OT RE: language question re: IDNs Conclusion

John C Klensin klensin at jck.com
Thu Oct 8 17:47:56 CEST 2009


A few small observations about this, many of them based on a few
painful years trying to sort out multilingual thesauri and
dictionaries for use in treaty-enabled regulatory environments
(a somewhat higher standard than anything the IDNAWG or IETF are
involved with).


--On Thursday, October 08, 2009 11:15 +0100 Debbie Garside
<debbie at ictmarketing.co.uk> wrote:

> Hi Cary
> 
> Going slightly off track to the original enquiry, I am quite
> interested in the terminology currently being used for
> character/glyph similarities.  I have read quite a bit about
> this during the course of the past couple of years and I think
> you are right there is a need for a term to describe
> "same/similar glyph".   Despite the fact that the Wikipedia
> article has no clear references to the word or its etymology,
> I think there is considerable merit in using Homoglyph to
> describe two or more glyphs or combination characters that are
> visually the same or so similar to the human eye as to cause
> confusion.
>...

This is, itself, fuzzy because of differences in perception,
expectations, and (with some scripts more than others)
variations in type and calligraphic styles.  Subjectively,
"confusables" fully captures the problem but does so because (a
few tech reports notwithstanding) it fully captures that
subjectiveness of it all.

As this particular discussion evolves, I think it is also likely
that we will want to distinguish between "same character, with
same derivation, in two different scripts" (e.g., Latin, Greek,
and Cyrillic Capital "A") from "in the context of the right
experiment, someone might confuse these two characters" (e.g.,
Latin Capital U in an appropriately-decorative typeface as
compared to Thai Kho Khai (U+0E02)), and from "these characters
are really different, but might be mistaken for each other
visually" (e.g., Latin Capital I and Katakana Small E (U+30A7).
I'm not sure that the latter two distinctions are important, but
distinguishing the first one from the other two may be critical.

>...
> Getting back to the original request for guidance, I still
> don't think "Homoglyph bundling" is the correct terminology
> (for the reasons stated in my mail regarding whole domain
> names - labels). Indeed having re-read some of the documents
> cited above, I believe the term should be "Homograph(ic)
> bundling" as the term Homograph is used consistently across
> the web in this context.

And, as Cary has pointed out, it is also wrong.  References to
"consistently across the web" or to Wikipedia articles are
useless here because they reflect mob mentality rather than an
attempt to make very precise --and probably important--
distinctions clear.
 
> So, does anyone know how we can suggest Homoglyph to the
> editors of OED! :-)

Wrong question, I think.

Given that we are looking for a precise term that can be
precisely mapped into multiple languages, the right solution is
to borrow a note from John Tukey and several scientific fields
and make something up -- traditionally based on some language
that was once widely-used but is not now in normal
conversational use -- define it precisely, and then move toward
getting it into the use-vocabularies of all of the relevant
modern languages either directly or in transliteration.
"Homograph" might work if it has not already been used too much
as a synonym for "confusable".

If it has been used too much, I'd recommend deferring to the
international character of this work and choosing something
based on some classical language rather than Greek or Latin,
perhaps one that, like them, is primarily used only liturgically
today.   Once such a term appears in a few official translation
dictionaries (such as an EU one), the OED and similar references
will take care of themselves.

    john




More information about the Idna-update mailing list