Update to clarify combining characters

"Martin J. Dürst" duerst at it.aoyama.ac.jp
Tue Apr 22 12:35:57 CEST 2014


Hello Cary,

On 2014/04/22 17:10, Cary Karp wrote:
> Quoting Eric,
>
>> ... in Abenaki we use several ASCII character sequences
>> inter-changeably ("ou", "w" and "8") as well as an "u atop o" character
>> defined in one or more extensions to ASCII, which typewritters with
>> half-height settings, and the character "8" have accommodated over the
>> past century, in support of a local (to a zone) semantic, e.g.,
>> equivalency of two labels, e.g., "ou.example" and "8.example" (or
>> "wabanaki.example" and "8abanaki.example" and "ouabanaki.example"),
>
> Are there similar non-ASCII examples?

The case of simplified vs. traditional Chinese characters has been 
discussed at length (at length even for IDNA) leading up to IDNA 2003. 
There are many other scripts and languages where things like this can 
occur. English as used around the world is one of them; if you want 
color.example, you may want colour.example at the same time.

>> Obviously, what ICANN gTLD registry operators do is governed by contacts
>> between they and ICANN, and what ccTLD registry operators is also
>> governed, in part, by desires for consistency, but below (or outside) of
>> these namespaces with _local_ (not pervasive to all levels of the tree)
>> restrictions  on labels, what resolves is a local question -- local in
>> the sense of both the FQDN, the RRSet associated, and the resolvers to
>> which query(s) are made.
>
> Does this suggest that there are language communities with need to have
> such intricacy accommodated on lower levels of the gTLD/ccTLD namespace,
> who are willing to forgo the possibility of manifesting their languages
> directly in TLD labels?

In TLDs, it might actually be easier to deal with such a case, because 
they can be dealt with on a one-by-one base. For an actual example, 
please see .中国 and .中國 
(http://www.iana.org/domains/root/db/xn--fiqs8s.html and 
http://www.iana.org/domains/root/db/xn--fiqz9s.html). Of course, it's 
not easy (in case it is desired) to
keep such equivalents in sync (whatever "in sync" may mean).


> Variation in keyboard practice otherwise appears in many contexts but it
> is difficult to see how this can be weighed into the IDNA protocol. My
> Swedish keyboard has separate keys for the direct entry of the last
> three letters of the Swedish alphabet (å ä ö). These can, however, also
> be typed by using the "dead key" that is necessary for the other
> diacritically marked letters used in written Swedish. That method
> requires the mark to be entered first but it neither displays nor
> spaces. The letter with which it combines is then entered and the
> corresponding pre-composed single-code point character is displayed.
>
> I had always assumed that the trailing order of combining marks was
> imposed directly by Unicode and that this simply cascaded into IDNA.

True. Keyboard input uses keystrokes (represented internally by 
so-called keycodes), and whether diacritics are entered before or after 
the base letter isn't indicative of what characters end up in the data.

> Can
> that constraint actually be overridden in any situation that would be
> trapped by a new contextual rule in 5892?

No. A diacritic followed by a base character would mean that the 
diacritic is displayed over the previous character. Also, IDNA requires 
NFC, which is many cases (including the Swedish ones) combines the mark 
with the base character.

> (If new rules are going to be
> added, there are a few others that might be suggested. Is that topic now
> open for discussion?)

I don't think so.

Regards,   Martin.

> /Cary
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>


More information about the Idna-update mailing list