Update to clarify combining characters

Fri Apr 25 21:24:51 CEST 2014

--On Friday, April 25, 2014 19:16 +0900 "\"Martin J. Dürst\""
<duerst at it.aoyama.ac.jp> wrote:

Hi Martin,

>...
>> For use of characters with precombined forms in DNS
>> labels, it is important that the IDNA requirement for NFC be
>> applied carefully (that requirement essentially eliminates
>> leading combining characters or marks)
> 
> Actually, it doesn't. A string that starts with a leading
> combining mark is still in NFC, assuming that the remainder of
> the string is in NFC. Something else in IDNA may specify that
> leading combining marks are not allowed, but NFC doesn't.

Yes.  You are quite correct -- I was thinking of something else
and got confused.  And, yes, there is such an IDNA rule (in the
text of Section 4.2.3.2 of RFC 5891 that more or less started
this thread because of the Unicode section it pointed to.

>> There are problems that the IETF could not solve even if there
>> were the will to do so.   One involves decisions by the
>> Unicode community that are unattractive for particular
>> scripts.  In my experience, while I'd be very interested in
>> counter-examples, there are few such problems with
>> Latin-based characters unless one gets to characters that
>> require multiple decorations and that can potentially be
>> written as a base (i.e., undecorated and typically ASCII)
>> character plus two (or more) combining characters or a
>> precombined character plus one (or more) of them.  Because
>> some of those combinations appear to not be resolved into a
>> single form by normalization, there might be an opportunity
>> for "variant" consideration  except that ICANN, in its wisdom
>> (and unless things have changed recently), decided that there
>> is no such thing as a variant for Latin-based scripts.
> 
> This agrees with my understanding that for Latin,
> normalization (i.e. NFC) deals with these problems, even in
> those cases where multiple 'decorations' (diacritics) are
> involved.

I used the term "decorations" to avoid controversies about what
was, or was not, a diacritical mark.  

>> Variants are also out of IETF scope, at least for IDNA,
>> because doing anything about them in anything resembling a
>> general case turns into a set of issues that cannot be
>> handled in the DNS except by externally treating names as
>> equivalent.  As Andrew has mentioned, there have been
>> extensive ICANN efforts to deal with a set of problems they
>> have lumped together under that term; it may be of note that
>> there does not seem to be a single period with experience
>> using endangered languages or writing systems in a DNS
>> context in the relevant decision-making committees.
> 
> 'period' -> 'person' ?

yes.  Sorry.

>> Finally, to respond to Martin's comment about simplified and
>> traditional Chinese, that problem is very different from those
>> associated with other, especially "alphabetic-phonetic",
>> scripts, in part because those of us who did the final editing
>> on the JET document that put "variants" on the map made a
>> serious error in terminology.  But, again, it isn't a topic
>> for this list.
> 
> Can you (slightly) expand on "serious error in terminology",
> or provide a pointer?

Yes.  This has been elaborated on in several ICANN contexts;
perhaps someone else will supply the references.  But, briefly
and with a disclaimer about deliberate lack of precision to keep
this short...

The "variants" discussed in the JET spec (RFC 3743) are
described in terms of "characters".  In particular, the tables
are arranged around "character variants".  At great dead of the
confusion around variants [1] has gotten involved with
look-alike (or "confusable") characters but the key JET
discussions were about characters with equivalent or
substitutable meanings.  Not only was the SC<->TC discussion not
about look-alike characters [2] but the whole idea of
"characters with equivalent meanings" doesn't mean the same
thing for ideographic scripts that it does for
alphabetic-phonetic ones.  "Character" was the right term to use
to describe the relationships because they are called
characters.  But, in retrospect, I wish we had written RFC 3743
to use terminology about equivalence in meaning or semantics,
not terminology that could easily be interpreted in
glyph-relationships terms.  

If we had done so, we might still be having the same
discussions, but a great many confusing and largely bogus
positions derived from invalid (or a serious stretch)
extrapolations from Chinese to other scripts might have been
avoided or framed more clearly.  Conversely, a number of things
that ICANN has forbidden variant treatment might be seen in a
different light: if the TC-> SC relationship is seen as a
writing and spelling simplification rather than a notion of
equivalent characters, then the claim that "colour" -> "color"
is somehow fundamentally different would probably fall apart.

None of that changes the basics in any way.  It does show how
much confusion can be caused by getting sloppy about terminology
in this area.  And, sadly, I was probably in the best position
to notice that particular bit of sloppiness, to figure out what
the consequences might be, and to fix it.   It took me some
years to figure out what the problem actually was.

     best,
       john

[1] Some inside the IETF, far more outside, and some of the
latter driven by people with an economic interest in their
interpretations.

[2] Except maybe to a very trained eye.