Tonos, Simplified Chinese, and Han Unification (was: Re: The Future of IDNA)

Fri Mar 20 07:30:07 CET 2009

Another note in the series.  If you haven't read the one titled
"Basic IDN assumptions" yet, please do so first...

--On Thursday, March 19, 2009 11:12 -0700 Erik van der Poel
<erikv at google.com> wrote:

>...
> Note that some language communities may wish to strip accents
> in lookup/registration (e.g. tonos in Greek script), while
> other language communities may agree to leave accents on the
> letters (e.g. Latin script). Yet other communities may agree
> to add mappings for similar characters (e.g. East Asian Han
> characters that are currently bundled or blocked on the server
> side).

While I at least understand the Final Sigma and Eszett
discussions (or think I do), the Tonos one leaves me more than a
little confused.   Drawing somewhat on Ken's notes (whether or
not we reach the same conclusions, I believe he understands the
issues better than I do), I don't understand either the question
or your proposed solution in the context of IDNA.    As far as I
can tell, the behavior of the Tonos-bearing characters are the
same in IDNA2003, IDNAv2, and the IDNA2008 proposals.  If one
creates a mapping, it is a mapping that never existed before in
the IDNA context(s).

Contrary to your "leave accents on the letters" comments, there
are many circumstances in which the ideal set of matching rules
would treat decorated versions of some characters and
undecorated ones as equivalent.   In the general case, there are
three problems with doing so.  First, transitivity doesn't work:
while it is possible that "à" (U+00E0) should match "a"
(U+0061) and that "á" (U+00E1) should match "a" also, there are
few cases in which people will want "à" to match "á" (although
there might be some -- there is no accounting for either taste
or culture).    Second, almost all of the relevant cases are
language-dependent.  The example of the difference between the
treatment of "ö" (U+00F6) in German, where the character might
be equivalent to "oe" or the diaeresis safely dropped in
comparison to the treatment in Swedish, where it is a completely
different letter from any orthographic variation on "o" has been
used often enough to be tedious, but it is still relevant.
Jefsey's concerns about different matching rules for French are
relevant here too (and they become even more relevant if not
every user of the language agrees with him.  Finally, the
combination of the first and second issues violates those
principles about Unicode and the DNS.  While it might be
possible to design a coding system that would make this
"sometimes it matches and sometimes it doesn't" sort of thing
easy to do, Unicode --especially Unicode with NF[K]C
normalization-- definitely is not one of them.  Something like
RFC 5242 might have conveniently permitted a "just ignore the
decoration on lookup" bit, but it wouldn't have worked in
anything resembling practice (which is only one of many reasons
why 5242 was a joke).  And the DNS as we know it can't support
this sort of conditional matching either.

Since Eric has brought up the original CDNC (not JET) proposal
for mapping of Traditional Chinese to Simplified Chinese several
times recently, let me suggest that, while it is superficially
very different from the Tonos case, it was another variation on
the themes above.   One would like to be able to simply have the
two match, but that is complicated by both DNS limitations and
the occasional many-one and one-many relationships.  One might
want to map the two together to overcome the DNS limitations,
but that not only runs into those many-one and one-many mapping
issues (and associated loss of information) but also would turn
Hanji and Kanji into Simplified Chinese.  One could avoid that
if Unicode hadn't done Han unification, but it does, so...   The
net result was the invention of variant systems and what we now
call bundling -- not a perfect solution, but all of the others
are overconstrained or violate those basic assumptions.

If you actually have a proposal, please make it (I await the
I-D).  But please sort out the details (rather than just telling
us what you think the Greek language community wants), including
both the issues Ken raises and the fact that this would be an
incompatible change to IDNA2003 or either of  the update
strategies that are now on the table.

     john