The Two Lookups Approach (was Re: Parsing the issuesand finding a middle ground -- another attempt)

John C Klensin klensin at jck.com
Sat Mar 14 09:01:19 CET 2009


The question below seems to have never been answered... sorry.

--On Saturday, March 07, 2009 15:52 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> At 01:41 09/03/07, John C Klensin wrote:
> 
>> It is worth stressing that the occurrence of this sort of
>> problem does not depend on IDNA2008.  Paul's IDNAv2 proposal
>> would cause it equally well, as would anything else that
>> provides a change from Unicode 3.2 to Unicode 5.1 and, more
>> generally, most or all future changes to Unicode that add new
>> characters to existing scripts to improve the way in which
>> those scripts can be expressed.
> 
> To be precise, only characters that interact with others
> in the script would be problematic, not completely independent
> characters. Or are I'm missing something?

I think that is correct.  But "completely independent
characters" may be a slightly fuzzy concept in practice.  Two
examples...

(i) Adding the Chillu to Malayalam, as Unicode 5.1 does,
involves completely independent characters, in the sense that
nothing was there before and they are not treated as
compositions of characters that were in Unicode earlier.   In
one way, that makes them "independent" of what has come before,
but they profoundly change the way the script is coded, so they
to interact and set up a problematic situation.

(2) Assume that the Catalan ela geminada were added to Unicode
as a single character (I understand the odds of this are
vanishingly small, but perhaps it makes a good illustration).
I assume that it would go in with an NFKC equivalent of U+006C
U+00B7 U+006C following the existing model for U+0140 (a
character that some folks in the Catalan community believe is
orthographic nonsense), i.e., NFC(U+0140) -> U+006C U+00B7.  For
IDNA2008 purposes, the fact that both U+0140 and this
hypothetical new precomposed ela geminada would turn into other
things under NFKC would cause them to be disallowed, but, as
soon as one moves into an environment in which mappings are
considered part of the standard, there is clearly an
interaction.  Whether the new character could be considered
"independent" or not, whether it would interact in ways that
would present transition issues depends on what else is going on
with the protocol.

(3) Unless one adopts a strong "if they look alike, they are the
same thing" model (often obnoxiously expressed as "Unicode
really works only for printing"), strong cases can be made that
the Yiddish character that looks like the Hebrew U+05D0 U+05B7
should be treated as an entirely separate precomposed character
with its own code point.  In Yiddish, that is a real character
with its own phonemic properties that should not, under any
circumstances, compare equal to U+05D0 alone.  In Hebrew U+05B7
is a point ("vowel") whose presence is optional (and in most
modern circumstances, discouraged), so dropping it on a
comparison or, for the DNS case, even banning U+05D0 U+05B7 from
being registered, would be reasonable.  The points do not appear
on most Hebrew keyboards or appear only as non-spacing
(combining) marks.  The particular character in question appears
on most Yiddish keyboards.  Now, assume that new character were
added to Unicode and assume that it does not immediately
transform itself back into U+05D0 U+05B7 under NFC (which would
defeat most of the purpose of adding it).  I don't know if it
makes it an independent character or not, but it would certainly
have an effect on backward compatibility.

  --john







More information about the Idna-update mailing list