Transitions and compatibility (was: The Future of IDNA)

John C Klensin klensin at jck.com
Fri Mar 20 08:32:14 CET 2009


Final note for tonight...

--On Thursday, March 19, 2009 11:12 -0700 Erik van der Poel
<erikv at google.com> wrote:

>...
> Both of these have distracted us from the critical issue,
> which is that IF the client does NOT map certain character
> sequences to others, THEN the server must bundle a large
> number of names that differ only in the way that these
> character sequences are presented.
> 
> Notice that it does not matter whether those mappings are
> performed immediately after keyboard input or a long time
> after that, e.g. in HTML processing. The point is that IF the
> client does not map Final Sigma to Normal Small Sigma and
> Characters with Tonos to Characters without Tonos, THEN the
> server must bundle all of the permutations (final/normal and
> with/without tonos).
>...

Erik,

It seems to me that, in the above and several followup notes,
you (and some others) are making a fundamental assumption that
leads you to "have to do this" conclusions.  It is an assumption
with which some of us disagree.  We may ultimately need to agree
to disagree, but I think it would be helpful if you would
understand and acknowledge that other perspective.

I believe that "no incompatible changes, anywhere" is a
perfectly viable position.   I think that, if one takes the
position that anything that has been done, practiced, and relied
upon as working, whether it conforms to applicable standards and
specifications or not, has to be preserved and supported
forever, one very rapidly gets to that "no incompatible changes"
point.  

But I think that, if we believe it and take it literally in the
IDNA case, it gets us to "IDNA2003 forever".  The Malayalam case
is only one of several nasty examples: if the natural coding of
a particular string is changed, then one has a transition
problem that is inconsistent with the "no incompatible changes"
model.  If the folks using the scripts added in Unicode versions
between 3.2 and 5.2 were previously using local conventions or
idiosyncratic transliterations and now want to use their new
scripts, they have a transition and backward-compatibility
problem that is really no different from the Chillu one or even
the Eszett case except that it is even less detectable by some
IDNA or Unicode action -- a transitional two-stage lookup
process would need to understand the transliteration rules, not
just IDNA2003 operations.

Now, I happen to think that "IDNA2003 forever" is not viable for
a number of reasons.  I think from reading your notes that you
probably agree.  But the position is still plausible.  It is
also important that we not confuse IDNAv2, which is a little bit
of a hybrid (nothing inherently wrong with that, of course),
with an "absolutely no incompatible change as seen by the user,
registrant, or web designer" position -- it isn't one of those.
If one includes those likely transliteration or "fake the
script" cases, it is nearly certain that there is no such thing
as one of those that includes newly-added (since 2003)
characters.

Those transliteration cases probably call for some warnings in
Rationale that anyone deciding to register transliterated names
while pressing for inclusion of a script in Unicode had best
know what they are getting into in terms of transition.  But it
is important to notice that, too, is a tradeoff: if one does
nothing until Unicode approves and codes the script and it ends
up in IDNA, there is no transition problem to deal with.  On the
other hand, doing nothing may mean a delay of some years before
getting anything resembling mnemonics based on one's language
into the DNS.

We have also heard from one community whose input keeps getting
lost in these discussions.  We have a number of registries out
there (certainly not all of them, but a non-trivial number by
registration count), who have, to varying degrees, come to us
(some of them via ICANN) and said "some of the issues raised in
RFC 4290 and elsewhere are real, we would like to get them
fixed, and we are willing to tolerate one more round of
incompatible changes to permit them to get fixed".  The last
part of that is important because, (i) whatever else you or I
may think of those registries and their operators, they are
smart enough to understand what they are getting into, and (ii)
their customers, whom they have to deal with about transition
plans, including sunrise or bundling arrangements or the absence
of either, are typically the web page designers whose reliance
on IDNA2003 behavior, or the representatives of the management
of those web page designers.  They haven't said "IDNA2003
forever", "no incompatible changes at all", or even "IDNAng is
ok iff we don't have to do any work or persuade our customers to
do work".  What they have said is "let's get this right".  

Again, one can disagree about what "right" means and I certainly
don't expect that debate to stop.  But the assumption is
incorrect that everyone with skin in the IDN game holds complete
compatibility in the same esteem that several participants in
the WG seem to.  In addition, for some of the non-conforming
cases, such as the use of mapping-dependent UTF-8 strings in
URIs, there are people who would make the case that "violating
the standard makes any pain you feel when things change" is an
ok rule.  YMMD, but the position is plausible.

Finally, these issues are not about display and especially not
about the display (or not) of around four characters.  Unicode,
by its design and very nature, suppresses a lot of glyph
variations that some people think are very important if one
talks about display.  Han unification covered over differences
in display of some characters in Japan and China -- differences
that cannot even be expressed in Unicode terms (except by
language coding and heuristics).  No distinction is made in
Unicode among the three major sets of glyphs for Eastern
Arabic-Indic digits, but some of those symbols may look very
different if displayed properly for the local context (one can
argue about whether the three sets should have been unified, and
that argument does come up periodically, but see my earlier
comment about having to take Unicode as a given. Eszett and "ss"
are different things, not different presentation forms.   More
generally, there are a very large collection of compatibility
characters in Unicode, characters that are mapping to other
things in IDNA2003 (and presumably IDNAv2) and prohibited in
IDNA2008.  But I presume that, in most or all cases, those
characters were added because they had fans who pushed for them,
people who would not have bothered had they considered the
compatibility characters to always be equivalent to the ones
that Unicode defines and the associated base ones.  I do believe
that one could devise a way to deal with all of those issues,
and maybe one that was application protocol-independent, but it
is a rather different matter from getting IDNAng to come
together.

    john



More information about the Idna-update mailing list