IDNAbis compatibility

Wed Apr 4 03:50:45 CEST 2007

John Klensin commented:

> --On Tuesday, 03 April, 2007 09:49 -0700 Mark Davis
> <mark.davis at icu-project.org> wrote:
> 
> > IDNA2003 eliminates any inconsistency by specifying an exact
> > folding, one that is a language-independent folding.
> 
> Or, put differently, one that is a compromise among
> language-related issues and that hence ends up being correct for
> some languages and incorrect for others.

You know, John, sometimes I think you end up doing so many
stutter steps that you end up faking yourself out.

Casefolding is not some "compromise" worked out in a give and take
conference somewhere.

Language-independent folding basically works good enough for
all Latin-, Cyrillic-, Greek-, and Armenian-based orthographies,
but...

Language-independent folding doesn't work for "i"'s in
Turkish (and Azeri written in Latin, and so on for other
Turkic languages adopting the pan-Turkic Latin conventions).

So instead of bringing up this vague, unspecified anxiety
every other week about language-independent folding being
less than perfect and subject to vague, unknowable,
unspecified attacks from whom-we-know-not for alleged
failures of localization, how about we actually do a
case study for the *one* staring-us-in-the-face, known
example where there is a significant problem: Turkish, and more 
specifically, "i"'s in Turkish?

By the way, the Turkish problem is *not* a Unicode problem.
The same issue applies to Turkish data encoded using
the *Turkish* ISO 8-bit code page, ISO/IEC 8859-9, Latin-5,
as well as Turkish data using old Turkish PC code pages
or Windows CP 1254.

Bezillions of Turkish domains already exist -- and they already
have established practices of folding "i"'s, because they have had
to. Go look at:

http://www.tubitak.gov.tr/

for example. That is "tubitak", but the correct name of
the Turkish Scientific and Technological Research Council
is TÜB{I-dot}TAK.

If I enter "TUBITAK" (note, no umlaut on the u and no dot on the i)
and expect it to be resolved, it gets casefolded to "tubitak"
by *generic* casefolding rules, and I get to the right place
because "tubitak.gov.tr" is the actual domain I'm looking for.

If, on the other hand, I decide that this is a Turkish domain
name (by whatever means I don't know -- inferring from "tr",
I suppose, since the label doesn't have a language tag), and
needs a Turkish-language-specific casefolding, then I'm going
to be looking for "tub{i-dotless}tak.gov.tr", which probably
doesn't exist, and which *could* be different from what I
am expecting, if the .tr domain registry wasn't careful about
what got into it belonging to whom.

This is already established practice for Turkish domain names,
as best as I can tell, given the constraints that they have
had to work already for years in an ASCII-only context.
Whatever we do for IDNAbis should not be flying in the face
of such practice and introducing problems *specifically*
for Turkish, worse than the situation that already exists.

One such potential problem would resul;t from deciding to *remove* the
language-independent case folding from the specification for IDNAbis,
with the recommendation that this all be left up to applications,
to be determined by who-knows-what.

> As long as this is strictly a discussion about the "lookup",
> rather than "registration", side of things, and it doesn't
> affect what is normative in URIs or IRIs used between systems,
> that would certainly work.   Of course, such applications would
> be faced with a three-way choice when localization was
> considered:
> 
> 	(i) Don't fold

Which would lead to worse results than current expectation.

> 	
> 	(ii) Fold according to the new RFC, presumably
> 	consistently with IDNA2003 updated if necessary to
> 	reflect Unicode 5.0 (or 5.1) or updated from time to
> 	time to reflect new versions of Unicode.  Of course, if
> 	no new case-sensitive characters are added, those
> 	updates would be trivial.

Which would effectively be equivalent to the existing situation.

> 	
> 	(iii) Fold according to local conventions as required by
> 	the local language(s).

Which would likely lead to chaos worse than the current
situation.

Run out the scenario *specifically* for Turkey here. Don't
get stuck worrying vaguely about how things could be
different everywhere around the world and gosh we don't
know and we have to throw up our hands and let everybody
case the way they want to.

What is the *specific* problem for Turkey?

Well, if you advocate that Turkish-language applications
start applying correct Turkish case-folding rules to domain
names, because those are the local conventions required
by the local language, then you *might* get correct results
for some new IDNA-specific registrations, but you *would*
get lots of erroneous results for lots of existing
domain names, because the "i"s and "I"s already exist in
them and are part of the casefolding problem.

> 
> This raises some issues which we at least need to understand
> before adopting that approach.  The first is that complete
> consistency cannot be guaranteed, simply because the three
> choices exist. 

Yes. We understand that you can't guarantee against the
bozo factor. Somebody can always goof up. People may even
insist on goofing up on purpose and by principle.

> Note that we can do nothing to eliminate the
> third choice: implementations will do what they need to do to
> gain local acceptance and make local sense.

Which needs to be justified *specifically* in the Turkish
context. What are user agents doing in Turkey *now* to
handle domain names? Why would they start doing something
different in the near future that would make them work
marginally more correctly for *some* new Turkish domain
names and break for everything else?

Please make the case for why this would be the issue, and
thus we need to innoculate IDNA against criticism for preventing
Turks from correct adaptation of their software.

>   Second, unless we
> are careful, (ii) would reintroduce version dependency.

Not an issue at this point.

I understand that *fear* that it might be an issue is an
issue at this point. 

But I do not see a *technical* problem here -- I see a
cultural, communicative, psychological, and political
problem here.

>  That
> would be unfortunate, but, if it were in a supplemental option,
> rather than the base protocol, I don't see it as a showstopper.

Why is it a showstopper in the base protocol?

--Ken