Comments on IDNAbis protocol-03

Mark Davis mark.davis at icu-project.org
Fri Dec 14 04:47:11 CET 2007


http://www.ietf.org/internet-drafts/draft-klensin-idnabis-protocol-02.txt
Overview:
Protocol-1. By excluding case/width folding, there will be significant
backwards compatibility problems, caused by having no standard folding.
Examples of current usage:


U-Label
U-Label Escaped
Current Punycode 1 http://öbb.at <http://%c3%b6bb.at/>
http://%C3%B6bb.at<http://%c3%b6bb.at/>
http://xn--bb-eka.at <http://%c3%b6bb.at/> canonical, allowed in both
IDNA2003 and IDNAbis
 2 http://ÖBB.at <http://%c3%b6bb.at/> http://%C3%96bb.at<http://%c3%b6bb.at/>
http://xn--bb-eka.at <http://%c3%b6bb.at/> *Disallowed in IDNAbis: *case
variation 3 http://öbb.at <http://%c3%b6bb.at/>
http://%C3%B6%EF%BD%82b.at<http://%c3%b6bb.at/>
http://xn--bb-eka.at <http://%c3%b6bb.at/> *Disallowed in IDNAbis: *width
variation (NFKC)
I am very concerned about the breakage that will occur if the folding
operations are entirely at the option of the implementation. See the mail
discussion under "IDNAbis compatibility":

http://www.alvestrand.no/pipermail/idna-update/2007-March/000537.html
http://www.alvestrand.no/pipermail/idna-update/2007-April/thread.html

I'll copy one portion. As of last March, "Out of a significantly large
sampling of the web, there were about 800,000 cases where an HTML document
contained an href="..." that contained a host name that was valid IDNA2003.
We tested those host names to see if they would also be valid under IDNAbis
(based on the current working proposals). About 85% were valid, about 8%
more would be valid if IDNAbis were changed to also do case and width
folding, and about 6% would still be invalid even if case and width foldings
were applied. (The width foldings are applying NFKC to just the half-width
and full-width characters to get the normal ones.) "

IDNAbis is already excluding thousands of characters that used to be valid.
There is, however, rough consensus that symbol characters, punctuation, and
others were ok to exclude, and their numbers are relatively small.

But the folding case is different. The case/NFKC folding of IDNA is not just
a UI issue; there are a huge number in email, web pages, and so on. I'm very
leary of causing 4% of embedded URLs to break. And we haven't seen any real
evidence that case/width folding is a real, demonstrable problem.

Note: There is only really one locale where locale-sensitive lowercasing is
needed, and that is for Turkish (and related languages using the same
conventions in Latin). There are some possible issues with uppercasing
(typically in whether accents are retained, although there are clear
differences of opinion on this topic, such as in French), but those are not
relevant to IDNA since only the lowercasing is at issue.

Now, one possibility is that we have a separate IDNA-Folding document that
preserves the case/width folding of IDNA2003. Then other standards,
protocols, and implementations (such as browsers) could also claim
conformance to that. This wouldn't be as good as keeping it inside the IDNA
umbrella, but would be better than a potential huge backwards compatibility
breakage.

Protocol-2.  Section 5 has Normalization (5.5), but it is missing from
Section 4. It must be there also (probably just an oversight).

Protocol-3.  It needs to have a prohibition on a leading combining mark. See
Michel's emails.

Protocol-4.  Some of the same issues as
draft-faltstrom-idnabis-tables-03.txt<http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt>,
such as MAYBE YES vs MAYBE NO.

Protocol-5.  The "Contextual Rules" need to be supplied. (What is the
format? Machine readable? Are there default required ones -- there should
be, for ZWJ/ZWNJ).

Protocol-6.  Section 5.1 assumes that URLs are entered by users, when they
are often (perhaps most often) interpreted by machines. That is of great
importance, of course, for search engines, email readers, browsers, and
others.
Details
Protocol-7.

   Unicode (without surrogates), paralleling the process above

(Minor) this is unnecessary. The tables disallow surrogates.

Protocol-8.

      a character is never removed from
      it unless it is removed from Unicode.

This is not necessary. If you really have to have it, then add "(however,
the Unicode stability policies expressly forbid this)"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20071213/f23f3a8a/attachment.html


More information about the Idna-update mailing list