Comments on IDNAbis protocol-03

Mark Davis mark.davis at icu-project.org
Thu Jan 10 01:26:27 CET 2008


I sent this almost a month ago, and got no reply. I'm assuming that the lack
of response was due to the holidays, and some discussion or response for
these items will be forthcoming soon.

Mark

On Dec 13, 2007 7:47 PM, Mark Davis <mark.davis at icu-project.org> wrote:

> http://www.ietf.org/internet-drafts/draft-klensin-idnabis-protocol-02.txt
> Overview:
> Protocol-1. By excluding case/width folding, there will be significant
> backwards compatibility problems, caused by having no standard folding.
> Examples of current usage:
>
>
> U-Label
> U-Label Escaped
> Current Punycode 1 http://öbb.at <http://%c3%b6bb.at/> http://%C3%B6bb.at<http://%c3%b6bb.at/>
> http://xn--bb-eka.at <http://%c3%b6bb.at/> canonical, allowed in both
> IDNA2003 and IDNAbis
>  2 http://ÖBB.at <http://%c3%b6bb.at/> http://%C3%96bb.at<http://%c3%b6bb.at/>
> http://xn--bb-eka.at <http://%c3%b6bb.at/> *Disallowed in IDNAbis: *case
> variation 3 http://öbb.at <http://%c3%b6bb.at/> http://%C3%B6%EF%BD%82b.at<http://%c3%b6bb.at/>
> http://xn--bb-eka.at <http://%c3%b6bb.at/> *Disallowed in IDNAbis: *width
> variation (NFKC)
> I am very concerned about the breakage that will occur if the folding
> operations are entirely at the option of the implementation. See the mail
> discussion under "IDNAbis compatibility":
>
> http://www.alvestrand.no/pipermail/idna-update/2007-March/000537.html
> http://www.alvestrand.no/pipermail/idna-update/2007-April/thread.html
>
> I'll copy one portion. As of last March, "Out of a significantly large
> sampling of the web, there were about 800,000 cases where an HTML document
> contained an href="..." that contained a host name that was valid IDNA2003.
> We tested those host names to see if they would also be valid under IDNAbis
> (based on the current working proposals). About 85% were valid, about 8%
> more would be valid if IDNAbis were changed to also do case and width
> folding, and about 6% would still be invalid even if case and width foldings
> were applied. (The width foldings are applying NFKC to just the half-width
> and full-width characters to get the normal ones.) "
>
> IDNAbis is already excluding thousands of characters that used to be
> valid. There is, however, rough consensus that symbol characters,
> punctuation, and others were ok to exclude, and their numbers are relatively
> small.
>
> But the folding case is different. The case/NFKC folding of IDNA is not
> just a UI issue; there are a huge number in email, web pages, and so on. I'm
> very leary of causing 4% of embedded URLs to break. And we haven't seen any
> real evidence that case/width folding is a real, demonstrable problem.
>
> Note: There is only really one locale where locale-sensitive lowercasing
> is needed, and that is for Turkish (and related languages using the same
> conventions in Latin). There are some possible issues with uppercasing
> (typically in whether accents are retained, although there are clear
> differences of opinion on this topic, such as in French), but those are not
> relevant to IDNA since only the lowercasing is at issue.
>
> Now, one possibility is that we have a separate IDNA-Folding document that
> preserves the case/width folding of IDNA2003. Then other standards,
> protocols, and implementations (such as browsers) could also claim
> conformance to that. This wouldn't be as good as keeping it inside the IDNA
> umbrella, but would be better than a potential huge backwards compatibility
> breakage.
>
> Protocol-2.  Section 5 has Normalization (5.5), but it is missing from
> Section 4. It must be there also (probably just an oversight).
>
> Protocol-3.  It needs to have a prohibition on a leading combining mark.
> See Michel's emails.
>
> Protocol-4.  Some of the same issues as
> draft-faltstrom-idnabis-tables-03.txt<http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt>,
> such as MAYBE YES vs MAYBE NO.
>
> Protocol-5.  The "Contextual Rules" need to be supplied. (What is the
> format? Machine readable? Are there default required ones -- there should
> be, for ZWJ/ZWNJ).
>
> Protocol-6.  Section 5.1 assumes that URLs are entered by users, when they
> are often (perhaps most often) interpreted by machines. That is of great
> importance, of course, for search engines, email readers, browsers, and
> others.
> Details
> Protocol-7.
>
>    Unicode (without surrogates), paralleling the process above
>
> (Minor) this is unnecessary. The tables disallow surrogates.
>
> Protocol-8.
>
>       a character is never removed from
>       it unless it is removed from Unicode.
>
> This is not necessary. If you really have to have it, then add "(however,
> the Unicode stability policies expressly forbid this)"
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080109/95a36693/attachment-0001.html


More information about the Idna-update mailing list