idna folding (was Re: idna-bis and '゜')

Mark Davis mark.davis at icu-project.org
Wed Dec 12 00:37:05 CET 2007


There is only one case where locale-sensitive lowercasing is needed, and
that is for Turkish (and related languages using the same conventions in
Latin). There are some possible issues with uppercasing (typically in
whether accents are retained, although there are clear differences of
opinion on this topic, such as in French), but those are not relevant to
IDNA since only the lowercasing is at issue.

I am very concerned about the breakage that will occur if the folding
operations are entirely freeform. See the mail discussion under
"IDNAbiscompatibility":

http://www.alvestrand.no/pipermail/idna-update/2007-March/000537.html
http://www.alvestrand.no/pipermail/idna-update/2007-April/thread.html

I'll copy one portion. As of last March, "Out of a significantly large
sampling of the web, there were about 800,000 cases where an HTML document
contained an href="..." that contained a host name that was valid IDNA2003.
We tested those host names to see if they would also be valid under
IDNAbis(based on the current working proposals). About 85% were valid,
about 8%
more would be valid if IDNAbis were changed to also do case and width
folding, and about 6% would still be invalid even if case and width foldings
were applied. (The width foldings are applying NFKC to just the half-width
and full-width characters to get the normal ones.) "

IDNAbis is already excluding thousands of characters that used to be valid.
There is, however, rough consensus that symbol characters, punctuation, and
others were ok to exclude, and their numbers are relatively small.

But the folding case is different. The case/NFKC folding of IDNA is not just
a UI issue; there are a huge number in email, web pages, and so on. I'm very
leary of causing 8% of embedded URLs to break. And we haven't seen any real
evidence that case/width folding is a real, demonstrable problem.

Now, one possibility is that we have a separate IDNA-Folding document that
preserves the case/width folding of IDNA2003. Then other standards,
protocols, and implementations (such as browsers) could also claim
conformance to that. This wouldn't be as good as keeping it inside the IDNA
umbrella, but would be better than a potential huge backwards compatibility
breakage.

> (given the requests I have got for example)

Patrik, can you be more specific about this? Numbers and examples to justify
this would be useful.

Mark

On Nov 27, 2007 2:28 AM, Patrik Fältström <patrik at frobbit.se> wrote:

>
> On 27 nov 2007, at 08.13, Martin Duerst wrote:
>
> > With the current IDNA architecture, mapping happened at
> > a single place in the protocol stack. Any idna library
> > would do it, or it wouldn't want to call itself an idna
> > library. That leads to a consistent and predictable behavior
> > from a user viewpoint.
>
> The major argument for me to NOT include mapping in IDNAbis is that
> IDNA(bis) is context free, while mappings that people want to have
> (given the requests I have got for example) require context dependent
> mapping. For example based on what locale is in use.
>
> That one might need well defined mapping mechanisms is of course
> clear, but it can not be resolved as part of the context-free domain
> name layer in the chain of functions between user and wire.
>
>    Patrik
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20071211/e955e659/attachment.html


More information about the Idna-update mailing list