IDNAbis compatibility

Mark Davis mark.davis at icu-project.org
Sat Mar 31 03:14:15 CEST 2007


We had a bit more time to look at IDNAbis compatibility, and here are some
better (and hopefully clearer) results. Out of a significantly large
sampling of the web, there were about 800,000 cases where an HTML document
contained an href="..." that contained a host name that was valid IDNA2003.
We tested those host names to see if they would also be valid under IDNAbis
(based on the current working proposals). About 85% were valid, about 8%
more would be valid if IDNAbis were changed to also do case and width
folding, and about 6% would still be invalid even if case and width foldings
were applied. (The width foldings are applying NFKC to just the half-width
and full-width characters to get the normal ones.)

Here are some more details, where A0-A4 are disjoint categories.

A0: Passes IDNAbis 708,760 85.26% A1: Passes IDNAbis after case folding
22,714 2.73% A2: Passes IDNAbis after width folding 47,312 5.69% A3: Passes
IDNAbis after apply width folding, and then case folding. 4 0.00% A4: Failed
to pass IDNAbis after 3 steps 52,456 6.31%


 A5: Passes IDNA = sum(A1-A4) 831,246 100.00%
This differs from some of our previous data, because we are explicitly
testing IDNA vs IDNAbis (not just approximating the latter), and also
filtering out invalid URLs. I will be out next week, but we'll try to follow
up with more of a breakdown of A4.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20070330/72880906/attachment.html


More information about the Idna-update mailing list