Erik van der Poel
erikv at google.com
Fri Dec 14 04:50:21 CET 2007
Correction: Mark's number corresponds to (F + G - X) / A, where X is
the number of host names that contain both mapped-by-NFKC and
non-ASCII upper-case characters.
So, 15% is an upper bound for that number in Nov 2006 and 4% is an
upper bound for that number in Nov 2007. Either way, 8% could still
fall between these numbers, so I'm satisfied with the results.
Just because this number is falling does not mean that it will ever
reach zero, nor can I predict what the browser developers will do.
They may try to force things, by removing the nfkc/case/dot mapping.
Or they may not. I don't know.
On Dec 13, 2007 7:07 PM, Erik van der Poel <erikv at google.com> wrote:
> I've been told that these numbers are confusing, so let me try to clarify.
> First, these numbers were all computed using an IDNA2003
> implementation. I haven't taken a look at IDNA200X tables yet.
> Second, it is hard to see which numbers are overlapping and so on. So
> let me write down the relationships:
> A = B + D
> C = D + E
> F < C (F is a subset of C)
> G < C
> H < C
> I < C
> J !< any (J is not a subset of any of the listed sets)
> K !< any
> L !< any
> Also, regarding Mark's earlier email: "about 8% more would be valid if
> IDNAbis were changed to also do case and width folding":
> This corresponds to (F + G) / A, which was 15% in Nov 2006 and is
> about 4% in Nov 2007. The 8% number was computed by someone else at
> Google, in March 2007, so this is still consistent.
> On Dec 13, 2007 2:44 PM, Erik van der Poel <erikv at google.com> wrote:
> > I had a look at the URIs/IRIs in half a billion HTML documents in
> > Google's index from 2006 and 2007. The following percentages are
> > relative to the total number of URIs/IRIs that contain host names.
> > __Nov_2006_____Nov_2007
> > A 0.016600000% 0.047700000% IDNs
> > B 0.007590000% 0.042800000% Punycode
> > C 0.010000000% 0.006130000% non-ASCII
> > D 0.009000000% 0.004940000% round-trip to non-ASCII
> > E 0.001020000% 0.001190000% round-trip to ASCII
> > F 0.001530000% 0.001080000% mapped by NFKC
> > G 0.000916000% 0.000921000% non-ASCII upper-case
> > H 0.000072900% 0.000030000% non-ASCII dots
> > I 0.000060300% 0.000067600% escaped UTF-8
> > J 0.000045200% 0.000035700% escaped non-UTF-8
> > K 0.000000124% 0.000001030% unassigned in Unicode 3.2
> > L 0.113000000% 0.076900000% ASCII non-LDH
> > The good news is that the number of IDNs (A) has grown from 0.017% to
> > 0.048%, and most of that growth is in Punycode (B), which grew from
> > 0.008% to 0.043%.
> > On the other hand, IDNs written in other encodings (D) shrank from
> > 0.009% to 0.005%. This figure was computed by converting the original
> > to UTF-8 (C), and then to Punycode, and finally back to UTF-8 again.
> > If the result is non-ASCII (D), it is considered an IDN.
> > Some of these round-trips resulted in ASCII (E), and this figure
> > remained steady at about 0.001%.
> > Most of these are mapped by NFKC (F), i.e. full-width -> normal ASCII.
> > The number of non-ASCII upper-case host names (G) remained steady at
> > about 0.0009%, while host names containing non-ASCII dots (H) shrank
> > from 0.00007% to 0.00003%.
> > Some URIs have %-escaped bytes in their host names. Escaped UTF-8 host
> > names (I) grew slightly from 0.00006% to 0.00007%. These are supported
> > by Opera 9, but not by any other browser that I know of.
> > On the other hand, the escaped non-UTF-8 host names (J) shrank from
> > 0.00005% to 0.00004%. These are only really supported by MSIE 6.
> > There is a very small number of host names with characters that were
> > unassigned in Unicode 3.2 (K), though this number has increased. These
> > are not supported by MSIE 7 (a wise decision).
> > Finally, there are some ASCII host names with non-LDH characters in
> > them (L), the most prevalent of which is the underscore (_).
> > Disclaimer: Google's index is comprised of documents from the open Web
> > only, i.e. not blocked by robots.txt, firewalls, etc. We only index
> > high-value documents, which are computed using a proprietary
> > algorithm.
> > Erik
More information about the Idna-update