Erik van der Poel
erikv at google.com
Thu Dec 13 23:44:05 CET 2007
I had a look at the URIs/IRIs in half a billion HTML documents in
Google's index from 2006 and 2007. The following percentages are
relative to the total number of URIs/IRIs that contain host names.
A 0.016600000% 0.047700000% IDNs
B 0.007590000% 0.042800000% Punycode
C 0.010000000% 0.006130000% non-ASCII
D 0.009000000% 0.004940000% round-trip to non-ASCII
E 0.001020000% 0.001190000% round-trip to ASCII
F 0.001530000% 0.001080000% mapped by NFKC
G 0.000916000% 0.000921000% non-ASCII upper-case
H 0.000072900% 0.000030000% non-ASCII dots
I 0.000060300% 0.000067600% escaped UTF-8
J 0.000045200% 0.000035700% escaped non-UTF-8
K 0.000000124% 0.000001030% unassigned in Unicode 3.2
L 0.113000000% 0.076900000% ASCII non-LDH
The good news is that the number of IDNs (A) has grown from 0.017% to
0.048%, and most of that growth is in Punycode (B), which grew from
0.008% to 0.043%.
On the other hand, IDNs written in other encodings (D) shrank from
0.009% to 0.005%. This figure was computed by converting the original
to UTF-8 (C), and then to Punycode, and finally back to UTF-8 again.
If the result is non-ASCII (D), it is considered an IDN.
Some of these round-trips resulted in ASCII (E), and this figure
remained steady at about 0.001%.
Most of these are mapped by NFKC (F), i.e. full-width -> normal ASCII.
The number of non-ASCII upper-case host names (G) remained steady at
about 0.0009%, while host names containing non-ASCII dots (H) shrank
from 0.00007% to 0.00003%.
Some URIs have %-escaped bytes in their host names. Escaped UTF-8 host
names (I) grew slightly from 0.00006% to 0.00007%. These are supported
by Opera 9, but not by any other browser that I know of.
On the other hand, the escaped non-UTF-8 host names (J) shrank from
0.00005% to 0.00004%. These are only really supported by MSIE 6.
There is a very small number of host names with characters that were
unassigned in Unicode 3.2 (K), though this number has increased. These
are not supported by MSIE 7 (a wise decision).
Finally, there are some ASCII host names with non-LDH characters in
them (L), the most prevalent of which is the underscore (_).
Disclaimer: Google's index is comprised of documents from the open Web
only, i.e. not blocked by robots.txt, firewalls, etc. We only index
high-value documents, which are computed using a proprietary
More information about the Idna-update