IDN trends

Fri Dec 14 04:07:41 CET 2007

I've been told that these numbers are confusing, so let me try to clarify.

First, these numbers were all computed using an IDNA2003
implementation. I haven't taken a look at IDNA200X tables yet.

Second, it is hard to see which numbers are overlapping and so on. So
let me write down the relationships:

A = B + D
C = D + E
F < C (F is a subset of C)
G < C
H < C
I < C
J !< any (J is not a subset of any of the listed sets)
K !< any
L !< any

Also, regarding Mark's earlier email: "about 8% more would be valid if
IDNAbis were changed to also do case and width folding":

This corresponds to (F + G) / A, which was 15% in Nov 2006 and is
about 4% in Nov 2007. The 8% number was computed by someone else at
Google, in March 2007, so this is still consistent.

Erik

On Dec 13, 2007 2:44 PM, Erik van der Poel <erikv at google.com> wrote:
> I had a look at the URIs/IRIs in half a billion HTML documents in
> Google's index from 2006 and 2007. The following percentages are
> relative to the total number of URIs/IRIs that contain host names.
>
> __Nov_2006_____Nov_2007
> A 0.016600000% 0.047700000% IDNs
> B 0.007590000% 0.042800000% Punycode
> C 0.010000000% 0.006130000% non-ASCII
> D 0.009000000% 0.004940000% round-trip to non-ASCII
> E 0.001020000% 0.001190000% round-trip to ASCII
> F 0.001530000% 0.001080000% mapped by NFKC
> G 0.000916000% 0.000921000% non-ASCII upper-case
> H 0.000072900% 0.000030000% non-ASCII dots
> I 0.000060300% 0.000067600% escaped UTF-8
> J 0.000045200% 0.000035700% escaped non-UTF-8
> K 0.000000124% 0.000001030% unassigned in Unicode 3.2
> L 0.113000000% 0.076900000% ASCII non-LDH
>
> The good news is that the number of IDNs (A) has grown from 0.017% to
> 0.048%, and most of that growth is in Punycode (B), which grew from
> 0.008% to 0.043%.
>
> On the other hand, IDNs written in other encodings (D) shrank from
> 0.009% to 0.005%. This figure was computed by converting the original
> to UTF-8 (C), and then to Punycode, and finally back to UTF-8 again.
> If the result is non-ASCII (D), it is considered an IDN.
>
> Some of these round-trips resulted in ASCII (E), and this figure
> remained steady at about 0.001%.
>
> Most of these are mapped by NFKC (F), i.e. full-width -> normal ASCII.
>
> The number of non-ASCII upper-case host names (G) remained steady at
> about 0.0009%, while host names containing non-ASCII dots (H) shrank
> from 0.00007% to 0.00003%.
>
> Some URIs have %-escaped bytes in their host names. Escaped UTF-8 host
> names (I) grew slightly from 0.00006% to 0.00007%. These are supported
> by Opera 9, but not by any other browser that I know of.
>
> On the other hand, the escaped non-UTF-8 host names (J) shrank from
> 0.00005% to 0.00004%. These are only really supported by MSIE 6.
>
> There is a very small number of host names with characters that were
> unassigned in Unicode 3.2 (K), though this number has increased. These
> are not supported by MSIE 7 (a wise decision).
>
> Finally, there are some ASCII host names with non-LDH characters in
> them (L), the most prevalent of which is the underscore (_).
>
> Disclaimer: Google's index is comprised of documents from the open Web
> only, i.e. not blocked by robots.txt, firewalls, etc. We only index
> high-value documents, which are computed using a proprietary
> algorithm.
>
> Erik
>