prohibiting previously mapped and unmapped characters

Erik van der Poel erikv at google.com
Sat Dec 2 00:51:46 CET 2006


OK, thanks to Mark Davis, my IDN character frequency results have been
made available on the Web:

http://macchiato.com/idn/idn-unmapped-sorted.html
http://macchiato.com/idn/idn-mapped-sorted.html

There are several caveats/notes:

These URLs are for documents that Google was actually able to fetch
from the Web quite recently. The sample was a large portion of the
main index. This means that it is only a subset of domain names that
have actually been registered.

I recommend MSIE 7 if you wish to try the links. Firefox is more
strict about the links it will follow.

Some of the domains are wildcard domains. No attempt has been made to
distinguish between wildcard and normal domains. Wildcard means that
if bar.com is a wildcard domain, then foo.bar.com, blah.bar.com and
blurfl.bar.com all work just fine. You can type anything there.

Some of the URLs take you to "parked" domains, which are really just
ads for those domain names and other services. No attempt has been
made to distinguish between parked and normal domains.

Some domain names and Web sites may be offensive to some. No attempt
has been made to filter out potentially offensive material.

The first table contains both unmapped and mapped characters. The IDNA
process maps characters to themselves, to nothing or to something else
via normalization and case-mapping. The 2nd table is an attempt to
separate out the mapped characters only.  0.0188% of the domain names
are mapped to different strings by the IDNA process, from the links
found in HTML to the domain names passed to DNS.

If you click on the "Code" heading at the top, you can sort by
codepoint. The G heading means Glyph.

No URLs are provided for the LDH set, since they are not as
interesting and required too much disk space when running my program.

There are several interesting things here. The only one I will point
out for now is that the fl ligature U+FB02 appears to be quite
frequent among the mapped characters.

Also, if you notice anything wrong or would like some enhancements,
let me know. I cannot make any promises, but I will try to improve the
programs I wrote to get the data. One possible enhancement is to add
links to the referrers, so that it is possible for you to confirm that
an existing HTML document did actually use those characters in domain
names.

Erik


More information about the Idna-update mailing list