looking up domain names with unassigned code points
Shawn.Steele at microsoft.com
Sat May 10 07:17:42 CEST 2008
I'm in the "Punycode is a hack because DNS is 7 bit" camp. I prefer to recognize that the usefulness of the names is in the human readable (by the correct reader anyway) value of the name.
I "get" that punicode could work in a link, but in practice? In practice, someone making a blog entry to some web site is going to go for the human form of the name. I can't even get a punycode form easily and I wrote the APIs we use. (Sure, I can call the API, but I never bothered to have a gadget to do that, instead I surf for one of the web-based ones when I need to convert.) The technically oriented people needing a workaround could easily use an ASCII CNAME that's more user friendly than the punicode name.
I also sense a distinction between the "DNS is a network infrastructure used to make it easier to find computers than a big IPV6 sequence", and the "my domain name is how I brand my business and protect my trademarks" people. Bring a camera when you go tell any marketing department that they need their pretty Unicode name needs to be written xn--abcdef just to make sure it works. Their expression's going to be priceless :)
Consider also the IMA/EAI UTF-8 effort. Clients that get a pretty UTF-8 name aren't going to be able to process that name if they can't do the punycode conversion to do the DNS query because they're on the wrong version of IDN. Sure, they can (hopefully) use the fallback address, but I suspect that'll have a human readable name, not a punycode string.
I am also in the "old client" camp so long as they are also allowed to process Unicode. Most of the rules, even for new characters, are reasonable and shouldn't block lookup. First of all, the string has to end up in a common form. Likely that's the form that's going to show up on a business card, since that's the form that the browser'll show, so even if I don't do anything but normalization, there are good odds that the string will convert to the proper punycode for resolution. The slightly more complex case involve casing, but even for new scripts the casing rules are fairly straightforward, so if the machine understood the unicode code points it would still be likely to resolve. There are more complex cases, but the worst is probably RTL/bidi processing, which hopefully will be clear after this version, even for code points not currently allowed. Basing rules on Unicode character types is also likely to properly enable future versions since you'd just have to update the Unicode tables.
I realize that there will be new code points that will be added that will be harder to convert to proper punycode. Those may not work, but if we can get a large percentage to work, and if the workaround is to use the normalized/clean form on one's business card, then the actual impact should be minimized until a "true" IDN version update can happen.
More information about the Idna-update