UTF-8

Shawn Steele Shawn.Steele at microsoft.com
Thu Jun 17 22:02:18 CEST 2010


>> I'd argue any new application protocol ought to specify the encoding rather than 
>> allowing multiple.   Specifying UTF-8 would be good :-)

> Just UTF-8, un-pre-processed, raw user input?  Or did you mean U-labels?

I meant, in non-DNS cases, it doesn't really matter.  If they aren't U-labels, they won't work (just like (*&$(*&.com won't work)), but other protocols shouldn't have to know how DNS behaves.

> Also, with respect to deployed protocols that have protocol elements for carrying
> domainnames, where those protocol elements are defined as carrying UTF-8, but
> where in practice most implementors did not actually code those slots as IDN-
> aware, wouldn't it be a strong presumption that the slots are IDN-unaware?

My assertion is that applications should use Unicode to enable globalization.  My app doesn't have to be IDN aware or unaware, so long as it uses system APIs that "do the right thing."  The problem is that punycode leaks into everything, then suddenly anyone handling a name has to know how ACE works, instead of just treating it as an opaque string.  

It's reasonably easy to build a network enabled app.  You can call system APIs on most systems to open connections or resolve names.  If you're handling a protocol, you may need to know some protocol specific stuff, but that's the app's domain (as in area/field, not name).  Apps may need to know how to parse their protocol to get a host name, and then pass that to the system APIs, but why should they have to know how to convert to ACE, compare ACE vs Unicode, etc.?  Presuming that those operations are interesting to apps, then there should be things like "CompareHostName()" functions so that apps don't have to worry about IDN or what the various forms a name can take.

EAI is a good example of layering.  The protocol doesn't have to know anything about Punycode or details of DNS, it just uses UTF-8.  At some point an EAI app will have to connect to a name server, and, hopefully, it can do so by calling a UTF-16 or UTF-8 aware API (or native code page), that does the right conversions, using UTF-8 or whatever on Intranet requests, and ACE on Internet requests as necessary.  EAI never has to worry about different names.

And, FWIW, if I were building a name server, I'd let it accept UTF-8 requests (They'd have to be U-labels, so the server'd have to use the UTS#46 mappings like any client would, however it wouldn't matter as long as the rules were consistent).

-Shawn


More information about the Idna-update mailing list