UTF-8

Nicolas Williams Nicolas.Williams at oracle.com
Thu Jun 17 22:47:49 CEST 2010


On Thu, Jun 17, 2010 at 08:02:18PM +0000, Shawn Steele wrote:
> >> I'd argue any new application protocol ought to specify the
> >> encoding rather than allowing multiple.   Specifying UTF-8 would be
> >> good :-)
> 
> > Just UTF-8, un-pre-processed, raw user input?  Or did you mean
> > U-labels?
> 
> I meant, in non-DNS cases, it doesn't really matter.  If they aren't
> U-labels, they won't work (just like (*&$(*&.com won't work)), but
> other protocols shouldn't have to know how DNS behaves.

That's not really a useful answer.

However, I believe it'd be fine in NFSv4 to send un-pre-processed, raw
user input in UTF-8 and let the receiver apply ToASCII() or
ToUnicode(ToASCII()) as necessary.  Note that non-U-label UTF-8 would
work with this approach.

> > Also, with respect to deployed protocols that have protocol elements
> > for carrying domainnames, where those protocol elements are defined
> > as carrying UTF-8, but where in practice most implementors did not
> > actually code those slots as IDN- aware, wouldn't it be a strong
> > presumption that the slots are IDN-unaware?
> 
> My assertion is that applications should use Unicode to enable
> globalization.  My app doesn't have to be IDN aware or unaware, so
> long as it uses system APIs that "do the right thing."  The problem is
> that punycode leaks into everything, then suddenly anyone handling a
> name has to know how ACE works, instead of just treating it as an
> opaque string.  

Indeed, it's all about those system APIs.  If the application send raw
user input encoded in UTF-8 and the peer passes that to getaddrinfo(),
and getaddrinfo() does the Right Thing, then everything works.  And the
application is left pretty darned simple.

Achieving that level of simplicity has been my goal in engaging this
list.

It's not necessarily that simple in all cases.  For example, if the
application needs to format LDAP DNs using the DC name attribute to hold
domainname labels (e.g., DC=foo,DC=example) then the application has to
make sure to use A-labels.  However, if getaddrinfo() by default returns
A-labels as canonical names, then the application still has nothing
special to do.  The point is that there are going to be a variety of
cases all of which have to be handled on a case-by-case basis.

Nico
-- 


More information about the Idna-update mailing list