Definitions limit on label length in UTF-8
"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Mon Sep 14 05:11:46 CEST 2009
Some additional points below.
On 2009/09/12 12:14, Martin J. Dürst wrote:
> Hello John,
> On 2009/09/12 0:47, John C Klensin wrote:
>> --On Friday, September 11, 2009 17:37 +0900 "\"Martin J.
>> Dürst\""<duerst at it.aoyama.ac.jp> wrote:
>> I note that, while I haven't had time to respond, some of the
>> discussion on the IRI list has included an argument that domain
>> names in URIs cannot be restricted to A-label forms but must
>> include %-escaped UTF-8 simply because those strings might not
>> be public-DNS domain names but references to some other database
>> or DNS environment.
> It's not 'simply because'. It's first and foremost because of the
> syntactic uniformity of URIs, and the fact that it's impossible to
> identify all domain names in an URI (the usual slot after the '//' is
> easy, scheme-specific processing (which is not what URIs and IRIs are
> about) may be able to deal with some of 'mailto', but what do you do
> about domain names in query parts? Also, this syntax is part of RFC
> 3986, STD 66, a full IETF Standard.
Also, consider EAI (email address internationalization) and mailto: (or
something like 'imailto:' if we go with a separate scheme name for
internationalized addresses). If we use scheme-specific processing, we
can convert IDN labels to punycode, but for EAI, that would be useless
overkill, because EAI uses UTF-8. This works much better if we use
%-encoding for the whole IRI->URI conversion than if we try to be 'smart'.
> Overall, it's just a question of what escaping convention should be
> used. URIs have their specific escaping convention (%-encoding), and DNS
> has its specific escaping convention (punycode).
> Also please note that the IRI spec doesn't prohibit to use punycode when
> converting to URIs.
> In addition, please note that at least my personal implementation
> experience (adding IDN support to Amaya) shows that the overhead of
> supporting %-encoding in domain names in URIs is minimal, and helps
> streamline the implementation.
>> It seems to me that one cannot have it
>> both ways -- either the application knows whether a string is a
>> public DNS reference that must conform _only_ to IDNA
>> requirements (but then can be restricted to A-labels) or the
>> application does not know and therefore must conform to DNS
>> requirements for label lengths.
> There is absolutely no need to restrict *all* references just because
> *some of them* may use other resolver systems with other length
> restrictions (which may be "63 octets per label when measured in UTF-8"
> or something completely different). It would be very similar to saying
> "Some compilers/linkers can only deal with identifiers 6 characters or
> shorter, so all longer identifiers are prohibited."
In addition, for IDNA2003 (which we are using for implementation
experience), a label being in UTF-8 means that it may not yet have been
nameprepped. That in turn implies that it may contain non-NFKC
characters, which take more or less space than the nameprepped version
of UTF-8. If there were indeed implementations that did conversion to
lenght-string pairs in UTF-8 and only later applied punycode, there
could be cases where an IDN label may or may not resolve depending on
whether input was normalized or not. So it could e.g. resolve on a Linux
or Windows system (these use precomposed characters mostly identical to
NFC), but not resolve on a Mac (which uses decomposed characters, taking
more space). Weird and improbable.
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update