Standards and localization (was Dot-mapping)

Sat Dec 8 20:59:16 CET 2007

--On Saturday, 08 December, 2007 11:28 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:

> I'm a bit puzzled. If I take a "raw" IDN, like
> 
> http://Bücher.com
> 
> and paste it into an IDNA unaware browser, it won't work.

Be careful about how you define "work".  The behavior is
discouraged for many reasons and prohibited in some applications
protocols. The DNS specs themselves require that, if the
applications protocols don't prohibit it, it is perfectly valid
to parse that string into the series of octets that make up
"Bücher" and the string of characters that make up "com", and
go look them up.  If a label whose octets correspond to the
octets of "Bücher"  is found in the "com" zone, and the query
type matches the type of those records, the associated
information will be returned.

The prohibition is in the definition of the http URL and URLs in
general, not in the DNS.  

> We
> should expect that of browsers that doesn't handle IDN. We'd
> need to paste in a punycode version to work: xn--bcher-kva.com

Yes, certainly if that domain is going to be interpreted by IDNA
rules, because "xn--bcher-kva" appears as a label in the zone
and not "Bücher".

> If I take a "raw" IDN, like
> 
> http://Buecher．com               // that dot is a full-width
> dot
> 
> and paste it into an IDNA unaware browser, it also won't work.

Ah.  But 

(1) the reason why it won't work is fundamentally different
because, ignoring the http URL restriction for a moment, that
string won't result in a lookup of "Buecher" in the "com" zone
but of "Buecher．com" in the root.  Remember that, as far as
the DNS is concerned, independent of what the http URL and IDNA
think, "Buecher．com" is probably a perfectly reasonable label
(there are some other issues about encoding, since nothing all
all requires that non-ASCII characters be written in UTF-8 or
any other Unicode representation) and...

(2) http://Buecher.com/ can be parsed and looked up properly,
and the name presumably found, by IDNA-unaware applications.

> We should also expect that of browsers that doesn't handle
> IDN. We'd need to paste in a normalized version to work:
> http://Buecher.com
> 
> That is, it doesn't appear that the dot conversion is much
> different than the punycode conversion (and case/normalization
> folding) -- something that has to be done before passing off
> to DNS for it to work correctly.

Let's get unstuck from browsers and all of the traps that "the
Internet is the web" lead us into.  Assume there is an
application that uses DNS names as identifers but that is rarely
required to look them up as DNS names.  Assume that, unlike
URLs, it carries domain names around, as passes them from system
to system, in internal (length-value list of labels) form.  Now
assume that the first system that encounters the identifier
(presumably read from something typed by the user in
dot-separated form) is not IDNA-aware.  It parses the identifier
into labels, using periods only.  Perhaps it then rejects it
because it contains non-ASCII characters or no dots, but it is
not required to do that by the DNS.  It places the length-value
list into the database or passes it to another system.  

If that system actually needs to do a DNS lookup and is not
IDNA-aware, it looks the wrong label up in the wrong zone.  If
it is IDNA-aware it presumably rejects the label for containing
a wide dot, and doesn't look it up at all (the error message is
likely to be very interesting).  Or it recombines the label list
into a dot-separated FQDN, checks for wide dots and fixes them,
and then looks everything up as the user intended.  

Or suppose that second (or subsequent) system doesn't look the
name up in the DNS at all.  It isn't IDNA aware because it isn't
DNS-aware: all it knows how to do with DNS names is to parse
them into labels and recombine those labels, plus how to do DNS
matching (case-insensitive matching in the ASCII range and
bitstring mapping outside it).  It compares the string with the
embedded wide dot to one with a standard dot and the match
fails, even if the user typed in a FQDN that contains punycode
but also a wide dot (what does the user know?  And the whole
argument for those dots was that they were easier to type and
made more sense in context!).  How it fails depends on whether
the pre-parsed string is first recombined and then matched or
whether the comparison string is parsed into labels and then
matched.  But it will fail.

    john