Unicode & IETF

Tue Aug 12 13:51:00 CEST 2014

On Mon, Aug 11, 2014 at 6:30 PM, Shawn Steele <Shawn.Steele at microsoft.com>
wrote:

> I feel like I'm hearing a third:  Make linguistic domain names
> mathematically unique so that I can depend that X==Y and maybe map back and
> forth between them with 100% certainty, particularly if they are rendered
> the same.  I don't feel that IDNA2003 or IDNA2008 accomplished that and
> don't see the class of behavior being discussed as contributing to the
> problem.
>

Shawn, your note made me try to think through the basic motivations of
IDNA2008. To enable IDNs without having to change every resolver in the
world, it was concluded to use a  mapping from Unicode to a representation
employing only ASCII strings through a coding step called punycode. It was
important that the mappings be one to one between the Unicode and the
punycoded forms. These two forms were designed to be canonically
equivalent. That was one of the versions of the"X==Y" you reference in the
paragraph above. A second assumption was that it was possible to use only
the Unicode properties of the Unicode characters to determine whether a
[new] character was or was not allowed for use in IDNs. The reason this was
considered valuable was precisely because it decoupled the class of PVALID
characters from any particular version of Unicode. IDNA2003 did not have
that property. Instead, it used what John K and others called "normative
tables."

The basic need in DNS is for a resolver to be able to find, in an efficient
way a domain name in a hierarchical and distributed structure. To do this,
DNS has to be able to compare ASCII strings as equal in a reliable way. To
do that, it is important to get the Unicode elements of an IDN label into a
canonical order so that comparison of either the Unicoded elements (e.g. in
UTF8) or the punycoded (ASCII) elements can detect equality by simple
string comparison.

When strings that users would regard as "the same" have ambiguous
representations in either the Unicoded or the punycoded sequences, the
ambiguity can result in failure to find the appropriate domain name in the
DNS. Or, worse, one may find the "wrong" one in the case that the ambiguous
versions have been independently registered and map to different IP
addresses. This is not about "confusables" in the sense that some
characters look like others. It is about the fact that the same glyph has
multiple encodings that do not collapse to an unambiguous canonical form.

The argument against allowing the new character is found in the paragraph
above and is not about glyph confusion. It is about coding ambiguity.

And that is why the new pre-composed character should not be allowed in
IDNs: because it was heretofore generated using a combined sequence and the
canonicalizing rules fail to produce that sequence in lieu of the new
pre-composed character.

As john mentions in passing, getting something into printable form
(regardless of the display medium) and comparing two instances of glyph
sequences impose very different requirements on rules for processing the
strings. I would have thought that the DNS case is very similar to the
general "string search" problem. Finding text in a large corpus of material
that uses Unicode to encode the characters must also place some constraints
on canonicalization since, without it, there would be a potentially
combinatorial explosion of different (under simple string comparison) ways
to represent the same sequence of glyphs, making it hard to find matching
texts.

vint
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140812/39a9ff3a/attachment.html>