IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Mon Jan 26 23:28:00 CET 2015

[much quoted text trimmed; trimming not noted below]

On Sun, Jan 25, 2015 at 10:30:54PM -0800, Asmus Freytag wrote:
> Canonical decomposition, by necessity, thus cannot "solve" the issue
> of turning Unicode into a perfect encoding for the sole purpose of
> constructing a robust identifier syntax - like the hypothetical
> encoding I opened this message with. If there was, at any time, a
> misunderstanding of that, it can't be helped -- we need to look for
> solutions elsewhere.
> 
> The fundamental design limitation of IDNA 2008 is that, largely, the
> rules that it describes pertain to a single label in isolation.

I'd say that the fundamental design limitation of DNS (never mind IDNA)
in this context is that a domainname's labels are supposed to make good
identifiers (observe: hand-waving about what an "identifier" is).

> You can look at the string of code points in a putative label, and
> compute whether it is conforming or not.

Right, instead we need to look at the zone's contents to see if there
are semantically-similar/identical labels with different encoding.  Then
we need to either forbid these (at the registration step) or require
that all of them resolve to the same RRsets.  The latter can be
difficult (there might be many aliases for a given label), but so can
the former: the registries need a robust label-similarity testing tool,
and this has to be kept up to date with the UC's assignments of new
codepoints that may be confusable (which in turn requires that someone
be able to and do not the confusable potential).

> What that kind of system handles poorly is the case where two labels
> look identical (or are semantically identical with different
> appearance -- where they look "identical" to the mind, not the eyes,
> of the user).

Well, we call that confusables.  And we have UTR#39 for it, no?

> In these cases, it's not necessarily possible, a-priori, to come to
> a solid preference of one over the other label (by ruling out
> certain code points). In fact, both may be equally usable - if one
> could guarantee that the name space did not contain  a doppelganger.
> 
> That calls for a different mechanism, what I have called "exclusion
> mechanism".

Isn't UTR#39 enough?  Hmmm, probably not.  Do we need a BCP for DNS
registries?

> Having a robust, machine readable specification of which labels are
> equivalent variants of which other labels, so that from such a
> variant set, only one of them gets to be an actual identifier.
> (Presumably the first to be applied for).

See above.

> This less draconian system is not something that is easy to retrofit
> on the protocol level.

It can't be retrofitted in DNS itself, only into the domainname
registration protocol (which involves layers 8 and 9, and which
anyways is not an Internet protocol).

> So, seen from the perspective of the entire eco-system around the
> registration of labels, the perceived shortcomings of Unicode are
> not as egregious and as devastating as they would appear if one
> looks only at the protocol level.

Agreed.

> There is a whole spectrum of issues, and a whole set of layers in
> the eco system to potentially deal with them. Just as string
> similarity is not handled in the protocol, these types of homographs
> should not have to be.
> 
> Let's recommend handling them in more appropriate ways.

Again, isn't that what UTR#39 was for?

Nico
--