IAB Statement on Identifiers and Unicode 7.0.0

Thu Jan 29 05:35:06 CET 2015

On 1/28/2015 7:02 PM, John C Klensin wrote:
> This is getting tedious.  Vint has explained, Andrew has
> explained, Pete has explained, Patrik has explained, and I have
> explained, each in different ways (and my apologies to anyone
> I've left out), that the examples in the IAB statement are not
> the problem.  They are symptoms of what may be a fundamental
> misunderstanding in the IDNA design (specifically that we may
> not have used the right set of properties) and perhaps an even
> more fundamental one (specifically that the necessary set of
> properties may not exist or be complete).

What we've heard, mostly, was the assertion that there are cases
for which there needs to be "normalization"-plus.

(In this contribution, I will attempt to triage *all* known, and not
yet known cases, based on differences in their structural typology
and usage scenarios. Therefore, please compare the discussion below
not just to a few, but any cases that you are aware of, and let me
know if you have examples that you think that do not fit.)

Let me start with the basics:

Normalization asserts that two sequences are equivalent; canonical
normalization asserts that they are fully equivalent based on their
underlying identity.

For the cases where Unicode does not provide normalization,
the argument by the UTC is that the equivalence is, if at all,
is in appearance only and not manifest in the underlying identity.

In most of these cases, attempting to assert such an equivalence
after the fact with some additional normalization step, based on
whatever additional properties is *_not_* the correct strategy.

In the vast majority of cases this is the wrong strategy for the
simple reason that nobody (other than malicious users) would
ever use what looks like the decomposed form. Only the
composite form is actually used - the Danish o-slash is a typical
example. Given that, the simple solution on the protocol level
for such cases is adding a context rule that _prevents _"fake"
compositions - instead of making up false equivalences.

It would be even easier to disallow certain combining marks
altogether, but, if that's seen as too drastic, then by all means
disallow them where they appear to form a composite that
a) is visually equivalent to an encoded character
b) is never expressed as a sequence in ordinary use

Again, disallowing the mark altogether would be easiest; but
the because requirement (a) by definition is met by only a
limited and enumerable set code points for composites, its
possible to use context rules.

With clever design of some properties for the purpose,
creating a general context rule appears possible.

With this approach, I estimate that you catch 90%+ of
all cases "missed" by NFC, including 90%+ of those that
have been identified as concerns in this discussion.

There are a small number of remaining cases, some of them
digraphs (things that look exactly like two letters) and perhaps
a few other ones.

For the digraphs, asserting an equivalence via some algorithm
that works like extended normalization _may_ make sense, if
the conclusion is that they must remain allowed (some are
really special like the case of Latin digraphs for writing poetry
in some African language).

Because the limited way these are used, the sequence of
ordinary letters must be a not uncommon fall-back in many
non-DNS situations anyway, so taking that fallback as the
preferred form would make some sense. (Still, it's probably
better to disallow all or most of the digraphs altogether - they
really don't need to be supported.)

Finally, we come to the 1% of 1% of cases where there may be
actual use of both composites and look-alike sequences.
This is the Arabic case that started all this. In such cases, if both
forms are really used (by separate constituents) it's not
possible to come to a sensible "preferred" form that doesn't
play favorites in an arbitrary way. So, it's not possible to know
what to normalize to!

 From a pragmatic point of view, and given that "look-alike" fades
gradually into "look-nearly-alike" and then into "look-confusingly-
similar" and so on, whenever such remaining edge cases are
exhibited by *_rarely_* or *_very rarely_* used code points, the
benefit of addressing them in the protocol, vs. relying on upper
layers (like string similarity) becomes _vanishingly small_.

This is Mark's and Shawn's point (and shared by many).

If we can get some recognition and acknowledgement that
solving arbitrarily minuscule problems on the protocol level
(when bigger problems can only be addressed outside), is not
productive, then we have a basis on which we can come
together - such that we can look at the wider, and perhaps
more relevant subset of cases and discuss solutions for them.

Now, to come back out of that rat-hole, I want to reiterate
that I see a number of classes of cases for which it is possible
to construct a robust solution on the protocol level that
does not necessarily have to be arbitrary -- or force users
into creating strings that contain code point sequences that
are explicitly discouraged for their language.

The vast majority of these cases, to repeat from above, are
those, where only one form is in practical or recommended
use. In these cases, finding a way to disallow the competing
representation (usually a sequence) would be the answer
that impacts non-malicious users the least.

It would also be implementable using a combination of
properties and context rules not too dissimilar from existing
rules, and not require a new or modified normalization
algorithm.

In careful review, it might even be possible to establish that
the set of code points that could be successfully normalized
to a preferred form is empty, or contains only code points of
such rarity that, speaking pragmatically, adding a whole
algorithm for their sake is not appropriate.

(For the root, the draft designs for Arabic side-step the issue
by disallowing the combining hamza, along with a number
of other combining marks that are felt unnecessary for the
purpose of creating identifiers).

This radical step may not be possible on the protocol level,
particularly as some combining marks may be needed for
novel combinations, which may be difficult to enumerate
in advance for the entire DNS.

(For the root, limiting combining marks to very specific
contexts, which would then be explicitly enumerated,
is one of the strategies we are looking at).

A./

PS: in the meantime, I continue to consider the IAB
statement in its totality, and particular in its immediate
recommendations regarding Arabic not merely as not
really helpful, but outright harmful.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150128/e5b617af/attachment-0001.html>