<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 1/28/2015 7:02 PM, John C Klensin
wrote:<br>
</div>
<blockquote cite="mid:BBD619D4D1D9EFC4A0792C4E@JcK-HP8200.jck.com"
type="cite">
<pre wrap="">This is getting tedious. Vint has explained, Andrew has
explained, Pete has explained, Patrik has explained, and I have
explained, each in different ways (and my apologies to anyone
I've left out), that the examples in the IAB statement are not
the problem. They are symptoms of what may be a fundamental
misunderstanding in the IDNA design (specifically that we may
not have used the right set of properties) and perhaps an even
more fundamental one (specifically that the necessary set of
properties may not exist or be complete).
</pre>
</blockquote>
<br>
<font face="Candara">What we've heard, mostly, was the assertion
that there are cases<br>
for which there needs to be "normalization"-plus.<br>
</font><br>
<font face="Candara"><font face="Candara">(In this contribution, I
will attempt to triage *all* known, and not<br>
yet known cases, based on differences in their structural
typology<br>
and usage scenarios. Therefore, please compare the discussion
below<br>
not just to a few, but any cases that you are aware of, and let
me<br>
know if you have examples that you think that do not fit.)<br>
<br>
Let me start with the basics:<br>
<br>
</font>Normalization asserts that two sequences are equivalent;
canonical<br>
normalization asserts that they are fully equivalent based on
their<br>
underlying identity.<br>
<br>
For the cases where Unicode does not provide normalization,<br>
the argument by the UTC is that the equivalence is, if at all,<br>
is in appearance only and not manifest in the underlying identity.<br>
<br>
In most of these cases, attempting to assert such an equivalence<br>
after the fact with some additional normalization step, based on<br>
whatever additional properties is *<u>not</u>* the correct
strategy.<br>
<br>
In the vast majority of cases this is the wrong strategy for the <br>
simple reason that nobody (other than malicious users) would <br>
ever use what looks like the decomposed form. Only the<br>
composite form is actually used - the Danish o-slash is a typical<br>
example. Given that, the simple solution on the protocol level <br>
for such cases is adding a context rule that <u>prevents </u>"fake"
<br>
compositions - instead of making up false equivalences.<br>
<br>
It would be even easier to disallow certain combining marks<br>
altogether, but, if that's seen as too drastic, then by all means<br>
disallow them where they appear to form a composite that<br>
a) is visually equivalent to an encoded character<br>
b) is never expressed as a sequence in ordinary use<br>
<br>
Again, disallowing the mark altogether would be easiest; but<br>
the because requirement (a) by definition is met by only a<br>
limited and enumerable set code points for composites, its<br>
possible to use context rules. <br>
<br>
With clever design of some properties for the purpose, <br>
creating a general context rule appears possible.<br>
<br>
With this approach, I estimate that you catch 90%+ of <br>
all cases "missed" by NFC, including 90%+ of those that <br>
have been identified as concerns in this discussion.<br>
<br>
There are a small number of remaining cases, some of them <br>
digraphs (things that look exactly like two letters) and perhaps<br>
a few other ones.<br>
<br>
For the digraphs, asserting an equivalence via some algorithm<br>
that works like extended normalization <u>may</u> make sense, if<br>
the conclusion is that they must remain allowed (some are<br>
really special like the case of Latin digraphs for writing poetry<br>
in some African language).<br>
<br>
Because the limited way these are used, the sequence of <br>
ordinary letters must be a not uncommon fall-back in many<br>
non-DNS situations anyway, so taking that fallback as the <br>
preferred form would make some sense. (Still, it's probably<br>
better to disallow all or most of the digraphs altogether - they<br>
really don't need to be supported.)<br>
<br>
Finally, we come to the 1% of 1% of cases where there may be<br>
actual use of both composites and look-alike sequences.<br>
This is the Arabic case that started all this. In such cases, if
both<br>
forms are really used (by separate constituents) it's not <br>
possible to come to a sensible "preferred" form that doesn't<br>
play favorites in an arbitrary way. So, it's not possible to know
<br>
what to normalize to!<br>
<br>
From a pragmatic point of view, and given that "look-alike" fades<br>
gradually into "look-nearly-alike" and then into
"look-confusingly-<br>
similar" and so on, whenever such remaining edge cases are<br>
exhibited by *<u>rarely</u>* or *<u>very rarely</u>* used code
points, the<br>
benefit of addressing them in the protocol, vs. relying on upper<br>
layers (like string similarity) becomes <u>vanishingly small</u>.<br>
<br>
This is Mark's and Shawn's point (and shared by many).<br>
<br>
If we can get some recognition and acknowledgement that<br>
solving arbitrarily minuscule problems on the protocol level<br>
(when bigger problems can only be addressed outside), is not<br>
productive, then we have a basis on which we can come<br>
together - such that we can look at the wider, and perhaps<br>
more relevant subset of cases and discuss solutions for them.<br>
<br>
Now, to come back out of that rat-hole, I want to reiterate<br>
that I see a number of classes of cases for which it is possible<br>
to construct a robust solution on the protocol level that <br>
does not necessarily have to be arbitrary -- or force users<br>
into creating strings that contain code point sequences that<br>
are explicitly discouraged for their language.<br>
<br>
The vast majority of these cases, to repeat from above, are<br>
those, where only one form is in practical or recommended<br>
use. In these cases, finding a way to disallow the competing<br>
representation (usually a sequence) would be the answer<br>
that impacts non-malicious users the least.<br>
<br>
It would also be implementable using a combination of <br>
properties and context rules not too dissimilar from existing<br>
rules, and not require a new or modified normalization <br>
algorithm.<br>
<br>
In careful review, it might even be possible to establish that<br>
the set of code points that could be successfully normalized<br>
to a preferred form is empty, or contains only code points of<br>
such rarity that, speaking pragmatically, adding a whole <br>
algorithm for their sake is not appropriate.<br>
<br>
(For the root, the draft designs for Arabic side-step the issue<br>
by disallowing the combining hamza, along with a number <br>
of other combining marks that are felt unnecessary for the<br>
purpose of creating identifiers).<br>
<br>
This radical step may not be possible on the protocol level,<br>
particularly as some combining marks may be needed for<br>
novel combinations, which may be difficult to enumerate<br>
in advance for the entire DNS.<br>
<br>
(For the root, limiting combining marks to very specific<br>
contexts, which would then be explicitly enumerated,<br>
is one of the strategies we are looking at).<br>
<br>
A./<br>
<br>
PS: in the meantime, I continue to consider the IAB <br>
statement in its totality, and particular in its immediate<br>
recommendations regarding Arabic not merely as not<br>
really helpful, but outright harmful.<br>
<br>
<br>
</font>
</body>
</html>