IAB Statement on Identifiers and Unicode 7.0.0

Thu Jan 29 12:59:34 CET 2015

Dear All,

I haven’t been up-to-date on the IDNA mailing group lately, due working with TF-AIDN (the group which assigned by ICANN LGR for the Arabic script). I just read the IAB stamen once it got forwarded by ICANN staff to us. We were aware about the character 08A1 and the confusability caused by not make the NFC for it. Their concern is valid and reasonable but in ONLY this character (U+08A1). The problem is in the end of their statement, IAB said an inaccurate information which is  the characters U+0623, U+0624, U+0626, U+0677, U+06C2 and U+06D3 aren’t canonically equivalent to <character> followed by U+0654, ARABIC HAMZA ABOVE. This statement has two problems: 1) inaccurate information and 2)these characters are safe and they are very important characters to the languages belong to Arabic script. It is like dropping vowels from English!.  Their statement should be restricted to ONLY character U+08A1, ARABIC LETTER BEH WITH HAMZA ABOVE. If this statement got adapted it is going to murder the language and it’ll be very hard for normal users to form a lot of words!

I hope my concern is clear and we should reconsider their statement with the concerns I mentioned previously.

AbdulRahman,

From: Idna-update [mailto:idna-update-bounces at alvestrand.no] On Behalf Of Asmus Freytag
Sent: Thursday, January 29, 2015 7:35 AM
To: John C Klensin; Shawn Steele; Vint Cerf
Cc: IDNA update work
Subject: Re: IAB Statement on Identifiers and Unicode 7.0.0

On 1/28/2015 7:02 PM, John C Klensin wrote:

This is getting tedious.  Vint has explained, Andrew has

explained, Pete has explained, Patrik has explained, and I have

explained, each in different ways (and my apologies to anyone

I've left out), that the examples in the IAB statement are not

the problem.  They are symptoms of what may be a fundamental

misunderstanding in the IDNA design (specifically that we may

not have used the right set of properties) and perhaps an even

more fundamental one (specifically that the necessary set of

properties may not exist or be complete).

What we've heard, mostly, was the assertion that there are cases
for which there needs to be "normalization"-plus.

(In this contribution, I will attempt to triage *all* known, and not
yet known cases, based on differences in their structural typology
and usage scenarios. Therefore, please compare the discussion below
not just to a few, but any cases that you are aware of, and let me
know if you have examples that you think that do not fit.)

Let me start with the basics:

Normalization asserts that two sequences are equivalent; canonical
normalization asserts that they are fully equivalent based on their
underlying identity.

For the cases where Unicode does not provide normalization,
the argument by the UTC is that the equivalence is, if at all,
is in appearance only and not manifest in the underlying identity.

In most of these cases, attempting to assert such an equivalence
after the fact with some additional normalization step, based on
whatever additional properties is *not* the correct strategy.

In the vast majority of cases this is the wrong strategy for the
simple reason that nobody (other than malicious users) would
ever use what looks like the decomposed form. Only the
composite form is actually used - the Danish o-slash is a typical
example. Given that, the simple solution on the protocol level
for such cases is adding a context rule that prevents "fake"
compositions - instead of making up false equivalences.

It would be even easier to disallow certain combining marks
altogether, but, if that's seen as too drastic, then by all means
disallow them where they appear to form a composite that
a) is visually equivalent to an encoded character
b) is never expressed as a sequence in ordinary use

Again, disallowing the mark altogether would be easiest; but
the because requirement (a) by definition is met by only a
limited and enumerable set code points for composites, its
possible to use context rules.

With clever design of some properties for the purpose,
creating a general context rule appears possible.

With this approach, I estimate that you catch 90%+ of
all cases "missed" by NFC, including 90%+ of those that
have been identified as concerns in this discussion.

There are a small number of remaining cases, some of them
digraphs (things that look exactly like two letters) and perhaps
a few other ones.

For the digraphs, asserting an equivalence via some algorithm
that works like extended normalization may make sense, if
the conclusion is that they must remain allowed (some are
really special like the case of Latin digraphs for writing poetry
in some African language).

Because the limited way these are used, the sequence of
ordinary letters must be a not uncommon fall-back in many
non-DNS situations anyway, so taking that fallback as the
preferred form would make some sense. (Still, it's probably
better to disallow all or most of the digraphs altogether - they
really don't need to be supported.)

Finally, we come to the 1% of 1% of cases where there may be
actual use of both composites and look-alike sequences.
This is the Arabic case that started all this. In such cases, if both
forms are really used (by separate constituents) it's not
possible to come to a sensible "preferred" form that doesn't
play favorites in an arbitrary way. So, it's not possible to know
what to normalize to!

From a pragmatic point of view, and given that "look-alike" fades
gradually into "look-nearly-alike" and then into "look-confusingly-
similar" and so on, whenever such remaining edge cases are
exhibited by *rarely* or *very rarely* used code points, the
benefit of addressing them in the protocol, vs. relying on upper
layers (like string similarity) becomes vanishingly small.

This is Mark's and Shawn's point (and shared by many).

If we can get some recognition and acknowledgement that
solving arbitrarily minuscule problems on the protocol level
(when bigger problems can only be addressed outside), is not
productive, then we have a basis on which we can come
together - such that we can look at the wider, and perhaps
more relevant subset of cases and discuss solutions for them.

Now, to come back out of that rat-hole, I want to reiterate
that I see a number of classes of cases for which it is possible
to construct a robust solution on the protocol level that
does not necessarily have to be arbitrary -- or force users
into creating strings that contain code point sequences that
are explicitly discouraged for their language.

The vast majority of these cases, to repeat from above, are
those, where only one form is in practical or recommended
use. In these cases, finding a way to disallow the competing
representation (usually a sequence) would be the answer
that impacts non-malicious users the least.

It would also be implementable using a combination of
properties and context rules not too dissimilar from existing
rules, and not require a new or modified normalization
algorithm.

In careful review, it might even be possible to establish that
the set of code points that could be successfully normalized
to a preferred form is empty, or contains only code points of
such rarity that, speaking pragmatically, adding a whole
algorithm for their sake is not appropriate.

(For the root, the draft designs for Arabic side-step the issue
by disallowing the combining hamza, along with a number
of other combining marks that are felt unnecessary for the
purpose of creating identifiers).

This radical step may not be possible on the protocol level,
particularly as some combining marks may be needed for
novel combinations, which may be difficult to enumerate
in advance for the entire DNS.

(For the root, limiting combining marks to very specific
contexts, which would then be explicitly enumerated,
is one of the strategies we are looking at).

A./

PS: in the meantime, I continue to consider the IAB
statement in its totality, and particular in its immediate
recommendations regarding Arabic not merely as not
really helpful, but outright harmful.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150129/e83c0856/attachment-0001.html>