Unicode & IETF

Tue Aug 12 16:42:55 CEST 2014

--On Tuesday, August 12, 2014 07:51 -0400 Vint Cerf
<vint at google.com> wrote:

>...
> A second
> assumption was that it was possible to use only the Unicode
> properties of the Unicode characters to determine whether a
> [new] character was or was not allowed for use in IDNs. The
> reason this was considered valuable was precisely because it
> decoupled the class of PVALID characters from any particular
> version of Unicode. IDNA2003 did not have that property.
> Instead, it used what John K and others called "normative
> tables."

Vint,

That much is clearly correct, at least from my perspective on
IDNA and independent of any particular Unicode issues.   But let
me review the rest of your note in the light of Ken's recent
note and the response to it I sent a few hours ago.

> The basic need in DNS is for a resolver to be able to find, in
> an efficient way a domain name in a hierarchical and
> distributed structure. To do this, DNS has to be able to
> compare ASCII strings as equal in a reliable way. To do that,
> it is important to get the Unicode elements of an IDN label
> into a canonical order so that comparison of either the
> Unicoded elements (e.g. in UTF8) or the punycoded (ASCII)
> elements can detect equality by simple string comparison.

Yes.  That was true even in the table-based IDNA2003, but,
because it involved mappings in the native character forms, the
string comparison had to be made on the Punycode-encoded form --
the native character forms could not be reliably compared.  That
caused other problems as you have pointed out.

> When strings that users would regard as "the same" have
> ambiguous representations in either the Unicoded or the
> punycoded sequences, the ambiguity can result in failure to
> find the appropriate domain name in the DNS. Or, worse, one
> may find the "wrong" one in the case that the ambiguous
> versions have been independently registered and map to
> different IP addresses. This is not about "confusables" in the
> sense that some characters look like others. It is about the
> fact that the same glyph has multiple encodings that do not
> collapse to an unambiguous canonical form.
> 
> The argument against allowing the new character is found in
> the paragraph above and is not about glyph confusion. It is
> about coding ambiguity.

Indeed.  As I indicated in my note to Ken, while I resisted
various statements about confusable characters, especially
inter-script, I may have talked enough about glyphs and shapes
to add to the confusion.   One of my problems with all of this
has been that I identified a problem but haven't been very good
at explaining it, especially in a way that is clear to everyone
on this list and the amount of apparently-unshared terminology
and conceptual understanding.

It is worth noting that the IAB report in RFC 4690, which was
the predecessor to starting on IDNA2008, discussed some of the
issues we are facing now (I believe before the Unicode stability
rules for normalization reached their current state), and
suggested a thought experiment of banning composing sequences
entirely from IDNs.  We knew at the time that was impractical
for the general case, but it is interesting that such a rule
would have prevented the current problem (if, indeed, there is a
problem in more than principle -- see below) and might allow a
solution to it.

> And that is why the new pre-composed character should not be
> allowed in IDNs: because it was heretofore generated using a
> combined sequence and the canonicalizing rules fail to produce
> that sequence in lieu of the new pre-composed character.

Certainly that is one solution and, at first glance, the most
obvious and most compatible with how IDNA2008 is structured.  As
others have pointed out, there are enough existing (coded well
before Unicode 7.0) characters with very similar properties that
one has to address such questions as whether handling one case
and leaving the others makes things better or worse, whether the
incompatibilities (if any in practice, again see below) that
would be introduced by making changes to the previously-coded
characters would be acceptable, whether we just need to accept
that the criteria you have outlined above just cannot work for
Arabic (and some other) script(s) and what that would imply, and
so on.  Where the "that case isn't the important one" discussion
comes back in is that two of those IDNA2008 assumptions were
that history (i.e., around Unicode 5.0 and earlier) was history
but (i) we had adequately dealt with the problem cases with
exceptions and special rules and (ii) as the result of
application of Unicode criteria and stability rules, no new
problem cases would arise.  You will recall that we were told
that we didn't even need the existing provisions for
backward-compatibility rules because the problems would never
arise.   The present case and, more important, the discussions
around it, strongly suggest that both of those assumptions were
incorrect.  We now need to deal with that (or disprove it).
DISALLOWing this one character may be part of the solution but,
even if it were, it certainly would not address all of it -- the
more fundamental questions need to be considered.

> As john mentions in passing, getting something into printable
> form (regardless of the display medium) and comparing two
> instances of glyph sequences impose very different
> requirements on rules for processing the strings. I would have
> thought that the DNS case is very similar to the general
> "string search" problem. Finding text in a large corpus of
> material that uses Unicode to encode the characters must also
> place some constraints on canonicalization since, without it,
> there would be a potentially combinatorial explosion of
> different (under simple string comparison) ways to represent
> the same sequence of glyphs, making it hard to find matching
> texts.

Indeed.

I have been thinking a bit more about something that Roozbeh's
initial note included.   I am _not_ attributing this
extrapolation to him, but it did stimulate my thinking.  Suppose
one could accurately say something like "without U+08A1, it is
not possible to write important phonemes of Fula".  Suppose,
after examining the table Ken pointed
to,http://www.unicode.org/Public/UCD/latest/ucd/CompositionExclusions.txt
(which by the way, doesn't appear to contain any of these Hamza
cases), some comments on this list (not limited to Roozbeh's),
and other materials, one concludes that none of the relevant
letters (and phonemes) can be sensibly expressed except in
single-code point forms.  One extension from that, equivalent to
the "just wait until this is available" option in my note from
Ken, would be that no one has ever written BEH with HAMZA ABOVE
as a combining sequence because it just makes no sense to
writers of Fula _and_ that the combination is not used in any of
the other languages that use Arabic script (e.g., because
glottal stops just don't appear around BEH).  Suppose the same
situation existed (for other languages) with HAH with HAMZA
ABOVE (U+0681) and REH WITH HAMZA ABOVE (U+076C) and any other
forms that might exist in that category, i.e., that, while it is
theoretically possible to write those letters as a combining
sequence, no one would ever do it, no keyboard mapping would
ever generate such a sequence no matter what was typed, etc.

In that case, we might have a problem in principle but, in
practice, there would only be one coding of the character in use
and it would be canonical for that reason.  We might still need
a rule in IDNA but that rule might sensibly ban the combining
forms because, if they are unused and useless, prohibiting them
now couldn't possibly have an effect on any existing labels.

What I definitely do not know is whether that set of conditions
are actually met, either for BEH or for any of the other
relevant combinations.  If they were, and the interpretation of
Roozbeh's "can't write" assertion were taken a step further, we
might even be able to ban the use of combining sequences with
Arabic, telling anyone who comes along with a unique, atomic,
letter that cannot be coded without a combining sequence in 7.0
that they just have to persuade Unicode to add an appropriate
code point.  

The combining sequences for characters that already have
decompositions are already de facto prohibited, so would not be
affected.  For example, U+0623, ARABIC LETTER ALEF WITH HAMZA
ABOVE, already has a decomposition (to U+0627 U+0654).  Applying
NFC to the combining sequence gets U+0623 back, so the combining
sequence cannot appear in IDNA-conforming labels.

It appears that, to at least a limited extent, Unicode has
created a similar rule for some cases: there are several
single-code point characters that use "WAVY HAMZA ABOVE" but no
combining character for WAVY HAMZA ABOVE, so no coding ambiguity
is even possible.

Whether it is feasible or not at this stage is something about
which we need advice from experts of Arabic-based writing
systems.   But I just wanted to point out that, in principle, we
could eliminate the string comparison problem with Arabic by
banning (or deprecating and issuing a lot of warnings), not use
of particular characters (especially single-code point ones) but
the practice (for that script and others that might be relevant)
of using combining sequences in domain name labels.

best,
    john