IAB Statement on Identifiers and Unicode 7.0.0

Wed Jan 28 19:51:43 CET 2015

Hi.

I think Vint, Andrew, Patrik, and I are saying much the same
thing, but let me try to respond to several messages with a
slightly different viewpoint and vocabulary, much of which
closely parallels the (IMO, very helpful) exchange between Asmus
and myself earlier this week -- an exchange that, while long,
I'd encourage those who are interested in this subject to read
if you have not done so.

First, I'd like to encourage everyone to try to think about this
issue as precisely as possible and to avoid both hyperbole and
hysteria, if only because they create distractions and make it
hard to identify the real issues and figure out what to do about
them.  This is not about similar-looking ("confusable")
characters or about characters, however related, from different
scripts.  It is not about revising Unicode's script
classifications for characters used in normal, natural-language,
writing systems either.

The IAB has not called for banning Hamza, or characters
constructed with Hamza. It has only called for a temporary
go-slow policy on a range of code points until we really figure
out how to respond to this.  If I had known when work started on
that statement what I think I know now (thanks to help from John
Cowan and Asmus among others), I think the statement would have
said equally cautionary things about, e.g., some collections of
phonetic description characters that are classified as Latin
script letters but that have some very similar decomposition
properties.    FWIW, the most radical long-term suggestions I
have made, seen, or heard of would have the effect of
disallowing one or the other of a combining sequence and a
[pre]composed form of a character.  That is no different from
what IDNA2008's "U-labels must be in NFC form" rule (or the
tables of IDNA2003) to for thousands of other characters in any
respect other than NFC doesn't do the elimination job.   

I would assume that any decision to ban _both_ the precomposed
character and the combining sequence would need to be a matter
of per-script or per-language recommendations to, and actions
by, zone administrators (in the DNS case) and those who pick
identifiers (in most or all of the other cases).  That isn't
because those "ban both" cases are somehow harder or more risky
but because the decision as to what abstract characters to allow
in a label (or other identifier) is fully consistent with
IDNA2008's general call for care and good judgment and not an
issue about how things are coded.

I understand that passions can run high about this, but the
solution to the IAB's "defer using these things until there is a
real solution" advice (at least that is how I have understood
it) is to get to a real solution as quickly as possible.  That
doesn't require that we all focus on the same issues, but a
narrow focus, with minimal distractions, would certainly help.

The problem _is_ about whether two ways to code the same
abstract character, within the same script, can be reliably
compared equal with the existing technology and, if not, what
can be done to create technology that will work.   That makes
it, inevitably, about whether the meaning of "same abstract
character" can be the same for Unicode purposes (which
apparently includes language, phonetic, and usage
considerations) as for pure identifier ones (think "IETF
identifier", but the Historical Note at the end).  IDNA2008's
property and new-Unicode-version transition rules assumed the
answer to that abstract character question was "yes".  It is now
clear that, for some groups of characters, it is "no" without
further work on IDNA and that is a problem for the reasons Vint,
Patrik, and Andrew have referred to.

Others have already said this but an approach of lumping these
code points (or sequences) that may or may not be the same
abstract character together with confusables and handing the
issue off to zone administrators is impractical and undesirable
because of the "number of registries that act independently" and
"no-registry" identifiers issues (see an earlier note from
Andrew and/or my discussion with Asmus).  If we cannot
reasonably know whether two representations (via input methods,
coding, or elsewhere) of an identifier match, then we don't have
identifiers, we have only names by which things are (or might
be) called.

For me, at least, the next steps, with the understanding that
the first two may, to some extent, depend on the third are:

(1) Try to figure out how, if at all possible, to disentangle
Precis (and, to the degree relevant, Json, IETF adaptations of
PKCS, and other protocols and systems that need to accommodate
non-ASCII strings) from this situation and its possible
implications.

(2)  Update draft-klensin-idna-5892upd-unicode70 to reflect our
current understanding of the problem.  Until and unless some
other approach comes along, that document lies on the path to a
solution for IDNA.  If someone does want to suggest another
approach, I'd be happy to work with them to incorporate it if
they conclude it would be inefficient to write a separate
document.  However, I don't think denial is going to do it for
us for reasons given above and elsewhere.

(3) Try to get a better understanding of the scope and locations
of the problem.  We know about the Hamza-related cases, but not
whether there are similar non-decomposing cases elsewhere within
the Arabic script.  The discussion in the Unicode Standard
suggests that there are not and won't be; some of Asmus's
comments appear to indicate that composed forms for many other
"characters" that can now only be represented by combining
sequences may be in our future.  We also know, now, about the
phonetic description characters, some of which can be formed
from Basic Latin characters and Latin or Common Script composing
characters.  If there are other types of cases, it would be
really desirable, perhaps essential, to know where what they are
where to find them, rather than just having comments that there
are lots of cases out there (some of which turn out to be
cross-script or "similar", not "identical" cases.

See also (5) below.

I also think there is some work that UTC could do here that the
IETF can't do and that would considerably improve the situation:

(4)  There are statements in the Unicode Standard and about
normalization stability that seem, to some of us who have read
them multiple time and very carefully, to be very specific about
conditions under which new code points are added and their
interactions with normalization.  It appears from the
discussions of the last few weeks that there are additional
considerations about phonetics, language issues within scripts,
different treatments for different scripts, and perhaps other
cases that call for what appear to be exceptions to those
statements.   It seems to me that it would be wise, in the
interest of predictability on which the community can rely --the
very essence of applied stability-- to align the statements in
the standard with the actual practices and guidelines used in
assigning new code points.   If the problem is not in the intent
of the statements made in the standard but in how many of us
have, in good faith, interpreted them, that also suggests that
some textual revisions are in order.    If those issues have to
be worked out in collaboration with ISO/IEC JTC1/SC2, I believe
that an increase in transparency would be beneficial to all
concerned.

(5) I'm sensitive to the distinction Asmus made between legacy
cases and rules and plans going forward.   Although I hate the
idea if the list is long, we can handle the legacy cases by
exception list (and have done some of that already).   It would
be far better if there were some property that would identify
code points that would have been handled differently (wrt
composition/decomposition (see (7) below) and anything else we
should be worried about) if assigned under 2014 rules rather
than whatever legacy principles applied.  Only UTC can create
such a property with any hope of getting it right.  However,
even such a property would not be of much help unless there are
clear rules about new code points assignments that we can
understand and rely on (see above) and a clear dividing line
between "legacy" and "new and consistent rules".  When we did
IDNA2008, we were under the impression that dividing line
existed and was set by the additional stability rules introduced
into Unicode 5.2 (if not a bit earlier).   That inference
appears to be incorrect, for help in formulating a better one
seems to me to be important. 

(6) If there are really some scripts (e.g., Latin) for which new
characters are assigned code points based on "composing
sequences preferred" and others (e.g., Arabic and, given the
difficulties with using ideographic description sequences in
identifiers, Han) in which new character are assigned based on
"precomposed grapheme with single code point preferred", it
would be extremely desirable to have a property that
distinguished the two rather than our having to make lists of
scripts.  The latter is inherently unstable on a Unicode version
by version basis as new scripts are added; presumably the
property could easily be updated as part of decisions about each
new script.

(7) Similarly, to the extent to which the core of the current
set of issues (but possibly not the only misconceptions around
which IDNA2008 was designed) is associated with characters that
one would expect, under a "same script, same form" principle, to
decompose (and, in most cases, to compose) when converted to the
appropriate canonical form but that do behave as predicted for
some well-thought-out reason.  If UTC concludes that is, indeed,
the key issue, a "you might expect this to decompose but it
doesn't" property would be extremely helpful.   Because that
property would be new, it could be assigned to all existing
cases of that situation without violating any stability rules.
And the IETF could choose to either disallow all such code
points (a decision that would favor stability of possible
existing names) and, in conjunction with the "type of script"
property described in (6), construct a more complex rule that
would favor predictability.  Either, in principle, might have
exception lists for particularly difficult legacy cases.

(8) It might be covered by the above case or might not and might
or might not be useful for other reasons, but it appears that a
great many code points have been assigned to characters that
have been assigned to the Latin (and perhaps Greek or other)
scripts and given letter properties that, with a different set
of conventions, would have been considered as symbols.  My
current examples are the IPA block at 0250..02AF and the
Phonetic Extensions and Phonetic Extensions Supplement at
1D00..1DFF but I have no reason to believe those are the only
cases.    It seems to me that those near-symbols are bound to
cause problems sooner or later for some identifier context (even
if IDNA can handle them in other ways) so I would encourage UTC
to consider whether some new property that can be used to
distinguish them from the letters normally used to write words
in human languages would be appropriate.

best regards to all,
   john

Historical note: FWIW, the issue of whether identifiers were
different if different languages were used as a base is not a
new issue or an IETF-specific one.  There were extensive
discussions in the ISO, ANSI, and ECMA programming language
standardization communities in the mid-1980s (long enough ago
that what is now ISO/IEC JTC1/SC22 was still ISO TC97/SC5) about
whether the concrete, machine-stored, form identifiers other
than those in what we now call  Basic Latin characters needed to
be a tuple of a language or CCS identifier with a coded string,
much along the lines that Andrew suggested in one of his
comments.  That discussion, which was strongly influenced by
increasing awareness of the difficulties with a collection of
specialized CCSs, was one part of what led to the creation of
the project that became ISO/IEC 10646.  It also led from a
liaison letter (if I recall, more than one) between the two SCs
cautioning against multiple coding forms for the same character
and warning against just the situation we find ourselves in
today.   Perhaps those who write, or have written, programs can
understand why the present situation is so disturbing by
thinking about a requirement on programming languages that every
program contain a declaration of the human language used as a
basis for its identifiers, with that information carried, not
only into object code, but into every procedure call and little
hope of interoperation in the general case among programs or
libraries with different language declarations (one could get
around that by passing only pointer-references and not names,
but we have presumably had enough experience with where that
leads from a security vunerability standpoint.