IAB Statement on Identifiers and Unicode 7.0.0

John C Klensin klensin at jck.com
Thu Jan 29 17:06:55 CET 2015


--On Thursday, January 29, 2015 11:59 +0000 "Abdulrahman I.
ALGhadir" <aghadir at citc.gov.sa> wrote:

> Dear All,
> 
> I haven't been up-to-date on the IDNA mailing group lately,
> due working with TF-AIDN (the group which assigned by ICANN
> LGR for the Arabic script). I just read the IAB stamen once it
> got forwarded by ICANN staff to us. We were aware about the
> character 08A1 and the confusability caused by not make the
> NFC for it. Their concern is valid and reasonable but in ONLY
> this character (U+08A1). The problem is in the end of their
> statement, IAB said an inaccurate information which is  the
> characters U+0623, U+0624, U+0626, U+0677, U+06C2 and U+06D3
> aren't canonically equivalent to <character> followed by
> U+0654, ARABIC HAMZA ABOVE. This statement has two problems:
> 1) inaccurate information and 2)these characters are safe and
> they are very important characters to the languages belong to
> Arabic script. It is like dropping vowels from English!.
> Their statement should be restricted to ONLY character U+08A1,
> ARABIC LETTER BEH WITH HAMZA ABOVE. If this statement got
> adapted it is going to murder the language and it'll be very
> hard for normal users to form a lot of words!
> 
> I hope my concern is clear and we should reconsider their
> statement with the concerns I mentioned previously.

Abdulrahman,

Nice to hear from you.

Having advised on that statement and contributed some words but
not being a member of the IAB or formally responsible in any
way, let me comment with the understanding that this is my
personal perspective and in my personal capacity alone.

First, I hope I'm not disclosing any secrets when I tell you
that there was considerable desire to keep the statement as
short as possible and to get it out quickly rather than trying
to achieve perfection and get every detail right.   That
tradeoff decision, which I personally consider reasonable, is
one that almost always results in rough edges.   Perhaps there
should have been more discussion about the differences among
that list of characters.  Perhaps the IAB should have waited
longer and gotten more discussion of issues in other scripts
into the statement.   It is always easy to to have different
opinions about those kinds of tradeoffs in retrospect.  It is
harder, or impossible, to get things completely right initially,
at least without spending very a long time at it.

I don't believe there is anything "inaccurate" about the IAB
statement, although it is possible to misread it as saying far
more specific things than it actually did.  See below before
reacting to that assertion.

The preference to keep the document short and complexity
minimized made a full explanation impossible.   I tried to get a
somewhat more comprehensive explanation into
draft-klensin-idna-5892upd-unicode70-03.  Based on what we have
now learned, that explanation is horribly inadequate (although
much more complete than the IAB statement was intended to be,
but it is already 16 pages long.   -04, which I will get back to
as soon as I can take a break from reading and responding to
messages on this list, will be better (and will address the
somewhat-similar Latin script examples), but also longer.  That
is still very much a work in progress and I would welcome your
help with getting the Arabic-specific discussion in it correct.

I'd saddened by the fact that this particular problem -- one
that seems to be a very general one of normalization not
consistently working the way the IETF expected (and was led to
expect [1]) -- first showed up with an Arabic script character.
As you know better than the vast majority of the people on this
list, the Arabic script has more than enough difficulties with
proper and consistent coding in Unicode and with DNS labels and
fully-qualified domain names.  It would be nice to not have even
more complications.  But, at least as long as Unicode is taken
as given, there is little that can be done about that general
issue at this stage.   Some characters were added to Unicode
that were clearly associated with letters with which a sensible
person might want to write mnemonic domain name labels, as part
of the normal IDNA review of changes in Unicode, I caught one of
them as being able to be created as a combining sequence but not
decomposing and therefore identifying this problem, and it
happened to be in Arabic script.  The latter is probably no
one's fault, certainly not mine or the IAB's.

My assumption when it was first spotted was that it was an
isolated anomaly.  We would decide what to do about it (which
might have been "nothing" or excluding it by exception), do
that, and then move on.  It rapidly became clear that it was
part of a broader issue, an issue tied to the fairly extensive
Unicode discussion of the various uses and contexts for Hamza
(there may be other single characters, in other scripts, that
get similarly extensive treatment, but there aren't many and I
don't personally know of them).  Only later -- much too late to
have a significant effect on the IAB Statement -- did the issues
with non-decomposable Latin script characters become evident.
As I have said before on this list, I'm still not confident I
have an adequate understanding of them, so I don't expect -04 to
be right either, just closer.

With the understanding that I have not, except accidentally,
followed any of the ICANN LGR work since ICANN concluded that I
had nothing useful to contribute to it (except possibly on terms
they knew I would not and could not accept) and that I resolved
a long time ago and for several reasons to stop doing free
consulting for ICANN, let my give you a little informal, a very
much personal, advice about the interactions between this
situation, the statement, and the TF-AIDN.   

The observation has been made several times in the last week
that proposed "solutions" based on "hand it off to the
registrars, maybe with some warnings, and make it their problem"
or "rely on a registration process to be sure conflicts in
coding forms or characters do not occur" are unworkable for the
DNS given its hierarchy and how the distributed administrative
arrangements for it operate and even less workable for many
other types of identifiers.  Wrt TF-AIDN's task, that
observation works in reverse as well: there is a single registry
for the root zone, there is a single administrator for that
registry and a process that cannot be overridden (at least until
ICANN implements a "second-guessing" procedure by which
important actors can get their way in spite of the normal
process as they have done for the ccTLD Fast Track).   IIR, the
LGR process also has "no mixed-script labels" rules in place and
assumes a much smaller repertoire, including exclusion of code
points that are likely to be problematic, than the an
unrestricted view of IDNA might allow.

So, it would, IMO, be perfectly reasonable for TF-AIDN to say
something like "We understand the script and the language and
cultures in which it is used and the particular context and
restrictions of the DNS root.  Given that, and the observation
that registration-time comparisons and checks are feasible if
needed, we make the following recommendations about names in the
root in Arabic script...  We recognize the IAB Statement but
believe that, for this one particular situation, our
recommendations provide even more protection against problems ".
Without knowing anything about your discussions to date, I note
that the list of characters specified for Arabic language use in
RFC 5564 does not require (or allow) any combining characters at
all.  I assume (indeed, if I remember ASIDNWG discussions
correctly, I know) that is not sufficient for comfortable
writing of reasonable mnemonics of other languages that need the
script, but restrictions of that nature could make the question
of what Unicode normalization does not does not do with
particular characters irrelevant. 

On the other hand, the credibility of such a statement (at least
in a rational world) would be heavily dependent on the
representation of a broad spectrum of expertise about languages
and writing systems that use Arabic script.  In particular, if
you are going to make decisions that could have either positive
or adverse affects on Fula (when it is written in Arabic), it
would seem important to have someone who is both expert on Fula
and who reads and writes it in Arabic script on a daily basis as
part of the task force.

best regards,
    john



[1] The relevant principle, as it was explained to the IDNABis
WG while IDNA2008 was being developed (and which is consistent
with what the Unicode Standard and stabilization policies appear
to say) is one I would summarize as:

	-- New characters will generally not be added if they
	can be constructed as a combining sequences from code
	points that already exist.
	
	-- If special circumstances require that such a
	character be added anyway, normalization stability
	requires that the combining sequence not canonicalize,
	under NFC, to the new character as would normally be
	expected.  However, the new character will decompose
	(under NFD) to the combining sequence as one would
	expect and NFC will produce that decomposed sequence.

Those sections and statements do not point to further exceptions
for particular characters, nor to exceptions for phonetic or
language-use reasons, and the IETF relied on them.




More information about the Idna-update mailing list