IAB Statement on Identifiers and Unicode 7.0.0

Sun Feb 1 13:11:46 CET 2015

Hey John,

Well my point wasn't clear in my previous email, I'll try my best to explain it more:

I agree with the document when it explaines the risk from having one character written in two ways, which is reasonable to advise in excluding it till a solution is found. Everything was good till the end, "the IAB recommends that the following characters and character sequences be excluded from use in any new identifiers until that solution is found" as we know some of the mentioned characters in the list have NF and they are shouldn't be included in the recommendation. A possible recommendation should be restricted to only the character 08A1 and Any Unicode code point, followed by U+0654, ARABIC HAMZA ABOVE. This would reduce the risk factor of having security issues and losing the usability of the languages where these characters are used.

Regardless of Fula language, only the old version of it uses Arabic characters but nowadays they use Latin characters we couldn't find any experts who would help us in the task force when Flua uses the Arabic characters. I don't see any unfairness in blocking this character till some expert address the daily need for it.

I hope it is clear now.

-----Original Message-----
From: John C Klensin [mailto:klensin at jck.com] 
Sent: Thursday, January 29, 2015 7:07 PM
To: Abdulrahman I. ALGhadir
Cc: IDNA update work
Subject: RE: IAB Statement on Identifiers and Unicode 7.0.0

--On Thursday, January 29, 2015 11:59 +0000 "Abdulrahman I.
ALGhadir" <aghadir at citc.gov.sa> wrote:

> Dear All,
> 
> I haven't been up-to-date on the IDNA mailing group lately, due 
> working with TF-AIDN (the group which assigned by ICANN LGR for the 
> Arabic script). I just read the IAB stamen once it got forwarded by 
> ICANN staff to us. We were aware about the character 08A1 and the 
> confusability caused by not make the NFC for it. Their concern is 
> valid and reasonable but in ONLY this character (U+08A1). The problem 
> is in the end of their statement, IAB said an inaccurate information 
> which is  the characters U+0623, U+0624, U+0626, U+0677, U+06C2 and 
> U+06D3 aren't canonically equivalent to <character> followed by
> U+0654, ARABIC HAMZA ABOVE. This statement has two problems:
> 1) inaccurate information and 2)these characters are safe and they are 
> very important characters to the languages belong to Arabic script. It 
> is like dropping vowels from English!.
> Their statement should be restricted to ONLY character U+08A1, ARABIC 
> LETTER BEH WITH HAMZA ABOVE. If this statement got adapted it is going 
> to murder the language and it'll be very hard for normal users to form 
> a lot of words!
> 
> I hope my concern is clear and we should reconsider their statement 
> with the concerns I mentioned previously.

Abdulrahman,

Nice to hear from you.

Having advised on that statement and contributed some words but not being a member of the IAB or formally responsible in any way, let me comment with the understanding that this is my personal perspective and in my personal capacity alone.

First, I hope I'm not disclosing any secrets when I tell you that there was considerable desire to keep the statement as short as possible and to get it out quickly rather than trying
to achieve perfection and get every detail right.   That
tradeoff decision, which I personally consider reasonable, is
one that almost always results in rough edges.   Perhaps there
should have been more discussion about the differences among that list of characters.  Perhaps the IAB should have waited longer and gotten more discussion of issues in other scripts
into the statement.   It is always easy to to have different
opinions about those kinds of tradeoffs in retrospect.  It is harder, or impossible, to get things completely right initially, at least without spending very a long time at it.

I don't believe there is anything "inaccurate" about the IAB statement, although it is possible to misread it as saying far more specific things than it actually did.  See below before reacting to that assertion.

The preference to keep the document short and complexity
minimized made a full explanation impossible.   I tried to get a
somewhat more comprehensive explanation into draft-klensin-idna-5892upd-unicode70-03.  Based on what we have now learned, that explanation is horribly inadequate (although much more complete than the IAB statement was intended to be,
but it is already 16 pages long.   -04, which I will get back to
as soon as I can take a break from reading and responding to messages on this list, will be better (and will address the somewhat-similar Latin script examples), but also longer.  That is still very much a work in progress and I would welcome your help with getting the Arabic-specific discussion in it correct.

I'd saddened by the fact that this particular problem -- one that seems to be a very general one of normalization not consistently working the way the IETF expected (and was led to expect [1]) -- first showed up with an Arabic script character.
As you know better than the vast majority of the people on this list, the Arabic script has more than enough difficulties with proper and consistent coding in Unicode and with DNS labels and fully-qualified domain names.  It would be nice to not have even more complications.  But, at least as long as Unicode is taken as given, there is little that can be done about that general
issue at this stage.   Some characters were added to Unicode
that were clearly associated with letters with which a sensible person might want to write mnemonic domain name labels, as part of the normal IDNA review of changes in Unicode, I caught one of them as being able to be created as a combining sequence but not decomposing and therefore identifying this problem, and it happened to be in Arabic script.  The latter is probably no one's fault, certainly not mine or the IAB's.

My assumption when it was first spotted was that it was an isolated anomaly.  We would decide what to do about it (which might have been "nothing" or excluding it by exception), do that, and then move on.  It rapidly became clear that it was part of a broader issue, an issue tied to the fairly extensive Unicode discussion of the various uses and contexts for Hamza (there may be other single characters, in other scripts, that get similarly extensive treatment, but there aren't many and I don't personally know of them).  Only later -- much too late to have a significant effect on the IAB Statement -- did the issues with non-decomposable Latin script characters become evident.
As I have said before on this list, I'm still not confident I have an adequate understanding of them, so I don't expect -04 to be right either, just closer.

With the understanding that I have not, except accidentally, followed any of the ICANN LGR work since ICANN concluded that I had nothing useful to contribute to it (except possibly on terms they knew I would not and could not accept) and that I resolved a long time ago and for several reasons to stop doing free consulting for ICANN, let my give you a little informal, a very much personal, advice about the interactions between this
situation, the statement, and the TF-AIDN.   

The observation has been made several times in the last week that proposed "solutions" based on "hand it off to the registrars, maybe with some warnings, and make it their problem"
or "rely on a registration process to be sure conflicts in coding forms or characters do not occur" are unworkable for the DNS given its hierarchy and how the distributed administrative arrangements for it operate and even less workable for many other types of identifiers.  Wrt TF-AIDN's task, that observation works in reverse as well: there is a single registry for the root zone, there is a single administrator for that registry and a process that cannot be overridden (at least until ICANN implements a "second-guessing" procedure by which important actors can get their way in spite of the normal
process as they have done for the ccTLD Fast Track).   IIR, the
LGR process also has "no mixed-script labels" rules in place and assumes a much smaller repertoire, including exclusion of code points that are likely to be problematic, than the an unrestricted view of IDNA might allow.

So, it would, IMO, be perfectly reasonable for TF-AIDN to say something like "We understand the script and the language and cultures in which it is used and the particular context and restrictions of the DNS root.  Given that, and the observation that registration-time comparisons and checks are feasible if needed, we make the following recommendations about names in the root in Arabic script...  We recognize the IAB Statement but believe that, for this one particular situation, our recommendations provide even more protection against problems ".
Without knowing anything about your discussions to date, I note that the list of characters specified for Arabic language use in RFC 5564 does not require (or allow) any combining characters at all.  I assume (indeed, if I remember ASIDNWG discussions correctly, I know) that is not sufficient for comfortable writing of reasonable mnemonics of other languages that need the script, but restrictions of that nature could make the question of what Unicode normalization does not does not do with particular characters irrelevant. 

On the other hand, the credibility of such a statement (at least in a rational world) would be heavily dependent on the representation of a broad spectrum of expertise about languages and writing systems that use Arabic script.  In particular, if you are going to make decisions that could have either positive or adverse affects on Fula (when it is written in Arabic), it would seem important to have someone who is both expert on Fula and who reads and writes it in Arabic script on a daily basis as part of the task force.

best regards,
    john

[1] The relevant principle, as it was explained to the IDNABis WG while IDNA2008 was being developed (and which is consistent with what the Unicode Standard and stabilization policies appear to say) is one I would summarize as:

	-- New characters will generally not be added if they
	can be constructed as a combining sequences from code
	points that already exist.

	-- If special circumstances require that such a
	character be added anyway, normalization stability
	requires that the combining sequence not canonicalize,
	under NFC, to the new character as would normally be
	expected.  However, the new character will decompose
	(under NFD) to the combining sequence as one would
	expect and NFC will produce that decomposed sequence.

Those sections and statements do not point to further exceptions for particular characters, nor to exceptions for phonetic or language-use reasons, and the IETF relied on them.