Retry: Roozbeh's concerns on the IAB statement (was Re: Fwd: IAB Statement on Identifiers and Unicode 7.0.0)
roozbeh at google.com
Wed Jan 28 22:24:19 CET 2015
Resending, as an early draft was sent.
On Wed, Jan 28, 2015 at 6:52 AM, Andrew Sullivan <ajs at anvilwalrusden.com>
> It would be very helpful to me if you could point out which statements
> are ill-informed or which conclusions are not supported by the
> premises in that statement.
Sure. I thought I and other have done that. But let me try another way:
There are three major things wrong wrong with the final recommendations. As
you were involved in the process, you can try to trace which is failed by
- Very common Arabic letters which have absolutely no problem with
normalization in Unicode, such as U+0624 ARABIC LETTER ALEF WITH HAMZA
ABOVE are discouraged, with absolutely no difference in how Unicode treats
them and how it treats U+00E4 LATIN SMALL LETTER A WITH DIAERESIS. Both
characters are canonically decomposed to two proper pieces. The reasoning
for discouraging U+0624 appears to be that it has hamza in its
decomposition. There is nothing wrong with hamza per se, the problems are
only where hamza appears on letters with no decomposition, namely 0681,
076C, and 08A1.
- Other Arabic letters that have the *exact* confusability issues that
U+08A1 ARABIC LETTER BEH WITH HAMZA ABOVE are not discouraged in the
statement. Here's my list of some: 063D, 0673, 0692, 06B5, 06C6, 06C7,
06C8, 06C9, 06CE, ... Note that I'm saying "exact" here. If you loosen the
criteria just a tiny little bit more, much more Arabic characters fall into
the bucket. such as 0679 and 0688.
- Character from other scripts which have the exact problem as 08A1 and
my list in the bullet point above are not discouraged:
- The character U+023B LATIN CAPITAL LETTER C WITH STROKE is not
discouraged, although it can also be represented as <0043,
0338>. There are
tens of such characters in Unicode, spread across every script.
- As another example, the appearance of the sequence <0069, 0307>
(LATIN SMALL LETTER I, COMBINING DOT ABOVE) is exactly the same
as <0069> (LATIN
SMALL LETTER I) according to Unicode and in any good font (the same would
be true if you replace "i" with various Soft_Dotted characters defined in
Unicode). The character sequence <0069, 0307> (followed by no other
combining mark) is actually very common on the internet based on my
- Several other such cases exist in other scripts. For example, in
Malayalam, the sequence U+0D08 has the exact same problem: it can also be
represented as <0D07, 0D57>. Unicode discourages the use of the
same way it discourages using <Beh, Combining Hamza Above>. Almost every
Indic script has a handful of such examples.
At least some of us who worked on that statement (and please note, I'm
> speaking for myself in this message and _not_ others on the IAB or in
> the program) knew very well that the suggestion we were making was
> pretty devastating. I believe the statement acknowledges this
> explicitly, and notes that it is making this recommendation only
> because there is a great deal of concern for future compatibility. We
> know that something must happen, but we know not what.
Basically, the problem is that the recommendation's list of discouraged
characters is arbitrary: it excessively overreaches and excessively
underreaches in others.
> Note also that the IAB can produce advice to others, but it doesn't
> have a police force. I would not be the least surprised if, on
> reflection, someone creating an identifier using Arabic script decided
> to ignore the general advice in one particular case. Your network,
> your rules, after all.
Yes, but IAB is supposed to have done its homework. Others with not enough
knowledge in the matter may defer to IAB's conclusions, as the document
"appears" to be well-researched. So in practice, people may just block Yeh
Hamza because they don't know better, and the poor Arabic users spread
around the world would have a hard time finding who is in charge and who
should change what.
> Then, other characters in the Arabic script that have identical
> > confusability issues (I'll leave finding them as an exercise to the
> > to minimize the damage to the script) are not listed.
> I think this is extremely unfortunate, and I urge you to reconsider.
> The issue for the protocols is not going to go away on its own. What
> you appear to be saying is that you know a whole bunch more
> Arabic-character cases like this, but you're not going to tell us what
> they are. That sounds like a reason to avoid Arabic-script
> identifiers at all until a fuller evaluation is done, and I doubt very
> much that either of us wants that sort of suggestion floating around.
> And given that some people will be creating identifiers no matter
> what, isn't it better that they be doing so with a full appreciation
> of what risks might be involved?
Ok. I'm listing some of them above. This problem is not limited to Arabic.
The problem exists in almost every major script: Latin, Cyrillic, Han,
Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu,
Kannada, Malayalam, Sinhala, Khmer, ... There's an (arguably good) reason
for each of these, and a lot of the examples are available in the data
accompanying UTS #39 (http://unicode.org/reports/tr39/). The model in UTS
#39 is not perfect (in the way that it cannot handle some of the cases such
as <i, combining dot above> that I mentioned above, which I hope could be
fixed), but it goes a very long way. We can retry doing UTS #39 for
different requirements (it appears that names may have slightly different
requirements) of course, but it's a great start.
> > I'll refrain from commenting further on the threads:
> I urge you to reconsider. We need greater participation and
> understanding in this area, not less. It is precisely because of the
> low participation rates we've had in these i18n issues that we keep
> discovering problems late.
On top of being very discouraged by the existing results of the
discussions, the whole discussion is a huge time sink for me,
unfortunately. Basically, from a distance I saw a bunch of people trying to
explain how all the weird identifier issues in Unicode are, and then IAB
comes up with a list that is overreaching too much and underreaching too
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update