Objection to draft-klensin-idna-5892upd-unicode70

Mon Aug 11 21:15:47 CEST 2014

--On Monday, August 11, 2014 10:03 -0700 Roozbeh Pournader
<roozbeh at google.com> wrote:

> Hi,
> 
> I found about draft-klensin-idna-5892upd-unicode70 last week.
> I highly object to the potential standardization of that
> draft, due to its singling out characters needed for minority
> communities. For comparison, none of the existing DISALLOWED
> characters in RFC 5892 is a part of the basic alphabet of a
> language. If the draft is adopted, users of the Fulfulde
> language cannot use words with an implosive /b/ in their
> domain names, being singled out for no consistent reason.
>...

Roozbeh,

Just to clarify a few things...

If you have not read the considerable traffic on this subject in
the last week or so, I suggest that you might want to do so.
Without at least some of that, your "objection" is, at best,
seriously out of context.

As several people keep trying to point out (Andrew and Vint have
done so within the last few hours), that particular character is
not really the issue here.   The draft was written the way it
was because there was no other mechanism to initiate the
discussion, but the real issue is about, stated yet another way,
whether the Unicode rules for adding new characters and
stability model are consistent with the IDNA2008 model for
character handling and interaction with new versions of Unicode.

Even if the particular instance of U+08A1 were the issue, I
don't believe that exaggeration and hyperbole are useful to this
discussion.  First, from what we've been told and what I've been
able to find out, the vast majority of "users of the Fulfulde
language" write that language in Latin script, so this
discussion is completely irrelevant to them and the domain names
they might want to use in their normal writing system for their
language.  That certainly does not mean that we should not try
to support Fula in Arabic Script, just that the reason for doing
so isn't the vast number of readers and writers of contemporary
Fula.    Second, this would be a very different issue --and, at
least IMO, there would be no discussion going on-- if the user
of U+08A1 were the only possible way to write a character in the
Arabic script that is, by either visual inspection or name,
identical to that Fula implosive /b/, aka "BEH WITH HAMZA
ABOVE".  The difficulty, and the reason for the discussion of
this character as an example, is that:

	(i) It has been possible to write BEH WITH HAMZA ABOVE
	as a combining sequence for years and years (that
	comment about that Paul Hoffman's recent note is correct
	in intent but wrong in detail)

	(ii) This new character is being added without a
	normalization that makes it equivalent to the prior
	combining sequence.

If the combining sequence didn't exist, then this addition would
be a completely new character, added to allow a particular
language to be written in a relevant (although not predominant)
script, and we would not be having a discussion because there
would be no issue.  If this character were added but decomposed
back to the combining sequence, the equality consideration that
people have asked for and that IDNA assumes (see several prior
postings) would be satisfied, and we wouldn't be having that
discussion.

Even if one assumes that the least-bad solution to the issues
that have been identified is to DISALLOW U+08A1, the effect
would be to prevent that character from being encoded into the
DNS in that particular way.  Not to prevent the use of the
phoneme, not even to prevent the use of the written character,
just to prevent that particular encoding use.  As I pointed out
earlier today, when UAs (including, but not limited to browsers)
are created that are optimized for Fula, nothing prevents them
from using mappings to get to whatever the IDNA-approved
encoding for the character is.  Indeed, we (including at least
the spirit of RFC 5895) would encourage that.   If such UAs
exist today, whatever they are doing with this character, it
cannot involve the U+08A1 encoding.

> Unicode is full of confusable characters and character
> sequences (with no canonical or compatibility decomposition
> pointing to them). Using a canonical or compatibility
> decomposition mechanism only for finding such cases doesn't
> make sense, nor does singling out some more obvious cases of
> such confusables.

Please read the recent postings.  I don't think it would be
helpful for me to repeat most of the explanations of why the
above statement is not helpful (and most of isn't even relevant).

> Just looking at the Arabic blocks, here are some other
> character sequence pairs just like U+08A1 that were not
> singled out in RFC 5892 (for good reason):
> 
> U+0618 ≈ U+064E
> U+0619 ≈ U+064F
> U+061A ≈ U+0650
> U+0628 ≈ U+066E U+065C
> U+0628 ≈ U+066E U+08ED
> U+064B ≈ U+064E U+064E
> U+0688 ≈ U+062F U+0615
> U+0692 ≈ U+0631 U+065A
> U+06CC ≈ U+0649
> U+06DF ≈ U+0652
> U+08FF ≈ U+06E1
> 
> The list goes on and on, and can become even more subtle: for
> example, the medial form of U+06CC is identical to the medial
> form of U+064A, and the initial form of U+06BD is identical to
> the initial form of U+067E, something that is not obvious from
> the charts at all.

Curiously, since you have presented yourself as a primary Arabic
script expert, some of the above were pointed out in Arabic IDN
and ASIWG discussions.  I think I remember sitting next to you
in at least one of them.   But these particular issues with
Hamza in combination with base characters as a phonetic
indicator and with the combined character as a separate entity
that had to be kept separate were not, to my recollection,
identified in those meetings or any of the reports that were
published.  The discussion of "character folding" in Section
2.1.3 of RFC 5564 is interesting but seems to me to address a
rather different issue, and that "permitted characters" table in
that document (Section 2.2) lists only precombined (single code
point) forms.  

If the issue of those character being able to be constructed by
combining sequences without decomposing forms was known to you,
then it is unfortunate that you didn't bring it to the attention
of the IDNA WG when it would have been much more convenient to
deal with it.  If it took you by surprise to, you are, despite
your outrage as the possibility that we would consider
restricting the use of a newly-added character to preserve
equality checks and backward compatibility, in the same
situation as the rest of us and should be helping to look for a
constructive solution that either makes fundamental changes to
the IDNA2008 model or the somehow prevents conflicts among
identical (and, in this case, identically-named) characters is a
single script by allowing them to not compare equal.

> Similar issues exist in other scripts too, and across scripts.
> Just looking at the new characters encoded in Unicode 7.0,
> there's a lot of other potential confusables.

If there are others characters that exactly match
previously-available combining sequences within the same script
without decompositions that would produced those sequences, I
wasn't able to find them.  Your help in identifying situations
where really are similar would be appreciated.  Talking instead
about "potential confusables" is not helpful to this particular
topic.

> Capturing all of this for every script and then across the
> scripts is a very large task.The best publicly available
> document that handles such sequences is UTS #39 at
> http://www.unicode.org/reports/tr39/. UTS #39 has its own
> limitations, but the approach taken in there is much more
> comprehensive than the approach taken in
> draft-klensin-idna-5892upd-unicode70, as can be seen by the
> details of its data files.

I really don't believe this is relevant.  The possibilities of
relying on UTR 39 and/or UAX 31 were considered very early in
the evolution of IDNs.  The community concluded they were not a
sufficient match.  If you need to know why, I recommend the
archives.

> Note that registries can use several ways to go around the
> potential confusability issues. For example, they can disallow
> the registration of domain names which use the sequence <BEH,
> HAMZA ABOVE>, or disallow domain names with the character
> <HAMZA ABOVE>, or disallow the character U+08A1 only if a very
> similar label was already registered that was identical except
> that it was using <BEH, HAMZA ABOVE> instead.

This has also been discussed in recent days, but let me be a
little more precise and explicit about it.  The general
mechanisms used by IDNA (and the tables associated with IDNA2003
before IDNA2008 came along) have been adopted as a basis for
rules about strings and identifiers used by a number of other
protocols.  Most of them involve unmanaged name spaces (or no
name spaces at all) with user names, identification strings,
passwords and pass phrases, and the like as good examples (and,
as we are frequently reminder, identifiers for things like iSCSI
objects for machines and clouds).  There is no mechanism for
most of those systems that allows even as much control as the
set of (probably millions of) DNS registries allow. 

> I also recommend that such discussion about architectural
> issues of character sequence confusability happen in the
> mailing lists hosted by the Unicode Consortium, where such
> expertise lies. 

And where, apparently, expertise about the requirements and
constraints of the DNS is somewhat more limited.

> There are nuances in every corner of Unicode,
> and a one-by-one burning of characters doesn't work for the
> internet community. My personal experience has shown the
> Unicode Consortium and the Unicode Technical Committee to be
> very accepting and communicative environments, and they happen
> to know about all the exceptional cases.

The difficulty from an IDNA perspective is that those
exceptional cases have not been identified to us in a proactive
way that would also a general model for handling their DNS to be
handled in a particular way.   It would probably be unfair to
categorize many of the recent discussions as "we did it that
way, we think we had good reasons, we won't explain those
reasons or the balances we selected as they compare to other
clear statements we have made, and you can't change it, so suck
it up", but a few of the remarks have felt somewhat like that.
So our (or at least my, since I seem to be the designated bad
guy) experience with IDNA and some of these cases is a little
different from yours.  

I really wish we could conclude that there was a problem in this
area that we needed to work on together, but, as long as the
comments from Unicode folks (including yourself) seem to be
"well, this is normal, you can't do anything with it in IDNA
including exercising provisions that have been there since
before RFC 5890ff were published, so why don't you rely on
registries and/or pervasive use of LGR (rules that, by the way,
would prohibit the use of ZWNJ in Persian, and option for which
I think you were one of the most passionate advocates).

> PS: I am one of the leading experts in the standardization of
> Arabic script and languages using it. Among other things, I
> have spent the last 15 years sorting out and documenting the
> nuances of Unicode model for the Arabic script.

See above.

best regards,
   john