Objection to draft-klensin-idna-5892upd-unicode70

Roozbeh Pournader roozbeh at google.com
Mon Aug 11 19:03:54 CEST 2014


I found about draft-klensin-idna-5892upd-unicode70 last week. I highly
object to the potential standardization of that draft, due to its singling
out characters needed for minority communities. For comparison, none of the
existing DISALLOWED characters in RFC 5892 is a part of the basic alphabet
of a language. If the draft is adopted, users of the Fulfulde language
cannot use words with an implosive /b/ in their domain names, being singled
out for no consistent reason.

Unicode is full of confusable characters and character sequences (with no
canonical or compatibility decomposition pointing to them). Using a
canonical or compatibility decomposition mechanism only for finding such
cases doesn't make sense, nor does singling out some more obvious cases of
such confusables.

Just looking at the Arabic blocks, here are some other character sequence
pairs just like U+08A1 that were not singled out in RFC 5892 (for good

U+0618 ≈ U+064E
U+0619 ≈ U+064F
U+061A ≈ U+0650
U+0628 ≈ U+066E U+065C
U+0628 ≈ U+066E U+08ED
U+064B ≈ U+064E U+064E
U+0688 ≈ U+062F U+0615
U+0692 ≈ U+0631 U+065A
U+06CC ≈ U+0649
U+06DF ≈ U+0652
U+08FF ≈ U+06E1

The list goes on and on, and can become even more subtle: for example, the
medial form of U+06CC is identical to the medial form of U+064A, and the
initial form of U+06BD is identical to the initial form of U+067E,
something that is not obvious from the charts at all.

Similar issues exist in other scripts too, and across scripts. Just looking
at the new characters encoded in Unicode 7.0, there's a lot of other
potential confusables.

Capturing all of this for every script and then across the scripts is a
very large task.The best publicly available document that handles such
sequences is UTS #39 at http://www.unicode.org/reports/tr39/. UTS #39 has
its own limitations, but the approach taken in there is much more
comprehensive than the approach taken in
draft-klensin-idna-5892upd-unicode70, as can be seen by the details of its
data files.

Note that registries can use several ways to go around the potential
confusability issues. For example, they can disallow the registration of
domain names which use the sequence <BEH, HAMZA ABOVE>, or disallow domain
names with the character <HAMZA ABOVE>, or disallow the character U+08A1
only if a very similar label was already registered that was identical
except that it was using <BEH, HAMZA ABOVE> instead.

I also recommend that such discussion about architectural issues of
character sequence confusability happen in the mailing lists hosted by the
Unicode Consortium, where such expertise lies. There are nuances in every
corner of Unicode, and a one-by-one burning of characters doesn't work for
the internet community. My personal experience has shown the Unicode
Consortium and the Unicode Technical Committee to be very accepting and
communicative environments, and they happen to know about all the
exceptional cases.


PS: I am one of the leading experts in the standardization of Arabic script
and languages using it. Among other things, I have spent the last 15 years
sorting out and documenting the nuances of Unicode model for the Arabic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140811/7c51b280/attachment.html>

More information about the Idna-update mailing list