IAB Statement on Identifiers and Unicode 7.0.0

Andrew Sullivan ajs at anvilwalrusden.com
Wed Jan 28 23:23:18 CET 2015


On Wed, Jan 28, 2015 at 09:56:28PM +0000, Shawn Steele wrote:
> 
> I'm a little confused.  What kind of property are you looking for?  To do what?  The behavior of these characters obeys NFC, so the behavior seems to be clear to me.
> 

I think there is a faint circularity at the centre of your argument.
We all agree that these characters obey all the usual NF* rules given
the way they're constructed, because they're not canonically
equivalent.  We all agree that Unicode has some principles by which
characters are encoded and when canonical equivalence is determined.
But you cannot reason from that to, "Since they're canonically
equivalent, therefore they're fine for identifiers."  That's precisely
the premise that some of us have come to question.

> Is there still a perceived "surprise" if these code points had visibly different glyphs or different names?  Names & glyphs aren't canonical Unicode properties
> 

If they had even possibly visibly different glyphs, in fact, we
wouldn't be arguing about this at all.  If you read carefully the
treament of _hamza_ in the Unicode Standard (in version 7.0.0 it's in
chapter 9, on pages 362 and 378-379), it's pretty clear that this is a
quite complicated issue, _but also_ that there's one thing we're
talking about ("the _hamza_").  That's why these characters are simply
indistinguishable from one another: no font _could_ render them
distinguishable, because (presumably for historical reasons) they
developed from the "same writing", but over time accrued different
significance without changing form.  But importantly, you cannot read
that off the properties of the characters.  (Mark offered another
example of the combining solidus and o.  I can't tell yet whether that
case is the same issue or isn't, but I suspect it may be.)

So it is not merely that these are visually the same, but that they're
in the same script, and that other analagous cases get treated as
canonically equivalent and these don't.  That's a problem for the IETF
because we were using the derived properties and the canonical
equivalence under the assumption that they'd give us certain
guarantees, and it turns out they don't.  Again, this isn't Unicode's
fault, it's just a fact of the way things are.

There is more in this thread, above, so I won't repeat it all, but the
choices boil down to a few:

1.  Do nothing.  Yep, some people will be confused.  Too bad.

2.  Unicode can't help us, sorry, because of what it is.  Maybe we
need something else (I hope we all agree this is a non-starter).

3.  Come up with a new property that will either give the IETF, or
allow the IETF to derive, the sort of equivalence it thought it got
from the existing properties + NFC.

4.  Send the IETF off to maintain its own exception list.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com


More information about the Idna-update mailing list