"This case isn't the important one" (was Re: Visually confusable characters (8))

Mon Aug 11 21:06:11 CEST 2014

On Mon, Aug 11, 2014 at 06:31:33PM +0000, Shawn Steele wrote:

> A)     From a purely binary standpoint no mapping was added, which is pretty much what normalization guaranteed, that no binary mappings would change.

"No mappings change" is true because the code point wasn't assigned
before.  The stability "promise" was also that if a new precomposed
version of something formerly made out of a composition sequence were
added, the normalization rules would use the decomposed form.  So, the
entire argument turns on …

> B)      Linguistic experts have indicated that, despite the confusing name, this is not the same character.

…this.  As I've said more than once, I get this argument.  It's hard
to understand, however, why the linguistic argument is true in this
case of producing an undetectably-different precomposed character and
combining sequence, and yet not in the case of (e.g.) ö (in Swedish)
and o-umlaut (in German).  They're clearly different letters
linguistically too.

Note that I'm not saying Unicode made the wrong decision.  I'm not
qualified to have an opinion about that.  I'm saying instead that, for
the purposes of IDNA, this decision appears to cause us trouble.

> So, in my view, nothing’s changed or broken WRT IDN’s use of normalization.  Yes, another potentially confusing character combination now exists, but we already have thousands of homographs. 

I think this "thousands of homographs" premise that keeps being
offered is either disingenuous or else is equivocating on
"homographs".  The relevant class of cases here are those that are
intra-script, where there is both a precomposed character and a
combining sequence that nobody could ever in principle differentiate
between without actually looking at the Unicode code points directly,
and which the normalization forms do not render as matching one
another.  It's _not_ look-similar (like the lower-case letter l and
the number 1), or inter-script cases (like those among Latin,
Cyrillic, and Greek).  There _does_ appear to be a set of cases in the
class we're talking about that were part of IDNA2008.  But that
actually highlights, rather than refutes, the point that I, at least,
am trying to make.  We _thought_ that NFC would mean that, if there
really were multiple ways of writing something in practice, then
normalization would normalize those multiple ways into the same thing.
Turns out we were wrong.  Now the question is what to do about it, and
so far there's a suggestion that even the authors seem to dislike.  

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com