"This case isn't the important one" (was Re: Visually confusable characters (8))

Mon Aug 11 21:50:25 CEST 2014

Hi Ken,

First, thanks for this.  Your message clarified a number of things for
me.

On Mon, Aug 11, 2014 at 07:28:50PM +0000, Whistler, Ken wrote:
> 
> U+0529 CYRILLIC LETTER EN WITH LEFT HOOK
> 
> For Orok.  For some value of "possible" and some value of "create", that
> was "the same" as the existing U+043D CYRILLIC LETTER EN + 
> U+0321 COMBINING PALATALIZED HOOK BELOW (which takes different
> positions, depending on the base letter it attaches to). 

Ok, this is helpful.  Thanks.

> I am guessing that you have taken on board the Unicode formal non-equivalence
> of these "precomposed" characters that have diacritics attached
> or overlaying the base letters, even though *logically* these
> diacritic modifications are, at another level the "same" as the
> base character plus the application of the combining diacritic
> in question.

No, not quite; this is exactly the problem.  While I accept and
believe I understand Unicode's decision that these are formally
non-equivalent, I think that for IDNA purposes that may not be good.
And this non-equivalence is not consistent with what I believed NFC
was supposed to get us for IDNA purposes.  That there are now these
other examples (thank you for them) makes the problem worse, not
better.

> In any case, at this point I find it surprising that anyone who had
> been paying close attention to Unicode for the last 15 years or so
> would find the regular (not common, not rare) introduction of these
> kinds of letters into the standard to actually be surprising.

What is surprising to me is the difference between what I thought
happened with normalization in the cases where precombined characters
were to be added, and what actually happens.  That isn't the meaning
of, "I am surprised," that is actually a snide way of saying, "You are
wrong."  It's just a genuine expression of surprise.  It's entirely
possible (it wouldn't surprise me at all) that I was utterly confused
about the way Unicode works.  I've been working intimately with the
DNS since the early 2000s, and I continue to be surprised by it too.
Probably others are more clever than I am and are less often mistaken.
My apologies.

> It will recur. These kinds of situations are built into the structure
> of a number of scripts -- most notably Latin, Cyrillic, and Arabic.
> They should not be surprises.

That's good to know.  Now we have to figure out what the consequences
are for protocols.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com