Unicode 7.0.0, (combining) Hamza Above, and normalization

Shawn Steele Shawn.Steele at microsoft.com
Fri Aug 8 20:33:56 CEST 2014

>  It would appear that at least some of us had the wrong understanding, and the implications of the actual rules are different to what we'd believed....

> IDNA2008 and into PRECIS and elsewhere.  Stated simplistically, that understanding has been that normalization would deal effectively with the issue of equality comparisons between "characters" within the same script that had the same appearance.  

That may be the crux of the issue.  Normalization is more concerned with the linguistic behavior than the visual behavior.  The question is "is it a different character?"  The normalization stability guidelines prohibit adding a different version of the same character as that would confuse things, however my understanding is that this is a different character (sorry, I'm not an expert on this language), though it appears the naming may be confusing.

Obvious cases are things like Cyrillic and Latin, which look very similar but are easy to understand are a different language.  Unfortunately there are examples within scripts. In some fonts even rn looks a lot like m, or as pointed out l, I and 1, or 0 and O.

> > There are likely many similar-looking things that fit in a similar 
> > bucket and have escaped notice.

> All the more reason to concern ourselves with it, no?

I'm gathering that the concerns are more related to homographs in general than just one character.  There appears to have been a conception that NFKC would prevent one class of homographs (same-script ones), however that is not the case, and it goes beyond this character.

As far as this concerns homographs, it was suggested that guidance be provided to registrars to not allow both of these.  It seems like that'd go for quite a few sequences, and I have a difficult time imagining any prescriptive homograph mechanism being complete.  Stick Chinese in a small-ish font for example and it'd probably be pretty easy to find thousands of characters that may trick someone at a quick glance.  Same way that rnicrosoft.com can trick you if you're expecting to see "Microsoft".

I'm sort of afraid that there's a catch-22 here:  A desire for allowing linguistic strings so that people's languages aren't disallowed, and a scientific precision that prescribes a mathematical uniqueness.  Linguistics != mathamatics.

Re: Precis, what is it's intended purpose, I haven't heard of it.


More information about the Idna-update mailing list