Unicode & IETF

Shawn Steele Shawn.Steele at microsoft.com
Wed Aug 13 18:25:27 CEST 2014


> Isn't it the case that NFC does NOT convert sharp-s to ss or vice versa?
> assuming that is correct, then these are treated as distinct,  at least for comparison purposes.

Exactly, but a there's a somewhat popular fussball.de site.  Users probably expect the sharp-s version to take them there.  It was pointed out that they are alternate spellings, not look-alikes.  I seriously doubt that users care which buckets we put them in when they're surprised it doesn't work.

> Even if they did look the same to you,
> it might not be relevant unless, e.g., they occupied the same
> key on your keyboard.  As far as I know, no one is trying to get
> a universal similar-looking character recognizer out of these
> process.

However as Asmus mentions, the other character in question is never on your keyboard.  NFC/NFD rules say they aren't the same characters, so people in the appropriate language, with the appropriate keyboard never type 'the other' character.  

Abstracting away from this specific character, I could see the wrong-yet-similar-looking character being used only while I have to hack around a missing letter (or support for a letter) for my language.  That happened in Hawaiian with the vowel-macrons using diereses instead and glottal thing vs apostrophe; and in Romanian with the turned comma vs cedilla; and in Maori with something I can't remember.  It also happens when a user ends up with the wrong keyboard or support.  There are probably tons of cases I don't know about and probably cases where Unicode is still missing whatever and people are making do.  *NONE* of the cases I mentioned provide NFC/NFD/NFKD/NFKC, or even IDNA2003 mappings to alleviate this confusion.  

We seem to be having this conversation is because in this specific case the characters happen to look similar.  (Though fonts could probably vary if they felt like it).  If we want to consider such cases where a similar-looking same-script hacked letter when there had been no support, vs an actual letter after support is provided, then at the least we should be having a serious discussion about the cedilla/turned comma in Romanian as well.  There are undoubtedly numerous cedilla's showing up in Romanian text, and possibly even still on older computers.  Or, perhaps, vowel-macrons vs diereses, which don't really look alike on my machine (unless maybe I made it an 8 pt font), but probably still look alike on Keola's machine, as can be seen in a PDF http://www.olelo.hawaii.edu/pub/WinKeyboardInstallers.pdf 

When the characters are used as designed, there is no confusion, because they should never appear in the other one's context.

-Shawn




More information about the Idna-update mailing list