Thai Codepoint U+0E33

Tue Jul 28 15:15:51 CEST 2009

Mark,

It was suggested by Pete that, as the resident Unicode expert, we email you to find out what’s the deal with this code point:

U+0E33 THAI CHARACTER SARA AM is not closed under NFKC.

NFC(U+0E33) = U+0E33
NFKC(U+0E33) = U+0E4D,U+0E32

NFC(U+0E4D,U+0E32) = U+0E4D,U+0E32
NFKC(U+0E4D,U+0E32) = U+0E4D,U+0E32

This character is used, for example, on the end of the Thai word ‘gold’. [http://www.thai-language.com/id/131373#def3]

However if you use a Windows machine with a Thai keyboard and hit the 'E' key, it puts in that code point (not the two separate code points), thus it would not be possible, without someone knowing to map 0E33 to 0E4D,0E32 to type the word gold in Thai as a U-label.  The current mappings document (ie saying apply NFC) does not help turn the string into a valid U-label even though it is possible (ie there is a sequence of PVALID code points that produce the same string on the screen). I hope I am explaining this clearly,  is there a reason why U+0E4D,U+0E32 don’t NFC to U+0E33?

Do you know of any other cases like this? Because in this case we need to apply NFKC to the user input to convert it into a valid U-Label, which the mapping document currently doesn’t deal with.

Overall the concern here is that width and case mappings alone leave a few holes. The character above is disallowed because it is unstable (toNFKC(toCaseFold(toNFKC(cp))) != cp) however it can be typed on a keyboard with one keystroke. We used the Unicode normalization test to determine which characters are DISALLOWED in NFC yet PVALID in NFKC which identifies many code points.  However, blanketly applying NFKC leads to unwanted outcomes, such as mapping superscript characters etc into something that is PVALID.

Our current thoughts are that in addition to the current steps in the mapping guide, we would apply compatibility mappings to arrive at a U-label. These need/should only be applied when the disallowed NFC form code point satisfies General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc} (i.e. identical to the protocol).  Further restrictions on this could be that it has to be a compatibility decomposition, and not that is because of superscript, subscript, font or other mappings.

There are several cases we have identified where this could be required and there may well be more:

0675 ARABIC LETTER HIGH HAMZA ALEF
0678 ARABIC LETTER HIGH HAMZA YEH
FB4F HEBREW LIGATURE ALEF LAMED
0EB3 LAO VOWEL SIGN AM
0F77 TIBETAN VOWEL SIGN VOCALIC RR

What are your thoughts on this?

Thanks

Chris