Thai Codepoint U+0E33

Wed Jul 29 09:54:42 CEST 2009

Would one of the other Unicode folks list take a shot at this 
question since Mark seems uninterested in actually answering the 
question asked:

On 7/28/09 at 11:15 PM +1000, Chris Wright wrote:

>...find out what's the deal with this code point:
>
>U+0E33 THAI CHARACTER SARA AM is not closed under NFKC.
>
>NFC(U+0E33) = U+0E33
>NFKC(U+0E33) = U+0E4D,U+0E32
>
>NFC(U+0E4D,U+0E32) = U+0E4D,U+0E32
>NFKC(U+0E4D,U+0E32) = U+0E4D,U+0E32
>
>This character is used, for example, on the end of the Thai word 
>'gold'. [http://www.thai-language.com/id/131373#def3]
>
>However if you use a Windows machine with a Thai keyboard and hit 
>the 'E' key, it puts in that code point (not the two separate code 
>points), thus it would not be possible, without someone knowing to 
>map 0E33 to 0E4D,0E32 to type the word gold in Thai as a 
>U-label.  The current mappings document (ie saying apply NFC) does 
>not help turn the string into a valid U-label even though it is 
>possible (ie there is a sequence of PVALID code points that produce 
>the same string on the screen). I hope I am explaining this clearly, 
> is there a reason why U+0E4D,U+0E32 don't NFC to U+0E33?
>
>Do you know of any other cases like this? Because in this case we 
>need to apply NFKC to the user input to convert it into a valid 
>U-Label, which the mapping document currently doesn't deal with.
>
>Overall the concern here is that width and case mappings alone leave 
>a few holes. The character above is disallowed because it is 
>unstable (toNFKC(toCaseFold(toNFKC(cp))) != cp) however it can be 
>typed on a keyboard with one keystroke. We used the Unicode 
>normalization test to determine which characters are DISALLOWED in 
>NFC yet PVALID in NFKC which identifies many code points.  However, 
>blanketly applying NFKC leads to unwanted outcomes, such as mapping 
>superscript characters etc into something that is PVALID.
>
>Our current thoughts are that in addition to the current steps in 
>the mapping guide, we would apply compatibility mappings to arrive 
>at a U-label. These need/should only be applied when the disallowed 
>NFC form code point satisfies General_Category(cp) is in {Ll, Lu, 
>Lo, Nd, Lm, Mn, Mc} (i.e. identical to the protocol).  Further 
>restrictions on this could be that it has to be a compatibility 
>decomposition, and not that is because of superscript, subscript, 
>font or other mappings.
>
>There are several cases we have identified where this could be 
>required and there may well be more:
>
>0675 ARABIC LETTER HIGH HAMZA ALEF
>0678 ARABIC LETTER HIGH HAMZA YEH
>FB4F HEBREW LIGATURE ALEF LAMED
>0EB3 LAO VOWEL SIGN AM
>0F77 TIBETAN VOWEL SIGN VOCALIC RR
>
>What are your thoughts on this?
>
>Thanks
>
>Chris

-- 
Pete Resnick <http://www.qualcomm.com/~presnick/>
Qualcomm Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102