Thai Codepoint U+0E33
Pete Resnick
presnick at qualcomm.com
Wed Jul 29 09:54:42 CEST 2009
Would one of the other Unicode folks list take a shot at this
question since Mark seems uninterested in actually answering the
question asked:
On 7/28/09 at 11:15 PM +1000, Chris Wright wrote:
>...find out what's the deal with this code point:
>
>U+0E33 THAI CHARACTER SARA AM is not closed under NFKC.
>
>NFC(U+0E33) = U+0E33
>NFKC(U+0E33) = U+0E4D,U+0E32
>
>NFC(U+0E4D,U+0E32) = U+0E4D,U+0E32
>NFKC(U+0E4D,U+0E32) = U+0E4D,U+0E32
>
>This character is used, for example, on the end of the Thai word
>'gold'. [http://www.thai-language.com/id/131373#def3]
>
>However if you use a Windows machine with a Thai keyboard and hit
>the 'E' key, it puts in that code point (not the two separate code
>points), thus it would not be possible, without someone knowing to
>map 0E33 to 0E4D,0E32 to type the word gold in Thai as a
>U-label. The current mappings document (ie saying apply NFC) does
>not help turn the string into a valid U-label even though it is
>possible (ie there is a sequence of PVALID code points that produce
>the same string on the screen). I hope I am explaining this clearly,
> is there a reason why U+0E4D,U+0E32 don't NFC to U+0E33?
>
>Do you know of any other cases like this? Because in this case we
>need to apply NFKC to the user input to convert it into a valid
>U-Label, which the mapping document currently doesn't deal with.
>
>Overall the concern here is that width and case mappings alone leave
>a few holes. The character above is disallowed because it is
>unstable (toNFKC(toCaseFold(toNFKC(cp))) != cp) however it can be
>typed on a keyboard with one keystroke. We used the Unicode
>normalization test to determine which characters are DISALLOWED in
>NFC yet PVALID in NFKC which identifies many code points. However,
>blanketly applying NFKC leads to unwanted outcomes, such as mapping
>superscript characters etc into something that is PVALID.
>
>Our current thoughts are that in addition to the current steps in
>the mapping guide, we would apply compatibility mappings to arrive
>at a U-label. These need/should only be applied when the disallowed
>NFC form code point satisfies General_Category(cp) is in {Ll, Lu,
>Lo, Nd, Lm, Mn, Mc} (i.e. identical to the protocol). Further
>restrictions on this could be that it has to be a compatibility
>decomposition, and not that is because of superscript, subscript,
>font or other mappings.
>
>There are several cases we have identified where this could be
>required and there may well be more:
>
>0675 ARABIC LETTER HIGH HAMZA ALEF
>0678 ARABIC LETTER HIGH HAMZA YEH
>FB4F HEBREW LIGATURE ALEF LAMED
>0EB3 LAO VOWEL SIGN AM
>0F77 TIBETAN VOWEL SIGN VOCALIC RR
>
>What are your thoughts on this?
>
>Thanks
>
>Chris
--
Pete Resnick <http://www.qualcomm.com/~presnick/>
Qualcomm Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
More information about the Idna-update
mailing list