Thai Codepoint U+0E33

Patrik Fältström patrik at frobbit.se
Wed Jul 29 10:16:47 CEST 2009


Some characters change properties when run through normalization.  
Another example is U+0140 LATIN SMALL LETTER L WITH MIDDLE DOT that is  
Ll. It is normalized into a pair of codepoints where one is U+00B7  
MIDDLE DOT that is Po.

If not the lack of coffee make me misunderstand my own data.

    Patrik

On 29 jul 2009, at 09.54, Pete Resnick wrote:

> Would one of the other Unicode folks list take a shot at this
> question since Mark seems uninterested in actually answering the
> question asked:
>
> On 7/28/09 at 11:15 PM +1000, Chris Wright wrote:
>
>> ...find out what's the deal with this code point:
>>
>> U+0E33 THAI CHARACTER SARA AM is not closed under NFKC.
>>
>> NFC(U+0E33) = U+0E33
>> NFKC(U+0E33) = U+0E4D,U+0E32
>>
>> NFC(U+0E4D,U+0E32) = U+0E4D,U+0E32
>> NFKC(U+0E4D,U+0E32) = U+0E4D,U+0E32
>>
>> This character is used, for example, on the end of the Thai word
>> 'gold'. [http://www.thai-language.com/id/131373#def3]
>>
>> However if you use a Windows machine with a Thai keyboard and hit
>> the 'E' key, it puts in that code point (not the two separate code
>> points), thus it would not be possible, without someone knowing to
>> map 0E33 to 0E4D,0E32 to type the word gold in Thai as a
>> U-label.  The current mappings document (ie saying apply NFC) does
>> not help turn the string into a valid U-label even though it is
>> possible (ie there is a sequence of PVALID code points that produce
>> the same string on the screen). I hope I am explaining this clearly,
>> is there a reason why U+0E4D,U+0E32 don't NFC to U+0E33?
>>
>> Do you know of any other cases like this? Because in this case we
>> need to apply NFKC to the user input to convert it into a valid
>> U-Label, which the mapping document currently doesn't deal with.
>>
>> Overall the concern here is that width and case mappings alone leave
>> a few holes. The character above is disallowed because it is
>> unstable (toNFKC(toCaseFold(toNFKC(cp))) != cp) however it can be
>> typed on a keyboard with one keystroke. We used the Unicode
>> normalization test to determine which characters are DISALLOWED in
>> NFC yet PVALID in NFKC which identifies many code points.  However,
>> blanketly applying NFKC leads to unwanted outcomes, such as mapping
>> superscript characters etc into something that is PVALID.
>>
>> Our current thoughts are that in addition to the current steps in
>> the mapping guide, we would apply compatibility mappings to arrive
>> at a U-label. These need/should only be applied when the disallowed
>> NFC form code point satisfies General_Category(cp) is in {Ll, Lu,
>> Lo, Nd, Lm, Mn, Mc} (i.e. identical to the protocol).  Further
>> restrictions on this could be that it has to be a compatibility
>> decomposition, and not that is because of superscript, subscript,
>> font or other mappings.
>>
>> There are several cases we have identified where this could be
>> required and there may well be more:
>>
>> 0675 ARABIC LETTER HIGH HAMZA ALEF
>> 0678 ARABIC LETTER HIGH HAMZA YEH
>> FB4F HEBREW LIGATURE ALEF LAMED
>> 0EB3 LAO VOWEL SIGN AM
>> 0F77 TIBETAN VOWEL SIGN VOCALIC RR
>>
>> What are your thoughts on this?
>>
>> Thanks
>>
>> Chris
>
> -- 
> Pete Resnick <http://www.qualcomm.com/~presnick/>
> Qualcomm Incorporated - Direct phone: (858)651-4478, Fax:  
> (858)651-1102
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://www.alvestrand.no/pipermail/idna-update/attachments/20090729/d1539646/attachment.pgp 


More information about the Idna-update mailing list