Thai Codepoint U+0E33

Mark Davis ⌛ mark at macchiato.com
Tue Jul 28 19:15:44 CEST 2009


The mapping has a quite number of flaws, but now that it looks like it will
not be normative, I don't think it is worth any spending any further time on
it.

My recommendation (for Google and other companies) is to use a more robust
mapping that maintains as much backwards compatibility as possible, which is
to map non-PVALID/CONTEXTx characters via the Unicode property
NFKC_Casefold.

Mark


On Tue, Jul 28, 2009 at 06:15, Chris Wright <chris at ausregistry.com.au>wrote:

> Mark,
>
> It was suggested by Pete that, as the resident Unicode expert, we email you
> to find out what’s the deal with this code point:
>
> U+0E33 THAI CHARACTER SARA AM is not closed under NFKC.
>
> NFC(U+0E33) = U+0E33
> NFKC(U+0E33) = U+0E4D,U+0E32
>
> NFC(U+0E4D,U+0E32) = U+0E4D,U+0E32
> NFKC(U+0E4D,U+0E32) = U+0E4D,U+0E32
>
> This character is used, for example, on the end of the Thai word ‘gold’. [
> http://www.thai-language.com/id/131373#def3]
>
> However if you use a Windows machine with a Thai keyboard and hit the 'E'
> key, it puts in that code point (not the two separate code points), thus it
> would not be possible, without someone knowing to map 0E33 to 0E4D,0E32 to
> type the word gold in Thai as a U-label.  The current mappings document (ie
> saying apply NFC) does not help turn the string into a valid U-label even
> though it is possible (ie there is a sequence of PVALID code points that
> produce the same string on the screen). I hope I am explaining this clearly,
>  is there a reason why U+0E4D,U+0E32 don’t NFC to U+0E33?
>
> Do you know of any other cases like this? Because in this case we need to
> apply NFKC to the user input to convert it into a valid U-Label, which the
> mapping document currently doesn’t deal with.
>
> Overall the concern here is that width and case mappings alone leave a few
> holes. The character above is disallowed because it is unstable
> (toNFKC(toCaseFold(toNFKC(cp))) != cp) however it can be typed on a keyboard
> with one keystroke. We used the Unicode normalization test to determine
> which characters are DISALLOWED in NFC yet PVALID in NFKC which identifies
> many code points.  However, blanketly applying NFKC leads to unwanted
> outcomes, such as mapping superscript characters etc into something that is
> PVALID.
>
> Our current thoughts are that in addition to the current steps in the
> mapping guide, we would apply compatibility mappings to arrive at a U-label.
> These need/should only be applied when the disallowed NFC form code point
> satisfies General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc} (i.e.
> identical to the protocol).  Further restrictions on this could be that it
> has to be a compatibility decomposition, and not that is because of
> superscript, subscript, font or other mappings.
>
> There are several cases we have identified where this could be required and
> there may well be more:
>
> 0675 ARABIC LETTER HIGH HAMZA ALEF
> 0678 ARABIC LETTER HIGH HAMZA YEH
> FB4F HEBREW LIGATURE ALEF LAMED
> 0EB3 LAO VOWEL SIGN AM
> 0F77 TIBETAN VOWEL SIGN VOCALIC RR
>
> What are your thoughts on this?
>
> Thanks
>
> Chris
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090728/bbbbc5eb/attachment.htm 


More information about the Idna-update mailing list