The mapping has a quite number of flaws, but now that it looks like it will not be normative, I don&#39;t think it is worth any spending any further time on it.<br><br>My recommendation (for Google and other companies) is to use a more robust mapping that maintains as much backwards compatibility as possible, which is to map non-PVALID/CONTEXTx characters via the Unicode property NFKC_Casefold.<br>

<br clear="all">Mark<br>

<br><br><div class="gmail_quote">On Tue, Jul 28, 2009 at 06:15, Chris Wright <span dir="ltr">&lt;<a href="mailto:chris@ausregistry.com.au">chris@ausregistry.com.au</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Mark,<br>

 <br>

It was suggested by Pete that, as the resident Unicode expert, we email you to find out what’s the deal with this code point:<br>

 <br>

U+0E33 THAI CHARACTER SARA AM is not closed under NFKC.<br>

 <br>

NFC(U+0E33) = U+0E33<br>

NFKC(U+0E33) = U+0E4D,U+0E32<br>

 <br>

NFC(U+0E4D,U+0E32) = U+0E4D,U+0E32<br>

NFKC(U+0E4D,U+0E32) = U+0E4D,U+0E32<br>

 <br>

This character is used, for example, on the end of the Thai word ‘gold’. [<a href="http://www.thai-language.com/id/131373#def3" target="_blank">http://www.thai-language.com/id/131373#def3</a>]<br>

 <br>

However if you use a Windows machine with a Thai keyboard and hit the &#39;E&#39; key, it puts in that code point (not the two separate code points), thus it would not be possible, without someone knowing to map 0E33 to 0E4D,0E32 to type the word gold in Thai as a U-label.  The current mappings document (ie saying apply NFC) does not help turn the string into a valid U-label even though it is possible (ie there is a sequence of PVALID code points that produce the same string on the screen). I hope I am explaining this clearly,  is there a reason why U+0E4D,U+0E32 don’t NFC to U+0E33?<br>


 <br>

Do you know of any other cases like this? Because in this case we need to apply NFKC to the user input to convert it into a valid U-Label, which the mapping document currently doesn’t deal with.<br>

 <br>

Overall the concern here is that width and case mappings alone leave a few holes. The character above is disallowed because it is unstable (toNFKC(toCaseFold(toNFKC(cp))) != cp) however it can be typed on a keyboard with one keystroke. We used the Unicode normalization test to determine which characters are DISALLOWED in NFC yet PVALID in NFKC which identifies many code points.  However, blanketly applying NFKC leads to unwanted outcomes, such as mapping superscript characters etc into something that is PVALID.<br>


 <br>

Our current thoughts are that in addition to the current steps in the mapping guide, we would apply compatibility mappings to arrive at a U-label. These need/should only be applied when the disallowed NFC form code point satisfies General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc} (i.e. identical to the protocol).  Further restrictions on this could be that it has to be a compatibility decomposition, and not that is because of superscript, subscript, font or other mappings.<br>


 <br>

There are several cases we have identified where this could be required and there may well be more:<br>

 <br>

0675 ARABIC LETTER HIGH HAMZA ALEF<br>

0678 ARABIC LETTER HIGH HAMZA YEH<br>

FB4F HEBREW LIGATURE ALEF LAMED<br>

0EB3 LAO VOWEL SIGN AM<br>

0F77 TIBETAN VOWEL SIGN VOCALIC RR<br>

 <br>

What are your thoughts on this?<br>

 <br>

Thanks<br>

<font color="#888888"> <br>

Chris<br>

</font></blockquote></div><br>