Some clarification. <br><br>1. It appears that you may think that NFKC does <span style="font-style: italic;">not </span>forbid combining marks; however, it only forbids sequences that could be expressed with a combined form (with a few exceptions). Thus:
<br><br>A + acute is forbidden in NFKC<br>X + cedilla is not forbidden in NFKC<br><br>See <a href="http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table">http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table
</a><span style="font-weight: bold;"><br><br></span>2. Unicode composition and decomposition is not based on visual confusability (referring to your memo on Gujarati). For example, "m" does not decompose to "rn" even though those two sequences are visually confusable (at address box sizes in common fonts they look the same). Nor is it simply based on origin: "w" does not decompose to "vv". For more examples, see
<a href="http://unicode.org/charts/normalization/">http://unicode.org/charts/normalization/</a><br><br>Visual similarity is much broader than the Unicode composition and decomposition. See <a href="http://www.unicode.org/reports/tr39/#Confusable_Detection">
http://www.unicode.org/reports/tr39/#Confusable_Detection</a><br><br>Baking visual similarity into the protocol would be a real problem for many, many languages: it would be the equivalent of disallowing the use of the letter "m" in English.
<br><br>Mark<br><br><div><span class="gmail_quote">On 11/26/06, <b class="gmail_sendername">Sam Vilain</b> <<a href="mailto:sam.vilain@catalyst.net.nz">sam.vilain@catalyst.net.nz</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
John C Klensin wrote:<br>>> But doesn't รจ decompose to a sequence including that mark?<br>>><br>> I may miss your point but, if I don't, that is one of the<br>> reasons we have used NFKC, rather than NFKD, all along.
<br>><br><br>Oh, right :-}. Funny how little details like that can be missed. I<br>thought it happened the other way around.<br><br>This is a bit of a problem. The Indic scripts must be able to use their<br>combining marks/vowel signs; they don't have a rich enough set of
<br>pre-composed characters to write their language. And if romanised forms<br>of African languages need compositions which are not already there, then<br>they will never work.<br><br>This might need to wait for the next version, but it should be possible
<br>to permit combining characters without breaking backwards compatibility<br>or losing the intent of this specification, you'd need to:<br><br>1. be able to classify combining marks with their target scripts, to<br>make sure that you're not trying to combine a Latin diacritical mark
<br>with a Chinese ideograph (etc)<br><br>2. disallow combining marks except in places where they're expected<br><br>3. standardise on the NKFD form, except for where a pre-composed form<br>exists.<br><br>It's ugly, but any tidier suggestions that don't exclude >25% of the
<br>world's population? :)<br><br>--<br>Sam Vilain, Systems Architect, Catalyst IT (NZ) Ltd.<br>phone: +64 4 499 2267 PGP ID: 0x66B25843<br><br>_______________________________________________<br>Idna-update mailing list
<br><a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br><a href="http://www.alvestrand.no/mailman/listinfo/idna-update">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br></blockquote></div>
<br><br clear="all"><br>-- <br>Mark