Combining accents

Mark Davis markdavis at google.com
Mon Nov 27 17:11:22 CET 2006


Some clarification.

1. It appears that you may think that NFKC does not forbid combining marks;
however, it only forbids sequences that could be expressed with a combined
form (with a few exceptions). Thus:

A + acute is forbidden in NFKC
X + cedilla is not forbidden in NFKC

See http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table

2. Unicode composition and decomposition is not based on visual
confusability (referring to your memo on Gujarati). For example, "m" does
not decompose to "rn" even though those two sequences are visually
confusable (at address box sizes in common fonts they look the same). Nor is
it simply based on origin: "w" does not decompose to "vv". For more
examples, see http://unicode.org/charts/normalization/

Visual similarity is much broader than the Unicode composition and
decomposition. See http://www.unicode.org/reports/tr39/#Confusable_Detection

Baking visual similarity into the protocol would be a real problem for many,
many languages: it would be the equivalent of disallowing the use of the
letter "m" in English.

Mark

On 11/26/06, Sam Vilain <sam.vilain at catalyst.net.nz> wrote:
>
> John C Klensin wrote:
> >> But doesn't è decompose to a sequence including that mark?
> >>
> > I may miss your point but, if I don't, that is one of the
> > reasons we have used NFKC, rather than NFKD, all along.
> >
>
> Oh, right :-}.  Funny how little details like that can be missed.  I
> thought it happened the other way around.
>
> This is a bit of a problem.  The Indic scripts must be able to use their
> combining marks/vowel signs; they don't have a rich enough set of
> pre-composed characters to write their language.  And if romanised forms
> of African languages need compositions which are not already there, then
> they will never work.
>
> This might need to wait for the next version, but it should be possible
> to permit combining characters without breaking backwards compatibility
> or losing the intent of this specification, you'd need to:
>
> 1. be able to classify combining marks with their target scripts, to
> make sure that you're not trying to combine a Latin diacritical mark
> with a Chinese ideograph (etc)
>
> 2. disallow combining marks except in places where they're expected
>
> 3. standardise on the NKFD form, except for where a pre-composed form
> exists.
>
> It's ugly, but any tidier suggestions that don't exclude >25% of the
> world's population?  :)
>
> --
> Sam Vilain, Systems Architect, Catalyst IT (NZ) Ltd.
> phone: +64 4 499 2267        PGP ID: 0x66B25843
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061127/705e1933/attachment.html


More information about the Idna-update mailing list