FW: [centr-tech] IDNA Redux]
harald at alvestrand.no
Tue Nov 7 01:39:44 CET 2006
--On 7. november 2006 13:03 +1300 Sam Vilain <sam.vilain at catalyst.net.nz>
> One issue important to .nz, is that most of the Indic Vowel Signs are
> marked "possibly not". The Gujarati word "ગાંધી" (Gandhi)
> contains a character marked as "possibly not". Even the Gujarati word for
> "Gujarati" - ગુજરાતી - contains characters marked this way.
> How would we feel if we could not have certain Latin letters, like "o" ?
> These signs look to be very similar to combining accents in Latin
> scripts. I think that someone literate in each of the scripts needs to
> sit down and for each of these signs, produce a white list of characters
> which they may follow. For instance, U+AC7 ("Gujarati Vowel Sign E")
> might only follow Gujarati letters and other than U+A8D - U+A94.
This is good input. It shows that the rule that says "possibly not" to
U+AC7 is too harsh. We need to look at the characters affected by that
rule, and either allow the whole range, or find a sensible way to split
that group into subranges.
> Hmm, did the normalization miss the Homographs in Indic scripts? Eg
> "Candra O" can be written as U+A91 (ઑ) - or as U+A86, U+AC5 (આૅ).
> There doesn't seem to be anything in the Unicode database dealing with
> this... so I assume Stringprep doesn't try to re-write those characters
> at the moment.
> I think that the character-based whitelist is an incomplete approach.
> Maybe it would be better to take a word-based approach as a second
> layer. This should hopefully have the advantage of not breaking
> backwards compatibility with software that uses Stringprep for non-DNS
> things; a word-based stringprep would just be a supplemental restriction
> recommended for applications such as the DNS where avoiding confusion is
> more important than representing text correctly. Then, by default, the
> script may not vary within a word - unless the rules for that script
> permit them to be mixed with other scripts. Eg, most scripts will be
> quite happy to mix with Arabic numerals. Then, each script has its own
> rules about which sequences are legal, and you won't get people putting
> Latin accents on Indic characters.
This would be equivalent to the approach taken by the "JET Tables"
approach, and makes sense.
To be effective, it does require knowledge of which language a string was
intended to be drawn from (think about the æ vs ae problem) and which
script (think about Hans vs Hant). So it's a feature that can only be
applied at registration time, and requires that registries are willing to
police such a scheme.
More information about the Idna-update