Mapping and Variants
duerst at it.aoyama.ac.jp
Sat Mar 7 09:31:34 CET 2009
At 06:06 09/03/06, John C Klensin wrote:
>When IDNA2003 was written, no one (as far as I know) anticipated
>the need to create elaborate variant (bundling) systems to
>associate potentially-confusing labels within a zone so that
>they could be given special treatment.
Maybe the exact details weren't anticipated, but lots of
discussion surrounding the issues definitely went on way
before IDNA2003 was final. Whether we called it 'bundling'
or whatever else, I'm pretty sure people such as Ken and
me who were sceptical (and, as it turned out, right) on a
central, uniform solution for CJK simplified/traditional
mappings were mentioning solutions in this direction.
>For scripts with case differences, IDNA2003 also chose to
>concentrate on lower case, partially because there was better
>differentiation of those characters. It has often been
>observed, for example, that Greek lower case ("SMALL LETTER")
>alpha and beta don't look nearly enough like their Latin
>counterparts ("a" and "b") to be confusing to anyone, but that
>the capital character pairs are identical.
>Unfortunately, if one has a situation in which Greek and Latin
>scripts are considered today and chooses to use variants _and_
>has the expectation of case-mapping, GREEK SMALL LETTER ALPHA
>(U+03B1) must be treated as a variant of LATIN SMALL LETTER A
>(U+0061) because a user might be looking at the combination of
>GREEK CAPITAL LETTER ALPHA (U+0391) and LATIN CAPITAL LETTER A
>(U+0041) which map (CaseFold) into the lower case pair. That
>sort of relationship exists for a significant number of
>Latin-Greek pairs and for a much larger number of Cyrillic-Greek
>pairs. For Cyrillic, it just about doubles the number of
>variants in the table.
Is this some highly theoretical discussion, or do you actually
expect that this would be needed in practice? In my view, it
should clearly be treated as the former, but I would have
expected you to say so if you thought so.
Why do I think so? It is well accepted now that script mixing
is a bad idea, exactly because of cases such as the above.
So a label consisting of a Latin and a Greek small letter
a/alpha just doesn't make much sense to start with.
It is also well-known that some carefully choosen letter
combinations in one script, in particular in upper case,
are difficult or impossible to visually distinguish from
potentially completely different letter combinations in
other scripts. But these are few and far between, in particular
if they are of a certain length and contain some bits of
I would also like to point out that with your approach
above, you may not be able to stop at letter pairs. As
an example, in script fonts and handwriting, Cyrillic
Ts (both upper and lower case) may look similar to Latin
Ms, but in print fonts, Cyrillic and Latin Ms look alike.
So suddenly, you have to group Cyrillic Ts and Ms with
Latin Ms. Not sure anybody will use such a system, at
least not for Cyrillic :-(.
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update