Mapping and Variants

John C Klensin klensin at jck.com
Thu Mar 5 22:06:44 CET 2009


Hi.

In the process of working on a document that makes
recommendations about Cyrillic registrations (paralleling RFC
4713 for Chinese and the I-D for Arabic language registrations),
it was forcefully brought to my attention that there is a
tradeoff, and perhaps an actual conflict, between mapping and
JET-like variant approaches.  I hope the draft document on
Cyrillic will be posted by Monday, but it is not within the WG's
scope and I think I can explain the issue without it.

When IDNA2003 was written, no one (as far as I know) anticipated
the need to create elaborate variant (bundling) systems to
associate potentially-confusing labels within a zone so that
they could be given special treatment.   Since the publication
of RFC 3743 (the "JET Guidelines"), the practice has become more
or less widespread, even though more in discussions than in
implementations.  In the current WG's discussions we have
included references to variants or bundling as an important
possibility in many of our discussions about confusing character
combinations as well as transitional strategies.  We have also
discussed, although not necessarily agreed upon, the issue of
variant explosion in which having multiple variants for even a
small number of characters potentially causes more variant
labels that a zone might be plausibly able to handle (some of
the CJK registries and others deal with that by banning some
variant combinations outright rather than allowing for bundling
them into a zone).

For scripts with case differences, IDNA2003 also chose to
concentrate on lower case, partially because there was better
differentiation of those characters.  It has often been
observed, for example, that Greek lower case ("SMALL LETTER")
alpha and beta don't look nearly enough like their Latin
counterparts ("a" and "b") to be confusing to anyone, but that
the capital character pairs are identical.

Unfortunately, if one has a situation in which Greek and Latin
scripts are considered today and chooses to use variants _and_
has the expectation of case-mapping, GREEK SMALL LETTER ALPHA
(U+03B1) must be treated as a variant of LATIN SMALL LETTER A
(U+0061) because a user might be looking at the combination of
GREEK CAPITAL LETTER ALPHA (U+0391) and LATIN CAPITAL LETTER A
(U+0041) which map (CaseFold) into the lower case pair.  That
sort of relationship exists for a significant number of
Latin-Greek pairs and for a much larger number of Cyrillic-Greek
pairs.  For Cyrillic, it just about doubles the number of
variants in the table.

Not a good situation.  But it is one that I think we need to
consider as we weigh the various tradeoffs associated with
mapping, even for transitional purposes, since variant methods
are at least as much part of our landscape today as creative
interpretations of  the specs in web page design.

     john



More information about the Idna-update mailing list