Mapping and Variants

Thu Mar 5 23:48:09 CET 2009

Hi John,

That is certainly an interesting addition to our discussion. Within
the current DNS, it would be very difficult to do server-side
matching, so if there is no mapping at all on the client side, we
would have to bundle a bunch of things on the server side. E.g.
upper-case and lower-case accented Latin letters. (Or just make the
lookup fail.)

However, since many systems do perform lower-case mapping, we may have
to bundle seemingly unrelated characters like Greek lower-case alpha
and Latin lower-case a, just because their upper-case forms look
identical.

So the answer is (again) to get users used to the idea of using
lower-case only in the DNS. We can do a number of things to "educate"
them. One is to always display domain names in lower-case. Some
organizations may not like this because they like to advertise their
domain names with certain letters capitalized.

Clients and registrars could nudge users in the right direction by
lower-casing domain names immediately or soon after the user has typed
them.

HTML clients could nudge authors in the right direction by refusing to
lower-case non-ASCII letters, thereby causing those links to be
ineffective. This may seem draconian, but it appears that HTML authors
no longer use very many upper-case non-ASCII letters in their domain
names. Around November 2005, the use of upper-case non-ASCII peaked at
0.005% of HTML links (in Google's index). After that, their use
dwindled, and since November 2007 the percentage has been 0.0001%.

Erik

On Thu, Mar 5, 2009 at 1:06 PM, John C Klensin <klensin at jck.com> wrote:
> Hi.
>
> In the process of working on a document that makes
> recommendations about Cyrillic registrations (paralleling RFC
> 4713 for Chinese and the I-D for Arabic language registrations),
> it was forcefully brought to my attention that there is a
> tradeoff, and perhaps an actual conflict, between mapping and
> JET-like variant approaches.  I hope the draft document on
> Cyrillic will be posted by Monday, but it is not within the WG's
> scope and I think I can explain the issue without it.
>
> When IDNA2003 was written, no one (as far as I know) anticipated
> the need to create elaborate variant (bundling) systems to
> associate potentially-confusing labels within a zone so that
> they could be given special treatment.   Since the publication
> of RFC 3743 (the "JET Guidelines"), the practice has become more
> or less widespread, even though more in discussions than in
> implementations.  In the current WG's discussions we have
> included references to variants or bundling as an important
> possibility in many of our discussions about confusing character
> combinations as well as transitional strategies.  We have also
> discussed, although not necessarily agreed upon, the issue of
> variant explosion in which having multiple variants for even a
> small number of characters potentially causes more variant
> labels that a zone might be plausibly able to handle (some of
> the CJK registries and others deal with that by banning some
> variant combinations outright rather than allowing for bundling
> them into a zone).
>
> For scripts with case differences, IDNA2003 also chose to
> concentrate on lower case, partially because there was better
> differentiation of those characters.  It has often been
> observed, for example, that Greek lower case ("SMALL LETTER")
> alpha and beta don't look nearly enough like their Latin
> counterparts ("a" and "b") to be confusing to anyone, but that
> the capital character pairs are identical.
>
> Unfortunately, if one has a situation in which Greek and Latin
> scripts are considered today and chooses to use variants _and_
> has the expectation of case-mapping, GREEK SMALL LETTER ALPHA
> (U+03B1) must be treated as a variant of LATIN SMALL LETTER A
> (U+0061) because a user might be looking at the combination of
> GREEK CAPITAL LETTER ALPHA (U+0391) and LATIN CAPITAL LETTER A
> (U+0041) which map (CaseFold) into the lower case pair.  That
> sort of relationship exists for a significant number of
> Latin-Greek pairs and for a much larger number of Cyrillic-Greek
> pairs.  For Cyrillic, it just about doubles the number of
> variants in the table.
>
> Not a good situation.  But it is one that I think we need to
> consider as we weigh the various tradeoffs associated with
> mapping, even for transitional purposes, since variant methods
> are at least as much part of our landscape today as creative
> interpretations of  the specs in web page design.
>
>     john
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>