IDNAbis compatibility

John C Klensin klensin at jck.com
Fri Mar 16 06:26:01 CET 2007



--On Thursday, 15 March, 2007 16:06 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:

> We did a test run over about a billion documents, looking for
> hrefs that use
> IDNA, and we got the following information:
>   changed by ToUnicode, case variant 117,546  changed by
> ToUnicode, other
> mapping difference 240,794  unchanged by ToUnicode 1,197,657
> This is a rough proxy for the proportion of IDNs that would
> become invalid
> under the current proposals for IDNAbis (that is, not using
> case mappings,
> NFKC, etc.). It is only very rough -- this is preliminary
> data, and a
> billion documents is a just a sampling of the web. Nor are we
>...

Mark,

I'm trying to understand this experiment.  Normally, an href
that "uses IDNA" would have Punycode labels (A-labels) in its
domain names.  If that were the case, presumably at least most
of the transformations you are describing below would be
non-issues since the A-labels are already in reduced form, with
all case variations forced to lower, all compatibility
characters reduced to canonical form, etc.

So, presumably, if you are running ToUnicode against the
contents of hrefs, you are looking at hrefs that either use
UTF-8 (or some other encoding) directly as domain names (a
string of U-labels in IDNA200x terminology) -- i.e., are IRIs
rather than URIs -- or contain the UTF-8 strings in %-escape
form.  But the last I checked, the latter was not a recommended
practice for domain names (A-labels are generally considered
preferable if real U-labels cannot be used) and support for IRIs
was not widespread in the installed browser base (which of
course contains many copies of versions of IE prior to IE7,
etc.).  The latter suggests that those who are interested in
having their web pages accessible from a large number of
browsers are probably not using IRIs yet.

Am I missing something or, if not, could you summarize where
these hrefs are coming from?   Even though the IDNA-containing
hrefs appear to constitute less than 0.2% of the hrefs you
examined (it would be around 0.2% if those billion documents
contains only one href each), my intuition suggests that it
might be somewhat more than I expected.

You also asked...

> Actually, one question that has come up. It appears that in
> http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issu
> es-01.txt no mappings are being done, thus the "B.1 Commonly
> mapped to nothing" characters from rfc3454 are simply illegal.
> The only ones that would be mapped to nothing would be the
> joiners (subject to context).
> 
> Is this the intent?

Yes.  I think it was the intent that these be prohibited even in
IDNA2003 although our collective understanding might not have
been sufficient to get things right at the time.  One way of
looking at this is that, regardless of whatever measures are
taken, one of the most important weapons against either general
confusion or malicious acts (such as phishing), user intuition
as to whether or not two domain name strings are the same, based
on visual inspection of those strings, should mostly get the
right answer with very little astonishment.  Naturally, we can
expect more surprises with scripts that are unfamiliar to the
user than with familiar ones, and visual comparison of domain
names tells us nothing about the values in the underlying
resource records and where they point, but applying restrictions
to reduce obvious sources of such confusion or astonishment
appears to be generally a good idea, at least in the absence of
arguments for particular code points that are sufficient to
overwhelm the downside risks.

In that regard, invisible characters and characters that are
visible but later disappear, are the friends of those who want
to create confusion since they can make two strings that appear
different visually to be the same or two strings that appear the
same visually to be different.

As with anything else that the IDNA200X model prohibits from
appearing in a U-label, nothing prevents an properly localized
implementation from accepting characters that are banned by the
proptocol and mapping them as appears reasonable under local
conditions.

    john

p.s.  I owe responses for a number of other notes that have been
posted to this list.  I got hit by several pre-IETF priority
demands on my time and will try to dig out over the next few
days.



More information about the Idna-update mailing list