mark.davis at icu-project.org
Tue Apr 3 19:18:32 CEST 2007
On 3/15/07, John C Klensin <klensin at jck.com> wrote:
I'm trying to understand this experiment. Normally, an href
> that "uses IDNA" would have Punycode labels (A-labels) in its
> domain names.
I don't know the basis for saying that this would be the "normal" usage.
There isn't anything in IDNA2003, unless I'm missing something, that
requires or even suggests that it is not perfectly fine to have:
<a href="http://ÖBB.at">Österreichishe Bundesbahn</a>
If that were the case, presumably at least most
> of the transformations you are describing below would be
> non-issues since the A-labels are already in reduced form, with
> all case variations forced to lower, all compatibility
> characters reduced to canonical form, etc.
True. Any pages that are already in Punycode should already be ok.
So, presumably, if you are running ToUnicode against the
> contents of hrefs, you are looking at hrefs that either use
> UTF-8 (or some other encoding) directly as domain names (a
> string of U-labels in IDNA200x terminology) -- i.e., are IRIs
> rather than URIs -- or contain the UTF-8 strings in %-escape
> form. But the last I checked, the latter was not a recommended
> practice for domain names (A-labels are generally considered
> preferable if real U-labels cannot be used)
and support for IRIs
> was not widespread in the installed browser base (which of
> course contains many copies of versions of IE prior to IE7,
> etc.). The latter suggests that those who are interested in
> having their web pages accessible from a large number of
> browsers are probably not using IRIs yet.
What is what I would have thought; I was surprised by the data as well. I
can only surmise that there was sufficient drive to use IDNA that people did
either use the browsers that handled IDNA natively or downloaded the IE
plugin that enabled it.
Am I missing something or, if not, could you summarize where
> these hrefs are coming from? Even though the IDNA-containing
> hrefs appear to constitute less than 0.2% of the hrefs you
> examined (it would be around 0.2% if those billion documents
> contains only one href each), my intuition suggests that it
> might be somewhat more than I expected.
These are from a random sampling of pages in Google's index, where each of
the pages was scanned for href's that contained IDNAs. Of course, a billion
pages is a small sampling of the web....
You also asked...
> > Actually, one question that has come up. It appears that in
> > http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issu
> > es-01.txt no mappings are being done, thus the "B.1 Commonly
> > mapped to nothing" characters from rfc3454 are simply illegal.
> > The only ones that would be mapped to nothing would be the
> > joiners (subject to context).
> > Is this the intent?
> Yes. I think it was the intent that these be prohibited even in
> IDNA2003 although our collective understanding might not have
> been sufficient to get things right at the time. One way of
> looking at this is that, regardless of whatever measures are
> taken, one of the most important weapons against either general
> confusion or malicious acts (such as phishing), user intuition
> as to whether or not two domain name strings are the same, based
> on visual inspection of those strings, should mostly get the
> right answer with very little astonishment.
I agree -- see below on case and width folding.
Naturally, we can
> expect more surprises with scripts that are unfamiliar to the
> user than with familiar ones, and visual comparison of domain
> names tells us nothing about the values in the underlying
> resource records and where they point, but applying restrictions
> to reduce obvious sources of such confusion or astonishment
> appears to be generally a good idea, at least in the absence of
> arguments for particular code points that are sufficient to
> overwhelm the downside risks.
> In that regard, invisible characters and characters that are
> visible but later disappear, are the friends of those who want
> to create confusion since they can make two strings that appear
> different visually to be the same or two strings that appear the
> same visually to be different.
The chief problem is where two different strings have the same visual
representation -- (paypal.com), not the case where the same string has two
different visuals. It is not much of a problem that, say, both a fullwidth A
and a normal A are treated the same, nor that a lowercase and uppercase A
are treated the same; in fact, all but software engineers expect them to be
treated the same. It is where case/width differences are treated as
significant that average people get confused.
As with anything else that the IDNA200X model prohibits from
> appearing in a U-label, nothing prevents an properly localized
> implementation from accepting characters that are banned by the
> proptocol and mapping them as appears reasonable under local
I'm very leery of that statement. I think it is a really bad idea that
browser A could say treat <a href="Μαρκ.com">....</a> as if it were
<a href="Mark.com">....</a> and browser B could treat <a
as if it were <a href="μαρκ.com">....</a>, and both correctly claim
conformance to IDNA2003. That seems to me a huge security hole, as well as
in practice a nightmare.
> p.s. I owe responses for a number of other notes that have been
> posted to this list. I got hit by several pre-IETF priority
> demands on my time and will try to dig out over the next few
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update