IDNAbis compatibility

Mon Apr 2 02:09:28 CEST 2007

I don't see this as a UI issue. Many programs process web pages, and depend
on a correct interpretation of the HTML attribute href="<someURL>". These
include not only browsers, but many other processes (like our search engine
at Google), where no human is involved. And even for a browser, what URL
gets used when you click on a link in a page should should be predictable.

Leaving the mappings from the URL to what is sent to the DNS is up to the
whim of the program doesn't seem to be a good thing, at least to me.
Presumably market pressure would force the browsers to do case folding and
width folding, and maybe some other foldings, but that is a presumption. And
that doesn't tell us exactly which characters will they fold and how --
since there are a number of edge cases (look at the situation with charsets,
where we have gratuitous differences between different vendors' SJIS
mappings for certain characters). Maybe we can assume that implementations
use the foldings in IDNA2003, maybe not. We certainly don't want every
implementation to have to maintain two bodies of code, IDNAbis and IDNA2003,
and first try to see if the URL works with IDNA2003 before trying IDNAbis
(or maybe that's what you had in mind?).

Our lives are not made easier if the foldings that are used for URLs for
each and every browsers and other product have to be researched either by
trying to ferret out documentation for all of those products to figure out
what they are doing, or by having to reverse-engineer what they are doing.
Our lives are made easier if there is a standard that products can claim
conformance to, that specifies a set of foldings to be used. Now maybe this
doesn't belong in your conception of IDNAbis, maybe it belongs in a separate
RFC "Standard folding for IDNAbis".

And I agree with you that we should not have done folding in the first place
-- or at least should have done it differently: Punycode would actually have
let us deal with basic foldings in an productive way, since it allows case
or other features in the input to be represented by case in the output,
which would have provided a unique mapping without folding, but use the
case-insensitivity already built into the DNS.

If the number of incompatible cases were exceedingly small, maybe it would
not be an issue (I often hear from various people that even the percentage
of cases that are changed by the Unicode normalization corrigenda between
3.0 and 4.1 are too large, and that percentage -- in actual data -- is
zero!). But 15% is pretty high in my book -- so we should think carefully
about the issue of folding.

Mark

On 3/31/07, John C Klensin <klensin at jck.com> wrote:
>
>
>
> --On Friday, 30 March, 2007 18:14 -0700 Mark Davis
> <mark.davis at icu-project.org> wrote:
>
> > We had a bit more time to look at IDNAbis compatibility, and
> > here are some
> > better (and hopefully clearer) results. Out of a significantly
> > large
> > sampling of the web, there were about 800,000 cases where an
> > HTML document
> > contained an href="..." that contained a host name that was
> > valid IDNA2003.
> > We tested those host names to see if they would also be valid
> > under IDNAbis
> > (based on the current working proposals). About 85% were
> > valid, about 8%
> > more would be valid if IDNAbis were changed to also do case
> > and width
> > folding, and about 6% would still be invalid even if case and
> > width foldings
> > were applied. (The width foldings are applying NFKC to just
> > the half-width
> > and full-width characters to get the normal ones.)
> >
> > Here are some more details, where A0-A4 are disjoint
> > categories.
> >
> > A0: Passes IDNAbis 708,760 85.26% A1: Passes IDNAbis after
> > case folding
> > 22,714 2.73% A2: Passes IDNAbis after width folding 47,312
> > 5.69% A3: Passes
> > IDNAbis after apply width folding, and then case folding. 4
> > 0.00% A4: Failed
> > to pass IDNAbis after 3 steps 52,456 6.31%
> >
> >
> >  A5: Passes IDNA = sum(A1-A4) 831,246 100.00%
> > This differs from some of our previous data, because we are
> > explicitly
> > testing IDNA vs IDNAbis (not just approximating the latter),
> > and also
> > filtering out invalid URLs. I will be out next week, but we'll
> > try to follow
> > up with more of a breakdown of A4.
>
> Mark,
>
> This is very interesting, but I'm still not clear about where it
> takes us except as implementation advice.
>
> Suppose I encounter a URI that falls into your cases A1-A3 (to
> keep this simple).   I'm running client software that is either
>
>         (i) conformant to IDNA2003, in which case these foldings
>         and mappings are made,
>
>         (ii) a conforming implementation of IDNAbis, in which
>         case the software implementer has the option of
>         performing those foldings and mappings as a UI issue, or
>
>         (iii) completely conformant to neither (e.g., refusing
>         to resolve strings that one or the other will permit
>         and, arguably, refusing to resolve some such strings
>         without explicit user intervention).
>
> I'm assuming that "IDNAbis", in your tests, relies on Ken's
> tables.  More on that below.
>
> So, to me, data like this aren't a useful critique (positive or
> negative) of the IDNAbis effort.  Instead, it turns into
> implementer advice, e.g., "if you are in an environment that
> normally expects upper and lower case to be treated as
> equivalent, you probably should do the mapping although it is
> not part of IDNA; if you are in an environment that normally
> expects differential-width characters to be treated as
> equivalent, you should do that mapping although it is not part
> of IDNA".   And I would expect HTML validity-testers, and maybe
> UIs that are especially concerned about these things, to warn
> about possible-invalid UPIs.
>
> As you look at this further, and especially as you look at A4, I
> think it would be helpful to distinguish between href strings
> that use domain names that are consistent with the ICANN
> Guidelines and the IESG advice.  Distinguishing between strings
> that IDNAbis newly prohibits and strings that are prohibited
> under existing guidelines for IDNA2003 but become a hard
> prohibition in IDNAbis would seem helpful in understanding the
> issues.
>
>      john
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>

-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20070401/ea489506/attachment.html