IDNAbis compatibility

Mon Apr 2 03:32:16 CEST 2007

in the ascii case, the URLs can be rendered rather loosely because of the
robust matching allowed by case folding. 

in the IDN case, production of URL references appear to require more complex
rules. 

May I ask, naively, whether one could invoke case folding and character
width mapping in some way that is not language dependent and is general. I
think that is one interpretation of Mark's message. Is there any rule we
could choose that would eliminate the ambiguity that has apparently
manifested because IDNAbis does not specify this aspect?

vint

Vinton G Cerf
Chief Internet Evangelist
Google
Regus Suite 384
13800 Coppermine Road
Herndon, VA 20171

+1 703 234-1823
+1 703-234-5822 (f)

vint at google.com
www.google.com <http://www.google.com/> 

  _____  

From: idna-update-bounces at alvestrand.no
[mailto:idna-update-bounces at alvestrand.no] On Behalf Of Mark Davis
Sent: Sunday, April 01, 2007 8:09 PM
To: John C Klensin
Cc: idna-update at alvestrand.no
Subject: Re: IDNAbis compatibility

I don't see this as a UI issue. Many programs process web pages, and depend
on a correct interpretation of the HTML attribute href="<someURL>". These
include not only browsers, but many other processes (like our search engine
at Google), where no human is involved. And even for a browser, what URL
gets used when you click on a link in a page should should be predictable. 

Leaving the mappings from the URL to what is sent to the DNS is up to the
whim of the program doesn't seem to be a good thing, at least to me.
Presumably market pressure would force the browsers to do case folding and
width folding, and maybe some other foldings, but that is a presumption. And
that doesn't tell us exactly which characters will they fold and how --
since there are a number of edge cases (look at the situation with charsets,
where we have gratuitous differences between different vendors' SJIS
mappings for certain characters). Maybe we can assume that implementations
use the foldings in IDNA2003, maybe not. We certainly don't want every
implementation to have to maintain two bodies of code, IDNAbis and IDNA2003,
and first try to see if the URL works with IDNA2003 before trying IDNAbis
(or maybe that's what you had in mind?). 

Our lives are not made easier if the foldings that are used for URLs for
each and every browsers and other product have to be researched either by
trying to ferret out documentation for all of those products to figure out
what they are doing, or by having to reverse-engineer what they are doing.
Our lives are made easier if there is a standard that products can claim
conformance to, that specifies a set of foldings to be used. Now maybe this
doesn't belong in your conception of IDNAbis, maybe it belongs in a separate
RFC "Standard folding for IDNAbis". 

And I agree with you that we should not have done folding in the first place
-- or at least should have done it differently: Punycode would actually have
let us deal with basic foldings in an productive way, since it allows case
or other features in the input to be represented by case in the output,
which would have provided a unique mapping without folding, but use the
case-insensitivity already built into the DNS. 

If the number of incompatible cases were exceedingly small, maybe it would
not be an issue (I often hear from various people that even the percentage
of cases that are changed by the Unicode normalization corrigenda between
3.0 and 4.1 are too large, and that percentage -- in actual data -- is
zero!). But 15% is pretty high in my book -- so we should think carefully
about the issue of folding.

Mark

On 3/31/07, John C Klensin <klensin at jck.com> wrote: 

--On Friday, 30 March, 2007 18:14 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:

> We had a bit more time to look at IDNAbis compatibility, and 
> here are some
> better (and hopefully clearer) results. Out of a significantly
> large
> sampling of the web, there were about 800,000 cases where an
> HTML document
> contained an href="..." that contained a host name that was 
> valid IDNA2003.
> We tested those host names to see if they would also be valid
> under IDNAbis
> (based on the current working proposals). About 85% were
> valid, about 8%
> more would be valid if IDNAbis were changed to also do case 
> and width
> folding, and about 6% would still be invalid even if case and
> width foldings
> were applied. (The width foldings are applying NFKC to just
> the half-width
> and full-width characters to get the normal ones.) 
>
> Here are some more details, where A0-A4 are disjoint
> categories.
>
> A0: Passes IDNAbis 708,760 85.26% A1: Passes IDNAbis after
> case folding
> 22,714 2.73% A2: Passes IDNAbis after width folding 47,312 
> 5.69% A3: Passes
> IDNAbis after apply width folding, and then case folding. 4
> 0.00% A4: Failed
> to pass IDNAbis after 3 steps 52,456 6.31%
>
>
>  A5: Passes IDNA = sum(A1-A4) 831,246 100.00%
> This differs from some of our previous data, because we are
> explicitly
> testing IDNA vs IDNAbis (not just approximating the latter),
> and also
> filtering out invalid URLs. I will be out next week, but we'll 
> try to follow
> up with more of a breakdown of A4.

Mark,

This is very interesting, but I'm still not clear about where it
takes us except as implementation advice.

Suppose I encounter a URI that falls into your cases A1-A3 (to 
keep this simple).   I'm running client software that is either

        (i) conformant to IDNA2003, in which case these foldings
        and mappings are made,

        (ii) a conforming implementation of IDNAbis, in which 
        case the software implementer has the option of
        performing those foldings and mappings as a UI issue, or

        (iii) completely conformant to neither (e.g., refusing
        to resolve strings that one or the other will permit 
        and, arguably, refusing to resolve some such strings
        without explicit user intervention).

I'm assuming that "IDNAbis", in your tests, relies on Ken's
tables.  More on that below. 

So, to me, data like this aren't a useful critique (positive or
negative) of the IDNAbis effort.  Instead, it turns into
implementer advice, e.g., "if you are in an environment that
normally expects upper and lower case to be treated as 
equivalent, you probably should do the mapping although it is
not part of IDNA; if you are in an environment that normally
expects differential-width characters to be treated as
equivalent, you should do that mapping although it is not part 
of IDNA".   And I would expect HTML validity-testers, and maybe
UIs that are especially concerned about these things, to warn
about possible-invalid UPIs.

As you look at this further, and especially as you look at A4, I 
think it would be helpful to distinguish between href strings
that use domain names that are consistent with the ICANN
Guidelines and the IESG advice.  Distinguishing between strings
that IDNAbis newly prohibits and strings that are prohibited 
under existing guidelines for IDNA2003 but become a hard
prohibition in IDNAbis would seem helpful in understanding the
issues.

     john

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update
<http://www.alvestrand.no/mailman/listinfo/idna-update> 

-- 
Mark 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20070401/d7ccdf71/attachment-0001.html