IDNAbis compatibility

Mark Davis mark.davis at icu-project.org
Tue Apr 3 18:49:02 CEST 2007


IDNA2003 eliminates any inconsistency by specifying an exact folding, one
that is a language-independent folding. IDNAbis, as currently proposed,
drops any notion or folding (or leaves it up to the application). think that
there is rough consensus that we shouldn't have had folding in IDNA2003, but
my concern is that if we drop it on the floor in IDNAbis, that we will get
inconsistency between applications and/or too high a level of breakage.

I mentioned the possibility of separating folding into a separate RFC; that
may be a way to deal with it. Applications that wanted consistent folding
could adhere to the RFC on IDNA folding; ones that didn't want folding, or
wanted their own, wouldn't claim conformance to that separate RFC.

Mark

On 4/1/07, Vint Cerf < vint at google.com> wrote:
>
>  in the ascii case, the URLs can be rendered rather loosely because of the
> robust matching allowed by case folding.
>
> in the IDN case, production of URL references appear to require
> more complex rules.
>
> May I ask, naively, whether one could invoke case folding and character
> width mapping in some way that is not language dependent and is general. I
> think that is one interpretation of Mark's message. Is there any rule we
> could choose that would eliminate the ambiguity that has apparently
> manifested because IDNAbis does not specify this aspect?
>
> vint
>
>
>  Vinton G Cerf
> Chief Internet Evangelist
> Google
> Regus Suite 384
> 13800 Coppermine Road
> Herndon, VA 20171
>
> +1 703 234-1823
> +1 703-234-5822 (f)
>
> vint at google.com
> www.google.com
>
>
>
>  ------------------------------
> *From:* idna-update-bounces at alvestrand.no [mailto:
> idna-update-bounces at alvestrand.no] *On Behalf Of *Mark Davis
> *Sent:* Sunday, April 01, 2007 8:09 PM
> *To:* John C Klensin
> *Cc:* idna-update at alvestrand.no
> *Subject:* Re: IDNAbis compatibility
>
> I don't see this as a UI issue. Many programs process web pages, and
> depend on a correct interpretation of the HTML attribute href="<someURL>".
> These include not only browsers, but many other processes (like our search
> engine at Google), where no human is involved. And even for a browser, what
> URL gets used when you click on a link in a page should should be
> predictable.
>
> Leaving the mappings from the URL to what is sent to the DNS is up to the
> whim of the program doesn't seem to be a good thing, at least to me.
> Presumably market pressure would force the browsers to do case folding and
> width folding, and maybe some other foldings, but that is a presumption. And
> that doesn't tell us exactly which characters will they fold and how --
> since there are a number of edge cases (look at the situation with charsets,
> where we have gratuitous differences between different vendors' SJIS
> mappings for certain characters). Maybe we can assume that implementations
> use the foldings in IDNA2003, maybe not. We certainly don't want every
> implementation to have to maintain two bodies of code, IDNAbis and IDNA2003,
> and first try to see if the URL works with IDNA2003 before trying IDNAbis
> (or maybe that's what you had in mind?).
>
> Our lives are not made easier if the foldings that are used for URLs for
> each and every browsers and other product have to be researched either by
> trying to ferret out documentation for all of those products to figure out
> what they are doing, or by having to reverse-engineer what they are doing.
> Our lives are made easier if there is a standard that products can claim
> conformance to, that specifies a set of foldings to be used. Now maybe this
> doesn't belong in your conception of IDNAbis, maybe it belongs in a separate
> RFC "Standard folding for IDNAbis".
>
> And I agree with you that we should not have done folding in the first
> place -- or at least should have done it differently: Punycode would
> actually have let us deal with basic foldings in an productive way, since it
> allows case or other features in the input to be represented by case in the
> output, which would have provided a unique mapping without folding, but use
> the case-insensitivity already built into the DNS.
>
> If the number of incompatible cases were exceedingly small, maybe it would
> not be an issue (I often hear from various people that even the percentage
> of cases that are changed by the Unicode normalization corrigenda between
> 3.0 and 4.1 are too large, and that percentage -- in actual data -- is
> zero!). But 15% is pretty high in my book -- so we should think carefully
> about the issue of folding.
>
> Mark
>
> On 3/31/07, John C Klensin <klensin at jck.com> wrote:
> >
> >
> >
> > --On Friday, 30 March, 2007 18:14 -0700 Mark Davis
> > <mark.davis at icu-project.org> wrote:
> >
> > > We had a bit more time to look at IDNAbis compatibility, and
> > > here are some
> > > better (and hopefully clearer) results. Out of a significantly
> > > large
> > > sampling of the web, there were about 800,000 cases where an
> > > HTML document
> > > contained an href="..." that contained a host name that was
> > > valid IDNA2003.
> > > We tested those host names to see if they would also be valid
> > > under IDNAbis
> > > (based on the current working proposals). About 85% were
> > > valid, about 8%
> > > more would be valid if IDNAbis were changed to also do case
> > > and width
> > > folding, and about 6% would still be invalid even if case and
> > > width foldings
> > > were applied. (The width foldings are applying NFKC to just
> > > the half-width
> > > and full-width characters to get the normal ones.)
> > >
> > > Here are some more details, where A0-A4 are disjoint
> > > categories.
> > >
> > > A0: Passes IDNAbis 708,760 85.26% A1: Passes IDNAbis after
> > > case folding
> > > 22,714 2.73% A2: Passes IDNAbis after width folding 47,312
> > > 5.69% A3: Passes
> > > IDNAbis after apply width folding, and then case folding. 4
> > > 0.00% A4: Failed
> > > to pass IDNAbis after 3 steps 52,456 6.31%
> > >
> > >
> > >  A5: Passes IDNA = sum(A1-A4) 831,246 100.00%
> > > This differs from some of our previous data, because we are
> > > explicitly
> > > testing IDNA vs IDNAbis (not just approximating the latter),
> > > and also
> > > filtering out invalid URLs. I will be out next week, but we'll
> > > try to follow
> > > up with more of a breakdown of A4.
> >
> > Mark,
> >
> > This is very interesting, but I'm still not clear about where it
> > takes us except as implementation advice.
> >
> > Suppose I encounter a URI that falls into your cases A1-A3 (to
> > keep this simple).   I'm running client software that is either
> >
> >         (i) conformant to IDNA2003, in which case these foldings
> >         and mappings are made,
> >
> >         (ii) a conforming implementation of IDNAbis, in which
> >         case the software implementer has the option of
> >         performing those foldings and mappings as a UI issue, or
> >
> >         (iii) completely conformant to neither (e.g., refusing
> >         to resolve strings that one or the other will permit
> >         and, arguably, refusing to resolve some such strings
> >         without explicit user intervention).
> >
> > I'm assuming that "IDNAbis", in your tests, relies on Ken's
> > tables.  More on that below.
> >
> > So, to me, data like this aren't a useful critique (positive or
> > negative) of the IDNAbis effort.  Instead, it turns into
> > implementer advice, e.g., "if you are in an environment that
> > normally expects upper and lower case to be treated as
> > equivalent, you probably should do the mapping although it is
> > not part of IDNA; if you are in an environment that normally
> > expects differential-width characters to be treated as
> > equivalent, you should do that mapping although it is not part
> > of IDNA".   And I would expect HTML validity-testers, and maybe
> > UIs that are especially concerned about these things, to warn
> > about possible-invalid UPIs.
> >
> > As you look at this further, and especially as you look at A4, I
> > think it would be helpful to distinguish between href strings
> > that use domain names that are consistent with the ICANN
> > Guidelines and the IESG advice.  Distinguishing between strings
> > that IDNAbis newly prohibits and strings that are prohibited
> > under existing guidelines for IDNA2003 but become a hard
> > prohibition in IDNAbis would seem helpful in understanding the
> > issues.
> >
> >      john
> >
> >
> >
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
> >
>
>
>
> --
> Mark
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20070403/22a7a949/attachment.html


More information about the Idna-update mailing list