IDNA2003 eliminates any inconsistency by specifying an exact folding, one that is a language-independent folding. IDNAbis, as currently proposed, drops any notion or folding (or leaves it up to the application). think that there is rough consensus that we shouldn't have had folding in IDNA2003, but my concern is that if we drop it on the floor in IDNAbis, that we will get inconsistency between applications and/or too high a level of breakage.
<br><br>I mentioned the possibility of separating folding into a separate RFC; that may be a way to deal with it. Applications that wanted consistent folding could adhere to the RFC on IDNA folding; ones that didn't want folding, or wanted their own, wouldn't claim conformance to that separate RFC.
<br><br>Mark<br><br><div><span class="gmail_quote">On 4/1/07, <b class="gmail_sendername">Vint Cerf</b> <<a href="mailto:vint@google.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
vint@google.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial" size="2">in the ascii case, the URLs can be rendered rather loosely
because of the robust matching allowed by case folding. </font></span></div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial" size="2"></font></span> </div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial" size="2">in the IDN case, production of URL references appear to
require more complex rules. </font></span></div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial" size="2"></font></span> </div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial" size="2">May I ask, naively, whether one could invoke case folding
and character width mapping in some way that is not language dependent and is
general. I think that is one interpretation of Mark's message. Is there any rule
we could choose that would eliminate the ambiguity that has apparently
manifested because IDNAbis does not specify this aspect?</font></span></div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial" size="2"></font></span> </div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial" size="2">vint</font></span></div>
<div dir="ltr" align="left"><span></span> </div>
<div> </div>
<div dir="ltr" align="left">
<div dir="ltr" align="left"><font face="Arial" size="2">Vinton G Cerf</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2">Chief Internet
Evangelist</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2">Google</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2">Regus Suite 384</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2">13800 Coppermine
Road</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2">Herndon, VA 20171</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2"></font> </div>
<div dir="ltr" align="left"><font face="Arial" size="2">+1 703 234-1823</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2">+1 703-234-5822 (f)</font></div>
<div dir="ltr" align="left"><font face="Arial" size="2"></font> </div>
<div dir="ltr" align="left"><font face="Arial" size="2"><a href="mailto:vint@google.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">vint@google.com</a></font></div>
<div dir="ltr" align="left"><font face="Arial" size="2"><a href="http://www.google.com/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">www.google.com</a></font></div>
<div dir="ltr" align="left"> </div></div>
<div> </div><br>
<div dir="ltr" align="left" lang="en-us">
<hr>
<font face="Tahoma" size="2"><b>From:</b> <a href="mailto:idna-update-bounces@alvestrand.no" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">idna-update-bounces@alvestrand.no</a>
[mailto:<a href="mailto:idna-update-bounces@alvestrand.no" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">idna-update-bounces@alvestrand.no</a>] <b>On Behalf Of </b>Mark
Davis<br><b>Sent:</b> Sunday, April 01, 2007 8:09 PM<br><b>To:</b> John C
Klensin<br><b>Cc:</b> <a href="mailto:idna-update@alvestrand.no" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">idna-update@alvestrand.no</a><br><b>Subject:</b> Re: IDNAbis
compatibility<br></font><br></div><div><span>
<div></div>I don't see this as a UI issue. Many programs process web pages, and
depend on a correct interpretation of the HTML attribute href="<someURL>".
These include not only browsers, but many other processes (like our search
engine at Google), where no human is involved. And even for a browser, what URL
gets used when you click on a link in a page should should be predictable.
<br><br>Leaving the mappings from the URL to what is sent to the DNS is up to
the whim of the program doesn't seem to be a good thing, at least to me.
Presumably market pressure would force the browsers to do case folding and width
folding, and maybe some other foldings, but that is a presumption. And that
doesn't tell us exactly which characters will they fold and how -- since there
are a number of edge cases (look at the situation with charsets, where we have
gratuitous differences between different vendors' SJIS mappings for certain
characters). Maybe we can assume that implementations use the foldings in
IDNA2003, maybe not. We certainly don't want every implementation to have to
maintain two bodies of code, IDNAbis and IDNA2003, and first try to see if the
URL works with IDNA2003 before trying IDNAbis (or maybe that's what you had in
mind?). <br><br>Our lives are not made easier if the foldings that are used for
URLs for each and every browsers and other product have to be researched either
by trying to ferret out documentation for all of those products to figure out
what they are doing, or by having to reverse-engineer what they are doing. Our
lives are made easier if there is a standard that products can claim conformance
to, that specifies a set of foldings to be used. Now maybe this doesn't belong
in your conception of IDNAbis, maybe it belongs in a separate RFC "Standard
folding for IDNAbis". <br><br>And I agree with you that we should not have done
folding in the first place -- or at least should have done it differently:
Punycode would actually have let us deal with basic foldings in an productive
way, since it allows case or other features in the input to be represented by
case in the output, which would have provided a unique mapping without folding,
but use the case-insensitivity already built into the DNS. <br><br>If the number
of incompatible cases were exceedingly small, maybe it would not be an issue (I
often hear from various people that even the percentage of cases that are
changed by the Unicode normalization corrigenda between 3.0 and 4.1 are too
large, and that percentage -- in actual data -- is zero!). But 15% is pretty
high in my book -- so we should think carefully about the issue of
folding.<br><br>Mark<br><br>
<div><span class="gmail_quote">On 3/31/07, <b class="gmail_sendername">John C
Klensin</b> <<a href="mailto:klensin@jck.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">klensin@jck.com</a>>
wrote:</span>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br><br>--On
Friday, 30 March, 2007 18:14 -0700 Mark Davis<br><<a href="mailto:mark.davis@icu-project.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">mark.davis@icu-project.org</a>>
wrote:<br><br>> We had a bit more time to look at IDNAbis compatibility,
and <br>> here are some<br>> better (and hopefully clearer) results. Out
of a significantly<br>> large<br>> sampling of the web, there were about
800,000 cases where an<br>> HTML document<br>> contained an href="..."
that contained a host name that was <br>> valid IDNA2003.<br>> We tested
those host names to see if they would also be valid<br>> under
IDNAbis<br>> (based on the current working proposals). About 85%
were<br>> valid, about 8%<br>> more would be valid if IDNAbis were
changed to also do case <br>> and width<br>> folding, and about 6% would
still be invalid even if case and<br>> width foldings<br>> were applied.
(The width foldings are applying NFKC to just<br>> the half-width<br>>
and full-width characters to get the normal ones.) <br>><br>> Here are
some more details, where A0-A4 are disjoint<br>>
categories.<br>><br>> A0: Passes IDNAbis 708,760 85.26% A1: Passes
IDNAbis after<br>> case folding<br>> 22,714 2.73% A2: Passes IDNAbis
after width folding 47,312 <br>> 5.69% A3: Passes<br>> IDNAbis after
apply width folding, and then case folding. 4<br>> 0.00% A4: Failed<br>>
to pass IDNAbis after 3 steps 52,456
6.31%<br>><br>><br>> A5: Passes IDNA = sum(A1-A4) 831,246
100.00%<br>> This differs from some of our previous data, because we
are<br>> explicitly<br>> testing IDNA vs IDNAbis (not just approximating
the latter),<br>> and also<br>> filtering out invalid URLs. I will be
out next week, but we'll <br>> try to follow<br>> up with more of a
breakdown of A4.<br><br>Mark,<br><br>This is very interesting, but I'm still
not clear about where it<br>takes us except as implementation
advice.<br><br>Suppose I encounter a URI that falls into your cases A1-A3 (to
<br>keep this simple). I'm running client software that is
either<br><br> (i) conformant
to IDNA2003, in which case these
foldings<br> and mappings are
made,<br><br> (ii) a conforming
implementation of IDNAbis, in which
<br> case the software
implementer has the option
of<br> performing those
foldings and mappings as a UI issue,
or<br><br> (iii) completely
conformant to neither (e.g.,
refusing<br> to resolve strings
that one or the other will permit
<br> and, arguably, refusing to
resolve some such
strings<br> without explicit
user intervention).<br><br>I'm assuming that "IDNAbis", in your tests, relies
on Ken's<br>tables. More on that below. <br><br>So, to me, data
like this aren't a useful critique (positive or<br>negative) of the IDNAbis
effort. Instead, it turns into<br>implementer advice, e.g., "if you
are in an environment that<br>normally expects upper and lower case to be
treated as <br>equivalent, you probably should do the mapping although it
is<br>not part of IDNA; if you are in an environment that normally<br>expects
differential-width characters to be treated as<br>equivalent, you should do
that mapping although it is not part <br>of IDNA". And I would
expect HTML validity-testers, and maybe<br>UIs that are especially concerned
about these things, to warn<br>about possible-invalid UPIs.<br><br>As you look
at this further, and especially as you look at A4, I <br>think it would be
helpful to distinguish between href strings<br>that use domain names that are
consistent with the ICANN<br>Guidelines and the IESG
advice. Distinguishing between strings<br>that IDNAbis newly
prohibits and strings that are prohibited <br>under existing guidelines for
IDNA2003 but become a hard<br>prohibition in IDNAbis would seem helpful in
understanding the<br>issues.<br><br>
john<br><br><br><br>_______________________________________________<br>Idna-update
mailing list<br><a href="mailto:Idna-update@alvestrand.no" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">Idna-update@alvestrand.no</a><br><a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://www.alvestrand.no/mailman/listinfo/idna-update
</a><br></blockquote></div><br><br clear="all"><br>-- <br>Mark </span></div></div>
</blockquote></div><br><br clear="all"><br>-- <br>Mark