IDNAbis compatibility

Sat Mar 31 18:16:54 CEST 2007

--On Friday, 30 March, 2007 18:14 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:

> We had a bit more time to look at IDNAbis compatibility, and
> here are some
> better (and hopefully clearer) results. Out of a significantly
> large
> sampling of the web, there were about 800,000 cases where an
> HTML document
> contained an href="..." that contained a host name that was
> valid IDNA2003.
> We tested those host names to see if they would also be valid
> under IDNAbis
> (based on the current working proposals). About 85% were
> valid, about 8%
> more would be valid if IDNAbis were changed to also do case
> and width
> folding, and about 6% would still be invalid even if case and
> width foldings
> were applied. (The width foldings are applying NFKC to just
> the half-width
> and full-width characters to get the normal ones.)
> 
> Here are some more details, where A0-A4 are disjoint
> categories.
> 
> A0: Passes IDNAbis 708,760 85.26% A1: Passes IDNAbis after
> case folding
> 22,714 2.73% A2: Passes IDNAbis after width folding 47,312
> 5.69% A3: Passes
> IDNAbis after apply width folding, and then case folding. 4
> 0.00% A4: Failed
> to pass IDNAbis after 3 steps 52,456 6.31%
> 
> 
>  A5: Passes IDNA = sum(A1-A4) 831,246 100.00%
> This differs from some of our previous data, because we are
> explicitly
> testing IDNA vs IDNAbis (not just approximating the latter),
> and also
> filtering out invalid URLs. I will be out next week, but we'll
> try to follow
> up with more of a breakdown of A4.

Mark,

This is very interesting, but I'm still not clear about where it
takes us except as implementation advice.

Suppose I encounter a URI that falls into your cases A1-A3 (to
keep this simple).   I'm running client software that is either

	(i) conformant to IDNA2003, in which case these foldings
	and mappings are made,

	(ii) a conforming implementation of IDNAbis, in which
	case the software implementer has the option of
	performing those foldings and mappings as a UI issue, or

	(iii) completely conformant to neither (e.g., refusing
	to resolve strings that one or the other will permit
	and, arguably, refusing to resolve some such strings
	without explicit user intervention).

I'm assuming that "IDNAbis", in your tests, relies on Ken's
tables.  More on that below.

So, to me, data like this aren't a useful critique (positive or
negative) of the IDNAbis effort.  Instead, it turns into
implementer advice, e.g., "if you are in an environment that
normally expects upper and lower case to be treated as
equivalent, you probably should do the mapping although it is
not part of IDNA; if you are in an environment that normally
expects differential-width characters to be treated as
equivalent, you should do that mapping although it is not part
of IDNA".   And I would expect HTML validity-testers, and maybe
UIs that are especially concerned about these things, to warn
about possible-invalid UPIs.

As you look at this further, and especially as you look at A4, I
think it would be helpful to distinguish between href strings
that use domain names that are consistent with the ICANN
Guidelines and the IESG advice.  Distinguishing between strings
that IDNAbis newly prohibits and strings that are prohibited
under existing guidelines for IDNA2003 but become a hard
prohibition in IDNAbis would seem helpful in understanding the
issues.

     john