looking up domain names with unassigned code points
John C Klensin
klensin at jck.com
Tue May 13 02:39:41 CEST 2008
Mark,
I, at least found this very helpful and even more interesting.
A few comments below...
--On Monday, 12 May, 2008 15:53 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:
> In answer to your question, I wrote a quick and dirty test of
> taking random "xn--something" codes (where something is from
> [- a-z 0-9]+) and seeing what the percents would be. Here are
> the results.
>
> For the lengths up to 4, I do an exhaustive test; above that
> it is a random sampling.
>
> The percentages are not what one would expect from a random
> sampling of strings (eg < 10% of possible Unicode code points
> are assigned LMN), I suspect because PunyCode would favor
> locality of deltas.
Certainly it does favor locality, so that is a reasonable
hypothesis.
> Key:
>
> - illegal_punycode - means that converting to unicode and
> back has an error
> - unassigned - has at least one unassigned character
> - non_LMN - has at least one non Letter/Mark/Number
> - non_folded - has at least one non NFKC or non CaseFolded
> - all_ascii - is all ASCII
> - otherwise_ok - everything else
>...
> length: 1
> 97.297% : illegal_punycode (36)
> 02.703% : non_LMN (1)
> length: 2
> 92.909% : illegal_punycode (1,271)
> 04.386% : non_LMN (60)
> 02.705% : all_ascii (37)
The nature of the IDNA and punycode beasts essentially make
one-character strings impossible (my guess is that the one
non_LMN character you found is in the 0..7F range) and two
unlikely. It is good to have that impression confirmed.
> length: 3
> 48.121% : otherwise_ok (24,357)
> 30.725% : illegal_punycode (15,552)
> 11.109% : non_LMN (5,623)
> 04.645% : unassigned (2,351)
> 02.705% : all_ascii (1,369)
> 02.695% : non_folded (1,364)
> length: 4
> 46.001% : illegal_punycode (861,512)
> 23.924% : otherwise_ok (448,047)
> 16.716% : unassigned (313,059)
> 08.354% : non_LMN (156,447)
> 02.705% : all_ascii (50,653)
> 02.300% : non_folded (43,074)
>...
Part of what is interesting here (and in the rest of your
results) is that, if we ignore the cases that might be shifted
from DISALLOWED to PVALID as a vanishingly small percentage and
ignore strings requiring bidi and/or contextual tests (as you
experiment did), the number of strings that could ever be valid
(valid now ("otherwise_ok" or unassigned now) is around 52% at
length three and significantly under half (pretty constantly
between 30 and 40%) for all lengths longer than that.
So it is actually reasonable to infer that, if people start
generating strings at random and building putative A-labels from
them, most of the results will be invalid for one reason or
another, independent of future version of Unicode. If I were
designing an application that was doing lookup, that, plus the
display concerns, would almost certainly induce me to try to
convert the string to U-label form and test it before trying to
look it up. I still don't think we should try to require that
behavior, but...
john
More information about the Idna-update
mailing list