looking up domain names with unassigned code points

Tue May 13 02:39:41 CEST 2008

Mark,

I, at least found this very helpful and even more interesting.
A few comments below...

--On Monday, 12 May, 2008 15:53 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:

> In answer to your question, I wrote a quick and dirty test of
> taking random "xn--something" codes (where something is from
> [- a-z 0-9]+) and seeing what the percents would be. Here are
> the results.
> 
> For the lengths up to 4, I do an exhaustive test; above that
> it is a random sampling.
> 
> The percentages are not what one would expect from a random
> sampling of strings (eg < 10% of possible Unicode code points
> are assigned LMN), I suspect because PunyCode would favor
> locality of deltas.

Certainly it does favor locality, so that is a reasonable
hypothesis.

> Key:
> 
>    - illegal_punycode - means that converting to unicode and
> back has an    error
>    - unassigned - has at least one unassigned character
>    - non_LMN - has at least one non Letter/Mark/Number
>    - non_folded - has at least one non NFKC or non CaseFolded
>    - all_ascii - is all ASCII
>    - otherwise_ok - everything else
>...

> length: 1
>     97.297%    :    illegal_punycode    (36)
>     02.703%    :    non_LMN    (1)
> length: 2
>     92.909%    :    illegal_punycode    (1,271)
>     04.386%    :    non_LMN    (60)
>     02.705%    :    all_ascii    (37)

The nature of the IDNA and punycode beasts essentially make
one-character strings impossible (my guess is that the one
non_LMN character you found is in the 0..7F range) and two
unlikely.  It is good to have that impression confirmed.

> length: 3
>     48.121%    :    otherwise_ok    (24,357)
>     30.725%    :    illegal_punycode    (15,552)
>     11.109%    :    non_LMN    (5,623)
>     04.645%    :    unassigned    (2,351)
>     02.705%    :    all_ascii    (1,369)
>     02.695%    :    non_folded    (1,364)
> length: 4
>     46.001%    :    illegal_punycode    (861,512)
>     23.924%    :    otherwise_ok    (448,047)
>     16.716%    :    unassigned    (313,059)
>     08.354%    :    non_LMN    (156,447)
>     02.705%    :    all_ascii    (50,653)
>     02.300%    :    non_folded    (43,074)
>...

Part of what is interesting here (and in the rest of your
results) is that, if we ignore the cases that might be shifted
from DISALLOWED to PVALID as a vanishingly small percentage and
ignore strings requiring bidi and/or contextual tests (as you
experiment did), the number of strings that could ever be valid
(valid now ("otherwise_ok" or unassigned now) is around 52% at
length three and significantly under half (pretty constantly
between 30 and 40%) for all lengths longer than that.  

So it is actually reasonable to infer that, if people start
generating strings at random and building putative A-labels from
them, most of the results will be invalid for one reason or
another, independent of future version of Unicode.  If I were
designing an application that was doing lookup, that, plus the
display concerns, would almost certainly induce me to try to
convert the string to U-label form and test it before trying to
look it up.   I still don't think we should try to require that
behavior, but...

     john