looking up domain names with unassigned code points

Mark Davis mark.davis at icu-project.org
Tue May 13 00:53:39 CEST 2008


In answer to your question, I wrote a quick and dirty test of taking random
"xn--something" codes (where something is from [- a-z 0-9]+) and seeing what
the percents would be. Here are the results.

For the lengths up to 4, I do an exhaustive test; above that it is a random
sampling.

The percentages are not what one would expect from a random sampling of
strings (eg < 10% of possible Unicode code points are assigned LMN), I
suspect because PunyCode would favor locality of deltas.

Key:

   - illegal_punycode - means that converting to unicode and back has an
   error
   - unassigned - has at least one unassigned character
   - non_LMN - has at least one non Letter/Mark/Number
   - non_folded - has at least one non NFKC or non CaseFolded
   - all_ascii - is all ASCII
   - otherwise_ok - everything else


Hope this is useful,

Mark

length: 1
    97.297%    :    illegal_punycode    (36)
    02.703%    :    non_LMN    (1)
length: 2
    92.909%    :    illegal_punycode    (1,271)
    04.386%    :    non_LMN    (60)
    02.705%    :    all_ascii    (37)
length: 3
    48.121%    :    otherwise_ok    (24,357)
    30.725%    :    illegal_punycode    (15,552)
    11.109%    :    non_LMN    (5,623)
    04.645%    :    unassigned    (2,351)
    02.705%    :    all_ascii    (1,369)
    02.695%    :    non_folded    (1,364)
length: 4
    46.001%    :    illegal_punycode    (861,512)
    23.924%    :    otherwise_ok    (448,047)
    16.716%    :    unassigned    (313,059)
    08.354%    :    non_LMN    (156,447)
    02.705%    :    all_ascii    (50,653)
    02.300%    :    non_folded    (43,074)
length: 5
    46.694%    :    illegal_punycode    (29,064)
    31.959%    :    otherwise_ok    (19,892)
    09.399%    :    non_LMN    (5,850)
    06.754%    :    unassigned    (4,204)
    02.696%    :    all_ascii    (1,678)
    02.498%    :    non_folded    (1,555)
length: 6
    47.881%    :    illegal_punycode    (29,946)
    25.041%    :    otherwise_ok    (15,661)
    13.676%    :    unassigned    (8,553)
    08.294%    :    non_LMN    (5,187)
    02.723%    :    all_ascii    (1,703)
    02.386%    :    non_folded    (1,492)
length: 7
    49.545%    :    illegal_punycode    (30,959)
    26.269%    :    otherwise_ok    (16,415)
    10.865%    :    unassigned    (6,789)
    08.253%    :    non_LMN    (5,157)
    02.713%    :    all_ascii    (1,695)
    02.356%    :    non_folded    (1,472)
length: 8
    50.707%    :    illegal_punycode    (31,414)
    22.831%    :    otherwise_ok    (14,144)
    13.564%    :    unassigned    (8,403)
    07.871%    :    non_LMN    (4,876)
    02.741%    :    all_ascii    (1,698)
    02.287%    :    non_folded    (1,417)
length: 9
    50.912%    :    illegal_punycode    (31,940)
    24.295%    :    otherwise_ok    (15,242)
    12.015%    :    unassigned    (7,538)
    07.693%    :    non_LMN    (4,826)
    02.766%    :    all_ascii    (1,735)
    02.319%    :    non_folded    (1,455)
length: 10
    51.009%    :    illegal_punycode    (31,760)
    22.071%    :    otherwise_ok    (13,742)
    14.408%    :    unassigned    (8,971)
    07.512%    :    non_LMN    (4,677)
    02.676%    :    all_ascii    (1,666)
    02.324%    :    non_folded    (1,447)
length: 11
    52.386%    :    illegal_punycode    (32,852)
    21.739%    :    otherwise_ok    (13,633)
    13.479%    :    unassigned    (8,453)
    07.289%    :    non_LMN    (4,571)
    02.751%    :    all_ascii    (1,725)
    02.355%    :    non_folded    (1,477)
length: 12
    52.535%    :    illegal_punycode    (32,787)
    20.875%    :    otherwise_ok    (13,028)
    14.156%    :    unassigned    (8,835)
    07.188%    :    non_LMN    (4,486)
    02.697%    :    all_ascii    (1,683)
    02.549%    :    non_folded    (1,591)
length: 13
    52.923%    :    illegal_punycode    (33,188)
    20.389%    :    otherwise_ok    (12,786)
    14.432%    :    unassigned    (9,050)
    07.005%    :    non_LMN    (4,393)
    02.682%    :    all_ascii    (1,682)
    02.569%    :    non_folded    (1,611)
length: 14
    53.417%    :    illegal_punycode    (33,263)
    19.515%    :    otherwise_ok    (12,152)
    14.837%    :    unassigned    (9,239)
    06.803%    :    non_LMN    (4,236)
    02.730%    :    non_folded    (1,700)
    02.698%    :    all_ascii    (1,680)
length: 15
    54.071%    :    illegal_punycode    (33,796)
    18.905%    :    otherwise_ok    (11,816)
    14.684%    :    unassigned    (9,178)
    06.712%    :    non_LMN    (4,195)
    02.909%    :    non_folded    (1,818)
    02.720%    :    all_ascii    (1,700)
length: 16
    53.885%    :    illegal_punycode    (33,741)
    18.150%    :    otherwise_ok    (11,365)
    15.368%    :    unassigned    (9,623)
    06.703%    :    non_LMN    (4,197)
    03.145%    :    non_folded    (1,969)
    02.750%    :    all_ascii    (1,722)
length: 17
    54.281%    :    illegal_punycode    (33,862)
    18.016%    :    otherwise_ok    (11,239)
    15.265%    :    unassigned    (9,523)
    06.526%    :    non_LMN    (4,071)
    03.302%    :    non_folded    (2,060)
    02.610%    :    all_ascii    (1,628)
length: 18
    54.470%    :    illegal_punycode    (34,239)
    17.489%    :    otherwise_ok    (10,993)
    15.230%    :    unassigned    (9,573)
    06.615%    :    non_LMN    (4,158)
    03.513%    :    non_folded    (2,208)
    02.684%    :    all_ascii    (1,687)
length: 19
    54.406%    :    illegal_punycode    (34,149)
    17.162%    :    otherwise_ok    (10,772)
    15.237%    :    unassigned    (9,564)
    06.741%    :    non_LMN    (4,231)
    03.763%    :    non_folded    (2,362)
    02.691%    :    all_ascii    (1,689)
length: 20
    54.940%    :    illegal_punycode    (34,364)
    16.606%    :    otherwise_ok    (10,387)
    15.089%    :    unassigned    (9,438)
    06.732%    :    non_LMN    (4,211)
    03.895%    :    non_folded    (2,436)
    02.737%    :    all_ascii    (1,712)

On Sun, May 11, 2008 at 7:36 AM, Vint Cerf <vint at google.com> wrote:

> I think we should say nothing about display. John's focus is on whether
> and how to do the lookup.
>
> I agree with what I understand his two positions to be:
>
> 1. just put the punycode string into the DNS query opaquely.
>
> OR
>
> 2. do the conversion and handle as if the resulting Unicode had been
> submitted.
>
> technical question:
>
> if someone generates an arbitrary  string of the form "xn-- <random
> sequence of lowercase a-z, 0-9 and hyphen>
> does the algorithm ALWAYS produce a sequence of UNICODE code points? Note
> I did not say a PVALID set of code points or even ASSIGNED.
>
> I am asking because I am wondering how a relatively simple-minded
> implementation might look from the UI perspective.
>
> If we always get a sequence of code points regardless of the sequence of
> LDH, the simple-minded implementation could easily produce gibberish if
> attempting to invert to UNICODE a sequence of random LDH characters
> (confining the letters to lowercase)
>
> Is the following correct:
>
> let s be a random string of <lower case a-z, 0-9, hyphen> prefixed by
> "xn--"
>
> let To UNICODE be a function that maps s into UNICODE
>
> let To ASCII be a function that maps UNICODE into punycode
>
> s is valid punycode If and Only If s = To ASCII ( To UNICODE  (s) )
>
> I hope I haven't mangled the question too badly.
>
> v
>
>
>
>


-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080512/f3747b7d/attachment.html


More information about the Idna-update mailing list