looking up domain names with unassigned code points
Mark Davis
mark.davis at icu-project.org
Tue May 13 00:53:39 CEST 2008
In answer to your question, I wrote a quick and dirty test of taking random
"xn--something" codes (where something is from [- a-z 0-9]+) and seeing what
the percents would be. Here are the results.
For the lengths up to 4, I do an exhaustive test; above that it is a random
sampling.
The percentages are not what one would expect from a random sampling of
strings (eg < 10% of possible Unicode code points are assigned LMN), I
suspect because PunyCode would favor locality of deltas.
Key:
- illegal_punycode - means that converting to unicode and back has an
error
- unassigned - has at least one unassigned character
- non_LMN - has at least one non Letter/Mark/Number
- non_folded - has at least one non NFKC or non CaseFolded
- all_ascii - is all ASCII
- otherwise_ok - everything else
Hope this is useful,
Mark
length: 1
97.297% : illegal_punycode (36)
02.703% : non_LMN (1)
length: 2
92.909% : illegal_punycode (1,271)
04.386% : non_LMN (60)
02.705% : all_ascii (37)
length: 3
48.121% : otherwise_ok (24,357)
30.725% : illegal_punycode (15,552)
11.109% : non_LMN (5,623)
04.645% : unassigned (2,351)
02.705% : all_ascii (1,369)
02.695% : non_folded (1,364)
length: 4
46.001% : illegal_punycode (861,512)
23.924% : otherwise_ok (448,047)
16.716% : unassigned (313,059)
08.354% : non_LMN (156,447)
02.705% : all_ascii (50,653)
02.300% : non_folded (43,074)
length: 5
46.694% : illegal_punycode (29,064)
31.959% : otherwise_ok (19,892)
09.399% : non_LMN (5,850)
06.754% : unassigned (4,204)
02.696% : all_ascii (1,678)
02.498% : non_folded (1,555)
length: 6
47.881% : illegal_punycode (29,946)
25.041% : otherwise_ok (15,661)
13.676% : unassigned (8,553)
08.294% : non_LMN (5,187)
02.723% : all_ascii (1,703)
02.386% : non_folded (1,492)
length: 7
49.545% : illegal_punycode (30,959)
26.269% : otherwise_ok (16,415)
10.865% : unassigned (6,789)
08.253% : non_LMN (5,157)
02.713% : all_ascii (1,695)
02.356% : non_folded (1,472)
length: 8
50.707% : illegal_punycode (31,414)
22.831% : otherwise_ok (14,144)
13.564% : unassigned (8,403)
07.871% : non_LMN (4,876)
02.741% : all_ascii (1,698)
02.287% : non_folded (1,417)
length: 9
50.912% : illegal_punycode (31,940)
24.295% : otherwise_ok (15,242)
12.015% : unassigned (7,538)
07.693% : non_LMN (4,826)
02.766% : all_ascii (1,735)
02.319% : non_folded (1,455)
length: 10
51.009% : illegal_punycode (31,760)
22.071% : otherwise_ok (13,742)
14.408% : unassigned (8,971)
07.512% : non_LMN (4,677)
02.676% : all_ascii (1,666)
02.324% : non_folded (1,447)
length: 11
52.386% : illegal_punycode (32,852)
21.739% : otherwise_ok (13,633)
13.479% : unassigned (8,453)
07.289% : non_LMN (4,571)
02.751% : all_ascii (1,725)
02.355% : non_folded (1,477)
length: 12
52.535% : illegal_punycode (32,787)
20.875% : otherwise_ok (13,028)
14.156% : unassigned (8,835)
07.188% : non_LMN (4,486)
02.697% : all_ascii (1,683)
02.549% : non_folded (1,591)
length: 13
52.923% : illegal_punycode (33,188)
20.389% : otherwise_ok (12,786)
14.432% : unassigned (9,050)
07.005% : non_LMN (4,393)
02.682% : all_ascii (1,682)
02.569% : non_folded (1,611)
length: 14
53.417% : illegal_punycode (33,263)
19.515% : otherwise_ok (12,152)
14.837% : unassigned (9,239)
06.803% : non_LMN (4,236)
02.730% : non_folded (1,700)
02.698% : all_ascii (1,680)
length: 15
54.071% : illegal_punycode (33,796)
18.905% : otherwise_ok (11,816)
14.684% : unassigned (9,178)
06.712% : non_LMN (4,195)
02.909% : non_folded (1,818)
02.720% : all_ascii (1,700)
length: 16
53.885% : illegal_punycode (33,741)
18.150% : otherwise_ok (11,365)
15.368% : unassigned (9,623)
06.703% : non_LMN (4,197)
03.145% : non_folded (1,969)
02.750% : all_ascii (1,722)
length: 17
54.281% : illegal_punycode (33,862)
18.016% : otherwise_ok (11,239)
15.265% : unassigned (9,523)
06.526% : non_LMN (4,071)
03.302% : non_folded (2,060)
02.610% : all_ascii (1,628)
length: 18
54.470% : illegal_punycode (34,239)
17.489% : otherwise_ok (10,993)
15.230% : unassigned (9,573)
06.615% : non_LMN (4,158)
03.513% : non_folded (2,208)
02.684% : all_ascii (1,687)
length: 19
54.406% : illegal_punycode (34,149)
17.162% : otherwise_ok (10,772)
15.237% : unassigned (9,564)
06.741% : non_LMN (4,231)
03.763% : non_folded (2,362)
02.691% : all_ascii (1,689)
length: 20
54.940% : illegal_punycode (34,364)
16.606% : otherwise_ok (10,387)
15.089% : unassigned (9,438)
06.732% : non_LMN (4,211)
03.895% : non_folded (2,436)
02.737% : all_ascii (1,712)
On Sun, May 11, 2008 at 7:36 AM, Vint Cerf <vint at google.com> wrote:
> I think we should say nothing about display. John's focus is on whether
> and how to do the lookup.
>
> I agree with what I understand his two positions to be:
>
> 1. just put the punycode string into the DNS query opaquely.
>
> OR
>
> 2. do the conversion and handle as if the resulting Unicode had been
> submitted.
>
> technical question:
>
> if someone generates an arbitrary string of the form "xn-- <random
> sequence of lowercase a-z, 0-9 and hyphen>
> does the algorithm ALWAYS produce a sequence of UNICODE code points? Note
> I did not say a PVALID set of code points or even ASSIGNED.
>
> I am asking because I am wondering how a relatively simple-minded
> implementation might look from the UI perspective.
>
> If we always get a sequence of code points regardless of the sequence of
> LDH, the simple-minded implementation could easily produce gibberish if
> attempting to invert to UNICODE a sequence of random LDH characters
> (confining the letters to lowercase)
>
> Is the following correct:
>
> let s be a random string of <lower case a-z, 0-9, hyphen> prefixed by
> "xn--"
>
> let To UNICODE be a function that maps s into UNICODE
>
> let To ASCII be a function that maps UNICODE into punycode
>
> s is valid punycode If and Only If s = To ASCII ( To UNICODE (s) )
>
> I hope I haven't mangled the question too badly.
>
> v
>
>
>
>
--
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080512/f3747b7d/attachment.html
More information about the Idna-update
mailing list