In answer to your question, I wrote a quick and dirty test of taking random "xn--something" codes (where something is from [- a-z 0-9]+) and seeing what the percents would be. Here are the results.<br><br>For the lengths up to 4, I do an exhaustive test; above that it is a random sampling.<br>
<br>The percentages are not what one would expect from a random sampling of strings (eg < 10% of possible Unicode code points are assigned LMN), I suspect because PunyCode would favor locality of deltas.<br><br>Key:<br>
<ul><li>illegal_punycode - means that converting to unicode and back has an error</li><li>unassigned - has at least one unassigned character</li><li>non_LMN - has at least one non Letter/Mark/Number</li><li>non_folded - has at least one non NFKC or non CaseFolded</li>
<li>all_ascii - is all ASCII</li><li>otherwise_ok - everything else</li></ul><br>Hope this is useful,<br><br>Mark<br><br>length: 1<br> 97.297% : illegal_punycode (36)<br> 02.703% : non_LMN (1)<br>length: 2<br>
92.909% : illegal_punycode (1,271)<br> 04.386% : non_LMN (60)<br> 02.705% : all_ascii (37)<br>length: 3<br> 48.121% : otherwise_ok (24,357)<br> 30.725% : illegal_punycode (15,552)<br>
11.109% : non_LMN (5,623)<br> 04.645% : unassigned (2,351)<br> 02.705% : all_ascii (1,369)<br> 02.695% : non_folded (1,364)<br>length: 4<br> 46.001% : illegal_punycode (861,512)<br>
23.924% : otherwise_ok (448,047)<br> 16.716% : unassigned (313,059)<br> 08.354% : non_LMN (156,447)<br> 02.705% : all_ascii (50,653)<br> 02.300% : non_folded (43,074)<br>
length: 5<br> 46.694% : illegal_punycode (29,064)<br> 31.959% : otherwise_ok (19,892)<br> 09.399% : non_LMN (5,850)<br> 06.754% : unassigned (4,204)<br> 02.696% : all_ascii (1,678)<br>
02.498% : non_folded (1,555)<br>length: 6<br> 47.881% : illegal_punycode (29,946)<br> 25.041% : otherwise_ok (15,661)<br> 13.676% : unassigned (8,553)<br> 08.294% : non_LMN (5,187)<br>
02.723% : all_ascii (1,703)<br> 02.386% : non_folded (1,492)<br>length: 7<br> 49.545% : illegal_punycode (30,959)<br> 26.269% : otherwise_ok (16,415)<br> 10.865% : unassigned (6,789)<br>
08.253% : non_LMN (5,157)<br> 02.713% : all_ascii (1,695)<br> 02.356% : non_folded (1,472)<br>length: 8<br> 50.707% : illegal_punycode (31,414)<br> 22.831% : otherwise_ok (14,144)<br>
13.564% : unassigned (8,403)<br> 07.871% : non_LMN (4,876)<br> 02.741% : all_ascii (1,698)<br> 02.287% : non_folded (1,417)<br>length: 9<br> 50.912% : illegal_punycode (31,940)<br>
24.295% : otherwise_ok (15,242)<br> 12.015% : unassigned (7,538)<br> 07.693% : non_LMN (4,826)<br> 02.766% : all_ascii (1,735)<br> 02.319% : non_folded (1,455)<br>
length: 10<br> 51.009% : illegal_punycode (31,760)<br> 22.071% : otherwise_ok (13,742)<br> 14.408% : unassigned (8,971)<br> 07.512% : non_LMN (4,677)<br> 02.676% : all_ascii (1,666)<br>
02.324% : non_folded (1,447)<br>length: 11<br> 52.386% : illegal_punycode (32,852)<br> 21.739% : otherwise_ok (13,633)<br> 13.479% : unassigned (8,453)<br> 07.289% : non_LMN (4,571)<br>
02.751% : all_ascii (1,725)<br> 02.355% : non_folded (1,477)<br>length: 12<br> 52.535% : illegal_punycode (32,787)<br> 20.875% : otherwise_ok (13,028)<br> 14.156% : unassigned (8,835)<br>
07.188% : non_LMN (4,486)<br> 02.697% : all_ascii (1,683)<br> 02.549% : non_folded (1,591)<br>length: 13<br> 52.923% : illegal_punycode (33,188)<br> 20.389% : otherwise_ok (12,786)<br>
14.432% : unassigned (9,050)<br> 07.005% : non_LMN (4,393)<br> 02.682% : all_ascii (1,682)<br> 02.569% : non_folded (1,611)<br>length: 14<br> 53.417% : illegal_punycode (33,263)<br>
19.515% : otherwise_ok (12,152)<br> 14.837% : unassigned (9,239)<br> 06.803% : non_LMN (4,236)<br> 02.730% : non_folded (1,700)<br> 02.698% : all_ascii (1,680)<br>
length: 15<br> 54.071% : illegal_punycode (33,796)<br> 18.905% : otherwise_ok (11,816)<br> 14.684% : unassigned (9,178)<br> 06.712% : non_LMN (4,195)<br> 02.909% : non_folded (1,818)<br>
02.720% : all_ascii (1,700)<br>length: 16<br> 53.885% : illegal_punycode (33,741)<br> 18.150% : otherwise_ok (11,365)<br> 15.368% : unassigned (9,623)<br> 06.703% : non_LMN (4,197)<br>
03.145% : non_folded (1,969)<br> 02.750% : all_ascii (1,722)<br>length: 17<br> 54.281% : illegal_punycode (33,862)<br> 18.016% : otherwise_ok (11,239)<br> 15.265% : unassigned (9,523)<br>
06.526% : non_LMN (4,071)<br> 03.302% : non_folded (2,060)<br> 02.610% : all_ascii (1,628)<br>length: 18<br> 54.470% : illegal_punycode (34,239)<br> 17.489% : otherwise_ok (10,993)<br>
15.230% : unassigned (9,573)<br> 06.615% : non_LMN (4,158)<br> 03.513% : non_folded (2,208)<br> 02.684% : all_ascii (1,687)<br>length: 19<br> 54.406% : illegal_punycode (34,149)<br>
17.162% : otherwise_ok (10,772)<br> 15.237% : unassigned (9,564)<br> 06.741% : non_LMN (4,231)<br> 03.763% : non_folded (2,362)<br> 02.691% : all_ascii (1,689)<br>
length: 20<br> 54.940% : illegal_punycode (34,364)<br> 16.606% : otherwise_ok (10,387)<br> 15.089% : unassigned (9,438)<br> 06.732% : non_LMN (4,211)<br> 03.895% : non_folded (2,436)<br>
02.737% : all_ascii (1,712)<br><br><div class="gmail_quote">On Sun, May 11, 2008 at 7:36 AM, Vint Cerf <<a href="mailto:vint@google.com">vint@google.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I think we should say nothing about display. John's focus is on whether and how to do the lookup.<br>
<br>
I agree with what I understand his two positions to be:<br>
<br>
1. just put the punycode string into the DNS query opaquely.<br>
<br>
OR<br>
<br>
2. do the conversion and handle as if the resulting Unicode had been submitted.<br>
<br>
technical question:<br>
<br>
if someone generates an arbitrary string of the form "xn-- <random sequence of lowercase a-z, 0-9 and hyphen><br>
does the algorithm ALWAYS produce a sequence of UNICODE code points? Note I did not say a PVALID set of code points or even ASSIGNED.<br>
<br>
I am asking because I am wondering how a relatively simple-minded implementation might look from the UI perspective.<br>
<br>
If we always get a sequence of code points regardless of the sequence of LDH, the simple-minded implementation could easily produce gibberish if attempting to invert to UNICODE a sequence of random LDH characters (confining the letters to lowercase)<br>
<br>
Is the following correct:<br>
<br>
let s be a random string of <lower case a-z, 0-9, hyphen> prefixed by "xn--"<br>
<br>
let To UNICODE be a function that maps s into UNICODE<br>
<br>
let To ASCII be a function that maps UNICODE into punycode<br>
<br>
s is valid punycode If and Only If s = To ASCII ( To UNICODE (s) )<br>
<br>
I hope I haven't mangled the question too badly.<br><font color="#888888">
<br>
v<br>
<br>
<br>
<br>
</font></blockquote></div><br><br clear="all"><br>-- <br>Mark