draft-liman-tld-names-00.txt and bidi

Martin Duerst duerst at it.aoyama.ac.jp
Mon Mar 9 02:16:50 CET 2009

Hello Andrew,

At 00:53 09/03/09, Andrew Sullivan wrote:
>On Sat, Mar 07, 2009 at 03:43:02PM +0900, Martin Duerst wrote:
>> You are right that there is a bidi issue. For some very specific
>> example, please see Example 11 at
>> http://www.w3.org/International/iri-edit/BidiExamples
>> (please read the legends or tooltips carefully).
>> The reason why there are bidi issues is:
>> - Non-IDN labels turn up in IDNs
>> - Digits get close to RTL characters, maybe only separated by dots
>> - In the bidi algorithm, numbers and dots get associated with nearby
>>   text and thrown around
>Ok.  Now the important question is, is it ever possible for Punycode
>to produce output that ends in a digit?  I haven't run into an example
>yet, but I haven't been able to convince myself that's anything but an
>accident.  If someone who understands the algorithm better than I
>says, "No, it can't, and here's why," then we'll be in a position to
>add to draft-liman-tld-names a restriction that a TLD must both begin
>_and end_ with an ASCII letter, and the problem will automatically go
>away.  Otherwise, we can't make that rule.  Right?

Oh, yes, we have to look at digits on the Unicode level and on the
ASCII-only level.

I have cursorily looked at RFC 3492, and it indeed seems to be the
case that for each punycode-encoded Unicode character, the last of
the ASCII characters in the punycode is a always a letter, never a
digit. The reason for this seems to be that tmax is set to 26. But
that's only a guess. I'm sure Adam can say more about this.

Appendix A mentions how to include upper/lower-case hints into
punycode, and says:
                                             Each non-basic code point
   is represented by a delta, which is represented by a sequence of
   basic code points, the last of which provides the annotation.  If it
   is uppercase, it is a suggestion to map the non-basic code point to
   uppercase (if possible); if it is lowercase, it is a suggestion to
   map the non-basic code point to lowercase (if possible).

The case that the last 'basic code point' (i.e. ASCII) is a digit
indeed doesn't seem to exist.

This means that as long as we can assume that punycode will be the
only algorithm for encoding non-ASCII stuff (or whatever else) into
the DNS, we are save to prohibit TLDs ending with a digit. In my
view, that's fine; if there is something else than punycode in
10 or 20 years, we can always work on changing the RFC that will
have resulted from the draft now being discussed.

Regards,   Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp     

More information about the Idna-update mailing list