Definitions limit on label length in UTF-8

Sat Sep 12 03:33:34 CEST 2009

Hello Adam,

Great to hear from you.

On 2009/09/12 5:10, Adam M. Costello wrote:
> "\"Martin J. Dürst\""<duerst at it.aoyama.ac.jp>  wrote:
>
>> [Short summary: It's very easy to create UTF-8 strings that are longer
>> than punycode, for everything except US-ASCII. Remember, punycode was
>> *designed* to be efficient, in particular for domain name labels.]
>
> Of the 11 languages I tried for my ACE evaluation, five were more
> compact in Punycode than in UTF-8 for my example sentence ("Why can't
> they just speak<language>?").
>
> Arabic:
>    Punycode: xn--egbpdaj6bu4bxfgehfvwxn
>       UTF-8: ??????????????????????????????????
> Hebrew:
>    Punycode: xn--4dbcagdahymbxekheh6e0a7fei0b
>       UTF-8: ????????????????????????????????????????????
> Hindi (Devanagari):
>    Punycode: xn--i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
>       UTF-8: ??????????????????????????????????????????????????????????????????????????????????????????
> Japanese (kanji and hiragana):
>    Punycode: xn--n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
>       UTF-8: ??????????????????????????????????????????????????????
> Russian:
>    Punycode: xn--b1abfaaepdrnnbgefbaDotcwatmq2g4l
>       UTF-8: ??????????????????????????????????????????????????????????

To look at the other examples, Chinese (in particular simplified) is 
lucky in that codepoints are quite close together. Czech, Spanish, and 
Vietnamese are Latin, mostly ASCII. Korean was always known to be a 
problem for compression, even for punycode.

>> People not familiar with the history of the development of IDNA2003
>> should be aware of the fact that a lot of energy went into the
>> development of compression algorithms for domain names,
>
> I can confirm that.  :)
>
>> The "max 63 octets in UTF-8" provision, unless removed, negates all
>> this effort.
>
> Yeah, that would be a shame.
>
> Since I haven't had time to participate in IDNA2008, maybe I haven't
> earned the right to comment, but...

Well, you are THE expert on punycode, so I think your comments are very 
much appreciated.

Regards,   Martin.

> A real concern back then was that IDNA would be unfair to non-ASCII
> scripts, because they couldn't fit nearly as much text in a domain
> label.  Making it truly fair was never possible, but we did work very
> hard to find an encoding that could squeeze as much non-ASCII text
> as possible into a 63-byte ACE label.  If a 63-byte limit on UTF-8
> forms is imposed, the complexity of Punycode is largely wasted and/or
> misdirected; the encoding should have been designed to be just complex
> enough to beat UTF-8, not to be as compact as possible.
>
> Also, I agree with Martin's concern that adding a 63-byte limit on the
> UTF-8 form of labels would have greater cost than benefit.  The cost is
> breaking compatibility with names that are valid in IDNA2003.
>
> AMC
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp