Definitions limit on label length in UTF-8

Adam M. Costello idna-update.amc+0+ at nicemice.net.RemoveThisWord
Fri Sep 11 22:10:58 CEST 2009


"\"Martin J. Dürst\"" <duerst at it.aoyama.ac.jp> wrote:

> [Short summary: It's very easy to create UTF-8 strings that are longer
> than punycode, for everything except US-ASCII. Remember, punycode was
> *designed* to be efficient, in particular for domain name labels.]

Of the 11 languages I tried for my ACE evaluation, five were more
compact in Punycode than in UTF-8 for my example sentence ("Why can't
they just speak <language>?").

Arabic:
  Punycode: xn--egbpdaj6bu4bxfgehfvwxn
     UTF-8: ??????????????????????????????????
Hebrew:
  Punycode: xn--4dbcagdahymbxekheh6e0a7fei0b
     UTF-8: ????????????????????????????????????????????
Hindi (Devanagari):
  Punycode: xn--i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
     UTF-8: ??????????????????????????????????????????????????????????????????????????????????????????
Japanese (kanji and hiragana):
  Punycode: xn--n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
     UTF-8: ??????????????????????????????????????????????????????
Russian:
  Punycode: xn--b1abfaaepdrnnbgefbaDotcwatmq2g4l
     UTF-8: ??????????????????????????????????????????????????????????

> People not familiar with the history of the development of IDNA2003
> should be aware of the fact that a lot of energy went into the
> development of compression algorithms for domain names,

I can confirm that.  :)

> The "max 63 octets in UTF-8" provision, unless removed, negates all
> this effort.

Yeah, that would be a shame.

Since I haven't had time to participate in IDNA2008, maybe I haven't
earned the right to comment, but...

A real concern back then was that IDNA would be unfair to non-ASCII
scripts, because they couldn't fit nearly as much text in a domain
label.  Making it truly fair was never possible, but we did work very
hard to find an encoding that could squeeze as much non-ASCII text
as possible into a 63-byte ACE label.  If a 63-byte limit on UTF-8
forms is imposed, the complexity of Punycode is largely wasted and/or
misdirected; the encoding should have been designed to be just complex
enough to beat UTF-8, not to be as compact as possible.

Also, I agree with Martin's concern that adding a 63-byte limit on the
UTF-8 form of labels would have greater cost than benefit.  The cost is
breaking compatibility with names that are valid in IDNA2003.

AMC


More information about the Idna-update mailing list