Definitions limit on label length in UTF-8
Adam M. Costello
idna-update.amc+0+ at nicemice.net.RemoveThisWord
Fri Sep 11 22:10:58 CEST 2009
"\"Martin J. Dürst\"" <duerst at it.aoyama.ac.jp> wrote:
> [Short summary: It's very easy to create UTF-8 strings that are longer
> than punycode, for everything except US-ASCII. Remember, punycode was
> *designed* to be efficient, in particular for domain name labels.]
Of the 11 languages I tried for my ACE evaluation, five were more
compact in Punycode than in UTF-8 for my example sentence ("Why can't
they just speak <language>?").
Arabic:
Punycode: xn--egbpdaj6bu4bxfgehfvwxn
UTF-8: ??????????????????????????????????
Hebrew:
Punycode: xn--4dbcagdahymbxekheh6e0a7fei0b
UTF-8: ????????????????????????????????????????????
Hindi (Devanagari):
Punycode: xn--i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
UTF-8: ??????????????????????????????????????????????????????????????????????????????????????????
Japanese (kanji and hiragana):
Punycode: xn--n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
UTF-8: ??????????????????????????????????????????????????????
Russian:
Punycode: xn--b1abfaaepdrnnbgefbaDotcwatmq2g4l
UTF-8: ??????????????????????????????????????????????????????????
> People not familiar with the history of the development of IDNA2003
> should be aware of the fact that a lot of energy went into the
> development of compression algorithms for domain names,
I can confirm that. :)
> The "max 63 octets in UTF-8" provision, unless removed, negates all
> this effort.
Yeah, that would be a shame.
Since I haven't had time to participate in IDNA2008, maybe I haven't
earned the right to comment, but...
A real concern back then was that IDNA would be unfair to non-ASCII
scripts, because they couldn't fit nearly as much text in a domain
label. Making it truly fair was never possible, but we did work very
hard to find an encoding that could squeeze as much non-ASCII text
as possible into a 63-byte ACE label. If a 63-byte limit on UTF-8
forms is imposed, the complexity of Punycode is largely wasted and/or
misdirected; the encoding should have been designed to be just complex
enough to beat UTF-8, not to be as compact as possible.
Also, I agree with Martin's concern that adding a 63-byte limit on the
UTF-8 form of labels would have greater cost than benefit. The cost is
breaking compatibility with names that are valid in IDNA2003.
AMC
More information about the Idna-update
mailing list