Definitions limit on label length in UTF-8

Sun Sep 13 10:05:58 CEST 2009

Hello John, others,

[I removed public-iri at w3.org, because this is no longer directly 
relevant to IRIs.]

On 2009/09/12 22:15, John C Klensin wrote:
> Martin,
>
> First of all, please understand that I'm much more agnostic on
> this issue than I think you assume.  I'm trying to reflect what
> I believe I've been told by the WG and by various other
> communities on the subject but, if the WG says "change it", I
> will do so as editor and lose very little sleep about the
> subject.

Okay, got it.

> I'll let Dave and Stuart address the API and eventual migration
> to pure UTF-8 issues.

I'd definitely like to hear from them re. API issues.

As for 'eventual migration to pure UTF-8', if you mean that something 
like an IDN-2015 or so could move to UTF-8 in IDN (I'm not using IDNA 
here because the A expresses the fact that Internationalization is 
limited to the application), then as somebody who has advocated that 
already in the past millennium, I'm pleased to hear that such ideas have 
resurfaced. But I think it would be much more appropriate to put 
something in a Security Section or Rationale saying "In the case of a 
future move of IDNs to direct encoding in UTF-8, some labels that can 
currently be expressed in at most 63 bytes in punycode won't be handled 
anymore because they will take more than 63 bytes in UTF-8."

> I've been told that the ability to
> convert to length-value form (with a six-bit length) _before_
> Punycode conversion (or in an IDNA-unaware, "octets only"
> implementation) is critical for the DNS community and for some
> security-related applications which store DNS-based identifiers
> in that form.

As I said in a previous mail, I doubt the former, and think that the 
later ("octets only") case is irrelevant. If it were security-relevant, 
then that would also apply to IDNA 2003, wouldn't it. This would mean 
that what essentially happened was that the security folks messed up a 
hard-fought property of IDNA 2003 and punycode. I hope that's not the case.

> But I have no personal implementation experience
> in either area, so perhaps Andrew and Paul can either speak to
> those issues or point us to someone who can.

Andrew, Paul, please do so if you can!

> As a sometime-implementer, I'm nervous about unlimited-length
> strings

Nervous, yes, but something one essentially always has to be prepared 
for. Otherwise, one's open for a DOS attack.

> (as, based on recent interactions, are Stuart and Vint).
> But it seems to me that the string length here is bounded in any
> event -- with 59 characters of Punycode in an A-label, the upper
> limit on a UTF-8 or UTF-32 string cannot be over 236 characters
> and, I assume, would be considerably smaller.  Especially if we
> can pin that number down (Adam?), I'd be a lot happier with text
> that said, essentially, "the limit is on the A-label string, but
> implementations should be aware that a maximum-length A-label
> can convert to a U-label of up to NNN" characters than saying
> "unlimited" and I think some others would be too.

Here are my calculations. After a few tests, one finds out that punycode 
uses a single 'a' to express 'one more of the same character'. The 
question is then how many characters it takes punycode to express the 
first character. Expressing that first character takes more and more 
punycode characters as its Unicode number gets higher, so one has to 
test with the smallest Unicode character that needs a certain number of 
bytes in UTF-8. Going through lengths 1,2,3, and 4 per character in 
UTF-8, we find:

1 octet per character in UTF-8:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org gives
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 63 characters, so 63 octets in UTF-8, 126 octets in UTF-16, and 
252 octets in UTF-32.

2 octets per character in UTF-8:
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢.org gives
xn--8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 58 characters, so 116 octets in UTF-8, 116 octets in UTF-16, and 
232 octets in UTF-32. 59 seems possible in theory, but impossible in 
practice.

ँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँ.org (using the currently lowest encoded character that needs 3 bytes, 
U+0901, DEVANAGARI SIGN CANDRABINDU), gives
xn--h1baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 57 characters, so 171 octets in UTF-8, 114 octets in UTF-16, and 
228 octets in UTF-32. Please note that even characters in the U+0800 
range would need that much, because already a character such as 'ü' 
needs that much.

Trying to assess how many characters one could use of
𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀.org 
(using U+10300, OLD ITALIC LETTER A, the lowest character in Unicode 3.2 
that needs 4 bytes in UTF-8) gives
xn--097caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 56 characters, so 224 octets in UTF-8, 224 octets in UTF-16, and 
224 octets in UTF-32.

Overall, we get a maximum label length in octets of 252 octets for 
UTF-32 (with US-ASCII), and 224 octets in UTF-8 and UTF-16 (with Old 
Italic and the like).

> All of that said, I'm not persuaded by the "there have been no
> issues raised, therefore there is no problem" argument.

Well, of course, logically speaking, you are right. But then that would 
apply to any and all IETF activity, wouldn't it?

> The
> reality is that, for mnemonic and typing convenience, people
> generally prefer shorter labels to longer ones.  Other than in
> test demonstrations and as part of efforts to encode other types
> of information in DNS labels, I don't believe I've ever seen a
> 60+ character ASCII label in the wild.

Yes, of course. Hitting something between 60 and 63 isn't exactly easy. 
But please don't forget, as I have written in an earlier mail, the limit 
is around 20 characters or a bit more for a large number of scripts. 
While that's still above average and still not necessarily easy to 
remember,..., it's much more in a realistic range that may come in handy 
from time to time.

Also please remember that it's not so much the absolute number that led 
to all the discussion and engineering work for IDNA2003, but the 
relative disadvantage of these scripts when compared to US-ASCII. I 
think you are one of the more well known and frequent critics of UTF-8 
on that base.

> Regardless of script, a
> few such labels in the same FQDN would not only be nearly
> impossible for most people to enter correctly but also would
> guarantee line-wrapping of DNS names in most screen-layout and
> documentation arrangements... never an ideal situation.   That
> isn't an argument for banning labels of that length or longer;
> it does suggest a reason why no problems have been identified
> other than "people have been using this for years with no
> difficulty".

You are right that in some sense we are speaking about an edge case 
with, compared with the overall number of IDN labels, a rather small 
percentage. (Can somebody with DNS experience, or maybe Erik, provide 
some statistics on lengths of IDN labels?)

However, in general IETF practice, if a feature like the one being 
discussed here is testably supported in all the major user agents 
(browsers in this case), and if there are no verifiable reports of 
actual problems, then such a feature would not only make it into a 
Proposed Standard, but also had absolutely no problem passing Draft and 
Full Standard criteria.

Regards,   Martin.

> --On Saturday, September 12, 2009 12:14 +0900 "\"Martin J.
> Dürst\""<duerst at it.aoyama.ac.jp>  wrote:
>
>> Hello John,
>>
>> [Dave, this is Cc'ed to you because of some discussion
>> relating to draft-iab-idn-encoding-00.txt.]
>>
>> [I'm also cc'ing public-iri at w3.org because of the IRI-related
>> issue at the end.]
>>
>> [Everybody, please remove the Cc fields when they are
>> unnecessary.]
>>
>>
>> Overall, I'm afraid that on this issue, more convoluted
>> explanations won't convince me nor anybody else, but I'll
>> nevertheless try to answer your discussion below
>> point-by-point.
>>
>> What I (and I guess others on this list) really would like to
>> know is whether you have any CONCRETE reports or evidence
>> regarding problems with IDN labels that are longer than 63
>> octets when expressed in UTF-8.
>>
>> Otherwise, Michel has put it much better than me: "given the
>> lack of issues with IDNA2003 on that specific topic there are
>> no reasons to introduce an incompatible change".
>
>
>
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp