Definitions limit on label length in UTF-8
"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Sun Sep 13 10:05:58 CEST 2009
Hello John, others,
[I removed public-iri at w3.org, because this is no longer directly
relevant to IRIs.]
On 2009/09/12 22:15, John C Klensin wrote:
> Martin,
>
> First of all, please understand that I'm much more agnostic on
> this issue than I think you assume. I'm trying to reflect what
> I believe I've been told by the WG and by various other
> communities on the subject but, if the WG says "change it", I
> will do so as editor and lose very little sleep about the
> subject.
Okay, got it.
> I'll let Dave and Stuart address the API and eventual migration
> to pure UTF-8 issues.
I'd definitely like to hear from them re. API issues.
As for 'eventual migration to pure UTF-8', if you mean that something
like an IDN-2015 or so could move to UTF-8 in IDN (I'm not using IDNA
here because the A expresses the fact that Internationalization is
limited to the application), then as somebody who has advocated that
already in the past millennium, I'm pleased to hear that such ideas have
resurfaced. But I think it would be much more appropriate to put
something in a Security Section or Rationale saying "In the case of a
future move of IDNs to direct encoding in UTF-8, some labels that can
currently be expressed in at most 63 bytes in punycode won't be handled
anymore because they will take more than 63 bytes in UTF-8."
> I've been told that the ability to
> convert to length-value form (with a six-bit length) _before_
> Punycode conversion (or in an IDNA-unaware, "octets only"
> implementation) is critical for the DNS community and for some
> security-related applications which store DNS-based identifiers
> in that form.
As I said in a previous mail, I doubt the former, and think that the
later ("octets only") case is irrelevant. If it were security-relevant,
then that would also apply to IDNA 2003, wouldn't it. This would mean
that what essentially happened was that the security folks messed up a
hard-fought property of IDNA 2003 and punycode. I hope that's not the case.
> But I have no personal implementation experience
> in either area, so perhaps Andrew and Paul can either speak to
> those issues or point us to someone who can.
Andrew, Paul, please do so if you can!
> As a sometime-implementer, I'm nervous about unlimited-length
> strings
Nervous, yes, but something one essentially always has to be prepared
for. Otherwise, one's open for a DOS attack.
> (as, based on recent interactions, are Stuart and Vint).
> But it seems to me that the string length here is bounded in any
> event -- with 59 characters of Punycode in an A-label, the upper
> limit on a UTF-8 or UTF-32 string cannot be over 236 characters
> and, I assume, would be considerably smaller. Especially if we
> can pin that number down (Adam?), I'd be a lot happier with text
> that said, essentially, "the limit is on the A-label string, but
> implementations should be aware that a maximum-length A-label
> can convert to a U-label of up to NNN" characters than saying
> "unlimited" and I think some others would be too.
Here are my calculations. After a few tests, one finds out that punycode
uses a single 'a' to express 'one more of the same character'. The
question is then how many characters it takes punycode to express the
first character. Expressing that first character takes more and more
punycode characters as its Unicode number gets higher, so one has to
test with the smallest Unicode character that needs a certain number of
bytes in UTF-8. Going through lengths 1,2,3, and 4 per character in
UTF-8, we find:
1 octet per character in UTF-8:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org gives
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 63 characters, so 63 octets in UTF-8, 126 octets in UTF-16, and
252 octets in UTF-32.
2 octets per character in UTF-8:
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢.org gives
xn--8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 58 characters, so 116 octets in UTF-8, 116 octets in UTF-16, and
232 octets in UTF-32. 59 seems possible in theory, but impossible in
practice.
ँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँ.org (using the currently lowest encoded character that needs 3 bytes,
U+0901, DEVANAGARI SIGN CANDRABINDU), gives
xn--h1baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 57 characters, so 171 octets in UTF-8, 114 octets in UTF-16, and
228 octets in UTF-32. Please note that even characters in the U+0800
range would need that much, because already a character such as 'ü'
needs that much.
Trying to assess how many characters one could use of
𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀.org
(using U+10300, OLD ITALIC LETTER A, the lowest character in Unicode 3.2
that needs 4 bytes in UTF-8) gives
xn--097caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
and has 56 characters, so 224 octets in UTF-8, 224 octets in UTF-16, and
224 octets in UTF-32.
Overall, we get a maximum label length in octets of 252 octets for
UTF-32 (with US-ASCII), and 224 octets in UTF-8 and UTF-16 (with Old
Italic and the like).
> All of that said, I'm not persuaded by the "there have been no
> issues raised, therefore there is no problem" argument.
Well, of course, logically speaking, you are right. But then that would
apply to any and all IETF activity, wouldn't it?
> The
> reality is that, for mnemonic and typing convenience, people
> generally prefer shorter labels to longer ones. Other than in
> test demonstrations and as part of efforts to encode other types
> of information in DNS labels, I don't believe I've ever seen a
> 60+ character ASCII label in the wild.
Yes, of course. Hitting something between 60 and 63 isn't exactly easy.
But please don't forget, as I have written in an earlier mail, the limit
is around 20 characters or a bit more for a large number of scripts.
While that's still above average and still not necessarily easy to
remember,..., it's much more in a realistic range that may come in handy
from time to time.
Also please remember that it's not so much the absolute number that led
to all the discussion and engineering work for IDNA2003, but the
relative disadvantage of these scripts when compared to US-ASCII. I
think you are one of the more well known and frequent critics of UTF-8
on that base.
> Regardless of script, a
> few such labels in the same FQDN would not only be nearly
> impossible for most people to enter correctly but also would
> guarantee line-wrapping of DNS names in most screen-layout and
> documentation arrangements... never an ideal situation. That
> isn't an argument for banning labels of that length or longer;
> it does suggest a reason why no problems have been identified
> other than "people have been using this for years with no
> difficulty".
You are right that in some sense we are speaking about an edge case
with, compared with the overall number of IDN labels, a rather small
percentage. (Can somebody with DNS experience, or maybe Erik, provide
some statistics on lengths of IDN labels?)
However, in general IETF practice, if a feature like the one being
discussed here is testably supported in all the major user agents
(browsers in this case), and if there are no verifiable reports of
actual problems, then such a feature would not only make it into a
Proposed Standard, but also had absolutely no problem passing Draft and
Full Standard criteria.
Regards, Martin.
> --On Saturday, September 12, 2009 12:14 +0900 "\"Martin J.
> Dürst\""<duerst at it.aoyama.ac.jp> wrote:
>
>> Hello John,
>>
>> [Dave, this is Cc'ed to you because of some discussion
>> relating to draft-iab-idn-encoding-00.txt.]
>>
>> [I'm also cc'ing public-iri at w3.org because of the IRI-related
>> issue at the end.]
>>
>> [Everybody, please remove the Cc fields when they are
>> unnecessary.]
>>
>>
>> Overall, I'm afraid that on this issue, more convoluted
>> explanations won't convince me nor anybody else, but I'll
>> nevertheless try to answer your discussion below
>> point-by-point.
>>
>> What I (and I guess others on this list) really would like to
>> know is whether you have any CONCRETE reports or evidence
>> regarding problems with IDN labels that are longer than 63
>> octets when expressed in UTF-8.
>>
>> Otherwise, Michel has put it much better than me: "given the
>> lack of issues with IDNA2003 on that specific topic there are
>> no reasons to introduce an incompatible change".
>
>
>
>
>
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update
mailing list