Hyphen Restrictions

Wed Jan 5 15:38:09 CET 2011

--On Wednesday, January 05, 2011 08:48 -0500 Andrew Sullivan
<ajs at shinkuro.com> wrote:

> On Wed, Jan 05, 2011 at 08:03:43AM +0000, Adam M. Costello
> wrote:
>> Yoshiro YONEYA <yoshiro.yoneya at jprs.co.jp> wrote:
>> 
>> > Dear Andrew and John,
>> > 
>> > Thank you for your quick response.  I'm clear now.
>> 
>> You are?  But didn't Andrew and John disagree?  Andrew said
>> it means 3rd & 4th characters, while John said it means 3rd &
>> 4th octets.
> 
> That's what I thought, too.  I can find no way to interpret
> that text other than "third and fourth characters".

Sorry.  I misspoke.  I've been thinking too much about a
different problem and confused myself.  It really is third and
fourth characters.  But remember that the IDNA specs are written
entirely in terms of Unicode character abstractions, not a
particular encoding.  That is one of the reasons it has to be
"characters".

But it also makes the apparent problem with the text a lot less
significant than it appears when one starts thinking in terms of
encodings.

Vint is also correct: Remember that a "convert, as needed, to
Unicode" step appears in both algorithms long before any
conditions are imposed on hyphens.   The test is expected to be
made only on ASCII strings (more precisely, on Unicode
characters in the range U+0000 through U+007F), so the "octet
versus character" distinction that is important for UTF-8 and
perhaps even more important for UTF-16, UTF-32, and encodings of
other coded character sets, is not really significant.

I note that the word "octet" appears in RFC 5890 only in
conjunction with the limits on lengths of DNS labels and FQDNs
inherited from RFC 1035 and does not appear in 5892 at all.

If I were to propose a change to this text to eliminate any
possible confusion, it would be to insert a reminder that IDNA
does not deal with encodings at all.

     john