A-label definition (was: IDN test TLDs)

Sat Jun 21 02:02:37 CEST 2008

I'll skip the various issues related to A-label definition
that John addressed, but... Frank Ellermann stated:

>     ... I also don't see why the
>     U-label is limited to a "standard Unicode encoding
>     form", that would mean "can be SCSU, but not BOCU,
>     UTF-7, UTF-1, GB 18030, etc.".  IMO the question of
>     encoding forms misses some points, maybe we should
>     simply rename U-label to I-label:
> 
>     "I" as in I18N, IDNAbis, IRI is intuitive and KISS.

First of all, a standard Unicode Character Encoding Form could
be UTF-8, UTF-16, or UTF-32, but *not* SCSU, which is
not a Unicode Character Encoding Form at all, by Unicode Standard
definitions. (I realized that SCSU is a registered charset,
but that is an entirely different thing.)

draft-ietf-idnabis-rationale-00.txt states that:

  * A "U-label" is an IDNA-valid string of Unicode characters,
    expressed in a standard Unicode Encoding Form, normally
    UTF-8 in an Internet transmission context...

I assume the 2nd part is obvious in this discussion (i.e.
why UTF-8, rather than UTF-16 or UTF-32).

The reason why it should be a standard Unicode Encoding Form
can be seen from the proposed protocol definition:

draft-ietf-idnabis-protocol-01.txt requires that:

  Some system routine, or a localized front-end to the IDNA
  process, ensures that the proposed label is a Unicode string.
  That string MUST be in Unicode Normalization Form C.

Unicode text content compressed by the SCSU algorithm
is a sequence of bytes that is neither a Unicode string nor
in Normalization Form C. And it could be put in NFC only
by first extracting it from the compressed form into one
of the 3 Unicode Character Encoding Forms, and then
normalizing it by the UAX #15 algorithm -- which works on
Unicode strings (in standard Encoding Forms).

--Ken