A-label definition (was: IDN test TLDs)
Kenneth Whistler
kenw at sybase.com
Sat Jun 21 02:02:37 CEST 2008
I'll skip the various issues related to A-label definition
that John addressed, but... Frank Ellermann stated:
> ... I also don't see why the
> U-label is limited to a "standard Unicode encoding
> form", that would mean "can be SCSU, but not BOCU,
> UTF-7, UTF-1, GB 18030, etc.". IMO the question of
> encoding forms misses some points, maybe we should
> simply rename U-label to I-label:
>
> "I" as in I18N, IDNAbis, IRI is intuitive and KISS.
First of all, a standard Unicode Character Encoding Form could
be UTF-8, UTF-16, or UTF-32, but *not* SCSU, which is
not a Unicode Character Encoding Form at all, by Unicode Standard
definitions. (I realized that SCSU is a registered charset,
but that is an entirely different thing.)
draft-ietf-idnabis-rationale-00.txt states that:
* A "U-label" is an IDNA-valid string of Unicode characters,
expressed in a standard Unicode Encoding Form, normally
UTF-8 in an Internet transmission context...
I assume the 2nd part is obvious in this discussion (i.e.
why UTF-8, rather than UTF-16 or UTF-32).
The reason why it should be a standard Unicode Encoding Form
can be seen from the proposed protocol definition:
draft-ietf-idnabis-protocol-01.txt requires that:
Some system routine, or a localized front-end to the IDNA
process, ensures that the proposed label is a Unicode string.
That string MUST be in Unicode Normalization Form C.
Unicode text content compressed by the SCSU algorithm
is a sequence of bytes that is neither a Unicode string nor
in Normalization Form C. And it could be put in NFC only
by first extracting it from the compressed form into one
of the 3 Unicode Character Encoding Forms, and then
normalizing it by the UAX #15 algorithm -- which works on
Unicode strings (in standard Encoding Forms).
--Ken
More information about the Idna-update
mailing list