A-label definition

Sat Jun 21 08:23:58 CEST 2008

John C Klensin wrote:

   [See separate reply wrt <toplabel>]
>>> U-label , which contains at least a non-ASCII character

>> Okay, but please without the "standard Unicode encoding" blurb,
>> it only needs Unicode code points (the numbers, any encoding).

> No.  Unless I misunderstand what you are asking for, it really
> is important that U-label and A-label refer to _valid_ IDNA
> label forms.

Yes, no problem with that, it's the definition.  I'm concerned
about something else.  At the moment IRIs are the most important
application of IDNA, with EAI still struggling to get on board.
I'm ignoring XML system identifiers is temporarily broken.

In an IRI it's the <ihost> part that actually uses IDNA.  And an
IRI can in essence exist in *any* charset, it's not limited to
the "standard Unicode encodings" (UTF-16, -16BE, -16LE, -32, 
-32BE, -32LE, -8).  

In a reply not yet visible on the list from my POV Ken noted
that SCSU is no "standard Unicode encoding", but a registered
charset.  AFAIK it's an "Unicode standard", as opposed to say
UTF-EBCDIC, UTF-7, UTF-1, or BOCU-1.  If an "Unicode standard"
charset is not the same as a "standard Unicode encoding" that
is okay, it isn't the point I'm concerned about.  Maybe it is
only me, but it might indicate why "standard Unicode encoding"
in idnabis-rationale can be misleading:

IRIs with perfectly valid <ihosts>s containing "U-labels" can
use other encodings (based on the document charset), not only
the seven (at the moment) "standard Unicode encodings".  What
you really want is that U-labels survive the IDNAbis procedure
resulting in a corresponding A-label.

In RFC 3987 the IRI to URI translation starts with "transform
whatever it is to UTF-8".  After that step (wannabe-) U-labels 
are UTF-8 and in a "standard Unicode encoding".  But they were
already U-labels before that step.  

The requirement for U-labels (before the IDNAbis procedure) is
"must have corresponding Unicode code points", not "must be in
a standard Unicode encoding".  I hope it's now clearer what I
mean, and why I proposed "I-label" instead of "U-label".

For an IRI in a Latin-1 document an "U-label" will use Latin-1
octets, and iso-8859-1 is no "standard Unicode encoding".  

Digression:  It might also use percent-encoded UTF-8, and while
you might not like this IRI-magic, "percent encoded UTF-8" is
also no "standard Unicode encoding", ditto all RFC 5137 ideas.

[...]   
>> +1, define A-label based on U-label, and not the other way
>> around.

> At the moment, neither is defined in terms of the other (in
> rationale).   There is an implication of linkage, but that
> is because A-labels have to be IDNA-valid and IDNA-validity
> is defined in terms of operations on U-labels.  What are you
> suggesting?

Swap the paragraphs, start with U-label followed by A-label.

>>> LDH label includes A-label.

>> +1, that is the whole point of this business.

> No, actually, "rationale" creates, effectively, four 
> categories which are disjoint:

> * LDH labels (as defined in 1035, with no prefix or other 
> IDNA implications)

ldh-label = <letdig> [1*61<l-d-h> <letdig>] ;or similar

> * A-labels (prefix, punycode encoding of the rest of the
> string, IDNA-valid)

a-label   = "xn--" *<l-d-h> "-" 1*<letdig>  ;or similar

Limited to length 63 and only valid if following the rules
in idnabis-protocol, TUS, RFC 3492, the works.  Any valid
<a-label> is by definition also a valid <ldh-label>, that
is what I meant.

> * U-labels (Unicode string that is valid under IDNA)

NAK, you need at least one non-<l-d-h> code point to get 
U-label != LDH-label, and therefore U-label != A-label.

> * Invalid

An invalid <a-label> matching the ABNF outlined above, and
not longer than 63 octets, is still a valid <ldh-label>.

Like a label that is no valid <ldh-label>, it can be still
a valid label, DNS allows any 1*63<octet>.

> Treating A-labels as a subset of LDH labels gets us back
> into situations in which there are LDH labels that look
> like A-labels and aren't.

But A-labels just *are* a proper subset of LDH labels, that
is the one and only point of IDNA(bis), as opposed to using
raw UTF-8 octets up to a maximal length of 63 octets.

That not any LDH-label starting with "xn--" is also a valid
<a-label> is the fine print, and one reason why IDNAbis and
IDNA need several RFCs for the details.

 Frank