Definitional Problem with U-Label and A-Label

Mon Dec 8 00:58:40 CET 2008

--On Wednesday, 03 December, 2008 17:41 -0800 Mark Davis
<mark at macchiato.com> wrote:

> Thanks for replying on all of these. It is much easier to have
> effective review when it is clear what is just going in, and
> what still needs to be discussed.
> I definitely disagree with you on the value of fixing these
> definitions.  A clear and precise specification of the
>...
> However, you've convinced me of some things, see below.

> On Wed, Nov 26, 2008 at 11:50, John C Klensin
> <klensin at jck.com> wrote:
>...

>> > Condition 3 is not stated in D2.3.1.2, but appears
>> > elsewhere. Should be in Defs 2.3.1.2.
>> 
>> Actually, it cannot be.  It was in with that definition in
>> earlier versions, but people made me take it out because it
>> constituted a restriction on the DNS more broadly.  Because
>> 2.3.1.2 describes an LDH-label as obeying the hostname syntax
>> and _not being an IDN_, it allows hostnames that do, indeed,
>> contain "--" in positions three and four.  Put differently,
>> "ab--abcde" is a perfectly valid LDH-label (but not an IDN)
>> even if other provisions prevent it from appearing in an
>> IDNA-aware zone.
> 
> 
> Ah, that was completely unclear to me. So what you re saying
> is that it is perfectly fine to have an A-Label "ab--abc", but
> just not the U-Label "ab--å bc"?

Nope.  "ab--abc" might be considered to be "in the form of an
A-label", but clearly is not one.  Whether it is even "in the
form of an A-label" depends on how far one goes down the
"putative A-label" validation path.  See below and the new text
in Defs.

Similar comments would apply to putative U-labels, except that
the "no '--' in 3,4" rule is applied much earlier and more
explicitly.  Part of the problem here --fixed, I hope, in the
next version of Defs-- is that we have, statically, 

	(1)  IDNs, consisting of 
	    LDH-labels
	    U-labels
	    A-labels

	(2)  Valid, but not valid in IDNA slots/ IDNA-conforming
	environments, stuff which is not IDNs.  ASCII strings
	that are not LDH-conformant, such as SRV labels, binary
	and bit string labels, etc., are examples of this
	category which, by definition, IDNA can't say much about.

	(3) Various forms of trash.

Part of the purpose of IDNA is to classify things that, at
various levels of assertion or appearance, might be IDNs into
whether they are actually IDNs or whether, instead, they fall
into one of the other categories.  Someone might claim that
"ab--å bc" is a U-label or looks like a U-label.  The assertion
that it was a U-label would obviously be false; whether it looks
like one is somewhat in the eye of the beholder and is not
something the protocol needs to resolve.  Similarly, someone
might claim that "ab--abc" was an A-label or looked like one,
but testing the assertion would clearly show that it was not
such a label (and whether or not the person making the assertion
continued to believe that it looked like one is irrelevant).

In case it is not obvious, the discriminator between "not an
IDN" (i.e., (2) above) and "trash" (i.e., (3) above) is a
contextual assumption associated with "Domain Name Slot"
(Defs-03 2.3.1.6) and contexts that are IDNA-aware.  Unless both
of those conditions are met, and often when they are, there is
no distinction for IDNA purposes.

>...
>> 
>> >    2. starts with "xn--" (or case variants thereof)
>> > [implicitly no hyphen at    end]
>> >    3. the remainder  is valid punycode
>> >    4. and the depunycoded result must be a valid U-Label
>> 
>> Except that "valid punycode" is not, itself, completely
>> well-defined, since Punycode (the algorithm) can encode any
>> string whose codepoints fall within the Unicode range
>> (assigned or unassigned, etc.).
> 
> 
> no. There are definitely *many* invalid punycode strings, like
> "xn-1" or "xn-$".

We don't have agreement on what constitutes a "punycode string",
and this gets back into the "what looks like a duck" discussion
above.  Those strings might look to you like "punycode strings".
Indeed, they might even look to you like A-labels, even though
I'd hope that the absence of the second hyphen would make the
incorrectness of that inference clear (actually the first one is
a valid LDH-label, since nothing prevents one of those from
starting in "xn-" as long as the next character isn't a "-" and
"1" is an hostname-valid ASCII character.

>  This cannot be validly transformed back to
> any Unicode string whatsoever. That is, if you apply
> http://www.ietf.org/rfc/rfc3492.txt, you fail.

But none of them are valid U-labels.

> That is different than the *further* condition (which I marked
> as #4) which is that when the punycode is decoded, you end up
> with a valid U-Label.

See above.

>> Validity of punycode must, in practice,
>> be defined in terms of transformations from [valid] U-labels.
> 
> No, http://www.ietf.org/rfc/rfc3492.txt doesn't know anything
> about U-Labels or need to.

See prior note.  Its failure conditions (other than the one for
overall label length) are tied up with invalid input characters.
We have chosen to define IDNA2008 in a way that prevents invalid
input characters to the Punycode algorithm, an approach that RFC
3492 explicitly considers valid.

One could turn that around and make tests for the status codes
implicit in Appendix C of 3492, but that would take us back to
reliance on a specific algorithm or pseudo-code sequence as the
definition of operations, rather than using a more operational
definition.  We decided to not take that approach in IDNA2008.
While one could argue that not replacing 3492 entirely with a
different style of definition violates that principle, leaving
it along seemed, pragmatically, like the right thing to do for
several reasons.

> Validity of "A-Labels", clearly is defined in terms of the
> transformation -- we are in agreement there, I think -- and
> that is what is captured by the above 4 conditions.

Please check the next text in Defs, which has been changed in
response to part of this discussion, and see if it better meets
your needs.

>...

>> > Putative A-Label
>> > Any string that is all ASCII, but is neither LDH or A-Label.
>> 
>> That term is used, I think consistently, for a string that is
>> offered to a registry or lookup process with the claim that it
>> is an A-label.   Because that claim can be false for all sorts
>> of reasons, and because all [valid] A-labels are potential
>> members of the category of putative A-labels, I don't think
>> the definition above works.
> 
> 
> You use the term "putative" in many places. It would be
> clearer if we had a formal definition.

While, as noted in an earlier response, I've inserted a new
section into Defs that addresses these strings that are asserted
to have some particular property (such as "being an A-label"), I
also believe that the use of "putative" is exactly consistent
with its ordinary dictionary definition.

>...
> You are right. It is ok for the definition to refer to a clear
> sequence of steps in the protocol document, as long as none of
> those steps are circularly referring to the definition.
> 
> So we could say, a U-Label is a Unicode string that satisfies
> the conditions of Section X.Y in [protocol].
> 
> Originally, I was trying to figure out what the definition
> actually was, and to see if those 8 conditions matched.

Please see if the new text is more satisfactory.

>...

     john