Definitional Problem with U-Label and A-Label

John C Klensin klensin at jck.com
Wed Nov 26 20:50:16 CET 2008



--On Tuesday, 18 November, 2008 12:55 -0800 Mark Davis
<mark at macchiato.com> wrote:

> I had a chance to review the documents again. There is good
> progress; the split of the definitions is very helpful. There
> is a tendency to always look for the remaining issues in the
> document, so I want to thank John for all the work done on
> this. I'll respond with some comments, with different subjects
> for easier tracking.

Much appreciated.   However, some of your comments below
illustrate the reason I've been resisting responding to long
sets of document comments by incorporating them without
community comment.  I'm now trying to finish up a revision of
the documents that reflects your comments and others received
since the beginning of next week, but I am incorporating only
those things about which I'm absolutely sure.  Especially since
I can spend far more time responding to particular suggestions
than actually editing the documents (and have now done so
several times), the comments below are intended only as examples
and, to some degree, an attempt to demonstrate that we have
gotten just about close enough for Proposed Standard and are
reaching the point of diminishing returns.

Before I dig into the details, a general comment for the WG: I'm
profoundly unhappy about the situation with the "definitions" at
this point and, while we disagree about the right solution,
Mark's comments highlight some of the issues and increase my
unhappiness.   These definitions started out as relatively
informal ones in a document that combined the general content of
Defs, Rationale, and Protocol.  They were intended to provide
the reader with a framework for understanding what was going on,
not to provide the precise category-boundary definitions that
are, in essence part of Protocol and Tables.  When we split
Protocol off from Rationale, we tuned them to serve somewhat
more of the role of category definitions, but they remained
informal.  Pulling them out of Rationale as "normative material"
increases the implicit requirement that they really _be_
definitions, definitions that are capable of delimiting category
boundaries.  But, as Mark's note illustrates, we have no real
mechanism for doing that without turning the definitions into a
roadmap of Protocol (and, in some cases, Bidi and Rationale).

At one level, I'd be very enthused about a proposal to
completely revamp the definitions, introduce additional
terminology as needed, and produce the document organization we
have discussed... one in which there are no normative
dependencies from Defs to any of the other documents and in
which the others, as needed, point to Defs.  The downside of
doing that is that the odds of making mistakes as we move text
around --mistakes that would be far more problematic than the
current lack of mathematical rigor in the definitions document--
are extremely high.   I think that we should get this finished
and into the hands of those who need it and revisit these
definitional structure issues when we go for Draft Standard.
But I've said that before you and others clearly feel
differently.


> Definitional Problem with U-Label and A-Label
> 
> I believe (although am not 100% sure) that the intent is for
> both U-Label and A-Label to only refer to *valid* possible
> labels under the specifications of IDNA2008,

Yes.  The WG agreed on that some time ago and the text was
changed to match (I had hoped correctly so, but obviously I
didn't get every case).

> but the text does
> not yet support that consistently. Here is the breakdown. (I'm
> using D1.3 to mean section 1.3 in Defs, and so on, with P for
> protocol, B for bidi, R for rationale).

Again, this careful breakdown is much appreciated.

> LDH
> The following conditions:
> 
>    1. Must match http://tools.ietf.org/html/rfc952
>       - <name> ::=
> <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]

Plus the amendment in RFC 1123 that permits leading (ASCII)
digits.

>    2. Length
> limited to 1..63 (http://tools.ietf.org/html/rfc1034, 3.1)

> 3. Must not have hyphens in both positions 3 and 4. (new
> condition)

That restriction applies only to an LDH-label, i.e., a label
used in (or proposed to be used in) an IDNA-aware slot and
application.  It still does not apply to LDH strings (and RFC
952 host names) in general.

> Condition 3 is not stated in D2.3.1.2, but appears elsewhere.
> Should be in Defs 2.3.1.2.

Actually, it cannot be.  It was in with that definition in
earlier versions, but people made me take it out because it
constituted a restriction on the DNS more broadly.  Because
2.3.1.2 describes an LDH-label as obeying the hostname syntax
and _not being an IDN_, it allows hostnames that do, indeed,
contain "--" in positions three and four.  Put differently,
"ab--abcde" is a perfectly valid LDH-label (but not an IDN) even
if other provisions prevent it from appearing in an IDNA-aware
zone.

> A-Label
> I believe the definition should be the following conditions:
> 
>    1. ASCII string of length 5 to 64

5 to 63

>    2. starts with "xn--" (or case variants thereof)
> [implicitly no hyphen at    end]
>    3. the remainder  is valid punycode
>    4. and the depunycoded result must be a valid U-Label

Except that "valid punycode" is not, itself, completely
well-defined, since Punycode (the algorithm) can encode any
string whose codepoints fall within the Unicode range (assigned
or unassigned, etc.).   Validity of punycode must, in practice,
be defined in terms of transformations from [valid] U-labels.

> I believe that the above is the intended definition, but it is
> not fully supported by the text in Defs, except (perhaps) very
> indirectly. Note that A-Label according to this is dependent
> on U-Label. To make sure that we are not circular, we need to
> define U-Label independently of A-Label.

We can, and should, try, but the isomorphism between the two
imposes a slightly different definitional relationship
requirement than would have existed under IDNA2003 rules.
  
> Putative A-Label
> Any string that is all ASCII, but is neither LDH or A-Label.

That term is used, I think consistently, for a string that is
offered to a registry or lookup process with the claim that it
is an A-label.   Because that claim can be false for all sorts
of reasons, and because all [valid] A-labels are potential
members of the category of putative A-labels, I don't think the
definition above works.

> U-Label
> This is difficult to make out. I believe the definition should
> be:
> 
>    1. contains at least one non-ASCII character.
>    2. is in form NFC (P4.2)
>    3. contains neither DISALLOWED nor UNASSIGNED (P4.3.1)
>    4. no hyphens in both position 3 and 4 (P4.3.2.1)
> [implicitly no hyphen    at start or end]
>    5. no leading combining marks (P4.3.2.2)
>    6. obeys context constrains (P4.3.2.3)
>    7. obeys bidi constraints (P4.3.2.4)
>    8. converts to valid punycode of length < 60

While I think this definition is correct (except that the "60"
in #8 should be 59), it takes us in a circle through Protocol
and, in actuality, Bidi (since P4.3.2.4 is ultimately just a
reference to Bidi).   That is not satisfactory for the purpose
of using Defs to support Rationale.  It amounts to "the
definition of a U-label is that a U-label is whatever the
algorithm says is a U-label".  That takes us back to a problem
we had with IDNA2003, which is that almost no one could figure
out what was and was not valid.

Your definition is much more a tour of Protocol, Bidi, and the
base DNS RFCs than it is a definition applicable to this set of
documents and, IMO, as such, belongs in Rationale or not at all.

> Protocol:
> 
> 4.3.3 says the following:
> 
>    Strings that have been produced by the steps above, and
> whose
> 
> contents pass the above tests, are U-labels.
> 
> However, this may does not include condition 8 above; that is
> the test for mapping to A-Label (eg overly long punycode) in
> 4.5, not "above" 4.3.3.Condition #1 is also implicit.

This has been clarified, thanks.

> Defs:
> 
> 2.3.1.1 says the following:
> 
>       A "U-label" is an IDNA-valid string of Unicode
> characters,       including at least one non-ASCII character,
> expressed in a       standard Unicode Encoding Form -- in an
> Internet transmission       context this will normally be
> UTF-8 -- and subject to the       constraint below.

>    1. This is inconsistent with 4.3.3, with the only
> constraints being that    U-Labels be NFC, be convertable to
> and from valid A-Labels, and not be of    the form xx--.. But
> the phrase in bullet 2 seems to state that they must    meet
> "all of the requirements of *these specifications*".

This is a problem, but perhaps different from the one you
identify.  The "a standard... encoding form" language was
precisely correct, and was intended to note that there is a
requirement that a U-label be in Unicode but not that it be in
UTF-8.  I've fixed the normalization form requirement and
rewritten the rest of the paragraph in Defs.

> But it is
> not clear what those are: they should be listed precisely.

> I can understand not wanting to complicate Defs by having
> conditions 1-8 spelled out completely. It would be possible to
> handle this without complicating Defs, *if* the specific
> sections corresponding to the conditions were explicitly
> referenced in Defs.

Of course, that introduces a different circularity, which is the
need to go from Defs to Protocol and back to Defs to understand
a definition.  I note that circularity would exist if these
definitions were incorporated into Protocol and the Defs
document eliminated, it would just be between sections of a
document rather than sections of different documents.  And it
would require that those who are reading Rationale to understand
registry restriction and other policy options to read Protocol,
which we agreed we did not want to require.  

> Putative U-Label
> Any Unicode string that contains at least one non-ASCII
> character, but is not a U-Label.

See comments about "putative A-label", above.

> I can suggest some text fixes, if that would be helpful, but
> wanted to get the principles right first.

As the above indicates, I think we disagree about the principles.

Best Thanksgiving wishes to those who celebrate that holiday.

   john





More information about the Idna-update mailing list