Definitional Problem with U-Label and A-Label

Thu Dec 4 02:41:35 CET 2008

Thanks for replying on all of these. It is much easier to have effective
review when it is clear what is just going in, and what still needs to be
discussed.
I definitely disagree with you on the value of fixing these definitions.  A
clear and precise specification of the protocol is absolutely required, and
A-Label, U-Label and the others were clearly vital to correct understanding
and implementation of protocol.
The fact that any problems were exposed when they were moved from Rationale
is a good thing, not a bad thing. Better to have them out in the open where
they can be fixed than to leave them hidden under rocks.

However, you've convinced me of some things, see below.

On Wed, Nov 26, 2008 at 11:50, John C Klensin <klensin at jck.com> wrote:

>
>
> --On Tuesday, 18 November, 2008 12:55 -0800 Mark Davis
> <mark at macchiato.com> wrote:
>
> > I had a chance to review the documents again. There is good
> > progress; the split of the definitions is very helpful. There
> > is a tendency to always look for the remaining issues in the
> > document, so I want to thank John for all the work done on
> > this. I'll respond with some comments, with different subjects
> > for easier tracking.
>
> Much appreciated.   However, some of your comments below
> illustrate the reason I've been resisting responding to long
> sets of document comments by incorporating them without
> community comment.  I'm now trying to finish up a revision of
> the documents that reflects your comments and others received
> since the beginning of next week, but I am incorporating only
> those things about which I'm absolutely sure.  Especially since
> I can spend far more time responding to particular suggestions
> than actually editing the documents (and have now done so
> several times), the comments below are intended only as examples
> and, to some degree, an attempt to demonstrate that we have
> gotten just about close enough for Proposed Standard and are
> reaching the point of diminishing returns.
>
> Before I dig into the details, a general comment for the WG: I'm
> profoundly unhappy about the situation with the "definitions" at
> this point and, while we disagree about the right solution,
> Mark's comments highlight some of the issues and increase my
> unhappiness.   These definitions started out as relatively
> informal ones in a document that combined the general content of
> Defs, Rationale, and Protocol.  They were intended to provide
> the reader with a framework for understanding what was going on,
> not to provide the precise category-boundary definitions that
> are, in essence part of Protocol and Tables.  When we split
> Protocol off from Rationale, we tuned them to serve somewhat
> more of the role of category definitions, but they remained
> informal.  Pulling them out of Rationale as "normative material"
> increases the implicit requirement that they really _be_
> definitions, definitions that are capable of delimiting category
> boundaries.  But, as Mark's note illustrates, we have no real
> mechanism for doing that without turning the definitions into a
> roadmap of Protocol (and, in some cases, Bidi and Rationale).
>
> At one level, I'd be very enthused about a proposal to
> completely revamp the definitions, introduce additional
> terminology as needed, and produce the document organization we
> have discussed... one in which there are no normative
> dependencies from Defs to any of the other documents and in
> which the others, as needed, point to Defs.  The downside of
> doing that is that the odds of making mistakes as we move text
> around --mistakes that would be far more problematic than the
> current lack of mathematical rigor in the definitions document--
> are extremely high.   I think that we should get this finished
> and into the hands of those who need it and revisit these
> definitional structure issues when we go for Draft Standard.
> But I've said that before you and others clearly feel
> differently.

>
>
> > Definitional Problem with U-Label and A-Label
> >
> > I believe (although am not 100% sure) that the intent is for
> > both U-Label and A-Label to only refer to *valid* possible
> > labels under the specifications of IDNA2008,
>
> Yes.  The WG agreed on that some time ago and the text was
> changed to match (I had hoped correctly so, but obviously I
> didn't get every case).
>
> > but the text does
> > not yet support that consistently. Here is the breakdown. (I'm
> > using D1.3 to mean section 1.3 in Defs, and so on, with P for
> > protocol, B for bidi, R for rationale).
>
> Again, this careful breakdown is much appreciated.
>
> > LDH
> > The following conditions:
> >
> >    1. Must match http://tools.ietf.org/html/rfc952
> >       - <name> ::=
> > <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]
>
> Plus the amendment in RFC 1123 that permits leading (ASCII)
> digits.

good

>
>
> >    2. Length
> > limited to 1..63 (http://tools.ietf.org/html/rfc1034, 3.1)
>
> > 3. Must not have hyphens in both positions 3 and 4. (new
> > condition)
>
> That restriction applies only to an LDH-label, i.e., a label
> used in (or proposed to be used in) an IDNA-aware slot and
> application.  It still does not apply to LDH strings (and RFC
> 952 host names) in general.

>
>
> > Condition 3 is not stated in D2.3.1.2, but appears elsewhere.
> > Should be in Defs 2.3.1.2.
>
> Actually, it cannot be.  It was in with that definition in
> earlier versions, but people made me take it out because it
> constituted a restriction on the DNS more broadly.  Because
> 2.3.1.2 describes an LDH-label as obeying the hostname syntax
> and _not being an IDN_, it allows hostnames that do, indeed,
> contain "--" in positions three and four.  Put differently,
> "ab--abcde" is a perfectly valid LDH-label (but not an IDN) even
> if other provisions prevent it from appearing in an IDNA-aware
> zone.

Ah, that was completely unclear to me. So what you re saying is that it is
perfectly fine to have an A-Label "ab--abc", but just not the U-Label "ab--å
bc"?

>
> > A-Label
> > I believe the definition should be the following conditions:
> >
> >    1. ASCII string of length 5 to 64
>
> 5 to 63

good

>
>
> >    2. starts with "xn--" (or case variants thereof)
> > [implicitly no hyphen at    end]
> >    3. the remainder  is valid punycode
> >    4. and the depunycoded result must be a valid U-Label
>
> Except that "valid punycode" is not, itself, completely
> well-defined, since Punycode (the algorithm) can encode any
> string whose codepoints fall within the Unicode range (assigned
> or unassigned, etc.).

no. There are definitely *many* invalid punycode strings, like "xn-1" or
"xn-$". This cannot be validly transformed back to any Unicode string
whatsoever. That is, if you apply http://www.ietf.org/rfc/rfc3492.txt, you
fail.

That is different than the *further* condition (which I marked as #4) which
is that when the punycode is decoded, you end up with a valid U-Label.

> Validity of punycode must, in practice,
> be defined in terms of transformations from [valid] U-labels.

No, http://www.ietf.org/rfc/rfc3492.txt doesn't know anything about U-Labels
or need to.

Validity of "A-Labels", clearly is defined in terms of the transformation --
we are in agreement there, I think -- and that is what is captured by the
above 4 conditions.

>
>
> > I believe that the above is the intended definition, but it is
> > not fully supported by the text in Defs, except (perhaps) very
> > indirectly. Note that A-Label according to this is dependent
> > on U-Label. To make sure that we are not circular, we need to
> > define U-Label independently of A-Label.
>
> We can, and should, try, but the isomorphism between the two
> imposes a slightly different definitional relationship
> requirement than would have existed under IDNA2003 rules.
>
> > Putative A-Label
> > Any string that is all ASCII, but is neither LDH or A-Label.
>
> That term is used, I think consistently, for a string that is
> offered to a registry or lookup process with the claim that it
> is an A-label.   Because that claim can be false for all sorts
> of reasons, and because all [valid] A-labels are potential
> members of the category of putative A-labels, I don't think the
> definition above works.

You use the term "putative" in many places. It would be clearer if we had a
formal definition.

>
> > U-Label
> > This is difficult to make out. I believe the definition should
> > be:
> >
> >    1. contains at least one non-ASCII character.
> >    2. is in form NFC (P4.2)
> >    3. contains neither DISALLOWED nor UNASSIGNED (P4.3.1)
> >    4. no hyphens in both position 3 and 4 (P4.3.2.1)
> > [implicitly no hyphen    at start or end]
> >    5. no leading combining marks (P4.3.2.2)
> >    6. obeys context constrains (P4.3.2.3)
> >    7. obeys bidi constraints (P4.3.2.4)
> >    8. converts to valid punycode of length < 60
>
> While I think this definition is correct (except that the "60"
> in #8 should be 59), it takes us in a circle through Protocol
> and, in actuality, Bidi (since P4.3.2.4 is ultimately just a
> reference to Bidi).   That is not satisfactory for the purpose
> of using Defs to support Rationale.  It amounts to "the
> definition of a U-label is that a U-label is whatever the
> algorithm says is a U-label".  That takes us back to a problem
> we had with IDNA2003, which is that almost no one could figure
> out what was and was not valid.

I'm sorry I was not clear. My first intent was to make sure that I
understood exactly what *you* meant by U-Label. That is hardly clear in the
document. More on this later.

>
> Your definition is much more a tour of Protocol, Bidi, and the
> base DNS RFCs than it is a definition applicable to this set of
> documents and, IMO, as such, belongs in Rationale or not at all.
>
> > Protocol:
> >
> > 4.3.3 says the following:
> >
> >    Strings that have been produced by the steps above, and
> > whose
> >
> > contents pass the above tests, are U-labels.
> >
> > However, this may does not include condition 8 above; that is
> > the test for mapping to A-Label (eg overly long punycode) in
> > 4.5, not "above" 4.3.3.Condition #1 is also implicit.
>
> This has been clarified, thanks.
>
> > Defs:
> >
> > 2.3.1.1 says the following:
> >
> >       A "U-label" is an IDNA-valid string of Unicode
> > characters,       including at least one non-ASCII character,
> > expressed in a       standard Unicode Encoding Form -- in an
> > Internet transmission       context this will normally be
> > UTF-8 -- and subject to the       constraint below.
>
> >    1. This is inconsistent with 4.3.3, with the only
> > constraints being that    U-Labels be NFC, be convertable to
> > and from valid A-Labels, and not be of    the form xx--.. But
> > the phrase in bullet 2 seems to state that they must    meet
> > "all of the requirements of *these specifications*".
>
> This is a problem, but perhaps different from the one you
> identify.  The "a standard... encoding form" language was
> precisely correct, and was intended to note that there is a
> requirement that a U-label be in Unicode but not that it be in
> UTF-8.  I've fixed the normalization form requirement and
> rewritten the rest of the paragraph in Defs.

ok, I'll take a look.

>
>
> > But it is
> > not clear what those are: they should be listed precisely.
>
> > I can understand not wanting to complicate Defs by having
> > conditions 1-8 spelled out completely. It would be possible to
> > handle this without complicating Defs, *if* the specific
> > sections corresponding to the conditions were explicitly
> > referenced in Defs.
>

>
> Of course, that introduces a different circularity, which is the
> need to go from Defs to Protocol and back to Defs to understand
> a definition.  I note that circularity would exist if these
> definitions were incorporated into Protocol and the Defs
> document eliminated, it would just be between sections of a
> document rather than sections of different documents.  And it
> would require that those who are reading Rationale to understand
> registry restriction and other policy options to read Protocol,
> which we agreed we did not want to require.

You are right. It is ok for the definition to refer to a clear sequence of
steps in the protocol document, as long as none of those steps are
circularly referring to the definition.

So we could say, a U-Label is a Unicode string that satisfies the conditions
of Section X.Y in [protocol].

Originally, I was trying to figure out what the definition actually was, and
to see if those 8 conditions matched.

>
>
> > Putative U-Label
> > Any Unicode string that contains at least one non-ASCII
> > character, but is not a U-Label.
>
> See comments about "putative A-label", above.
>
> > I can suggest some text fixes, if that would be helpful, but
> > wanted to get the principles right first.
>
> As the above indicates, I think we disagree about the principles.

We may be closer now.

>
>
> Best Thanksgiving wishes to those who celebrate that holiday.

I hope you had a good Thanksgiving also.

>
>
>   john
>
>
>
> Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081203/d36c5b14/attachment-0001.htm