Definitional Problem with U-Label and A-Label

Tue Nov 18 21:55:34 CET 2008

I had a chance to review the documents again. There is good progress; the
split of the definitions is very helpful. There is a tendency to always look
for the remaining issues in the document, so I want to thank John for all
the work done on this. I'll respond with some comments, with different
subjects for easier tracking.

Definitional Problem with U-Label and A-Label

I believe (although am not 100% sure) that the intent is for both U-Label
and A-Label to only refer to *valid* possible labels under the
specifications of IDNA2008, but the text does not yet support that
consistently. Here is the breakdown. (I'm using D1.3 to mean section 1.3 in
Defs, and so on, with P for protocol, B for bidi, R for rationale).

LDH
The following conditions:

   1. Must match http://tools.ietf.org/html/rfc952
      - <name> ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]
   2. Length limited to 1..63 (http://tools.ietf.org/html/rfc1034, 3.1)
   3. Must not have hyphens in both positions 3 and 4. (new condition)

Condition 3 is not stated in D2.3.1.2, but appears elsewhere. Should be
in Defs 2.3.1.2.

A-Label
I believe the definition should be the following conditions:

   1. ASCII string of length 5 to 64
   2. starts with "xn--" (or case variants thereof) [implicitly no hyphen at
   end]
   3. the remainder  is valid punycode
   4. and the depunycoded result must be a valid U-Label

I believe that the above is the intended definition, but it is not fully
supported by the text in Defs, except (perhaps) very indirectly. Note that
A-Label according to this is dependent on U-Label. To make sure that we are
not circular, we need to define U-Label independently of A-Label.

Putative A-Label
Any string that is all ASCII, but is neither LDH or A-Label.

U-Label
This is difficult to make out. I believe the definition should be:

   1. contains at least one non-ASCII character.
   2. is in form NFC (P4.2)
   3. contains neither DISALLOWED nor UNASSIGNED (P4.3.1)
   4. no hyphens in both position 3 and 4 (P4.3.2.1) [implicitly no hyphen
   at start or end]
   5. no leading combining marks (P4.3.2.2)
   6. obeys context constrains (P4.3.2.3)
   7. obeys bidi constraints (P4.3.2.4)
   8. converts to valid punycode of length < 60

Protocol:

4.3.3 says the following:

   Strings that have been produced by the steps above, and whose

contents pass the above tests, are U-labels.

However, this may does not include condition 8 above; that is the test for
mapping to A-Label (eg overly long punycode) in 4.5, not "above"
4.3.3.Condition #1 is also implicit.

Defs:

2.3.1.1 says the following:

      A "U-label" is an IDNA-valid string of Unicode characters,
      including at least one non-ASCII character, expressed in a
      standard Unicode Encoding Form -- in an Internet transmission
      context this will normally be UTF-8 -- and subject to the
      constraint below.

   1. This is inconsistent with 4.3.3, with the only constraints being that
   U-Labels be NFC, be convertable to and from valid A-Labels, and not be of
   the form xx--.. But the phrase in bullet 2 seems to state that they must
   meet "all of the requirements of *these specifications*". But it is not
   clear what those are: they should be listed precisely.

I can understand not wanting to complicate Defs by having conditions 1-8
spelled out completely. It would be possible to handle this without
complicating Defs, *if* the specific sections corresponding to the
conditions were explicitly referenced in Defs.

Putative U-Label
Any Unicode string that contains at least one non-ASCII character, but is
not a U-Label.

I can suggest some text fixes, if that would be helpful, but wanted to get
the principles right first.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081118/2cb983f3/attachment-0001.htm