A-label definition (was: IDN test TLDs)

Sat Jun 21 12:50:56 CEST 2008

--On Friday, 20 June, 2008 03:22 +0200 Frank Ellermann
<hmdmhdfmhdjmzdtjmzdtzktdkztdjz at gmail.com> wrote:

> Hi, I thought I know what an "A-label" is, but looking 
> into draft-ietf-idnabis-rationale-00 I found that this
> is not the case:
> 
> (1) LDH label, that's AFAIK 1 to 63 letters, digits,
>     and hyphens, not starting or ending with a hyphen.

And not having two hyphens in the third or forth positions,
according to the current definition in idnabis-rationale.  I've
clarified this slightly in the working draft for rationale-01.
If the WG concludes that it doesn't want the restriction
prohibiting non-IDNA labels with hyphens in those positions,
that will need to be revised.

> All LDH labels are technically valid host name labels,
> because that's what the relevant IETF standards say.

Yes.  But the terminology in "rationale" is a little different,
and says so.  Note that IIR 1035 doesn't say "LDH", and 1123
doesn't either, they say "host name".

> (2) Toplabel, that is at the moment a shaky RFC 1123
>     erratum.  IMO it should be the same as LDH label,
>     but including at least one non-digit.  It needs
>     an "updates 1123" in idnabis-rationale.  While at
>     it we could also say "not only a single letter".

This is really a separate discussion.  I hope that it is out of
scope for this WG, but that is certainly subject to debate.  As
you know, I've written the IESG asking them to give some
priority to validating that erratum.  On the other hand, 1123 is
actually quite clear, IMO: it says "alphabetic" and meant it.
And "alphabetic", in ordinary, common-sense, usage means "no
digits" (if 1123 had intended "alphanumeric", it would have said
so).  We probably should extend the 1123 rule to permit those
hyphens but, IMO, that is as far as we should go.

>     If we do the latter:  Folks often need syntax in
>     the form of STD 68 ABNF in their drafts, and we
>     can copy <toplabel> from RFC.ietf-usefor-usefor
> 
>     If we don't do this we can copy <toplabel> from
>     RFC 4408.  You can guess who needed this syntax,
>     and arrived at a slight difference.  <shudder />      
> 
>     JFTR, a USEFOR co-Chair (i.e. Harald) asked IAB
>     and ICANN (IIRC) about this issue.  Somebody 
>     found a simpler <toplabel> version for the "not
>     only a letter" variant, I can find it if needed.

A combination of I-Ds, informational and experimental documents,
and opinions that don't represent demonstrated community
consensus.  Sorry if I don't find much authority in these.

> (3) U-label, the definition should mention that this 
>     is about labels with at least one non-ASCII code
>     point, otherwise we would get a confusing overlap
>     with LDH labels.

Correction made in the working version of "rationale-01".
Thanks.

> (4) A-label, that is apparently the proper subset of
>     valid LDH labels (see 1) starting with "xn--",
>     and corresponding to valid U-labels (see 3).  By
>     definition an A-label is also a valid <toplabel>, 
>     and we don't need to talk about this.

By whose definition?   By the definition in 1123, one cannot
have an A-label as a TLD label, because it isn't alphabetic.
In that context, all the ICANN test collection proves is that
one can violate 1123 without causing very many problems, at
least for the mostly-web applications that have been used in
tests.

As noted above, I think we should probably change that, but it
means updating 1123, which is not obviously in the WG's charter.

> There's an open question about "valid U-toplabel", is
> more than one code point required.  I think it is not
> required:  Depending on the script "one code point"
> can express things that would need several letters in
> other scripts.  ICANN can sort this out.

It is not clear who gets to "sort this out".  When RFC 1591 was
written, its author and contributors assumed (and discussed the
assumption) that, if and when future TLDs were allocated, they
would be allocated according to the 2-3-4 (ccTLDs, gTLDs, ARPA)
rule and would be all-alphabetic.  That document anticipated
neither IDNs nor ICANN decisions to allocate gTLDs with names of
more or less arbitrary length.

But the enforcers of validity of DNS labels (at any level) has
always been the applications protocols.  If the IETF concludes
that there are substantive reasons to prohibit one-character
labels, or labels containing all (or any) digits at the top
level, etc.; incorporates those rules into protocol syntax; and
are lucky enough to have anyone pay attention to us, then ICANN
ends up in a very difficult situation in which they can allocate
strings that don't follow the rules but find that applications
won't look them up or otherwise consider them valid.   

So, if it is important enough that we can convince others that
we have a valid basis for doing so, we still have the ability to
do the sorting out.  And, again, I hope that work doesn't belong
to this WG.

> (5) I-label (making up a new term for this article):
>     An "I-label" is an U-label in legacy non-Unicode
>     and non-ASCII charsets, as found in RFC 3987 IRIs,
>     or more precisely in labels of an <ihost> for a
>     corresponding registered DNS host name.
> 
>     The typical example is "bücher", unless I screw up
>     and send this as UTF-8.  Please assume that I want
>     windows-1252 or iso-8859-1, not UTF-8.
> 
>     Maybe idnabis-rationale should define I-label with
>     a reference to RFC 3987.  I also don't see why the
>     U-label is limited to a "standard Unicode encoding
>     form", that would mean "can be SCSU, but not BOCU,
>     UTF-7, UTF-1, GB 18030, etc.".  IMO the question of
>     encoding forms misses some points, maybe we should
>     simply rename U-label to I-label:
> 
>     "I" as in I18N, IDNAbis, IRI is intuitive and KISS.

I believe that 3987, in permitting non-Unicode labels, is either:

	(i) a piece of user interface specification, as provided
	for in Section 4.1 and 5.1 of
	draft-ietf-idnabis-protocol-01, and not suitable for use
	"on the wire" or

	(ii) a serious threat to interoperability.

See Ken's note for an explanation of why "standard Unicode
encoding form" is exactly the right definition.

> Above all I disagree with the proposed decree that all
> LDH labels with a hyphen in position 3 and 4 have to
> be A-labels.  That could require to update hundreds of
> RFCs simultaneously, followed by a worldwide upgrade.

It does no such thing, IMO.    Remember that any use of the IDNA
"trick" (i.e., treating some domain names as a special encoding
of something else, rather than whatever they appear to be) at
all requires that we take some subset of domain names that were
previously interpreted as themselves and start interpreting them
as something else.  It doesn't require "update hundreds of RFCs
simultaneously, followed by a worldwide upgrade".  What it does
require is "if you are going to be IDNA-aware and -capable, then
you need to interpret prefixed labels in a particular way and
take some other precautions".  Nothing new.  

The current rule (banning anything with "--" in positions two
and three that isn't a valid A-label) in IDNA2008 is extremely
conservative wrt prefix forms as a means of avoiding nonsense in
the present and preserving the ability to introduce new special
codings in the future.   It still doesn't change anything for
applications that are not IDNA-aware.  For IDNA-aware
registries, it prohibits registering such names as a precaution.
That isn't much of a restriction, since no one has really
demonstrated a need for such strings.  And, for IDNA-aware
lookup applications, it recommends not looking the strings up,
at least unless the application is sure it knows how to
interpret them.   Not a really big deal, IMO.

If the WG concludes that is excessive and wants to drop back all
or part of the way to a rule that merely says that, if the label
starts in "xn--", it must be an A-label, I won't lose any sleep
over it... but let's not try to get there by hyperbole about
global changes to RFCs and worldwide upgrades.

> Looking at this from the other side:  If a worldwide
> upgrade would work we could simply decree that host
> names can use UTF-8, and be done with it.  As this is
> obviously wrong we cannot say that certain LDH labels
> are "invalid", we can only define valid A-labels, and
> anything else is whatever it is, xn--cocacola.

The latter is the one thing you cannot do because it prevents
future expansion within the Unicode set.  A valid A-label today
is one that satisfies the U-label conversion rules, e.g., it
doesn't map to Disallowed or Unassigned Unicode code points.
If one decides that an A-label that cannot satisfy those rules
is "whatever it is", one ends up with a string with two possible
interpretations depending on the version of Unicode being used
by the application.

     john