plea for NFKC & case-folding, suggestions for definitions

Sun Mar 1 23:17:08 CET 2009

I have not had time to follow the progress of this
working group, but I have now read the latest
draft-ietf-idnabis-{defs,protocol,rationale,tables}, and I have a few
high-level comments.

 1) I am very happy that an internationalized generalization of
    preferred-syntax (host name) labels is being defined, based on the
    principle of including what's needed, rather than excluding only
    what's clearly useless/harmful.  I wanted to work on this in the
    first IDN working group, instead of or in addition to the wide-open
    IDNs of IDNA2003, but the rough consensus then was that it was not
    worth the delay it would cause.

 2) I am not persuaded that IDNAbis can avoid requiring the fundamentals
    of Nameprep: NFKC and case-folding.  More on this below.

 3) I think the approach taken in the definitions section, of building
    up the smaller concepts involved in the ACE architecture, is
    better than the approach I took in RFC 3490--referring to complex
    multi-step operations ToASCII and ToUnicode as primitives.  The
    small-concepts approach allows the reader to develop some intuition.
    I have some suggestions for more concise and rigorous definitions
    following that approach (see below).

Regarding NFKC and case-folding

I think the rationale draft is trying to have it both ways.  It says
a prefix change would be required if a label that is valid in both
IDNA2003 and IDNAbis is represented by different ASCII forms in the
two protocols.  To avoid triggering that incompatibility, it defines
non-normalized and non-case-folded strings as "invalid".  But then
it admits (in section "Front-end and User Interface Processing for
Lookup" in the rationale doc) that in many cases "some local processing
of apparent domain name strings will be required, both to maintain
compatibility with IDNA2003 and to prevent user astonishment".
In practice applications will often have no choice but to accept
non-normalized and/or non-case-folded strings and apply "local
processing", for which there is no standard, only suggestions of either
"generic preprocessing" (which is basically parts of IDNA2003 that are
not specified in IDNAbis and not even included by reference) or "highly
localized preprocessing", which is completely unspecified but won't be
"a threat to interoperability as long as (i) only U-labels and A-labels
are used in interchange with systems outside the local environment..."
But domain names are global identifiers, and they get exchanged
willy-nilly by humans typing and cutting and pasting them.

My opinion:  We can't have it both ways.  Breaking compatibility in
isolated well-considered corner cases involving a few code points (like
the zero-width joiner) is one thing, but breaking compatibility with
giant swaths of the namespace, like all non-normalized names and all
names containing uppercase characters, and not changing the prefix,
would be antithetical to the purpose of the IETF (interoperability) and
would betray the trust of the internet community.

Since changing the prefix is disallowed by the IDNAbis charter, some
Nameprep-like requirement needs to be specified in IDNAbis, based on
NFKC and case-folding.

Regarding the definitions:

Below are some definitions of various kinds of labels, and some key
observations (useful theorems) implied by the definitions.  They follow
the same general small-concepts approach as in the defs draft, but
I've tried to make them a little more concise and rigorous.  Draft
editors might want to incorporate or draw inspiration from them.  These
definitions assume that something Nameprep-like is required in the
protocol.

============
Definitions:

[[Editorial notes are in double square brackets.]]

A "string" is a sequence of Unicode code points (not bytes).  The
relationship between code points and bytes is beyond the scope of IDNA.
Conversion of non-Unicode text to/from Unicode is beyond the scope of
IDNA.

An "ASCII code point" is a Unicode code point in the range 0..7F.

An "LDH code point" is an ASCII code point that is a letter (41..5A,
61..7A), digit (30..39), or hyphen-minus (U+002D).

An "ASCII string" is a string that contains only ASCII code points (or
is empty).  Note that a non-ASCII string can contain some ASCII code
points.

The "canonical form" of a string is the output of a particular
canonicalization function from strings to strings.  The function does
not always produce an output; sometimes it fails (for example, because
the input string contains a disallowed code point).  A string for which
the function fails has no canonical form.  The canonicalization function
is idempotent; that is, re-applying it to its own output yields the same
output.

The full details of the canonicalization function are specified
elsewhere, but it is worth noting here that its treatment of ASCII code
points agrees the rules for validating and comparing ASCII host name
labels [RFCs 921, 952, 1123]:  If the input contains only LDH code
points (or is empty) and neither begins nor ends with hyphen-minus,
the function succeeds.  If the input contains any non-LDH ASCII code
points, or if it begins or ends with hyphen-minus, the function fails.
The function replaces uppercase ASCII letters with the corresponding
lowercase ASCII letters, and leaves other ASCII code points unchanged.

[[The canonicalization function is the IDNAbis analog of Nameprep, the
function all applications use for encoding/decoding ACE and validating &
comparing internationalized labels.  It excludes any extra restrictions
that are enforced only by registries.]]

A "canonical string" is a string that has a canonical form and is equal
to it.

A "tagged string" is a string that has a canonical form with U+002D
(hyphen-minus) as its 3rd and 4th code points.  The first four code
points of the canonical form are the "tag".

Two strings are an "XN pair" iff they satisfy all of the following
properties:

 1) Both are canonical strings.

 2) One is a an ASCII string and the other is a non-ASCII string.

 3) The ASCII string begins with the tag "xn--".

 4) The non-ASCII string is non-tagged.

    [[Requiring the non-ASCII string to be non-tagged is stronger than
    RFC-3490, which required only that it not begin with "xn--".  That
    was probably an oversight.  Since IDNAbis is tightening up all sorts
    of things, it might as well tighten up this too.]]

 5) If the ASCII string minus its tag is fed to a Punycode decoder, the
    result is the non-ASCII string.  Equivalently, if the non-ASCII
    string is fed to a Punycode encoder that outputs lowercase forms,
    the result is the ASCII string minus its tag.  Punycode is specified
    elsewhere.

If a string is a member of an XN pair, its "XN partner" is the other
member of the pair.  (It can be shown that a string can belong to at
most one XN pair and therefore has at most one XN partner.)

A "domain label" is a component of a domain name, or something that
could be (by virtue of its syntax) a component of a domain name.  For
example, the domain name "www.example.com" (which can in some contexts
be written with a trailing dot: "www.example.com.") is composed of three
domain labels: "www", "example", and "com".  In some contexts there
is said to be a fourth domain label, the empty root label.  In some
contexts domain labels can contain non-text (like binary data).

A "text label" is a domain label that is non-empty and contains only
text.  Domain labels that are not text labels are outside the scope of
IDNA.

An "ASCII label" is an ASCII string with at least 1 and at most 63 code
points.

An "LDH label" is an ASCII label that has a canonical form (that is,
it contains only LDH code points and neither begins nor ends with
hyphen-minus).  Because domain labels intended for human consumption
have generally been LDH labels, this is the class of domain labels that
IDNA extends.  ASCII labels that have no canonical form (like "_tcp"
[RFC-2782]) are outside the scope of IDNA.

An "internationalized label" is a generalization of LDH label:  It is a
string that has a canonical form that is, or is the XN partner of, an
LDH label.

Two internationalized labels are "equivalent" iff they have canonical
forms that are identical or are an XN pair.

A "tagged label" is an internationalized label that is a tagged string.

An "ACE label" is an internationalized label whose canonical form is the
ASCII member of an XN pair.  (ACE stands for ASCII Compatible Encoding.)

A "Joker label" is a tagged label that is not an ACE label.

Within the IDNA specifications the unqualified term "label" means
internationalized label, but this abbreviation is avoided wherever it
might be confusing.

=================
Key observations:

Within the set of all internationalized labels equivalent to any given
internationalized label, at most two are canonical.  If there are two,
they are an XN pair.

For every non-ASCII internationalized label, there exists at least
one equivalent internationalized label that is ASCII.  (Proof by
construction:  If the canonical form of an internationalized label is
not ASCII, then it has an XN partner that is ASCII.)  ASCII forms are
needed in some protocols (like DNS).

For every ACE label, there exists at least one equivalent
internationalized label that is non-tagged and therefore non-ACE.
(Proof by construction:  An ACE label's canonical form has an XN partner
that is non-tagged.)  The non-ACE form is much more user-friendly,
because ACE labels contain Punycode-encoded text, which looks like
garbage.

Every internationalized label has exactly one canonical ASCII form.  Two
internationalized labels are equivalent iff their canonical ASCII forms
are identical.

Every internationalized label has exactly one canonical non-ACE form.
Two internationalized labels are equivalent iff their canonical non-ACE
forms are identical.

The canonical form of a tagged label is always ASCII, because all
non-ASCII canonical tagged strings fail to qualify as internationalized
labels.

Joker labels are tagged labels that fail to satisfy the properties of
an XN pair.  For example, "aa--foo" has the wrong tag, "xn--3" fails
in the Punycode decoder, and "xn--aa--foo-" fails to have a non-ASCII
non-tagged partner (the would-be partner produced by the Punycode
decoder is "aa--foo", which is both ASCII and tagged).

Since most of the definitions are in terms of canonical forms, it can be
instructive to categorize the set of canonical internationalized labels
as nested subsets:

    +-----------------------------------------+
    | canonical internationalized labels      |
    |                                         |
    |  +-----------------------------------+  |
    |  | canonical ASCII labels            |  |
    |  | all of which are                  |  |
    |  | canonical LDH labels              |  |
    |  |                                   |  |
    |  |  +-----------------------------+  |  |
    |  |  | canonical tagged labels     |  |  |
    |  |  |                             |  |  |
    |  |  |  +-----------------------+  |  |  |
    |  |  |  | canonical ACE labels  |  |  |  |
    |  |  |  +-----------------------+  |  |  |
    |  |  +-----------------------------+  |  |
    |  +-----------------------------------+  |
    +-----------------------------------------+

There is a one-to-one correspondence, defined by XN pairs, between the
outermost ring (the non-ASCII canonical internationalized labels) and
the innermost set (the canonical ACE labels).  The other two rings are
not involved in any XN pairs.  The Joker labels are the third ring
(canonical tagged non-ACE labels).

It can also be instructive to see how internationalized labels relate
to the broader universe of domain labels, via another series of nested
subsets along a different axis:

    +------------------------------+
    | labels                       |
    |                              |
    |  +------------------------+  |
    |  | text labels            |  |
    |  |                        |  |
    |  |  +------------------+  |  |
    |  |  | ASCII labels     |  |  |
    |  |  |                  |  |  |
    |  |  |  +------------+  |  |  |
    |  |  |  | LDH labels |  |  |  |
    |  |  |  +------------+  |  |  |
    |  |  +------------------+  |  |
    |  +------------------------+  |
    +------------------------------+

IDNA supplies the second ring (non-ASCII text labels); before
IDNA, all text labels were ASCII.  The scope of IDNA is the set of
internationalized labels, which includes the LDH labels and the
non-ASCII text labels, but not the intervening ring of non-LDH ASCII
labels.

========

AMC