draft-klensin-idnabis-issues Comments

Thu Dec 14 21:52:37 CET 2006

This morning, before getting back to the table issues,
I'm going to provide my feedback on the current (00.txt)
version of the IDNAbis issues document.

Details, section by section, below.

--Ken

2.1.1 Proposed label

Current text:

This string is typically produced by keyboard entry and converted
to the local character set by the keyboard driver software.
[[ followed by editor query about "keyboard driver" usage here ]]

Revise to:

This string is typically produced by keyboard entry.

Rationale:

First of all, as stated, this is incorrect and misleading.
"A string" is not "converted to the local character set"
by keyboard drivers. What happens is that combinations of
key press sequences are converted to scancodes, which are
delivered to the input handler of the OS. That, in turn,
interprets scancodes as events and routes character input
events separately from all other keyboard-related events,
where another handler accumulates them into string form,
where they are *created* in the local character set used
by the OS. (Or various tweaks on that basic set of steps.)
But dealing with any of that detail, or even mentioning
the role of a keyboard driver is basically irrelevant
to *this* document. The main points to make are that
1. registrants typically submit strings that are created
by typing them in on a keyboard, and 2. those strings will
be created in whatever local character set their OS is
supporting (which *might* be Unicode, or might not be).

2.1.3 Permitted Character Identification

The terminology being used here to distinguish IDNA200x
as having an "inclusion-based approach" versus IDNA2003
as having an "exclusion-based approach" is, I think,
causing confusion among the participants here who are
trying to create the relevant table mentioned in
section 4 and under discussion for IDNA200X-Blocks.

*Logically* there is no difference here. Some Unicode
characters are permitted. Those characters are on a
logical inclusion list. Some Unicode characters are
not permitted. Those characters are on a logical
exclusion list.

The real difference here is that IDNA2003 took a permissive
stance on the issue and permitted large numbers of characters on input
that were of dubious value. And further, IDNA2003, through
the formulation of StringPrep, *presented* the logical
inclusion list (the complement of [Unicode - logical_exclusion_list])
as a process of permitting lots of things to start with,
but then excluding a small list A of bad things, and then
another small list B of bad things, and then mapping
another list C of bad things to nothing, and so on.

The consensus that "this exclusion-based model was a mistake"
is based on two factors, I believe. First of all, being
overly permissive was a mistake -- too many questionable
characters were allowed from the start. Second, the
expression of the process itself became confusing, because
to get to the logical inclusion list, you were presented
with steps to start with a larger set and then filter things
down by excluding from that list, conceptually as part
of StringPrep, instead of predigested as just the set
of allowed things.

What IDNA200X should be doing is presenting its model as
taking a *restrictive* stance on the issue. It starts with
a much more restricted set of characters on the logical
inclusion list, pre-vetted to be those most likely to
be useful and unproblematical as internet identifiers,
not requiring any visible filtering step in StringPrep
to subsequently exclude particular sets of characters
from an overly generous starting set, and minimizing the
amount of mapping going on in StringPrep.

Therefore, I suggest this section be rewritten along the
lines of:

========================================================

2.1.3 Permitted Character Identification

The Unicode string is examined to prohibit characters
that IDNA does not permit in input. IDNA200x uses a
simple, restrictive approach, providing a list of characters
that are permitted on input (see Section 4). This
contrasts with the less restrictive approach of IDNA2003.
Under IDNA2003, the list of characters excluded on input
is quite limited, because the model was to permit almost
all Unicode characters to be used as input, with many of
them subsequented mapped into other characters during
string preparation. There is now general consensus that
this less restrictive approach was a mistake and should
be replaced, in IDNA200x, by a system that lists only
those characters that are permitted and which does much
less mapping.

Under the proposed IDNA200x, the string in Unicode form
will be rejected if it contains characters that are not
on the list of characters acceptable as IDNA input.

========================================================

I think that is a much clearer way to distinguish the
approach in IDNA200x from IDNA2003, and avoids the
entire confusing terminology of "inclusion-based approach"
and "exclusion-based approach" that is interfering with
the work in simply defining the "list of characters
acceptable as IDNA input." Note that the *conclusion*
is identical here.

2.1.4.1 Normalization

The reference "[Unicode-USX15]" should be revised slightly
to "[Unicode-UAX15]", because "UAX" is very widely
used (and normatively preferred by the UTC) as the
abbreviation for a Unicode Standard Annex. There really
isn't any good reason to try to abbreviate these
as "USX" in references at this point.

The term "the new Stable NFKC method" doesn't quite match
the terminology that will be used in UAX #15. Mark,
do you have a more precise specification of this,
so the wording here won't unnecessarily depart from
what UAX #15 will actually call this?

2.1.4.2 Case-folding

In the first sentence, "case-mapped" --> "case-folded",
to use the "folded" terminology consistently. Since
simply case mapping doesn't produce exactly the same
results as case-folding, it is best to stay consistent
here, I think.

2.2.3 Pre-Nameprep Validation and Character List Testing

"characters that fall into 'pending', 'possibly later',
or 'unassigned codepoint' categories in the inclusion tables
should probably not lead to label rejection at this point"

This text should be reworded, as it is assuming things about
the structure of categories in the inclusion table which
are not at all apparent should be there. In my opinion,
having categories of "pending" or "possibly later" in
the inclusion table is incoherent, misleading, and
confusing. An inclusion table for *this* set of
algorithmic processes should simply be that: an inclusion
table.

If there is text in a relevant internet draft suggesting that
some other scripts might possibly later be added to the
inclusion table (or some other characters), that's fine,
but I don't think the table per se should have such
categories, not should *this* document depend on them.

For *this* document, I would suggest keeping this simple
here, and not trying to presage the exact table results.
Something like:

===========================================================

2.2.3. Pre-Nameprep Validation and Character List Testing

Again in parallel to the above, the Unicode string is
checked to verify that all characters that appear in it
are valid for IDNA input. This check may be more liberal
that that of Section 2.1.4; the presence of characters not
in the inclusion table might not lead to label rejection at
this point. Instead, the resolver MUST rely on the presence
or absence of labels containing such characters in the
DNS to determine their validity.

==============================================================

4. Permitted Characters: An inclusion list

This section also needs a substantial rewrite, as it
describes an approach to building an inclusion list that
doesn't make any sense. We all agree that we need an inclusion
list, but I don't see any point in having idnabis-issues
describing a process here which isn't actually working
to build that inclusion list. A suggested rewrite
to salvage the intent of this section follows:

==========================================================

4. Permitted Characters: An inclusion list

IDNA200x requires a new list of charaters that are permitted
in IDNs. An initial version of such a list has been
developed by the contributors to this document
[IDNA200X-Table] [[<-- Note, reference change. Calling
this "Blocks" presupposes the structure of the table, and
is just incorrect.]] This was accomplished by going through
the repertoire of Unicode 5.0, extracting those characters
having character classes needed for Unicode identifiers,
and then systematically paring away characters not
clearly acceptable for IDNs. In addition, characters were
removed that would otherwise have been mapped away during subsequent
steps of string preparation, as during normalization and
case-folding.

[[ Incidentally, I think this generic description of the
table generation is plenty for here. Any further details
about *which* scripts are omitted, *which* classes
are not included, *what* other properties need to be
used in the derivation of the table, what particular
ranges were omitted for other reasons, and so on,
belongs in IDNA200X-Table, not here. ]]

In some cases, subsets of characters for particular
scripts may well need further study or input from the
relevant stakeholders, to determine their appropriateness
for the inclusion list. The discussion in [IDAN200X-BIDI]
illustrates areas in which more work and input is needed.
It is expected that such problems will be resolved quickly
and questioned subsets of characters for particular
scripts with either be added or removed from the list
of permitted characters.

A procedure for adding additional characters or scripts
to the inclusion list, either for scripts not in
the initial version of the table, or characters from
future versions of Unicode, will be developed as part of
this work. A key part of that procedure will be
specifications that, in fact, make it possible to add
new characters without long delays in implementation.
For example, by building the inclusion table primarily
on the basis of rules referencing Unicode character
properties, extensions to the table for future versions
of Unicode can be predictable and automatic. 

Also, it may be desirable to more strongly distinguish between
use of the procols for "registration" (i.e. entering names
in the DNS) and "lookup" (queries to the DNS), with the
characters inclusion list applied strictly at registration
time only and with clients generating queries relying on the
lookup process to return "not found" errors if characters
were invalid.

============================================================

8. The Ligature and Digraph Problem

This section is way too verbose. There is no need to once
again launch into a long explanation of the status of
"æ", or irrelevancies about the phonemic status of the
"ph" digraph in English.

In my opinion, all the buildup here should simply be
deleted, leaving the meat in the last two paragraphs, where the
disclaimer is made about those situations that IDNAbis
is not attempting to solve. Just describe "these situations"
with a couple, pithy, in-line examples.

I would also retitle this as "The Problem of Language- and
Context-specific Equivalences", since the issue is not
ligatures and digraphs, but rather the kinds of equivalences
between strings that are language, locale, and/or context-specific,
and which there is no hope of resolving by normalization
or StringPrep, and which registries should not be led to
expect will simply go away if they use this protocol.