IDNAbis discussion style, mappings, and (incidentally) Eszett

Fri Nov 30 00:14:12 CET 2007

Hi.

I'd like to see if we can change the focus of some of this
discussion, and some related discussions that have occurred on
other lists, in the hope that it will help us move forward.  We
need to remember, somehow, that this whole process is about
tradeoffs.  No change can be made without costs and risks and
every change, no matter how desirable, has negative aspects.

I apologize for the length of this note.  Perhaps parts of it
should be a separate Internet-Draft or other document in the
long term.  But I think it is important for understanding where
we are and how (or if) we can proceed with this work.

With IDNs, there are many tradeoffs, probably more so than in
most other things the IETF considers.  When we reexamine IDNA
in the light of experience and (we hope) the improved
understanding gained over time, the tradeoffs include issues of
scope and procedure as well as technical issues.  They also,
obviously, require balancing the value of changes against the
value of absolute compatibility (both forward and backward)
with the earlier version. 

Accepting as many characters as possible and excluding only
those that are clearly harmful clearly has attractions
although, especially without mapping (NFKC and otherwise) it
also creates more opportunities for both confusion and
deliberate bad behavior and more risk of future
incompatibility.  

Specifying more mapping in the protocol is a convenience for
registrants who would prefer that all conceivable variations on
their names be accepted. A registrant could even sensibly
believe that it would be desirable to automatically map all
possible transliterations and translations of his preferred
name into it as part of the protocol (the technical and
linguistic problems with that desire do not prevent people,
especially people with a relatively parochial view of the
importance of _their_ names, from wishing for that sort of
feature).  On the other hand, extensive mapping raises issues
of confusion or astonishment for users who see two things as
different that are being treated as the same, who believe in
reverse-mappings, or who are trying to informally compare a
pair of URIs.  

The observation that some mappings make perfectly good sense in
some cultures (or for some languages) that use a relevant
script but not for other uses of that script represents a
significant complication despite the relatively small number of
cases that can be easily identified today.  Telling country or
culture "well, there aren't very many of you, so you lose" is a
fairly uncomfortable position to take.  So a different, and
equally extreme, position about mapping is that, if the Unicode
Consortium considered two characters sufficiently different to
assign them different code points, we should accept that
conclusion and not try to override it through mappings that
they specify but consider optional and that are dependent on
circumstances and application.

It is always possible to treat particular characters as
exceptions to whatever rules we make and to have special rules
for those characters, but it is difficult to figure out where
to stop doing that.  Do we permit special-case mapping rules
only when someone can claim dependency on the IDNA2003 rules?
If we do, then it is likely that arguments about lower levels
of the DNS will prevent any changes to IDNA2003 at all.  If we
restrict the special cases to a few well-understood issues in
Latin-based scripts (or European scripts), we may do long-term
violence to other scripts and characters.  

There is also a tendency for exception lists to create Unicode
version dependencies (or at least version sensitivity).
Perhaps more important, any exception list increases the
importance of getting everything right the first time (in both
our work and that of the Unicode Consortium).  

Eszett is an example of the fact that "need to get it right the
first time rules" can create a mess later.  Part of our problem
is that some people in German-speaking countries where it is
important in the orthography now argue that we got it wrong the
first time while others, especially people from countries where
the orthography standards (quite independent of IDNs) claim it
should be mapped to "ss" more or less always.  Those who take
one position (and some others) argue that the mapping should be
preserved for compatibility.  Those who favor the other
position believe it was a mistake and artifact of case-mapping
in IDNA2003 and that, since IDNA200X removes case-mapping and
proposals continue to be pushed forward to assign a code point
to an upper-case form, the whole decision should be
reconsidered and Eszett treated as a normal character.  It
isn't at all clear to me how we resolve that conflict; I'd
certainly like to hear suggestions.

Eszett is clearly not the only example.  IDNA2003 contained its
own rules for parsing FQDNs into labels, essentially requiring
the mapping of a number of dots, and dot-like characters, into
periods before the parsing occurred.  In retrospect (and, for
me, only in retrospect because I thought it was a good idea at
the time), it was probably the worst decision we made.  Since
the list of characters that are mapped to period contains some
dot-like characters and not others, and cannot include those
that are introduced with later versions of Unicode, it creates
a version dependency.  Users have trouble understanding why
"their" dot is or is not mapped to period versus being treated
as a plain character or banned.  It causes violations of the
rule that systems that are not IDNA-aware must be able to
process FQDNs that contain IDN labels in ACE ("punycode") form
without any special knowledge.  It creates a strange sort of
IDN in which all of the labels are ASCII LDH in native form,
not ACE labels, but those labels are separated by these strange
dots (I believe the status of those names is a protocol
ambiguity).  As Martin mentioned, these strange dots were
considered sufficiently problematic that the IRI spec doesn't
provide for them.

So the draft IDNA200X documents take the dot-mapping provision
out, turning the parsing of all domain names, including those
that contain A-labels, back over to the rules of RFC 1034 and
1035 and the acceptance of special dots into a UI issue. To me,
the arguments for that choice are overwhelming.  But it is a
tradeoff against user-predictable behavior with scripts that
use non-ASCII dots and compatibility with existing non-protocol
text that represents IDNs using those dots: if applications
that map between such text and the IDNA protocol don't do the
right UI things with dots other than U+002E, bad things will
happen.  And, if we work the tradeoffs so that types of
compatibility issues overwhelm the reasons why special dot
mapping was a bad idea, then we are stuck with the special dots
forever.  

Obviously, that example isn't precisely equivalent to the
Eszett one, since the dots are about label separation and
Eszett is a character and mapping issue.  However, to the
extent to which an important argument for preserving the Eszett
-> "ss" mapping as part of the protocol involves chunks of
non-protocol text in which the character might appear, the
relationship should be pretty obvious.   Again, this is all
about tradeoffs, not about one position being right or wrong in
an absolute sense.

If, instead of depending on lists of characters that get
special treatment, we rely primarily on rules based on
properties and attributes linked to whatever Unicode version
one might be using, we may, if we are careful about how things
are designed, be somewhat more amenable to adjustments as
point-errors are found and corrected and hence less dependent
on Unicode versions.

But all of those are tradeoffs: it is perfectly rational to
argue that all of the IDNA2003 mappings should be preserved
even if it prevents us from moving to new versions of Unicode.
It is also rational to argue that we should preserve the
IDNA2003 rules (and Stringprep and Nameprep) for all characters
that appear in Unicode 3.2 and apply new rules only to new
characters, accepting the considerable added complexity
(including the need to keep a list of valid Unicode 3.2
codepoints in every application, since such a list is unlikely
to come out of character-handling libraries) as the price of
complete forward compatibility.  I happen to have a fairly
strong opinion about those two options, but I am all too aware
that there are other strong opinions and other ways to make the
tradeoffs.

A similar analysis applies to case mapping.  The answer to the
question of whether, if we had the DNS to do over from scratch
today, the case mapping for ASCII would be preserved is that
the question would at least cause an extended and probably
heated argument.  I suspect that anyone who has every used a
U**x-derived system (or, more properly, a Multics-derived one)
understands most of the argument: case-sensitive identifiers
are sometimes really handy and sometimes a significant pain in
sensitive parts of the anatomy, especially in communicating
with systems that are case-insensitive.  And most of us have
understood, long ago, that, when all of the arguments are added
up, the conclusion as to whether systems should be
case-sensitive or case-insensitive in the ASCII range is
essentially a matter of religion. 

For the DNS (and probably for internationalization generally),
there is another piece of the argument, which is that the case
mappings for the Latin (and I do mean _Latin_, not extended
Latin, Latin-derived, or decorated Latin here) subset of
Unicode is absolutely, 100%, unambiguous.  It is approximately
as good for the Latin-derived superset of undecorated
alphabetic characters that appear in ISO 646BV and its clones.
So, regardless of one's religion about case dependencies, for
those characters, the case mapping is at least unambiguous, fully
reversible, does not require language or locale information for
_any_ characters, and, importantly, the characters are stored
in, and retrieved from, the DNS in their original case --
case-insensitivity is supported only in the matching rules, not
in what gets stored.

In any event, if only because the case-distinguished strings
are stored in the DNS and retrieved by queries to it, it is far
too late to reopen the question of whether the original
decision was wise... at least within Class=IN.

Now the IDNA WG, responding to different complexities and
tradeoffs, including the desire to _not_ require DNS changes,
concluded that it was not possible to use server-side matching
rules to accommodate case.  Instead, the conclusion was that
there should be case-insensitivity (to parallel the behavior
with ASCII) and that it should be provided by pre-query and
pre-registration mapping.  That was a plausible decision (and
one that I supported).  But it causes some user confusion when
queries return "original case" for ASCII and "all lower case"
for non-ASCII labels, even when the mostly consist of ASCII
characters.  While they are few, there are also ambiguities in
which one character maps into another as a case-shift and
whether reversal works differently by language or locale.  That
creates a mess -- how large a mess depends on the perspective
of the beholder -- and led us to conclude that we should extend
the general "no more mappings in the protocol" principle to
case mappings, thereby making things less complex.  Do we think
that answer is without negative implications and consequences,
including causing problems with case-dependent label strings in
contexts where the DNS is not being used directly?  Of course
not.  Are we sure that eliminating case-mapping in the protocol
is the right answer after all of the tradeoffs are considered?
Again, certainly not.  We do think it is the best way to
resolve the tradeoffs, but we are still listening for
persuasive counterarguments or alternate proposals that don't
introduce even more problems.

Even the decision to try to move this work forward via an
open discussion but without a WG was based on careful
consideration of a tradeoff.  We know from experience with the
original IDN WG that WG discussions of this type of topic tend
to be extremely noisy, with a great deal of time spent going
over and over the pet ideas of various people with little
knowledge but strong opinions --especially about language and
culture issues that don't fit well into the DNS as we know it.
There is potential for even more noise when the views of people
who do not consider "the way the DNS works" and interoperation
with it to be a relevant constraint (or who don't even consider
understanding those issues to be relevant).  

We hoped that, by handling things with a small design team and
an open list, we could make more progress toward a better and
more balanced result than we could in an inherently-noisy WG.
But there are tradeoffs, including our having to listen to
people (none of them in this particular discussion, I hope) who
believe that any issue on which they don't get their way (even
before they express their opinions coherently) indicates a
conspiracy and who then use the absence of a WG to "prove" that
conspiracy exists.

The trends toward noise that led to the "no WG" decision are
still out there. It may amuse some readers of this list to hear
that some of us needed to spend significant time at IGF in Rio
discussing and defending the decision to not abandon IDNA
entirely.  The main suggested alternative was to move all DNS
internationalization work into an extended version of my old
"new class" proposal (for those with an interest in protocol
archeology, the last public version was
draft-klensin-i18n-newclass-02.txt, posted in June 2003).
Those who were pushing that idea in Rio seemed to favor using
not only a completely new DNS tree, new RR types, and new
matching rules but also wanted support for matching and
discrimination using information that would almost certainly
require new data formats for resource records.  If there were
no other reasons to avoid spending time on such a proposal
today (and there are many other reasons), the issues of
incompatibility, not just with IDNA but with the entire naming,
delegation, and administrative structure of the Class=IN DNS
boggle the mind.  

However, even that is another tradeoff and, as we understood
when the "new class" model was first suggested for discussion,
overturning the DNS structure associated with Class=IN and
starting over has a certain appeal, no matter how impractical.
Of course, "junk the DNS entirely and start over" has some
considerable appeal as well. There are sensible people who
would argue that the DNS architecture is sufficiently
mismatched to today's needs and expectations that the only
reason to not discard it and start over, if there are any
reasons at all, it is the transition difficulty.

Another tradeoff that I hope we all understand is that we
maximize Internet interoperability by minimizing variations and
different ways of doing things.  Doing IDNs at all creates some
risks that don't exist without them.  However we implement
IDNs, they represent an attempt to balance improved mnemonic
value of names and improved accessibility to present and future
Internet users against risks to global interoperability.  Were
we to conclude that one of those poles was so important that we
should ignore the other, we might well end up with radically
different solutions.  

Doing IDNs with complicated procedures, including mappings and
exception lists, or trying to make IDNs language- and
culture-sensitive, rather than just being registrant-chosen
character strings, makes interoperability even harder and
riskier.   The IDNA200X drafts reflect many decisions that were
made on the basis of less complexity or more simplicity, but it
is possible that we went too far.

And so on.  

We have, I hope, clearly decided that IDNs are worth it, but,
even in the most minimal form, they constitute risks that we
need to understand and accept.  And, to the extent to which
IDNA2003 is in use, any change at all, including moving beyond
Unicode 3.2, implies some risks and transition inconvenience
that, again, we need to understand and accept if we are going
to move forward with those changes.  Or, of course, we can
consider those tradeoffs and decide that the best course of
action is to do nothing, accepting those consequences.

While this list could go on much longer and include many
more examples of tradeoffs, I believe that the above is
sufficient to illustrate the situation we are in and most of
the key issues.

So, the question is, how do we proceed?

We could decide that the decision to proceed without a WG was a
mistake and that we really need a WG, however noisy and slow
that might be.  That would clearly be the right outcome for
those who believe that the design team is engaged in a complex
conspiracy against their language, culture, business, registry,
or persons (I do not believe I've heard that position on this
list, but I've certainly heard it elsewhere).  The tradeoff/
danger is that such a WG could get very seriously bogged down.
We need to recognize that IDN deployment today represents an
infinitesimal fraction of what various actors have claimed will
happen as soon as some threshold is crossed (current popular
theories include getting internationalized email addresses
and/or beginning to deploy IDN TLDs, but there are others) and
that every month of delay creates more uses and applications
and a stronger argument that we cannot modify IDNA2003 at all.

If we are not going to make the decision that it is time to
stop and turn the effort over to a WG, then I have some
suggestions.  I don't know if my colleagues would agree, so
please take the suggestions as personal ones.

(1) Please make suggestions, ideally suggestions that show that
you have understood and considered the tradeoffs, not just
complaints.  Complaints are very hard to deal with, especially
when we understand the tradeoffs well enough to know that no
decision is going to make everyone completely happy (including
us).

(2) When you make those suggestions, expect to be challenged on
their side-effects and on what else would be damaged to give
you what you want.  If your note making the suggestion
considers those issues, we will save a lot of time.  For
reasons that are probably clear from the comments above, my
colleagues and I have a design bias against complexity and a
design bias against tables of exceptions.  Neither is a firm
rule, but, if the nature of your suggestion is such that you
believe that some particular issue is important enough to add
complexity or exception cases, you should assume that we will
push back in an attempt to find out how sure you are, whether
the complexity is really necessary, and whether others agree.
We may even agree with you, but pushing back is part of the job
we think we took on.

(3) Please don't assume in your notes that the other side of
the tradeoffs never occurred to us or that we have blown off
some position and its consequences without considering it.
There are almost certainly things that we have missed and we
want to know about them (as quickly and clearly as possible),
but we have been spending a lot of time on these issues for the
last several years, have gotten input (of varying quality) from
all over the world and in a variety of forums, and have been
trying very hard to listen.    I can't speak for my colleagues,
but I am a lot more likely to be able to respond quickly and
effectively to a note that suggests that we probably got the
tradeoffs wrong about some particular issue, explains why, and
proposes a solution that strikes a reasonable balance with the
other tradeoffs than I am to be able to deal with a note that
starts out on the assumption that we are insensitive idiots
that haven't even bothered to consider the obvious One True Way
of doing something.   I wish that distinction didn't exist and
try to avoid overreacting to it, but I will plead some vestiges
of humanity.

(4) Finally, please try to assume that we are acting in good
faith (even if, at some level, you don't believe it).  We are
much more likely to be able to respond in a useful way and to
participate in a dialog if we aren't first accused of having
some bizarre agenda.

    -john