IDNAbis discussion style, mappings, and (incidentally) Eszett

Fri Nov 30 04:00:42 CET 2007

Hi John,

Overall, I think the idnabis drafts are on a reasonable track. I
remain concerned, though, that the mapping spec (case mapping and
NFKC) is missing. It remains to be seen whether the major browser
developers will stop mapping characters found in domain names in HTML,
but if they continue to map them, I feel that it would be simpler if
they performed similar mapping for the post-Unicode-3.2 characters.
Such a spec could certainly be written, at least as an informative
document.

In your email below, you refer to non-protocol text. Does this include
HTML? It might be nice to give HTML as an example if that is what you
mean.

I admire the design team's desire to keep things simple and to avoid
exceptions but with tongue in cheek I point out that Patrik's document
does not really seem simple and appears to be a long list of
exceptions (or exceptional rules). :-)

Anyway, congrats on a job well done. It could not have been easy.

If it would be helpful, I could take a look at Google's data to see
whether any of the characters listed as NEVER or MAYBE NOT appear to
deserve a "higher" classification (e.g. MAYBE YES).

Erik

On Nov 29, 2007 3:14 PM, John C Klensin <klensin at jck.com> wrote:
> Hi.
>
> I'd like to see if we can change the focus of some of this
> discussion, and some related discussions that have occurred on
> other lists, in the hope that it will help us move forward.  We
> need to remember, somehow, that this whole process is about
> tradeoffs.  No change can be made without costs and risks and
> every change, no matter how desirable, has negative aspects.
>
> I apologize for the length of this note.  Perhaps parts of it
> should be a separate Internet-Draft or other document in the
> long term.  But I think it is important for understanding where
> we are and how (or if) we can proceed with this work.
>
> With IDNs, there are many tradeoffs, probably more so than in
> most other things the IETF considers.  When we reexamine IDNA
> in the light of experience and (we hope) the improved
> understanding gained over time, the tradeoffs include issues of
> scope and procedure as well as technical issues.  They also,
> obviously, require balancing the value of changes against the
> value of absolute compatibility (both forward and backward)
> with the earlier version.
>
> Accepting as many characters as possible and excluding only
> those that are clearly harmful clearly has attractions
> although, especially without mapping (NFKC and otherwise) it
> also creates more opportunities for both confusion and
> deliberate bad behavior and more risk of future
> incompatibility.
>
> Specifying more mapping in the protocol is a convenience for
> registrants who would prefer that all conceivable variations on
> their names be accepted. A registrant could even sensibly
> believe that it would be desirable to automatically map all
> possible transliterations and translations of his preferred
> name into it as part of the protocol (the technical and
> linguistic problems with that desire do not prevent people,
> especially people with a relatively parochial view of the
> importance of _their_ names, from wishing for that sort of
> feature).  On the other hand, extensive mapping raises issues
> of confusion or astonishment for users who see two things as
> different that are being treated as the same, who believe in
> reverse-mappings, or who are trying to informally compare a
> pair of URIs.
>
> The observation that some mappings make perfectly good sense in
> some cultures (or for some languages) that use a relevant
> script but not for other uses of that script represents a
> significant complication despite the relatively small number of
> cases that can be easily identified today.  Telling country or
> culture "well, there aren't very many of you, so you lose" is a
> fairly uncomfortable position to take.  So a different, and
> equally extreme, position about mapping is that, if the Unicode
> Consortium considered two characters sufficiently different to
> assign them different code points, we should accept that
> conclusion and not try to override it through mappings that
> they specify but consider optional and that are dependent on
> circumstances and application.
>
> It is always possible to treat particular characters as
> exceptions to whatever rules we make and to have special rules
> for those characters, but it is difficult to figure out where
> to stop doing that.  Do we permit special-case mapping rules
> only when someone can claim dependency on the IDNA2003 rules?
> If we do, then it is likely that arguments about lower levels
> of the DNS will prevent any changes to IDNA2003 at all.  If we
> restrict the special cases to a few well-understood issues in
> Latin-based scripts (or European scripts), we may do long-term
> violence to other scripts and characters.
>
> There is also a tendency for exception lists to create Unicode
> version dependencies (or at least version sensitivity).
> Perhaps more important, any exception list increases the
> importance of getting everything right the first time (in both
> our work and that of the Unicode Consortium).
>
> Eszett is an example of the fact that "need to get it right the
> first time rules" can create a mess later.  Part of our problem
> is that some people in German-speaking countries where it is
> important in the orthography now argue that we got it wrong the
> first time while others, especially people from countries where
> the orthography standards (quite independent of IDNs) claim it
> should be mapped to "ss" more or less always.  Those who take
> one position (and some others) argue that the mapping should be
> preserved for compatibility.  Those who favor the other
> position believe it was a mistake and artifact of case-mapping
> in IDNA2003 and that, since IDNA200X removes case-mapping and
> proposals continue to be pushed forward to assign a code point
> to an upper-case form, the whole decision should be
> reconsidered and Eszett treated as a normal character.  It
> isn't at all clear to me how we resolve that conflict; I'd
> certainly like to hear suggestions.
>
> Eszett is clearly not the only example.  IDNA2003 contained its
> own rules for parsing FQDNs into labels, essentially requiring
> the mapping of a number of dots, and dot-like characters, into
> periods before the parsing occurred.  In retrospect (and, for
> me, only in retrospect because I thought it was a good idea at
> the time), it was probably the worst decision we made.  Since
> the list of characters that are mapped to period contains some
> dot-like characters and not others, and cannot include those
> that are introduced with later versions of Unicode, it creates
> a version dependency.  Users have trouble understanding why
> "their" dot is or is not mapped to period versus being treated
> as a plain character or banned.  It causes violations of the
> rule that systems that are not IDNA-aware must be able to
> process FQDNs that contain IDN labels in ACE ("punycode") form
> without any special knowledge.  It creates a strange sort of
> IDN in which all of the labels are ASCII LDH in native form,
> not ACE labels, but those labels are separated by these strange
> dots (I believe the status of those names is a protocol
> ambiguity).  As Martin mentioned, these strange dots were
> considered sufficiently problematic that the IRI spec doesn't
> provide for them.
>
> So the draft IDNA200X documents take the dot-mapping provision
> out, turning the parsing of all domain names, including those
> that contain A-labels, back over to the rules of RFC 1034 and
> 1035 and the acceptance of special dots into a UI issue. To me,
> the arguments for that choice are overwhelming.  But it is a
> tradeoff against user-predictable behavior with scripts that
> use non-ASCII dots and compatibility with existing non-protocol
> text that represents IDNs using those dots: if applications
> that map between such text and the IDNA protocol don't do the
> right UI things with dots other than U+002E, bad things will
> happen.  And, if we work the tradeoffs so that types of
> compatibility issues overwhelm the reasons why special dot
> mapping was a bad idea, then we are stuck with the special dots
> forever.
>
> Obviously, that example isn't precisely equivalent to the
> Eszett one, since the dots are about label separation and
> Eszett is a character and mapping issue.  However, to the
> extent to which an important argument for preserving the Eszett
> -> "ss" mapping as part of the protocol involves chunks of
> non-protocol text in which the character might appear, the
> relationship should be pretty obvious.   Again, this is all
> about tradeoffs, not about one position being right or wrong in
> an absolute sense.
>
> If, instead of depending on lists of characters that get
> special treatment, we rely primarily on rules based on
> properties and attributes linked to whatever Unicode version
> one might be using, we may, if we are careful about how things
> are designed, be somewhat more amenable to adjustments as
> point-errors are found and corrected and hence less dependent
> on Unicode versions.
>
> But all of those are tradeoffs: it is perfectly rational to
> argue that all of the IDNA2003 mappings should be preserved
> even if it prevents us from moving to new versions of Unicode.
> It is also rational to argue that we should preserve the
> IDNA2003 rules (and Stringprep and Nameprep) for all characters
> that appear in Unicode 3.2 and apply new rules only to new
> characters, accepting the considerable added complexity
> (including the need to keep a list of valid Unicode 3.2
> codepoints in every application, since such a list is unlikely
> to come out of character-handling libraries) as the price of
> complete forward compatibility.  I happen to have a fairly
> strong opinion about those two options, but I am all too aware
> that there are other strong opinions and other ways to make the
> tradeoffs.
>
> A similar analysis applies to case mapping.  The answer to the
> question of whether, if we had the DNS to do over from scratch
> today, the case mapping for ASCII would be preserved is that
> the question would at least cause an extended and probably
> heated argument.  I suspect that anyone who has every used a
> U**x-derived system (or, more properly, a Multics-derived one)
> understands most of the argument: case-sensitive identifiers
> are sometimes really handy and sometimes a significant pain in
> sensitive parts of the anatomy, especially in communicating
> with systems that are case-insensitive.  And most of us have
> understood, long ago, that, when all of the arguments are added
> up, the conclusion as to whether systems should be
> case-sensitive or case-insensitive in the ASCII range is
> essentially a matter of religion.
>
> For the DNS (and probably for internationalization generally),
> there is another piece of the argument, which is that the case
> mappings for the Latin (and I do mean _Latin_, not extended
> Latin, Latin-derived, or decorated Latin here) subset of
> Unicode is absolutely, 100%, unambiguous.  It is approximately
> as good for the Latin-derived superset of undecorated
> alphabetic characters that appear in ISO 646BV and its clones.
> So, regardless of one's religion about case dependencies, for
> those characters, the case mapping is at least unambiguous, fully
> reversible, does not require language or locale information for
> _any_ characters, and, importantly, the characters are stored
> in, and retrieved from, the DNS in their original case --
> case-insensitivity is supported only in the matching rules, not
> in what gets stored.
>
> In any event, if only because the case-distinguished strings
> are stored in the DNS and retrieved by queries to it, it is far
> too late to reopen the question of whether the original
> decision was wise... at least within Class=IN.
>
> Now the IDNA WG, responding to different complexities and
> tradeoffs, including the desire to _not_ require DNS changes,
> concluded that it was not possible to use server-side matching
> rules to accommodate case.  Instead, the conclusion was that
> there should be case-insensitivity (to parallel the behavior
> with ASCII) and that it should be provided by pre-query and
> pre-registration mapping.  That was a plausible decision (and
> one that I supported).  But it causes some user confusion when
> queries return "original case" for ASCII and "all lower case"
> for non-ASCII labels, even when the mostly consist of ASCII
> characters.  While they are few, there are also ambiguities in
> which one character maps into another as a case-shift and
> whether reversal works differently by language or locale.  That
> creates a mess -- how large a mess depends on the perspective
> of the beholder -- and led us to conclude that we should extend
> the general "no more mappings in the protocol" principle to
> case mappings, thereby making things less complex.  Do we think
> that answer is without negative implications and consequences,
> including causing problems with case-dependent label strings in
> contexts where the DNS is not being used directly?  Of course
> not.  Are we sure that eliminating case-mapping in the protocol
> is the right answer after all of the tradeoffs are considered?
> Again, certainly not.  We do think it is the best way to
> resolve the tradeoffs, but we are still listening for
> persuasive counterarguments or alternate proposals that don't
> introduce even more problems.
>
> Even the decision to try to move this work forward via an
> open discussion but without a WG was based on careful
> consideration of a tradeoff.  We know from experience with the
> original IDN WG that WG discussions of this type of topic tend
> to be extremely noisy, with a great deal of time spent going
> over and over the pet ideas of various people with little
> knowledge but strong opinions --especially about language and
> culture issues that don't fit well into the DNS as we know it.
> There is potential for even more noise when the views of people
> who do not consider "the way the DNS works" and interoperation
> with it to be a relevant constraint (or who don't even consider
> understanding those issues to be relevant).
>
> We hoped that, by handling things with a small design team and
> an open list, we could make more progress toward a better and
> more balanced result than we could in an inherently-noisy WG.
> But there are tradeoffs, including our having to listen to
> people (none of them in this particular discussion, I hope) who
> believe that any issue on which they don't get their way (even
> before they express their opinions coherently) indicates a
> conspiracy and who then use the absence of a WG to "prove" that
> conspiracy exists.
>
> The trends toward noise that led to the "no WG" decision are
> still out there. It may amuse some readers of this list to hear
> that some of us needed to spend significant time at IGF in Rio
> discussing and defending the decision to not abandon IDNA
> entirely.  The main suggested alternative was to move all DNS
> internationalization work into an extended version of my old
> "new class" proposal (for those with an interest in protocol
> archeology, the last public version was
> draft-klensin-i18n-newclass-02.txt, posted in June 2003).
> Those who were pushing that idea in Rio seemed to favor using
> not only a completely new DNS tree, new RR types, and new
> matching rules but also wanted support for matching and
> discrimination using information that would almost certainly
> require new data formats for resource records.  If there were
> no other reasons to avoid spending time on such a proposal
> today (and there are many other reasons), the issues of
> incompatibility, not just with IDNA but with the entire naming,
> delegation, and administrative structure of the Class=IN DNS
> boggle the mind.
>
> However, even that is another tradeoff and, as we understood
> when the "new class" model was first suggested for discussion,
> overturning the DNS structure associated with Class=IN and
> starting over has a certain appeal, no matter how impractical.
> Of course, "junk the DNS entirely and start over" has some
> considerable appeal as well. There are sensible people who
> would argue that the DNS architecture is sufficiently
> mismatched to today's needs and expectations that the only
> reason to not discard it and start over, if there are any
> reasons at all, it is the transition difficulty.
>
> Another tradeoff that I hope we all understand is that we
> maximize Internet interoperability by minimizing variations and
> different ways of doing things.  Doing IDNs at all creates some
> risks that don't exist without them.  However we implement
> IDNs, they represent an attempt to balance improved mnemonic
> value of names and improved accessibility to present and future
> Internet users against risks to global interoperability.  Were
> we to conclude that one of those poles was so important that we
> should ignore the other, we might well end up with radically
> different solutions.
>
> Doing IDNs with complicated procedures, including mappings and
> exception lists, or trying to make IDNs language- and
> culture-sensitive, rather than just being registrant-chosen
> character strings, makes interoperability even harder and
> riskier.   The IDNA200X drafts reflect many decisions that were
> made on the basis of less complexity or more simplicity, but it
> is possible that we went too far.
>
> And so on.
>
> We have, I hope, clearly decided that IDNs are worth it, but,
> even in the most minimal form, they constitute risks that we
> need to understand and accept.  And, to the extent to which
> IDNA2003 is in use, any change at all, including moving beyond
> Unicode 3.2, implies some risks and transition inconvenience
> that, again, we need to understand and accept if we are going
> to move forward with those changes.  Or, of course, we can
> consider those tradeoffs and decide that the best course of
> action is to do nothing, accepting those consequences.
>
> While this list could go on much longer and include many
> more examples of tradeoffs, I believe that the above is
> sufficient to illustrate the situation we are in and most of
> the key issues.
>
> So, the question is, how do we proceed?
>
> We could decide that the decision to proceed without a WG was a
> mistake and that we really need a WG, however noisy and slow
> that might be.  That would clearly be the right outcome for
> those who believe that the design team is engaged in a complex
> conspiracy against their language, culture, business, registry,
> or persons (I do not believe I've heard that position on this
> list, but I've certainly heard it elsewhere).  The tradeoff/
> danger is that such a WG could get very seriously bogged down.
> We need to recognize that IDN deployment today represents an
> infinitesimal fraction of what various actors have claimed will
> happen as soon as some threshold is crossed (current popular
> theories include getting internationalized email addresses
> and/or beginning to deploy IDN TLDs, but there are others) and
> that every month of delay creates more uses and applications
> and a stronger argument that we cannot modify IDNA2003 at all.
>
> If we are not going to make the decision that it is time to
> stop and turn the effort over to a WG, then I have some
> suggestions.  I don't know if my colleagues would agree, so
> please take the suggestions as personal ones.
>
> (1) Please make suggestions, ideally suggestions that show that
> you have understood and considered the tradeoffs, not just
> complaints.  Complaints are very hard to deal with, especially
> when we understand the tradeoffs well enough to know that no
> decision is going to make everyone completely happy (including
> us).
>
> (2) When you make those suggestions, expect to be challenged on
> their side-effects and on what else would be damaged to give
> you what you want.  If your note making the suggestion
> considers those issues, we will save a lot of time.  For
> reasons that are probably clear from the comments above, my
> colleagues and I have a design bias against complexity and a
> design bias against tables of exceptions.  Neither is a firm
> rule, but, if the nature of your suggestion is such that you
> believe that some particular issue is important enough to add
> complexity or exception cases, you should assume that we will
> push back in an attempt to find out how sure you are, whether
> the complexity is really necessary, and whether others agree.
> We may even agree with you, but pushing back is part of the job
> we think we took on.
>
> (3) Please don't assume in your notes that the other side of
> the tradeoffs never occurred to us or that we have blown off
> some position and its consequences without considering it.
> There are almost certainly things that we have missed and we
> want to know about them (as quickly and clearly as possible),
> but we have been spending a lot of time on these issues for the
> last several years, have gotten input (of varying quality) from
> all over the world and in a variety of forums, and have been
> trying very hard to listen.    I can't speak for my colleagues,
> but I am a lot more likely to be able to respond quickly and
> effectively to a note that suggests that we probably got the
> tradeoffs wrong about some particular issue, explains why, and
> proposes a solution that strikes a reasonable balance with the
> other tradeoffs than I am to be able to deal with a note that
> starts out on the assumption that we are insensitive idiots
> that haven't even bothered to consider the obvious One True Way
> of doing something.   I wish that distinction didn't exist and
> try to avoid overreacting to it, but I will plead some vestiges
> of humanity.
>
> (4) Finally, please try to assume that we are acting in good
> faith (even if, at some level, you don't believe it).  We are
> much more likely to be able to respond in a useful way and to
> participate in a dialog if we aren't first accused of having
> some bizarre agenda.
>
>     -john
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>