The Two Lookups Approach (was Re: Parsing the issues and finding a middle ground -- another attempt)

John C Klensin klensin at jck.com
Fri Mar 6 17:41:03 CET 2009



--On Friday, March 06, 2009 15:03 +0100 "Marcos Sanz/Denic"
<sanz at denic.de> wrote:

>...
> Summary: Different replies to the two lookups is neither a
> sufficient nor  a necessary condition for a "security
> situation", the mechanism produces  plenty of false
> positives/negatives, which by themselves, would be very 
> difficult to debug. We don't want to go down that path.

Marcos,

FWIW, I mostly agree.  I'm just trying to summarize and record
what I'm hearing in the hope of creating a focus for a
discussion that could move us forward.

If we had to do two lookups, I would personally favor it only as
a fallback, i.e., the second lookup is not done at all unless
the first one returns an indication of complete failure to find
anything at that node.  Even that would not address all of the
problems you identify, but it would go a long way

However, I think we also need to understand that the alternative
to some transition difficulties --whether a two-lookup plan, or
exceptional care on the part of registries, or something else--
is "IDNA2003 forever".   Not "IDNAv2" followed by "IDNAv3" and
so on, but really IDNA2003 forever, Unicode 3.2 restrictions and
all.  Possibly one could permit new scripts, but not new (valid)
characters for any existing script, but I don't believe that
there are properties to help with that -- we'd be further into
the character-by-character decision-making problem than we have
ever been (unless one counts our current four high-attention
cases).  The problem is that one really cannot add characters
without creating the potential for disruption.

The example I heard about last week is illustrative and was, to
me, very interesting.  As I understand it, under Unicode 5.0 and
earlier, Malayalam could not be written without ZJW and ZWNJ --
character1+character2, character1+ZWJ+character2, and
character1+ZWNJ+character2 would produce three separate glyph
sequences as seen by the user.    Assume that a registry decided
to permit registrations and to accept whatever damage or
restrictions resulted from having these three sequences map
together (note that decision would raise all of the "how does
one control the display" issues that the list was discussing a
week ago).

Now Unicode 5.1 comes along with new Malayalam Chillu characters
that, at least in the opinion of some including Unicode (see
http://www.unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters),
eliminate the need to use ZWJ or ZWNJ at all.  Ideally, the
registry would like to support 
  character1+character2
  character3
  character4

But, when seen in an older document in ACE form, or translated
back from ACE form, there is no way to distinguish which one of
the three treatments immediately above

  character1+character2 

was intended to represent.  In particular, if one sees
character3 in a file or typed by a user, one has to wrestle with
whether "character1+character2" should be looked up because one
doesn't know whether character3 appeared because the user
intended it instead of character1+ZWJ+character2 or because
something new and updated happened between the keyboard and the
web browser.  

Note that also implies that users with older version of
operating systems may type the sequence that produces
character1+ZWJ+character2 and get that sequence into the IRI
while, if the same user goes to a machine with a newer system
and types the same sequence, she will get character3 in the IRI.

The problem is more subtle and complex than the Eszett one, but
really no different.

It is worth stressing that the occurrence of this sort of
problem does not depend on IDNA2008.  Paul's IDNAv2 proposal
would cause it equally well, as would anything else that
provides a change from Unicode 3.2 to Unicode 5.1 and, more
generally, most or all future changes to Unicode that add new
characters to existing scripts to improve the way in which those
scripts can be expressed.

The only way to ensure complete compatibility across the board
is to stick with IDNA2003 and Unicode 3.2 forever.  Presumably,
that would imply telling anyone with a script or set of
important characters added in the last five years or in the
future that they just lose.  I'd be very unhappy about doing
that, but maybe the WG feels differently.

In principle, we could also use exceptions (in IDNA2008) or
omissions from Stringprep (in IDNAvX) to forbid the use of some
or all newly-added characters for existing scripts but, because
many of those characters were added to cure perceived omissions
(the Malayalam case is an example of that), the idea is not much
more attractive than just freezing at IDNA2003.

But, if we do permit Unicode upgrades that add characters to
scripts (by any mechanism in terms of IDNA specifications), we
are going to be stuck with some difficult transition issues.   I
believe that our problem is to find the transition mechanism (or
set of mechanisms) that stink least.  What I've learned from the
discussions of the last year is that finding one that is
completely odor-free is extremely unlikely.  

>From my perspective, "lookup once following IDNA2008 rules and
fall over to an IDNA2003 lookup if nothing is found" stinks less
than "two lookups always", partially for the reasons you
identified.  I don't especially like it but, again, I think we
need to search for "least bad", not "wonderful and perfect",
because I don't think the latter is likely to exist.

regards,
      john


If we are going to permit updates and the addition of 






More information about the Idna-update mailing list