Interoperability

John C Klensin klensin at jck.com
Mon Jul 28 06:41:40 CEST 2008



--On Thursday, 24 July, 2008 10:38 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:

> One thing that I hope we have a chance to discuss in Dublin is
> interoperability.
> 
> IDNA2008 is actually much more lenient than IDNA2003, because
> it allows arbitrary local mappings. Suppose you have any of
> the following in an email message, for example.
>...

Mark,

While I'm almost certain we are still not in complete agreement,
this note from you led to something of an insight for me.   I'm
going to try to explain it here and hope we can discuss it
further today and tomorrow.   Wrestling with how to explain what
I hope is a middle ground (and the question of whether it is a
middle ground acceptable to anyone) is one of the reasons I
haven't gotten a new version of Rationale out.

We apparently start from rather different assumptions about
implementer and user behavior, the effectiveness of standards,
and things that constrain that behavior.  We definitely make
different assumptions about the importance of trying to preserve
absolute backward compatibility in a rapidly growing Internet.
I think those differences in assumptions arise from experience
with different communities as well as differences in personality
and the like.

As I read your note (and I would encourage anyone on the list
who hasn't done so, and done so carefully, to fix that), I come
away with the feeling that you assume that, in the absence of
very specific guidance and instructions, implementers are likely
to run wild and do things that we would agree are crazy.   My
experience has been that most Internet protocol and applications
implementers are fairly conservative and, conversely, that the
minority who are inclined to run wild will do so regardless of
what standards documents say.  If only because IDNs overlap
between the areas of your greatest experience and areas of mine,
I can't even guess how to estimate whether your experience or
mine are likely to be better predictors.

I also have seen instances of behavior that would fit into the
"local mapping" framework with existing implementations.  The
fact that they violate IDNA2003 simply loses out to things that
implementers believe they need to do to provide acceptable user
experiences.  Those observations, and a desire to provide a
context for, and limits to, behavior that was going on anyway
and that will almost certainly continue, dominate the way the
text that is now in the documents was written.  In retrospect,
that was probably a mistake and one that should (and will) be
fixed.

Another important difference, which we have discussed before, is
that I feel very strongly that we need to move strongly toward a
standard URI that allows as few variations as possible.  As with
local mappings, there will always be variations in violation of
the standard.   One example of that which appears in your
statistics, is that we see a lot of URLs (not conformant IRIs)
around the world that contain IDNs and/or protocol tails with
non-ASCII characters.   In their own environments, those things
clearly work most of the time or people wouldn't use them.    In
other environments, sometimes they will work and sometimes they
won't.  And, of course, if Google indexes those protocol-invalid
links, the pointers returned by the search will work sometimes
and won't others.   I've encountered that in actual examples
(not just test cases) in which Google returns a link that will
not resolve with my browser configuration but will with other
ones (i.e., the pages haven't gone away) and that can be
retrieved from your cache.  So, to some extent, your nightmare
scenario is already with us, with no help from the IDNA2008
proposals.  And, from my point of view, while I don't expect we
can ever eliminate that problem, the best way to make progress
on it is to narrow down the definition of a valid IDN-containing
URIs and IRIs, eliminate protocol-required mappings, and then
treat whatever mappings are left as transition mechanisms from
IDNA2003 documents and adjustments for local circumstances.

In any event, I see the flexibility about local mappings to be
important to provide for the following (and _only_ the
following):

	(1) Situations where commonly-used input methods
	generate compatibility characters rather than (or too
	easily in addition to) the associated base characters.
	The obvious example of this is the use of full-width and
	half-width characters in East Asian scripts.  But it is
	really a function of a CCS coding idiosyncrasy and the
	input method mechanisms chosen, not fundamental
	properties of the abstract characters (e.g., with the
	understanding that I believe Unicode and several other
	CCSs got this right, if full-width characters had been
	coded by use of a presentation-affecting combining
	character following the base character rather than as
	different base characters, this discussion would be
	different).  By contrast, if I were to try to enter CJK
	characters or Romanji, I'd either be picking things from
	tables or simply using ASCII.  Giving me automatic
	mapping to or from forms that I'd have trouble entering
	is just an invitation to trouble.
	
	(2) Allowing, where absolutely necessary, for continued
	compatibility with some of the variations supported by
	IDNA2003 that are used in practice.   We know, for
	example, that some of the mathematical characters, mixed
	case arrangements, and digit variations (e.g.,
	superscripts or subscripts) are used in domain names
	displayed to users, presumably because some page
	designer or marketing expert decided those forms would
	provide more effective or distinctive presentations.
	For many environments, I believe that these mappings
	should be used only transitionally, i.e., in ways that
	encourage people to better understand that URLs are
	protocol-related mechanisms rather than presentations
	one and that there are better ways to handle
	presentation.   For others, the answers are different.
	For example, I think Google should continue to use most
	IDNA2003-compatible mappings forever for indexing old
	documents and for some time for newer ones, but should
	then put only IDNA2008-compatible links in what is
	returned to users, no matter what was actually found.

That very narrow view does not come across in the document text
as it now exists, partially because I believe it is pointless to
tell people what they must or must not do in areas where it is
certain that we would often be ignored.  If we can get close, or
at least closer, to agreement on that being what those
provisions are about, then we should work on adjusting the text.

      john





More information about the Idna-update mailing list