Unicode position on local mapping

Wed Feb 18 23:16:34 CET 2009

--On Wednesday, February 18, 2009 16:09 -0500 Eric
Brunner-Williams <ebw at abenaki.wabanaki.net> wrote:

> Comments interlinear.
>...
>> A) Registration local mapping.
>> 
>> You request a registration of "Å.com", and the registry
>> converts that  to "aa.com <http://aa.com>" behind your back.
>> They are allowed to do  so by:
>> 
>> http://tools.ietf.org/html/draft-ietf-idnabis-protocol-08#sec
>> tion-4.2
> 
> I'm sorry but I understand 4.2 to be referring to "local
> implementation  choice", not to registry actions, whether bugs
> there, or policy there.  Unless you're discussing a policy of
> a registry (or registrar for that  matter) or a bug in the
> registry (or registrar for that matter) system,  which amounts
> to policy anyway.

My understanding/intent of 4.2 is consistent with yours, Eric.
However, given a number of recent comments about problems with
registration-side mapping and the fact that the second paragraph
of 4.2 seems to be continuing to cause confusion, I propose to
remove that paragraph entirely unless people object.

>> B) Registry local mapping.
>> 
>> You request a registration of "å.com <http://xn--5ca.com>"
>> and the  registry gives you both "å.com
>> <http://xn--5ca.com>" and "aa.com  <http://aa.com>"
>> (bundling).
>> 
>> The draft is silent on this aspect.
> 
> Back in the '03 work I discussed the Abenaki equivalence class
> of {8, w,  ou, and U+0222, U+0223}, and of course, the
> CDNC/JET proposed an SC/TC  equivalence class, but an
> overlooked subtlety of that proposal was the  proposed
> inter-registry cooperation over the set of registries then 
> offering SC or TC.

Indeed.   As discussed in my response to Jefsey, one of the
reasons those equivalence classes didn't go into the protocol as
mappings (at least the CJK ones -- I don't understand the
Abenaki proposal well enough to have an opinion) is that
language knowledge was required and the mappings were not
unambiguously one-to-one.

> The industry seems to have invented "bundling" as a generic 
> non-description of what might be a persistent zone-local
> mapping, or  temporary marketing campaign by a registration
> channel (write access to  the registry db), that is, a
> "registry service" or a "registrar  service", in the ICANN
> gTLD registrar and registry nomenclatures.

As Cary has pointed out, there are legitimate uses for
variations on the JET "variant" scheme as a registry activity
and ideally a common registry activity among registries who are
accepting registrations in the same script or script family.  I
agree with him that we are likely to see more of that in the
future and that it will be a good thing in many cases.  I also
agree with you that the terminology is easily abused (through
either misunderstanding or malice) for marketing or political
purposes, that it has been abused, and that it will probably be
abused in the future.

> Why should the draft have anything on either channel-scoped,
> or  registry-scoped, or multi-registry-scoped equivalence
> classes, or the  mechanism(s) used to implement this local
> scope?

Protocol does not, although there is a narrative/ informative
comment in Rationale.  I believe that is the right balance and
hope that others agree (and that, if they do not, they will
speak up soon).

> Just to make something obvious, a registrar could "bundle"
> both  "å.example <http://xn--5ca.com>" and "aa.example
> <http://aa.com>", not  just a registry.

As long as race conditions in registrar-registry protocols did
not cause a disconnect.   Registries that wanted to encourage
registrar bundling would presumably want to be sure that there
were mechanisms to cope with the race conditions.  But I see no
reason for _any_ of our documents to get into that discussion
(it perhaps belongs in an update to RFC 4290, but I don't know
of any plans to update that document).

>> C) Client local mapping.
>> 
>> The web page you're looking at contains <a href="Å.com">,
>> and your  browser sends you to "aa.com <http://aa.com>"
>> instead of "å.com  <http://xn--5ca.com>", behind your back.
>> 
>> In the above messages** I'm using the characters:
>> |U+00C5
>> <http://unicode.org/cldr/utility/character.jsp?a=00C5>| ( Å
>> )  LATIN CAPITAL LETTER A WITH RING ABOVE
>> |U+00E5
>> <http://unicode.org/cldr/utility/character.jsp?a=00E5>| ( å
>> )  LATIN SMALL LETTER A WITH RING ABOVE.
>> 
>> They are allowed to do so by:
>> 
>> http://tools.ietf.org/html/draft-ietf-idnabis-protocol-08#sec
>> tion-5.3
> 
> I'm sorry but I understand 5.3 to be referring to
> "preprocessing" (prior  to what actual processing is
> undefined) or the "user interface", hence a  local
> implementation choice, and what happens there seems to be way 
> outside our control.

That has been the intent.  Mark's concern (and that of others)
is that, if everyone does preprocessing in a different way,
especially of names found in URIs and IRIs in files (as distinct
from directly-typed user input), then we have "a massive
interoperability problem".   That tradeoffs between the "local
processing is far outside our scope and we can't prevent people
from doing it" position and the perception of likely massive
interoperability problems if they do (or more massive
interoperability if we tell them they can than if they figure it
out on  their own) is, IMO, what this discussion really needs to
be about.

>> ===
>> 
>> For (A), I think we are in agreement; this is a bad idea. The 
>> registrant should supply exactly what they want, and if it is
>> not an  A-Label, it should be rejected, so that as a result
>> the user only is  able to register what s/he thinks is being
>> registered.
> 
> If the presentation to the registry (or the registrar for that
> matter)  is aa.example, how is the registry (or the registrar
> for that matter) to  divine the registrant's intent was
> Å.example?

By requiring that the registrant submit the intended A-label.
Whether they submit it in addition to or instead of a native
character string is something that the draft leaves up to you*.
It also leaves up to you what you do if they provide both and
you discover a discrepancy (the native character string is not a
U-label and/or the native character string doesn't even map to
the supplied A-label under IDNA2003 mappings).  I believe that
paragraph 2 of Section 4.2 of Protocol now says that.   If it
does not do so clearly, or if we decide to remove that
paragraph, we should consider whether that A-label requirement/
suggestion should be moved to 4.1 and the rest of 4.2 recast or
eliminated.   However, going very far in that direction would
take us very far in the direction of a registration best
practices document, which I believe we should (continue to)
avoid.

* "you" in the above is intended to obscure the details of
registrar-registry relationships.   Such details lie far outside
our scope and are, IMO, matters for individual zones (and anyone
who presumes to regulate them) to sort out.

> My point is, regardless whether this is good or bad, how are
> we  (registry and/or registrar hat on) to know that there is a
> bug between  us and the would-be registrant's actual intent?
> And if it is a bug or  the policy of the registry (or the
> registrar for that matter), how is  this fundamentally
> different from a registry (or the registrar for that  matter)
> that declines all IDN registration offers, or transforms such 
> offers?

See above but, as to declining, probably none.

> For instance, in the 2002 time-frame the MicroSoft browser
> product had a  bug in its IDN code that caused every browser
> in the CDNC market to  source a sequence of packets to
> Redmond, then Mountain View and finally  Reston, resulting in
> noticeable dollar-denominated overseas tariffs for  the CNNIC
> ISP market. The resolution was systematically incorrect due to 
> "(product) local mapping" (a bug in handling the final octet
> of a  string), and the resolvant (party attempting to resolve
> an IDN) was  unable to supply exactly what they wanted.

Indeed and while that was presumably a bug rather than intent or
malice,  it points out that there isn't a lot we can do in the
protocol to prevent people from doing things locally that other
would consider harmful and/or stupid.  Much of this problem,
IMO, comes down to the observation that, pre-IDNA we have a very
specific, server-enforced, criterion about what matches and what
doesn't.   Whether it was correct or not, everyone got used to
it because they had no choice.   For IDNA, where all matching
has to be simulated by the encoding procedure (and any
associated mappings), nothing is going to prevent people from
making lookup-time local determinations about things that should
match other than fear of interoperability or resolution
failures.  

>> For (B), I'm not sure what you think, but for me is clearly a
>> case  where local mappings make sense, and are being
>> currently implemented  (under a different name, bundling).
>> So, for example, simplified  Chinese characters in a label
>> can be mapped to traditional, and both  labels registered.
>> Labels only differing by eszett and "ss" can be  bundled (or
>> blocked). That's up to the registry.
> 
> See above. I favor this, always have.

So do I.  And I hope we have general agreement about that.

>> For (C), this is the area that the UTC position is actually 
>> addressing; where some client program implementing IDNA is
>> doing a  remapping. It is all in reference to (C) that my
>> message to Cary is  written. And it is here where the
>> interoperability and security  problems of local mappings
>> surface.
>> 
>> At least in the current draft, we are not at a "no mappings"
>> model -  we are at an "arbitrary conflicting mappings" model,
>> because we allow  local mappings for (C).
>> 
>> And the thought of the thousands of user agents; browsers,
>> IMers,  emailers, plus search engines, (plus varying by
>> versions!) sending <a  href="Å.com"> to "aa.com
>> <http://aa.com>" instead of "å.com  <http://xn--5ca.com>" is
>> a nightmare. Before allowing that nightmare,  we really need
>> to hear a compelling case for it!
> 
> The example set of potentially incorrect implementations cited
> here (see  also MicroSoft's browser product, circa 2002,
> supra), is vast.
> 
> We can't correct those implementations.

Indeed.  But we can either go down the slippery slope of telling
people what mappings they should and should not make (and hope
that they pay attention and get it right), or we can try to make
the dangers of indiscriminate local mappings more obvious,
explain why a "no mapping" model is desirable, and talk about
transitions.  We can also discuss intermediate "a few mappings"
position, but should do so with the understanding that stopping
rules are hard and that registry flexibility for the use of
JET-style variant techniques probably benefit from avoiding
mappings that might lose information and foul things up.

However, from my personal point of view, the worst case is not
the one Mark outlines in "C" (however realistic or that may be).
The worst case is that we have an IETF standard that specifies
"no mapping" (with or without very localized and constrained
mapping on lookup) and a Unicode Standard that specifies a
particular mapping because some implementations would inevitably
follow and conform to the IETF Standard only and others would
follow and conform to both.  That would guarantee "massive
interoperability problems" not put us into a position of
speculation about how much local mapping would actually occur,
where, and what fraction of it would be irresponsible.

     john