local mappings

Fri Jan 23 23:36:42 CET 2009

This is going to be a bit long.  Perhaps one inducement to
reading it is that I've partially changed my mind about mappings
and need to explain that change of mind.

--On Friday, January 23, 2009 9:20 -0800 Erik van der Poel
<erikv at google.com> wrote:

> On Fri, Jan 23, 2009 at 8:42 AM, Andrew Sullivan
> <ajs at shinkuro.com> wrote:
>> The only suggestion I have for this is to tighten the rules
>> more.  One way of tightening them is to do as Mark suggested,
>> and say "mapping for IDNA2003:IDNA2008, and nothing else" (I
>> think that's what he said; apologies if I put words in his
>> mouth).  An alternative would be to put together a registry
>> of known mappings, and say, "These are acceptable, and
>> everything else MUST fail."  The icky thing about that, of
>> course, is that it puts us back in the business of deciding
>> "legal" characters, and that's supposed to be a policy
>> decision.
> 
> Yes, it would be a good idea to tighten the recommendations
> for local mappings. I think we should also add at least one
> example, and the example I would choose is the Turkish
> dotted/dotless 'i' in upper and lower case.

I can easily add that and will do so if there is no objection.

> The recommendation I would suggest would go roughly along the
> following lines. On the registration side, the implementation
> should ask the user to confirm the domain name that was
> entered, after lower-casing using the user's language's rules.

Andrew knows much more about this than I do, but, as soon as one
gets into the territory of users confirming names, etc., one
runs into issues that we really can't address (or, at least, I
don't know how to do so):  At one extreme, I've got a small
domain that I'm maintaining with a text zone file and emacs.
User calls me up and says "please put a node in with the
following label and records".   I listen, write it down in lower
case, probably read it back to the caller to see if I've gotten
it right (not a very good technique, but likely), invoke a
little program to validate it and convert to A-label form, and
then key the A-label in with my little keyboard. I don't have
any idea whether that procedure would conform to what you are
suggesting above.   

At the other extreme, we enter the strange world of
registrar-registry protocols in which inquiries of the user by
the registry are bad business because they complicate protocols
and create race conditions about name acquisition.    As Andrew,
Cary, and several others know, my personal "good practices to
prevent trouble" recommendation to those sorts of registries
(since before IDNA2003 was published) has been that they get
both the Native character and ACE forms transmitted over the
registrar-registry interface, that they convert the ACE form
back to the Unicode one, and, if they don't match, they bounce
the registration attempt as inconsistent.  I've also recommended
that only the Unicode string that one obtains by mapping back
from the ACE be searchable via, e.g., web interfaces to whois
databases (I continue to believe that whois, especially on the
query side, is required to be ASCII-only, so that only the ACE
can be used to search).   That puts all of the responsibility
for mapping and interpretation, even with IDNA2003, at the
registrant-registrar boundary, where one can engage in a
dialogue with the would-be registrant.  But behavior at that
boundary is far outside anything that, IMO, we can reasonably
specify as part of the IDNA package, even in "rationale".

It is now clear to me from these discussions that I've gotten
very confused about the difference between the range of expected
practices from "registrant talking about getting a label in a
particular domain" to "registry putting something in a zone" and
the best way to describe the protocol for the parts of sequence
of events, and its many variations, which really are protocol.  

It seems obvious that some very specific advice about
registration, registrar, and registry and best practices is
needed and that it is probably different for different types of
registries and relationships.  That has actually seemed obvious
for years and there are some documents out there.  But whether
the responsibility for that work should lie with the EPP folks,
or with ICANN, or with some other group (I'm particularly about
where the advice to small ccTLDs and individual domains deeper
in the tree is going to come from), it is fairly clear to me
that doing very much beyond warning about traps to look out for
is beyond the scope of this WG and probably beyond its
competence.

> On the lookup side, if the user is typing characters one by
> one, lower-casing should follow that user's language's rules.

Agree so far.

(paragraph cut out of order -- it is below)

> But this is exactly where some members of the WG are likely to
> disagree, because they are willing to allow local mappings but
> want to get rid of the global mappings.
> 
> So we are back to square one, lack of consensus on the issue
> of global mappings.

Since I have just changed my mind about how this issue should be
approached in the documents (I'm a slow learner), let me explain
where I stand at the moment and a bit about why to see if it
gives others a basis for moving forward...

First of all, there several possible ways to try to construct a
standard.  One is to write it narrowly on the assumption that
will optimize interoperability if everyone follows the rules.
Another is to write it to reflect current practice, no matter
how bad, on the expectation that, given whatever the practice is
today, things will not get better and are likely to get worse.
I'm normally an advocate of the former, partially because I
believe that the purpose of standards is to facilitate a better
and more interoperable world and that the latter approach is
really not a "standard" in that sense but rather is a simple,
non-judgmental, description of what is going on out there.  The
latter has its place too, but it is a different thing.

I tried, in constructing the text about local mappings, to fall
somewhere between those two models, setting a clear norm for
domain names in interchange but recognizing that some mapping
was going to occur no matter what we told people to do.  The
text has never worked satisfactorily.  I now believe that the
text has not worked because that particular middle ground model
does not work.

Second, one of the things we have learned about the Internet in
the last twenty years is that it keeps growing, usually with an
accelerating rate of growth.  We usually consider that A Good
Thing, but it means that, when we discover that we've made a
design decision that was sub-optimal, we can either fix it
quickly on the assumption that it will only get worse and larger
if we delay (and that the number of people affected by a
transition will be dwarfed by the number of new users, systems,
and content elements), or we can decide that we are stuck,
forever, with the issue.  The growth of the Internet also brings
about changes in the population -- each new million users lowers
the percentage of users who are sophisticated enough to
appreciate the nuances and constraints of protocols, inherited
legacy conventions, intra-Unicode relationships, etc.

In that tradeoff between "fix it so unsophisticated users are
less likely to be astonished" and "maintain compatibility with
prior practices at all costs", the first one wins for me every
time.  If we disagree about that principle, let's discuss it
rather than its manifestation in, e.g., mappings.

That said, I (and others) been taking the position that we need
a clear, simple, "no mapping" rule and that, while there are
other reasons for it, it is basically a necessary corollary of a
desire to make URIs as comparable and unambiguous as possible
for all sorts of reasons.  While the alternative is more ugly,
probably an "absolutely not mapping" rule is wrong when seen
from a "minimize astonished users" perspective.  One of the
reasons for a clear "no mapping" rule is that it is difficult to
know where to stop once one gets started and, in particular,
difficult to deal with "given that you have that mapping, why
can't I have my favorite mapping?" questions.   At the other
extreme from the lower-casing Erik mentions as an example, there
are mappings in IDNA2003 that are dependent on relationships
within Unicode that are obscure to the average user and that
involve characters that, if they appear in the presentation
forms of domain names or URIs are most likely to be there to
make mischief (I want to stress that I don't think there is
anything wrong with Unicode in those areas, just the decisions
we've made).

So, for example, we can reasonably expect that, for scripts with
case, users will be astonished if upper case characters don't
map to lower case ones.  Could we agree to apply a lower-case
mapping globally -- lower-case and not CaseFold, because
CaseFold is designed for comparison and not well-suited to
mapping (more or less quoting from TUS on that) and because the
subtle properties of CaseFold in the cases in which it doesn't
produce what LowerCase produces are themselves astonishing to
the unsophisticated)?    If we do lower-case, but continue to
ban compatibility characters and the other odd cases that
surprise those who don't know what is going on, does that help
us significantly with the compatibility and astonishment
situations that are really important?  I don't think there are
many other situations similar to the lower-case one (which I
assume is why it keeps coming up in examples), but need advice
from Mark, Ken, and others as to whether there are any others
and what they are.  And I can only hope that even suggesting
this doesn't open cans of worms and arguments about which
mappings are more important than others.

Disclaimer: I have not discussed the idea above with members of
the editorial team or anyone else.   It may be a _really_ dumb
idea.

Does that help at all?

    john