xd- with C/DNAME (was: Re: The Two Lookups Approach (was Re: Parsing the issuesand finding a middle ground -- another attempt))

Sun Mar 15 17:49:02 CET 2009

--On Saturday, March 14, 2009 10:48 AM -0700 Erik van der Poel 
<erikv at google.com> wrote:

>...
>>> The big question then is whether we can make a similar jump
>>> for Eszett and Final Sigma. I'd be interested to hear from
>>> the German and Greek registries, now that we have figured
>>> out how to display those characters (via CNAME/DNAME).
>>
>> Interesting.  I don't think we have figured out anything of
>> the sort.  What we have "figured out" is how to make
>> CNAME/DNAME do some of the aliasing job, but that is nothing
>> new.
>
> I have explained CNAME/DNAME with xd--. If you have a concrete
> objection, please present it. Or if I have not explained enough
> details, let me know.

With the understanding that this is a real issue, not 
nit-picking, proposals around here come in I-D form, if only 
because it is easier to determine whether they are complete or 
require a lot of context and assumptions to understand, and to 
be able to conveniently circulate them to experts who are not 
following the list.  To a considerable extent, you are proposing 
an entirely new piece of protocol and it has not been described 
in nearly enough detail to evaluate it.

I haven't commented because I haven't seen a proposal, only a 
fairly vague statement of an idea.

But I think some of the concerns have already been noted. 
Anything tricky with the DNS is going to require more than one 
lookup unless you can get the information in the "Additional" 
data block, and the latter raises all of the deployment 
difficulties and delays that Andrew and others have raised. 
Encoding information in the DNS in tricky ways --ways that might 
be recognized by one application and not another-- are sort of 
invitations to security vulnerabilities as well as 
user-surprising inconsistencies.  I don't know that an adequate 
analysis has been done about how one tests the display 
information to be sure it isn't too different from the base 
information and doing so may violate the basic IDNA constraint 
of not requiring DNS changes, i.e., not requiring non-IDNA-aware 
applications (and DNS clients and servers) to be even a little 
away of IDNA.  DNAMEs, especially when the data must be 
synthesized, are not cheap and I haven't seen an analysis yet 
about implications for caches and cache rotation and the effects 
that might have on overall DNS performance.

Even for IE7, which seems to be your primary concern, going to a 
different resolution logic -- looking up different QTYPEs under 
different circumstances, presumably based on a list of 
characters -- is an entirely different deal than the operation 
the traditional plug-ins have done, which is simply to map one 
URL into another.  Shawn or Andrew may want to comment on this 
further, but I think you are into the range of a far more 
serious deployment problem than you anticipate, one that might 
be sufficient to make the idea useless in any relevant time 
frame.

As we have also seen with Eszett and characters with diaeresis 
and other decorations, the boundary between display optimization 
and something one would want to use for comparisons is thin, and 
sometimes a little dependent on outside considerations and/or 
subjective ones.  I don't think any of us would want to deploy a 
display-assistant procedure that some implementations also tried 
to treat as clues about matching and others didn't, leading to 
far more incompatibility than what we are trying to remedy.

Please note that some of those comments are more of the nature 
of "this needs to examined carefully" rather than "this is a 
deficiency".  And this still might be the right thing to do 
under the "least stinky" principle, although I doubt it.

> At this point, I believe that xd--
> should be limited to characters that we have identified as
> problematic. I.e. characters with three-way relationships
> (Eszett/ss/SS and Final/Capital/Normal Sigma), IDNA2003 "map
> to nothing" characters including ZWJ/ZWNJ, Unicode 3.2
> upper-case characters that now have lower-case counterparts in
> Unicode 5.1, and characters that have unfortunately received
> different normalizations after Unicode 3.2:
>
> http://www.unicode.org/reports/tr46/#Differences_from_IDNA2003

Reducing scope of applicability obviously helps with some of the 
issues above, but not with most of them.

However, it is a mistake to concentrate on just these four.  The 
number of registrations may be on the same order as those for 
Malayalam (whatever that is), but, if someone decided to 
register abbreviations (or labels intended to represent 
old-style numbers) in Hebrew script by just dropping the geresh 
(or gershayim), they, too, will now have a transition problem as 
the introduction of those characters with contextual rules 
creates a problem almost identical to the "previously mapped to 
nothing" one.  In the latter cases, the mapping was performed by 
IDNA2003.  For these two cases, it would have been performed by 
a registrant who decided that some approximation to the desired 
name is better than none at all, even while knowing that users 
trying to guess at the name are likely to type the banned 
characters and get lookup failures.  But, in both situations, 
the registry really can't tell what was intended except by 
appealing to information that isn't stored in the DNS.

>...
> Certainly, the IETF can go ahead and suggest encoding Eszett
> and Final Sigma under xn--. The question is whether the major
> implementations will continue to map or not. Isn't that why
> we're all here in this WG, to try to agree on this?

The major implementations will do what they do, regardless of 
what we agree on.  I think we are here to make IDNs as useful, 
stable,  and predictable going forward as possible.  That 
includes accommodating a broader range of populations in a 
useful way and preventing having to go through through this type 
of discussion with each Unicode revision that involves any 
change that might be conceptually incompatible (the Malayalam 
example is again important here) and therefore involve a 
transition strategy by either the relevant registries or in the 
protocol.   One of the other things I'm not sure of from your 
comments/suggestion is whether we might then need "xe--" for 
Unicode 5.2, "xf--" for Unicode 6.x, and so on, depending on how 
additions and changes affected existing characters and 
registrations or whether we would "just" need to add to the 
characters which, if they appear in a label, require a different 
sort of lookup.

Once we decide what the right thing is to do, we then have an 
educational job on our hands, one of convincing the implementers 
that they should accommodate the new characters and strategies. 
Presumably we do that in conjunction with those who want the 
characters to be available and whatever market forces they can 
exercise.   But the bottom line is that not even the purest form 
of the IDNAv2 approach will automagically cause implementers to 
do the right thing.   Looking up unassigned characters will help 
somewhat, but un-updated implementations won't know about 
canonical and compatibility mappings that might be associated 
with the new characters (even though an updated Stringprep 
presumably would), etc.

Of course, if a client has to have a list of characters that 
must be looked up in an entirely different way, then it also has 
a list of characters that are the only ones that might need a 
two-stage lookup arrangement.  While I don't like two-stage 
lookups any more than you do, it is worth remembering that 
CNAMEs also require a two-stage lookup unless the Additional 
Information contains all of the needed information (and is 
trusted) and DNAMEs can require either that or a complex 
synthesis operation.  Those requirements do not come for free. 
More important, unless I've missed something, they are forever 
-- not a transition strategy, but a different form of a new 
display-information / presentation advice bag on the side of the 
DNS.

Part of the disconnect here is that some of us are primarily 
concerned with looking forward -- how do we organize things so 
that they work better and more predictably, for more people, in 
the long run and then how do we get there (which includes making 
such there are plausible transition strategies in the shorter 
term). It appears to me that you and others keep looking 
backward onto variations of "all IDNA2003 decisions are forever 
and therefore require elaborate workarounds to be added to the 
protocols".

>>> It sure would be nice to avoid double and triple lookups...
>>
>> Yes.  Here is where I say something about wanting a pony.
>>  Or a DWIM DNS RR.  Ultimately, if there are going to be
>> transitions, one is going to need to make a choice about
>> where to feel the pain.  One way is to say, in one form or
>> another, "registry problem", a problem that registries might
>> choose to deal with in a variety of ways.  Another is to
>> decide that it is the problem of the protocol, in which case
>> we are probably going to need strategies that result in more
>> than one lookup (at least sometimes).  We ought to keep the
>> fact that there is a choice in mind.
>
> How about a base IDNA spec that leaves the choice between
> multiple lookup and bundling to the implementers and
> registries? Actually, the current IDNA2008 drafts are already
> there, pretty much. Then the decisions about Eszett, Final
> Sigma, ZWJ and ZWNJ can be hammered out after the base spec is
> released.

As you say, in a way the drafts are already there.   They 
certainly do not prevent registries from bundling or from 
implementing any of the alternate strategies.  Some (fairly 
extensive) previous experience indicates that registries are 
more likely to deal with these issues by sunrise or equivalent 
procedures (or just starting a land rush) than by bundled 
collections of labels.  Some will do that because it is less 
work, others because it is more profitable in their environment, 
and others because they see these newly-available characters as 
"different things" and want to make both labels available.  But 
the experience so far is that strategies involving bundling 
names into a zone are used mostly for orthographically long-term 
issues (like Simplified and Traditional Chinese), not for 
short-term transitions and that, even then, registries much more 
often use variant techniques to ban registration of 
possibly-conflicting variants than the register multiple names 
(again, less work and fewer side-effects on, e.g., database 
structures).

Almost all of these characters (final sigma looks like the one 
exception right now) are supported in the IDNA2008 proposals 
because some group demanded them and demanded support for them. 
A registry or implementation that continues to support mappings 
that de facto makes the characters invisible will suffer the 
wrath of those groups.  What they do about that isn't up to 
decide.

regards,
   john