xd- with C/DNAME (was: Re: The Two Lookups Approach (was Re: Parsing the issuesand finding a middle ground -- another attempt))

Mon Mar 16 17:54:02 CET 2009

John,

I had almost completed this response to your email when my wife and I
went on a bike ride, during which the registrant <-> user idea hit me
right in the head (see other email). I had to write that idea down
before I lost it, but now I'd like to return to your thoughtful note.
I will include some of my latest thoughts.

>>>> The big question then is whether we can make a similar jump
>>>> for Eszett and Final Sigma. I'd be interested to hear from
>>>> the German and Greek registries, now that we have figured
>>>> out how to display those characters (via CNAME/DNAME).
>>>
>>> Interesting.  I don't think we have figured out anything of
>>> the sort.  What we have "figured out" is how to make
>>> CNAME/DNAME do some of the aliasing job, but that is nothing
>>> new.
>>
>> I have explained CNAME/DNAME with xd--. If you have a concrete
>> objection, please present it. Or if I have not explained enough
>> details, let me know.
>
> With the understanding that this is a real issue, not nit-picking, proposals
> around here come in I-D form, if only because it is easier to determine
> whether they are complete or require a lot of context and assumptions to
> understand, and to be able to conveniently circulate them to experts who are
> not following the list.  To a considerable extent, you are proposing an
> entirely new piece of protocol and it has not been described in nearly
> enough detail to evaluate it.
>
> I haven't commented because I haven't seen a proposal, only a fairly vague
> statement of an idea.

I agree that it would be much better to write it up, with as many
details as possible.

> But I think some of the concerns have already been noted. Anything tricky
> with the DNS is going to require more than one lookup unless you can get the
> information in the "Additional" data block, and the latter raises all of the
> deployment difficulties and delays that Andrew and others have raised.

CNAME/DNAME don't appear to require more than one lookup. I tried
www.xn--kxadbfj6eq.gr from Firefox on Windows and Ethereal reported a
single packet going out (the query), and a single packet coming back,
which included a DNAME, a CNAME and an IP address. (I'm willing to
believe that Ethereal is showing me fake packets due to Windows API
quirks.)

> Encoding information in the DNS in tricky ways --ways that might be
> recognized by one application and not another-- are sort of invitations to
> security vulnerabilities as well as user-surprising inconsistencies.  I
> don't know that an adequate analysis has been done about how one tests the
> display information to be sure it isn't too different from the base
> information

Yes, I need to write down the comparison procedure, to mitigate both
security issues and user astonishment.

> and doing so may violate the basic IDNA constraint of not
> requiring DNS changes, i.e., not requiring non-IDNA-aware applications (and
> DNS clients and servers) to be even a little away of IDNA.

As far as I can see, this does not require DNS changes, other than
adding CNAME and/or DNAME, which is presumably permitted. The only
difference is that the app uses CNAME/DNAME for display, using a
carefully specified procedure (which I need to write and refine).

[My registrant <-> user idea uses CNAME/DNAME not only for display,
but also for transition from one IDNA version to another.]

> DNAMEs,
> especially when the data must be synthesized, are not cheap

Yes, I'm quite concerned about DNAME for this purpose, because it is
newer and harder to implement in general, so I'd be more comfortable
with CNAME.

> and I haven't
> seen an analysis yet about implications for caches and cache rotation and
> the effects that might have on overall DNS performance.

It seems very fair to point this out (and thank you).

> Even for IE7, which seems to be your primary concern

IE7 is one of my biggest concerns, but certainly not the only one.

> going to a different
> resolution logic -- looking up different QTYPEs under different
> circumstances, presumably based on a list of characters

Do we need different QTYPEs to do the simple IP address lookup (with
CNAME/DNAME results) that I mentioned above?

> is an entirely
> different deal than the operation the traditional plug-ins have done, which
> is simply to map one URL into another.  Shawn or Andrew may want to comment
> on this further, but I think you are into the range of a far more serious
> deployment problem than you anticipate, one that might be sufficient to make
> the idea useless in any relevant time frame.

Yes, APIs and deployment are serious issues. I still believe that a
carefully specified http://<domain-name>/idndisp.txt can serve as the
transition strategy.

[Now that the registrant <-> user idea is about IDNA version
transition as well as display, perhaps idndisp is the wrong name for
this file.]

> As we have also seen with Eszett and characters with diaeresis and other
> decorations, the boundary between display optimization and something one
> would want to use for comparisons is thin, and sometimes a little dependent
> on outside considerations and/or subjective ones.  I don't think any of us
> would want to deploy a display-assistant procedure that some implementations
> also tried to treat as clues about matching and others didn't, leading to
> far more incompatibility than what we are trying to remedy.

I think I agree that I would need to make it clear that this is a
display protocol, not a matching protocol, and analyze the dangers of
using the info for matching or anything other than display.

[Again, I'm not just thinking about display now, so I need to analyze
the matching issues.]

>> At this point, I believe that xd--
>> should be limited to characters that we have identified as
>> problematic. I.e. characters with three-way relationships
>> (Eszett/ss/SS and Final/Capital/Normal Sigma), IDNA2003 "map
>> to nothing" characters including ZWJ/ZWNJ, Unicode 3.2
>> upper-case characters that now have lower-case counterparts in
>> Unicode 5.1, and characters that have unfortunately received
>> different normalizations after Unicode 3.2:
>>
>> http://www.unicode.org/reports/tr46/#Differences_from_IDNA2003
>
> Reducing scope of applicability obviously helps with some of the issues
> above, but not with most of them.
>
> However, it is a mistake to concentrate on just these four.  The number of
> registrations may be on the same order as those for Malayalam (whatever that
> is), but, if someone decided to register abbreviations (or labels intended
> to represent old-style numbers) in Hebrew script by just dropping the geresh
> (or gershayim), they, too, will now have a transition problem as the
> introduction of those characters with contextual rules creates a problem
> almost identical to the "previously mapped to nothing" one.  In the latter
> cases, the mapping was performed by IDNA2003.  For these two cases, it would
> have been performed by a registrant who decided that some approximation to
> the desired name is better than none at all, even while knowing that users
> trying to guess at the name are likely to type the banned characters and get
> lookup failures.  But, in both situations, the registry really can't tell
> what was intended except by appealing to information that isn't stored in
> the DNS.

Thanks for the geresh/gershayim example. It led to my other idea
(registrant <-> user).

>> Certainly, the IETF can go ahead and suggest encoding Eszett
>> and Final Sigma under xn--. The question is whether the major
>> implementations will continue to map or not. Isn't that why
>> we're all here in this WG, to try to agree on this?
>
> The major implementations will do what they do, regardless of what we agree
> on.  I think we are here to make IDNs as useful, stable,  and predictable
> going forward as possible.  That includes accommodating a broader range of
> populations in a useful way and preventing having to go through through this
> type of discussion with each Unicode revision that involves any change that
> might be conceptually incompatible (the Malayalam example is again important
> here) and therefore involve a transition strategy by either the relevant
> registries or in the protocol.   One of the other things I'm not sure of
> from your comments/suggestion is whether we might then need "xe--" for
> Unicode 5.2, "xf--" for Unicode 6.x, and so on, depending on how additions
> and changes affected existing characters and registrations or whether we
> would "just" need to add to the characters which, if they appear in a label,
> require a different sort of lookup.

I believe we should consider xu-- with CNAME/DNAME/*NAME, not only
because it overloads the meaning of those *NAMEs, but also because it
avoids conflict with any xn-- names that might currently float around
in *NAMEs.

Also, if it is true that the current expectation is that CNAME/DNAME
are typically hidden from the user, I think that might be an argument
for a new prefix, which basically says "this is an exceptional
circumstance, please process according to special rule set R".

> Once we decide what the right thing is to do, we then have an educational
> job on our hands, one of convincing the implementers that they should
> accommodate the new characters and strategies. Presumably we do that in
> conjunction with those who want the characters to be available and whatever
> market forces they can exercise.   But the bottom line is that not even the
> purest form of the IDNAv2 approach will automagically cause implementers to
> do the right thing.   Looking up unassigned characters will help somewhat,
> but un-updated implementations won't know about canonical and compatibility
> mappings that might be associated with the new characters (even though an
> updated Stringprep presumably would), etc.

I still agree that looking up unassigned characters is bad (if
received in Unicode form), but if the app receives the label in xn--
form, it sure would be nice if it was allowed to look it up, as long
as it didn't blindly display the Unicode form of the label. (This
would allow users to "click through" links.) But again, the
discussions about automatic update of MSIE and plug-ins weaken my case
quite a bit.

> Of course, if a client has to have a list of characters that must be looked
> up in an entirely different way, then it also has a list of characters that
> are the only ones that might need a two-stage lookup arrangement.  While I
> don't like two-stage lookups any more than you do, it is worth remembering
> that CNAMEs also require a two-stage lookup unless the Additional
> Information contains all of the needed information (and is trusted) and
> DNAMEs can require either that or a complex synthesis operation.  Those
> requirements do not come for free. More important, unless I've missed
> something, they are forever -- not a transition strategy, but a different
> form of a new display-information / presentation advice bag on the side of
> the DNS.

See my other email about "registrant <-> user".

> Part of the disconnect here is that some of us are primarily concerned with
> looking forward -- how do we organize things so that they work better and
> more predictably, for more people, in the long run and then how do we get
> there (which includes making such there are plausible transition strategies
> in the shorter term). It appears to me that you and others keep looking
> backward onto variations of "all IDNA2003 decisions are forever and
> therefore require elaborate workarounds to be added to the protocols".

It would be great to "jump" to a new lookup protocol without
transitioning via multiple lookups, but I keep getting pulled back
from that ideal by compatibility considerations. [Now I hope to refine
the "registrant <-> user" idea.]

>>>> It sure would be nice to avoid double and triple lookups...
>>>
>>> Yes.  Here is where I say something about wanting a pony.
>>>  Or a DWIM DNS RR.  Ultimately, if there are going to be
>>> transitions, one is going to need to make a choice about
>>> where to feel the pain.  One way is to say, in one form or
>>> another, "registry problem", a problem that registries might
>>> choose to deal with in a variety of ways.  Another is to
>>> decide that it is the problem of the protocol, in which case
>>> we are probably going to need strategies that result in more
>>> than one lookup (at least sometimes).  We ought to keep the
>>> fact that there is a choice in mind.
>>
>> How about a base IDNA spec that leaves the choice between
>> multiple lookup and bundling to the implementers and
>> registries? Actually, the current IDNA2008 drafts are already
>> there, pretty much. Then the decisions about Eszett, Final
>> Sigma, ZWJ and ZWNJ can be hammered out after the base spec is
>> released.
>
> As you say, in a way the drafts are already there.   They certainly do not
> prevent registries from bundling or from implementing any of the alternate
> strategies.  Some (fairly extensive) previous experience indicates that
> registries are more likely to deal with these issues by sunrise or
> equivalent procedures (or just starting a land rush) than by bundled
> collections of labels.  Some will do that because it is less work, others
> because it is more profitable in their environment, and others because they
> see these newly-available characters as "different things" and want to make
> both labels available.  But the experience so far is that strategies
> involving bundling names into a zone are used mostly for orthographically
> long-term issues (like Simplified and Traditional Chinese), not for
> short-term transitions and that, even then, registries much more often use
> variant techniques to ban registration of possibly-conflicting variants than
> the register multiple names (again, less work and fewer side-effects on,
> e.g., database structures).
>
> Almost all of these characters (final sigma looks like the one exception
> right now) are supported in the IDNA2008 proposals because some group
> demanded them and demanded support for them. A registry or implementation
> that continues to support mappings that de facto makes the characters
> invisible will suffer the wrath of those groups.  What they do about that
> isn't up to decide.

Yes, I certainly don't want to risk such wrath.

Erik