xd- with C/DNAME (was: Re: The Two Lookups Approach (was Re: Parsing the issuesand finding a middle ground -- another attempt))
John C Klensin
klensin at jck.com
Sun Mar 15 17:49:02 CET 2009
--On Saturday, March 14, 2009 10:48 AM -0700 Erik van der Poel
<erikv at google.com> wrote:
>...
>>> The big question then is whether we can make a similar jump
>>> for Eszett and Final Sigma. I'd be interested to hear from
>>> the German and Greek registries, now that we have figured
>>> out how to display those characters (via CNAME/DNAME).
>>
>> Interesting. I don't think we have figured out anything of
>> the sort. What we have "figured out" is how to make
>> CNAME/DNAME do some of the aliasing job, but that is nothing
>> new.
>
> I have explained CNAME/DNAME with xd--. If you have a concrete
> objection, please present it. Or if I have not explained enough
> details, let me know.
With the understanding that this is a real issue, not
nit-picking, proposals around here come in I-D form, if only
because it is easier to determine whether they are complete or
require a lot of context and assumptions to understand, and to
be able to conveniently circulate them to experts who are not
following the list. To a considerable extent, you are proposing
an entirely new piece of protocol and it has not been described
in nearly enough detail to evaluate it.
I haven't commented because I haven't seen a proposal, only a
fairly vague statement of an idea.
But I think some of the concerns have already been noted.
Anything tricky with the DNS is going to require more than one
lookup unless you can get the information in the "Additional"
data block, and the latter raises all of the deployment
difficulties and delays that Andrew and others have raised.
Encoding information in the DNS in tricky ways --ways that might
be recognized by one application and not another-- are sort of
invitations to security vulnerabilities as well as
user-surprising inconsistencies. I don't know that an adequate
analysis has been done about how one tests the display
information to be sure it isn't too different from the base
information and doing so may violate the basic IDNA constraint
of not requiring DNS changes, i.e., not requiring non-IDNA-aware
applications (and DNS clients and servers) to be even a little
away of IDNA. DNAMEs, especially when the data must be
synthesized, are not cheap and I haven't seen an analysis yet
about implications for caches and cache rotation and the effects
that might have on overall DNS performance.
Even for IE7, which seems to be your primary concern, going to a
different resolution logic -- looking up different QTYPEs under
different circumstances, presumably based on a list of
characters -- is an entirely different deal than the operation
the traditional plug-ins have done, which is simply to map one
URL into another. Shawn or Andrew may want to comment on this
further, but I think you are into the range of a far more
serious deployment problem than you anticipate, one that might
be sufficient to make the idea useless in any relevant time
frame.
As we have also seen with Eszett and characters with diaeresis
and other decorations, the boundary between display optimization
and something one would want to use for comparisons is thin, and
sometimes a little dependent on outside considerations and/or
subjective ones. I don't think any of us would want to deploy a
display-assistant procedure that some implementations also tried
to treat as clues about matching and others didn't, leading to
far more incompatibility than what we are trying to remedy.
Please note that some of those comments are more of the nature
of "this needs to examined carefully" rather than "this is a
deficiency". And this still might be the right thing to do
under the "least stinky" principle, although I doubt it.
> At this point, I believe that xd--
> should be limited to characters that we have identified as
> problematic. I.e. characters with three-way relationships
> (Eszett/ss/SS and Final/Capital/Normal Sigma), IDNA2003 "map
> to nothing" characters including ZWJ/ZWNJ, Unicode 3.2
> upper-case characters that now have lower-case counterparts in
> Unicode 5.1, and characters that have unfortunately received
> different normalizations after Unicode 3.2:
>
> http://www.unicode.org/reports/tr46/#Differences_from_IDNA2003
Reducing scope of applicability obviously helps with some of the
issues above, but not with most of them.
However, it is a mistake to concentrate on just these four. The
number of registrations may be on the same order as those for
Malayalam (whatever that is), but, if someone decided to
register abbreviations (or labels intended to represent
old-style numbers) in Hebrew script by just dropping the geresh
(or gershayim), they, too, will now have a transition problem as
the introduction of those characters with contextual rules
creates a problem almost identical to the "previously mapped to
nothing" one. In the latter cases, the mapping was performed by
IDNA2003. For these two cases, it would have been performed by
a registrant who decided that some approximation to the desired
name is better than none at all, even while knowing that users
trying to guess at the name are likely to type the banned
characters and get lookup failures. But, in both situations,
the registry really can't tell what was intended except by
appealing to information that isn't stored in the DNS.
>...
> Certainly, the IETF can go ahead and suggest encoding Eszett
> and Final Sigma under xn--. The question is whether the major
> implementations will continue to map or not. Isn't that why
> we're all here in this WG, to try to agree on this?
The major implementations will do what they do, regardless of
what we agree on. I think we are here to make IDNs as useful,
stable, and predictable going forward as possible. That
includes accommodating a broader range of populations in a
useful way and preventing having to go through through this type
of discussion with each Unicode revision that involves any
change that might be conceptually incompatible (the Malayalam
example is again important here) and therefore involve a
transition strategy by either the relevant registries or in the
protocol. One of the other things I'm not sure of from your
comments/suggestion is whether we might then need "xe--" for
Unicode 5.2, "xf--" for Unicode 6.x, and so on, depending on how
additions and changes affected existing characters and
registrations or whether we would "just" need to add to the
characters which, if they appear in a label, require a different
sort of lookup.
Once we decide what the right thing is to do, we then have an
educational job on our hands, one of convincing the implementers
that they should accommodate the new characters and strategies.
Presumably we do that in conjunction with those who want the
characters to be available and whatever market forces they can
exercise. But the bottom line is that not even the purest form
of the IDNAv2 approach will automagically cause implementers to
do the right thing. Looking up unassigned characters will help
somewhat, but un-updated implementations won't know about
canonical and compatibility mappings that might be associated
with the new characters (even though an updated Stringprep
presumably would), etc.
Of course, if a client has to have a list of characters that
must be looked up in an entirely different way, then it also has
a list of characters that are the only ones that might need a
two-stage lookup arrangement. While I don't like two-stage
lookups any more than you do, it is worth remembering that
CNAMEs also require a two-stage lookup unless the Additional
Information contains all of the needed information (and is
trusted) and DNAMEs can require either that or a complex
synthesis operation. Those requirements do not come for free.
More important, unless I've missed something, they are forever
-- not a transition strategy, but a different form of a new
display-information / presentation advice bag on the side of the
DNS.
Part of the disconnect here is that some of us are primarily
concerned with looking forward -- how do we organize things so
that they work better and more predictably, for more people, in
the long run and then how do we get there (which includes making
such there are plausible transition strategies in the shorter
term). It appears to me that you and others keep looking
backward onto variations of "all IDNA2003 decisions are forever
and therefore require elaborate workarounds to be added to the
protocols".
>>> It sure would be nice to avoid double and triple lookups...
>>
>> Yes. Here is where I say something about wanting a pony.
>> Or a DWIM DNS RR. Ultimately, if there are going to be
>> transitions, one is going to need to make a choice about
>> where to feel the pain. One way is to say, in one form or
>> another, "registry problem", a problem that registries might
>> choose to deal with in a variety of ways. Another is to
>> decide that it is the problem of the protocol, in which case
>> we are probably going to need strategies that result in more
>> than one lookup (at least sometimes). We ought to keep the
>> fact that there is a choice in mind.
>
> How about a base IDNA spec that leaves the choice between
> multiple lookup and bundling to the implementers and
> registries? Actually, the current IDNA2008 drafts are already
> there, pretty much. Then the decisions about Eszett, Final
> Sigma, ZWJ and ZWNJ can be hammered out after the base spec is
> released.
As you say, in a way the drafts are already there. They
certainly do not prevent registries from bundling or from
implementing any of the alternate strategies. Some (fairly
extensive) previous experience indicates that registries are
more likely to deal with these issues by sunrise or equivalent
procedures (or just starting a land rush) than by bundled
collections of labels. Some will do that because it is less
work, others because it is more profitable in their environment,
and others because they see these newly-available characters as
"different things" and want to make both labels available. But
the experience so far is that strategies involving bundling
names into a zone are used mostly for orthographically long-term
issues (like Simplified and Traditional Chinese), not for
short-term transitions and that, even then, registries much more
often use variant techniques to ban registration of
possibly-conflicting variants than the register multiple names
(again, less work and fewer side-effects on, e.g., database
structures).
Almost all of these characters (final sigma looks like the one
exception right now) are supported in the IDNA2008 proposals
because some group demanded them and demanded support for them.
A registry or implementation that continues to support mappings
that de facto makes the characters invisible will suffer the
wrath of those groups. What they do about that isn't up to
decide.
regards,
john
More information about the Idna-update
mailing list