The Two Lookups Approach (was Re: Parsing the issuesand finding a middle ground -- another attempt)

Sat Mar 14 18:48:42 CET 2009

On Sat, Mar 14, 2009 at 9:10 AM, John C Klensin <klensin at jck.com> wrote:
> --On Saturday, March 14, 2009 08:31 -0700 Erik van der Poel
> <erikv at google.com> wrote:
>>> I think that is correct.  But "completely independent
>>> characters" may be a slightly fuzzy concept in practice.  Two
>>> examples...
>>>
>>> (i) Adding the Chillu to Malayalam, as Unicode 5.1 does,
>>> involves completely independent characters, in the sense that
>>> nothing was there before and they are not treated as
>>> compositions of characters that were in Unicode earlier.   In
>>> one way, that makes them "independent" of what has come
>>> before, but they profoundly change the way the script is
>>> coded, so they to interact and set up a problematic situation.
>>
>> Yes, the Chillu issue is an important example because it would
>> require triple lookup rather than the double lookup that we
>> have been discussing until now. I.e. after the user has
>> entered some sequence of characters, the implementation would
>> have to try a sequence without ZW characters, with ZW
>> characters and with Chillu.
>>
>> One possible answer for this script (Malayalam) is to suggest
>> that since there are very few Malayalam domain names in use
>> today, we should just jump straight to the Chillu, and not
>> attempt any complicated transition involving double or triple
>> lookup.
>
> But you don't know how many Malayalam domains are in use today.
> You could ask a few registries, notably the Indian one, about
> the second or maybe the third level, but that wouldn't tell you
> about anything lower, not would it tell you a thing about names
> registered as A-labels with TLD registries that don't check for
> language or script (i.e., just for validity, if that) and don't
> keep track of any language information they get along with the
> registration (if they get it at all).  You could also search
> Google databases for them, but "have web pages in a domain"
> still does not equal "have domain name".  In principle, there
> could be a rather large collection of pages in a given language
> that reference only each other, and use a local IDN for the
> web-related subdomains instead of "www"... I'd guess that you
> procedures for locating and indexing pages would never find any
> of them.
>
> I thing that we can probably accept the risk of consulting the
> obvious registries and, if they agree, making the single-step
> jump without doing anything more complex, but it isn't because
> that we had any assurance about the size of the body^H^H^H^H old
> registration count.  Instead, it is because the language
> community asked for these additional characters and now has to
> deal with the consequences of having them.   In that regard,
> even though the request went to Unicode in one case and to use
> in the other, I see little difference between this and Eszett.

I agree. I am not suggesting that we count the number of Malayalam
domains. I am simply suggesting that we decide, after discussing it
with people from organizations "close to" Malayalam, whether or not to
make a "jump".

>> The big question then is whether we can make a similar jump
>> for Eszett and Final Sigma. I'd be interested to hear from the
>> German and Greek registries, now that we have figured out how
>> to display those characters (via CNAME/DNAME).
>
> Interesting.  I don't think we have figured out anything of the
> sort.  What we have "figured out" is how to make CNAME/DNAME do
> some of the aliasing job, but that is nothing new.

I have explained CNAME/DNAME with xd--. If you have a concrete
objection, please present it. Or if I have not explained enough
details, let me know. At this point, I believe that xd-- should be
limited to characters that we have identified as problematic. I.e.
characters with three-way relationships (Eszett/ss/SS and
Final/Capital/Normal Sigma), IDNA2003 "map to nothing" characters
including ZWJ/ZWNJ, Unicode 3.2 upper-case characters that now have
lower-case counterparts in Unicode 5.1, and characters that have
unfortunately received different normalizations after Unicode 3.2:

http://www.unicode.org/reports/tr46/#Differences_from_IDNA2003

>> (And then there's the question of jumping straight to ZW* for
>> Farsi, Urdu, etc.)
>>
>> It might be a good idea to come up with an open source
>> implementation of a utility that generates all (or many) of
>> the possible A-labels from a single label with one or more ZW
>> characters. This could then be downloaded not only by
>> top-level and high-level domain name registries, but also by
>> lower-level zone administrators that wish to participate in
>> any "jump".
>
> This involves more policy issues, which I think you continue to
> underestimate in your search for automated decisions.  As Cary
> and others have repeatedly pointed out, decisions as to whether
> to do these things by variants, sunrise with the responsibility
> on the registrants, or some other mechanism are entirely up to
> the registries.  The program you propose is either trivial (just
> assume ZWJ and ZWNJ between every pair of characters) or quite
> complicated (examine those cases whether those character are
> actually relevant and make a difference).

I agree that the program would not necessarily be trivial. Note that I
said "all (or many) of the possible A-labels".

> And that points out another aspect of the ZWJ/ZWNJ issue and
> possibly the Eszett and Final Sigma ones.  Whatever restrictions
> we apply, sensible registries may want to apply even more --
> local context about language and script for ZW* (e.g., further
> tightening up on the cases that the global rules don't detect,
> as described in Mark's note on that subject), positional
> constraints on Eszett and Final Sigma (it is unlikely that there
> are any legitimate uses for either as the first character in a
> label), and so on.  Whether a registry considers such rules
> worth applying is a matter of local choice, but they could do so.
>
> So, while I do believe that such a program, perhaps with
> switches to control just what it looks for as it computes, would
> be useful, I don't think we should get ourselves into a
> situation where its absence is blocking or where the number of
> labels it generates is used as an excuse to do or not do
> something.

Certainly, the IETF can go ahead and suggest encoding Eszett and Final
Sigma under xn--. The question is whether the major implementations
will continue to map or not. Isn't that why we're all here in this WG,
to try to agree on this?

>> It sure would be nice to avoid double and triple lookups...
>
> Yes.  Here is where I say something about wanting a pony.  Or a
> DWIM DNS RR.  Ultimately, if there are going to be transitions,
> one is going to need to make a choice about where to feel the
> pain.  One way is to say, in one form or another, "registry
> problem", a problem that registries might choose to deal with in
> a variety of ways.  Another is to decide that it is the problem
> of the protocol, in which case we are probably going to need
> strategies that result in more than one lookup (at least
> sometimes).  We ought to keep the fact that there is a choice in
> mind.

How about a base IDNA spec that leaves the choice between multiple
lookup and bundling to the implementers and registries? Actually, the
current IDNA2008 drafts are already there, pretty much. Then the
decisions about Eszett, Final Sigma, ZWJ and ZWNJ can be hammered out
after the base spec is released.

Erik