The Two Lookups Approach (was Re: Parsing the issuesand finding a middle ground -- another attempt)

Sat Mar 14 17:10:42 CET 2009

--On Saturday, March 14, 2009 08:31 -0700 Erik van der Poel
<erikv at google.com> wrote:

>...
>> I think that is correct.  But "completely independent
>> characters" may be a slightly fuzzy concept in practice.  Two
>> examples...
>> 
>> (i) Adding the Chillu to Malayalam, as Unicode 5.1 does,
>> involves completely independent characters, in the sense that
>> nothing was there before and they are not treated as
>> compositions of characters that were in Unicode earlier.   In
>> one way, that makes them "independent" of what has come
>> before, but they profoundly change the way the script is
>> coded, so they to interact and set up a problematic situation.
> 
> Yes, the Chillu issue is an important example because it would
> require triple lookup rather than the double lookup that we
> have been discussing until now. I.e. after the user has
> entered some sequence of characters, the implementation would
> have to try a sequence without ZW characters, with ZW
> characters and with Chillu.
> 
> One possible answer for this script (Malayalam) is to suggest
> that since there are very few Malayalam domain names in use
> today, we should just jump straight to the Chillu, and not
> attempt any complicated transition involving double or triple
> lookup.

But you don't know how many Malayalam domains are in use today.
You could ask a few registries, notably the Indian one, about
the second or maybe the third level, but that wouldn't tell you
about anything lower, not would it tell you a thing about names
registered as A-labels with TLD registries that don't check for
language or script (i.e., just for validity, if that) and don't
keep track of any language information they get along with the
registration (if they get it at all).  You could also search
Google databases for them, but "have web pages in a domain"
still does not equal "have domain name".  In principle, there
could be a rather large collection of pages in a given language
that reference only each other, and use a local IDN for the
web-related subdomains instead of "www"... I'd guess that you
procedures for locating and indexing pages would never find any
of them.

I thing that we can probably accept the risk of consulting the
obvious registries and, if they agree, making the single-step
jump without doing anything more complex, but it isn't because
that we had any assurance about the size of the body^H^H^H^H old
registration count.  Instead, it is because the language
community asked for these additional characters and now has to
deal with the consequences of having them.   In that regard,
even though the request went to Unicode in one case and to use
in the other, I see little difference between this and Eszett.

> The big question then is whether we can make a similar jump
> for Eszett and Final Sigma. I'd be interested to hear from the
> German and Greek registries, now that we have figured out how
> to display those characters (via CNAME/DNAME).

Interesting.  I don't think we have figured out anything of the
sort.  What we have "figured out" is how to make CNAME/DNAME do
some of the aliasing job, but that is nothing new.

> (And then there's the question of jumping straight to ZW* for
> Farsi, Urdu, etc.)
> 
> It might be a good idea to come up with an open source
> implementation of a utility that generates all (or many) of
> the possible A-labels from a single label with one or more ZW
> characters. This could then be downloaded not only by
> top-level and high-level domain name registries, but also by
> lower-level zone administrators that wish to participate in
> any "jump".

This involves more policy issues, which I think you continue to
underestimate in your search for automated decisions.  As Cary
and others have repeatedly pointed out, decisions as to whether
to do these things by variants, sunrise with the responsibility
on the registrants, or some other mechanism are entirely up to
the registries.  The program you propose is either trivial (just
assume ZWJ and ZWNJ between every pair of characters) or quite
complicated (examine those cases whether those character are
actually relevant and make a difference).

And that points out another aspect of the ZWJ/ZWNJ issue and
possibly the Eszett and Final Sigma ones.  Whatever restrictions
we apply, sensible registries may want to apply even more --
local context about language and script for ZW* (e.g., further
tightening up on the cases that the global rules don't detect,
as described in Mark's note on that subject), positional
constraints on Eszett and Final Sigma (it is unlikely that there
are any legitimate uses for either as the first character in a
label), and so on.  Whether a registry considers such rules
worth applying is a matter of local choice, but they could do so.

So, while I do believe that such a program, perhaps with
switches to control just what it looks for as it computes, would
be useful, I don't think we should get ourselves into a
situation where its absence is blocking or where the number of
labels it generates is used as an excuse to do or not do
something.

> It sure would be nice to avoid double and triple lookups...

Yes.  Here is where I say something about wanting a pony.  Or a
DWIM DNS RR.  Ultimately, if there are going to be transitions,
one is going to need to make a choice about where to feel the
pain.  One way is to say, in one form or another, "registry
problem", a problem that registries might choose to deal with in
a variety of ways.  Another is to decide that it is the problem
of the protocol, in which case we are probably going to need
strategies that result in more than one lookup (at least
sometimes).  We ought to keep the fact that there is a choice in
mind.

     john