looking up domain names with unassigned code points

Mon May 12 17:07:42 CEST 2008

--On Sunday, 11 May, 2008 20:35 -0400 Vint Cerf
<vint at google.com> wrote:

> Jefsey, I think what John meant is that the RFCs did not
> impose the restriction, ICANN did.

Yes.  And, as has been pointed out many times, ICANN's
restrictions apply to only a few handfuls of domains (if that
many) out of many millions.  For anyone else, their guidelines
are a suggestion at most.

Jefsey also wrote...

> I do not think this creates any problem as long as users can
> filter in the point-code they do not want to accept in their
> private environment ?

I think we need to be very careful here.   I agree with what you
are saying as I understand it, but I also believe that there are
ways of reading the above that would get us into trouble.

So...

Q: Would it be reasonable for a user to set up a sort of
whitelist of domains to be accepted, with all others being
rejected or producing warnings?

A: Yes.  Whether it would be a good idea or not would depend on
the user and usage patterns, but, if a user wanted to do it, I
don't think we should try to interfere.  Note that this really
has nothing to do with the script in which those domain names
are written or even whether they are LDH or IDNs.  I would also
suppose that the idea would be much more useful on the basis of
domain reputation than on the basis of lexical analysis, but, if
the user is creating explicit lists, there is no need for anyone
else to be concerned about the basis being used.

Q: Would it be reasonable for a user to set up some sort of
algorithm or collection of rules to effectively perform
whitelist selection?  

A: Sure.  And if that algorithm includes rejecting IDNs in
scripts that the user doesn't read, I don't see any problem with
it as long as the user is aware that there is no necessary
relationship between the character set / language/ script of the
content reached through a domain name and the script of the
domain name itself.  To avoid getting tangled up in a different
misunderstanding, it is important to remember that all
standard-conforming domain names are based on Unicode, so there
is no question of character set and that domain names do not
have language bindings except heuristically and possibly at
registration time.

Q: Is it reasonable for someone else to set algorithmic or
heuristic rules that let users see some domain names and not
others?

A: First of all, this goes on today.  We have reputation systems
that filter out or create strong warnings about some domains
based on prior bad uses (e.g., phishing, porn, spyware, or
viruses) of those domains.  We have filters that refuse to
display U-labels based on whether or not the user has the
relevant scripts enabled as part of language choices (e.g., IE)
and other filters that make decisions about U-label display
based on policies of the associated TLD (e.g., Firefox).
Personally, I am much more comfortable with these sorts of
actions if either (i) they generate warnings of one sort or
another rather than rejecting (e.g., refusing to look up) the
name and/or (ii) the user can override the choices.  There is
text in Rationale that is consistent with that view (of course,
it could be changed if there were consensus to do so).

But it is an area in which, although I don't think standards
should have much to say about how a user sees or handles names,
we need to be somewhat careful.  Ultimately, at the extremes,
there are only two types of identifier systems.  In one,
identifiers are unique, unambiguous, and universal: one should,
at least in principle, be able to reach the identified object
or, if not at least be assured that some other object will not
turn up instead.  At the other extreme, we have what, with
apologies to Lewis Carroll, a Humpty Dumpty naming system in
which words mean whatever one cares to have them mean and
identifiers do not exist except with regard to each particular
interpretation system.  

The DNS is clearly designed to be one of the former.   IDNs
should not change that.  If two different people register the
same label with an "xn--" prefix in different zones and do so
with different assumptions about what U-label it will be mapped
into (if it is mapped at all), then the DNS still works because
the respective FQDNs are still unique.  But IDNs essentially
stop working because, for IDNs to be viable, the mappings, in
both directions, between A-labels and U-labels must be
consistent and predictable _and_, given the way the DNS is
constructed, the mappings to be used must not depend on the zone
(or DNS hierarchy) in which the label is embedded.

I think there is a place for the latter, more local, type of
identifier as well and that, in particular, one area in which
the Internet is not fully mature yet is that of personal aliases
in which a user can decide what to call a particular object and
have that decision honored by relevant applications software and
environments.  That might include aliases that trigger selection
lists of the "which of these did you really mean?" variety.  But
my personal aliases are normally useful to me and not to you.
If we agree that they should be useful to you, we need to be
running similar software, you need to know that a given personal
alias belongs to me and not to you or someone else, you need to
know where my alias databases are located and have access to
them,  and so on.

Confusion between what a user can filter and local decisions
about how strings should be interpreted or mapped between
A-labels and U-labels, or between universally-interpretable
identifiers and personal (or local) aliases gets us into a lot
of trouble, IMO.

    john