baking into the protocol

Wed Dec 20 21:44:42 CET 2006

--On Wednesday, 20 December, 2006 11:29 -0800 Erik van der Poel
<erikv at google.com> wrote:

> Re: "baked into the protocol": _The_ protocol? There are
> several
> protocols involved here, and I will only list some of them:
> 
> (1) DNS. This protocol actually doesn't care what byte values
> you put
> into the labels. There is a length byte that indicates how
> many bytes
> are in each label, but 2 of the bits in the length bytes are
> used for
> repetitive substrings, leaving only 6 bits for the length, so
> that's
> where the 63-byte limit per label comes from. 

The DNS protocols are unaffected by any of this, unless someone
proposes to change the fundamentals of the IDNA model.  To my
knowledge, no one who is both sane and who understands the
issues has proposed that (those qualifications leave out, e.g.,
a few ICANN GNSO Council members and some of the participants in
IGF).

> However, higher-level
> protocols, such as email-related ones, _do_ care about the
> byte values
> in the labels, and you will bump into all sorts of
> interoperability
> problems if you try to use byte values outside the LDH set.
> This is
> why we have Punycode, which re-encodes Unicode in the LDH set.

Yes.  Email is not a very interesting example for several
reasons, however.

> (2) Communication between registrar and registry. Some
> registry/registrar pairs use more-or-less standardized
> protocols
> called Registry Registrar Protocol and others. This is one
> area where
> it might be possible to apply mixed script rules. I.e. the
> registry
> would simply say "No" when the registrar attempts to register
> such a label.

Registries regularly say "no" to names for all sorts of reasons,
including national regulations about obscenities and names with
religious significance, ICANN/GAC guidelines about country
names, and so on.   This work does not represent a fundamental
change to the protocols that are used in those areas, although
they could suggest changes to tables.    Note that the use of
standardized registrar-registry protocols is very infrequent
below the first operational level of the DNS (typically
second-level domains, but sometimes third).  The use of
non-standard registration protocols within organizations or
enterprises is, however, not at all rare.  And "we" know almost
nothing about those protocols, either individually or as a class.

The class of registry-enforced (and often registry-defined)
restrictions goes well beyond "script".  Many registries define
and enforce language-based rules.  Others define and enforce
other rules about subsets of scripts or about relationships
among names (see, e.g., RFC 3743, 4290, and 4713).

The important thing for us to remember about restrictions
enforced on registration is that they are voluntary on a
per-registry basis and, barring some radical changes in
international domain name policy, are certain to be generally
ignored below that first operational level.   Implementations of
applications that assume that registries have created and
enforced them will almost certainly end up making names
inaccessible that are, in fact, registered -- a situation that
is clearly bad for the global Internet.   They are, in general,
very weak restrictions.

> (3) HTML. This is also a protocol in the sense that it crosses
> the
> wire or ether. This is another area where user agents could
> apply
> mixed script rules. One extreme is to simply refuse to perform
> a DNS
> lookup when any of the labels mixes scripts. A less extreme
> policy
> would be to refrain from displaying the Unicode version of a
> label when that label mixes scripts.

See above.  And, because it is a convenient example, note that
application of the latter rule would exclude many domains in
Japan that JPNIC has chosen to register because they are
considered legitimate names.

> So, as you can see, there really is no distinction between
> "bak[ing]
> into the protocol" and having "registr[ies] and/or
> user-agents" apply
> rules. There is no magical interceptor in the DNS
> infrastructure that
> could block certain operations based on mixed script rules. The
> registries and the user agents _are_ the ones that could
> perform some
> action based on mixed script rules.

I disagree, strongly.  We have the reasonable expectation that
protocol changes -- IDNA (and/or stringprep or nameprep)
changes-- will be implemented globally.  Relying entirely on
registry restrictions, or user agent restrictions that
completely forbid the use of some names that can be registered,
is a recipe for fragmentation of the DNS namespace.

> Now, if by "protocol" you are referring to the rules in a
> future
> IDNA200x or the guidelines in a future ICANN document, then I
> agree
> that many people would balk at the idea of prohibiting mixed
> scripts
> in those documents. But then maybe this is just what we need,
> initially, until we have a better understanding of the problem
> or some
> progress on this front. In particular, Michael appears to
> believe that
> it might be possible to get the committees to encode additional
> characters so that no community would be forced to cross
> Unicode
> script boundaries to write their words. Doesn't Kurdish
> require Latin
> w and q to be mixed into their Cyrillic text? Until we have
> figured
> all of this out, should we simply prohibit certain script
> mixtures?

There is another problem with prohibiting script-mixing at the
protocol (IDNA) level and that is that the common,
on-the-street, perception of "the script we use" is different
from the Unicode definitions of "script".  No one is wrong here,
but, if JDNC concludes that Romanji is a necessity and must be
available in mixed names with Kanji and Kana, I don't think we
are in a position to say "no" (although we can _advise_ that
this isn't a good idea).  Similar examples arise with mixtures
of Cyrillic and Roman characters in Russia, even though we are
agreed that is one of the more dangerous cases of mixed-script
labels (the fact that some strings in Cyrillic can be confused
with names in Latin characters even when they are purely
Cyrillic is one of the arguments why prohibiting mixed scripts
isn't nearly as powerful a tool as is often argued).

    john