baking into the protocol

Wed Dec 20 20:29:10 CET 2006

Re: "baked into the protocol": _The_ protocol? There are several
protocols involved here, and I will only list some of them:

(1) DNS. This protocol actually doesn't care what byte values you put
into the labels. There is a length byte that indicates how many bytes
are in each label, but 2 of the bits in the length bytes are used for
repetitive substrings, leaving only 6 bits for the length, so that's
where the 63-byte limit per label comes from. However, higher-level
protocols, such as email-related ones, _do_ care about the byte values
in the labels, and you will bump into all sorts of interoperability
problems if you try to use byte values outside the LDH set. This is
why we have Punycode, which re-encodes Unicode in the LDH set.

(2) Communication between registrar and registry. Some
registry/registrar pairs use more-or-less standardized protocols
called Registry Registrar Protocol and others. This is one area where
it might be possible to apply mixed script rules. I.e. the registry
would simply say "No" when the registrar attempts to register such a
label.

(3) HTML. This is also a protocol in the sense that it crosses the
wire or ether. This is another area where user agents could apply
mixed script rules. One extreme is to simply refuse to perform a DNS
lookup when any of the labels mixes scripts. A less extreme policy
would be to refrain from displaying the Unicode version of a label
when that label mixes scripts.

So, as you can see, there really is no distinction between "bak[ing]
into the protocol" and having "registr[ies] and/or user-agents" apply
rules. There is no magical interceptor in the DNS infrastructure that
could block certain operations based on mixed script rules. The
registries and the user agents _are_ the ones that could perform some
action based on mixed script rules.

Now, if by "protocol" you are referring to the rules in a future
IDNA200x or the guidelines in a future ICANN document, then I agree
that many people would balk at the idea of prohibiting mixed scripts
in those documents. But then maybe this is just what we need,
initially, until we have a better understanding of the problem or some
progress on this front. In particular, Michael appears to believe that
it might be possible to get the committees to encode additional
characters so that no community would be forced to cross Unicode
script boundaries to write their words. Doesn't Kurdish require Latin
w and q to be mixed into their Cyrillic text? Until we have figured
all of this out, should we simply prohibit certain script mixtures?

Erik

On 12/20/06, Mark Davis <mark.davis at icu-project.org> wrote:
> I tend to agree with Michael on the usefulness of disallowing mixed scripts.
> [...]
> I am not yet, however, so sure that it should be baked into the protocol.
> This is a pretty big hammer, and it may be better to leave it to the
> registrars and/or the user-agents, which have a lot more flexibility.