Yes, by "bake into the protocol", I mean incorporate into IDNA200x. And what I'm saying is that I'm a bit leery of putting a mixed-script prohibition into IDNA200x. Not dead set against it, but leery.<br>
<br>Mark<br><br><div><span class="gmail_quote">On 12/20/06, <b class="gmail_sendername">Erik van der Poel</b> <<a href="mailto:erikv@google.com">erikv@google.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Re: "baked into the protocol": _The_ protocol? There are several<br>protocols involved here, and I will only list some of them:<br><br>(1) DNS. This protocol actually doesn't care what byte values you put<br>
into the labels. There is a length byte that indicates how many bytes<br>are in each label, but 2 of the bits in the length bytes are used for<br>repetitive substrings, leaving only 6 bits for the length, so that's<br>
where the 63-byte limit per label comes from. However, higher-level<br>protocols, such as email-related ones, _do_ care about the byte values<br>in the labels, and you will bump into all sorts of interoperability<br>problems if you try to use byte values outside the LDH set. This is
<br>why we have Punycode, which re-encodes Unicode in the LDH set.<br><br>(2) Communication between registrar and registry. Some<br>registry/registrar pairs use more-or-less standardized protocols<br>called Registry Registrar Protocol and others. This is one area where
<br>it might be possible to apply mixed script rules. I.e. the registry<br>would simply say "No" when the registrar attempts to register such a<br>label.<br><br>(3) HTML. This is also a protocol in the sense that it crosses the
<br>wire or ether. This is another area where user agents could apply<br>mixed script rules. One extreme is to simply refuse to perform a DNS<br>lookup when any of the labels mixes scripts. A less extreme policy<br>would be to refrain from displaying the Unicode version of a label
<br>when that label mixes scripts.<br><br>So, as you can see, there really is no distinction between "bak[ing]<br>into the protocol" and having "registr[ies] and/or user-agents" apply<br>rules. There is no magical interceptor in the DNS infrastructure that
<br>could block certain operations based on mixed script rules. The<br>registries and the user agents _are_ the ones that could perform some<br>action based on mixed script rules.<br><br>Now, if by "protocol" you are referring to the rules in a future
<br>IDNA200x or the guidelines in a future ICANN document, then I agree<br>that many people would balk at the idea of prohibiting mixed scripts<br>in those documents. But then maybe this is just what we need,<br>initially, until we have a better understanding of the problem or some
<br>progress on this front. In particular, Michael appears to believe that<br>it might be possible to get the committees to encode additional<br>characters so that no community would be forced to cross Unicode<br>script boundaries to write their words. Doesn't Kurdish require Latin
<br>w and q to be mixed into their Cyrillic text? Until we have figured<br>all of this out, should we simply prohibit certain script mixtures?<br><br>Erik<br><br>On 12/20/06, Mark Davis <<a href="mailto:mark.davis@icu-project.org">
mark.davis@icu-project.org</a>> wrote:<br>> I tend to agree with Michael on the usefulness of disallowing mixed scripts.<br>> [...]<br>> I am not yet, however, so sure that it should be baked into the protocol.
<br>> This is a pretty big hammer, and it may be better to leave it to the<br>> registrars and/or the user-agents, which have a lot more flexibility.<br></blockquote></div><br>