IDN and language

Tue Jan 4 23:21:22 CET 2005

--On Tuesday, 04 January, 2005 12:52 -0500 John Cowan
<jcowan at reutershealth.com> wrote:

> John C Klensin scripsit:
> 
>> Returning to the DNS/IDN situation, ICANN has created a
>> recommendation for all TLDs, and a requirement on at least
>> some gTLDs, that languages not be mixed within a label and for
>> registration and use of tables similar to those recommended by
>> RFC 3743.  
> 
> This regulation is going to be completely unenforceable, since
> with a few exceptions (hexagonal French), languages do not
> have bright-line rules saying what words they do and do not
> contain.  Are we to be in the position of saying that
> eigenvector.com may be registered (and is) because the word
> appears in dictionaries, whereas eigenevent.com is ruled out
> because it "mixes" English and German?

John, I am sure that ICANN would welcome your participation as
the various rules/ guidelines evolve -- those rules are not an
IETF problem, even though changes to the standard that is used
to label them might be.  One of the things their processes have
in common with the IETF is that they prefer that people actually
try to read and understand documents before attacking them, but
I suppose there are always exceptions.  In particular, the
recommendations of RFC 3743 are about tables of characters, not
dictionary lookup.   If, however, a domain decided to adopt a
canonical dictionary and lookup in it as a registration
criterion, that rule would be perfectly enforceable.  I'd
recommend against it for many reasons, but this would be more or
less up to them.

> Forbidding the mixing of scripts is another matter, although
> in fact some languages are written using more than one
> (Unicode) script.

Whether those languages are a problem or not in the DNS context
depends on whether one wishes to permit a single label to use
both (or all three in at least a few cases I know of) scripts.
Again a per-registry decision and again perfectly enforceable
either way.  Other issues occur if the writing order of
characters in a language obeys specific rules and one chooses to
enforce them (a potential issue with, e.g., Hangul, although,
again, the choice of whether or not to try to enforce is up to
the registry).  But one of the notational problems with using
3066 would be a rule that one can have a label that contains the
characters of a given language written in, e.g., either a
modified Arabic script or a modified Cyrillic one but not in a
modified Roman ("Latin") one.  Another issue arises when one
wants to permit a character collection that includes the
characters from a given script that are used by two separate
languages -- not all of the characters of that script, but
exactly those characters that fall into the union of the
characters from the script used by the relevant languages.  It
is not clear that the current proposal is much better than 3066
for handling those cases, but I wonder if anyone has carefully
evaluated whether it would make things worse.

      john