Comments on the IDNA2008 document

Mon Jan 5 16:46:45 CET 2009

I had some comments on the IDNA2008 and thought I'd send them to this
mailing list first. 

1.
The document states that allowing names to be interpreted differently by
different applications would cause a "huge interoperability problem."

Then, right after a table listing some examples, the document goes on to
say that "[An IDNA2008-conformant implementation] could even decide,
based on local linguistic mappings, to map #5 and #6 to different valid
domain names".

Do I understand correctly that it will now become acceptable to have
"huge interoperability problems," as different applications are certain
to handle locales differently. 

I see it as an improvement that invalid names are no longer allowed.
I.e. any name which is not already normalized and in lower case will not
be allowed. This makes it unambiguous as to which name is meant.

Therefore I find it really contradictory that software is allowed to use
"local mapping" to interpret a name in an unpredictable manner. Two
domain names, e.g. "ää.com" and "aa.com" can be owned by two different
entities, so it cannot be acceptable behavior that a name "Ää.com" can
be interpreted as "aa.com" by software running under the US locale, and
as "ää.com" or even "aeae.com" by software running under the German
locale. 

I think software must interpret the name "Ää.com" as "ää.com" and if it
can't, reject it as invalid. 

2.
The following sentence seems a bit odd:
"Note also that some browsers allow characters like "_" in domain
names."
RFC 1033 recommends a set of characters for domain name labels which
includes the underscore [a-zA-Z0-9_-]. Therefore it is no surprise that
they are accepted as valid labels by browsers and other software. 

As an aside, why does the pattern of allowed characters exclude the
underscore character? 

3. 
There is no mention of a requirement for a single language/locale in
IDNA strings. I noticed there was some discussion about this, but no
mention in the document proper. Many security issues with i18n domain
names occur from the use of characters from multiple languages/locales
together. If a requrement was made that every name must contain
characters from only one language/locale, most (maybe all?) of these
poroblems could be avoided. Using characters from multiple languages in
a domain name is a rare need. I can't actually think of any legitimate
need for such names. 

If one used the sets of exemplarCharacters from CLDR, we would even have
a ready database of valid characters. I.e. there is no need for
additional work to classify characters. The work has already been done,
at least for the most part. If some locale doesn't have characters which
are needed, it is easy to add them to the CLDR.

A name can consist of characters in multiple scripts. 
E.g. linuxクラブの参加者.com which contains ascii,katakana,hiragana and
kanji. These are all used in Japanese, though, and therefore valid in
that locale/language. I can't think of a legitimate use for a name with
multiple languages. Such names will only serve to confuse. 

I understand that it's difficult to specify that only characters from
some external list (CLDR in this case) are to be allowed. This could be
solved by specifying the version of CLDR, and then later updating only
that part of the document with a revision document. The other option is
to specify that "local" checking of locale is done, meaning that
browsers and other software check the current CLDR, whatever it is.

Cheers,

Troy

-- 
Troy Korjuslommi
+358 40 570 9900
Tksoft Inc.
http://www.tksoft.com/