CLDR data (Re: Comments on the IDNA2008 document)

Harald Alvestrand harald at alvestrand.no
Mon Jan 12 20:39:59 CET 2009


Troy wrote:
> I had some comments on the IDNA2008 and thought I'd send them to this
> mailing list first. 
>   
.... (not commenting on 1 and 2)
>
> 3. 
> There is no mention of a requirement for a single language/locale in
> IDNA strings. I noticed there was some discussion about this, but no
> mention in the document proper. Many security issues with i18n domain
> names occur from the use of characters from multiple languages/locales
> together. If a requrement was made that every name must contain
> characters from only one language/locale, most (maybe all?) of these
> poroblems could be avoided. Using characters from multiple languages in
> a domain name is a rare need. I can't actually think of any legitimate
> need for such names. 
>
> If one used the sets of exemplarCharacters from CLDR, we would even have
> a ready database of valid characters. I.e. there is no need for
> additional work to classify characters. The work has already been done,
> at least for the most part. If some locale doesn't have characters which
> are needed, it is easy to add them to the CLDR.
>
> A name can consist of characters in multiple scripts. 
> E.g. linuxクラブの参加者.com which contains ascii,katakana,hiragana and
> kanji. These are all used in Japanese, though, and therefore valid in
> that locale/language. I can't think of a legitimate use for a name with
> multiple languages. Such names will only serve to confuse. 
>
> I understand that it's difficult to specify that only characters from
> some external list (CLDR in this case) are to be allowed. This could be
> solved by specifying the version of CLDR, and then later updating only
> that part of the document with a revision document. The other option is
> to specify that "local" checking of locale is done, meaning that
> browsers and other software check the current CLDR, whatever it is.
I'd strongly object to placing a dependency on CLDR's character lists in 
the standard itself.

Two reasons:

1) Enforcing this requirement would require (not just recommend) that 
the intended locale for each and every domain name be known at 
registration-checking time. Otherwise, there's no way to know what rules 
to enforce.

2) The CLDR database is developed for multiple purposes, there's always 
debate about what should be in it, and there's no way to get at the 
justifications for the exclusions or inclusions.

For instance, the CLDR locale for Norwegian, as picked up from 
http://www.unicode.org/cldr/data/charts/summary/no.html,  claims that 
its "standard" characters are these:

[a à b-e é f-o ó ò ô p-z æ ø å]

and its "auxillary" characters are these:

[á ǎ ã č ç đ è ê í ń ñ ŋ š ŧ ü ž ä ö]

These seem to cover all of the characters in the Norwegian domain name 
registry's rules (http://www.norid.no/navnepolitikk.html#link3), but is 
slightly different - in this case, ã and í is allowed.

The question of making registrations match with locales has been pushed 
off to the registries, and I think it should stay pushed.

                 Harald





More information about the Idna-update mailing list