IDN and language

John C Klensin john-ietf at jck.com
Tue Jan 4 18:06:37 CET 2005



--On Tuesday, 04 January, 2005 09:38 -0500 Bruce Lilly
<blilly at erols.com> wrote:

>> One is not.  Domain names are strings of characters; only
>> incidentally do they spell out one or more words in one or
>> more languages.  I doubt whether the names "Google," "Yahoo,"
>> and "AltaVista" can be pinned down as belonging to one
>> specific language.
> 
> I was referring specifically to internationalized domain names
> (IDN, RFCs 3490, 3491, 3492, 3743) where the on-the-wire
> domain name continues to be of traditional form (ANSI X3.4
> letters,digits, and hyphen (with restrictions on combinations
> and placement)), but where a certain class of names (those
> beginning with "xn--") are "internationalized" and might be
> presented to users in a different form (which can include
> non-ASCII characters).  That came about because of the
> tendency to associate a domain name (tag) with a natural
> language "name" or legally-registered name (trademark, etc.).
> Whether one considers such associations logical or
> irrational, that is what has happened.  So one could have
> a domain name (beginning with xn--) that is presented by
> an application as "Nestlé.com".  Now certainly some names,
> such as your examples, Kodak, Häagen-Dazs, etc. have no
> language (because they are made-up strings of characters),
> but others do have a specific language.  In skimming through
> the RFCs mentioned above, it appears that there is now some
> provision for language tagging (which was not present in
> earlier versions of IDN).  However, I have not thoroughly
> reviewed those recent additions; therefore it should be
> clear that I have not reviewed the impact of the proposed
> draft changes on IDN or vice versa.  Such a review should
> take place (ideally before the deadline for the New Last
> Call on draft-phillips-langtags-08 (tomorrow!)), but I'm
> not the person to do so as I have only slight interest in
> IDN (I'm one of those who considers associating a tag
> with natural language and/or legally registered names to
> be irrational).  One potential issue is that domain names
> are case-insensitive, and whether lower-case accented
> characters map to/compare with unaccented upper-case
> letters may be a function of language (or culture, or
> political fiat).
>...
> I would add that there is apparently some discussion of
> wreaking similar havoc on local-parts, which appear in
> message-identifiers and email mailbox identifiers (STD 11).
> That too should be evaluated w.r.t. specification of
> language and the proposed changes.

Bruce,

While I'm sympathetic to many of the points you have raised, the
IDN situation is not an issue except in a very narrow sense and
similar situation would apply to local-parts if we ever do
something there.  In the IDN case, the protocols are written in
terms of arbitrary Unicode strings and just about have to be --
there has never been a DNS restriction requiring that the labels
be names or words in a language.  The protocols apply some
mapping rules that reject a few characters (and hence the labels
that contain them) and change some characters into others, but
the net effect is still a set of standards written in terms of
strings, not languages.  There has been a good deal of concern
in the DNS community about the potential for deliberately or
accidentially misleading users about domain names and the
consequent opportunities for confusion or outright fraud.  Those
concerns have led to a good deal of work on restrictions about
what strings can be registered, imposing, e.g., rules that the
holder of one string may be the only permitted holder of a
related one and rules that prohibit mixing scripts within a
single label.  These types of rules, especially the latter, are
the "very narrow sense" mentioned above, but they have no impact
on the protocols themselves.  The registration rules actually
differ from zone to zone and can safely do so because, to the
user of the DNS, an unregistered name is an unregistered name
and the distinction as to whether a name is unregistered because
no one wanted it or because some subtle rule prohibited its
registration is not of importance.

The situation with local-parts will, most of us are convinced,
work out in much the same way.  There is a long history of
strings used in local-parts that are not "names", "words", or
otherwise bound to a particular language.  Worse, different
destination systems apply different internal syntax rules and
interpretations to local-part strings.  Protocols will need to
be designed to reflect that history and avoid unreasonable
restrictions.  At the same time, I would expect the
administrators of an given local system to impose restrictions
on what local-parts parts can be used for mailboxes there (just
as is often done today).   Those restrictions may, in many
cases, reflect assumptions about languages and/or scripts but,
since they are purely local conventions, there is no need for
external registration.

Returning to the DNS/IDN situation, ICANN has created a
recommendation for all TLDs, and a requirement on at least some
gTLDs, that languages not be mixed within a label and for
registration and use of tables similar to those recommended by
RFC 3743.  Those tables are identified by a combination of the
Domain name associated with the registering TLD registry and a
3066 code.  That system is not, IMO, working especially well and
the 3066 code model will, I think, have to be extended to deal
with some unusual situations.   But, interestingly,
draft-phillips... doesn't appear to solve that particular
problem: what is needed is a way to specify odd mixtures of
languages and/or scripts that may be appropriate to a particular
zone, and that means less specificity and more
linguistically-strange constructions, not more specificity and
structure.  

     john





More information about the Ietf-languages mailing list