Alternate character sets (was: Re: confusing labels)

John C Klensin klensin at
Mon Apr 13 11:52:35 CEST 2009

--On Monday, April 13, 2009 05:29 +0200 Xavier Legoff
<xlegoff at> wrote:

> Dear Mr. Klensin,
> Another input I find interesting from Don Osborn, calling for
> organised versatility in headers and algorithms and to foresee
> transition and parallel solutions.

M. Legoff,

I am working on a more comprehensive note to you and your
colleagues.  As both a matter of courtesy and to reduce the
chance of further misunderstanding, I will send it only when a
French translation has been prepared and verified.

However, in the hope of quickly giving you at least the outline
of a response...

The data in your message is very interesting.  However, it is
not a surprise and it has little or nothing to do with the work
of this working group.  I do have some data on another
international broadcaster and I know that they try to find out
which coded character set (CCS) is most in use by the target
population and then they use that CCS.  So, again, I am not
surprised by what the BBC is doing.

First of all, I hope you understand already that Internet
protocols that deal with actual content -- words, sentences,
paragraphs, and so on -- generally have provisions for
identifying both the language in which the material is written
and the character set used to encode it.  That is true, in
particular, for both email and the web which can support the use
of any well-defined coded character set or language.  That is,
of course, why the BBC can use those systems on its web pages
and other distributions.

The domain name system does not share that property.  There are
a long list of reasons why it cannot accommodate more than one
character coding system and cannot be language-sensitive.   In
practical terms, it is not even clear that a different design
could have done better as long as many domain names are
abbreviations, acronyms, or numbered objects rather than words
in any language: the user who sees a domain name without
specific context has no way to know what language was intended.
Because of this, IDNs were basically impossible before Unicode
and UTF-8 is the only plausible encoding form for them.

That other note will discuss what can reasonably be done about
the situation, but the work that is required is well outside the
scope of this WG.


More information about the Idna-update mailing list