Unicode encoding forms (was: Re: Eszett and IDNAv2 vs IDNA2008)

John C Klensin klensin at jck.com
Fri Mar 20 08:52:33 CET 2009


Peter,

I would like this WG to get finished.  Somehow and with
something. A key element of making that happen is to avoid
getting involved in debates that are seriously out of scope or
long-ago settled somewhere else.

That said, I'm personally a bit fan of fixed-length characters,
and that leads me to prefer UTF-32 (UCS-4) to either UTF-8 or
UTF-16 (both of which are variable-width if one considers the
surrogates in the latter).  In addition, where characters do not
exist in precomposed form, the combination between a base
character and a non-spacing mark eliminates several of the
advantages of going to UTF-32.  But, in this particular context,
knowing about that preference doesn't do either of us any good.

There is an IETF requirement/ preference for UTF-8 that goes
back many years.  One can argue it should be changed, but that
is not a problem for this WG.  In addition, the two fundamental
problems with UTF-32 -- leading "0" octets causing problems in
string parsing with some well-known programming languages and no
inherent compatibility with ASCII -- don't disappear by wishing.

If you read the IDNA2008 spec, you will also discover that it is
not written in terms of UTF-8 but is deliberately agnostic about
how Unicode is encoded.

Finally, if you are referring to the i-DNS patch I think you
are, it isn't a patch, but a deliberate and quite significant
violation of the DNS standard.  Rather than saying more about
that, please see my note a few hours ago about taking the DNS as
given, rather than trying to make it work in a way different
from what is specified.

    john



--On Friday, March 20, 2009 00:22 +0100 Peter Dambier
<peter at peter-dambier.de> wrote:

> 
> 
> John C Klensin wrote:
> 
>> 
>> You will never get case-sensitivity with ."fra" in the DNS
>> unless your intention is to provide an equal-opportunity mess
>> by encoding everything with a prefix, including basic
>> (undecorated) Latin characters.  That encoding can't be
>> Punycode, since it won't encode those basic Latin characters.
>> 
> 
> John, don't forget most of us are not us-ascii writers and I
> guess most of us don't even use latin at all.
> 
> UTF-8 is a mess, UTF-16 is an excuse and the world really is
> UTF-32.
> 
> Bind e.g. can do UTF-32.
> 
> Whether windows or IE can do UTF-32 does not matter. The
> chinese will replace it sooner or later with something better.
> 
> Try and compare correct implementation
> 
> http://www.das-loch-von-koelle.de/UTF/
> 
> and buggy appache stating always UTF-8
> 
> http://www.hessen-braucht-sechs.de/UTF/
> 
> it is the same text trying ISO, UTF-8, UTF-16 and UTF-32 with
> both big-endian and littel-endian versions.
> 
> We do have the bandwidth. We can use UTF-32.
> 
> UTF-16 might be favouring the chinese but UTF-32 is same
> trouble for everybody.
> 
> Ok the site is about presentation and html. Bind is binary. It
> does not care what code you feed it but you probably have to
> implement the IDNS-patch to make it forget folding cases.
> 
> Interestingly enough IE can do UTF-32 but Chrome and Safari,
> both on Windows XP - complained.
> 
> Peter






More information about the Idna-update mailing list