Unicode encoding forms (was: Re: Eszett and IDNAv2 vs IDNA2008)
John C Klensin
klensin at jck.com
Fri Mar 20 08:52:33 CET 2009
I would like this WG to get finished. Somehow and with
something. A key element of making that happen is to avoid
getting involved in debates that are seriously out of scope or
long-ago settled somewhere else.
That said, I'm personally a bit fan of fixed-length characters,
and that leads me to prefer UTF-32 (UCS-4) to either UTF-8 or
UTF-16 (both of which are variable-width if one considers the
surrogates in the latter). In addition, where characters do not
exist in precomposed form, the combination between a base
character and a non-spacing mark eliminates several of the
advantages of going to UTF-32. But, in this particular context,
knowing about that preference doesn't do either of us any good.
There is an IETF requirement/ preference for UTF-8 that goes
back many years. One can argue it should be changed, but that
is not a problem for this WG. In addition, the two fundamental
problems with UTF-32 -- leading "0" octets causing problems in
string parsing with some well-known programming languages and no
inherent compatibility with ASCII -- don't disappear by wishing.
If you read the IDNA2008 spec, you will also discover that it is
not written in terms of UTF-8 but is deliberately agnostic about
how Unicode is encoded.
Finally, if you are referring to the i-DNS patch I think you
are, it isn't a patch, but a deliberate and quite significant
violation of the DNS standard. Rather than saying more about
that, please see my note a few hours ago about taking the DNS as
given, rather than trying to make it work in a way different
from what is specified.
--On Friday, March 20, 2009 00:22 +0100 Peter Dambier
<peter at peter-dambier.de> wrote:
> John C Klensin wrote:
>> You will never get case-sensitivity with ."fra" in the DNS
>> unless your intention is to provide an equal-opportunity mess
>> by encoding everything with a prefix, including basic
>> (undecorated) Latin characters. That encoding can't be
>> Punycode, since it won't encode those basic Latin characters.
> John, don't forget most of us are not us-ascii writers and I
> guess most of us don't even use latin at all.
> UTF-8 is a mess, UTF-16 is an excuse and the world really is
> Bind e.g. can do UTF-32.
> Whether windows or IE can do UTF-32 does not matter. The
> chinese will replace it sooner or later with something better.
> Try and compare correct implementation
> and buggy appache stating always UTF-8
> it is the same text trying ISO, UTF-8, UTF-16 and UTF-32 with
> both big-endian and littel-endian versions.
> We do have the bandwidth. We can use UTF-32.
> UTF-16 might be favouring the chinese but UTF-32 is same
> trouble for everybody.
> Ok the site is about presentation and html. Bind is binary. It
> does not care what code you feed it but you probably have to
> implement the IDNS-patch to make it forget folding cases.
> Interestingly enough IE can do UTF-32 but Chrome and Safari,
> both on Windows XP - complained.
More information about the Idna-update