The fight about Unicode in IETF

NOTE added in 1999: This article from 1996 was my attempt to get some sense out of the discussions about character sets in the IETF context, in particular the pro et contra arguments made about ISO 10646 as an universal character set.

Later, the IETF adopted the character set policy of RFC 2277, which recommends UTF-8, an encoding of ISO 10646, as the standard IETF character set.

This page is preserved in memory of those debates, and may help some people remember why there are still people who think the IETF made the wrong decision.

The claim for ISO 10646

Proponents of ISO 10646 claim that it is an universal character set, capable of representing most graphic characters in the world, "and those we have forgotten will be in the next version".

The list of characters in the current ISO 10646/Unicode version can be found using FTP from Unicode.org, together with other Unicode material. Unicode also has a Web server.

NOTE: UNICODE has actively discouraged mirroring of this archive. From the UnicodeData-Readme file:

The complaints against ISO 10646

The main complaint about ISO 10646 is the so-called "Han unification", the decision to make the Japanese, Chinese and Korean character sets into one character set where the characters of the same "meaning" and general shape were joined together.
One description of the rules, suggested for addition to the standard, has been put up on the Web by Glenn Adams.

Japanese and Chinese ideographs for the word "dream". Both are represented by codepoint <insert name and number here>

For an UNICODE view of the problem, follow this link into the Unicode Web server.

Other complaints include the size of the characters (16 bits in simplistic implementations like Windows NT), the use of combining accent characters that come after the character they change rather than before it, the rules for these combining characters (either too liberal or too restrictive, depending on the "level"), and too many encoding schemes.

The differences between UNICODE and ISO 10646

UNICODE is a product of the UNICODE Consortium. ISO 10646 is a product of the ISO comittee JTC1/SC2/WG2.

At the moment they are technically aligned, and the UNICODE Consortium has pledged to keep them technically aligned in the future, but UNICODE has published some more documentation on a "character set model" that is used in the design of the system, which is not in the ISO standard. It also includes some words on character decomposition and character semantics, which ISO has not considered.

ISO plans to allocate characters outside the 16-bit range in the next study period; it remains to be seen how and when the UNICODE consortium will follow.

The quarrels over ISO 10646

The claim for ISO 10646

The complaints against ISO 10646

The differences between UNICODE and ISO 10646