Please review application/shf+xml

Linus Walleij triad at df.lth.se
Wed Oct 29 18:02:23 CET 2003


On Thu, 30 Oct 2003, MURATA Makoto wrote:

> > * We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of
> >   simplicty.
>
> Which UTF-16?  Unfortunately, there are three charsets for UTF-16.
> They are "utf-16le", "utf-16be" and "utf-16" (see  RFC 2781).

The XML specification says:

 Entities encoded in UTF-16 must begin with the Byte Order Mark described
 by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section
 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK
 SPACE character, #xFEFF). This is an encoding signature, not part of
 either the markup or the character data of the XML document. XML
 processors must be able to use this character to differentiate between
 UTF-8 and UTF-16 encoded documents.

As easy as it gets :-)

> Since this XML format describes hexadecimal data, almost every character
> is US-ASCII.  I wonder why we have to double the file size by representing
> a US-ASCII character with 16 bits.  1MB in UTF-8 becomes 2MB in UTF-16.

That's a good point. OK it's UTF-8 only then.

Linus



More information about the Ietf-types mailing list