Please review application/shf+xml

Wed Oct 29 18:16:54 CET 2003

On Wednesday, October 29, 2003, 5:33:26 PM, MURATA wrote:

MM> On Mon, 27 Oct 2003 18:06:46 +0100 (MET)
MM> Linus Walleij <triad at df.lth.se> wrote:

>> * We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of
>>   simplicty.

MM> Which UTF-16?  Unfortunately, there are three charsets for UTF-16.
MM> They are "utf-16le", "utf-16be" and "utf-16" (see  RFC 2781).

I believe the answer to this can be found by a case-insensitive string
match between the three stings you give and the string "UTF-16" that I
suggested above.

However, my main reason for suggesting both the encodings that XML
mandates is because the effort to detect and give an error on an
otherwise perfectly fine file in XML that uses UTF-16 and that the
parser would happily accept, seems higher than any utility gained.

MM> Since this XML format describes hexadecimal data, almost every character
MM> is US-ASCII.  I wonder why we have to double the file size by representing
MM> a US-ASCII character with 16 bits.  1MB in UTF-8 becomes 2MB in UTF-16.

You don't *have* to double the file size. Saying UTF-16 only would, I
agree, double the file size in most cases.

Use UTF-8 unless the quantity of human-readable content is large and
the overall size is smaller in UTF-16

>> * We define that in this case the "charset" parameter is omitted, for this
>>   very reason.

MM> If there is more than one choice, I do not think that this is a
MM> good reason to omit the charset parameter.

On the contrary, since both UTF-8 and UTF-16 are accepted by every XML
parser worldwide and since, therefore, using either of these two does
not require an XML declaration with an encoding declaration, then
there is absolutely no problem there.

Forcing the server to construct redundant and frequently erroneous
information in parallel seems especially unwise in this case, because
every parser can accept both encodings and because the string
containing the encoding is not physically present in the file, making
the server have to do byte analysis to compute the value.

-- 
 Chris                            mailto:chris at w3.org