MIME Type Review Request: image/svg+xml

Wed Nov 24 17:54:56 CET 2004

On Wednesday, November 24, 2004, 4:41:48 PM, Bjoern wrote:

BH> * Chris Lilley wrote:
>>As you yourself pointed out, per RFC3023
>>
>>   Processors generating XML MIME entities MUST NOT label conflicting
>>   charset information between the MIME Content-Type and the XML
>>   declaration.
>>
>>such content is already non conforming.

BH> It does not actually apply to content...

Yes, that is an ambiguity that needs to be cleared up. It says things
that generate the content must not do that; if that were true, then
there would be no such content. Since there is, its worth splitting this
into two:

- Conformance for XML generators
- Conformance for XML messages (headers plus bodies)

>>In terms of dealing with such content if it still occurs, the XML well
>>formedness rules already handle that in an entirely satisfactory manner
>>and nothing further need be added. These are already well implemented
>>and highly interoperable.

BH> Consider a *UTF-8 encoded* document

BH>   Content-Type: application/xml;charset=iso-8859-1

Since that isn't image/svg+xml then it has a charset parameter, although
the processor that generated it is non conforming to the existing RFC
3023. But lets press on into how to detect or resolve the error.

BH>   <?xml version="1.0"?>
BH>   ...
BH>   <!--Björn-->
BH>   ...

BH> With no BOM and using only US-ASCII characters for the rest of the
BH> document,

Cleverly constructed example, if the processor believes the charset the
processor will think the comment says BjÃ¶rn. However, as soon as you
save it, your name is mis-spelled. I'm sure you would not like that,
BjÃ¶rn.

So in this case, although the processor that generated it is non
conforming, the content is not non conforming (but it should be) and the
processor that receives it has two possibilities:

a) it can add the missing encoding declaration when processing and when
saving to disk (note that, if the xml happened to be digitally signed
and in canonical XML form, this would break the signature). See RFC 3741

b) it can note that a required encoding declaration is not present, and
throw a well formedness error.

Note that both of these choices will break some content and both of these
choices are licensed by the relevant specifications. There is thus
non-interoperability. Note further that, in the case where the charset
parameter is not present, there is 100% interoperability, no breakage,
all in conformance with the existing clauses in RFC 3023 which 3023bis
will retain, since they are proven by implementation experience with
running code to be highly robust and interoperable.

So, lets take the other case, which is more interesting.

Consider an *8859-1 encoded* document

  Content-Type: application/xml;charset=UTF-8

  <?xml version="1.0"?>
  ...
  <!--Björn-->
  ...

With your proposal, would the well formedness error (bytes occur that
cannot occur in UTF-8) be silently recovered from if the HTTP
header overrides it, even for an XML processor, while it would continue
to fail in other cases (such as server side processing)?

BH> with your proposal, which of the following behaviors of
BH> implementations would be considered conforming?

(see above for discussion of b and c)

BH>   a) it fails to process the document due to RFC3023bis/XML 1.0 errors

That would be the safest course. Consider if the non-ascii character was
a euro or some other currency symbol, if the document was an invoice,
and was being processed by an accounting system not by a human being.
Accounting systems do not have the luxury of a human to look at the
invoice, go to View...Character Encoding and try various possibilities
until it seems to look right, then save the document and edit the local
copy and fix up the encoding declaration

BH>   b) it considers the comment to include "BjÃ¶rn"
BH>   c) it considers the comment to include "Björn"

BH>   * application/xhtml+xml (with no update to RFC3236)

That is an existing type and has an existing charset parameter.
Applications are thus allowed to use it, with all the complications and
breakage that this entails as described above.

BH>   * image/svg+xml (as you propose it)

There is no charset parameter. Processors that generate one and messages
that contain one are in error.

BH> For application/xml / application/xhtml+xml this would currently be b)
BH> as the document includes 0xC3 0xB6 and the encoding is determined to be
BH> ISO-8859-1 which means the sequence above represents "Ã¶".

It would sometimes be b) and sometimes c) depending on the particular
software and whether its reading from disk on the server or over the
net. I frankly can't understand how you consider this lack of
interoperability to be a desirable thing.

-- 
 Chris Lilley                    mailto:chris at w3.org
 Chair, W3C SVG Working Group
 Member, W3C Technical Architecture Group