Unknown text/* subtypes

Mon Jan 14 01:26:01 CET 2008

> * Ian Hickson <ian at hixie.ch> [2008-01-13 05:47+0000]
> > On Fri, 28 Dec 2007, Frank Ellermann wrote:
> > >
> > > Years later (after 2616bis) it might be possible to upgrade "default
> > > ASCII" to UTF-8, Latin-1 was a dead end.  As soon as we're back to
> > > "default ASCII" just let RFC 2277 finish it off.
> >
> > FWIW, a number of specs are already overriding both MIME and HTTP when it
> > comes to character encodings. For example HTML4 says to not default to any
> > encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as
> > currently proposed defaults to an even more complicated heuristic [3], and
> > so on.
> >
> > In the "real world" the implementations are following the heuristics
> > described in CSS2.1 and HTML5 (or something close to them), and those
> > differ for text/css and text/html, so it would seem pointless for HTTP to
> > try to define something here: it would just get ignored.
> >
> > IMHO the best option is for HTTP to stay out of the discussion altogether
> > and let the lower level specs (MIME) and the higher level specs (XML,
> > HTML, CSS, etc, defining the formats) figure it out amongst themselves.

> I think this is consistent with Martin's proposal that HTTP1.1bis not
> set a default encoding
>   http://www.w3.org/2008/01/rdf-media-types#noDefault
> (noting that Frank Ellerman believed the default should be us-ascii for
>  the same effect)
>   http://www.w3.org/2008/01/rdf-media-types#defAscii

> What we still need, however, is an update to 2046 that reflects
> current practice (and eases the discovery process for folks
> registering non-ascii text/ media types). Let's geek out the
> changes to we'd like to see.

You might, and I emphasize might, be able to get this changed to protocol
specific restriction. (The MIME specifications specify both an email-specific
extension as well as some more generally useful facilities.) There is no chance
of this rule being lifted in general.

> • CRLF rules:
> [[
>   The canonical form of any MIME "text" subtype MUST always represent
>   a line break as a CRLF sequence.  Similarly, any occurrence of CRLF
>   in MIME "text" MUST represent a line break.  Use of CR and LF
>   outside of line break sequences is also forbidden.
> ]] — RFC2046 §4.1.1 ¶1 http://www.rfc.net/rfc2046.html#s4.1.1.
> is not respected by HTTP1.1, nor is it respected in general when
> shipping text/xml.

> Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs,
> web servers, proxies or clients)?

Not only does email depend on this, conformance to this has been dramatically
strengthened, not weakened, in subsequest revisions of the email protocol
specification. Specifically, RFC 821 was essentially silent on what bare CR and
LF mean, but 2821 and 2821bis (now in last call) both say that bare CR and LF
MUST NOT be sent and if received MUST NOT be treated as CRLF.

This, incidentially, is not the way I personally think things should have been
done. I like the "ignore bare CR treat LF like CRLF" approach. But my personal
opinion isn't especially relevant - I mention it only to avoid "shoot the
messenger" sorts of responses.

> I would like to think that MIME
> shouldn't care about recognizing new lines in the text block.

I'm sorry, but that's fanciful in the extreme.

> If it can't go away, can it be relaxed in accordance with HTTP 1.1
> [[
>   The line terminator for message-header fields is the sequence CRLF.
>   However, we recommend that applications, when parsing such headers,
>   recognize a single LF as a line terminator and ignore the leading
>   CR.
> ]] — RFC2616 §19.3 ¶3 http://www.rfc.net/rfc2616.html#s19.3

Again, I personally think this is the way to go. But that's not what
has happened.

> or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR):
> [[
>    1. the two-character sequence #xD #xA

>    2. the two-character sequence #xD #x85

>    3. the single character #x85

>    4. the single character #x2028

>    5. any #xD character that is not immediately followed by #xA or
>       #x85.
> ]] — XML 1.1 §2.11 ¶2 http://www.w3.org/TR/xml11/#sec-line-ends

> The XML 1.1 rule interacts with character encoding because, while most
> character encodings line up with ascii on CR and LF, clearly none do
> on #x85 and #x2028

> • character encoding:
> [[
> Unlike some other parameter values, the values of the charset
> parameter are NOT case sensitive.  The default character set, which
> must be assumed in the absence of a charset parameter, is US-ASCII.

> The specification for any future subtypes of "text" must specify
> whether or not they will also utilize a "charset" parameter, and may
> possibly restrict its values as well.  For other subtypes of "text"
> than "text/plain", the semantics of the "charset" parameter should be
> defined to be identical to those specified here for "text/plain",
> i.e., the body consists entirely of characters in the given charset.
> In particular, definers of future "text" subtypes should pay close
> attention to the implications of multioctet character sets for their
> subtype definitions.

> The charset parameter for subtypes of "text" gives a name of a
> character set, as "character set" is defined in RFC 2045.  The rules
> regarding line breaks detailed in the previous section must also be
> observed -- a character set whose definition does not conform to these
> rules cannot be used in a MIME "text" subtype.
> ]] — RFC2046 §4.1.2 ¶2-4 http://www.rfc.net/rfc2046.html#s4.1.2.

> When should the "default" character set apply?
>   • no charset parameter
>   • no charset parameter, no fixed encoding for the media type
>   • no charset, no fixed encoding, no internal encoding declaration

> The current text specifies the first, while HTML and CSS count on the
> third. From the use case of "best effort rendering", we are already in
> a state where users who are better-informed than their web or mail
> clients manually set the encoding so they can see the right
> characters. The following heuristics may meet or exceed the user
> experience with today's data while advancing the state of the art to
> enable better rendering with future data:
> [[
> Unlike some other parameter values, the values of the charset
> parameter are NOT case sensitive. The first of the following
> determinants that apply will identify the character set:

>   1. charset parameter

>   2. fixed encoding registered with the media type, if known

>   3. encoding algorithm registered with the media type, if known

>   4. UFT-8 if the document conforms to the UTF-8 encoding pattern

>   5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e]

>   6. application preference
> ]]

Again, there is absolutely no chance this will fly for email so it cannot be 
written with this degree of generality. And if this is made protocol specific
the specifics of any protocol other than email don't belong in a RFC 2046
revision.

> @@charset constraints — can it have faux line feeds?

> @@bidi? Martin, what do you think?

> @@lowest common demoninator:
>   RFC2046 §4.1.2 ¶22 http://www.rfc.net/rfc2046.html#s4.1.2.
> Is it better to encourage the world to write "UTF-8" or "US-ASCII"
> for ascii subset? tension between lcd and one common encoding.

Marking something as utf-8 when it is in fact restricted to the us-ascii subset
has been known to cause problems. I think change in this area is unlikely.

				Ned