Unknown text/* subtypes

Sun Jan 13 19:40:12 CET 2008

* Ian Hickson <ian at hixie.ch> [2008-01-13 05:47+0000]
> On Fri, 28 Dec 2007, Frank Ellermann wrote:
> > 
> > Years later (after 2616bis) it might be possible to upgrade "default 
> > ASCII" to UTF-8, Latin-1 was a dead end.  As soon as we're back to 
> > "default ASCII" just let RFC 2277 finish it off.
> 
> FWIW, a number of specs are already overriding both MIME and HTTP when it 
> comes to character encodings. For example HTML4 says to not default to any 
> encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as 
> currently proposed defaults to an even more complicated heuristic [3], and 
> so on.
> 
> In the "real world" the implementations are following the heuristics 
> described in CSS2.1 and HTML5 (or something close to them), and those 
> differ for text/css and text/html, so it would seem pointless for HTTP to 
> try to define something here: it would just get ignored.
> 
> IMHO the best option is for HTTP to stay out of the discussion altogether 
> and let the lower level specs (MIME) and the higher level specs (XML, 
> HTML, CSS, etc, defining the formats) figure it out amongst themselves.

I think this is consistent with Martin's proposal that HTTP1.1bis not
set a default encoding
  http://www.w3.org/2008/01/rdf-media-types#noDefault
(noting that Frank Ellerman believed the default should be us-ascii for
 the same effect)
  http://www.w3.org/2008/01/rdf-media-types#defAscii

What we still need, however, is an update to 2046 that reflects
current practice (and eases the discovery process for folks
registering non-ascii text/ media types). Let's geek out the
changes to we'd like to see.

• CRLF rules:
[[
  The canonical form of any MIME "text" subtype MUST always represent
  a line break as a CRLF sequence.  Similarly, any occurrence of CRLF
  in MIME "text" MUST represent a line break.  Use of CR and LF
  outside of line break sequences is also forbidden.
]] — RFC2046 §4.1.1 ¶1 http://www.rfc.net/rfc2046.html#s4.1.1.
is not respected by HTTP1.1, nor is it respected in general when
shipping text/xml.

Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs,
web servers, proxies or clients)? I would like to think that MIME
shouldn't care about recognizing new lines in the text block.

If it can't go away, can it be relaxed in accordance with HTTP 1.1
[[
  The line terminator for message-header fields is the sequence CRLF.
  However, we recommend that applications, when parsing such headers,
  recognize a single LF as a line terminator and ignore the leading
  CR.
]] — RFC2616 §19.3 ¶3 http://www.rfc.net/rfc2616.html#s19.3
or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR):
[[
   1. the two-character sequence #xD #xA

   2. the two-character sequence #xD #x85

   3. the single character #x85

   4. the single character #x2028

   5. any #xD character that is not immediately followed by #xA or
      #x85.
]] — XML 1.1 §2.11 ¶2 http://www.w3.org/TR/xml11/#sec-line-ends

The XML 1.1 rule interacts with character encoding because, while most
character encodings line up with ascii on CR and LF, clearly none do
on #x85 and #x2028

• character encoding:
[[
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive.  The default character set, which
must be assumed in the absence of a charset parameter, is US-ASCII.

The specification for any future subtypes of "text" must specify
whether or not they will also utilize a "charset" parameter, and may
possibly restrict its values as well.  For other subtypes of "text"
than "text/plain", the semantics of the "charset" parameter should be
defined to be identical to those specified here for "text/plain",
i.e., the body consists entirely of characters in the given charset.
In particular, definers of future "text" subtypes should pay close
attention to the implications of multioctet character sets for their
subtype definitions.

The charset parameter for subtypes of "text" gives a name of a
character set, as "character set" is defined in RFC 2045.  The rules
regarding line breaks detailed in the previous section must also be
observed -- a character set whose definition does not conform to these
rules cannot be used in a MIME "text" subtype.
]] — RFC2046 §4.1.2 ¶2-4 http://www.rfc.net/rfc2046.html#s4.1.2.

When should the "default" character set apply?
  • no charset parameter
  • no charset parameter, no fixed encoding for the media type
  • no charset, no fixed encoding, no internal encoding declaration

The current text specifies the first, while HTML and CSS count on the
third. From the use case of "best effort rendering", we are already in
a state where users who are better-informed than their web or mail
clients manually set the encoding so they can see the right
characters. The following heuristics may meet or exceed the user
experience with today's data while advancing the state of the art to
enable better rendering with future data:
[[
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The first of the following
determinants that apply will identify the character set:

  1. charset parameter

  2. fixed encoding registered with the media type, if known

  3. encoding algorithm registered with the media type, if known

  4. UFT-8 if the document conforms to the UTF-8 encoding pattern

  5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e]

  6. application preference
]]

@@charset constraints — can it have faux line feeds?

@@bidi? Martin, what do you think?

@@lowest common demoninator:
  RFC2046 §4.1.2 ¶22 http://www.rfc.net/rfc2046.html#s4.1.2.
Is it better to encourage the world to write "UTF-8" or "US-ASCII"
for ascii subset? tension between lcd and one common encoding.

@@Content-Transfer-Encoding: Base64
  Content-Type: text/wibbly
How does TE affect this? I suspect it's completely orthogonal.

> -- Footnotes --
> 
> [1] http://www.w3.org/TR/html4/charset.html#h-5.2.2
> This text explicitly says that HTTP's default is useless. It then 
> recomments behaviour that is even more useless, but that's another 
> problem altogether...
> 
> [2] http://www.w3.org/TR/CSS21/syndata.html#charset
> 
> [3] http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#determining
> 
> Cheers,
> -- 
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

-- 
-eric

office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric at w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 481 bytes
Desc: Digital signature
Url : http://www.alvestrand.no/pipermail/ietf-types/attachments/20080113/96f92a01/attachment.bin