Media types for RDF languages N3 and Turtle

Mon Dec 17 17:15:33 CET 2007

Sean B. Palmer wrote:
> On Dec 17, 2007 3:22 PM, Garret Wilson <garret at globalmentor.com> wrote:
>
>   
>> There exists serious concern regarding the use of a text top-level
>> type for N3. See the recent discussion on www-rdf-comments.
>>     
>
> Eric and I discussed that in some detail prior to and subsequent to
> the start of this thread. One thing that I don't understand is what
> you said here:
>
> [[[
> I suppose that, with the popular understanding that RFC 2046 requires a
> default character set of US-ASCII if there is no charset parameter, then
> it's almost as true as if RFC 2046 said so explicitly.
> ]]] - http://lists.w3.org/Archives/Public/www-rdf-comments/2007OctDec/0017
>
> It seems very clear to me that RFC 2046 states explicitly that
> US-ASCII is required if there is no charset parameter. Here are the
> relevant quotes:
>
>    The default character set, which must be assumed in the absence
>    of a charset parameter, is US-ASCII.
>
>    ...
>
>    Note that the character set used, if anything other than US- ASCII,
>    must always be explicitly specified in the Content-Type field.
>
> The way I read that, that doesn't leave any room for a text/anything
> specification setting its own default.
>   

In the excerpt you presented, "The default character set...", it must be 
asked, "the default character set of what?" My literal reading of RFC 
2046 (which may not be correct) led me to believe that this only applied 
to text/plain. Let's look at the whole section, including parts you left 
out:

   A critical parameter that may be specified in the Content-Type field
   for "text/plain" data is the character set.  This is specified with a
   "charset" parameter, as in:

     Content-type: text/plain; charset=iso-8859-1

   Unlike some other parameter values, the values of the charset
   parameter are NOT case sensitive.  The default character set, which
   must be assumed in the absence of a charset parameter, is US-ASCII.

The first sentence led me to believe that we are talking about 
"text/plain" and "text/plain" only. Therefore, "The default character 
set" to me indicated "The default character set of text/plain".

Immediately following this is the paragraph:

   The specification for any future subtypes of "text" must specify
   whether or not they will also utilize a "charset" parameter, and may
   possibly restrict its values as well.  For other subtypes of "text"
   than "text/plain", the semantics of the "charset" parameter should be
   defined to be identical to those specified here for "text/plain",
   i.e., the body consists entirely of characters in the given charset.
   In particular, definers of future "text" subtypes should pay close
   attention to the implications of multioctet character sets for their
   subtype definitions.

So my interpretation was that 1 ) the default character set of 
text/plain is US-ASCII, and 2 ) other "text" subtypes may define 
differently whether and how they utilize a charset parameter.

Again, this may be an incorrect reading; as I mentioned, even if it is 
literally correct, if the rest of the community understands it to mean 
that charset defaults to "text/plain" for all text/* types, the literal 
meaning, which is not unambiguous, is probably moot.

> As for the CRLF requirement, that CRLF and *only* CRLF be used for
> line breaks, Dan Brickley commented in response to that that text/xml
> was widely regarded troublesome; but it's not clear from his citations
> that CRLF has anything to do with the troublesome nature, only charset
> defaulting.
>   

I don't see how this could cause many problems in practice; I was just 
pointing out that technically XML does not follow RFC 2046 requirements 
because it has its own rules about CR, LF, and CRLF.

> It seems that most of the problem, as you mentioned in the
> www-rdf-comments thread, is that the text subtree is simply broken.
> RFC 2046 just wasn't written to deal with the Unicode world. Check out
> the following, for example:
>
>    A SINGLE character set that can be used
>    universally for representing all of the world's languages in Internet
>    mail would be preferrable.  Unfortunately, existing practice in
>    several communities seems to point to the continued use of multiple
>    character sets in the near future.  A small number of standard
>    character sets are, therefore, defined for Internet use in this
>    document.
>
> And it defines US-ASCII and ISO-8859-X. It's not RFC 2046's fault that
> it wasn't prescient, but it's *out-of-date* now and perhaps ought to
> be obsoleted so that text/* can be used as intended rather than as
> we're currently forced?
>   

That would be what I would prefer. The other option is *not* to obsolete 
RFC 2046 text types, which means that no one uses RFC 2046 text types, 
making RFC 2046 text types de facto obsolete. In fact, I could easily be 
persuaded just to ignore the US-ASCII and CRLF parts if everyone else 
were to do the same. But surely someone could take the time to write up 
another RFC making this explicit.

> But of course there is the question of what MIME implementations will
> do and what problems, possibly serious, it would cause to, for
> example, make utf-8 the new text/* default. It would need a lot of
> discussion and a new RFC.
>   

The problem with the computer standards process is that it has gotten so 
bureaucratic and slow that it's hard for anything significant to happen 
to solve problems in any short time frame. I applaud anyone that would 
push for a new RFC to update RFC 2046--that's the Right Thing To Do 
here. Unfortunately, I wouldn't place my bets on this happening anytime 
soon.  (It might beat XHTML 2.0 out the door, though.)

> Note that TimBL has never, as far as I know, suggested disregarding
> the charset defaulting requirement, just the CRLF requirement which he
> mightn't even be aware of. And as it seems that the charset defaulting
> is the thing that most people are anxious about, I'd be happy for
> text/rdf+n3; charset=utf-8 or text/n3; charset=utf-8 to go forwards,
> even ignoring the fact that it disregards the CRLF requirement.
>   

I don't like that option; it's too bulky and feels like a hack. I'd 
prefer that either application/* were used, or the RFC 2046 default 
character set were ignored. Either would be a better solution.

Garret