Media types for RDF languages N3 and Turtle
Garret Wilson
garret at globalmentor.com
Mon Dec 17 17:15:33 CET 2007
Sean B. Palmer wrote:
> On Dec 17, 2007 3:22 PM, Garret Wilson <garret at globalmentor.com> wrote:
>
>
>> There exists serious concern regarding the use of a text top-level
>> type for N3. See the recent discussion on www-rdf-comments.
>>
>
> Eric and I discussed that in some detail prior to and subsequent to
> the start of this thread. One thing that I don't understand is what
> you said here:
>
> [[[
> I suppose that, with the popular understanding that RFC 2046 requires a
> default character set of US-ASCII if there is no charset parameter, then
> it's almost as true as if RFC 2046 said so explicitly.
> ]]] - http://lists.w3.org/Archives/Public/www-rdf-comments/2007OctDec/0017
>
> It seems very clear to me that RFC 2046 states explicitly that
> US-ASCII is required if there is no charset parameter. Here are the
> relevant quotes:
>
> The default character set, which must be assumed in the absence
> of a charset parameter, is US-ASCII.
>
> ...
>
> Note that the character set used, if anything other than US- ASCII,
> must always be explicitly specified in the Content-Type field.
>
> The way I read that, that doesn't leave any room for a text/anything
> specification setting its own default.
>
In the excerpt you presented, "The default character set...", it must be
asked, "the default character set of what?" My literal reading of RFC
2046 (which may not be correct) led me to believe that this only applied
to text/plain. Let's look at the whole section, including parts you left
out:
A critical parameter that may be specified in the Content-Type field
for "text/plain" data is the character set. This is specified with a
"charset" parameter, as in:
Content-type: text/plain; charset=iso-8859-1
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The default character set, which
must be assumed in the absence of a charset parameter, is US-ASCII.
The first sentence led me to believe that we are talking about
"text/plain" and "text/plain" only. Therefore, "The default character
set" to me indicated "The default character set of text/plain".
Immediately following this is the paragraph:
The specification for any future subtypes of "text" must specify
whether or not they will also utilize a "charset" parameter, and may
possibly restrict its values as well. For other subtypes of "text"
than "text/plain", the semantics of the "charset" parameter should be
defined to be identical to those specified here for "text/plain",
i.e., the body consists entirely of characters in the given charset.
In particular, definers of future "text" subtypes should pay close
attention to the implications of multioctet character sets for their
subtype definitions.
So my interpretation was that 1 ) the default character set of
text/plain is US-ASCII, and 2 ) other "text" subtypes may define
differently whether and how they utilize a charset parameter.
Again, this may be an incorrect reading; as I mentioned, even if it is
literally correct, if the rest of the community understands it to mean
that charset defaults to "text/plain" for all text/* types, the literal
meaning, which is not unambiguous, is probably moot.
> As for the CRLF requirement, that CRLF and *only* CRLF be used for
> line breaks, Dan Brickley commented in response to that that text/xml
> was widely regarded troublesome; but it's not clear from his citations
> that CRLF has anything to do with the troublesome nature, only charset
> defaulting.
>
I don't see how this could cause many problems in practice; I was just
pointing out that technically XML does not follow RFC 2046 requirements
because it has its own rules about CR, LF, and CRLF.
> It seems that most of the problem, as you mentioned in the
> www-rdf-comments thread, is that the text subtree is simply broken.
> RFC 2046 just wasn't written to deal with the Unicode world. Check out
> the following, for example:
>
> A SINGLE character set that can be used
> universally for representing all of the world's languages in Internet
> mail would be preferrable. Unfortunately, existing practice in
> several communities seems to point to the continued use of multiple
> character sets in the near future. A small number of standard
> character sets are, therefore, defined for Internet use in this
> document.
>
> And it defines US-ASCII and ISO-8859-X. It's not RFC 2046's fault that
> it wasn't prescient, but it's *out-of-date* now and perhaps ought to
> be obsoleted so that text/* can be used as intended rather than as
> we're currently forced?
>
That would be what I would prefer. The other option is *not* to obsolete
RFC 2046 text types, which means that no one uses RFC 2046 text types,
making RFC 2046 text types de facto obsolete. In fact, I could easily be
persuaded just to ignore the US-ASCII and CRLF parts if everyone else
were to do the same. But surely someone could take the time to write up
another RFC making this explicit.
> But of course there is the question of what MIME implementations will
> do and what problems, possibly serious, it would cause to, for
> example, make utf-8 the new text/* default. It would need a lot of
> discussion and a new RFC.
>
The problem with the computer standards process is that it has gotten so
bureaucratic and slow that it's hard for anything significant to happen
to solve problems in any short time frame. I applaud anyone that would
push for a new RFC to update RFC 2046--that's the Right Thing To Do
here. Unfortunately, I wouldn't place my bets on this happening anytime
soon. (It might beat XHTML 2.0 out the door, though.)
> Note that TimBL has never, as far as I know, suggested disregarding
> the charset defaulting requirement, just the CRLF requirement which he
> mightn't even be aware of. And as it seems that the charset defaulting
> is the thing that most people are anxious about, I'd be happy for
> text/rdf+n3; charset=utf-8 or text/n3; charset=utf-8 to go forwards,
> even ignoring the fact that it disregards the CRLF requirement.
>
I don't like that option; it's too bulky and feels like a hack. I'd
prefer that either application/* were used, or the RFC 2046 default
character set were ignored. Either would be a better solution.
Garret
More information about the Ietf-types
mailing list