Charset mandatory in unix/linux

Mon Mar 27 01:50:45 CEST 2006

On Sun March 12 2006 10:10, ned+ietf-822 at mrochek.com wrote:
> 
> (cc'ing the ietf-types list since this doesn't seem like an appropriate topic
> for ietf-822)
[this response to types, cc to 822, Reply-To set to types]

[Jacob Palme wrote, regarding charset] 
> > However, such a parameter is not mandatory in
> > Unix or Linux.
> 
> I could say the same thing about media types. File extensions or type codes are
> commonly used to determine the media type. This is a huge problem that has led
> to serious security glitches as well as poor user experiences.

Agreed.

> > This is causing more and more problems, when
> > people have a mixture of files with different charsets,
> > which you easily get when you download files from the
> > Internet or receive them via e-mail.
> 
> The reality is it is causing less and less problems as things gradually shift
> towards Unicode-based charsets and away from the vast array of less capable
> charsets. The security issues caused by non-use or misuse of media type labels
> are a far bigger problem, and worse, one that doesn't appear to be going away.

Agreed about the security issues w.r.t. non-use/misuse of type labels.
However, I have a different perspective regarding Unicode (see below).
> 
> > Would it be possible to get the people responsible for the
> > file systems in Unix and Linux to add a mandatory charset
> > attribute to all text files?
> 
> Knowing the charset buys you very little without also knowing the media type.
> You seem to be focused on plain text here and hence you're ignoring the larger
> media type issue. Lots of media types have parameters and even when the media
> type can be determined - it frequently cannot be done reliably - it is often
> done in a way that doesn't allow additional parameters to be attached.

I note that Unix and Unix-like systems don't have the notion of a "text
file"; unlike some other systems, there is no distinction between "text"
and "binary" files.  Moreover, it is one of the characteristics of Unix
that file system semantics apply not only to files per se, but also
apply to devices (disk drives, communications ports, etc., and in recent
implementations as interfaces to system information for processes).

> > Best is probably to add a
> > generalized property list to files, so that also other
> > properties than charset can be added in the future.
> 
> The ability to attach metadata to files is indeed a very useful feature, one
> that has been around for decades on some platforms at least. (I'm not going to
> bother with the history here.) And it is already available on Linux - at a
> minimum the ext2, ext3, and XFS file systems support it. (There are probably
> others but I'm too lazy to go look them up.)

ReiserFS is one of the more important Linux file systems...

> So in the sense of getting the filesystem to support this sort of tagging, your
> problem is already solved in many cases. But this is the easy part. You now
> have to get applications to agree on a specific use of metadata tags for
> charsets or media types or whatever. Good luck on getting that to happen.

Specifically regarding Unix-like systems, there is a long history of
representing metadata as character strings containing attribute/value
pairs; that is how environment variables are passed, how command-line
options with parameters are passed to programs, etc.  IIRC, Tom Duff
had a paper in the "papers" volume of one of the recent editions of
the Unix manuals about the use of pairs within graphics data files
for conveying such information.  So if the filesystem metadata can be
represented as character strings containing attribute/value pairs,
that's a very good fit (with one important caveat) to media type
parameters as that is precisely what those parameters are.

> > The advantage would be that programs which transport files
> > across the Internet, such as e-mail, ftp and http, would
> > more often use the correct charset and not munge the files
> > by giving then an incorrect charset. The commonly occuring
> > problem with incorrect charset would be reduced. Also local
> > problems such as text editors would benefit from knowing
> > the charset of a file.
> 
> First of all, email and http do not "transfer files" per se. They transfer data
> objects and each protocol defines the metadata it considers approproate to
> attach to data objects.

It's a bit more complicated than that specifically for text; email
is fairly consistent regarding line endings -- the message format
is quite specific on that point, as are the MIME specifications
(2046 in particular).  HTTP however does not specify line endings
for text, and that is a source of various inconsistencies and
problems.  There is no way to specify line ending with HTTP; the
protocol specification expressly permits implementation
discrepancies.  A potent source of trouble when transferring
unlabeled or mislabeled binary content containing 0x0D octets via
HTTP.

[...]
> This situation means that in situations where retention of file metadata is
> important some sort of additional container has to be used. A vast number of
> such container formats have been defined - tar files, zip files,
> AppleSingle/AppleDouble, etc.

These typically (and specifically for tar and zip) do not include
media type information or charset or other type parameter
information.  The information in the tar format, for example,
carries time stamps, permissions, and file type (where "type"
means plain file vs. directory vs. device, etc.).

Media types can be used to label such containers (IIRC there is
a defined media type for the "AppleDouble" stuff), and media type
parameters can be used to convey additional information;
alternatively, media types could be defined to label an octet
stream as a particular type, with metadata carried via parameters.

[...]
> So what's the bottom line? The bottom line is that you appear to be focusing on
> the wrong problem along several dimensions. First, charset information
> specifically isn't as interesting or essential as you claim, and the degree to
> which is it interesting is dropping for a variety of reasons. Second, you
> appear to have missed the larger and much more important problem of not having
> correct parameterized type information available. (And we haven't even
> discussed the many other sorts of metadata, like say language information, that
> is also useful to have.) Third, your focus on getting metadata support into
> filesystems is mostly misplaced - this is a solved problem in a lot of cases.
> And fifth, you don't seem to appreciate the difficulty of getting everyone to
> agree to actually use filesystem metadata to solve any of these problems. This
> last is a complete showstopper and I dispair of there ever being significant
> progress in this area because of it.

In reverse order:

Getting agreement to use metadata has several issues: aside from the
slow evolutionary process of having protocols convey the metadata and
having applications store and retrieve that metadata, there is the
issue of some sort of standard(s) for APIs for the metadata
storage/retrieval.  That's not really in IETF's bailiwick; perhaps
it's something that might spark some interest in ECMA or a similar
SDO.  So far as Unix-like systems are concerned, the use of
character-stream representation of attribute/value pairs would seem
to be a good fit as noted above, with one caveat; in MIME 
Content-Type fields, one knows how to interpret the "character
strings" because they are specified to be comprised of characters
from a limited repertoire -- that is, one does not need to know the
"charset" of the character strings themselves; they are composed of
a small, well-defined set of characters which fit in octets (in 7
bits, in fact).  So long as one can count on being able to interpret
the character strings as a stream of octets, I won't despair.
Conversely, if the very problem that Jacob has described, viz. the
inability to determine what sort of "character strings" one is
looking at, extends to the metadata storage, I will abandon all hope
of a solution.

As far as additional metadata (language, etc.) is concerned, the
attribute/value pair paradigm still works, but at a higher level.
In email and HTTP, for example, there are header fields -- comprised
of attribute/value pairs (specifically header field name and field
body) -- which convey not only MIME media type and parameters, but
other MIME fields (e.g. Content-Language) as well as non-MIME data.

I do think Jacob has a valid concern; having unlabeled text files in
various charsets is a big problem, and from my perspective it is
getting worse, not better.  The holy grail of a single unified
character set that will supposedly solve the problem sounds nice
until one looks at the details.  Fortunately, the notion of a
"charset" being somewhat more complex than the notion of a "character
set" helps a little; at least knowing the charset, one can
distinguish among utf-7, utf-8, utf-32be, utf-32le, utf-16be, and
utf-16le, all of which have "Unicode" as the underlying character
code.  But that doesn't help much, precisely because "Unicode" is
itself a "vast array" (ever-increasing in number) of character code
sets.  Saying "Unicode" doesn't tell me if that's pre-"Korean mess"
(see RFC 2279) "Unicode" or post-"Korean mess" "Unicode".  Or whether
that's the "Unicode" that has among its design principles a uniform
code width of 16 bits and an encoding strictly of text (specifically
excluding musical notation), or the "Unicode" that has a much wider
code width and includes non-textual cruft such as (yes, you guessed
it) musical notation.  Or whether it's one of the "Unicode"s that
has an attempt at encoding language information (versions 3.1 and
3.2), or one of the "Unicode"s (earlier and later) that do not.  And
so on.  Ned, you're quite right that simply adding attributes isn't
a panacea; specifying "Unicode" version doesn't help.  Obviously an
implementation using "Unicode" version 2.1 won't be able to make
sense of the "features" which crept into "Unicode" version 3.2 -- and
not everybody is able to "upgrade".  Aside from the much greater
resources required to support an increased code width (think of small,
battery-powered hand-held devices like cell phones), and the costs
(monetary and otherwise) that an "upgrade" might entail, some hardware
and/or software might be incapable of supporting a version change
(not merely insufficient hardware resources, but perhaps a vendor has
dropped support or has gone out of business). And there are related
costs (monetary and otherwise) associated with changing entire suites
of software, font support, etc.  And as I see it, the problem is much
worse than when I had to deal with ANSI X3.4 and the ISO-8859 variants;
at least I COULD have software that dealt with the various encodings,
fonts that support the glyphs, locales to switch among, etc. -- as far
as I know, it's not even possible to have multiple versions of Unicode
and to transcode between them on the same machine.  Even if that is
theoretically possible, the fundamental version problem remains;
"utf-8" still doesn't tell me (or my software) anything about the
underlying "Unicode" version.  Whereas "ANSI X3.4" has a specific
meaning -- it's code width didn't suddenly double one day.  To put
the issue in the perspective of Jacob's problem, suppose Jacob has
received a text file in Korean and the issue of labeling the charset
and language is solved.  If it is labeled as "ISO-2022-KR", he can
proceed to make sense of the file; conversely if it is labeled as
"utf-7" he cannot because he lacks information to determine whether
the result of the transformation to "Unicode" should be interpreted as
groups of 16 bits or some other code width, as well as which code points
represent various hangul characters.