<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2653.12">
<TITLE>RE: Media Type "text/csv": new draft (-02) and Last Call</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=2>Graham,</FONT>
</P>
<P><FONT SIZE=2>-----Original Message-----</FONT>
<BR><FONT SIZE=2>From: Graham Klyne [<A HREF="mailto:GK-lists@ninebynine.org">mailto:GK-lists@ninebynine.org</A>]</FONT>
<BR><FONT SIZE=2>Sent: Wednesday, March 23, 2005 9:55 AM</FONT>
<BR><FONT SIZE=2>To: Yakov Shafranovich; clyde.ingram@edl.uk.eds.com</FONT>
<BR><FONT SIZE=2>Cc: ietf-types@alvestrand.no</FONT>
<BR><FONT SIZE=2>Subject: Re: Media Type "text/csv": new draft (-02) and Last Call</FONT>
</P>
<BR>
<P><FONT SIZE=2>At 01:14 23/03/05 -0500, Yakov Shafranovich wrote:</FONT>
</P>
<P><FONT SIZE=2>>Clyde,</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>Thanks for pointing this out. I personally think that instead of making </FONT>
<BR><FONT SIZE=2>>the header record mandatory which is something that most CSV applications </FONT>
<BR><FONT SIZE=2>>do not have, I would rather take the comma out of the end of the record </FONT>
<BR><FONT SIZE=2>>and have the last field end with a CRLF instead of an optional COMMA. Do </FONT>
<BR><FONT SIZE=2>>you think that is a plausible solution?</FONT>
</P>
<P><FONT SIZE=2>No. Some of the Excel data I process has trailing commas. This must be </FONT>
<BR><FONT SIZE=2>allowed.</FONT>
</P>
<P><FONT SIZE=2>I also don't think it's necessary to say anything (other than maybe as a </FONT>
<BR><FONT SIZE=2>comment) about any special status for the first line: such use is </FONT>
<BR><FONT SIZE=2>accommodated quite reasonably within the basic CSV format.</FONT>
</P>
<P><FONT SIZE=2>For example, having such a line when exporting Excel as CSV depends </FONT>
<BR><FONT SIZE=2>entirely upon how the user constructs the original spreadsheet. Column </FONT>
<BR><FONT SIZE=2>headings are common, but not mandatory. In some cases, there may be a more </FONT>
<BR><FONT SIZE=2>complex heading structure -- this is an application issue, not a dataset </FONT>
<BR><FONT SIZE=2>format issue, and as such does not belong in the dataset format specification.</FONT>
</P>
<P><FONT SIZE=2>#g</FONT>
<BR><FONT SIZE=2>--</FONT>
<BR><FONT SIZE=2>------------</FONT>
</P>
<P><FONT SIZE=2>Please clarify whether the trailing commas that your Excel export generates are there to mark the end of the last field, or to mark the start of a last field which currently has no value.</FONT></P>
<P><FONT SIZE=2>To take a concrete example, I would expect a CSV of sibling relationships in a mythical family to look like this, assuming the siblings are one brother (Bart) and 2 sisters (Lisa & Maggie):</FONT></P>
<P><FONT SIZE=2> child,sisters,brothers<CR-LF></FONT>
<BR><FONT SIZE=2> Bart,Lisa & Maggie,<CR-LF></FONT>
<BR><FONT SIZE=2> Lisa,Maggie,Bart<CR-LF></FONT>
<BR><FONT SIZE=2> Maggie,Lisa,Bart<CR-LF></FONT>
</P>
<P><FONT SIZE=2>where the trailing comma for the record of child=Bart signifies that the "brothers" field is null, so that Bart has no brothers. In my view this is a logical conclusion, and in fact stripping that one trailing comma would be an error, as that record would only have 2 fields, not 3.</FONT></P>
<P><FONT SIZE=2>Would you, however, expect the CSV file to use comma as a field-terminator, rather than a field-separator, as follows?:</FONT>
</P>
<P><FONT SIZE=2> child,sisters,brothers,<CR-LF></FONT>
<BR><FONT SIZE=2> Bart,Lisa & Maggie,,<CR-LF></FONT>
<BR><FONT SIZE=2> Lisa,Maggie,Bart,<CR-LF></FONT>
<BR><FONT SIZE=2> Maggie,Lisa,Bart,<CR-LF></FONT>
</P>
<P><FONT SIZE=2>Note that parsers that split data records on unprotected comma would detect one field too many in this latter case. </FONT>
</P>
<P><FONT SIZE=2>In a Comma SEPARATED Value file format, can you configure Excel to use comma as a SEPARATOR between values, rather than a TERMINATOR (at the end of values)?</FONT></P>
<P><FONT SIZE=2> </FONT>
<BR><FONT SIZE=2>Regarding your remarks on the header record being "an application issue, not a dataset format issue, and as such does not belong in the dataset format specification": XML, ASN.1, and other (application-independent) data interchange formats, explicitly tag individual fields so that their type is unambiguously defined within a context. In contrast, CSV conveys no tags per field in a data record. Hence, to help with application-independent data interchange, the CSV format should convey field titles in a header record.</FONT></P>
<P><FONT SIZE=2>Here is an example of lack of application-independence: if my application sends yours this CSV file:</FONT>
</P>
<P><FONT SIZE=2> ,Bart,Lisa & Maggie<CR-LF></FONT>
<BR><FONT SIZE=2> Bart,Lisa,Maggie<CR-LF></FONT>
<BR><FONT SIZE=2> Bart,Maggie,Lisa<CR-LF></FONT>
<BR><FONT SIZE=2> </FONT>
<BR><FONT SIZE=2>and your application depends on the assumption that the fields are the sequence:</FONT>
</P>
<P><FONT SIZE=2> child</FONT>
<BR><FONT SIZE=2> sisters</FONT>
<BR><FONT SIZE=2> brothers</FONT>
</P>
<P><FONT SIZE=2>then your application will mis-interpret the data.</FONT>
<BR><FONT SIZE=2>But if my application precedes this with a header record, like so:</FONT>
</P>
<P><FONT SIZE=2> brothers,child,sisters<CR-LF></FONT>
<BR><FONT SIZE=2> ,Bart,Lisa & Maggie<CR-LF></FONT>
<BR><FONT SIZE=2> Bart,Lisa,Maggie<CR-LF></FONT>
<BR><FONT SIZE=2> Bart,Maggie,Lisa<CR-LF></FONT>
</P>
<P><FONT SIZE=2>then your application can maintain independence from the change by my application, because the CSV file conveys the corresponding new field sequence (the columns "brother" and "child" have swapped).</FONT></P>
<BR>
<P><FONT SIZE=2>Regards,</FONT>
<BR><FONT SIZE=2>Clyde</FONT>
</P>
</BODY>
</HTML>