<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2653.12">
<TITLE>RE: Media Type "text/csv": new draft (-02) and Last Call</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=2>Graham & Yakov,</FONT>
</P>
<P><FONT SIZE=2>Where a Comma-Separated-Value format is used by peer computer applications attempting to communicate with each other in an open fashion, it is very simple for them to produce a fixed number of Comma-Separated-Values.</FONT></P>
<P><FONT SIZE=2>It sounds like your production of a variable number of Comma-Separated-Values is an artefact of how you are manually driving one proprietary spreadsheet program from the keyboard/mouse. If such manually generated output is to be read by a similarly manually driven mechanism, then it may be acceptable to have variable numbers of Comma-Separated-Values per record. Standardisation in numbers of fieldds is then unnecessary, so an RFC need not cater for the uncontrolled nature of manually handled data.</FONT></P>
<P><FONT SIZE=2>But the same cannot be said of automated computer-based applications, where maintaining a strict count of generated and expected Comma-Separated-Values per record is not only easy, but also allows for an extra level of data validation: namely that a received record is corrupt if it has too few or too many fields. This is where standardisation in the format of the CSV records becomes appropriate material for an RFC.</FONT></P>
<P><FONT SIZE=2>Regards,</FONT>
<BR><FONT SIZE=2>Clyde Ingram</FONT>
</P>
<P><FONT SIZE=2>-----Original Message-----</FONT>
<BR><FONT SIZE=2>From: Graham Klyne [<A HREF="mailto:GK-lists@ninebynine.org">mailto:GK-lists@ninebynine.org</A>]</FONT>
<BR><FONT SIZE=2>Sent: Tuesday, March 29, 2005 12:21 PM</FONT>
<BR><FONT SIZE=2>To: clyde.ingram@edl.uk.eds.com; YakovS@solidmatrix.com</FONT>
<BR><FONT SIZE=2>Cc: ietf-types@alvestrand.no</FONT>
<BR><FONT SIZE=2>Subject: RE: Media Type "text/csv": new draft (-02) and Last Call</FONT>
</P>
<BR>
<P><FONT SIZE=2>At 15:19 23/03/05 +0000, clyde.ingram@edl.uk.eds.com wrote:</FONT>
<BR><FONT SIZE=2>>Please clarify whether the trailing commas that your Excel export </FONT>
<BR><FONT SIZE=2>>generates are there to mark the end of the last field, or to mark the </FONT>
<BR><FONT SIZE=2>>start of a last field which currently has no value.</FONT>
</P>
<P><FONT SIZE=2>I'm not sure how to tell the difference in an Excel spreadsheet.</FONT>
</P>
<P><FONT SIZE=2>In the case where this arose for me, I had created a speadsheet with </FONT>
<BR><FONT SIZE=2>varying numbers of values in different rows, and many of the rows were </FONT>
<BR><FONT SIZE=2>output by Excel with *multiple* trailing commas. Some rows were generated </FONT>
<BR><FONT SIZE=2>without any trailimng commas. My point would be that if this happens with </FONT>
<BR><FONT SIZE=2>reasonable data then is must be permitted. Whether it's interpreted as a </FONT>
<BR><FONT SIZE=2>field terminator as start of field with no value is, I think, moot.</FONT>
</P>
<P><FONT SIZE=2>#g</FONT>
<BR><FONT SIZE=2>--</FONT>
</P>
<BR>
<P><FONT SIZE=2>>Graham,</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>-----Original Message-----</FONT>
<BR><FONT SIZE=2>>From: Graham Klyne </FONT>
<BR><FONT SIZE=2>>[<<A HREF="mailto:GK-lists@ninebynine.org">mailto:GK-lists@ninebynine.org</A>><A HREF="mailto:GK-lists@ninebynine.org">mailto:GK-lists@ninebynine.org</A>]</FONT>
<BR><FONT SIZE=2>>Sent: Wednesday, March 23, 2005 9:55 AM</FONT>
<BR><FONT SIZE=2>>To: Yakov Shafranovich; clyde.ingram@edl.uk.eds.com</FONT>
<BR><FONT SIZE=2>>Cc: ietf-types@alvestrand.no</FONT>
<BR><FONT SIZE=2>>Subject: Re: Media Type "text/csv": new draft (-02) and Last Call</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>At 01:14 23/03/05 -0500, Yakov Shafranovich wrote:</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>> >Clyde,</FONT>
<BR><FONT SIZE=2>> ></FONT>
<BR><FONT SIZE=2>> >Thanks for pointing this out. I personally think that instead of making</FONT>
<BR><FONT SIZE=2>> >the header record mandatory which is something that most CSV applications</FONT>
<BR><FONT SIZE=2>> >do not have, I would rather take the comma out of the end of the record</FONT>
<BR><FONT SIZE=2>> >and have the last field end with a CRLF instead of an optional COMMA. Do</FONT>
<BR><FONT SIZE=2>> >you think that is a plausible solution?</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>No. Some of the Excel data I process has trailing commas. This must be</FONT>
<BR><FONT SIZE=2>>allowed.</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>I also don't think it's necessary to say anything (other than maybe as a</FONT>
<BR><FONT SIZE=2>>comment) about any special status for the first line: such use is</FONT>
<BR><FONT SIZE=2>>accommodated quite reasonably within the basic CSV format.</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>For example, having such a line when exporting Excel as CSV depends</FONT>
<BR><FONT SIZE=2>>entirely upon how the user constructs the original spreadsheet. Column</FONT>
<BR><FONT SIZE=2>>headings are common, but not mandatory. In some cases, there may be a more</FONT>
<BR><FONT SIZE=2>>complex heading structure -- this is an application issue, not a dataset</FONT>
<BR><FONT SIZE=2>>format issue, and as such does not belong in the dataset format </FONT>
<BR><FONT SIZE=2>>specification.</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>#g</FONT>
<BR><FONT SIZE=2>>--</FONT>
<BR><FONT SIZE=2>>------------</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>Please clarify whether the trailing commas that your Excel export </FONT>
<BR><FONT SIZE=2>>generates are there to mark the end of the last field, or to mark the </FONT>
<BR><FONT SIZE=2>>start of a last field which currently has no value.</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>To take a concrete example, I would expect a CSV of sibling relationships </FONT>
<BR><FONT SIZE=2>>in a mythical family to look like this, assuming the siblings are one </FONT>
<BR><FONT SIZE=2>>brother (Bart) and 2 sisters (Lisa & Maggie):</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>> child,sisters,brothers<CR-LF></FONT>
<BR><FONT SIZE=2>> Bart,Lisa & Maggie,<CR-LF></FONT>
<BR><FONT SIZE=2>> Lisa,Maggie,Bart<CR-LF></FONT>
<BR><FONT SIZE=2>> Maggie,Lisa,Bart<CR-LF></FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>where the trailing comma for the record of child=Bart signifies that the </FONT>
<BR><FONT SIZE=2>>"brothers" field is null, so that Bart has no brothers. In my view this </FONT>
<BR><FONT SIZE=2>>is a logical conclusion, and in fact stripping that one trailing comma </FONT>
<BR><FONT SIZE=2>>would be an error, as that record would only have 2 fields, not 3.</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>Would you, however, expect the CSV file to use comma as a </FONT>
<BR><FONT SIZE=2>>field-terminator, rather than a field-separator, as follows?:</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>> child,sisters,brothers,<CR-LF></FONT>
<BR><FONT SIZE=2>> Bart,Lisa & Maggie,,<CR-LF></FONT>
<BR><FONT SIZE=2>> Lisa,Maggie,Bart,<CR-LF></FONT>
<BR><FONT SIZE=2>> Maggie,Lisa,Bart,<CR-LF></FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>Note that parsers that split data records on unprotected comma would </FONT>
<BR><FONT SIZE=2>>detect one field too many in this latter case.</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>In a Comma SEPARATED Value file format, can you configure Excel to use </FONT>
<BR><FONT SIZE=2>>comma as a SEPARATOR between values, rather than a TERMINATOR (at the end </FONT>
<BR><FONT SIZE=2>>of values)?</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>Regarding your remarks on the header record being "an application issue, </FONT>
<BR><FONT SIZE=2>>not a dataset format issue, and as such does not belong in the dataset </FONT>
<BR><FONT SIZE=2>>format specification": XML, ASN.1, and other (application-independent) </FONT>
<BR><FONT SIZE=2>>data interchange formats, explicitly tag individual fields so that their </FONT>
<BR><FONT SIZE=2>>type is unambiguously defined within a context. In contrast, CSV conveys </FONT>
<BR><FONT SIZE=2>>no tags per field in a data record. Hence, to help with </FONT>
<BR><FONT SIZE=2>>application-independent data interchange, the CSV format should convey </FONT>
<BR><FONT SIZE=2>>field titles in a header record.</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>Here is an example of lack of application-independence: if my application </FONT>
<BR><FONT SIZE=2>>sends yours this CSV file:</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>> ,Bart,Lisa & Maggie<CR-LF></FONT>
<BR><FONT SIZE=2>> Bart,Lisa,Maggie<CR-LF></FONT>
<BR><FONT SIZE=2>> Bart,Maggie,Lisa<CR-LF></FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>and your application depends on the assumption that the fields are the </FONT>
<BR><FONT SIZE=2>>sequence:</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>> child</FONT>
<BR><FONT SIZE=2>> sisters</FONT>
<BR><FONT SIZE=2>> brothers</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>then your application will mis-interpret the data.</FONT>
<BR><FONT SIZE=2>>But if my application precedes this with a header record, like so:</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>> brothers,child,sisters<CR-LF></FONT>
<BR><FONT SIZE=2>> ,Bart,Lisa & Maggie<CR-LF></FONT>
<BR><FONT SIZE=2>> Bart,Lisa,Maggie<CR-LF></FONT>
<BR><FONT SIZE=2>> Bart,Maggie,Lisa<CR-LF></FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>then your application can maintain independence from the change by my </FONT>
<BR><FONT SIZE=2>>application, because the CSV file conveys the corresponding new field </FONT>
<BR><FONT SIZE=2>>sequence (the columns "brother" and "child" have swapped).</FONT>
<BR><FONT SIZE=2>></FONT>
<BR><FONT SIZE=2>>Regards,</FONT>
<BR><FONT SIZE=2>>Clyde</FONT>
</P>
<P><FONT SIZE=2>------------</FONT>
<BR><FONT SIZE=2>Graham Klyne</FONT>
<BR><FONT SIZE=2>For email:</FONT>
<BR><FONT SIZE=2><A HREF="http://www.ninebynine.org/#Contact" TARGET="_blank">http://www.ninebynine.org/#Contact</A></FONT>
</P>
</BODY>
</HTML>