On Tue, Oct 31, 2006 at 09:29:23AM -0500, Kai Naumann wrote:
> we are planning a business rule for the character separated value
> format (aka tabbed format), dealing with the best choice for field
> delimiter, and with the problem of text delimiters encountered
> inside texts.
the typical problem with tab-delimited is encountering tabs or
newlines inside a field value, input by people who want to "format"
the data, say in a description field. in these cases i tell them to
export the data including field names as the first row. my parser
count these, to tell it how many fields it should expect, then it
counts the fields in every row. it reports when it encounters a row
that has more or fewer fields than expected, returning the row number
with how many fields it found, in which case i send the data back to
the user with the report and tell them to fix the problem. it's simple
enough to write these parsers in your language of choice.
with csv you're not going to have the problem with embedded tabs. i
can't remember offhand how much of a problem embedded newlines
represent. it's simple enough to experiment with, though.
another issue you might want to keep in mind is character encoding if
that is relevant in your situation. people using these formats
typically are generating data using MS Windows products, which default
to codepage 1252 for character encoding. since i typically want to
convert tab-delimited or csv to xml, i need to convert anything i get
from these sources to utf-8 using a tool such as GNU recode, or tell
them to export as utf-8 (but check what you get in these cases).
|