Print

Print


Hi All,
A couple of years ago, we did extensive CSV file export from a
proprietary db...our practice is not best practice or at least it was
not tested or shared to be tagged as best practice.
I second Steve Bordwell: The delimiter is something not used in the
text. 
We used a tool called xflat (http://www.unidex.com/xflat.htm) to
convert CSV to XML. We were capable of re-creating the hierarchy
embedded indirectly in the CSV. 

CSV in its generic definition means "[anything eg comma] separated
value". 
Karim Boughida
[log in to unmask] 
 

>>> Steve Bordwell <[log in to unmask]> 2006-10-31 08:09 >>>
We have also used the bar | as the delimiter, as this unlikely to
appear
in text.

Steve

-----Original Message-----
From: PREMIS Implementors Group Forum [mailto:[log in to unmask]] On Behalf
Of
John A. Kunze
Sent: 31 October 2006 15:33
To: [log in to unmask] 
Subject: Re: [PIG] Standards for CSV/Tabbed

I wonder if a couple of things could be at play here.
The original email used csv to mean "character separated value
(aka tab delimited)", but I'm used to seeing CSV mean "comma
separated value".  Some versions of CSV also allow double-quotes
to enclose values, which significantly effects parsing.

Whether separated by commas or tabs, the data values will have to
be cleansed of the delimiter (tab or comma) for the format to work.
In much of my work, tabs are better value delimiters because they're
either more rare in values or easier to clean out of values (tending
to be less significant than commas).

-John


--- On Tue, 31 Oct 2006, Charles Blair wrote:
> On Tue, Oct 31, 2006 at 09:29:23AM -0500, Kai Naumann wrote:
> > we are planning a business rule for the character separated value
> > format (aka tabbed format), dealing with the best choice for field
> > delimiter, and with the problem of text delimiters encountered
> > inside texts.
> 
> the typical problem with tab-delimited is encountering tabs or
> newlines inside a field value, input by people who want to "format"
> the data, say in a description field. in these cases i tell them to
> export the data including field names as the first row. my parser
> count these, to tell it how many fields it should expect, then it
> counts the fields in every row. it reports when it encounters a row
> that has more or fewer fields than expected, returning the row
number
> with how many fields it found, in which case i send the data back to
> the user with the report and tell them to fix the problem. it's
simple
> enough to write these parsers in your language of choice.
> 
> with csv you're not going to have the problem with embedded tabs. i
> can't remember offhand how much of a problem embedded newlines
> represent. it's simple enough to experiment with, though.
> 
> another issue you might want to keep in mind is character encoding
if
> that is relevant in your situation. people using these formats
> typically are generating data using MS Windows products, which
default
> to codepage 1252 for character encoding. since i typically want to
> convert tab-delimited or csv to xml, i need to convert anything i
get
> from these sources to utf-8 using a tool such as GNU recode, or tell
> them to export as utf-8 (but check what you get in these cases).
> 
______