Hi All, A couple of years ago, we did extensive CSV file export from a proprietary db...our practice is not best practice or at least it was not tested or shared to be tagged as best practice. I second Steve Bordwell: The delimiter is something not used in the text. We used a tool called xflat (http://www.unidex.com/xflat.htm) to convert CSV to XML. We were capable of re-creating the hierarchy embedded indirectly in the CSV. CSV in its generic definition means "[anything eg comma] separated value". Karim Boughida [log in to unmask] >>> Steve Bordwell <[log in to unmask]> 2006-10-31 08:09 >>> We have also used the bar | as the delimiter, as this unlikely to appear in text. Steve -----Original Message----- From: PREMIS Implementors Group Forum [mailto:[log in to unmask]] On Behalf Of John A. Kunze Sent: 31 October 2006 15:33 To: [log in to unmask] Subject: Re: [PIG] Standards for CSV/Tabbed I wonder if a couple of things could be at play here. The original email used csv to mean "character separated value (aka tab delimited)", but I'm used to seeing CSV mean "comma separated value". Some versions of CSV also allow double-quotes to enclose values, which significantly effects parsing. Whether separated by commas or tabs, the data values will have to be cleansed of the delimiter (tab or comma) for the format to work. In much of my work, tabs are better value delimiters because they're either more rare in values or easier to clean out of values (tending to be less significant than commas). -John --- On Tue, 31 Oct 2006, Charles Blair wrote: > On Tue, Oct 31, 2006 at 09:29:23AM -0500, Kai Naumann wrote: > > we are planning a business rule for the character separated value > > format (aka tabbed format), dealing with the best choice for field > > delimiter, and with the problem of text delimiters encountered > > inside texts. > > the typical problem with tab-delimited is encountering tabs or > newlines inside a field value, input by people who want to "format" > the data, say in a description field. in these cases i tell them to > export the data including field names as the first row. my parser > count these, to tell it how many fields it should expect, then it > counts the fields in every row. it reports when it encounters a row > that has more or fewer fields than expected, returning the row number > with how many fields it found, in which case i send the data back to > the user with the report and tell them to fix the problem. it's simple > enough to write these parsers in your language of choice. > > with csv you're not going to have the problem with embedded tabs. i > can't remember offhand how much of a problem embedded newlines > represent. it's simple enough to experiment with, though. > > another issue you might want to keep in mind is character encoding if > that is relevant in your situation. people using these formats > typically are generating data using MS Windows products, which default > to codepage 1252 for character encoding. since i typically want to > convert tab-delimited or csv to xml, i need to convert anything i get > from these sources to utf-8 using a tool such as GNU recode, or tell > them to export as utf-8 (but check what you get in these cases). > ______