Roy - That's a good idea. I don't think that we have - I've queried our programmer who's developed most of our harvesting software. Thanks! Sarah At 10:38 AM 10/25/2006, Roy Tennant wrote: >Sarah, >Did you consider piping bad records through something like tidy? I've always >wondered whether that might be a good way to cleanup bad records before >ingest. Thanks, >Roy > > >On 10/25/06 8:53 AM, "Sarah L. Shreeves" <[log in to unmask]> wrote: > > > Loosening the validation was required for us just > > to get the OAI harvest done. Strict validation > > could cause harvest to hang or fail and usually > > wasn't worth it for a handful of bad records. We > > tend to throw out the bad records. If there was a > > more systematic problem, we would try to contact the data provider. > > > > Character encoding is something we should be > > striving to get right - and should be making sure > > that our vendors know how to do correctly. > > Whether the data resides in a digital library > > environment or not, it's not going to be very > > interoperable (such as being useful in on-line > > catalog!) if there are hang-ups on technical glitches. > > > > Sarah > > > > At 08:24 AM 10/25/2006, Jackie Shieh wrote: > > > >> Though, I'd be a bit careful to loosening validation > >> standards, as it may come back and haunts one later... > >> > >> In this particular case, since I am hoping to get it > >> working for our online catalog, when character encoding is > >> incorrect, indexing will then be faulty. Thus, the object > >> will most likely lost in the abyss, user will not be able > >> to find it. Consequently defeats the purpose of providing > >> it via online catalog, doesn't it?! > >> > >> That said, if the data is to reside only in the digital > >> lib environment, perhaps, the character encoding is not > >> such a big issue as it can be. (For me at this time it is... > >> plus more to look into from my original query to the list, > >> i.e. the mapping of stylesheet for 130/240 and parent/child > >> node for subject!) > >> > >> --Jackie > >> > >> On Wed, 25 Oct 2006, Sarah L. Shreeves wrote: > >> > >>> There's never any guarantee that metadata that > >>> has been harvested via the OAI Protocol is free > >>> from character encoding problems. In fact, here > >>> in our harvesting work at Illinois, we've often > >>> encountered character encoding problems, so > >>> much so that we've had to really loosen our > >>> validation procedures when harvesting. Lagoze > >>> et al also mention this issue in passing as > >>> well in their recent JCDL paper "Metadata > >>> Aggregation and "Automated Digital Libraries:" > >>> A Retrospective on the NSDL Experience". > >>> > >>> This is sometimes a problem when folks cut and > >>> paste from MS Word, but at times it can be the > >>> digital content management system itself that causes the problem. > >>> > >>> See the OAI best practices on this: > >>> http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?CharacterEncoding > >>> > >>> Sarah > >>> > >>> ------------------------------------------------------------------------ > >>> Sarah L. Shreeves > >>> Coordinator, Illinois Digital Environment for > >>> Access to Learning and Scholarship (IDEALS) > >>> University of Illinois Library at Urbana-Champaign > >>> Phone: 217-244-3877 or 217-233-4648 > >>> Email: [log in to unmask] > >>> http://ideals.uiuc.edu/ > >>> At 07:43 AM 10/25/2006, Jackie Shieh wrote: > >>> > >>>> Yes, I am aware of unicode has various UTF encoding. > >>>> I had to deal with converting UTF-16 to UTF-8 in order > >>>> to use MARC::Record module. > >>>> What I was puzzled was that how OAI harvested data > >>>> can have this problem when it was declared already > >>>> as utf-8... then caused my MARC::Record to have diacritics > >>>> problem. I suspect this is more complicated where I > >>>> must trace back to the original supplier of the data. > >>>> Then go from there. > >>>> Thank very much for all your patience in this mystery! > >>>> Regards, > >>>> --Jackie > >>>> On Tue, 24 Oct 2006, Erik Hetzner wrote: > >>>> > >>>>> At Tue, 24 Oct 2006 14:55:18 -0400, > >>>>> Jackie Shieh <[log in to unmask]> wrote: > >>>>>> I am fairly new on MODS and MARC21 conversion, > >>>>>> so my question perhaps too elementary... > >>>>>> If declaring output to ascii, don't I then miss > >>>>>> the proper diacritics encoding?! The records > >>>>>> I have are primary non-English. > >>>>> Character encoding is simple in concept but complex in execution. I am > >>>>> not an expert but I will do my best. > >>>>> The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see > >>>>> an e with an acute accent your (or possibly my) mail reader is not > >>>>> working propertly) is U+00E9 (see > >>>>> > <http://www.fileformat.info/info/unicode/char/00e9/).>>>>> > As UTF-8 this is> expressed by the two bytes > >>>>> 0xC3 0xA9 (see above message). If your > >>>>> document is encoded as UTF-8 then those two bytes will make the > >>>>> character above. If you are looking at the file as latin-1 encoding > >>>>> these bytes will not look like this é > but instead like é. If you set > >>>>> the encoding of your output file to ascii > it will “entity encode” your > >>>>> character as é (decimal) or é (hex). If you are processing > >>>>> this xml with a useful parser it does not care if you have: (a) é in > >>>>> utf-8; or (b) é or 9 as entity encoded characters. But it > >>>>> you wish to force your XSL transform to output entity encoded ascii > >>>>> rather than UTF-8 you must set you encoding to “ascii” in your > >>>>> <xsl:output> element. This means that the file itself is 7-bit ascii > >>>>> but all the characters outside of those 7-bits will be encoded as pure > >>>>> ascii which is equivalent as far as XML parsers are concerned. > >>>>> best, > >>>>> -- > >>>>> Erik Hetzner > >>>>> California Digital Library > >>>>> 510-987-0884 > >>>> > >>>> > --------------------------------------------------------------------------- > >>>> -------------------- > >>>> Sarah L. Shreeves > >>>> Coordinator, Illinois Digital Environment for > >>>> Access to Learning and Scholarship (IDEALS) > >>>> University of Illinois Library at Urbana-Champaign > >>>> Phone: 217-244-3877 or 217-233-4648 > >>>> Email: [log in to unmask] > >>>> http://ideals.uiuc.edu/ ----------------------------------------------------------------------------------------------- Sarah L. Shreeves Coordinator, Illinois Digital Environment for Access to Learning and Scholarship (IDEALS) University of Illinois Library at Urbana-Champaign Phone: 217-244-3877 or 217-233-4648 Email: [log in to unmask] http://ideals.uiuc.edu/