Roy -
That's a good idea. I don't think that we have -
I've queried our programmer who's developed most of our harvesting software.
Thanks!
Sarah
At 10:38 AM 10/25/2006, Roy Tennant wrote:
>Sarah,
>Did you consider piping bad records through something like tidy? I've always
>wondered whether that might be a good way to cleanup bad records before
>ingest. Thanks,
>Roy
>
>
>On 10/25/06 8:53 AM, "Sarah L. Shreeves" <[log in to unmask]> wrote:
>
> > Loosening the validation was required for us just
> > to get the OAI harvest done. Strict validation
> > could cause harvest to hang or fail and usually
> > wasn't worth it for a handful of bad records. We
> > tend to throw out the bad records. If there was a
> > more systematic problem, we would try to contact the data provider.
> >
> > Character encoding is something we should be
> > striving to get right - and should be making sure
> > that our vendors know how to do correctly.
> > Whether the data resides in a digital library
> > environment or not, it's not going to be very
> > interoperable (such as being useful in on-line
> > catalog!) if there are hang-ups on technical glitches.
> >
> > Sarah
> >
> > At 08:24 AM 10/25/2006, Jackie Shieh wrote:
> >
> >> Though, I'd be a bit careful to loosening validation
> >> standards, as it may come back and haunts one later...
> >>
> >> In this particular case, since I am hoping to get it
> >> working for our online catalog, when character encoding is
> >> incorrect, indexing will then be faulty. Thus, the object
> >> will most likely lost in the abyss, user will not be able
> >> to find it. Consequently defeats the purpose of providing
> >> it via online catalog, doesn't it?!
> >>
> >> That said, if the data is to reside only in the digital
> >> lib environment, perhaps, the character encoding is not
> >> such a big issue as it can be. (For me at this time it is...
> >> plus more to look into from my original query to the list,
> >> i.e. the mapping of stylesheet for 130/240 and parent/child
> >> node for subject!)
> >>
> >> --Jackie
> >>
> >> On Wed, 25 Oct 2006, Sarah L. Shreeves wrote:
> >>
> >>> There's never any guarantee that metadata that
> >>> has been harvested via the OAI Protocol is free
> >>> from character encoding problems. In fact, here
> >>> in our harvesting work at Illinois, we've often
> >>> encountered character encoding problems, so
> >>> much so that we've had to really loosen our
> >>> validation procedures when harvesting. Lagoze
> >>> et al also mention this issue in passing as
> >>> well in their recent JCDL paper "Metadata
> >>> Aggregation and "Automated Digital Libraries:"
> >>> A Retrospective on the NSDL Experience".
> >>>
> >>> This is sometimes a problem when folks cut and
> >>> paste from MS Word, but at times it can be the
> >>> digital content management system itself that causes the problem.
> >>>
> >>> See the OAI best practices on this:
> >>> http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?CharacterEncoding
> >>>
> >>> Sarah
> >>>
> >>> ------------------------------------------------------------------------
> >>> Sarah L. Shreeves
> >>> Coordinator, Illinois Digital Environment for
> >>> Access to Learning and Scholarship (IDEALS)
> >>> University of Illinois Library at Urbana-Champaign
> >>> Phone: 217-244-3877 or 217-233-4648
> >>> Email: [log in to unmask]
> >>> http://ideals.uiuc.edu/
> >>> At 07:43 AM 10/25/2006, Jackie Shieh wrote:
> >>>
> >>>> Yes, I am aware of unicode has various UTF encoding.
> >>>> I had to deal with converting UTF-16 to UTF-8 in order
> >>>> to use MARC::Record module.
> >>>> What I was puzzled was that how OAI harvested data
> >>>> can have this problem when it was declared already
> >>>> as utf-8... then caused my MARC::Record to have diacritics
> >>>> problem. I suspect this is more complicated where I
> >>>> must trace back to the original supplier of the data.
> >>>> Then go from there.
> >>>> Thank very much for all your patience in this mystery!
> >>>> Regards,
> >>>> --Jackie
> >>>> On Tue, 24 Oct 2006, Erik Hetzner wrote:
> >>>>
> >>>>> At Tue, 24 Oct 2006 14:55:18 -0400,
> >>>>> Jackie Shieh <[log in to unmask]> wrote:
> >>>>>> I am fairly new on MODS and MARC21 conversion,
> >>>>>> so my question perhaps too elementary...
> >>>>>> If declaring output to ascii, don't I then miss
> >>>>>> the proper diacritics encoding?! The records
> >>>>>> I have are primary non-English.
> >>>>> Character encoding is simple in concept but complex in execution. I am
> >>>>> not an expert but I will do my best.
> >>>>> The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
> >>>>> an e with an acute accent your (or possibly my) mail reader is not
> >>>>> working propertly) is U+00E9 (see
> >>>>>
> <http://www.fileformat.info/info/unicode/char/00e9/).>>>>>
> As UTF-8 this is> expressed by the two bytes
> >>>>> 0xC3 0xA9 (see above message). If your
> >>>>> document is encoded as UTF-8 then those two bytes will make the
> >>>>> character above. If you are looking at the file as latin-1 encoding
> >>>>> these bytes will not look like this é
> but instead like é. If you set
> >>>>> the encoding of your output file to ascii
> it will “entity encode” your
> >>>>> character as é (decimal) or é (hex). If you are processing
> >>>>> this xml with a useful parser it does not care if you have: (a) é in
> >>>>> utf-8; or (b) é or 9 as entity encoded characters. But it
> >>>>> you wish to force your XSL transform to output entity encoded ascii
> >>>>> rather than UTF-8 you must set you encoding to “ascii” in your
> >>>>> <xsl:output> element. This means that the file itself is 7-bit ascii
> >>>>> but all the characters outside of those 7-bits will be encoded as pure
> >>>>> ascii which is equivalent as far as XML parsers are concerned.
> >>>>> best,
> >>>>> --
> >>>>> Erik Hetzner
> >>>>> California Digital Library
> >>>>> 510-987-0884
> >>>>
> >>>>
> ---------------------------------------------------------------------------
> >>>> --------------------
> >>>> Sarah L. Shreeves
> >>>> Coordinator, Illinois Digital Environment for
> >>>> Access to Learning and Scholarship (IDEALS)
> >>>> University of Illinois Library at Urbana-Champaign
> >>>> Phone: 217-244-3877 or 217-233-4648
> >>>> Email: [log in to unmask]
> >>>> http://ideals.uiuc.edu/
-----------------------------------------------------------------------------------------------
Sarah L. Shreeves
Coordinator, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
University of Illinois Library at Urbana-Champaign
Phone: 217-244-3877 or 217-233-4648
Email: [log in to unmask]
http://ideals.uiuc.edu/
|