Did you consider piping bad records through something like tidy? I've always
wondered whether that might be a good way to cleanup bad records before
On 10/25/06 8:53 AM, "Sarah L. Shreeves" <[log in to unmask]> wrote:
> Loosening the validation was required for us just
> to get the OAI harvest done. Strict validation
> could cause harvest to hang or fail and usually
> wasn't worth it for a handful of bad records. We
> tend to throw out the bad records. If there was a
> more systematic problem, we would try to contact the data provider.
> Character encoding is something we should be
> striving to get right - and should be making sure
> that our vendors know how to do correctly.
> Whether the data resides in a digital library
> environment or not, it's not going to be very
> interoperable (such as being useful in on-line
> catalog!) if there are hang-ups on technical glitches.
> At 08:24 AM 10/25/2006, Jackie Shieh wrote:
>> Though, I'd be a bit careful to loosening validation
>> standards, as it may come back and haunts one later...
>> In this particular case, since I am hoping to get it
>> working for our online catalog, when character encoding is
>> incorrect, indexing will then be faulty. Thus, the object
>> will most likely lost in the abyss, user will not be able
>> to find it. Consequently defeats the purpose of providing
>> it via online catalog, doesn't it?!
>> That said, if the data is to reside only in the digital
>> lib environment, perhaps, the character encoding is not
>> such a big issue as it can be. (For me at this time it is...
>> plus more to look into from my original query to the list,
>> i.e. the mapping of stylesheet for 130/240 and parent/child
>> node for subject!)
>> On Wed, 25 Oct 2006, Sarah L. Shreeves wrote:
>>> There's never any guarantee that metadata that
>>> has been harvested via the OAI Protocol is free
>>> from character encoding problems. In fact, here
>>> in our harvesting work at Illinois, we've often
>>> encountered character encoding problems, so
>>> much so that we've had to really loosen our
>>> validation procedures when harvesting. Lagoze
>>> et al also mention this issue in passing as
>>> well in their recent JCDL paper "Metadata
>>> Aggregation and "Automated Digital Libraries:"
>>> A Retrospective on the NSDL Experience".
>>> This is sometimes a problem when folks cut and
>>> paste from MS Word, but at times it can be the
>>> digital content management system itself that causes the problem.
>>> See the OAI best practices on this:
>>> Sarah L. Shreeves
>>> Coordinator, Illinois Digital Environment for
>>> Access to Learning and Scholarship (IDEALS)
>>> University of Illinois Library at Urbana-Champaign
>>> Phone: 217-244-3877 or 217-233-4648
>>> Email: [log in to unmask]
>>> At 07:43 AM 10/25/2006, Jackie Shieh wrote:
>>>> Yes, I am aware of unicode has various UTF encoding.
>>>> I had to deal with converting UTF-16 to UTF-8 in order
>>>> to use MARC::Record module.
>>>> What I was puzzled was that how OAI harvested data
>>>> can have this problem when it was declared already
>>>> as utf-8... then caused my MARC::Record to have diacritics
>>>> problem. I suspect this is more complicated where I
>>>> must trace back to the original supplier of the data.
>>>> Then go from there.
>>>> Thank very much for all your patience in this mystery!
>>>> On Tue, 24 Oct 2006, Erik Hetzner wrote:
>>>>> At Tue, 24 Oct 2006 14:55:18 -0400,
>>>>> Jackie Shieh <[log in to unmask]> wrote:
>>>>>> I am fairly new on MODS and MARC21 conversion,
>>>>>> so my question perhaps too elementary...
>>>>>> If declaring output to ascii, don't I then miss
>>>>>> the proper diacritics encoding?! The records
>>>>>> I have are primary non-English.
>>>>> Character encoding is simple in concept but complex in execution. I am
>>>>> not an expert but I will do my best.
>>>>> The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
>>>>> an e with an acute accent your (or possibly my) mail reader is not
>>>>> working propertly) is U+00E9 (see
>>>>> As UTF-8 this is> expressed by the two bytes
>>>>> 0xC3 0xA9 (see above message). If your
>>>>> document is encoded as UTF-8 then those two bytes will make the
>>>>> character above. If you are looking at the file as latin-1 encoding
>>>>> these bytes will not look like this é but instead like Ã©. If you set
>>>>> the encoding of your output file to ascii it will “entity encode” your
>>>>> character as é (decimal) or é (hex). If you are processing
>>>>> this xml with a useful parser it does not care if you have: (a) é in
>>>>> utf-8; or (b) é or 9 as entity encoded characters. But it
>>>>> you wish to force your XSL transform to output entity encoded ascii
>>>>> rather than UTF-8 you must set you encoding to “ascii” in your
>>>>> <xsl:output> element. This means that the file itself is 7-bit ascii
>>>>> but all the characters outside of those 7-bits will be encoded as pure
>>>>> ascii which is equivalent as far as XML parsers are concerned.
>>>>> Erik Hetzner
>>>>> California Digital Library
>>>> Sarah L. Shreeves
>>>> Coordinator, Illinois Digital Environment for
>>>> Access to Learning and Scholarship (IDEALS)
>>>> University of Illinois Library at Urbana-Champaign
>>>> Phone: 217-244-3877 or 217-233-4648
>>>> Email: [log in to unmask]