Loosening the validation was required for us just
to get the OAI harvest done. Strict validation
could cause harvest to hang or fail and usually
wasn't worth it for a handful of bad records. We
tend to throw out the bad records. If there was a
more systematic problem, we would try to contact the data provider.
Character encoding is something we should be
striving to get right - and should be making sure
that our vendors know how to do correctly.
Whether the data resides in a digital library
environment or not, it's not going to be very
interoperable (such as being useful in on-line
catalog!) if there are hang-ups on technical glitches.
Sarah
At 08:24 AM 10/25/2006, Jackie Shieh wrote:
>Though, I'd be a bit careful to loosening validation
>standards, as it may come back and haunts one later...
>
>In this particular case, since I am hoping to get it
>working for our online catalog, when character encoding is
>incorrect, indexing will then be faulty. Thus, the object
>will most likely lost in the abyss, user will not be able
>to find it. Consequently defeats the purpose of providing
>it via online catalog, doesn't it?!
>
>That said, if the data is to reside only in the digital
>lib environment, perhaps, the character encoding is not
>such a big issue as it can be. (For me at this time it is...
>plus more to look into from my original query to the list,
>i.e. the mapping of stylesheet for 130/240 and parent/child
>node for subject!)
>
>--Jackie
>
>On Wed, 25 Oct 2006, Sarah L. Shreeves wrote:
>
>>There's never any guarantee that metadata that
>>has been harvested via the OAI Protocol is free
>>from character encoding problems. In fact, here
>>in our harvesting work at Illinois, we've often
>>encountered character encoding problems, so
>>much so that we've had to really loosen our
>>validation procedures when harvesting. Lagoze
>>et al also mention this issue in passing as
>>well in their recent JCDL paper "Metadata
>>Aggregation and "Automated Digital Libraries:"
>>A Retrospective on the NSDL Experience".
>>
>>This is sometimes a problem when folks cut and
>>paste from MS Word, but at times it can be the
>>digital content management system itself that causes the problem.
>>
>>See the OAI best practices on this:
>>http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?CharacterEncoding
>>
>>Sarah
>>
>>------------------------------------------------------------------------
>>Sarah L. Shreeves
>>Coordinator, Illinois Digital Environment for
>>Access to Learning and Scholarship (IDEALS)
>>University of Illinois Library at Urbana-Champaign
>>Phone: 217-244-3877 or 217-233-4648
>>Email: [log in to unmask]
>>http://ideals.uiuc.edu/
>>At 07:43 AM 10/25/2006, Jackie Shieh wrote:
>>
>>>Yes, I am aware of unicode has various UTF encoding.
>>>I had to deal with converting UTF-16 to UTF-8 in order
>>>to use MARC::Record module.
>>>What I was puzzled was that how OAI harvested data
>>>can have this problem when it was declared already
>>>as utf-8... then caused my MARC::Record to have diacritics
>>>problem. I suspect this is more complicated where I
>>>must trace back to the original supplier of the data.
>>>Then go from there.
>>>Thank very much for all your patience in this mystery!
>>>Regards,
>>>--Jackie
>>>On Tue, 24 Oct 2006, Erik Hetzner wrote:
>>>
>>>>At Tue, 24 Oct 2006 14:55:18 -0400,
>>>>Jackie Shieh <[log in to unmask]> wrote:
>>>>>I am fairly new on MODS and MARC21 conversion,
>>>>>so my question perhaps too elementary...
>>>>>If declaring output to ascii, don't I then miss
>>>>>the proper diacritics encoding?! The records
>>>>>I have are primary non-English.
>>>>Character encoding is simple in concept but complex in execution. I am
>>>>not an expert but I will do my best.
>>>>The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
>>>>an e with an acute accent your (or possibly my) mail reader is not
>>>>working propertly) is U+00E9 (see
>>>><http://www.fileformat.info/info/unicode/char/00e9/).
>>>>As UTF-8 this is> expressed by the two bytes
>>>>0xC3 0xA9 (see above message). If your
>>>>document is encoded as UTF-8 then those two bytes will make the
>>>>character above. If you are looking at the file as latin-1 encoding
>>>>these bytes will not look like this é but instead like é. If you set
>>>>the encoding of your output file to ascii it will “entity encode” your
>>>>character as é (decimal) or é (hex). If you are processing
>>>>this xml with a useful parser it does not care if you have: (a) é in
>>>>utf-8; or (b) é or 9 as entity encoded characters. But it
>>>>you wish to force your XSL transform to output entity encoded ascii
>>>>rather than UTF-8 you must set you encoding to “ascii” in your
>>>><xsl:output> element. This means that the file itself is 7-bit ascii
>>>>but all the characters outside of those 7-bits will be encoded as pure
>>>>ascii which is equivalent as far as XML parsers are concerned.
>>>>best,
>>>>--
>>>>Erik Hetzner
>>>>California Digital Library
>>>>510-987-0884
>>>
>>>-----------------------------------------------------------------------------------------------
>>>Sarah L. Shreeves
>>>Coordinator, Illinois Digital Environment for
>>>Access to Learning and Scholarship (IDEALS)
>>>University of Illinois Library at Urbana-Champaign
>>>Phone: 217-244-3877 or 217-233-4648
>>>Email: [log in to unmask]
>>>http://ideals.uiuc.edu/
|