Print

Print


There's never any guarantee that metadata that 
has been harvested via the OAI Protocol is free 
from character encoding problems. In fact, here 
in our harvesting work at Illinois, we've often 
encountered character encoding problems, so much 
so that we've had to really loosen our validation 
procedures when harvesting. Lagoze et al also 
mention this issue in passing as well in their 
recent JCDL paper "Metadata Aggregation and 
"Automated Digital Libraries:" A Retrospective on the NSDL Experience".

This is sometimes a problem when folks cut and 
paste from MS Word, but at times it can be the 
digital content management system itself that causes the problem.

See the OAI best practices on this: 
http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?CharacterEncoding

Sarah


At 07:43 AM 10/25/2006, Jackie Shieh wrote:

>Yes, I am aware of unicode has various UTF encoding.
>I had to deal with converting UTF-16 to UTF-8 in order
>to use MARC::Record module.
>
>What I was puzzled was that how OAI harvested data
>can have this problem when it was declared already
>as utf-8... then caused my MARC::Record to have diacritics
>problem.  I suspect this is more complicated where I
>must trace back to the original supplier of the data.
>Then go from there.
>
>Thank very much for all your patience in this mystery!
>
>Regards,
>
>--Jackie
>
>On Tue, 24 Oct 2006, Erik Hetzner wrote:
>
>>At Tue, 24 Oct 2006 14:55:18 -0400,
>>Jackie Shieh <[log in to unmask]> wrote:
>>>I am fairly new on MODS and MARC21 conversion,
>>>so my question perhaps too elementary...
>>>
>>>If declaring output to ascii, don't I then miss
>>>the proper diacritics encoding?!  The records
>>>I have are primary non-English.
>>
>>Character encoding is simple in concept but complex in execution. I am
>>not an expert but I will do my best.
>>
>>The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
>>an e with an acute accent your (or possibly my) mail reader is not
>>working propertly) is U+00E9 (see
>><http://www.fileformat.info/info/unicode/char/00e9/). 
>>As UTF-8 this is> expressed by the two bytes 
>>0xC3 0xA9 (see above message). If your
>>document is encoded as UTF-8 then those two bytes will make the
>>character above. If you are looking at the file as latin-1 encoding
>>these bytes will not look like this é but instead like é. If you set
>>the encoding of your output file to ascii it will “entity encode” your
>>character as &#233; (decimal) or &#xe9; (hex). If you are processing
>>this xml with a useful parser it does not care if you have: (a) é in
>>utf-8; or (b) &#233; or &#x39; as entity encoded characters. But it
>>you wish to force your XSL transform to output entity encoded ascii
>>rather than UTF-8 you must set you encoding to “ascii” in your
>><xsl:output> element. This means that the file itself is 7-bit ascii
>>but all the characters outside of those 7-bits will be encoded as pure
>>ascii which is equivalent as far as XML parsers are concerned.
>>
>>best,
>>--
>>Erik Hetzner
>>California Digital Library
>>510-987-0884
>
>-----------------------------------------------------------------------------------------------
>Sarah L. Shreeves
>Coordinator, Illinois Digital Environment for 
>Access to Learning and Scholarship (IDEALS)
>University of Illinois Library at Urbana-Champaign
>Phone: 217-244-3877 or 217-233-4648
>Email: [log in to unmask]
>http://ideals.uiuc.edu/