Print

Print


Though, I'd be a bit careful to loosening validation
standards, as it may come back and haunts one later...

In this particular case, since I am hoping to get it
working for our online catalog, when character encoding is
incorrect, indexing will then be faulty.  Thus, the object
will most likely lost in the abyss, user will not be able
to find it. Consequently defeats the purpose of providing
it via online catalog, doesn't it?!

That said, if the data is to reside only in the digital
lib environment, perhaps, the character encoding is not
such a big issue as it can be. (For me at this time it is...
plus  more to look into from my original query to the list,
i.e. the mapping of stylesheet for 130/240 and parent/child
node for subject!)

--Jackie

On Wed, 25 Oct 2006, Sarah L. Shreeves wrote:

> There's never any guarantee that metadata that has been harvested via the OAI 
> Protocol is free from character encoding problems. In fact, here in our 
> harvesting work at Illinois, we've often encountered character encoding 
> problems, so much so that we've had to really loosen our validation 
> procedures when harvesting. Lagoze et al also mention this issue in passing 
> as well in their recent JCDL paper "Metadata Aggregation and "Automated 
> Digital Libraries:" A Retrospective on the NSDL Experience".
>
> This is sometimes a problem when folks cut and paste from MS Word, but at 
> times it can be the digital content management system itself that causes the 
> problem.
>
> See the OAI best practices on this: 
> http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?CharacterEncoding
>
> Sarah
>
> ------------------------------------------------------------------------
> Sarah L. Shreeves
> Coordinator, Illinois Digital Environment for Access to Learning and 
> Scholarship (IDEALS)
> University of Illinois Library at Urbana-Champaign
> Phone: 217-244-3877 or 217-233-4648
> Email: [log in to unmask]
> http://ideals.uiuc.edu/ 
>
> At 07:43 AM 10/25/2006, Jackie Shieh wrote:
>
>> Yes, I am aware of unicode has various UTF encoding.
>> I had to deal with converting UTF-16 to UTF-8 in order
>> to use MARC::Record module.
>> 
>> What I was puzzled was that how OAI harvested data
>> can have this problem when it was declared already
>> as utf-8... then caused my MARC::Record to have diacritics
>> problem.  I suspect this is more complicated where I
>> must trace back to the original supplier of the data.
>> Then go from there.
>> 
>> Thank very much for all your patience in this mystery!
>> 
>> Regards,
>> 
>> --Jackie
>> 
>> On Tue, 24 Oct 2006, Erik Hetzner wrote:
>> 
>>> At Tue, 24 Oct 2006 14:55:18 -0400,
>>> Jackie Shieh <[log in to unmask]> wrote:
>>>> I am fairly new on MODS and MARC21 conversion,
>>>> so my question perhaps too elementary...
>>>> 
>>>> If declaring output to ascii, don't I then miss
>>>> the proper diacritics encoding?!  The records
>>>> I have are primary non-English.
>>> 
>>> Character encoding is simple in concept but complex in execution. I am
>>> not an expert but I will do my best.
>>> 
>>> The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
>>> an e with an acute accent your (or possibly my) mail reader is not
>>> working propertly) is U+00E9 (see
>>> <http://www.fileformat.info/info/unicode/char/00e9/). As UTF-8 this is> 
>>> expressed by the two bytes 0xC3 0xA9 (see above message). If your
>>> document is encoded as UTF-8 then those two bytes will make the
>>> character above. If you are looking at the file as latin-1 encoding
>>> these bytes will not look like this é but instead like é. If you set
>>> the encoding of your output file to ascii it will “entity encode” your
>>> character as &#233; (decimal) or &#xe9; (hex). If you are processing
>>> this xml with a useful parser it does not care if you have: (a) é in
>>> utf-8; or (b) &#233; or &#x39; as entity encoded characters. But it
>>> you wish to force your XSL transform to output entity encoded ascii
>>> rather than UTF-8 you must set you encoding to “ascii” in your
>>> <xsl:output> element. This means that the file itself is 7-bit ascii
>>> but all the characters outside of those 7-bits will be encoded as pure
>>> ascii which is equivalent as far as XML parsers are concerned.
>>> 
>>> best,
>>> --
>>> Erik Hetzner
>>> California Digital Library
>>> 510-987-0884