Print

Print


Yes, I am aware of unicode has various UTF encoding.
I had to deal with converting UTF-16 to UTF-8 in order
to use MARC::Record module.

What I was puzzled was that how OAI harvested data
can have this problem when it was declared already
as utf-8... then caused my MARC::Record to have diacritics
problem.  I suspect this is more complicated where I
must trace back to the original supplier of the data.
Then go from there.

Thank very much for all your patience in this mystery!

Regards,

--Jackie

On Tue, 24 Oct 2006, Erik Hetzner wrote:

> At Tue, 24 Oct 2006 14:55:18 -0400,
> Jackie Shieh <[log in to unmask]> wrote:
>> I am fairly new on MODS and MARC21 conversion,
>> so my question perhaps too elementary...
>>
>> If declaring output to ascii, don't I then miss
>> the proper diacritics encoding?!  The records
>> I have are primary non-English.
>
> Character encoding is simple in concept but complex in execution. I am
> not an expert but I will do my best.
>
> The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
> an e with an acute accent your (or possibly my) mail reader is not
> working propertly) is U+00E9 (see
> <http://www.fileformat.info/info/unicode/char/00e9/). As UTF-8 this is
> expressed by the two bytes 0xC3 0xA9 (see above message). If your
> document is encoded as UTF-8 then those two bytes will make the
> character above. If you are looking at the file as latin-1 encoding
> these bytes will not look like this é but instead like é. If you set
> the encoding of your output file to ascii it will “entity encode” your
> character as &#233; (decimal) or &#xe9; (hex). If you are processing
> this xml with a useful parser it does not care if you have: (a) é in
> utf-8; or (b) &#233; or &#x39; as entity encoded characters. But it
> you wish to force your XSL transform to output entity encoded ascii
> rather than UTF-8 you must set you encoding to “ascii” in your
> <xsl:output> element. This means that the file itself is 7-bit ascii
> but all the characters outside of those 7-bits will be encoded as pure
> ascii which is equivalent as far as XML parsers are concerned.
>
> best,
> --
> Erik Hetzner
> California Digital Library
> 510-987-0884
>