Yes, I am aware of unicode has various UTF encoding.
I had to deal with converting UTF-16 to UTF-8 in order
to use MARC::Record module.
What I was puzzled was that how OAI harvested data
can have this problem when it was declared already
as utf-8... then caused my MARC::Record to have diacritics
problem. I suspect this is more complicated where I
must trace back to the original supplier of the data.
Then go from there.
Thank very much for all your patience in this mystery!
On Tue, 24 Oct 2006, Erik Hetzner wrote:
> At Tue, 24 Oct 2006 14:55:18 -0400,
> Jackie Shieh <[log in to unmask]> wrote:
>> I am fairly new on MODS and MARC21 conversion,
>> so my question perhaps too elementary...
>> If declaring output to ascii, don't I then miss
>> the proper diacritics encoding?! The records
>> I have are primary non-English.
> Character encoding is simple in concept but complex in execution. I am
> not an expert but I will do my best.
> The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
> an e with an acute accent your (or possibly my) mail reader is not
> working propertly) is U+00E9 (see
> <http://www.fileformat.info/info/unicode/char/00e9/). As UTF-8 this is
> expressed by the two bytes 0xC3 0xA9 (see above message). If your
> document is encoded as UTF-8 then those two bytes will make the
> character above. If you are looking at the file as latin-1 encoding
> these bytes will not look like this é but instead like Ã©. If you set
> the encoding of your output file to ascii it will “entity encode” your
> character as é (decimal) or é (hex). If you are processing
> this xml with a useful parser it does not care if you have: (a) é in
> utf-8; or (b) é or 9 as entity encoded characters. But it
> you wish to force your XSL transform to output entity encoded ascii
> rather than UTF-8 you must set you encoding to “ascii” in your
> <xsl:output> element. This means that the file itself is 7-bit ascii
> but all the characters outside of those 7-bits will be encoded as pure
> ascii which is equivalent as far as XML parsers are concerned.
> Erik Hetzner
> California Digital Library