Print

Print


At Tue, 24 Oct 2006 14:55:18 -0400,
Jackie Shieh <[log in to unmask]> wrote:
> I am fairly new on MODS and MARC21 conversion,
> so my question perhaps too elementary...
> 
> If declaring output to ascii, don't I then miss
> the proper diacritics encoding?!  The records
> I have are primary non-English.

Character encoding is simple in concept but complex in execution. I am
not an expert but I will do my best.

The UTF codepoint for LATIN SMALL E WITH ACUTE (é; if you do not see
an e with an acute accent your (or possibly my) mail reader is not
working propertly) is U+00E9 (see
<http://www.fileformat.info/info/unicode/char/00e9/). As UTF-8 this is
expressed by the two bytes 0xC3 0xA9 (see above message). If your
document is encoded as UTF-8 then those two bytes will make the
character above. If you are looking at the file as latin-1 encoding
these bytes will not look like this é but instead like é. If you set
the encoding of your output file to ascii it will “entity encode” your
character as &#233; (decimal) or &#xe9; (hex). If you are processing
this xml with a useful parser it does not care if you have: (a) é in
utf-8; or (b) &#233; or &#x39; as entity encoded characters. But it
you wish to force your XSL transform to output entity encoded ascii
rather than UTF-8 you must set you encoding to “ascii” in your
<xsl:output> element. This means that the file itself is 7-bit ascii
but all the characters outside of those 7-bits will be encoded as pure
ascii which is equivalent as far as XML parsers are concerned.

best,
--
Erik Hetzner
California Digital Library
510-987-0884