Michele (et al.),
My suggestion is that the EAD XML unicode should be consistent.
  1. raw unicode (bytes)
  2. numeric character references (either of the following)
    a. decimal
    b. hex
  3. single ligatures (not double)

I have recently created a tool using the International Component for Unicode (ICU) jars.

The object was to create java static methods that accessed ICU components to produce legal XML. This included rendering named and numeric character entities into unicode as well as decomposed characters into composed characters.

The java class I wrote has static methods that can be used either at the command line or through the XSLT stylesheet java extensions. (IOW, you can clean up your EAD XML using XSLT with this jar on your classpath.)

The toolkit includes files for running both command line documents and XSLT stylesheet examples. Everything is self contained for easy user actuation.

The goal initially was to accomplish some tasks using data that be illegal using the regular XSLT functions (1.0 and 2.0). Personally, I like XSLT 1.0 and would like to continue using it. Java extensions make this desire a reality. Java extensions add muscle to the weak or soft areas of XSLT.

This toolkit is available on the LoC staff site. (API available)

ICU has quite a lot to offer some of the issues (search and display) that are mentioned in this list. You might want to take a look at it.

I can be contacted off list for the toolkit.

  Windows 2000/XP
  Java JRE 1.5 or higher environment

Mike Ferrando
Library Technician
Library of Congress
Washington, DC
(202) 707-4454

p.s. I have also compiled a demo class (into a jar) which utilizes the ICU to view and detect encodings of a file. It is a neat little utility (http or local file). This is also available on the LoC staff site. -mf

International Component for Unicode (ICU)

Check your Java:

----- Original Message ----
From: Michele Combs <[log in to unmask]>
To: [log in to unmask]
Sent: Wednesday, June 6, 2007 9:26:35 AM
Subject: special characters in EAD

For those of you who are offering web-based searching of your EAD
finding aids by title or author, how are you handling special characters
in the title, subtitle, or originator (for example, the French accented
e, or German umlauted u) ?  Are you encoding those special characters in
the EAD finding aid using character entities?  If so, does it cause any
problems with indexing or searching, given that most researchers will
not have the special characters in their search string (for example,
they'll likely just use an e without the accent) ?


Michele C.

Michele R. Combs
[log in to unmask]
Manuscripts Processor
Special Collections Research Center
Syracuse University Library
222 Waverly Avenue
Syracuse, NY 13244
(315) 443-2697

Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.