I've been grappling with the problem of these accented characters, and
the best way to deal with them for a few years now, particularly
allowing them to be searched and displayed in the optimum way. I've come
to the following conclusions:
1. Special characters should be encoded within XML, using either the
named or numeric form - e.g. é or é
2. The named form as preferable when editing and proof-reading the file.
If numeric forms are required by your XML parser of choice, these can be
substituted for the named versions in en mass prior to parsing.
3. The software used to index the finding aid should provide a means of
normalising the accented form to the regular character. This allows the
end user to find the record using the form without entering the accented
character.
4. The software should also apply the normalising to query terms entered
so that advanced user who do enter the accented form still find matches.
The above approach - which is currently being applied to support
federated search here in the UK (http://www.archivehub.ac.uk) - means
that matches can be found by searching with the unaccented form, while
the accented characters remain in the original file. So the file appears
as intended when the retrieved and transformed (by XSLT) for display in
the browser.
Hope this helps. Please feel free to contact me for further detail about
how we're implementing federated searching.
--
John Harrison
Special Collections and Archives
University of Liverpool Library
Chatham Street, PO Box 123, Liverpool, L693DA
e: [log in to unmask]
w: sca.lib.liv.ac.uk
t: 0151 7943142
f: 0151 7942681
|