Print

Print


Timothy,

I am eagerly looking for UNICODE to solve similar problems in our =
CD-ROM
In Principio.  The program works around lots of replacements for
Medieval Latin  e.g. hymn* can find hymus, ymnus, hympnus, etc, but
M=DCNICH causes problems!  =20

However, UNICODE will work on Windows NT and beyond.  So... right back
at the starting blocks.

Peregrin Berres, OSB
Hill Monastic Manuscript Library
Collegeville, MN 56321

        ----------
        From:   Timothy Young[SMTP:[log in to unmask]]
        Sent:   Friday, October 17, 1997 1:31 PM
        To:     Multiple recipients of list EAD
        Subject:        Query re: character normalization

        Now that we all seem to be moving in the direction of
understanding
        and implementing SGML mark-up, I am going to open a can of worms
        about a further step in making finding aids usable...namely:
        Character Normalization

        In the development of our Finding Aids site at Yale, we were
        happy to realize that we could code non-Latin (extended)
characters=20
        so they would appear properly, thanks to the Special Characters
Entity
        component.
        When our finding aids were online, we tested the search
interface and found
        that, indeed, we could find extended characters by keying-in
ASCII
        number sequences (e.g. Alt-130 =3D =E9). However, the wind left our
sails
        when we realized that this was the ONLY way to search extended
characters.
        Therefore, a researcher looking through our collection of Goethe
manuscripts
        will
        have to learn to type like a programmer to find all of the
relevant names she
        desires.

        Knowing that there are other problems that arise, such as an
inconsistency
        in using extended characters, and local practice, I wonder if
anyone has
        any advice/direction/comments on what can be done as far as what
I
        refer to as "character normalization".

        I envision a system that - on the search interface - is able to
map all
        accented versions of Latin characters to their unaccented
equivalents.
        (e.g. -  =E9, =EB, etc. would map to e)

        Is this the way to go? Is anybody aware of a system that can do
such
        a normalization? Our default strategy is to do what we have done
        with our in-house database - keep a "printable" version with
extended
        characters - and create a database version with extended
characters stripped
        out and replaced by Latin equivalents.

        Any response would be welcome, even regarding correct
terminology for this
        question.

        Timothy Young
        Archivist
        Beinecke Rare Book and Manuscript Library
        Yale University
        New Haven, CT  06520
        (203) 432-8131