Print

Print


Deena,

We've just done exactly this conversion on a large number of Finding 
Aids, and are just finishing up the interface tweaks (in DLXS) before 
releasing them publicly. Here's what Scott Prater and Brian Sheppard 
did:

#######################################

Hello, Deena --

>  I would be interested in hearing from colleagues regarding your 
>experiences converting legacy finding aids from SGML to XML.
>  Did you find you needed to do a lot of post-conversion clean up to 
>make them compliant with the recommended encoding protocol for EAD 
>2002?
>  Our experience has been that the time spent on clean-up may almost 
>be better spent on creating new EADs from scratch, especially for 
>small finding aids.
>  Is this what others have experienced, or is it a function of how 
>our legacy finding aids were initially coded?
>

Depends on how many finding aids you want to convert, and how clean
and/or standardized the legacy finding aids are.  We just recently 
completed a project converting some 3000+ finding aids from
SGML/Latin-1/EAD 1.0 to XML/UTF-8/EAD 2002.  Obviously, doing that
manually was out of the question, so we spent quite a bit of time (the
better part of two weeks) working out a process to automatically convert
them.  James Clark's OpenSP sgml/xml tools form the base of our
conversion toolkit.

Here, roughly, are the steps we followed:

1.  First, we sat down with the data provider and went over the changes
from EAD 1.0 to EAD 2002 (fortunately, there aren't very many).  We then
looked at the EAD 1.0 finding aids, and decided what steps to take when
multiple transformation options were available (for instance, what to do
with the deprecated <admininfo> and <add> regions).  We came up with a
13-point list of standard transformations for our finding aids.

2.  Then we started the work of conversion.  Early on, we decided to do
the EAD 1.0 -> EAD 2002 conversion with XSLT;  this presupposed that
prior to converting the files to EAD 2002, they needed to be valid
XML/UTF-8 EAD 1.0 files.

So first, we simply make sure all the files were valid SGML/Latin-1 EAD
1.0 files.  Then we do a preliminary syntax conversion to well-formed 
XML, using a simple perl script: close empty SGML tags, fix a few 
character entities, etc.

3.  We then convert the files from Latin-1 to UTF-8, using the freeware
utilities iconv, nct2utf8 and isocer2utf8.

4.  We then run jhove to make sure that the files are valid UTF-8, and
that the syntax is well-formed XML.

5.  We then run osx against the files and an EAD 1.0 XML DTD to convert
the files to XML  (EAD 1.0).

6.  We should now have valid XML/UTF-8 EAD 1.0 files.  We run onsgmls to
validate the new XML files against the XML EAD 1.0 DTD, just to make sure.

7.  Now we run the run files through our xsl template to convert from
EAD 1.0 to EAD 2002.

8.  We then normalize and validate the XML/UTF-8/EAD 2002 files.

Steps 5 - 7 were the most complicated, and took the longest to figure
out to our satisfaction.  But we're pleased with the results, and the
conversion process brought to light some inconsistencies/errors that had
escaped our notice before.  The effort was worth it.

All this was done on a Unix platform.  If you're interested, we can make
available the Makefile we used to perform the conversion, and provide
support files (EAD 1.0 XML dtd, xsl template, etc.)  I was surprised how
difficult it was to find an EAD 1.0 XML dtd;  it's no longer available
on the main EAD website, as far as I can tell.  I ended up doing some
modifications on the EAD 1.0 sgml DTD, after some research on Google and
reading the DTD.

Hope this helps,

-- Scott


-- 
Scott Prater
Library, Instructional, and Research Applications (LIRA)
Division of Information Technology (DoIT)
University of Wisconsin - Madison
[log in to unmask]
-- 
_______________________________
Peter C. Gorman
Head, University of Wisconsin Digital Collections Center
218 Memorial Library
Madison, WI 53706
[log in to unmask]
(608) 265-5291