Print

Print


Peter,

I believe you could have streamlined your processing. See below.

Daniel

At 05:04 PM 1/17/2007, Peter C. Gorman wrote:
>Deena,
>
>We've just done exactly this conversion on a large number of Finding 
>Aids, and are just finishing up the interface tweaks (in DLXS) 
>before releasing them publicly. Here's what Scott Prater and Brian 
>Sheppard did:
>
>#######################################
>
>Hello, Deena --
>
>>  I would be interested in hearing from colleagues regarding your 
>> experiences converting legacy finding aids from SGML to XML.
>>  Did you find you needed to do a lot of post-conversion clean up 
>> to make them compliant with the recommended encoding protocol for EAD 2002?
>>  Our experience has been that the time spent on clean-up may 
>> almost be better spent on creating new EADs from scratch, 
>> especially for small finding aids.
>>  Is this what others have experienced, or is it a function of how 
>> our legacy finding aids were initially coded?
>
>Depends on how many finding aids you want to convert, and how clean
>and/or standardized the legacy finding aids are.  We just recently 
>completed a project converting some 3000+ finding aids from
>SGML/Latin-1/EAD 1.0 to XML/UTF-8/EAD 2002.  Obviously, doing that
>manually was out of the question, so we spent quite a bit of time (the
>better part of two weeks) working out a process to automatically convert
>them.  James Clark's OpenSP sgml/xml tools form the base of our
>conversion toolkit.
>
>Here, roughly, are the steps we followed:
>
>1.  First, we sat down with the data provider and went over the changes
>from EAD 1.0 to EAD 2002 (fortunately, there aren't very many).  We then
>looked at the EAD 1.0 finding aids, and decided what steps to take when
>multiple transformation options were available (for instance, what to do
>with the deprecated <admininfo> and <add> regions).  We came up with a
>13-point list of standard transformations for our finding aids.
>
>2.  Then we started the work of conversion.  Early on, we decided to do
>the EAD 1.0 -> EAD 2002 conversion with XSLT;  this presupposed that
>prior to converting the files to EAD 2002, they needed to be valid
>XML/UTF-8 EAD 1.0 files.


Provided you have the EAD 1.0 DTD and related files (including the 
SDATA (SGML) and character reference (XML) character entity lists 
available, you can collapse all of the steps beginning with "So first 
..." through 5. by simply running OSX. OSX begins by parsing the SGML 
instance, and then converts it into XML. In order to make sure the 
characters are transformed properly, you need to do the following:

Reference the SGML declaration for XML when invoking OSX. See 
http://www.w3.org/TR/NOTE-sgml-xml-971215 (Specifically OSX needs 
this to understand character entities that have the following pattern: &#x...;)

Reference the XML character entity lists. These are available here 
http://www.loc.gov/ead/ead2002a.html

By default, OSX will output the XML in utf-8 encoding.

OSX is available at 
http://sourceforge.net/project/showfiles.php?group_id=2115 (Download OpenSP).

With respect to step 6, I would recommend using an XML validator 
rather than onsgmls, as the latter is not a XML parser, as such. If 
you do use onsgmls, it is important to make sure you reference the 
SGML XML declaration mentioned above to ensure that all XML features 
are covered.


>So first, we simply make sure all the files were valid SGML/Latin-1 EAD
>1.0 files.  Then we do a preliminary syntax conversion to 
>well-formed XML, using a simple perl script: close empty SGML tags, 
>fix a few character entities, etc.
>
>3.  We then convert the files from Latin-1 to UTF-8, using the freeware
>utilities iconv, nct2utf8 and isocer2utf8.
>
>4.  We then run jhove to make sure that the files are valid UTF-8, and
>that the syntax is well-formed XML.
>
>5.  We then run osx against the files and an EAD 1.0 XML DTD to convert
>the files to XML  (EAD 1.0).
>
>6.  We should now have valid XML/UTF-8 EAD 1.0 files.  We run onsgmls to
>validate the new XML files against the XML EAD 1.0 DTD, just to make sure.
>
>7.  Now we run the run files through our xsl template to convert from
>EAD 1.0 to EAD 2002.
>
>8.  We then normalize and validate the XML/UTF-8/EAD 2002 files.
>
>Steps 5 - 7 were the most complicated, and took the longest to figure
>out to our satisfaction.  But we're pleased with the results, and the
>conversion process brought to light some inconsistencies/errors that had
>escaped our notice before.  The effort was worth it.
>
>All this was done on a Unix platform.  If you're interested, we can make
>available the Makefile we used to perform the conversion, and provide
>support files (EAD 1.0 XML dtd, xsl template, etc.)  I was surprised how
>difficult it was to find an EAD 1.0 XML dtd;  it's no longer available
>on the main EAD website, as far as I can tell.  I ended up doing some
>modifications on the EAD 1.0 sgml DTD, after some research on Google and
>reading the DTD.
>
>Hope this helps,
>
>-- Scott
>
>
>--
>Scott Prater
>Library, Instructional, and Research Applications (LIRA)
>Division of Information Technology (DoIT)
>University of Wisconsin - Madison
>[log in to unmask]
>--
>_______________________________
>Peter C. Gorman
>Head, University of Wisconsin Digital Collections Center
>218 Memorial Library
>Madison, WI 53706
>[log in to unmask]
>(608) 265-5291

Daniel V. Pitti, Associate Director
Institute for Advanced Technology in the Humanities
319 Alderman Library
P.O. Box 400115
University of Virginia
Charlottesville, Virginia 22904-4115
Phone: 434-924-6594
Fax: 434-982-2363