Print

Print


Hi Kris,

> I am replying to your query of the list because I do't have an answer to
> your question.  I would be very interested in an answer however, as we
> have about 600 finding aids here in HTML that need to be converted to
> EAD at some point.  We are far from starting on this project -- it will
> be part of my new job here, a job which I wont't start until Spring
> 2002.  But if you find a solution or some conversion software I would
> appreciate if you could let me know!
    I do not know such software, but I will give you some clues.

Before getting involved with EAD at UBC (which is not my main job function
to begin with), I've done quite a bit of work with XML technologies. One of
the learning projects I've build was a COM component that used a set of
regular expressions and W3C's HTMLTidy software to "clean up" messy HTML
into a well-formed XML.

Since XML is very picky about well-formedness, the first challenge for such
conversion would be the HTML tidying process. Just to give you a figure, my
tests show that my component successfully parsed converted XML/HTML
documents about 90% of the time.

The second process is even more challenging. The program will have to
somehow "study" the structure of the HTML document (which serves almost no
metadata information) and match them with appropriate EAD tags.

So it looks like this process will even require some form of Artificial
Intelligence. UNLESS, of course, all your 600 HTML-based EAD documents
follow the SAME structure (i.e. the third <p> always corresponds to
<admininfo>, etc.), which I highly doubt it does.

I can think of ways to build an application that could help you with the
conversion process, but it looks to me like you will have to manually
convert most of them.

Good luck.