A very, very late thanks for your reply to my EAD question; apologies for
the delay. I've been out on leave and the in-box piled up a bit.
At 03:10 PM 8/1/01 -0700, you wrote:
> > I am replying to your query of the list because I do't have an answer to
> > your question. I would be very interested in an answer however, as we
> > have about 600 finding aids here in HTML that need to be converted to
> > EAD at some point. We are far from starting on this project -- it will
> > be part of my new job here, a job which I wont't start until Spring
> > 2002. But if you find a solution or some conversion software I would
> > appreciate if you could let me know!
> I do not know such software, but I will give you some clues.
>Before getting involved with EAD at UBC (which is not my main job function
>to begin with), I've done quite a bit of work with XML technologies. One of
>the learning projects I've build was a COM component that used a set of
>regular expressions and W3C's HTMLTidy software to "clean up" messy HTML
>into a well-formed XML.
>Since XML is very picky about well-formedness, the first challenge for such
>conversion would be the HTML tidying process. Just to give you a figure, my
>tests show that my component successfully parsed converted XML/HTML
>documents about 90% of the time.
>The second process is even more challenging. The program will have to
>somehow "study" the structure of the HTML document (which serves almost no
>metadata information) and match them with appropriate EAD tags.
>So it looks like this process will even require some form of Artificial
>Intelligence. UNLESS, of course, all your 600 HTML-based EAD documents
>follow the SAME structure (i.e. the third <p> always corresponds to
><admininfo>, etc.), which I highly doubt it does.
>I can think of ways to build an application that could help you with the
>conversion process, but it looks to me like you will have to manually
>convert most of them.
K. Ross Toole Archives
The University of Montana--Missoula
Missoula, MT 59812
[log in to unmask]