On Tue, 31 Mar 1998, Timothy Young wrote:
> Over the last year, during the preparation of instances for our
> OpenText server, I had to create a short list of
> "Files that refuse to index" - suffice it to say that
> this was also known as "The Headache List."
> I parsed, examined, and compared the files on this list,
> but was never able to figure out why OpenText refused
> to index them...until now.
A way to avoid the problem described above is to use an SGML normalizer
such as sgmlnorm included in SP by James Clark (www.jclark.com). sgmlnorm
attempts to put start and end tag of an element on one line, it will also
never split a tag. Every element name gets uppercased, attribute values
get surrounded by double quotes.
Working myself with OpenText (and TEI), I made it a rule to normalize
(which of course also parses the document) all documents, and to run the
indexing process only if there were no errors encountered.
Hope this helps,
Jakob.
---------
Jakob Fix
Computing Officer at
The Oxford Text Archive
Oxford University
http://ota.ahds.ac.uk
|