Dynatext has similar problems with very, VERY long lines
in EAD instances (note this has nothing to do with the DTD).
Some configurations of Author/Editor export sgml documents
with NO linebreaks anywhere in the instance. The entire
document will be strung up in one, single huge line. We
don't have the particular error Tim mentions, tags being
broken up along attribute values, but big lines within an
instance do cause dynatext errors.
If you have perl, and it's in your UNIX path, you can
break up long Author/Editor document lines by issuing
the command at your UNIX prompt:
perl -pi -e 's/([^^])(<[^\/])/$1\n$2/g' filename.sgm
which will guarantee that every start tag will begin on a
new line.
To take all tags that are split over two lines and join them
onto one line, issue this command at your UNIX prompt (all on
one line):
perl -0777 -pi -e
's/<([^\n\r>]*)[\n\r]+([^\n\r>]*)>/<$1 $2>/g;' filename.sgm
To join tags that are split into more than two lines, the
command would be a somewhat more complex.
Alvin Pollock
Electronic Text Unit
UC Berkeley Library
[log in to unmask]
At 12:23 PM 3/31/98 -0500, you wrote:
>Over the last year, during the preparation of instances for our
>OpenText server, I had to create a short list of
>"Files that refuse to index" - suffice it to say that
>this was also known as "The Headache List."
>I parsed, examined, and compared the files on this list,
>but was never able to figure out why OpenText refused
>to index them...until now.
>
>As serendipity has it, I stumbled across an important
>error that will pass most parsers, but will trip up OpenText.
>In marking up an EAD instance, there often are very long character strings
>that must be broken by line wrap, or soft return, depending on the software
>you are using.
>Thus, it is not uncommon to have tags that wrap from one line to the next,
>such as the <extref> tag in the following example:
>
>---------------------------
><did><physdesc><extent>EXTENT<lb>
>Total Boxes: 8<lb>
>Other Storage Formats: oversize<lb>
>Linear Feet: 6.0</extent></physdesc></did><note><p><extref
>ext.ptr="http://www.library.yale.edu/beinecke/manuscript/copyrite.htm">
>Copyright © 1992 by the Yale University Library.</extref></p>
><p>
>-----------------------------
>
>However, it seems that OpenText will allow some tags to be broken, but it
>has a problem with other *crucial* tags. I discovered that all of the files
>on my "refuse to index" list shared the same characteristic, which was that
>the <archdesc> tag was broken between two lines as in the following example:
>
>--------------------------
></titlepage></frontmatter><findaid><archdesc
>level="collection">
>-------------------------
>
>It is important that this tag be complete on one line, or OpenText will
>throw out the file.
>I haven't discovered any other specific tags that are as delicate, but I
>suspect that any
>high-level tag might cause a similar problem.
>
>I hope this helps anyone working with EAD and OpenText.
>You might not need this information today, but do store it away...
>
>Timothy Young
>Archivist
>Beinecke Rare Book and Manuscript Library
>Yale University
>New Haven, CT 06520
>(203) 432-8131
>
>p.s. OK, Not *all* of my files on the "refuse to index" list were
>completely fixed by this procedure. I still have *one* that refuses to
behave,
>despite the <archdesc> fix, but that's another mystery to be solved.
>
>
|