Alvin et al.,
An easier approach in SoftQuad's Author/Editor is to set the "break lines"
option Special-Options-Export-Break Lines to 70 or so characters. It will
not break significant element boundaries.
Though you can always use perl if that is your pleasure.
On a related note, the developer of perl is currently working with the XML
developers to make perl natively XML-capable. There is a beta version of
his most current work available via the SGML/XML Web page under "What's New
At 11:34 AM 3/31/98 -0800, you wrote:
>Dynatext has similar problems with very, VERY long lines
>in EAD instances (note this has nothing to do with the DTD).
>Some configurations of Author/Editor export sgml documents
>with NO linebreaks anywhere in the instance. The entire
>document will be strung up in one, single huge line. We
>don't have the particular error Tim mentions, tags being
>broken up along attribute values, but big lines within an
>instance do cause dynatext errors.
>If you have perl, and it's in your UNIX path, you can
>break up long Author/Editor document lines by issuing
>the command at your UNIX prompt:
>perl -pi -e 's/([^^])(<[^\/])/$1\n$2/g' filename.sgm
>which will guarantee that every start tag will begin on a
>To take all tags that are split over two lines and join them
>onto one line, issue this command at your UNIX prompt (all on
>perl -0777 -pi -e
>'s/<([^\n\r>]*)[\n\r]+([^\n\r>]*)>/<$1 $2>/g;' filename.sgm
>To join tags that are split into more than two lines, the
>command would be a somewhat more complex.
>Electronic Text Unit
>UC Berkeley Library
>[log in to unmask]
>At 12:23 PM 3/31/98 -0500, you wrote:
>>Over the last year, during the preparation of instances for our
>>OpenText server, I had to create a short list of
>>"Files that refuse to index" - suffice it to say that
>>this was also known as "The Headache List."
>>I parsed, examined, and compared the files on this list,
>>but was never able to figure out why OpenText refused
>>to index them...until now.
>>As serendipity has it, I stumbled across an important
>>error that will pass most parsers, but will trip up OpenText.
>>In marking up an EAD instance, there often are very long character strings
>>that must be broken by line wrap, or soft return, depending on the software
>>you are using.
>>Thus, it is not uncommon to have tags that wrap from one line to the next,
>>such as the <extref> tag in the following example:
>>Total Boxes: 8<lb>
>>Other Storage Formats: oversize<lb>
>>Linear Feet: 6.0</extent></physdesc></did><note><p><extref
>>Copyright © 1992 by the Yale University Library.</extref></p>
>>However, it seems that OpenText will allow some tags to be broken, but it
>>has a problem with other *crucial* tags. I discovered that all of the files
>>on my "refuse to index" list shared the same characteristic, which was that
>>the <archdesc> tag was broken between two lines as in the following example:
>>It is important that this tag be complete on one line, or OpenText will
>>throw out the file.
>>I haven't discovered any other specific tags that are as delicate, but I
>>suspect that any
>>high-level tag might cause a similar problem.
>>I hope this helps anyone working with EAD and OpenText.
>>You might not need this information today, but do store it away...
>>Beinecke Rare Book and Manuscript Library
>>New Haven, CT 06520
>>p.s. OK, Not *all* of my files on the "refuse to index" list were
>>completely fixed by this procedure. I still have *one* that refuses to
>>despite the <archdesc> fix, but that's another mystery to be solved.