Re: Search engines
The main difference in software is that:
- some search each document in real time, each time. This limits
scalability, as every document needs to be read, parsed, and searched for
- others read all of the documents periodically, and build inverted indexes.
Queries are then just done on the indexes. Higher performance,
scalability, cost and complexity.
The following mostly lists the latter kind of approach:
For free (non-commercial use) you might want to review, Zebra from Index
Data. It is full text on top of
Z39.50 and has capability for handling (if I recollect) structured
information. There are some additional interesting pieces of software and
utilities... This isn't shrink wrapped stuff, though, you would need
someone with good integration skills. http://www.indexdata.com/ But it
looks like it has some possibilities.
Other Commercial products:
At the high end, both in terms of cost, and capability (SIM - the Structured
Information Manager) www.simdb.com
Oracle 8i, and its Intermedia search technology (full text extensions to
Other "application" servers at various stages of development that may
include searching capability
some are listed at:
Or another useful list of XML database products:
Re: Word to XML
At least three possible approaches, fundamentally depends on how structured
the information is in the word document.
(1) use Omnimark (free version), convert your word document to RTF, then
either RTF to XML and use XSLT to transform, or program Omnimark to
recognize elements, and create your own XML document
(2) Perl approach: convert word document to text, use Perl modules like
parse-recursive-decent to "recognize" the structure (presuming there is
one), and then create your XML document from these elements using some of
the XML libraries. Programmers only. I have had reasonable success with
"tweaked" catalogue type cards.
(3) Take your word document, and then use something like Open Office
(www.openoffice.org) to take the word document and convert it to an XML dtd
(using their openoffice.dtd), then again use XSLT to try to transform as
much as possible. You are really attempting to take style and physical
formatting and translate
that into structure.
Options 1 and 3 will work "better" if you can (or have) put discrete named
styles on the various elements in your word document.
The above pre-supposes you are trying to convert them to EAD. You can of
course just convert them to well formed XML (or quasi-XML using word 2000)
or the openoffice.org DTD, but then you dont get all of the advantages
offered by the EAD.
At 02:35 PM 2/7/01 -0600, you wrote:
>As we are planning to convert our Finding Aids from Word to XML, I was
>looking at what kind of search engine we might be able to use to deliver
>them and I came to almost complete halt. I've started out by looking at
>what seemed to be the most popular search engines: DynaWeb and LiveLink.
>However, DynaWeb is no longer available and LiveLink's search is no
>longer sold as a stand along application. I've also looked at Isite,
>but it also appears to be no longer available (correct me, if I am wrong
>on this one). Consequently, I was wondering about what applications are
>used by the institutions that are starting the conversion now. Being a
>small institution, we are looking for something fairly reasonable in
>price. If someone could put me on the right track, I would greatly
>Yuliya Lef, Digital Resources Coordinator
>I.D. Weeks Library, University of South Dakota
>Phone: (605) 677-6615 | Email: [log in to unmask]