Dear Mark,
I am very interesting in a survey about the systems out there.
Currently, I am planning to set something up to know what is really 'out
there'.
I think from an IR point of view, Lemur/Indri is probably the most
powerful IR system out there as it supports XML element retrieval.
However, it is not a real XML IR system as it is used often for full
text search. The nice thing about EAD is that you can use the XML
structure for indexing and retrieval.
Another solution is MonetDB + PF/Tijah, which allows you to query the
XML database/index using XQuery. I am using that myself, but in terms of
speed it cannot beat yet the other systems out there, but it is
effective/powerful and shows nicely what you can do with XML in terms of
retrieval.
I am also wondering whether subscribers to this mailinglist can let us
know what kind of system back-end they are using for indexing and retrieval.
Specifically,
- Name of back-end
- Link to the online search system
Depending on the responses, I am thinking of setting up an online
questionnaire for people who maintain the system for their institution.
Kind regards,
junte zhang
Archives and Information Studies
University of Amsterdam
Custer, Mark wrote:
>
> Yesterday’s post about “normalized dates” has me thinking once again
> about how dates are used (or not used) in EAD records. As far as I can
> tell, RLG’s ArchiveGrid doesn’t permit searching by date (I could be
> wrong on this, though, as I don’t have full access to it, but it does
> use Lucene to index its records; though I suppose that most of these
> records are just MARC records?) and Proquest’s Archive Finder does
> permit searching by date, but it doesn’t really allow you to do very
> much (i.e. there’s no way to rank your results by “relevancy”).
>
> This leads me to a question: what sort of back-end systems are
> archives using for their EAD records? (are there any surveys out there
> that has this information, or should we start one???)
>
> At ECU, we're using an XML database only, but we aren't doing any
> advanced searching by date (primarily because, at this time, if you
> did search for something like "1912", it's not going to limit your
> results very much; and then, really, you're just back at the whole
> "browse by collection name" situation). However, you can do a keyword
> search for "1912", and the results that are returned to you will be
> ordered by the number of hits in each document, which, in my mind, is
> only a small difference in functionality, but perhaps more useful (in
> most occasions) than simply limiting your results to any and all
> collection date ranges that contain the year "1912".
>
> This leads me to another set of questions: is anyone out there using
> the "bulk" attribute as part of your information retrieval process?...
> is anyone using dates beyond the collection range (those dates
> associated with a series, folder, even an item) in the information
> retrieval process?... has anyone attempted to test their corpus of EAD
> records with their current search operations *vs.* indexing and
> searching those records by means of different models of IR, such as
> Nutch <http://lucene.apache.org/nutch/>, INDRI
> <http://www.lemurproject.org/indri/>, Solr
> <http://lucene.apache.org/solr/>, or even just Google Custom Search???
>
> I think it's great that we're encoding our documents so well, but I
> keep wondering if we're harnessing that information in the best
> possible ways yet (and perhaps the best solutions won't be tied to our
> encoding practices at all).
>
> Mark Custer
>
> Text & Markup Coordinator
>
> ECU Digital Collections
>
> http://personal.ecu.edu/custerm
>
|