Michele R Combs wrote:
>> In response to that thread, I hastily jotted down some thoughts for a blog post, located here:
> From that post:
> "...has anyone attempted to test their corpus of EAD records with their current search operations vs. indexing and searching those records by means of different models of IR, such as Nutch, INDRI, Solr, or even just Google Custom Search???"
> This is a great question, related to the one I posed recently about whether anyone had tried to compare the various indexing and search options. Would be a really interesting research topic.
I think many software packages have the same indexing and search options
for developers. But arguably, some work better than others for
retrieval, but it also depends on the collection.
System evaluation of archives is what I am doing (or should be doing
;-)). To evaluate different search algorithms, a test collection and
fixed set of EAD files to be indexed is needed (see the Cranfield
experiments in the 1950s, and TREC later).
Many of the software packages use the same search algorithms. Nutch and
Solr both use Lucene (which employs the Vector Space Model and is
working well). Lemur/Indri uses so called Language Modelling (which tend
to perform better for users who have a lot of time to scan exhaustively
long hit lists).
Regarding data normalization, I was wondering what standard is
preferred? ISO 8601 i.e. |YYYY-MM-DD|
University of Amsterdam