Print

Print


As an addendum to this conversation and to address Mark's concerns about
indexing strategies/best practices, I think that there can be no standard
approach to indexing EAD data.  *What* you index is dependent upon your user
interface specifications.  Each institution may have their own requirements,
and stylesheets to transform EAD into Solr documents may vary radically from
institution to institution or project to project.  For example, Blacklight
may index EAD finding aids for full-text searching, along with some other
fields, like title, creator, and date, but Blacklight may not incorporate
controlled access headings, like persname, corpname, occupation, genreform
by default.

When indexing EAD into Solr, one must make a conscious decision regarding
the importance of certain pieces of information and how users may wish to
use that information to improve the relevance of their searches.  Some
institutions may deal with many finding aids and want to represent each
finding aid as a separate Solr document.  Other uses of EAD may require
there be a Solr document for each individual item within a finding aid
(therefore you may have hundreds or thousands of separate Solr documents
extracted from a single EAD file).

Ideally, a Solr document would include text-searchable fields (full-text of
the document plus perhaps title and author), fields for display purposes,
and a variety of facets that are useful for categorizing the document and
associating them with related documents--for example, corpname, publisher,
persname, subject, etc.

Attached is a very simple stylesheet for generating a Solr document from a
finding aid.  This is the stylesheet used in the EADitor project when one
wishes to "publish" a guide via posting it to Solr.  There are some
searchable fields, some displayable fields, and eight facets.

Ethan Gruber

On Fri, Apr 16, 2010 at 10:29 AM, Wang, Ching-Hsien <[log in to unmask]> wrote:

>  Having a little trouble sending this message, so let me try again.
>
>
>
> I agree with John Rees and Ethan Gruber.  Indexing in the back-end and
> presentation in the front are separate.  We have used Solr for indexing for
> several years and we are comfortable with it.  The most recent example is (
> http://collections.si.edu/search/  ).
>
>
>
> During indexing configuration, hierarchy levels will need to be flattened
> for the best index and searching results.  In Solr, you can turn on any
> field with “store=true” which enables you for display purposes, and you can
> preserver hierarchy levels for display if you wish.  You need to do some
> planning ahead of time to preserve your ability for presentation.
>
>
>
> We don’t have enough EAD data at the Smithsonian at this point, but I think
> that will change in about a year or two.  When we have enough EAD data, we
> plan to define our index.  We will share our configuration with you when we
> get to there.
>
>
>
> *Ching-hsien Wang*, * Manager*
>
> Library and Archives System Support Branch
>
> Office of Chief Information Officer
>
> Smithsonian Institution
>
> 202-633-5581(office)  202-312-2874(fax)
>
> [log in to unmask]
>
> Visit us online: www.siris.si.edu
>
>
>
> *From:* Encoded Archival Description List [mailto:[log in to unmask]] *On Behalf
> Of *Rees, John (NIH/NLM) [E]
> *Sent:* Friday, April 16, 2010 9:14 AM
> *To:* [log in to unmask]
> *Subject:* Re: Indexing EAD using Solr
>
>
>
> I agree with Ethan. Indexing EAD with SOLR is different than needing a
> presentation interface, which you still need and what everyone seems to
> continually complain about. That said my brief experience, along with some
> hearsay, is that SOLR has trouble with deeply hierarchical data. It is
> pretty easy to index at the archdesc level but once you get into the dsc all
> that inheritance business doesn’t fare so well.
>
>
>
> But I’m no SOLR expert.
>
>
>
> John
>
>
>
>
>
> John P. Rees, MA, MLIS
>
> Curator, Archives and Modern Manuscripts
>
> History of Medicine Division, MSC 3819
>
> National Library of Medicine
>
> 8600 Rockville Pike
>
> Bethesda, MD 20894
>
>
>
>
>
>
>
> *From:* Ethan Gruber [mailto:[log in to unmask]]
> *Sent:* Thursday, April 15, 2010 1:46 PM
> *To:* [log in to unmask]
> *Subject:* Re: Indexing EAD using Solr
>
>
>
> Hi Mark,
>
> I've used Solr for several different applications of EAD, from traditional
> finding aids to metadata that is intensive and focused at the item level,
> such as describing museum artifacts, like coins.  I think the blacklight
> approach to displaying EAD with the application called "Raven" is far
> different than indexing an entire guide as a Solr document, and I'm not
> entirely convinced their method is scalable to a collection of thousands or
> tens of thousands of EAD files.
>
> To address Lisa's statement about reviewing XTF vs. Solr, I'm not sure you
> can compare the two that way.  Solr isn't a mechanism for viewing EAD,
> though I think that the indexing of data into Solr gives one much more
> flexibility to develop a robust framework for searching/browsing documents
> than what Lucene in XTF allows.
>
> Ethan Gruber
>
> 2010/4/15 Király Péter <[log in to unmask]>
>
> Hi Mark,
>
> I have done it once, for a not too sophisticated, but quite large EAD set,
> and
> for Drupal as interface. Steps were taken:
>
> 1) created a flat XML from original EAD, conforming to Solr input format
> important sub steps:
> a) preserving parent-child content with record ID, and "parent" field
> (c01...c12 levels)
> b) preserving full path with XPATH expressions
> (rootID/childID/grandchildID/.../currentDocID
> c) handling dates to Solr format
>
> 2) load it into Solr
> 3) writing simple methods, which could handle
> a) navigation accross hierarchy
> b) searching dates (and other fields, but those are trivials)
> c) showing full path
>
> That was all I done.
>
> Péter
>
> ----- Original Message ----- From: "Mark A. Matienzo" <[log in to unmask]>
>
>
> To: <[log in to unmask]>
>
> Sent: Thursday, April 15, 2010 5:32 PM
>
>
> Subject: Indexing EAD using Solr
>
> I know there has been some discussion related to this about making EAD
> available as part of the discovery layer, but I'm interested in
> getting a sense of which institutions are using Solr [0] to index EAD.
> At this point, I'm more interested in discussing the different
> indexing strategies from a technical standpoint rather than focusing
> too much on the discovery layer. For what it's worth, this discussion
> began [1] when some folks were talking about incorporating EAD into a
> Solr index to be used by Blacklight [2], an open source discovery
> layer.
>
> If your institution is using Solr to index EAD, can you briefly
> describe your indexing process? I would be interested in coordinating
> future work, or potentially developing a set of recommendations/best
> practices to share with the community.
>
> [0] http://lucene.apache.org/solr
> [1]
> http://groups.google.com/group/blacklight-development/browse_thread/thread/848bae32b11a8501
> [2] http://projectblacklight.org/
>
> Mark A. Matienzo
> Digital Archivist, Manuscripts and Archives
> Yale University Library
>
>
>