Print

Print


We're using a similar technology mix for a 3.6M finding aid made up of 76 individual EAD files describing around 8500 photographs. The combined index/site includes a little over 10K EAD c0ns which also manifest as Fedora objects and Solr/Lucene documents.

Here's the site: http://www.library.northwestern.edu/africana/winterton

Here's a brief description of the moving parts:
There are 76 individual EAD files representing physical sections of the archive (albums, folders, scrapbooks - http://repository.library.northwestern.edu/winterton/about.html). Each of these EADs in ingested into Fedora resulting, initially, in 76 EAD objects. Our EAD content model supports an access service (bound as a Fedora disseminator) that indexes the file for text extraction and encapsulates queries supporting a variety of structural access methods. Here is a list of that disseminator's methods:

These methods require no parameters and return the associated structural material for the EAD object they're invoked on:
getEADHeader
getComponentTOC
getComponents
getArchDescNoComponents
getAsHTML
getChildrenAsHTML

These require a 'unitid' parameter and return the associated structural material for the corresponding 'c0n' having that unitid within the EAD object they're invoked on:
getComponent(unitid)
getComponentStructure(unitid)
getChildComponents(unitid)
getAncestorComponents(unitid)
getEmbeddedComponent(unitid)
getComponentAsMODS(unitid)
getComponentAsDC(unitid)
getComponentAsHTML(unitid)
getComponentAsEmbeddedHTML(unitid)
getComponentChildrenAsHTML(unitid)
getComponentChildrenAsJSON(unitid)

These are just general purpose xml queries that return a given element by xml:id attribute value or a set of elements of a given name:
getElementById(xmlid)
getElementsByName(name)

To build the combined EAD file, we use the getComponent(unitid) method for each of the finding aids, passing the unitid for each EAD's top level c01. These urls are used in xml entity declarations and then referenced in an xml file for the combined EAD:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ead PUBLIC "+//ISBN 1-931666-00-8//DTD ead.dtd (Encoded Archival Description (EAD) Version 2002)//EN" "http://www.library.northwestern.edu/ead/dtd/ead.dtd" [
<!ENTITY wc01 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc01/inu:sdef-ead/getComponent?unitid=1">
<!ENTITY wc02 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc02/inu:sdef-ead/getComponent?unitid=2">
<!ENTITY wc03 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc03/inu:sdef-ead/getComponent?unitid=3">
...
<!ENTITY wc76 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc03/inu:sdef-ead/getComponent?unitid=76">
]>
<ead>
<eadheader .../>
<archdesc>
...
<dsc>
&wc01;
&wc02;
&wc03;
...
&wc76;
</dsc>
</archdesc>
</ead>


The combined file is ingested in Fedora. Having the combined EAD file in Fedora, we can leverage the above disseminator, making all structural components for the entire archive available efficiently through the above access methods.

There are also Image and Crop objects in Fedora for each scan and crop (when multiple photographs on a scan) for the collection. These image objects are associated with their corresponding EAD c0n description through unitid/pid convention. Image/Crop objects get their description (DC and MODS) as disseminations (getComponentAsMODS(unitid)) on the combined EAD object above. These are also indexed in Solr/Lucene for searching in the site.  Image/Crop objects have their own image disseminator bound to a jp2 service for efficiently getting scaled/cropped views on the images.

This is a brief description given perhaps in shorthand a little too packed. I didn't say anything about the Image/Crop disseminators or the site's web presentation mechanisms. I can say more if there's specific interest.

I just wanted to chime in on the 'divide and combine' approach with a description of the mechanisms we're using, especially when I heard Jennie's mention of Fedora and Lucene.

Thanks,
Bill






On Feb 9, 2010, at 8:26 AM, Jennie Levine Knies wrote:

Ethan,
Your question about how we plan to create the finding aid is a good one.  We have your standard "Finding Aid" site at the University of Maryland.  <http://www.lib.umd.edu/archivesum>. For that photograph collection, we definitely wanted a record in that system.  Originally, we were thinking "traditional" finding aid.  Other options available to us *right now* would be something like putting in a basic "abstract" finding aid and linking out to a PDF or some other form of the Access database.

However, the bigger question we've been asking (perhaps just to procrastinate? Although I like to think it's because we are trying to be thorough... ;)) is when is the finding aid not enough?  We have asked this question, as well as "when is the finding aid appropriate?"  We have done a good job at UM getting people to understand that ArchivesUM is where you go for archival finding aids, but what about our rare book collections?  People don't always understand that those are in the catalog, and the question has come up asking if we couldn't put some of our non-archival special collections into an EAD and include them in ArchivesUM for discovery.

With the photograph collection, we have also asked ourselves if it might not make sense instead to put the metadata for the folder descriptions into our Fedora digital repository as discrete items.  That would boost our repository's size from a modest 10,000 or so records to about 75,000 records.  The problem there is that we obviously don't have the entire collection digitized, so would that be confusing to people.  It seems with this type of photograph collection, a true database, rather than an XML file, might be a better form of discovery.

I think I would like both.  With links between and levels of discovery all over the place.  And I don't think we're too far away from that, in the scheme of things.  All we need is some technical support and a will to succeed.

Some other comments - I agree (I forget who mentioned this), the creation of the EAD is not so difficult. With this particular photograph collection, the information is already in a database, and we create our finding aids by starting from a database, so making the actual XML file is trivial.  We could mount it online tomorrow.  And, as I type this, I am wondering why we haven't just gone ahead with a stop-gap measure and used the abstract/PDF model to get started, instead of waiting for everything to get perfect.  The presentation is always the challenge. Our system works great for 95% of our finding aids.  It's just the oddballs that keep us on our toes.

Also, another comment/question - we use Lucene to index our finding aids .  I forget what the limit is, but there apparently is a size limit. We've known this since the beginning.  So, with our very large finding aids, a search from within our site is going to miss some of that stuff in the depths.  Maybe breaking down things into separate files, as Ethan suggested, would be a way to get around this.  Will have to experiment...

Jennie

~*~
Jennie Levine Knies
Manager, Digital Collections
2216 Hornbake Library
University of Maryland
College Park, MD 20742
(301)314-2558 TEL (301)314-2709 FAX
[log in to unmask] E-MAIL
http://www.lib.umd.edu/digital

Ethan Gruber wrote, On 2/8/2010 3:30 PM:
I have found that Saxon processes anything that is 5mb or under fairly efficiently, and load times aren't so bad as long as you're not on dialup. Jennie,
Your photograph collection in an Access database--do you plan on making a traditional type of EAD finding aid that will go into a collection of other finding aids and served through a typical type of finding aid website, or do you want to create a site that puts emphasis on the item level?  I have done work on several projects where the focus is on item-level information.  I am gotten around the issue of having a 10 mb finding aid by making each item as a standalone XML file that contains only a <c>.  The <c>'s can be reassembled into a full finding aid, if necessary, but processing is only done on the small, singular XML file that has only several kilobytes of information that describes an item.
I think dealing with massive finding aids is not such a big deal if you put aside the notion that all the data must reside in the same XML file at processing time.  As long as you can extract all the data into a single XML file at the time of migration, it doesn't really matter how you store the files under normal circumstances.
Ethan
On Mon, Feb 8, 2010 at 3:10 PM, Wick, Ryan <[log in to unmask] <mailto:[log in to unmask]>> wrote:
 Our finding aid for the Ava Helen and Linus Pauling Papers is
 currently at 13.8MB of XML.
  From very early on I put each series into it's own XML file. They
 weren't intended to stand on their own so there was nothing "above"
 <c01>. There wasn't a specific link to them, and I just modified our
 stylesheet to pull them in where appropriate. Last year we switched
 to using XML's external entities referencing local files to "link"
 to the series and are happy with the results. See
 http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238
 <http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238>
 for more information on XML's entities.
 For web delivery, we have always split the display of the finding
 aid into smaller pieces. We generate static HTML files and divide
 the series and box listings into smaller chunks for ease of
 navigation and retrieval. There is also an option to view the entire
 series in one file. (The 17 series pages total about 16.4 MB of
 HTML. The hundreds of smaller pages combined would have a greater
 total, but most of that is overhead of duplicate navigation). The
 majority of our traffic comes from search engines, so we've tried
 our best to make our content easily indexable.
 http://osulibrary.oregonstate.edu/specialcollections/coll/pauling/index.html
 On another note, in 2006 we published a print version of the Pauling
 Papers. This included some additional content but the entire package
 ended up being 1800 pages in 6 volumes. http://paulingcatalogue.org/
 Mark, thanks for posting about UNC's Hugh Morton collection, I
 wasn't aware of it before.
 Ryan Wick
 Information Technology Consultant
 Special Collections
 Oregon State University Libraries
 http://osulibrary.oregonstate.edu/specialcollections


Bill Parod
Library Technology Division - Enterprise Systems
Northwestern University Library
[log in to unmask]
847 491 5368