We're using a similar technology mix for a 3.6M finding aid made up of
76 individual EAD files describing around 8500 photographs. The
combined index/site includes a little over 10K EAD c0ns which also
manifest as Fedora objects and Solr/Lucene documents.
Here's the site: http://www.library.northwestern.edu/africana/winterton
Here's a brief description of the moving parts:
There are 76 individual EAD files representing physical sections of
the archive (albums, folders, scrapbooks - http://repository.library.northwestern.edu/winterton/about.html)
. Each of these EADs in ingested into Fedora resulting, initially, in
76 EAD objects. Our EAD content model supports an access service
(bound as a Fedora disseminator) that indexes the file for text
extraction and encapsulates queries supporting a variety of structural
access methods. Here is a list of that disseminator's methods:
These methods require no parameters and return the associated
structural material for the EAD object they're invoked on:
These require a 'unitid' parameter and return the associated
structural material for the corresponding 'c0n' having that unitid
within the EAD object they're invoked on:
These are just general purpose xml queries that return a given element
by xml:id attribute value or a set of elements of a given name:
To build the combined EAD file, we use the getComponent(unitid) method
for each of the finding aids, passing the unitid for each EAD's top
level c01. These urls are used in xml entity declarations and then
referenced in an xml file for the combined EAD:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ead PUBLIC "+//ISBN 1-931666-00-8//DTD ead.dtd (Encoded
Archival Description (EAD) Version 2002)//EN" "http://www.library.northwestern.edu/ead/dtd/ead.dtd
<!ENTITY wc01 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc01/inu:sdef-ead/getComponent?unitid=1
<!ENTITY wc02 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc02/inu:sdef-ead/getComponent?unitid=2
<!ENTITY wc03 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc03/inu:sdef-ead/getComponent?unitid=3
<!ENTITY wc76 SYSTEM "http://repository.library.northwestern.edu/fedora/get/inu:inu-ead-afri-wc03/inu:sdef-ead/getComponent?unitid=76
The combined file is ingested in Fedora. Having the combined EAD file
in Fedora, we can leverage the above disseminator, making all
structural components for the entire archive available efficiently
through the above access methods.
There are also Image and Crop objects in Fedora for each scan and crop
(when multiple photographs on a scan) for the collection. These image
objects are associated with their corresponding EAD c0n description
through unitid/pid convention. Image/Crop objects get their
description (DC and MODS) as disseminations
(getComponentAsMODS(unitid)) on the combined EAD object above. These
are also indexed in Solr/Lucene for searching in the site. Image/Crop
objects have their own image disseminator bound to a jp2 service for
efficiently getting scaled/cropped views on the images.
This is a brief description given perhaps in shorthand a little too
packed. I didn't say anything about the Image/Crop disseminators or
the site's web presentation mechanisms. I can say more if there's
I just wanted to chime in on the 'divide and combine' approach with a
description of the mechanisms we're using, especially when I heard
Jennie's mention of Fedora and Lucene.
On Feb 9, 2010, at 8:26 AM, Jennie Levine Knies wrote:
> Your question about how we plan to create the finding aid is a good
> one. We have your standard "Finding Aid" site at the University of
> Maryland. <http://www.lib.umd.edu/archivesum>. For that photograph
> collection, we definitely wanted a record in that system.
> Originally, we were thinking "traditional" finding aid. Other
> options available to us *right now* would be something like putting
> in a basic "abstract" finding aid and linking out to a PDF or some
> other form of the Access database.
> However, the bigger question we've been asking (perhaps just to
> procrastinate? Although I like to think it's because we are trying
> to be thorough... ;)) is when is the finding aid not enough? We
> have asked this question, as well as "when is the finding aid
> appropriate?" We have done a good job at UM getting people to
> understand that ArchivesUM is where you go for archival finding
> aids, but what about our rare book collections? People don't always
> understand that those are in the catalog, and the question has come
> up asking if we couldn't put some of our non-archival special
> collections into an EAD and include them in ArchivesUM for discovery.
> With the photograph collection, we have also asked ourselves if it
> might not make sense instead to put the metadata for the folder
> descriptions into our Fedora digital repository as discrete items.
> That would boost our repository's size from a modest 10,000 or so
> records to about 75,000 records. The problem there is that we
> obviously don't have the entire collection digitized, so would that
> be confusing to people. It seems with this type of photograph
> collection, a true database, rather than an XML file, might be a
> better form of discovery.
> I think I would like both. With links between and levels of
> discovery all over the place. And I don't think we're too far away
> from that, in the scheme of things. All we need is some technical
> support and a will to succeed.
> Some other comments - I agree (I forget who mentioned this), the
> creation of the EAD is not so difficult. With this particular
> photograph collection, the information is already in a database, and
> we create our finding aids by starting from a database, so making
> the actual XML file is trivial. We could mount it online tomorrow.
> And, as I type this, I am wondering why we haven't just gone ahead
> with a stop-gap measure and used the abstract/PDF model to get
> started, instead of waiting for everything to get perfect. The
> presentation is always the challenge. Our system works great for 95%
> of our finding aids. It's just the oddballs that keep us on our toes.
> Also, another comment/question - we use Lucene to index our finding
> aids . I forget what the limit is, but there apparently is a size
> limit. We've known this since the beginning. So, with our very
> large finding aids, a search from within our site is going to miss
> some of that stuff in the depths. Maybe breaking down things into
> separate files, as Ethan suggested, would be a way to get around
> this. Will have to experiment...
> Jennie Levine Knies
> Manager, Digital Collections
> 2216 Hornbake Library
> University of Maryland
> College Park, MD 20742
> (301)314-2558 TEL (301)314-2709 FAX
> [log in to unmask] E-MAIL
> Ethan Gruber wrote, On 2/8/2010 3:30 PM:
>> I have found that Saxon processes anything that is 5mb or under
>> fairly efficiently, and load times aren't so bad as long as you're
>> not on dialup. Jennie,
>> Your photograph collection in an Access database--do you plan on
>> making a traditional type of EAD finding aid that will go into a
>> collection of other finding aids and served through a typical type
>> of finding aid website, or do you want to create a site that puts
>> emphasis on the item level? I have done work on several projects
>> where the focus is on item-level information. I am gotten around
>> the issue of having a 10 mb finding aid by making each item as a
>> standalone XML file that contains only a <c>. The <c>'s can be
>> reassembled into a full finding aid, if necessary, but processing
>> is only done on the small, singular XML file that has only several
>> kilobytes of information that describes an item.
>> I think dealing with massive finding aids is not such a big deal if
>> you put aside the notion that all the data must reside in the same
>> XML file at processing time. As long as you can extract all the
>> data into a single XML file at the time of migration, it doesn't
>> really matter how you store the files under normal circumstances.
>> On Mon, Feb 8, 2010 at 3:10 PM, Wick, Ryan
>> <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>> Our finding aid for the Ava Helen and Linus Pauling Papers is
>> currently at 13.8MB of XML.
>> From very early on I put each series into it's own XML file. They
>> weren't intended to stand on their own so there was nothing "above"
>> <c01>. There wasn't a specific link to them, and I just modified
>> stylesheet to pull them in where appropriate. Last year we switched
>> to using XML's external entities referencing local files to "link"
>> to the series and are happy with the results. See
>> for more information on XML's entities.
>> For web delivery, we have always split the display of the finding
>> aid into smaller pieces. We generate static HTML files and divide
>> the series and box listings into smaller chunks for ease of
>> navigation and retrieval. There is also an option to view the
>> series in one file. (The 17 series pages total about 16.4 MB of
>> HTML. The hundreds of smaller pages combined would have a greater
>> total, but most of that is overhead of duplicate navigation). The
>> majority of our traffic comes from search engines, so we've tried
>> our best to make our content easily indexable.
>> On another note, in 2006 we published a print version of the
>> Papers. This included some additional content but the entire
>> ended up being 1800 pages in 6 volumes. http://
>> Mark, thanks for posting about UNC's Hugh Morton collection, I
>> wasn't aware of it before.
>> Ryan Wick
>> Information Technology Consultant
>> Special Collections
>> Oregon State University Libraries
Library Technology Division - Enterprise Systems
Northwestern University Library
[log in to unmask]
847 491 5368