Print

Print


Hello All,

As the week comes to an end I thought I would pass along a little
information I have gathered so far on local collections in Chronam.  I want
to point out two very good sources of information for building batches.
The first is from the University of Oregon Libraries and the second set is
from the North Carolina Digital Heritage Center.  Both sites were very
careful to point out the processes are very much tuned to their environment
and are not "out of the box" solutions.

Jeremy Echols at the University of Oregon Libraries has his process
documented on Github at


           https://github.com/uoregon-libraries/pdf-to-chronam


UO actually has several projects on Github associated with Chronam.
Jeremy's process is interesting in that it starts with a PDF and works
through building the OCR file and the JP2.  It utilizes a slightly modified
version of pdftotext and the updated source is included on the site.  Most
of the scripts are in Python and I found they were fairly easy to read and
understand as well as modify for our purposes.  I am not a Python
programmer but I was able to get the code running in short order.

The UO process does require a slight modification to the batch_load process
in Chronam.  A call to GraphicsMagick is required to get the size of the
JP2 and the modified version of the code can be found on the UO Github site
listed above (after doing some navigating around on the site that is).

Stephanie Williams at the North Carolina Digital Heritage Center took a
slightly different approach.  Their process starts with an inventory in an
Excel spreadsheet.  The spreadsheet is then used to drive the process of
generating the necessary folder structure for the batch.  Stephanie
includes a set of XSLT files that then generate the XML for the batches.
She also just finished creating some very nice documentation on the process
and it is online at;


    https://github.com/ncdhc/ndnp-local-batch-process


https://github.com/ncdhc/ndnp-local-batch-process/wiki/An-approach-to-producing-NDNP-compatible-batch-ingest-packages-locally



Included in the files are sample spreadsheets which are covered in the
documentation.  The NCDHC process begins with TIFFs, as opposed to PDFs.
The generation of JPEG2000 files as well as the OCR and PDF creation are
not included in this documentation.   However, scripts are included to
process the ALTO output from ABBYY to make it NDNP compliant.

The NCDHC process produces a more complete METS file that includes the
Technical Metadata section.  This metadata includes the JPEG2000 dimensions
which means the loader process in Chronam works without modification.

I found both sites to be a big help.  I learned a lot about the overall
batch creation process by reading each site and running the different
scripts and processes.  I am also happy to say that because of this
information, I was able to successfully generate a batch package and ingest
it on our test instance of Chronam.  I still have a lot of work to do but I
can see light at the end of the tunnel.

I really have to thank Karen Estlund for putting me on the right track.
And then a big thank you to Stephanie and Jeremy.  They were very helpful
and I do appreciate their patience.

I am continuing to work on documenting a more comprehensive workflow but I
did want to report out what I had learned this week.

Thanks

-- 
Michael W. Bolton  |  Assistant Dean, Digital Initiatives
Sterling C. Evans Library  |  Texas A&M University
5000 TAMU  |  College Station, TX  77843-5000
Ph: 979-845-5751  |  [log in to unmask]
http://library.tamu.edu