Print

Print


Hello All,

As the week comes to an end I thought I would pass along a little information I have gathered so far on local collections in Chronam.  I want to point out two very good sources of information for building batches.  The first is from the University of Oregon Libraries and the second set is from the North Carolina Digital Heritage Center.  Both sites were very careful to point out the processes are very much tuned to their environment and are not "out of the box" solutions.

Jeremy Echols at the University of Oregon Libraries has his process documented on Github at 
    

           https://github.com/uoregon-libraries/pdf-to-chronam



UO actually has several projects on Github associated with Chronam. Jeremy's process is interesting in that it starts with a PDF and works through building the OCR file and the JP2.  It utilizes a slightly modified version of pdftotext and the updated source is included on the site.  Most of the scripts are in Python and I found they were fairly easy to read and understand as well as modify for our purposes.  I am not a Python programmer but I was able to get the code running in short order.  

The UO process does require a slight modification to the batch_load process in Chronam.  A call to GraphicsMagick is required to get the size of the JP2 and the modified version of the code can be found on the UO Github site listed above (after doing some navigating around on the site that is).

Stephanie Williams at the North Carolina Digital Heritage Center took a slightly different approach.  Their process starts with an inventory in an Excel spreadsheet.  The spreadsheet is then used to drive the process of generating the necessary folder structure for the batch.  Stephanie includes a set of XSLT files that then generate the XML for the batches.  She also just finished creating some very nice documentation on the process and it is online at;


    https://github.com/ncdhc/ndnp-local-batch-process

     https://github.com/ncdhc/ndnp-local-batch-process/wiki/An-approach-to-producing-NDNP-compatible-batch-ingest-packages-locally 


Included in the files are sample spreadsheets which are covered in the documentation.  The NCDHC process begins with TIFFs, as opposed to PDFs.  The generation of JPEG2000 files as well as the OCR and PDF creation are not included in this documentation.   However, scripts are included to process the ALTO output from ABBYY to make it NDNP compliant.  

The NCDHC process produces a more complete METS file that includes the Technical Metadata section.  This metadata includes the JPEG2000 dimensions which means the loader process in Chronam works without modification.

I found both sites to be a big help.  I learned a lot about the overall batch creation process by reading each site and running the different scripts and processes.  I am also happy to say that because of this information, I was able to successfully generate a batch package and ingest it on our test instance of Chronam.  I still have a lot of work to do but I can see light at the end of the tunnel.

I really have to thank Karen Estlund for putting me on the right track.  And then a big thank you to Stephanie and Jeremy.  They were very helpful and I do appreciate their patience.

I am continuing to work on documenting a more comprehensive workflow but I did want to report out what I had learned this week.

Thanks

--
Michael W. Bolton  |  Assistant Dean, Digital Initiatives
Sterling C. Evans Library  |  Texas A&M University
5000 TAMU  |  College Station, TX  77843-5000
Ph: 979-845-5751  |  [log in to unmask]