Print

Print


Hello All,

As an update, a couple of institutions have shared their locally developed code for ingesting content into the viewer and this exercise is getting rather interesting.

Jeremy Echols at UO has a set of Python scripts that process born digital content and produce a batch that can be loaded into Chronam.  The process starts with a PDF and then produces the needed JP2 and OCR files as well as building the necessary directory structure and METS files.  I have been able to update those scripts and use them with our content.  I now have a batch that I can test with our viewer.

I do have a question about the format of some of the files and how that might affect the ingest process.  In the LOC guidelines, the JPEGs have an XML box added that contains what looks to be identifying information for the page.  I see similar XML added to the PDFs.  Can someone tell me what the extra XML does or how it is used?  I am particularly interested in how it might be processed during an ingest.  Does it affect searching or indexing or is it just to clearly identify what the file is and where it came from?  In short, can I safely drop it for non-NDNP batches and still have full functionality in the viewer.

Thanks

--
Michael W. Bolton  |  Assistant Dean, Digital Initiatives
Sterling C. Evans Library  |  Texas A&M University
5000 TAMU  |  College Station, TX  77843-5000
Ph: 979-845-5751  |  [log in to unmask]