Print

Print


Hi everybody,
 
the METS board were discussing the proposed extension to the METS schema
regarding location information of a file or bytestream within a file.
This issue had been raised on the mailing list in july but did not
receive a lot of attention. Therefore I would like to bring this up
again before actually adding a change request for the new version of the
schema: 
 
Proposed change:
Adding BETYPE, BEING and END attribute to the <file> and <stream>
element. 
 
Description:
A file or stream which is embedded into a file is represented by nested
<file> file elements or by a <stream> element as child of a <file>
element. As we have two different kind of files, the file containing the
other file or stream is called container file. The container file is
usually a zip, tar or WARC file. But it could be any container format
like e.g. a TIFF file containing various images (a Multi-TIFF file).
 
Use case:
Storing the location of a file or stream within a container in the METS
file would allow to read data directly from the container without
loading and parsing the whole container file.
The definition of a container file is very vague. Though some file
formats are designed for bundleing several files into one big file,
content files itself may be containers for a certain type of
bytestreams. A TIFF file may actually contain various images or various
manifestations of the same image (different resolutions). 
The container file as such may not be in the main focus of interest. The
embedded files and bytestreams with their metadata are usually more
important. Recording the location of a content file within a container
file using byte offsets might enable to read those even if the internal
structure of the container file is unknown. This might proof very
valuable especially for the readablitly of new and still immature
container formats such as WARC.
 
Though the <file> and <bytestream> element are usingthe same attributes
as the <area> element within the structmap, the semantics is very
different. The <area> element is pointing to an area within a content
file which actually manifests the <div> object. A <div> object is
typically not a file but representing a logical or physical entity such
as a column, a page, a chapter etc.
 
Both kinds of references into a file may even be used at the same time:
- the file set consists of page images
- all page images are bundled into one container file (e.g. a TAR or ZIP
file)
- the physical structMap defines columns
- the structMap contains pointers into the page image files using the
area element (using XHTML coordinates defining the columns in the image
file)
- the content file elements containing byte offsets defining the file's
position within the container file.
 
The location information in the <area> element and the <file> element
are using different points of reference. 
 
I would appreciate any comments before I will add this proposal to the
METS change request wiki page.
 
Ciao
Markus

**************************************************************************
 
Experience the British Library online at http://www.bl.uk/
 
The British Library's new interactive Annual Report and Accounts 2007/08 : http://www.bl.uk/knowledge
 
Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook
 
The Library's St Pancras site is WiFi - enabled
 
*************************************************************************
 
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:[log in to unmask] : The contents of this e-mail must not be disclosed or copied without the sender's consent.
 
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
 
*************************************************************************