Hi everybody,
the METS board were discussing the proposed extension to the METS schema regarding location information of a file or bytestream within a file. This issue had been raised on the mailing list in july but did not receive a lot of attention. Therefore I would like to bring this up again before actually adding a change request for the new version of the schema: 
Proposed change:
Adding BETYPE, BEING and END attribute to the <file> and <stream> element.
A file or stream which is embedded into a file is represented by nested <file> file elements or by a <stream> element as child of a <file> element. As we have two different kind of files, the file containing the other file or stream is called container file. The container file is usually a zip, tar or WARC file. But it could be any container format like e.g. a TIFF file containing various images (a Multi-TIFF file).
Use case:
Storing the location of a file or stream within a container in the METS file would allow to read data directly from the container without loading and parsing the whole container file.
The definition of a container file is very vague. Though some file formats are designed for bundleing several files into one big file, content files itself may be containers for a certain type of bytestreams. A TIFF file may actually contain various images or various manifestations of the same image (different resolutions).
The container file as such may not be in the main focus of interest. The embedded files and bytestreams with their metadata are usually more important. Recording the location of a content file within a container file using byte offsets might enable to read those even if the internal structure of the container file is unknown. This might proof very valuable especially for the readablitly of new and still immature container formats such as WARC.
Though the <file> and <bytestream> element are usingthe same attributes as the <area> element within the structmap, the semantics is very different. The <area> element is pointing to an area within a content file which actually manifests the <div> object. A <div> object is typically not a file but representing a logical or physical entity such as a column, a page, a chapter etc.
Both kinds of references into a file may even be used at the same time:
- the file set consists of page images
- all page images are bundled into one container file (e.g. a TAR or ZIP file)
- the physical structMap defines columns
- the structMap contains pointers into the page image files using the area element (using XHTML coordinates defining the columns in the image file)
- the content file elements containing byte offsets defining the file's position within the container file.
The location information in the <area> element and the <file> element are using different points of reference.
I would appreciate any comments before I will add this proposal to the METS change request wiki page.
Experience the British Library online at www.bl.uk
The British Library’s new interactive Annual Report and Accounts 2007/08 : www.bl.uk/knowledge
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the [log in to unmask] : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.