METS board were discussing the proposed extension to the METS schema regarding
location information of a file or bytestream within a file. This issue had been
raised on the mailing list in july but did not receive a lot of attention.
Therefore I would like to bring this up again before actually adding a change
request for the new version of the schema:
BETYPE, BEING and END attribute to the <file> and <stream> element.
or stream which is embedded into a file is represented by nested <file>
file elements or by a <stream> element as child of a <file>
element. As we have two different kind of files, the file containing the
other file or stream is called container file. The container file is usually a
zip, tar or WARC file. But it could be any container format like e.g. a TIFF
file containing various images (a Multi-TIFF
Storing the location of a file or stream within a
container in the METS file would allow to read data directly from the container
without loading and parsing the whole container file.
definition of a container file is very vague. Though some file formats are
designed for bundleing several files into one big file, content files itself may
be containers for a certain type of bytestreams. A TIFF file may actually
contain various images or various manifestations of the same image (different
container file as such may not be in the main focus of interest. The embedded
files and bytestreams with their metadata are usually more important. Recording
the location of a content file within a container file using byte offsets might
enable to read those even if the internal structure of the container file is
unknown. This might proof very valuable especially for the readablitly of new
and still immature container formats such as WARC.
<file> and <bytestream> element are usingthe same attributes as the
<area> element within the structmap, the semantics is very different.
The <area> element is pointing to an area within a content file which
actually manifests the <div> object. A <div> object is typically not
a file but representing a logical or physical entity such as a column, a page, a
Both kinds of
references into a file may even be used at the same
- the file set
consists of page images
- all page images
are bundled into one container file (e.g. a TAR or ZIP
- the physical
structMap defines columns
structMap contains pointers into the page image files using the area element
(using XHTML coordinates defining the columns in the image
- the content
file elements containing byte offsets defining the file's position within the
information in the <area> element and the <file> element are using
different points of reference.
appreciate any comments before I will add this proposal to the METS change
request wiki page.