At Mon, 28 Sep 2009 17:47:06 +0100,
Enders, Markus wrote:
>
> Hi everybody,
>
> the METS board were discussing the proposed extension to the METS
> schema regarding location information of a file or bytestream within
> a file. This issue had been raised on the mailing list in july but
> did not receive a lot of attention. Therefore I would like to bring
> this up again before actually adding a change request for the new
> version of the schema:
>
> Proposed change:
> Adding BETYPE, BEING and END attribute to the <file> and <stream>
> element.
>
> […]
Hi all -
While the ARC and WARC formats have been explicitly design to allow
extraction of substrings of the file without the need to read or the
entire file, it is not generally feasible with a given file format. It
seems that it should be possible with tar and zip files to use byte
offsets to extract only some part of the file, though I have never
seen this done. It would also be impossible to do this with compressed
TAR files.
Furthermore, is not possible, in the ARC, WARC, TAR, or ZIP file
formats, to be given a substring of bytes from that file and
understand them without understanding, respectively, the ARC, WARC,
TAR or ZIP file formats. This means that in most cases applications
processing these METS files will need to have knowledge of the
container format, which seems to make moot the seeming simplicity of a
byte-offset based pointer scheme.
Perhaps there is a more general design which would allow for METS to
point into container formats without tying METS (necessarily) to using
substrings of bytes from the files? This would allow a file to, e.g.,
point to a particular file in a TAR.GZ archive, or to a particular
page in a PDF, or to a particular URL in a WARC archive in a container
format specific way? Forgive me if this mechanism already exists; it
has been some time since I worked with METS.
best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3
|