Hi Erik,
Thanks for your thoughts. You are right, not every container format can
be used for pointing into it. However we have to differentiate between
pointing into a container format to extract a bytestream and decoding
the bytestream as it might be compressed, encoded etc.
For each <file> element METS allows us to store a <transformFile>
element already. It may store information about the algorithm or the
order in which files need to be transformed. Substrings which have been
obtained from a file (using byte offsets) should be decompressionable
using this information.
What is still missing in the METS schema is the possibility to point to
those substrings. Using byte offsets is just one possibility. For other
kind of files such as XML files and IDREF based mechanism may be
suitable as well (very similar to what the <area> elements allows as
well).
Our current use case would be referencing files within WARC containers.
I am not sure we may find a solution for adressing single image files or
other bytestreams within a PDF other than the byte offset. Using a
BETYPE attribute might actually give us the option to extend the
mechanism later on.
Ciao
Markus
Ciao
Markus
-----Original Message-----
From: Metadata Encoding and Transmission Standard [mailto:[log in to unmask]]
On Behalf Of Erik Hetzner
Sent: 29 September 2009 19:50
To: [log in to unmask]
Subject: Re: [METS] storing byte offsets for files and streams
At Mon, 28 Sep 2009 17:47:06 +0100,
Enders, Markus wrote:
>
> Hi everybody,
>
> the METS board were discussing the proposed extension to the METS
> schema regarding location information of a file or bytestream within a
> file. This issue had been raised on the mailing list in july but did
> not receive a lot of attention. Therefore I would like to bring this
> up again before actually adding a change request for the new version
> of the schema:
>
> Proposed change:
> Adding BETYPE, BEING and END attribute to the <file> and <stream>
> element.
>
> [...]
Hi all -
While the ARC and WARC formats have been explicitly design to allow
extraction of substrings of the file without the need to read or the
entire file, it is not generally feasible with a given file format. It
seems that it should be possible with tar and zip files to use byte
offsets to extract only some part of the file, though I have never seen
this done. It would also be impossible to do this with compressed TAR
files.
Furthermore, is not possible, in the ARC, WARC, TAR, or ZIP file
formats, to be given a substring of bytes from that file and understand
them without understanding, respectively, the ARC, WARC, TAR or ZIP file
formats. This means that in most cases applications processing these
METS files will need to have knowledge of the container format, which
seems to make moot the seeming simplicity of a byte-offset based pointer
scheme.
Perhaps there is a more general design which would allow for METS to
point into container formats without tying METS (necessarily) to using
substrings of bytes from the files? This would allow a file to, e.g.,
point to a particular file in a TAR.GZ archive, or to a particular page
in a PDF, or to a particular URL in a WARC archive in a container format
specific way? Forgive me if this mechanism already exists; it has been
some time since I worked with METS.
best,
Erik Hetzner
**************************************************************************
Experience the British Library online at http://www.bl.uk/
The British Library’s new interactive Annual Report and Accounts 2007/08 : http://www.bl.uk/knowledge
Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:[log in to unmask] : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*************************************************************************
|