At Wed, 30 Sep 2009 10:46:14 +0100,
Enders, Markus wrote:
>
> Hi Erik,
>
> Thanks for your thoughts. You are right, not every container format can
> be used for pointing into it. However we have to differentiate between
> pointing into a container format to extract a bytestream and decoding
> the bytestream as it might be compressed, encoded etc.
>
> For each <file> element METS allows us to store a <transformFile>
> element already. It may store information about the algorithm or the
> order in which files need to be transformed. Substrings which have been
> obtained from a file (using byte offsets) should be decompressionable
> using this information.
> What is still missing in the METS schema is the possibility to point to
> those substrings. Using byte offsets is just one possibility. For other
> kind of files such as XML files and IDREF based mechanism may be
> suitable as well (very similar to what the <area> elements allows as
> well).
>
> Our current use case would be referencing files within WARC containers.
>
> I am not sure we may find a solution for adressing single image files or
> other bytestreams within a PDF other than the byte offset. Using a
> BETYPE attribute might actually give us the option to extend the
> mechanism later on.
Hi Markus -
This makes sense; I have not used METS in a while and was not familiar with
transformFile.
In this document:
<http://www.loc.gov/standards/mets/METS%20Documentation%20final%20070930%20msw.pdf>
transformFile is used to extract documents from a tar.gz file. Could
transformFile be used to extract files from a WARC in a similar way?
best,
Erik Hetzner
PS: The METS scheme documentation for transformFile appears to be
broken:
<http://www.loc.gov/standards/mets/docs/mets.v1-8.html#transformFile>
It does not display the attribute restrictions.
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3
|