I think this is a tough one with lots
of complications and I don't think there is a right/wrong answer.
Even a single tool might report several
degrees of uncertainty. For example, DROID lists its identification
status as:
- Positive (specific). It matched
1 and only 1 internal signature. Note that this is not called "certain".
- Positive (generic). It matched
more than 1 internal signature
- Tentative. It matched only an
extension (for a format without an internal signature)
I'm doing this from memory so I might
not have that quite right but the point is that there are more degrees
of uncertainty than "certain" or "uncertain" and you
don't need multiple tools to generate such uncertainty (although they can
then disagree as well of course!).
An additional complication is what one
system calls a format, another might define as a family of formats so one
system's certainty can be uncertainty in another system. Also, because
of this, what used to be "certain" could, if re-characterised
later using the same tool, become "uncertain" (e.g., if additional
formats sharing a signature now exist).
Furthermore we can make a distinction
between identification and validation. A validation tool can update
the identification results (e.g., DROID can't tell TIFF 3/4/5/6 apart but,
now we know to try it. Jhove's TIFF module can). This, in essence,
means that 3 of the identification records are deleted and now the single
one left is considered to be definitive (for now). But even validation
is not a cure all: what happens if several validators disagree and does
a file have to be strictly valid to be considered having a format (e.g.,
HTML)?
I guess my point is that, there is always
some degree of uncertainty (even if every available current tool agrees).
So it is not possible to state that something is "certain".
It is possible, however, to be clear that something is known to be
uncertain (and why) and it is also possible to declare an absence of known
uncertainty (and the methods attempted to find some).
Rob
Robert Sharpe
Head of Archiving Solutions
Tessella plc
26 The Quadrant, Abingdon Science Park, Abingdon, Oxfordshire, OX14 3YS
Registered in England
No. 1466429
T: +44 (0)1235 555511 M: +44 (0)7515 197 880
E: [log in to unmask]
W: www.digital-preservation.com
This message is commercial in confidence and
may be privileged. It is intended for the addressee(s) only. Access to
this message by anyone else is unauthorized and strictly prohibited. If
you have received this message in error, please inform the sender immediately.
Please note that messages sent or received by the Tessella e-mail system
may be monitored and stored in an information retrieval system.
The PREMIS Editorial Committee received a request to include
an optional "certainty attribute" in the Data Dictionary to indicate
the degree of certainty that the value provided for a particular semantic
unit is correct. Although the requester thought it might be of use
for all semantic units, the specific use case was in reference to format:
===================================================================================================
I generate PREMIS documents from FITS (http://code.google.com/p/fits/).
FITS normalises and consolidates the output from various technical
metadata extraction tools. File formats are where there is the most
difficulty.
If multiple tools agree on format for a given file, and no tools
disagree, then it would be useful to indicate in PREMIS that there is a
high degree of certainty that this file format has been correctly
identified.
If only one tool is able to identify a file format, then there is a
lower degree of certainty. Both this situation and the one above will
produce a single PREMIS format element, but they have very different
degrees of certainty.
If there is disagreement amongst the tools as to the correct format for
a file, then there will be multiple PREMIS format elements. If all tools
but one have identified one format, and one tool another format, again,
it would be helpful to retain this information.
======================================================================================================
After discussion among EC members and staff at their institutions, the
general concern was that there are too many ways that degrees of certainty
can be expressed. A repository could use certainty information consistently
internally, but this would then be local business information and not core
preservation metadata. For certainty information to be generally
interoperable, use of a single vocabulary for degrees of certainty would
be required, and this would be very difficult to devise. So, recognizing
the importance of certainty information about file formats and concerned
about interoperability, the EC is considering whether adding a certainty
element to pertain to format only, with only two values:
yes, this format is certain
no, there is uncertainty about the format
I am sending this note to see what PREMIS Implementers think of this idea.
Would this limited certainty information be useful, or would
you prefer to see more complex and/or more generally appropriate certainty
information allowed?
Please reply to the list, not to me directly. I'd love to get a discussion
going about this.
Priscilla