Print

Print


I think this is a tough one with lots of complications and I don't think 
there is a right/wrong answer.

Even a single tool might report several degrees of uncertainty.  For 
example, DROID lists its identification status as:
- Positive (specific).  It matched 1 and only 1 internal signature.  Note 
that this is not called "certain".
- Positive (generic).  It matched more than 1 internal signature
- Tentative.  It matched only an extension (for a format without an 
internal signature)
I'm doing this from memory so I might not have that quite right but the 
point is that there are more degrees of uncertainty than "certain" or 
"uncertain" and you don't need multiple tools to generate such uncertainty 
(although they can then disagree as well of course!). 

An additional complication is what one system calls a format, another 
might define as a family of formats so one system's certainty can be 
uncertainty in another system.  Also, because of this, what used to be 
"certain" could, if re-characterised later using the same tool, become 
"uncertain" (e.g., if additional formats sharing a signature now exist).

Furthermore we can make a distinction between identification and 
validation.  A validation tool can update the identification results 
(e.g., DROID can't tell TIFF 3/4/5/6 apart but, now we know to try it. 
Jhove's TIFF module can).   This, in essence, means that 3 of the 
identification records are deleted and now the single one left is 
considered to be definitive (for now).  But even validation is not a cure 
all: what happens if several validators disagree and does a file have to 
be strictly valid to be considered having a format (e.g., HTML)?

I guess my point is that, there is always some degree of uncertainty (even 
if every available current tool agrees).    So it is not possible to state 
that something is "certain".  It is possible, however, to be clear that 
something is known to be uncertain (and why) and it is also possible to 
declare an absence of known uncertainty (and the methods attempted to find 
some).

Rob
Robert Sharpe
Head of Archiving Solutions
Tessella plc
26 The Quadrant, Abingdon Science Park, Abingdon, Oxfordshire, OX14 3YS  
Registered in England No. 1466429
T: +44 (0)1235 555511       M: +44 (0)7515 197 880       E: 
[log in to unmask]
W: www.digital-preservation.com

This message is commercial in confidence and may be privileged. It is 
intended for the addressee(s) only. Access to this message by anyone else 
is unauthorized and strictly prohibited. If you have received this message 
in error, please inform the sender immediately. Please note that messages 
sent or received by the Tessella e-mail system may be monitored and stored 
in an information retrieval system.




Priscilla Caplan <[log in to unmask]> 
Sent by: PREMIS Implementors Group Forum <[log in to unmask]>
25/02/2011 14:34
Please respond to
PREMIS Implementors Group Forum <[log in to unmask]>


To
[log in to unmask]
cc

Subject
[PIG] certainty information in PREMIS






The PREMIS Editorial Committee received a request to include an optional 
"certainty attribute" in the Data Dictionary to indicate the degree of 
certainty that the value provided for a particular semantic unit is 
correct.  Although the requester thought it might be of use for all 
semantic units, the specific use case was in reference to format:

 ===================================================================================================
I generate PREMIS documents from FITS (http://code.google.com/p/fits/). 
FITS normalises and consolidates the output from various technical
metadata extraction tools. File formats are where there is the most
difficulty.

If multiple tools agree on format for a given file, and no tools
disagree, then it would be useful to indicate in PREMIS that there is a
high degree of certainty that this file format has been correctly
identified.

If only one tool is able to identify a file format, then there is a
lower degree of certainty. Both this situation and the one above will
produce a single PREMIS format element, but they have very different
degrees of certainty.

If there is disagreement amongst the tools as to the correct format for
a file, then there will be multiple PREMIS format elements. If all tools
but one have identified one format, and one tool another format, again,
it would be helpful to retain this information.

 =====================================================================================================

After discussion among EC members and staff at their institutions, the 
general concern was that there are too many ways that degrees of certainty 
can be expressed.  A repository could use certainty information 
consistently internally, but this would then be local business information 
and not core preservation metadata.  For certainty information to be 
generally interoperable, use of a single vocabulary for degrees of 
certainty would be required, and this would be very difficult to devise. 
So, recognizing the importance of certainty information about file formats 
and concerned about interoperability, the EC is considering whether adding 
a certainty element to pertain to format only, with only two values:

yes, this format is certain 
no, there is uncertainty about the format

 I am sending this note to see what PREMIS Implementers think of this 
idea.    Would this limited certainty information be useful, or would you 
prefer to see more complex and/or more generally appropriate certainty 
information allowed?

Please reply to the list, not to me directly.  I'd love to get a 
discussion going about this.

Priscilla