Apologies for taking so long to answer this very well thought-out message from Robert Sharpe. The PREMIS Editorial Committee has been discussing the issues brought up in this message from late October. Following are responses to all of those that have an answer and don't need further discussion. For those that require further discussion, we will either send out a message to the PIG list to solicit feedback or discuss them in the PREMIS EC and send out responses later.
I have retained the original numbering from the message. Where there are gaps are questions that require further discussion.
Priscilla Caplan, Peter McKinney, Angela Dappert, Rebecca Guenther
1. Models Intellectual entities (information objects in OAIS)
These are the things that we want to preserve so it is important to model their significant properties. The Planets conceptual model does not worry about descriptive information since other schemas do a good job at this.
However, it is important to model the existence and properties of the atomic information object needed for transformation (we call these "components"), which is often a smaller unit of information than traditional structural/descriptive models normally deal with. As an
example a descriptive model might deal with a web site but we need to model each individual web page if we are to be able to verify their properties before and after a transformation.
Response: The "component" in the Planets model is not equivalent to the Intellectual Entity in PREMIS. Significant properties can (in Planets) adhere to components that can be embedded within larger physical entities, as, for example, text and image components might be embedded within a PDF file. Text and image would have different significant characteristics.
The PREMIS Editorial Committee has started work to result in a future revision of PREMIS to include semantic units that describe the Intellectual Entity (in PREMIS terms). It is likely that future revisions of PREMIS will in some way accommodate the emerging Planets model of significant properties. There have already been conversations between PREMIS and Planets principals.
2. Models structural metadata.
There are important concepts here since new representations created via migration can be complex combinations of existing files and newly created files. Similarly, new information objects can also reuse files that already exist in the repository (e.g., when creating new web site
snapshots). This can lead to complex structural relationships that need to be modelled by a truly comprehensive preservation information model. I believe this needs to be part of a preservation conceptual model. It is true that, physically, it is possible to hold this information in existing schemas (e.g., METS) although sometimes with a little awkwardness.
Response: PREMIS was never intended to include all information needed for all purposes. For example, clearly some descriptive metadata is needed, but there are adequate schemes already in use for this (as noted in point 1 above) and the Working Group that drafted the Data Dictionary did not feel a need to include descriptive metadata in it. Similarly there are many schemes that adequately handle structural metadata.
3. Models Transformation entities.
This can be used to control preservation planning, migration or emulation. This could be done through the current PREMIS Event entity (but I think having an explicit entity would be clearer especially in a conceptual model). The things that need to be recorded include the representation of the component being transformed and the new representation of that component plus information on the migration pathways and the verification process that took place.
Response: This question is about recording explicitly how a preservation action creates a new representation from an old one. This involves recording the relationship between the representations, the preservation action event, the agent used to perform the preservation action, and details, such as configuration parameters, significant characteristics which guided the choice of preservation action, measured differences between the source and the target (outcome information), etc.
This all fits the PREMIS model very well. The PREMIS Editorial Committee believes that the PREMIS data model needs to stay as slim as possible, while being able to capture what we need. It does not want to introduce a special type of entity for preservation actions.
However, the PREMIS Editorial Committee will consider a refined event model that captures what people want to say about events in one place. For example, if you have an n:m migration, e.g. creating one pdf from multiple files, or creating multiple spreadsheets from one database file, it is very cumbersome and verbose in PREMIS at the moment.
1. Why is it necessary to state whether an embedded object is a FileStream or a Bitstream? Not sure why this helps since anything embedded has to be extracted by some method (and we may not know what that method is).
Response: It is not necessary to state this, but if you want to use a bitstream object you should know what a bitstream is. I think maybe the real question here is what does it matter if an object is a filestream or a bitstream. The answer is that, since a filestream can stand alone, it can actually be treated and described as a file object, while a bitstream can not.
Here's a possible scenario. We have a bytestream that contains a bitstream. For example, an image inside a word document. If we are trying to pull out that image from Word, there will necessarily be some degree of transformation on the image to make it into a filestream so it can exist as of itself. If however, we are pulling out images from an ARC file then that image is a filestream and no transformation is needed to be made as it can stand by itself.
Therefore, it would help to know if the object is a filestream or bitstream. You would know that an object was a bitstream by the objectCategory value = "bitstream". You would have to infer that an object was a filestream by the fact that the objectCategory value = "file" but the contentLocationType would be "byte offset" or something like that. The PREMIS Editorial Committee thought that it would probably be better to make this explicit.
2. The Data Dictionary states "If all identifiers are local to repository system, it is unlikely that identifier type would need to be explicitly recorded for each identifier in the system". I agree but most Identifier Types in the schema do appear to be mandatory?
Response: Note that "mandatory" means the repository needs to know it, but how it knows it isn't in scope. It really means that it has to be recoverable by the system. The XML schema is a particular implementation of the data dictionary; the type could be generated when exchanging data in XML.
3. Along the same lines, every time you use a format identifier you need to name the registry. This is usually implied and so it is a lot of unnecessary repetition. Can this be made less verbose?
Response: An implementation could implement it as a business rule that it always uses a certain format registry, and again, it doesn't need to be explicitly named if it can be recovered later by the system.
5. I'm not at all clear how to use "preservation level" or what is the point of it. Can this be further explained?
Response: Preservation level is a business rule and most business rules are not in scope for PREMIS. It has to do with the intentions and capabilities of a given repository. As with other semantic units, as a business rule there may be nothing that would be stored for each item.
The PREMIS Editorial Committee will consider having them stand on their own, not as part of the object entity.
11. How would we record the existence of an empty folder in PREMIS? This is important in some cases (e.g., to allow DVDs to be stored and rebuilt)
Response: You could interpret a folder as a representation or as an intellectualEntity, depending on how you are planning to use it. In either case you would want to declare an objectIdentifier - not yet possible for the intellectualEntity - but the PREMIS Editorial Committee is working on that. Whether or not it is empty is structural information.
Other questions to be addressed after further discussion:
4) Modification dates
6) Significant properties
7) creatingApplication, environment, software, hardware
8) environment registries; related to 7)
12) recording whether file is valid or well-formed against its format
Rebecca S. Guenther
Senior Networking and Standards Specialist
Network Development and MARC Standards Office
Library of Congress
101 Independence Ave. SE
Washington, DC 20540-4402
(202) 707-5092 (voice)
(202) 707-0115 (FAX)
[log in to unmask]