Monday, May 18, 2015 10:43 PM, Tim Thompson wrote:
> Now that BIBFRAME production projects are on the horizon, it seems like a
> good time to revisit the issue of provenance and how to track it. I started
> reading a thread on the topic from a few years ago, but was wondering what
> new insights or best practices had emerged since then. The current
> marc2bibframe transformation outputs a single bf:Annotation with
> provenance/revision information, linked only to the bf:Work entity, probably
> only for demonstration purposes, or to keep the results more (human)
> <http://bibframe.org/resources/XYm1431979154/9057891annotation21> a
> bf:Annotation ;
> bf:annotates <http://bibframe.org/resources/XYm1431979154/9057891> ;
> bf:changeDate "2015-05-07T16:06" ;
> <http://bibframe.org/resources/XYm1431979154/9057891.marcxml.xml> ;
> <http://id.loc.gov/vocabulary/descriptionConventions/rda> ;
> bf:descriptionLanguage <http://id.loc.gov/vocabulary/languages/eng> ;
> bf:descriptionSource <http://id.loc.gov/vocabulary/organizations/njp> ;
> bf:generationProcess "DLC transform-tool:2015-01-16-T11:00:00" .
I have a problem with the provenance statement above. In my opinion this kind of provenance only makes sense when the data is stored in the BIBFRAME format. Depending on what system you envision to store your data, that might or might not be true.
In many cases systems use an internal data format (IDF) and when needed the data is converted to the requested exchange format, be it MARC 21, MARCXML, BIBFRAME, some-other-kind-of-RDF-e.g.-RDA, ... In that case I'd have a provenance chain
<http://example.org/bibframe/123456> bf:derivedFrom <http://example.org/internal/123456> .
<http://example.org/internal/123456> bf:derivedFrom <http://example.org/marcxml/123456> .
Would it in your opinion make sense to publish the complete provenance chain?
(As an aside, to me the same discussion applies to the new MARC 21 field 884 )
> At production scale, however, it seems the kind of provenance data we record
> should be more robust and systematic. So, what should be the scale and focus
> of our provenance data be? Named graphs? Individual triples?
That depends on how your data is handled. You always give provenance for a graph. That graph might be a single triple, or the complete graphiphication of a record. Since you always need a handle for the provenance (what resource does the provenance apply to), you could do that by creating a named graph or if it's only for a single triple you could do it with (ugly) reification. Both methods have their pros and cons.
> Do we need to
> repeat the same provenance data for every statement in a description set (e.g.,
> bf:descriptionLanguage, bf:descriptionSource), given that some statements are
> more "trivial" than others? What's the most appropriate structure/vocabulary
> for making assertions about provenance? It seems that Annotations might not
> be the most appropriate, for example (the Open Annotation spec stresses the
> importance of provenance for Annotations themselves). Is PROVO-O a
> good fit, or do we need something more domain specific?
We shouldn't reinvent the wheel. I'd say that PROV is the way to go.