Since one of the things named graphs are supposed to help us with is
recording provenance, I thought I'd follow up my last post by sharing
some thoughts on provenance, too. I sometimes see provenance discussed
in terms of the provider of the data, e.g., the URL domain in linked
data. This is useful so far as it goes, but I am more interested in
provenance in terms of what justifies the data values given. Suppose
OCLC did release all of WorldCat as linked data. It's all very well to
know that some piece of information came from WorldCat, but frankly, the
quality of records varies tremendously within WC, from very, very good
to too minimal to be useful to totally wrong. So knowing that something
came from WC only gets you so far.
There are various kinds of provenance built into existing library
cataloging, although these methods were built for the analog era and
tend to not be machine actionable. IMO they are also less adequate the
farther you get from the original model of books with title pages. You
could think of these mechanisms as the library world's equivalent of
Wikipedia's demand for citations. They provide a way that someone else
coming along later can reproduce what the first cataloger did to come to
their conclusions--a trust but verify approach.
One of the underpinnings of cooperative cataloging is that if another
cataloger (or user) comes along and looks at the bibliographic record
you've created, you've put in enough information in a way that the
second person can tell whether or not s/he has the same item (the FRBR
identify task). One of the reasons catalogers are made so uncomfortable
by ultra brief vendor records (the infamous level 3 records in OCLC
WorldCat) is that they violate this community norm in the extreme. In
some of these vendor records, nothing is right except the identifier
(usually ISBN); the title, creator, format, and publication information
are all different from what's on the item.
When creating a bibliographic record describing a book, CD or whatever,
the cataloging rules tell you to base the description on the item in
hand. This means that the item itself is considered the most
authoritative source for the information in the body of the description.
However, there might be more than one possible source for a given piece
of information in the item. For example, there may be more than one form
of title in different places on the item (title page, cover, running
title, etc.). The approach of existing cataloging rules is to give a
hierarchy of sources (in AACR2 chief and prescribed sources and in RDA
preferred sources). In AACR2, if you take the title from the chief
source (say the title page), you don't have to say what you did and
everyone assumes that's where the title came from (as an aside, let me
say that I have come to hate implicit data like this). If you take the
main title from somewhere else, you are supposed to say so in a note
(e.g., "Cover title."). The source that you use for other basic citation
information is then assumed to be the same as that for the title or, for
some data elements, one of a list of other possible options. If you take
citation data from somewhere else, you bracket it, but the source of
data for everything outside the title is generally not noted.
Leaving aside the lack of machine-friendliness, this also doesn't work
very well for a lot of non-book media. If someone took the title from
the title page of a book, you can assume that they took the rest of
their basic descriptive info from there, too, or from looking at the
item itself (such as the number of pages). If you have a DVD video, even
if someone takes a title from the title frames (chief source), you can't
make any assumptions about where they got the rest of their data. Much
of the publication info (publisher, date, series) is usually best taken
from the disc label or container. Beyond the basic descriptive info, did
the cataloger take the soundtrack, subtitle or caption options from the
container (known to be wrong on occasion), the disc menu (also known to
be wrong occasionally) or from listening to or looking at the tracks
(only works if you recognize the language).
Also, if a cataloger takes a DVD title from somewhere other than the
title frames, such as the disc label, it could mean one of two
significantly different things:
1) there is no title on the title frames
2) there is a title on the title frames, but the cataloger didn't look
at it due to economical, technological, etc. limitations.
It would also be useful in some circumstances to record contradictory
information in combination with sources. One that comes up with DVDs
more often than you would think is the case where the packaging makes no
mention of closed captions, but if you pop the DVD in a player, it is
captioned.
Right now, a cataloger would just make the usual note:
Closed-captioned
If you could say:
Container/Packaging: no closed captions
Validated in player: closed captions
Someone who has this DVD that doesn't say anything externally about
captions would know that they probably do have a captioned DVD whereas
in the current system, they're likely to think they have a different
version. It would also be good to be able to mark which is thought to be
the true statement.
When OLAC was working on our initial investigation of using FRBR to
improve access to moving images (see
http://www.olacinc.org/drupal/?q=node/27, particularly part 3a), we
thought that it was best to allow for element-specific provenance
without requiring it. We were focusing on FRBR works where the
item-in-hand is not necessarily the authoritative source so provenance
is clearly important. On the other hand, recording provenance at a
granular level makes for additional work. By allowing everything to have
a value of unspecified/unknown for provenance, it allowed us to have
granularity when possible while allowing for legacy data and data from
providers who choose not to provide that level of granularity.
We also played around with a value for inferred/guessed for those
situations where the evidence clearly seems to be pointing at something,
but there isn't enough solid evidence to make a strong assertion.
By clearly identifying elements of unknown or unreliable provenance, it
is easy for those who care to update the information while other can use
the information as is.
Right now we have provenance in bib records in the following forms that
I can think of:
* Presence or absence of 500 source of title note in conjunction with
format of item for the source of citation/transcribed parts of record
* 040 field lists institutional codes of libraries that have edited a
record at the record level (so you know the last institution that
touched a record but you don't know what they did)
What I might wish for is something more granular and
machine-actionable, such as
Three optional machine-comprehensible provenance elements attached to
every data element:
1) source of the data
2) the institution entering the data
3) date the data was input
Perhaps something like the following for title proper:
TitleProper: Citizen Kane
DataSource: title frame
DataInst: OrU
DataDate: 2012-01-04
Even if catalogers only did this the source of data for title proper,
it would give us as much information as we have today, but in a
machine-friendly form (although it might be hard to come up with a long
enough list of data sources). The editing institution and date could
presumably be generated automatically.
RDA is making a practical move away from identifying the sources of
data even as clearly as in AACR2. In AACR2, if you include descriptive
citation data that was taken from somewhere other than your
selected/allowed sources, you bracket it. So basically AACR2 has a
binary partition into data from the chief or prescribed source and data
not from there. Except for title proper, RDA generally allows data from
other sources to be silently interpolated. This does suggest that we
could use a way to representing provenance for data elements that
contain data from more than one source.
RDA retains the notion of a "source of title" note, but in a way that
ironically undermines its usefulness. In RDA
1. As in AACR2 (at least for books and moving images), the note is only
given when the data is taken from somewhere other than source at the top
of the hierarchy for preferred sources for a format
2. The note is optional
Given these two possibilities, if there's no note, how will anyone ever
know which situation applies? IMO, it would be much better to give an
option for a positive source of title note or element across the board
and allow those who value that information to always record it
explicitly.
What about authority records? Authority records are records for things
other than bibliographic entities (people, corporate bodies, subjects)
or for some bibliographic entities (usually those other than FRBR
manifestations, which are described by bibliographic records). If
anything, provenance is even more important for authority records.
Although the related item-in-hand is taken into account when
constructing these, it is commonly necessary to consult external
information. If we start creating separate records for FRBR works and
expressions, these will be more like authority records in that they
can't necessarily rely on an item and will often have to justify their
data with citations for external sources.
Like provenance in bibliographic records, provenance in current
authority records is largely recorded in free text (albeit structured)
notes.
The most common one is 670 (source data found), which includes a
citation plus usually the data found and the location where the data was
found within the cited material.
670 $a Its Guide to manuscripts in the Bentley Historical Library,
1976: $b t.p. (Bentley Historical Library, Michigan Historical
Collections, Univ. of Mich.)
For internet resources, the date when the site was looked at is
included:
670 $a Internet Movie Database, Feb. 6, 2003: $b (b. 30 July 1961 in
Augusta, Ga.; sometimes credited as: Laurence Fishburne III, Lawrence
Fishburne III; changed his name from Larry to Laurence in his films in
1991)
This is important for things like birth dates in IMDb, which can be
something of a moving target.
There is a parallel field 675 for sources where data was not found.
This is generally only used for prominent sources where a reasonable
person would be expected to look. Standardized abbreviations for
well-known sources are often used. For a composer, you might have:
675 $a New Grove; $a Thompson, 10th ed.
There is now also a $v in some authority record fields for a source of
information for a specific data element. I think this is supposed to
contain a citation, although I wasn't able to find an example in the
documentation. This is more granular, but it's not clear to me if it's
more machine-interpretable.
In summary, I think we do need better provenance information and that
it should be
*more granular
*machine-interpretable
*optional
*capable of recording alternate viewpoints and reconciling these
viewpoints by identifying preferred data
*capable of recording a history of edits (which I didn't talk about
above but which I think would be useful)
Kelley
Kelley McGrath
University of Oregon
[log in to unmask]
|