Print

Print


 Since one of the things named graphs are supposed to help us with is 
 recording provenance, I thought I'd follow up my last post by sharing 
 some thoughts on provenance, too. I sometimes see provenance discussed 
 in terms of the provider of the data, e.g., the URL domain in linked 
 data. This is useful so far as it goes, but I am more interested in 
 provenance in terms of what justifies the data values given. Suppose 
 OCLC did release all of WorldCat as linked data. It's all very well to 
 know that some piece of information came from WorldCat, but frankly, the 
 quality of records varies tremendously within WC, from very, very good 
 to too minimal to be useful to totally wrong. So knowing that something 
 came from WC only gets you so far.

 There are various kinds of provenance built into existing library 
 cataloging, although these methods were built for the analog era and 
 tend to not be machine actionable. IMO they are also less adequate the 
 farther you get from the original model of books with title pages. You 
 could think of these mechanisms as the library world's equivalent of 
 Wikipedia's demand for citations. They provide a way that someone else 
 coming along later can reproduce what the first cataloger did to come to 
 their conclusions--a trust but verify approach.

 One of the underpinnings of cooperative cataloging is that if another 
 cataloger (or user) comes along and looks at the bibliographic record 
 you've created, you've put in enough information in a way that the 
 second person can tell whether or not s/he has the same item (the FRBR 
 identify task). One of the reasons catalogers are made so uncomfortable 
 by ultra brief vendor records (the infamous level 3 records in OCLC 
 WorldCat) is that they violate this community norm in the extreme. In 
 some of these vendor records, nothing is right except the identifier 
 (usually ISBN); the title, creator, format, and publication information 
 are all different from what's on the item.

 When creating a bibliographic record describing a book, CD or whatever, 
 the cataloging rules tell you to base the description on the item in 
 hand. This means that the item itself is considered the most 
 authoritative source for the information in the body of the description. 
 However, there might be more than one possible source for a given piece 
 of information in the item. For example, there may be more than one form 
 of title in different places on the item (title page, cover, running 
 title, etc.). The approach of existing cataloging rules is to give a 
 hierarchy of sources (in AACR2 chief and prescribed sources and in RDA 
 preferred sources). In AACR2, if you take the title from the chief 
 source (say the title page), you don't have to say what you did and 
 everyone assumes that's where the title came from (as an aside, let me 
 say that I have come to hate implicit data like this). If you take the 
 main title from somewhere else, you are supposed to say so in a note 
 (e.g., "Cover title."). The source that you use for other basic citation 
 information is then assumed to be the same as that for the title or, for 
 some data elements, one of a list of other possible options. If you take 
 citation data from somewhere else, you bracket it, but the source of 
 data for everything outside the title is generally not noted.

 Leaving aside the lack of machine-friendliness, this also doesn't work 
 very well for a lot of non-book media. If someone took the title from 
 the title page of a book, you can assume that they took the rest of 
 their basic descriptive info from there, too, or from looking at the 
 item itself (such as the number of pages). If you have a DVD video, even 
 if someone takes a title from the title frames (chief source), you can't 
 make any assumptions about where they got the rest of their data. Much 
 of the publication info (publisher, date, series) is usually best taken 
 from the disc label or container. Beyond the basic descriptive info, did 
 the cataloger take the soundtrack, subtitle or caption options from the 
 container (known to be wrong on occasion), the disc menu (also known to 
 be wrong occasionally) or from listening to or looking at the tracks 
 (only works if you recognize the language).

 Also, if a cataloger takes a DVD title from somewhere other than the 
 title frames, such as the disc label, it could mean one of two 
 significantly different things:

 1) there is no title on the title frames

 2) there is a title on the title frames, but the cataloger didn't look 
 at it due to economical, technological, etc. limitations.

 It would also be useful in some circumstances to record contradictory 
 information in combination with sources. One that comes up with DVDs 
 more often than you would think is the case where the packaging makes no 
 mention of closed captions, but if you pop the DVD in a player, it is 
 captioned.

 Right now, a cataloger would just make the usual note:
   Closed-captioned

 If you could say:
   Container/Packaging: no closed captions
   Validated in player: closed captions

 Someone who has this DVD that doesn't say anything externally about 
 captions would know that they probably do have a captioned DVD whereas 
 in the current system, they're likely to think they have a different 
 version. It would also be good to be able to mark which is thought to be 
 the true statement.

 When OLAC was working on our initial investigation of using FRBR to 
 improve access to moving images (see 
 http://www.olacinc.org/drupal/?q=node/27, particularly part 3a), we 
 thought that it was best to allow for element-specific provenance 
 without requiring it. We were focusing on FRBR works where the 
 item-in-hand is not necessarily the authoritative source so provenance 
 is clearly important. On the other hand, recording provenance at a 
 granular level makes for additional work. By allowing everything to have 
 a value of unspecified/unknown for provenance, it allowed us to have 
 granularity when possible while allowing for legacy data and data from 
 providers who choose not to provide that level of granularity.

 We also played around with a value for inferred/guessed for those 
 situations where the evidence clearly seems to be pointing at something, 
 but there isn't enough solid evidence to make a strong assertion.

 By clearly identifying elements of unknown or unreliable provenance, it 
 is easy for those who care to update the information while other can use 
 the information as is.

 Right now we have provenance in bib records in the following forms that 
 I can think of:
 * Presence or absence of 500 source of title note in conjunction with 
 format of item for the source of citation/transcribed parts of record
 * 040 field lists institutional codes of libraries that have edited a 
 record at the record level (so you know the last institution that 
 touched a record but you don't know what they did)

 What I might wish for is something more granular and 
 machine-actionable, such as

 Three optional machine-comprehensible provenance elements attached to 
 every data element:
 1) source of the data
 2) the institution entering the data
 3) date the data was input

 Perhaps something like the following for title proper:
 TitleProper: Citizen Kane
 DataSource: title frame
 DataInst: OrU
 DataDate: 2012-01-04

 Even if catalogers only did this the source of data for title proper, 
 it would give us as much information as we have today, but in a 
 machine-friendly form (although it might be hard to come up with a long 
 enough list of data sources). The editing institution and date could 
 presumably be generated automatically.

 RDA is making a practical move away from identifying the sources of 
 data even as clearly as in AACR2.  In AACR2, if you include descriptive 
 citation data that was taken from somewhere other than your 
 selected/allowed sources, you bracket it. So basically AACR2 has a 
 binary partition into data from the chief or prescribed source and data 
 not from there. Except for title proper, RDA generally allows data from 
 other sources to be silently interpolated. This does suggest that we 
 could use a way to representing provenance for data elements that 
 contain data from more than one source.

 RDA retains the notion of a "source of title" note, but in a way that 
 ironically undermines its usefulness. In RDA

 1. As in AACR2 (at least for books and moving images), the note is only 
 given when the data is taken from somewhere other than source at the top 
 of the hierarchy for preferred sources for a format

 2. The note is optional

 Given these two possibilities, if there's no note, how will anyone ever 
 know which situation applies? IMO, it would be much better to give an 
 option for a positive source of title note or element across the board 
 and allow those who value that information to always record it 
 explicitly.


 What about authority records? Authority records are records for things 
 other than bibliographic entities (people, corporate bodies, subjects) 
 or for some bibliographic entities (usually those other than FRBR 
 manifestations, which are described by bibliographic records). If 
 anything, provenance is even more important for authority records. 
 Although the related item-in-hand is taken into account when 
 constructing these, it is commonly necessary to consult external 
 information. If we start creating separate records for FRBR works and 
 expressions, these will be more like authority records in that they 
 can't necessarily rely on an item and will often have to justify their 
 data with citations for external sources.

 Like provenance in bibliographic records, provenance in current 
 authority records is largely recorded in free text (albeit structured) 
 notes.

 The most common one is 670 (source data found), which includes a 
 citation plus usually the data found and the location where the data was 
 found within the cited material.

 670 $a Its Guide to manuscripts in the Bentley Historical Library, 
 1976: $b t.p. (Bentley Historical Library, Michigan Historical 
 Collections, Univ. of Mich.)

 For internet resources, the date when the site was looked at is 
 included:

 670 $a Internet Movie Database, Feb. 6, 2003: $b (b. 30 July 1961 in 
 Augusta, Ga.; sometimes credited as: Laurence Fishburne III, Lawrence 
 Fishburne III; changed his name from Larry to Laurence in his films in 
 1991)

 This is important for things like birth dates in IMDb, which can be 
 something of a moving target.

 There is a parallel field 675 for sources where data was not found. 
 This is generally only used for prominent sources where a reasonable 
 person would be expected to look. Standardized abbreviations for 
 well-known sources are often used. For a composer, you might have:

 675  $a New Grove; $a Thompson, 10th ed.

 There is now also a $v in some authority record fields for a source of 
 information for a specific data element. I think this is supposed to 
 contain a citation, although I wasn't able to find an example in the 
 documentation. This is more granular, but it's not clear to me if it's 
 more machine-interpretable.

 In summary, I think we do need better provenance information and that 
 it should be

 *more granular
 *machine-interpretable
 *optional
 *capable of recording alternate viewpoints and reconciling these 
 viewpoints by identifying preferred data
 *capable of recording a history of edits (which I didn't talk about 
 above but which I think would be useful)

 Kelley
 
 Kelley McGrath
 University of Oregon
 [log in to unmask]