Sigh...I realize my original message was stated in such a way that it was possible people could construe that I am against all provenance information everywhere, when nothing could be further from the truth. My bad.

I had been responding to a specific example of the use of provenance in terms of where the title information was taken for a bibliographic record, for example:

>- title page title
>- cover title
>- title from jewel case insert

And the end-user in me shouted “who cares!” Yes, I understand that there may be variances in those titles, but I wanted to not make the assumption that such variance would have a detrimental effect on end-user needs. Not that it wouldn’t necessarily, but I don’t think we can afford to just assume that it does. All complexity comes at a price and we should be clear when the complexity is worth it and when it is not. We’ve been really bad at that in the past.

I also understand that there will be cases where machine processes create metadata, which means we basically get it “for free”, and that’s great, I welcome it. But I don’t know of a machine process that would supply the information above. That means we would be paying a cataloger to record it, and therefore we need to be REALLY SURE it’s important. I’m not yet convinced that it is.

I’m not claiming that any metadata we capture is “necessarily...more expensive, and that sensible people will immediately see we can’t afford it.” Far from it. I just don’t want us to continue to assume that we can afford infinite metadata. We can’t. Therefore, if we can’t, we need to select wisely what we spend our staff time to capture. For my money, it isn’t where the title is taken from. However, really all I’m saying is “show me the money”.  If someone can make the case that not capturing this will have a significant deleterious effect on the end-user communities we serve, then great, let’s get it. Otherwise, we can spend our time more effectively doing something else.

Keep in mind that the effect on our user communities can even be something like “if we didn’t have this data it would make our work so difficult as to prevent us from spending time doing other things our community wishes we could do for them.” In other words, I’m not writing off backroom efficiencies as others have inferred.

Diane, if as you say, “all the provenance I'm talking about is supplied for machines, by machines” then I’m pretty much all for it, so long as the carrier of it does not need to be so complex as to render all kinds of difficulties down the line (but that should be rare). In other words, you seem to think we are nearly diametrically opposed, but I sure don’t think so. At least not how you have laid out your position below.
Roy


On 1/14/12 1/14/12 • 1:47 PM, "Diane Hillmann" <[log in to unmask]">[log in to unmask]> wrote:

Roy: 

Lucky you! You've stumbled on a topic I really feel strongly about (yeah, there are others).  So there are comments below: 

On Fri, Jan 13, 2012 at 3:43 PM, Roy Tennant <[log in to unmask]">[log in to unmask]> wrote:
Diane asks the question “We want to do this well, don’t we?” My reply would be we should want to do it as well as is required to support real end-user needs that are important to support. This is because we will clearly lack the level of resourcing we enjoyed for much of the 80s and 90s, and even into the 2000s. We must choose well where to put our resources or we will regret it. Lacking any context, any cataloger will want to describe a resource to within an inch of its life. But that isn’t what we can afford to do.

I'm frustrated by the continuing assumption that by suggesting the high value for provenance, we're proposing something that will necessarily be more expensive, and that sensible people will immediately see that we can't afford it.  Certainly this is true in our current environment, but in a world where data will be moving around in very different ways than we see now, and not in MARC-like aggregation, provenance data is essential.  

 

So I’m suggesting we need to provide the end-user use cases where knowing “where it came from, when it was last updated, how it was created (human or machine?)” is important and then we can go from there. This can be something along the lines of “without that information I can’t provide the user with a display from which they can make intelligent decisions about the resource because of X and Y”. But there must be something to justify all the work besides our deep-seated (and laudible) desire to do things “well”.


If we define 'end users' as always being human, we're missing a whole lot of the point of all this shift in focus. If we're expecting machines to parse, manage, and interpret the data coming at them, we have to see them (and the services that depend on them) as 'end users' as well. Yes, as always, humans will be directing all this, but we need to provide much more information about the data itself if we're expecting all this to work in a different environment, AND to be affordable and efficient.  We should have learned well enough in the last forty years about how insufficient and sometimes lousy data limits what we can do. 

Keep in mind that all the provenance I'm talking about is supplied for machines, by machines (but in a manner designed by people). Expensive humans aren't entering data on forms, but they need to know how to instruct the providing and consuming machines what to supply for downstream services, and how to interpret what they're being fed. This is not rocket science, and there are people in our community and others who 'get' this and have even written about it. There are even vendors who are actually using these ideas successfully--one example is the Summon product. 

Let's keep our minds open ...

Diane

Diane