As someone once* asked, "Is this rule necessary?"

For transcribed statements, a good place to begin the inquiry is to look at the  goals underlying the original rules, and to see if the rules chosen to achieve those goals are adequate or redundant.

I would argue that the primary goal addressed by rigidly controlled transcribed statements is to determine whether two descriptions are about the same thing. This explains why there is a need to record information as found (or modified by uniformly applied transforms).

There can be a limited amount of benefit to access if the transcribed statements contain terms that would not otherwise be present, but this purpose alone would not justify the specifity of the transcription rules.

If the lack of transcribed statements breaks record linkage badly, then the rules are clearly necessary ; however this _necessity_ would seem to arise from the act of description rather than the nature of what is described.

[The answer to Lubetzky for this situation is amenable to empirical solution; one could take a large number of _records_ , mutate them to simulate the records that might have been recorded if the rule were not present, then compare matching accuracy using e.g. Fellegi-Sunter linkage with weightings estimated by EM. It is of course important to account for the violations of the assumption of independence.
The estimated weights for different fields / subfields in the unmodified records may suggest which rules are important to match /non match determination.  ]

Simon // Is it not possible that it is possible that this is not a rule?

On Aug 1, 2014 2:31 PM, "Philip Schreur" <[log in to unmask]> wrote:
bf has some very complex use cases:

    - it is the inheritor of MARC and will be expected to find a way of representing that data

    - it will need to represent data created according to our latest set of cataloging rules (in this case, RDA)

    - it should provide a light, extensible framework for representing all the data library patrons may have interest in (which is virtually everything) and all of the above as natively in RDF as possible

At some point we have to acknowledge that not all these can be accommodated equally as well and compromises will need to be made.  But if the point of bf is to integrate this data into the web via RDF, it seems we should compromise in this aspect least.  Otherwise what is the point?  I think the conversations of the past few weeks have been very helpful in this regard.


On 8/1/2014 9:49 AM, [log in to unmask] wrote:
Thanks, Rob.

But where is the "radical" thought which caught my eye? ;)

Here, the ship hasn't sailed yet. 

Let's be radical: for my future work with bibliographical data, I will ignore systems that do not support the distinction of core data with RDF statements that can be processed by a machine, and the descriptive data, required for presentational services, with languages and rules how to describe data, e.g. on the web.

Let's drop all legacy OPACs and discovery systems now. Cataloging of URLs  - that's where it all started. The mix of all kinds of control and descriptive "web data" in the catalog. 

It's not "MARC must die". It's "Bad data without clear semantics must die".

Just to add one minor thing, beside RDFa,it is also possible to add JSON-LD into HTML, Google is using that: 


On Fri, Aug 1, 2014 at 6:18 PM, Robert Sanderson <[log in to unmask]> wrote:

Dear all,

In my experience, RDF and Linked Data can do both presentation based information (eg here is content to present directly to the user, without semantics eg [1]) and it can do semantic, descriptive information (here is a rich description of the resource, say a book or annotation eg [2]) but both at once is very challenging without simply repeating everything in a for-machines way and a for-humans way as per the current titleStatement, providerStatement, and one assumes authorStatement, subjectStatement, etc.  

Here are two radical ideas, for which the boat has probably long since sailed, but I'll throw them out there regardless.

1. Don't try to mix them up.  Have two completely separate descriptions, where one is intended for humans to read, and the other is intended for machines to reason upon and search.  A machine will only ever throw a transcribed string through to the user, so make it easy for them to do that by separating the non-semantic information from the semantic information, with links between them. 

2.  Mix them up using the appropriate technology: HTML + RDFA.  Instead of thinking about triples for everything, instead create the HTML that you want the user to see.  Then annotate that HTML with RDFA properties to add the semantics into the record (and really a record now, not a graph).  This way there's only one record to maintain that has both, but uses presentation technology for presenting things to users, and semantic technology for enabling machines to understand the information.

Basically -- use the right tools for the job.  RDF has a hard time representing transcriptions outside of non-semantic strings because it was never intended to do that.  Order in RDF is a complete pain, because a graph is inherently unordered, but there are very real use cases that require order.  On the other hand, RDF is fantastic for controlled data as that is precisely its intended usage.  We should make the most appropriate use of the tools that we have available to us, rather than treating everything as a nail.



[1].  The IIIF Presentation API is focused on this approach of giving information intended for a client to display, while still being useful linked data by referencing existing semantic descriptions and following REST and JSON-LD.
[2].  The Open Annotation work is a rich data model that provides semantics for web annotation, but says almost nothing about presentation.

Rob Sanderson
Technology Collaboration Facilitator
Digital Library Systems and Services
Stanford, CA 94305

Philip E. Schreur
Head, Metadata Department
Stanford University
650-725-1120 (fax)