Print

Print


Ross,

I wonder how you query your MongoDB store? I don't suppose it supports SPARQL?

On Mon, Apr 13, 2015 at 11:51 PM, Ross Singer <[log in to unmask]> wrote:
> Kelley,
>
> I stick by the notion that as long as a system can ingest/export data as
> well-formed RDF graphs, how you store it internally makes no difference.  As
> I've said before, our biggest product is modeled entirely using RDF, but we
> store the data in a MongoDB document database, because it fits our actual
> needs better than a triple store.
>
> It is possible, btw, to produce an ordered list for author/editor names in
> RDF, but it's horribly ugly: you can use rdf:Seq
> http://www.w3.org/TR/rdf-schema/#ch_seq or rdf:List
> http://www.w3.org/TR/rdf-schema/#ch_list.  They each have their pluses and
> minuses: rdf:List is absolutely awful to work with in any serialization
> except Turtle (where it's super easy! see:
> http://www.w3.org/2007/02/turtle/primer/#L2986), but has the downside of
> being semantically open.  That is, you cannot definitively say "these are
> *all* of the authors and there are no more".
>
> rdf:Seq (which is an rdf:Container) is considered closed (i.e. there is no
> assumption that there would be anything else in the current container that
> appears somewhere else) but, unfortunately has no syntactic sugar like
> Collections in Turtle.
>
> Containers and Collections being such major pain points in RDF, JSON-LD
> threw all of it away for a *much* simpler implementation:
> http://www.w3.org/TR/json-ld/#sets-and-lists
>
> All that said, as long as you can serialize your author lists as one of
> these, model it however suits your needs the best for your regular
> workflows/needs.
>
> -Ross.
>
> On Mon, Apr 13, 2015 at 6:12 AM Kelley McGrath <[log in to unmask]> wrote:
>>
>> Although much of the discussion on storing bibframe data went over my
>> head, some things have been niggling at me for a while that maybe are
>> related to this thread.
>>
>> I get that it would be good for us to publish our data as linked data. I
>> get that it would be good for us to consume linked data. I get that we
>> should re-use other people's URIs in our data to save time and reduce
>> maintenance. I get that we should match our identifiers to other people's
>> URIs in order to connect more information.
>>
>> However, it has not been clear to me that it makes sense for us to store
>> and maintain our data as linked data. And yet, I don't see any alternative
>> plan being developed.
>>
>> This may be sacrilege, but from what I understand there seem to be things
>> that linked data isn't good at. For example, retaining the order of things
>> like authors' names or connecting a specific place and publisher written on
>> the title page or a book. Sometimes when this has been discussed on this
>> list, I get the impression that we shouldn't want to do those things; that
>> they're somehow obsolete.
>>
>> I can't get my head around that. Maybe you don't need those things for
>> linking, but I don't think linking is the only thing that we want to do with
>> our data. For example, it emerged recently, when MPOW changed to a discovery
>> layer that didn't do such a good job with this initially, that the ability
>> to generate citations is hugely important to a significant portion of our
>> patrons. If you want to generate an accurate citation, you need to know the
>> order of the author's names.
>>
>> It has been suggested to me that we shouldn't be generating citations, but
>> rather storing them as strings. However, again I seem to be missing
>> something because that doesn't seem very optimal to me. Do you store a
>> separate string for every format: APA, MLA, Chicago, etc.? What do you do
>> when a style guide gets updated? It might not be very easy to update
>> millions of strings. What if a new citation style is invented and becomes
>> popular? It just seems to me to be more flexible and powerful to store the
>> individual pieces of data and generate the citations. On the other hand,
>> publishing citations as strings might be okay for most use cases.
>>
>> MARC records are a single unit. If a record has been edited by multiple
>> parties, you can't tell who changed what when, which is a challenge for
>> trouble-shooting and quality control. Linked data statements are atomistic,
>> but it sounds to me like it is still hard to say anything much *about* the
>> statement other than maybe the domain name used by whoever made it. It would
>> be useful to track more about individual statements, such as when they are
>> made and whether or not they are currently considered valid (one of the
>> problems with bad data in the OCLC master record environment is that even if
>> you take erroneous information out, all it takes is one batchload to put it
>> right back in).
>>
>> As some of you know, I have been working on a project to crowdsource the
>> parsing of film credits in catalog records (help us out at
>> http://olac-annotator.org/ ). One result of this is that we have links
>> between transcribed names in records and their authorized form.  It occurs
>> to me that this might be a useful thing to record proactively. For example,
>> even in a world of identifiers, we still need to choose one of many possible
>> versions of a name to display to the user (unless you're going to display
>> them all at once in some kind of cluster, which is not very user-friendly in
>> many situations). In library cataloging, traditionally, for people the most
>> common or the most recent variation is chosen as the preferred one. However,
>> if the math changes, you have to wait for a person with NACO powers to
>> notice this and fix it. This doesn't always happen in a timely fashion. In
>> his earliest movies, Laurence Fishburne was credited as Larry Fishburne so
>> this is how his name was established. It then persisted in library catalogs
>> as Larry Fishburne for long, long after after he made the change (I think
>> ten years) . If you had data like this, the computer could do the math and
>> display the most current form.
>>
>> Name on piece   Year and work
>> Larry Fishburne 1984    The Cotton Club
>> Larry Fishburne         1985    The Color Purple
>> Laurence Fishburne      1993    Searching for Bobby Fischer
>> Laurence Fishburne      1999    The Matrix
>> Laurence Fishburne      2006    Mission: Impossible III
>>
>> (if you look at IMDb's Laurence Fishburne page, they do track all this,
>> along with the names of the characters he played:
>> http://www.imdb.com/name/nm0000401/ )
>>
>> With linked data, you can say
>>
>> Movie1  -- has actor  -- LF123
>> Movie1 -- has actor's name credited as  -- "Laurence Fishburne"
>> LF123  -- has been credited as -- "Laurence Fishburne"
>>
>> But you can't get all three of those things to connect up, at least not
>> without using blank nodes, which then makes your data not so shareable. So
>> far as I can see, anytime you want to connect the dots between more than two
>> pieces of information or say something about a statement, it doesn't work so
>> well with triples. This might not be such a problem for linking, but I think
>> there are other things we want to do with our data where we may want this
>> ability.
>>
>> What happens if we implement bibframe and we don't store and maintain our
>> data as bibframe triples? We could just keep generating bibframe from MARC
>> records, but then we haven't really replaced MARC or gotten more flexible,
>> structured data than we already have.
>>
>> Alternatively, ILS vendors could come up with another internal format for
>> us to store data in. However, I don't know that they have the right
>> expertise for this nor any economic incentives. If this happened, we would
>> also end up with much less portable data. Imagine if bib records were like
>> item records and every system had its proprietary format and unique
>> combination of fields. Anytime you do an ILS migration, there is a lot of
>> item data that can't be moved to the new system, either because it's
>> structured differently or because there is no equivalent field.
>>
>> This may be completely wrong-headed and I think I'm using the wrong
>> vocabulary some places, but I thought I'd throw it out there in case someone
>> can enlighten me.
>>
>> Kelley