Rob, you're talking about a small subset of the query space. Query
languages exist for a reason, as I'm sure you know well.
Another important issue here is query portability, where SPARQL also
wins hands down against any imperative solution.
On Tue, Apr 14, 2015 at 12:59 AM, Robert Sanderson <[log in to unmask]> wrote:
>
> Many queries are handled perfectly well without graph traversal, which is
> where SPARQL shines (if you'll excuse the pun).
> Give me all books that were written by X author is a simple index.
>
> Secondly, not all information stored is needed for answering graph traversal
> queries, only the information needed for the use cases. If no query is
> going to look at, for example, bf:retentionPolicy, then there's no need to
> manage the millions of copies of it currently required in the model in a
> triple store.
>
> Rob
>
>
>
>
>
> On Mon, Apr 13, 2015 at 3:52 PM, Martynas Jusevičius <[log in to unmask]>
> wrote:
>>
>> Ross,
>>
>> I wonder how you query your MongoDB store? I don't suppose it supports
>> SPARQL?
>>
>> On Mon, Apr 13, 2015 at 11:51 PM, Ross Singer <[log in to unmask]>
>> wrote:
>> > Kelley,
>> >
>> > I stick by the notion that as long as a system can ingest/export data as
>> > well-formed RDF graphs, how you store it internally makes no difference.
>> > As
>> > I've said before, our biggest product is modeled entirely using RDF, but
>> > we
>> > store the data in a MongoDB document database, because it fits our
>> > actual
>> > needs better than a triple store.
>> >
>> > It is possible, btw, to produce an ordered list for author/editor names
>> > in
>> > RDF, but it's horribly ugly: you can use rdf:Seq
>> > http://www.w3.org/TR/rdf-schema/#ch_seq or rdf:List
>> > http://www.w3.org/TR/rdf-schema/#ch_list. They each have their pluses
>> > and
>> > minuses: rdf:List is absolutely awful to work with in any serialization
>> > except Turtle (where it's super easy! see:
>> > http://www.w3.org/2007/02/turtle/primer/#L2986), but has the downside of
>> > being semantically open. That is, you cannot definitively say "these
>> > are
>> > *all* of the authors and there are no more".
>> >
>> > rdf:Seq (which is an rdf:Container) is considered closed (i.e. there is
>> > no
>> > assumption that there would be anything else in the current container
>> > that
>> > appears somewhere else) but, unfortunately has no syntactic sugar like
>> > Collections in Turtle.
>> >
>> > Containers and Collections being such major pain points in RDF, JSON-LD
>> > threw all of it away for a *much* simpler implementation:
>> > http://www.w3.org/TR/json-ld/#sets-and-lists
>> >
>> > All that said, as long as you can serialize your author lists as one of
>> > these, model it however suits your needs the best for your regular
>> > workflows/needs.
>> >
>> > -Ross.
>> >
>> > On Mon, Apr 13, 2015 at 6:12 AM Kelley McGrath <[log in to unmask]>
>> > wrote:
>> >>
>> >> Although much of the discussion on storing bibframe data went over my
>> >> head, some things have been niggling at me for a while that maybe are
>> >> related to this thread.
>> >>
>> >> I get that it would be good for us to publish our data as linked data.
>> >> I
>> >> get that it would be good for us to consume linked data. I get that we
>> >> should re-use other people's URIs in our data to save time and reduce
>> >> maintenance. I get that we should match our identifiers to other
>> >> people's
>> >> URIs in order to connect more information.
>> >>
>> >> However, it has not been clear to me that it makes sense for us to
>> >> store
>> >> and maintain our data as linked data. And yet, I don't see any
>> >> alternative
>> >> plan being developed.
>> >>
>> >> This may be sacrilege, but from what I understand there seem to be
>> >> things
>> >> that linked data isn't good at. For example, retaining the order of
>> >> things
>> >> like authors' names or connecting a specific place and publisher
>> >> written on
>> >> the title page or a book. Sometimes when this has been discussed on
>> >> this
>> >> list, I get the impression that we shouldn't want to do those things;
>> >> that
>> >> they're somehow obsolete.
>> >>
>> >> I can't get my head around that. Maybe you don't need those things for
>> >> linking, but I don't think linking is the only thing that we want to do
>> >> with
>> >> our data. For example, it emerged recently, when MPOW changed to a
>> >> discovery
>> >> layer that didn't do such a good job with this initially, that the
>> >> ability
>> >> to generate citations is hugely important to a significant portion of
>> >> our
>> >> patrons. If you want to generate an accurate citation, you need to know
>> >> the
>> >> order of the author's names.
>> >>
>> >> It has been suggested to me that we shouldn't be generating citations,
>> >> but
>> >> rather storing them as strings. However, again I seem to be missing
>> >> something because that doesn't seem very optimal to me. Do you store a
>> >> separate string for every format: APA, MLA, Chicago, etc.? What do you
>> >> do
>> >> when a style guide gets updated? It might not be very easy to update
>> >> millions of strings. What if a new citation style is invented and
>> >> becomes
>> >> popular? It just seems to me to be more flexible and powerful to store
>> >> the
>> >> individual pieces of data and generate the citations. On the other
>> >> hand,
>> >> publishing citations as strings might be okay for most use cases.
>> >>
>> >> MARC records are a single unit. If a record has been edited by multiple
>> >> parties, you can't tell who changed what when, which is a challenge for
>> >> trouble-shooting and quality control. Linked data statements are
>> >> atomistic,
>> >> but it sounds to me like it is still hard to say anything much *about*
>> >> the
>> >> statement other than maybe the domain name used by whoever made it. It
>> >> would
>> >> be useful to track more about individual statements, such as when they
>> >> are
>> >> made and whether or not they are currently considered valid (one of the
>> >> problems with bad data in the OCLC master record environment is that
>> >> even if
>> >> you take erroneous information out, all it takes is one batchload to
>> >> put it
>> >> right back in).
>> >>
>> >> As some of you know, I have been working on a project to crowdsource
>> >> the
>> >> parsing of film credits in catalog records (help us out at
>> >> http://olac-annotator.org/ ). One result of this is that we have links
>> >> between transcribed names in records and their authorized form. It
>> >> occurs
>> >> to me that this might be a useful thing to record proactively. For
>> >> example,
>> >> even in a world of identifiers, we still need to choose one of many
>> >> possible
>> >> versions of a name to display to the user (unless you're going to
>> >> display
>> >> them all at once in some kind of cluster, which is not very
>> >> user-friendly in
>> >> many situations). In library cataloging, traditionally, for people the
>> >> most
>> >> common or the most recent variation is chosen as the preferred one.
>> >> However,
>> >> if the math changes, you have to wait for a person with NACO powers to
>> >> notice this and fix it. This doesn't always happen in a timely fashion.
>> >> In
>> >> his earliest movies, Laurence Fishburne was credited as Larry Fishburne
>> >> so
>> >> this is how his name was established. It then persisted in library
>> >> catalogs
>> >> as Larry Fishburne for long, long after after he made the change (I
>> >> think
>> >> ten years) . If you had data like this, the computer could do the math
>> >> and
>> >> display the most current form.
>> >>
>> >> Name on piece Year and work
>> >> Larry Fishburne 1984 The Cotton Club
>> >> Larry Fishburne 1985 The Color Purple
>> >> Laurence Fishburne 1993 Searching for Bobby Fischer
>> >> Laurence Fishburne 1999 The Matrix
>> >> Laurence Fishburne 2006 Mission: Impossible III
>> >>
>> >> (if you look at IMDb's Laurence Fishburne page, they do track all this,
>> >> along with the names of the characters he played:
>> >> http://www.imdb.com/name/nm0000401/ )
>> >>
>> >> With linked data, you can say
>> >>
>> >> Movie1 -- has actor -- LF123
>> >> Movie1 -- has actor's name credited as -- "Laurence Fishburne"
>> >> LF123 -- has been credited as -- "Laurence Fishburne"
>> >>
>> >> But you can't get all three of those things to connect up, at least not
>> >> without using blank nodes, which then makes your data not so shareable.
>> >> So
>> >> far as I can see, anytime you want to connect the dots between more
>> >> than two
>> >> pieces of information or say something about a statement, it doesn't
>> >> work so
>> >> well with triples. This might not be such a problem for linking, but I
>> >> think
>> >> there are other things we want to do with our data where we may want
>> >> this
>> >> ability.
>> >>
>> >> What happens if we implement bibframe and we don't store and maintain
>> >> our
>> >> data as bibframe triples? We could just keep generating bibframe from
>> >> MARC
>> >> records, but then we haven't really replaced MARC or gotten more
>> >> flexible,
>> >> structured data than we already have.
>> >>
>> >> Alternatively, ILS vendors could come up with another internal format
>> >> for
>> >> us to store data in. However, I don't know that they have the right
>> >> expertise for this nor any economic incentives. If this happened, we
>> >> would
>> >> also end up with much less portable data. Imagine if bib records were
>> >> like
>> >> item records and every system had its proprietary format and unique
>> >> combination of fields. Anytime you do an ILS migration, there is a lot
>> >> of
>> >> item data that can't be moved to the new system, either because it's
>> >> structured differently or because there is no equivalent field.
>> >>
>> >> This may be completely wrong-headed and I think I'm using the wrong
>> >> vocabulary some places, but I thought I'd throw it out there in case
>> >> someone
>> >> can enlighten me.
>> >>
>> >> Kelley
>
>
>
>
> --
> Rob Sanderson
> Information Standards Advocate
> Digital Library Systems and Services
> Stanford, CA 94305
|