Print

Print


+1

Thanks,
Shlomo

Sent from my iPad

On Apr 17, 2015, at 10:47, Ross Singer <[log in to unmask]> wrote:

Martynas,

I don't think I've tried to present it otherwise.  We *want* the data layer to be a silo, honestly - we use the application layer to expose the linked data.  Of the list of things our customers care about, a SPARQL endpoint is fairly near the bottom: yes, it's been asked for, but to be honest, those customers crawl their linked data and put it in their own triple store (or *whatever*), anyway, so it's a pretty low priority (compared to other requests) even for them.

Our architecture also works orders of magnitude better than triple stores did for our use cases, and I can't figure out your objection to that, honestly.

What's *your* use case, exactly?
-Ross.

On Fri, Apr 17, 2015 at 1:22 AM Martynas Jusevičius <[log in to unmask]> wrote:
Ross,

in this case you're operating a good old data silo, with an RDF export
capability.

Besides solving the integration issue, RDF also opens a way to
gradually move business rules and user interaction logic from
(imperative) code to (declarative) data. It enables new generic
software design patterns, which I hesitated to go into on this list.
But you can take a look at our SWAT4LS about Graphity platform
architecture:
http://ceur-ws.org/Vol-1320/paper_30.pdf

Martynas

On Tue, Apr 14, 2015 at 9:21 AM, Ross Singer <[log in to unmask]> wrote:
> Martynas,
>
> You have to understand, *we abandoned using a triple store after using one
> for many years*. Advancing RDF and Linked Data was the primary tenet of the
> company and everything we produced used the triple store in some capacity.
> This wasn't an issue of not understanding RDF or not having used SPARQL in
> anger.
>
> But in the end we abandoned using a native triple store because the things
> that they're good at: storing and retrieving data from various sources with
> unbounded shapes and performing ad hoc queries on it accounted for less than
> 1% of our use cases.
>
> For the other 99%, triple stores proved inefficient and awkward and required
> far more operational scaffolding to scale.
>
> To answer your question about querying our data, we pre-compute the vast
> majority of our common joins into multi-describe graphs, somewhat analogous
> to a RDBMS view. We also pre-compute tabular data out of the graphs for
> SPARQL select-like functionality. The rest we can just query in Mongo, if we
> wanted, but we rarely have a need to (tracking down a support problem,
> maybe). It does not support SPARQL, no. There's no use case for it.
>
> The point is that our data is all modeled to ingest and export RDF quads, so
> we're not locked into anything and when we need to run ad hoc queries, we
> can ingest our data into a Fuseki instance we run: but that's for a specific
> need, which is completely different than the general workflow and operation
> of everything else.
>
> -Ross.
>
>
> On Monday, April 13, 2015, Martynas Jusevičius <[log in to unmask]>
> wrote:
>>
>> Ross,
>>
>> I wonder how you query your MongoDB store? I don't suppose it supports
>> SPARQL?
>>
>> On Mon, Apr 13, 2015 at 11:51 PM, Ross Singer <[log in to unmask]>
>> wrote:
>> > Kelley,
>> >
>> > I stick by the notion that as long as a system can ingest/export data as
>> > well-formed RDF graphs, how you store it internally makes no difference.
>> > As
>> > I've said before, our biggest product is modeled entirely using RDF, but
>> > we
>> > store the data in a MongoDB document database, because it fits our
>> > actual
>> > needs better than a triple store.
>> >
>> > It is possible, btw, to produce an ordered list for author/editor names
>> > in
>> > RDF, but it's horribly ugly: you can use rdf:Seq
>> > http://www.w3.org/TR/rdf-schema/#ch_seq or rdf:List
>> > http://www.w3.org/TR/rdf-schema/#ch_list.  They each have their pluses
>> > and
>> > minuses: rdf:List is absolutely awful to work with in any serialization
>> > except Turtle (where it's super easy! see:
>> > http://www.w3.org/2007/02/turtle/primer/#L2986), but has the downside of
>> > being semantically open.  That is, you cannot definitively say "these
>> > are
>> > *all* of the authors and there are no more".
>> >
>> > rdf:Seq (which is an rdf:Container) is considered closed (i.e. there is
>> > no
>> > assumption that there would be anything else in the current container
>> > that
>> > appears somewhere else) but, unfortunately has no syntactic sugar like
>> > Collections in Turtle.
>> >
>> > Containers and Collections being such major pain points in RDF, JSON-LD
>> > threw all of it away for a *much* simpler implementation:
>> > http://www.w3.org/TR/json-ld/#sets-and-lists
>> >
>> > All that said, as long as you can serialize your author lists as one of
>> > these, model it however suits your needs the best for your regular
>> > workflows/needs.
>> >
>> > -Ross.
>> >
>> > On Mon, Apr 13, 2015 at 6:12 AM Kelley McGrath <[log in to unmask]>
>> > wrote:
>> >>
>> >> Although much of the discussion on storing bibframe data went over my
>> >> head, some things have been niggling at me for a while that maybe are
>> >> related to this thread.
>> >>
>> >> I get that it would be good for us to publish our data as linked data.
>> >> I
>> >> get that it would be good for us to consume linked data. I get that we
>> >> should re-use other people's URIs in our data to save time and reduce
>> >> maintenance. I get that we should match our identifiers to other
>> >> people's
>> >> URIs in order to connect more information.
>> >>
>> >> However, it has not been clear to me that it makes sense for us to
>> >> store
>> >> and maintain our data as linked data. And yet, I don't see any
>> >> alternative
>> >> plan being developed.
>> >>
>> >> This may be sacrilege, but from what I understand there seem to be
>> >> things
>> >> that linked data isn't good at. For example, retaining the order of
>> >> things
>> >> like authors' names or connecting a specific place and publisher
>> >> written on
>> >> the title page or a book. Sometimes when this has been discussed on
>> >> this
>> >> list, I get the impression that we shouldn't want to do those things;
>> >> that
>> >> they're somehow obsolete.
>> >>
>> >> I can't get my head around that. Maybe you don't need those things for
>> >> linking, but I don't think linking is the only thing that we want to do
>> >> with
>> >> our data. For example, it emerged recently, when MPOW changed to a
>> >> discovery
>> >> layer that didn't do such a good job with this initially, that the
>> >> ability
>> >> to generate citations is hugely important to a significant portion of
>> >> our
>> >> patrons. If you want to generate an accurate citation, you need to know
>> >> the
>> >> order of the author's names.
>> >>
>> >> It has been suggested to me that we shouldn't be generating citations,
>> >> but
>> >> rather storing them as strings. However, again I seem to be missing
>> >> something because that doesn't seem very optimal to me. Do you store a
>> >> separate string for every format: APA, MLA, Chicago, etc.? What do you
>> >> do
>> >> when a style guide gets updated? It might not be very easy to update
>> >> millions of strings. What if a new citation style is invented and
>> >> becomes
>> >> popular? It just seems to me to be more flexible and powerful to store
>> >> the
>> >> individual pieces of data and generate the citations. On the other
>> >> hand,
>> >> publishing citations as strings might be okay for most use cases.
>> >>
>> >> MARC records are a single unit. If a record has been edited by multiple
>> >> parties, you can't tell who changed what when, which is a challenge for
>> >> trouble-shooting and quality control. Linked data statements are
>> >> atomistic,
>> >> but it sounds to me like it is still hard to say anything much *about*
>> >> the
>> >> statement other than maybe the domain name used by whoever made it. It
>> >> would
>> >> be useful to track more about individual statements, such as when they
>> >> are
>> >> made and whether or not they are currently considered valid (one of the
>> >> problems with bad data in the OCLC master record environment is that
>> >> even if
>> >> you take erroneous information out, all it takes is one batchload to
>> >> put it
>> >> right back in).
>> >>
>> >> As some of you know, I have been working on a project to crowdsource
>> >> the
>> >> parsing of film credits in catalog records (help us out at
>> >> http://olac-annotator.org/ ). One result of this is that we have links
>> >> between transcribed names in records and their authorized form.  It
>> >> occurs
>> >> to me that this might be a useful thing to record proactively. For
>> >> example,
>> >> even in a world of identifiers, we still need to choose one of many
>> >> possible
>> >> versions of a name to display to the user (unless you're going to
>> >> display
>> >> them all at once in some kind of cluster, which is not very
>> >> user-friendly in
>> >> many situations). In library cataloging, traditionally, for people the
>> >> most
>> >> common or the most recent variation is chosen as the preferred one.
>> >> However,
>> >> if the math changes, you have to wait for a person with NACO powers to
>> >> notice this and fix it. This doesn't always happen in a timely fashion.
>> >> In
>> >> his earliest movies, Laurence Fishburne was credited as Larry Fishburne
>> >> so
>> >> this is how his name was established. It then persisted in library
>> >> catalogs
>> >> as Larry Fishburne for long, long after after he made the change (I
>> >> think
>> >> ten years) . If you had data like this, the computer could do the math
>> >> and
>> >> display the most current form.
>> >>
>> >> Name on piece   Year and work
>> >> Larry Fishburne 1984    The Cotton Club
>> >> Larry Fishburne         1985    The Color Purple
>> >> Laurence Fishburne      1993    Searching for Bobby Fischer
>> >> Laurence Fishburne      1999    The Matrix
>> >> Laurence Fishburne      2006    Mission: Impossible III
>> >>
>> >> (if you look at IMDb's Laurence Fishburne page, they do track all this,
>> >> along with the names of the characters he played:
>> >> http://www.imdb.com/name/nm0000401/ )
>> >>
>> >> With linked data, you can say
>> >>
>> >> Movie1  -- has actor  -- LF123
>> >> Movie1 -- has actor's name credited as  -- "Laurence Fishburne"
>> >> LF123  -- has been credited as -- "Laurence Fishburne"
>> >>
>> >> But you can't get all three of those things to connect up, at least not
>> >> without using blank nodes, which then makes your data not so shareable.
>> >> So
>> >> far as I can see, anytime you want to connect the dots between more
>> >> than two
>> >> pieces of information or say something about a statement, it doesn't
>> >> work so
>> >> well with triples. This might not be such a problem for linking, but I
>> >> think
>> >> there are other things we want to do with our data where we may want
>> >> this
>> >> ability.
>> >>
>> >> What happens if we implement bibframe and we don't store and maintain
>> >> our
>> >> data as bibframe triples? We could just keep generating bibframe from
>> >> MARC
>> >> records, but then we haven't really replaced MARC or gotten more
>> >> flexible,
>> >> structured data than we already have.
>> >>
>> >> Alternatively, ILS vendors could come up with another internal format
>> >> for
>> >> us to store data in. However, I don't know that they have the right
>> >> expertise for this nor any economic incentives. If this happened, we
>> >> would
>> >> also end up with much less portable data. Imagine if bib records were
>> >> like
>> >> item records and every system had its proprietary format and unique
>> >> combination of fields. Anytime you do an ILS migration, there is a lot
>> >> of
>> >> item data that can't be moved to the new system, either because it's
>> >> structured differently or because there is no equivalent field.
>> >>
>> >> This may be completely wrong-headed and I think I'm using the wrong
>> >> vocabulary some places, but I thought I'd throw it out there in case
>> >> someone
>> >> can enlighten me.
>> >>
>> >> Kelley