Approaches to Publishing Linked Data via Named Graphs

This is a follow-up to my previous post on managing RDF using named graphs. In that post I looked at the basic concept of named graphs, how they are used in SPARQL 1.0/1.1, and discussed RESTful APIs for managing named graphs. In this post I wanted to look at how Named Graphs can be used to support publishing of Linked Data.

There are two scenarios I’m going to explore. The first uses Named Graphs in a way that provides a low friction method for publishing Linked Data. The second prioritizes ease of data management, and in particular the scenario where RDF is being generated by converting from other sources. Lets look at each in turn and their relative merits.

Publishing Scenario #1: One Resource per Graph

For this scenario lets assume that we’re building a simple book website. Our URI space is going to look like this:


http://www.example.org/id/thing/id
http://www.example.org/doc/thing/id

The first URI being the pattern for identifiers in our system, the second being the URI to which we’ll 303 clients in order to get the document containing the metadata about the thing with that identifier. We’ll have several types of thing in our system: books, authors, and subjects.

The application will obviously include a range of features such as authentication, registration, search, etc. But I’m only going to look at the Linked Data delivery aspects of the application here in order to highlight our Named Graphs can support that.

Our application is going to be backed by a triplestore that offers an HTTP protocol for managing Named Graphs, e.g. as specified by SPARQL 1.1. This triplestore will expose graphs from the following base URI:

http://internal.example.org/graphs

The simplest way to manage our application data is to store the data about resource in a separate Named Graph. Each resource will therefore be fully described in a single graph, so all of the metadata about:

http://www.example.org/id/book/1234

with be found in:

http://internal.example.org/graphs/book/1234

The contents of that graph will be the Concise Bounded Description of http://www.example.org/id/book/1234, i.e. all its literal properties, any related blank nodes, as well a properties referencing related resources.

This means delivering the Linked Data view for this resource is trivial. A GET request to http://www.example.org/doc/book/1234 will trigger our application to perform a GET request to our internal triplestore at http://internal.example.org/graphs/book/1234.

If the triplestore supports multiple serializations then there’s no need for our application to parse or otherwise process the results: we can request the format desired by the client directly from the store and then proxy the response straight-through. Ideally the store would also support ETags and/or other HTTP caching headers which we can also reuse. ETags will be simple to generate as it will be easy to track whether a specific Named Graph has been updated.

As the application code to do all this is agnostic to the type of resource being requested, we don’t have to change anything if we were to expand our application to store information about new types of thing. This is the sort of generic behaviour that could easily be abstracted out into a reusable framework.

Another nice architectural feature is that it will be easy to slot in internal load-balancing over a replicated store to spread requests over multiple servers. Because the data is organised into graphs there are also natural ways to “shard” the data if we wanted to replicate the data in other ways.

This gets us a simple Linked Data publishing framework, but does it help us build an application, i.e. the HTML views of that data? Clearly in that case we’ll need to parse the data so that it can be passed off to a templating engine of some form. And if we need to compose a page containing details of multiple resource then this can easily be turned into requests for multiple graphs as there’s a clear mapping from resource URI to graph URI.

When we’re creating new things in the system, e.g. capturing data about a new book, then the application will have to handle any newly submitted data, perform any needed validation and generate an RDF graph describing the resource. It then simply PUTs the newly generated data to a new graph in the store. Updates are similarly straight-forward.

If we want to store provenance data, e.g. an update history for each resource, then we can store that in a separate related graph, e.g. http://internal.example.org/graphs/provenance/book/1234.

Benefits and Limitations

This basic approach is simple, effective, and makes good use of the Named Graph feature. Identifying where to retrieve or update data is little more than URI rewriting. It’s well optimised for the common case for Linked Data, which is retrieving, displaying, and updating data about a single resource. To support more complex queries and interactions, ideally our triplestore would also expose a SPARQL endpoint that supported querying against a “synthetic” default graph which consists of the RDF union of all the Named Graphs in the system. This gives us the ability to query against the entire graph but still manage it as smaller chunks.

(Aside: Actually, we’re likely to want two different synthetic graphs: one that merges all our public data, and one that merges the public data + that in the provenance graphs.)

There are a couple of limitations which we’ll hit when managing data using the scenario. The first is that the RDF in the Linked Data views will be quite sparse, e.g the data wouldn’t contain the labels of any referenced resources. To be friendly to Linked Data browsers we’ll want to include more data. We can work around this issue by performing two requests to the store for each client request: the first to get the individual graph, the second to perform a SPARQL query something like this:


CONSTRUCT {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 ?referenced rdfs:label ?label.
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 ?refencing rdfs:label ?label2.  
} WHERE {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 OPTIONAL {
   ?referenced rdfs:label ?label.
 }
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 OPTIONAL {
   ?refencing rdfs:label ?label2.
 }
}

The above query would be executed against the union graph of our triplestore and would let us retrieve the labels of any resources referenced by a specific book (in this case), plus the labels and properties of any referencing resources. This query can be done in parallel to the request for the graph and merged with its RDF by our application framework.

The other limitation is also related to how we’ve chosen to factor out the data into CBDs. Any time we need to put in reciprocal relationships, e.g. when we add or update resources, then we will have to update several different graphs. This could become expensive depending on the number of affected resources. We could potentially work around that by adopting an Eventual Consistency model and deferring updates using a message queue. This lets us relax the constraint that updates to all resources need to be synchronized, allowing more of that work to be done both asynchronously and in parallel. The same approach can be applied to manage list of items in the store, e.g. a list of all authors: these can be stored as individual graphs, but regenerated on a regular basis.

The same limitation hits us if we want to do any large scale updates to all resources. In this case SPARUL updates might become more effective, especially if the engine can update individual graphs, although handling updates to the related provenance graphs might be problematic. What I think is interesting is that in this data management model this is the only area in which we might really need something with the power of SPARUL. For the majority of use cases graph level updates using simple HTTP PUTs coupled with a mechanism like Changesets are more than sufficient. This is one reason why I’m so keen to see attention paid to the HTTP protocol for managing graphs and data in SPARQL 1.1: not every system will need SPARUL.

The final limitation relates to the number of named graphs we will end up storing in our triplestore. One graph per resource means that we could easily end up with millions of individual graphs in a large system. I’m not sure that any triplestore is currently handling this many graphs, so there may be some scaling issues. But for small-medium sized applications this should be a minor concern.

Publishing Scenario #2: Multiple Resources per Graph

The second scenario I want to introduce in this posting is one which I think is slightly more conventional. As a result I’m going to spend less time reviewing it. Rather than using one graph per resource, we instead store multiple resources per Named Graph. This means that each Named Graph will be much larger, perhaps including data about thousands of resources. It also means that there may not be a simple mapping from a resource URI to a single graph URI: the triples for each resource may be spread across multiple graphs, although there’s no requirement that this be the case.

Whereas the first scenario was optimised for data that was largely created, managed, and owned by a web application, this scenario is most useful when the data in the store is derived from other sources. The primary data sources may be a large collection of inter-related spreadsheets which we are regularly converting into RDF, and the triplestore is just a secondary copy of the data created to support Linked Data publishing. It should be obvious that the same approach could be used when aggregating existing RDF data, e.g. as a result of a web crawl.

To make our data conversion workflow system easier to manage it makes sense to use a Named Graph per data source, i.e. one for each spreadsheet, rather than one per resource. E.g:


http://internal.example.org/graphs/spreadsheet/A
http://internal.example.org/graphs/spreadsheet/B
http://internal.example.org/graphs/spreadsheet/C

The end result of our document conversion workflow would then be the updating or replacing of a single specific Named Graph in the system. The underlying triplestore in our system will need to expose a SPARQL endpoint that includes a synthetic graph which is the RDF union of all graphs in the system. This ensures that where data about an individual resource might be spread across a number of underlying graphs, that a union view is available where required.

As noted in the first scenario we can store provenance data in a separate related graph, e.g. http://internal.example.org/graphs/provenance/spreadsheet/A.

Benefits and Limitations

From a data publishing point of view our application framework can no longer use URI rewriting to map a request to a GET on a Named Graph. It must instead submit SPARQL DESCRIBE or CONSTRUCT queries to the triplestore, executing them against the union graph. This lets the application ignore the details of the organisation and identifiers, of the Named Graphs in the store when retrieving data.

If the application is going to support updates to the underlying data then it will need to know which Named Graph(s) must be updated. This information should be available by querying the store to identify the graphs that contain the specific triple patterns that must be updated. SPARUL request(s) can then be issued to apply the changes across the affected graphs.

The difficult of co-ordinating updates from the application with updates from the document conversion (or crawling) workflow means that this scenario may be best suited for read-only publishing of data.

Its clear that this approach is much more optimised to support the underlying data conversion and/or collection workflows that the publishing web application. The trade-off doesn’t add much more complexity to the application implementation, but doesn’t exhibit some of the same architectural benefits, e.g. easy HTTP caching, data sharding, etc, that the first model exhibits.

Summary

In this blog post I’ve explored two different approaches to managing and publishing RDF data using Named Graphs. The first scenario described an architecture that used Named Graphs in a way that simplified application code whilst exposing some nice architectural properties. This was traded off against ease of data management for large scales updates to the system.

The second scenario was more optimised data conversion & collection workflows and is particularly well suited for systems publishing Linked Data derived from other primary sources. This flexibility was traded off against slightly more complex application implementation.

My goal has been to try to highlight different patterns for using Named Graphs and how those patterns place greater or lesser emphasis on features such as RESTful protocols for managing graphs, and different styles of update language. In reality an application might mix together both styles in different areas, or even at different stages of its lifecycle.

If you’re using Named Graphs in your applications then I’d love to hear more about how you’re making use of the feature. Particularly if you’ve layered on additional functionality such as versioning and other elements of workflow.

Better understanding of how to use these kinds of features will help the community begin assembling good application frameworks to support Linked Data application development.

6 thoughts on “Approaches to Publishing Linked Data via Named Graphs”

Justin Cormack says:

December 20, 2009 at 2:37 pm

I have been thinking about the first case in fairly similar terms for a while. A queue based model with eventual consistency has also been my preferred choice, although I think you may need to build up more general collection objects than full triplestores to efficiently answer specific queries.

I am glad there is some recognition in the W3C now that it is necessary to talk about RDF and SPARQL outside a pure document context and recognize that the (only significant) use case is on the web and issues about REST, the resource model and updates matter. We need to get away from Prolog in XML to get stuff used.
Pingback: uberVU - social comments
Michael Hausenblas says:

December 20, 2009 at 3:07 pm

Thanks for this Leigh, very valuable write-up! I agree, it’s an important issue and we should start collecting good practices. Note that we’ve recently looked into this issue as well (http://sw-app.org/pub/restful-sparql.pdf)

Cheers,
Michael
1. admin says:
  
  December 21, 2009 at 4:59 pm
  
  Hi Michael,
  
  Thanks for the pointer I’ll take a look at the paper. I’ve read your recent blog posts about RDF & REST as its an area that I’m particularly interested in: the two are very well aligned in my view.
  
  Cheers,
  
  L.
Benjamin Nowack says:

December 21, 2009 at 9:27 am

Hi Leigh,

I’m generally using multiple graphs per resource, where each graph is associated with a certain front-end operation, e.g. the “basic details”-form for resource writes into graph , the “friends”-manager saves data in , interests go into etc.

This requires a graph look-up when a resource is going to be deleted, but it makes managing sub-chunks of a resource description quite hassle-free. I can simply replace the respective graph w/o messing with the data that’s meant to be kept.

Caching and RDF serving is then based on the resource view (which is built from multiple graphs), i.e. a new, app-specific graph is exposed to consuming applications.

To support the larger number of graphs backend-wise, I’ve normalized/separated the “graph” URI from the triple parts so that typical/non-graph queries can work with a compact triple table. On the down-side, certain GRAPH patterns (e.g. in OPTIONALs) become rather painful due to the larger number of table joins.
1. admin says:
  
  December 21, 2009 at 4:58 pm
  
  Hi Benji,
  
  Thanks for the comment, that’s another interesting variant and one that I hadn’t considered.
  
  L.