Tag Archives: rdf

Layered Data: A Paper & Some Commentary

Two years ago I wrote a short paper about “layering” data but for various reasons never got round to putting it online. The paper tried to capture some of my thinking at the time about the opportunities and approaches for publishing and aggregating data on the web. I’ve finally got around to uploading it and you can read it here.

I’ve made a couple of minor tweaks in a few places but I think it stands up well, even given the recent pace of change around data publishing and re-use. I still think the abstraction that it describes is not only useful but necessary to take us forward on the next wave of data publishing.

Rather than edit the paper to bring it completely up to date with recent changes, I thought I’d publish it as is and then write some additional notes and commentary in this blog post.

You’re probably best off reading the paper, then coming back to the notes here. The illustration referenced in the paper is also now up on slideshare.

RDF & Layering

I see that the RDF Working Group, prompted by Dan Brickley, is now exploring the term. I should acknowledge that I also heard the term “layer” in conjunction with RDF from Dan, but I’ve tried to explore the concept from a number of perspectives.

The RDF Working Group may well end up using the term “layer” to mean a “named graph”. I’m using the term much more loosely in my paper. In my view an entire dataset could be a layer, as well as some easily identifiable sub-set of it. My usage might therefore be closer to Pat Hayes’s concept of a “Surface”, but I’m not sure.

I think that RDF is still an important factor in achieving the goal I outlined of allowing domain experts to quickly assemble aggregates through a layering metaphor. Or, if not RDF, then I think it would need to be based around a graph model, ideally one with a strong notion of identity. I also think that mechanisms to encourage sharing of both schemas and annotations are also useful. It’d be possible to build such a system without RDF, but I’m not sure why you’d go to the effort.

User Experience

One of the things that appeals to me about the concept of layering is that there are some nice ways to create visualisation and interfaces to support the creation, management and exploration of layers. It’s not hard to see how, given some descriptive metadata for a collection of layers, you could create:

  • A drag-and-drop tool for creating and managing new composite layers
  • An inspection tool that would let you explore how the dataset for an application or visualisation has been constructed, e.g. to explore provenance or to support sharing and customization. Think “view source” for data aggregation.
  • A recommendation engine that suggested new useful layers that could be added to a composite, including some indication of what additional query options might become available

There’s been some useful work done on describing datasets within the Linked Data community: VoiD and DCat for example. However there’s not yet enough data routinely available about the structure and relationships of individual datasets, nor enough research into how to provide useful summaries.

This is what prompted my work on an RDF Report Card to try and move the conversation forward beyond simply counting triples.

To start working with layers, we need to understand what each layer contains and how they relate to and complement one another.

Linked Data & Layers

In the paper I suggest that RDF & Linked Data alone aren’t enough and that we need systems, tools and vocabularies for capturing the required descriptive data and enabling the kinds of aggregation I envisage.

I also think that the Linked Data community is spending far too much effort on creating new identifiers for the same things and worrying how best to define equivalences.

I think the leap of faith that’s required, and that people like the BBC have already taken, is that we just need to get much more comfortable re-using other people’s identifiers and publishing annotations. Yes, there will be times when identifiers diverge, but there’s a lot to be gained, especially in terms of efficiency around data curation from just focusing on the value-added data, not re-publishing any copy of a core set of facts.

There are efficiency gains to be had from existing businesses, as well as faster routes to market for startups, if they can reliably build on some existing data. I suspect that there are also businesses that currently compete with one another — because they’re having to compile or re-compile the same core data assets — that could actually complement one another if they could instead focus on the data curation or collection tasks at which they excel.

Types of Data

In the paper I set out seven different facets which I think cover the majority of types of data that we routinely capture and publish. I think the classification could be debated, but I think its a reasonable first attempt.

The intention is to try and illustrate that we can usefully group together different types of data. And organisations may be particularly good at creating or collecting particular types of data. There’s scope for organisations to focus on being really good in a particular area and by avoiding needless competition around collecting and re-collecting the same core facts, there are almost certainly efficiency gains and cost savings to be had.

I’m sure there must be some prior work in this space, particularly around the core categories, so if anyone has pointers please share them.

There are also other ways to usefully categorise data. One area that springs to mind is how the data itself is collected, i.e. its provenance. E.g. is it collected automatically by sensors, or as a side-effect of user activity, or entered by hand by a human curator? Are those curators trained or are they self-selected contributors? Is the data derived from some form of statistical analysis?

I had toyed with provenance as a distinct facet, but I think its an orthogonal concern.

Layering & Big Data

A lot has happened in the last two years and I winced a bit at all of the Web 2.0 references in the paper. Remember that? If I were writing this now then the obvious trend to discuss as context to this approach is Big Data.

Chatting with Matt Biddulph recently he characterised a typical Big Data analysis as being based on “Activity Data” and “Reference Data”. Matt described reference data as being the core facts and information on top of which the activity data — e.g. from users of an application — is added. The analysis then draws on the combination to create some new insight, i.e. more data.

I referenced Matt’s characterisation in my Strata talk (with acknowledgement!). Currently Linked Data does really well in the Reference category but there’s not a great deal of Activity data. So while its potentially useful in a Big Data world, there’s a lot of value still not being captured.

I think Matt’s view of the world chimes well with both the layered data concept and the data classifications that I’ve proposed. Most of the facets in the paper really define different types of Reference data. The outcome of a typical Big Data analysis is usually a new facet, an obvious one being “Comparative” data, e.g. identifying the most popular, most connected, most referenced resources in a network.

However there’s clearly a different in approach between typical Big Data processing and the graph models that I think underpin a layered view of the world.

MapReduce workflows seem to work best with more regular data, however newer approaches like Pregel illustrate the potential for “graph-native” Big Data analysis. But setting that aside, there’s no real contention as a layering approach to combining data doesn’t say anything about how the data must actually be used: it can be easily projected out into structures that are amenable for indexing and processing in different ways.

Kasabi

Looking at the last section of the paper it should be obvious that much of the origin of this analysis was early preparation for Kasabi.

I still think that there’s a great deal of potential to create a marketplace around data layers and tools for interacting with them. But we’re not there yet though for several reasons. Firstly its taken time to get the underlying platform in place to support that. We’ve done that now and you can expect more information on that from more official sources shortly. Secondly I under estimated how much effort is still required to move the market forward: there’s still lots to be done to support organisations in opening up data before we can really explore more horizontal marketplaces. But that is a topic for another post.

This has been quite a ramble of a blog post but hopefully there are some useful thoughts here that chime with your own experience. Let me know what you think.

Tagged , , , , ,

RDF and JSON: A Clash of Model and Syntax

I had been meaning to write this post for some time. After reading Jeni Tennison’s post from earlier this week I had decided that I didn’t need too, but Jeni and Thomas Roessler suggested I publish my thoughts. So here they are. I’ve got more things to say about where efforts should be expended in meeting the challenges that face us over the next period of growth of the semantic web, but I’ll keep those for future posts.

Everyone agrees that a JSON serialization of RDF is a Good Thing. And I think nearly everyone would agree that a standard JSON serialization of RDF would be even better. The problem is no-one can agree on what constitutes a good JSON serialization of RDF. As the RDF Next Working Group is about to convene to try and define a standard JSON serialization now is a very good time to think about what it is we really want them to achieve.

RDF in JSON, is RDF in XML all over again

There are very few people who like RDF/XML. Personally, while it’s not my favourite RDF syntax, I’m glad its there for when I want to convert XML formats into RDF. I’ve even built an entire RDF workflow that began with the ingestion of RDF/XML documents; we even validated them against a schema!

There are several reasons why people dislike RDF/XML.

Firstly, there is a mis-match in the data models: serialization involves turning a graph into a tree. There are many different ways to achieve that so, without applying some external constraints, the output can be highly variable. The problem is that those constraints can be highly specific, so are difficult to generalize. This results in a high degree of syntax variability of RDF/XML in the wild, and that undermines the ability to use RDF/XML with standard XML tools like XPath, XSLT, etc. They (unsurprisingly) operate only on the surface XML syntax not the “real” data model.

Secondly, people dislike RDF/XML because of the mis-match in (loosely speaking) the native data types. XML is largely about elements and attributes whereas RDF has resources, properties, literals, blank nodes, lists, sequences, etc. And of course there are those ever present URIs. This leads to additional syntax short-cuts and hijacking of features like XML Namespaces to simplify the output, whilst simultaneously causing even more variability in the possible serializations.

Thirdly, when it comes to parsing, RDF/XML just isn’t a very efficient serialization. It’s typically more verbose and can involve much more of a memory overhead when parsing than some of the other syntaxes.

Because of these issues, we end up with a syntax which, while flexible, requires some profiling to be really useful within an XML toolchain. Or you just ignore the fact that its XML at all and throw it straight into a triple store, which is what I suspect most people do. If you do that then an XML serialization of RDF is just a convenient way to generate RDF data from an XML toolchain.

Unfortunately when we look at serializing RDF as JSON we discover that we have nearly all of the same issues. JSON is a tree; so we have the same variety of potential options for serializing any given graph. The data types are also still different: key-value pairs, hashes, lists, strings, dates (of a form!), etc. versus resource, properties, literals, etc. While there is potential to use more native datatypes, the practical issues of repeatable properties, blank nodes, etc mean that a 1:1 mapping isn’t feasible. Lack of support for anything like XML Namespaces means that hiding URIs is also impossible without additional syntax conventions.

So, ultimately, both XML and JSON are poor fits for handling RDF. I think most people would agree that a specific format like Turtle is much easier to work with. It’s also better as starting point for learning RDF because most of the syntax is re-used in SPARQL. That’s why standardising Turtle, ideally extended to support Named Graphs, needs to be the first item on the RDF Next Working Group’s agenda.

What do we actually want?

What purpose are we trying to achieve with a JSON serialization of RDF? I’d argue that there are several goals:

  1. Support for scripting languages: Provide better support for processing RDF in scripting languages
  2. Creating convergence: Build some convergence around the dizzying array of existing RDF in JSON proposals, to create consistency in how data is published
  3. Gaining traction: Make RDF more acceptable for web developers, with the hope of increasing engagement with RDF and Linked Data

I don’t think that anyone considers a JSON serialization of RDF as a better replacement for RDF/XML. I think everyone is looking to Turtle to provide that.

I also don’t think that anyone sees JSON as a particularly efficient serialization of RDF, particularly for bulk loading. It might be, but I think N-Triples (a subset of Turtle) fulfills that niche already: it’s easy to stream and to process in parallel.

Lets look at each of those goals in turn.

Support for scripting languages

Unarguably its much, much easier to process JSON in scripting languages like Javascript, Ruby, PHP than RDF/XML.

Parser support for JSON is ubiquitous as its the syntax de jour. Just as XML was when the RDF specifications were being written. Typically JSON parsing is much more efficient. That’s especially true when we look at Javascript in the browser.

From that perspective RDF in JSON is an instant win as it will simplify consumption of Linked Data and the results of SPARQL CONSTRUCT and DESCRIBE queries. There are other issues with getting wide-spread support for RDF across different programming languages, e.g. proper validation of URIs, but fast parsing of the basic data structure would be a step in the right direction.

Creating Convergence

I think I’ve seen about a dozen or more different RDF in JSON proposals. There’s a list on the ESW wiki and some comparison notes on the Talis Platform wiki, but I don’t think either are complete. If I get chance I’ll update them. The sheer variety confirms my earlier points about the mis-matches between models: everyone has their own conception of what constitutes a useful JSON serialization.

Because there are less syntax options in JSON, the proposals run the full spectrum from capturing the full RDF model but making poor use of JSON syntax, through to making good use of JSON syntax but at the cost of either ignoring aspects of the RDF model or layering additional syntax conventions on top of JSON itself. As an aside, I find it interesting that so many people are happy with subsetting RDF to achieve this one goal.

The thing we should recognise is that none of the existing RDF in JSON formats are really useful without an accompanying API. I’ve used a number of different formats and no matter what serialization I’ve used I’ve ended up with helper code that simplifies some or all of the following:

  • Lookup of all properties of a single resource
  • Mapping between URIs and short names (e.g. CURIES or locally defined keys) for properties
  • Mapping between conventions for encoding particular datatypes (or language annotations) and native objects in the scripting language
  • Cross-referencing between subjects and objects; and vice-versa
  • Looking up all values of a property or a single value (often the first)

In addition, if I’m consuming the results of multiple requests then I may also end up with a custom data structure and code for merging together different descriptions. Even if its just an array of parsed JSON documents and code to perform the above lookups across that collection.

So, while we can debate the relative aesthetics of different approaches, I think its focusing attention on the wrong areas. What we should really be looking at is an API for manipulating RDF. One that will work in Javascript, Ruby or PHP. While I acknowledge the lingering horror of the DOM, I think the design space here is much simpler. Maybe I’m just an optimist!

If we take this approach then what we need is an JSON serialization of RDF that covers as much of the RDF model as possible and, ideally, is already as well supported as possible. From what I’ve seen the RDF/JSON serialization is actually closest to that ideal. It’s supported in a number of different parsing and serialising libraries already and only needs to be extended to support blank nodes and Named Graphs, which is trivial to do. While its not the prettiest serialization, given a vote, I’d look at standardising that and moving on to focus on the more important area: the API.

Gaining Traction

Which brings me to the last use case. Can we create a JSON serialization of RDF that will help Linked Data and RDF get some traction in the wider web development community?

The answer is no.

If you believe that the issues with gaining adoption are purely related to syntax then you’re not listening to the web developer community closely enough. While a friendlier syntax may undoubtedly help, an API would be even better. The majority of web developers these days are very happy indeed to work with tools like JQuery to handle client-side scripting. A standard JQuery extension for RDF would help adoption much more than spending months debating the best way to profile the RDF model into a clean JSON serialization.

But the real issue is that we’re asking web developers to learn not just new syntax but also an entirely new way to access data: we’re asking them to use SPARQL rather than simple RESTful APIs.

While I think SPARQL is an important and powerful tool in the RDF toolchain I don’t think it should be seen as the standard way of querying RDF over the web. There’s a big data access gulf between de-referencing URIs and performing SPARQL queries. We need something to fill that space, and I think the Linked Data API fills that gap very nicely. We should be promoting a range of access options.

I have similar doubts about SPARQL Update as the standard way of updating triple stores over the web, but that’s the topic of another post.

Summing Up

As the RDF Next Working Group gets underway I think it needs to carefully prioritise its activities to ensure that we get the most out of this next phase of development and effort around the Semantic Web specifications. It’s particularly crucial right now as we’re beginning to see the ideas being adopted and embraced more widely. As I’ve tried to highlight here, I think there’s a lot of value to be had in having a standard JSON serialization of RDF. But I don’t think that there’s much merit in attempting to create a clean, simple JSON serialization that will meet everyone’s needs.

Standardising Turtle and an API for manipulating RDF data has more value in my view. RDF/JSON as a well implemented specification meets the core needs of the semantic web developer; a simple scripting API meets the needs of everyone else.

Tagged , ,

Gridworks Reconciliation API Implementation

Gridworks is a really fantastic tool and there’s scope to extend it in all kinds of interesting ways. Jeni Tennison has recently published a great blog post describing how to use Gridworks for generating Linked Data. I strongly encourage you to read her posting as it not only provides a good introduction to Gridworks itself, but also shows a nice real world example of generating RDF using its built-in data cleaning and templating tools.

I was luckily enough to meet David Huynh as a workshop recently and chatted to him briefly about another aspect of the Gridworks: its ability to match field values in a dataset to entities in Freebase, e.g. identifying a place based on just it’s name. Within Gridworks this process is known as “reconciliation”.

Reconciliation is an important step for generating good Linked Data as you’ll often need to correlate values in a dataset with URIs in existing datasets in order to generate links. E.g. matching company names to their URIs. While it is possible to generate identifiers algorithmically during a conversion this typically just defers the reconciliation work until a later stage, when you carry out cross-linking to introduce equivalence links.

Recognising that the ability to introduce new reconciliation services would be a powerful extension to Gridworks, David Huynh has been creating a draft specification that will allow third-parties to create and deploy their own reconciliation services. He’s been documenting his progress on implementing the client side of this protocol and has published a testing service.

It occurred to me that the reconciliation API is essentially a structured search over a dataset and thus could be implemented against the search interface exposed by Talis Platform stores. The RSS 1.0 feeds that the Platform returns includes enough information to rank and filter results as required by the API.

I’ve created a simple Ruby application, using the Sinatra web framework, that implements the reconciliation API for any Talis Platform store. You can find the code on github if you want to have a play with it. As I note in the README there are some areas where customisation is useful to get the most from the service. So while in principle it can be used against any existing Platform store you can create a simple JSON config to tweak it for particular datasets.

There’s a live version of the code running one my server here: http://ldodds.com/gridworks/.

That page has a simple API console for carrying out queries, but consult the draft specification for more details. I think I’ve covered all of the basic features (but bug reports welcome!). Consult the README for notes on configuration options and implementation decisions.

As a simple illustration, lets say that I have the value “Bath” in a dataset and want to match that to some area in the UK administrative geography. This information is available from the Linked Data exposed by statistics.data.gov.uk and this happens to be hosted in this platform store. The reconciliation API we need can therefore be found at: http://ldodds.com/gridworks/govuk-statistics/reconcile. An HTTP GET on that location retrieves the service metadata.

If we use the API explorer we can use a simple HTML form to try out examples. Select govuk-statistics from the Store drop-down and then type Bath into the search box. You’ll get this result. This is not very readable by default, so if you’re using Firefox I recommend you install the JSONView extension which provides a nicely formatted display.

Our initial search returns a number of results. The highest ranked of these being the Westminster Constituency for Bath. That seems like a pretty good initial result to me. As it is the most relevant result in the search it’s marked as an exact match, so once integrated with Gridworks it will capture and store the reconciled identifier for you.

However, we may know that in the imaginary dataset we’re working with, that a particular field doesn’t contain names of constituencies. It may instead refer to a Local Education Authority. We can refine our search by adding the URI that defines that type of resource into the type field in the API explorer.

Try pasting in http://statistics.data.gov.uk/def/geography/LocalEducationAuthority into the post and running the search again. You’ll find that this time you get a single result, which is Bath and North East Somerset. Job done.

Of course, to get the most from this you need to know what URIs you can use for filtering by types (and properties). But this is something that the Gridworks UI will help with. It can integrate with “suggestion services” that can be used to help map values to a properties and types within a schema. I’ll be looking at how to expose those as my next piece of work.

Hopefully you can see how the overall system works. Feel free to have a play with the API to try it out for yourself. If you have comments on the implementation then I’d love to hear them, but I’d suggest that comments on the specification are best addressed to the gridworks mailing list.

I also suspect the Reconciliation API has uses outside of just Gridworks. For example, I wonder how easy it would be to introduce reconciliation into Google Spreadsheets using Google Apps Script? It’s also another nice demonstration of how easy it is to map simple RESTful APIs onto RDF datasets, this implementation works for any data in the Platform, no matter what schema it confirms with. Neat.

Tagged , , ,

Approaches to Publishing Linked Data via Named Graphs

This is a follow-up to my previous post on managing RDF using named graphs. In that post I looked at the basic concept of named graphs, how they are used in SPARQL 1.0/1.1, and discussed RESTful APIs for managing named graphs. In this post I wanted to look at how Named Graphs can be used to support publishing of Linked Data.

There are two scenarios I’m going to explore. The first uses Named Graphs in a way that provides a low friction method for publishing Linked Data. The second prioritizes ease of data management, and in particular the scenario where RDF is being generated by converting from other sources. Lets look at each in turn and their relative merits.

Publishing Scenario #1: One Resource per Graph

For this scenario lets assume that we’re building a simple book website. Our URI space is going to look like this:



http://www.example.org/id/{thing}/{id}


http://www.example.org/doc/{thing}/{id}

The first URI being the pattern for identifiers in our system, the second being the URI to which we’ll 303 clients in order to get the document containing the metadata about the thing with that identifier. We’ll have several types of thing in our system: books, authors, and subjects.

The application will obviously include a range of features such as authentication, registration, search, etc. But I’m only going to look at the Linked Data delivery aspects of the application here in order to highlight our Named Graphs can support that.

Our application is going to be backed by a triplestore that offers an HTTP protocol for managing Named Graphs, e.g. as specified by SPARQL 1.1. This triplestore will expose graphs from the following base URI:

http://internal.example.org/graphs

The simplest way to manage our application data is to store the data about resource in a separate Named Graph. Each resource will therefore be fully described in a single graph, so all of the metadata about:

http://www.example.org/id/book/1234

with be found in:

http://internal.example.org/graphs/book/1234

The contents of that graph will be the Concise Bounded Description of http://www.example.org/id/book/1234, i.e. all its literal properties, any related blank nodes, as well a properties referencing related resources.

This means delivering the Linked Data view for this resource is trivial. A GET request to http://www.example.org/doc/book/1234 will trigger our application to perform a GET request to our internal triplestore at http://internal.example.org/graphs/book/1234.

If the triplestore supports multiple serializations then there’s no need for our application to parse or otherwise process the results: we can request the format desired by the client directly from the store and then proxy the response straight-through. Ideally the store would also support ETags and/or other HTTP caching headers which we can also reuse. ETags will be simple to generate as it will be easy to track whether a specific Named Graph has been updated.

As the application code to do all this is agnostic to the type of resource being requested, we don’t have to change anything if we were to expand our application to store information about new types of thing. This is the sort of generic behaviour that could easily be abstracted out into a reusable framework.

Another nice architectural feature is that it will be easy to slot in internal load-balancing over a replicated store to spread requests over multiple servers. Because the data is organised into graphs there are also natural ways to “shard” the data if we wanted to replicate the data in other ways.

This gets us a simple Linked Data publishing framework, but does it help us build an application, i.e. the HTML views of that data? Clearly in that case we’ll need to parse the data so that it can be passed off to a templating engine of some form. And if we need to compose a page containing details of multiple resource then this can easily be turned into requests for multiple graphs as there’s a clear mapping from resource URI to graph URI.

When we’re creating new things in the system, e.g. capturing data about a new book, then the application will have to handle any newly submitted data, perform any needed validation and generate an RDF graph describing the resource. It then simply PUTs the newly generated data to a new graph in the store. Updates are similarly straight-forward.

If we want to store provenance data, e.g. an update history for each resource, then we can store that in a separate related graph, e.g. http://internal.example.org/graphs/provenance/book/1234.

Benefits and Limitations

This basic approach is simple, effective, and makes good use of the Named Graph feature. Identifying where to retrieve or update data is little more than URI rewriting. It’s well optimised for the common case for Linked Data, which is retrieving, displaying, and updating data about a single resource. To support more complex queries and interactions, ideally our triplestore would also expose a SPARQL endpoint that supported querying against a “synthetic” default graph which consists of the RDF union of all the Named Graphs in the system. This gives us the ability to query against the entire graph but still manage it as smaller chunks.

(Aside: Actually, we’re likely to want two different synthetic graphs: one that merges all our public data, and one that merges the public data + that in the provenance graphs.)

There are a couple of limitations which we’ll hit when managing data using the scenario. The first is that the RDF in the Linked Data views will be quite sparse, e.g the data wouldn’t contain the labels of any referenced resources. To be friendly to Linked Data browsers we’ll want to include more data. We can work around this issue by performing two requests to the store for each client request: the first to get the individual graph, the second to perform a SPARQL query something like this:


CONSTRUCT {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 ?referenced rdfs:label ?label.
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 ?refencing rdfs:label ?label2.  
} WHERE {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 OPTIONAL {
   ?referenced rdfs:label ?label.
 }
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 OPTIONAL {
   ?refencing rdfs:label ?label2.
 }
}

The above query would be executed against the union graph of our triplestore and would let us retrieve the labels of any resources referenced by a specific book (in this case), plus the labels and properties of any referencing resources. This query can be done in parallel to the request for the graph and merged with its RDF by our application framework.

The other limitation is also related to how we’ve chosen to factor out the data into CBDs. Any time we need to put in reciprocal relationships, e.g. when we add or update resources, then we will have to update several different graphs. This could become expensive depending on the number of affected resources. We could potentially work around that by adopting an Eventual Consistency model and deferring updates using a message queue. This lets us relax the constraint that updates to all resources need to be synchronized, allowing more of that work to be done both asynchronously and in parallel. The same approach can be applied to manage list of items in the store, e.g. a list of all authors: these can be stored as individual graphs, but regenerated on a regular basis.

The same limitation hits us if we want to do any large scale updates to all resources. In this case SPARUL updates might become more effective, especially if the engine can update individual graphs, although handling updates to the related provenance graphs might be problematic. What I think is interesting is that in this data management model this is the only area in which we might really need something with the power of SPARUL. For the majority of use cases graph level updates using simple HTTP PUTs coupled with a mechanism like Changesets are more than sufficient. This is one reason why I’m so keen to see attention paid to the HTTP protocol for managing graphs and data in SPARQL 1.1: not every system will need SPARUL.

The final limitation relates to the number of named graphs we will end up storing in our triplestore. One graph per resource means that we could easily end up with millions of individual graphs in a large system. I’m not sure that any triplestore is currently handling this many graphs, so there may be some scaling issues. But for small-medium sized applications this should be a minor concern.

Publishing Scenario #2: Multiple Resources per Graph

The second scenario I want to introduce in this posting is one which I think is slightly more conventional. As a result I’m going to spend less time reviewing it. Rather than using one graph per resource, we instead store multiple resources per Named Graph. This means that each Named Graph will be much larger, perhaps including data about thousands of resources. It also means that there may not be a simple mapping from a resource URI to a single graph URI: the triples for each resource may be spread across multiple graphs, although there’s no requirement that this be the case.

Whereas the first scenario was optimised for data that was largely created, managed, and owned by a web application, this scenario is most useful when the data in the store is derived from other sources. The primary data sources may be a large collection of inter-related spreadsheets which we are regularly converting into RDF, and the triplestore is just a secondary copy of the data created to support Linked Data publishing. It should be obvious that the same approach could be used when aggregating existing RDF data, e.g. as a result of a web crawl.

To make our data conversion workflow system easier to manage it makes sense to use a Named Graph per data source, i.e. one for each spreadsheet, rather than one per resource. E.g:



http://internal.example.org/graphs/spreadsheet/A


http://internal.example.org/graphs/spreadsheet/B


http://internal.example.org/graphs/spreadsheet/C


The end result of our document conversion workflow would then be the updating or replacing of a single specific Named Graph in the system. The underlying triplestore in our system will need to expose a SPARQL endpoint that includes a synthetic graph which is the RDF union of all graphs in the system. This ensures that where data about an individual resource might be spread across a number of underlying graphs, that a union view is available where required.

As noted in the first scenario we can store provenance data in a separate related graph, e.g. http://internal.example.org/graphs/provenance/spreadsheet/A.

Benefits and Limitations

From a data publishing point of view our application framework can no longer use URI rewriting to map a request to a GET on a Named Graph. It must instead submit SPARQL DESCRIBE or CONSTRUCT queries to the triplestore, executing them against the union graph. This lets the application ignore the details of the organisation and identifiers, of the Named Graphs in the store when retrieving data.

If the application is going to support updates to the underlying data then it will need to know which Named Graph(s) must be updated. This information should be available by querying the store to identify the graphs that contain the specific triple patterns that must be updated. SPARUL request(s) can then be issued to apply the changes across the affected graphs.

The difficult of co-ordinating updates from the application with updates from the document conversion (or crawling) workflow means that this scenario may be best suited for read-only publishing of data.

Its clear that this approach is much more optimised to support the underlying data conversion and/or collection workflows that the publishing web application. The trade-off doesn’t add much more complexity to the application implementation, but doesn’t exhibit some of the same architectural benefits, e.g. easy HTTP caching, data sharding, etc, that the first model exhibits.

Summary

In this blog post I’ve explored two different approaches to managing and publishing RDF data using Named Graphs. The first scenario described an architecture that used Named Graphs in a way that simplified application code whilst exposing some nice architectural properties. This was traded off against ease of data management for large scales updates to the system.

The second scenario was more optimised data conversion & collection workflows and is particularly well suited for systems publishing Linked Data derived from other primary sources. This flexibility was traded off against slightly more complex application implementation.

My goal has been to try to highlight different patterns for using Named Graphs and how those patterns place greater or lesser emphasis on features such as RESTful protocols for managing graphs, and different styles of update language. In reality an application might mix together both styles in different areas, or even at different stages of its lifecycle.

If you’re using Named Graphs in your applications then I’d love to hear more about how you’re making use of the feature. Particularly if you’ve layered on additional functionality such as versioning and other elements of workflow.

Better understanding of how to use these kinds of features will help the community begin assembling good application frameworks to support Linked Data application development.

Tagged , ,
Follow

Get every new post delivered to your Inbox.

Join 29 other followers