Set Algebra For Updating a Triple Store

Lets assume we have a stored graph Gstore. Also that we have been given another graph of incoming data Gin that contains some modifications to a specific sub-graph.
Lets also assume that we have a function view() that can extract the “equivalent” sub-graph (i.e. equivalent view) of the original data.
In pseudo code to apply these updates we do the following:

Gview = view(Gstore)
Gdelete = Gview - Gin
Ginsert = Gin - Gview
Gstore' = Gstore.remove(Gdelete).add(Ginsert)

Job done. The Jena API provides methods for handling the basic operations see, for example, the difference method. You can also wrap the modifications to Gstore in a transaction.
The nice thing is that this is agnostic to the actual data being updated, we don’t care which triples are being added or inserted. This differentiates it from the SPARQL Update Language, specifically the MODIFY operation, which requires the patterns being inserted or deleted to be added to the query. Changesets are much the same.
In the above approach the detail of what is being changed (or is being allowed to change) is shifted out of the triple store update code and into the view() function. The extent of the graph that is returned by this function must match that being passed as input. So we’ve defined a specific “document type“. As it turns out this is quite reasonable as you can generally match, e.g. a RESTful service call, to a view based on the identifier of the item to which the content is being posted, its media-type, other service parameters, etc.
In terms of implementing the view() function, it turns out you can go a long way with a SPARQL CONSTRUCT operation. DESCRIBE isn’t suitable as you don’t have control over how the sub-graph is built.
I think there are strengths and weaknesses to all of the different approaches to updating RDF stores and suspect that there isn’t going to be a one size fits all approach. For example SPARQL Update looks like a handy syntax to use when the modifications all follow predictable patterns, e.g. I’m doing parameterized updates to some stored data, much like parameterized updates in a SQL database. Changesets offer some extra functionality around store versioning which doesn’t drop out of the set logic approach (although it could be added).
Oh, and the keen eyed amongst you will notice that this approach does involve some “thrashing” of updates for bnodes, because they don’t compare as equal. But what ya gonna do?! 🙂

Graph Shape Sorting

On Sunday I posted about how constrained views of RDF can be useful in order to document the inputs into an application, validate those inputs, and also manage updates via application of set algebra. I explored the idea that a system may support many such views or “document types” without blessing any as the primary view of the data. And, importantly, that this approach doesn’t ultimately constrain the range of data that you can put into a triple store.
It just occured to me that there’s another way to explain the concept: a shape sorter.
Photo by ellas dad
A shape sorter can contain many different sizes, shapes, and colours of block. Each can only be put into the box through a specific hole, but once in they’re all mixed together. And one can reach in and pick out any or all of them. Depending on which face of the shape sorter you’re looking at the options may look quite limited. But the sorter has a whole has a lot of different faces and options.
The inside of the box is the triple store. It can contain many different things. Each block is a specific data format or the shape of a specific sub-graph. Passing a block through a shape is the validation process, and the shape sorter offers many different forms of validation.
Useful alternate explanation or excuse to post a pointer to a pretty picture?

Modelling Statistical Publications: Some Notes

Lee Feigenbaum has put together a really nice posting discussing different ways of modelling statistical data using RDF. I wanted to contribute to that discussion and add in a few comments about how I’ve been modelling some of the OECD‘s statistical publications using RDF.
Note the emphasis: what I’ve been doing is capturing metadata about individual statistical tables and graphs, their association with specific publications, their metadata, etc. I’ve not attempted to capture the detail of the statistics themselves, but do have a few relevant comments there.
The background to this is that I’m currently technically leading a project to build the latest version of OECD’s electronic library. All of the metadata is stored in RDF, with content available as HTML, PDFs, Excel spreadsheets or as views into the OECD.stat application that the OECD have developed as a power tool for housing and delivering their statistical data.
As Lee discovered in the EuroStat data, regions and countries are core concepts. All of the OECD’s statistical output can be classified by country and region, and these are types defined within our schema. We assign URIs to the countries using either the ISO 3166-1 alpha-2 country code or, in the case of classifying data that refers to countries that no longer exist as a specific entity (e.g. Yugoslavia), we use the ISO 3166-3 4 letter country code.
A country may be associated with zero or more Regions, using an Is Part Of relationship. A region may be the European Union, OECD member states or other arbitrary grouping. I suspect the same basic requirements will apply to other statistical datasets.
There are some other types of classification that we associate with the tables:

  • An indicator of whether the table is a “comparative table”: e.g. does it include data from multiple countries?
  • An association between the table and a “Table Series” which constitute a collection of tables published over time
  • The statistical Variables that the table contains, e.g. GDP
  • A summary of the time range that the table covers, e.g. “2007”, “2005-2007”, “2000, 2002-2005”, etc. These are captured as simple literals for now as we have to do little/no processing on them at this level.

And then there’s the usual collection of title, description, etc. all as multi-lingual literals. All tables are also assigned a DOI to provide a stable link that can be cited in publications. If the table was originally published in a specific Book or journal Article then that relationship is also captured.
Obviously this metadata is, largely, at a level above that which Lee has been exploring, but I thought this might provide some useful context. For anyone looking at capturing statistical data in RDF, there are some other useful places to look at for defining terms and drawing on prior experience.
Firstly the Journal of Economic Literature Classification provides some terms that can be associated with statistical publications to help categorize them. The OECD’s statistical glossary fills a similar role.
Secondly, the Statistical Data and Metadata EXchange (SDMX) initiative is also worthy of a look. It’s not RDF but, as well as defining XML Schemas and web services for exchanging statistical data, the guidelines include lists of cross-domain concepts and their mappings to those in use by EuroStat, OECD, IMF, etc. So plenty of scope for grounding RDF vocabularies for statistical in a lot of prior art.
Finally, the OECD have some public documentation about the design and implementation of their “MetaStore” database that supports OECD.stat (it’s a different beast to the Ingenta MetaStore, I should point out). For example, the document “Management of Statistical Metadata at the OECD” (PDF) has some interesting detail about the different types of metadata (structural, technical, publishing) that is stored in these multi-dimensional data cubes.

Documents Types in RDF

The notion of a “document” and a “document type” are core concepts in XML. The specification includes a precise description of document, what it means for a document to be well-formed, valid, and so on. Even if you’re not using a DTD or XML schema, and are just using XML as a syntax for exchanging structured or semi-structured data, the concept of document is still a useful one. For example a document has a clear boundary and content, and so there is a limited scope for the data that an application has to deal with.
The ability to define classes of documents (“document types”) brings other benefits: the structure and content of documents can be standardized. The document type becomes both a contract that can be enforced by an application prior to its processing of any given document, and a description of the acceptable inputs of that application.
The concepts of “document” and “document type” are quite general and aren’t limited to XML applications. See, for example, the JSON schema discussion. The same concepts and their attendant benefits also crop up in messaging systems.
But you don’t see much discussion about the concept of a document or their types in RDF applications. Granted, RDF/XML does define a document type for serializing RDF graphs, but we all know that the large variation in how any single RDF graph could be encoded in valid RDF/XML means that the same benefits we get from non-RDF XML vocabularies are lost. The document scope can be highly variable scope, as can content and syntax. Of course it is is possible to create “RDF profiles” that constrain the RDF/XML syntax so that an XML schema can be used to validate documents. Jeni Tennison has recently discussed some approaches to this, and I’ve explored the topic myself in the past. In fact I regularly apply it when designing RDF based systems: it is extremely useful (essential) to be able to validate incoming data.
But generally the notion of document types doesn’t sit well with RDF. RDF is a data model for semi-structured data. It assumes an “open world model” in which missing information is not invalid, or as Dan Brickley has put it “missing isn’t broken“. This wild and woolly nature of RDF is, I think, one of the reasons many people struggle with it. As Dan says:

If nothing is mandatory, then how can they write code that knows what to expect?

Dan concludes that posting by suggesting that there are certain bedrocks which application authors can still rely on, e.g. XML+Namespaces, conformance to the RDF model, etc. But lately I’ve come around to the view that we need to go beyond that and offer tighter ways to document, declare and validate data that is being exchanged in RDF applications. I don’t know of any applications that adopt an open world model; quite the opposite in fact. I think there are benefits in looking at the notion of “document” and “document type” in an RDF context. Although “document” may not be the right term here, a better one may be “view”.
So how might we achieve this, and what are the benefits in doing so?
We can use the aforemention “profiling” option to create an constrained RDF/XML vocabularly that can be validated using XML schema (of whatever kind). Where two parties need to have an agreed on format for data exchange this works well. So for example, the OECD are supplying us with XML documents according to an XML schema. The documents are valid RDF/XML so we can simply pour them into a triple store for our application to use. Each XML document is basically a packet of RDF that describes one section (or sub-graph) of the entire data set. Those same packets are used as the basic message format for passing between internal components (e.g. the search indexer). So this is one useful application of the document concept in an application which is otherwise entirely RDF-driven and which goes to some length to be agnostic to the details of the data it contains.
In a scenario where there isn’t any prior co-ordination between the parties exchanging data then there are other options. A typical scenario here might be submitting my FOAF document (either directly, or referenced via an OpenId) to register/configure some online service. There are many ways I might structure my FOAF document, so how does the service validate or check that the required data is present? The answer here is SPARQL. SPARQL can be used to to validate a graph by testing whether specific graph patterns are present using ASK or CONSTRUCT. It can also be used to CONSTRUCT a constrained “view” of the submitted data that throws away anything that the application isn’t directly interestd in. The other side benefit to using SPARQL is that it doesn’t really matter that RDF syntax is being used: the validation and data extraction is happening at the level of the data model not the syntax.
We use the technique of defining RDF views using CONSTRUCT elsewhere in our applications. The primary one being fetching the data required to present some aspect of the RDF graph to an end user. I’ve described this, and the underlying system and its assumptions in a recent presentation. Here the “view” or “document type” is used to drive a simple data binding layer, and is essentially the contract between the application logic and the presentation layer. The application doesn’t need to deal with the entire graph, just useful use case specific subsets. And these are different “document types” to that used when loading the original data. The application doesn’t have a single document type: it has many and they’re used in different contexts. This avoids overly constraining the model (we want to be able to store arbitrary additional properties) but imposes local scoping to gain the benefits of validation, known contents, etc.
It turns out that there’s another use case where RDF document types or views are useful: managing updates to a triple store. If you know that some incoming data is constrained to a particular view (e.g. by prior agreement, or through extracting only those graph patterns that are of interest) then apply the incoming message as an update to the store is simply a matter of doing some set algebra. Extract the equivalent view from the store (i.e. the relevant sub-graph) and then look for the difference between the stored and incoming sub-graphs. The end result is a list of triples to delete and add to the store.
I’ll follow up more on the topics in this posting, as I think there are huge benefits to be had here from looking at how the notion of documents and document types can add value to RDF systems. It’s very easy to get caught up in the completely general case of a highly-distributed, wild and woolly world of RDF and the Semantic Web. But the majority of applications will have a much more limited world view, and my experience so far is that applying some additional constraints here and there can have huge benefits. Embracing the notion of multiple document types is one of these.

Oxford SWIG Talks: Twinkle & SPARQL Query Forms

I finally found time to attend one of the Oxford SWIG sessions last night and had a thoroughly enjoyable time.
I gave a couple of presentations which I’ve posted to slideshare, and which I’ll embed below.
The first was a general introduction and mini-demonstration of Twinkle. I gave a basic overview of the key features and showed how the configuration drives the user interface:

The second talk as about the different SPARQL query forms. I started by asking the question “why are there four different query forms?” and then proceeded to examine each one and talk about the benefits and their applied use.

The talk was streamed online via Yahoo Live which was a nice touch as one SWIGger was at home with a broken ankle (get well soon Katie!). It’d be nice to see more use of free video streaming at other events.