Publishing SPARQL queries and documentation using github

Yesterday I released an early version of sparql-doc a SPARQL documentation generator. I’ve got plans for improving the functionality of the tool, but I wanted to briefly document how to use github and sparql-doc to publish SPARQL queries and their documentation.

Create a github project for the queries

First, create a new github project to hold your queries. If you’re new to github then you can follow their guide to get a basic repository set up with a README file.

The simplest way to manage queries in github is to publish each query as a separate SPARQL query file (with the extension “.rq”). You should also add a note somewhere specifying the license associated with your work.

As an example, here is how I’ve published a collection of SPARQL queries for the British National Bibliography.

When you create a new query be sure to add it to your repository and regularly commit and push the changes:

git add my-new-query.rq
git commit my-new-query.rq -m "Added another example"
git push origin master

The benefit of using github is that users can report bugs or submit pull requests to contribute improvements, optimisations, or new queries.

Use sparql-doc to document your queries

When you’re writing your queries follow the commenting conventions encouraged by sparql-doc. You should also install sparql-doc so you can use it from the command-line. (Note: installation guide and notes still needs some work!)

As a test you can try generating the documentation from your queries as follows. Make a new local directory called, e.g. ~/projects/examples. Execute the following from your github project directory to create the docs:

sparql-doc ~/projects/examples ~/projects/docs

You should then be able to open up the index.html document in ~/projects/examples to see how the documentation looks.

If you have an existing web server somewhere then you can just zip up those docs and put them somewhere public to share them with others.However you can also publish them via Github pages. This means you don’t have to setup any web hosting at all.

Use github pages to publish your docs

Github Pages allows github users to host public, static websites directly from github projects. It can be used to publish blogs or other project documentation. But using github pages can seem a little odd if you’re not familiar with git.

Effectively what we’re going to do is create a separate collection of files — the documentation — that sits in parallel to the actual queries. In git terms this is done by creating a separate “orphan” branch. The documentation lives in the branch, which must be called gh-pages, while the queries will remain in the master branch.

Again, github have a guide for manually adding and pushing files for hosting as pages. The steps below follow the same process, but using sparql-doc to create the files.

Github recommend starting with a fresh separate checkout of your project. You then create a new branch in that checkout, remove the existing files and replace them with your documentation.

As a convention I suggest that when you checkout the project a second time, that you give it a separate name, e.g. by adding a “-pages” suffix.

So for my bnb-queries project, I will have two separate checkouts:

The original checkout I did when setting up the project for the BNB queries was:

git clone

This gave me a local ~/projects/bnb-queries directory containing the main project code. So to create the required github pages branch, I would do this in the ~/projects directory:

#specify different directory name on clone
git clone bnb-queries-pages
#then follow the steps as per the github guide to create the pages branch
cd bnb-queries-pages
git checkout --orphan gh-pages
#then remove existing files
git rm -rf .

This gives me two parallel directories one containing the master branch and the other the documentation branch. Make sure you really are working with a separate checkout before deleting and replacing all of the files!

To generate the documentation I then run sparql-doc telling it to read the queries from the project directory containing the master branch, and then use the directory containing the gh-pages branch as the output, e.g.:

sparql-doc ~/projects/bnb-queries ~/projects/bnb-queries-pages

Once that is done, I then add, commit and push the documentation, as per the final step in the github guide:

cd ~/projects/bnb-queries-pages
git add *
git commit -am "Initial commit"
git push origin gh-pages

The first time you do this, it’ll take about 10 minutes for the page to become active. They will appear at the following URL:


If you add new queries to the project, be sure to re-run sparql-doc and add/commit the updated files.

Hopefully that’s relatively clear.

The main thing to understand is that locally you’ll have your github project checked out twice: once for the main line of code changes, and once for the output of sparql-doc.

These will need to be separately updated to add, commit and push files. In practice this is very straight-forward and means that you can publicly share queries and their documentation without the need for web hosting.


Developers often struggle with SPARQL queries. There aren’t always enough good examples to play with when learning the language or when trying to get to grips with a new dataset. Data publishers often overlook the need to publish examples or, if they do, rarely include much descriptive documentation.

I’ve also been involved with projects that make heavy use of SPARQL queries. These are often externalised into separate files to allow them to be easily tuned or tweaked without having to change code. Having documentation on what a query does and how it should be used is useful. I’ve seen projects that have hundreds of different queries.

It occurred to me that while we have plenty of tools for documenting code, we don’t have a tool for documenting SPARQL queries. If generating and publishing documentation was a little more frictionless, then perhaps people will do it more often. Services like SPARQLbin are useful, but provide address a slightly different use case.

Today I’ve hacked up the first version of a tool that I’m calling sparql-doc. Its primary usage is likely to be helping to publish good SPARQL examples, but might also make a good addition to existing code/project documentation tools.

The code is up on github for you to try out. Its still very rough-and-ready but already produces some useful output. You can see a short example here.

Its very simple and adopts the same approach as tools like rdoc and Javadoc: its just specifies some conventions for writing structured comments. Currently it supports adding a title, description, list of authors, tags, and related links to a query. Because the syntax is simple, I’m hoping that other SPARQL tools and IDEs will support it.

I plan to improve the documentation output to provide more ways to navigate the queries, e.g. by tag, query type, prefix, etc.

Let me know what you think!

A Brief Review of the Land Registry Linked Data

The Land Registry have today announced the publication of their Open Data — including both Price Paid information and Transactions as Linked Data. This is great to see, as it means that there is another UK public body making a commitment to Linked Data publishing.

I’ve taken some time to begin exploring the data. This blog post provides some pointers that may help others in using the Linked Data. I’m also including some hopefully constructive feedback on the approach that the Land Registry have taken.

The Land Registry Linked Data

The Linked Data is available from this follows the general pattern used by other organisations publishing public sector Linked Data in the UK.

The data consists of a single SPARQL endpoint — based on the Open Source Fuseki server — which contains RDF versions of both the Price Paid and Transaction data. The documentation notes that the endpoint will be updated on the 20th of each month, with the equivalent to the monthly releases that are already published as CSV files.

Based on some quick tests, it would appear that the endpoint contains all of the currently published Open Data, which in total is 16,873,170 triples covering 663,979 transactions.

The data seems to primarily use custom vocabularies for describing the data:

The landing page for the data doesn’t include any examples, but I ran some SPARQL queries to extract a few, e.g:

So for Price Paid Data, the model appears to be that a Transaction has a Transaction Record which in turn has an associated Address. The transaction counts seem to be standalone resources.

The SPARQL endpoint for the data is at A test form is also available and that page has a couple of example queries, including getting Price Paid data based on a postcode search.

However I’d suggest that the following version might be slightly better as it includes the record status for the record, which will indicate whether it is an “add” or a “delete”:

PREFIX xsd: <>
PREFIX rdf: <>
PREFIX rdfs: <>
PREFIX owl: <>
PREFIX lrppi: <>
PREFIX skos: <>
PREFIX lrcommon: <>
SELECT ?paon ?saon ?street ?town ?county ?postcode ?amount ?date ?status
{ ?transx lrppi:pricePaid ?amount .
 ?transx lrppi:transactionDate ?date .
 ?transx lrppi:propertyAddress ?addr.
 ?transx lrppi:recordStatus ?status.

 ?addr lrcommon:postcode "PL6 8RU"^^xsd:string .
 ?addr lrcommon:postcode ?postcode .

 OPTIONAL {?addr lrcommon:county ?county .}
 OPTIONAL {?addr lrcommon:paon ?paon .}
 OPTIONAL {?addr lrcommon:saon ?saon .}
 OPTIONAL {?addr lrcommon:street ?street .}
 OPTIONAL {?addr lrcommon:town ?town .}
ORDER BY ?amount

General Feedback

Lets start with the good points:

  • The data is clearly licensed so is open for widespread re-use
  • There is a clear commitment to regularly updating the data, so it should stay in line with the Land Registry’s other Open Data. This makes it reliable for developers to use the data and the identifiers it contains
  • The data uses Patterned URIs based on Shared Keys (the Land Registry’s own transaction identifiers) so building links is relatively straight-forward
  • The vocabularies are documented and the URIs resolve, so it is possible to lookup the definitions of terms. I’m already finding that easier than digging through the FAQs that the Land Registry publish for the CSV versions.

However I think there is room for improvement in a number of areas:

  • It would be useful to have more example queries, e.g. how to find the transactional data, as well as example Linked Data resources. A key benefit of a linked dataset is that you should be able to explore it in your browser. I had to run SPARQL queries to find simple examples
  • The SPARQL form could be improved: currently it uses a POST by default and so I don’t get a shareable URL for my query; the Javascript in the page also wipes out my query every time I hit the back button, making it frustrating to use
  • The vocabularies could be better documented, for example a diagram showing the key relationships would be useful, as would a landing page providing more of a conceptual overview
  • The URIs in the data don’t match the patterns recommended in Designing URI Sets for the Public Sector. While I believe that guidance is under review, the data is diverging from current documented best practice. Linked Data purists may also lament the lack of a distinction between resource and page.
  • The data uses custom vocabulary where there are existing vocabularies that fit the bill. The transactional statistics could have been adequately described by the Data Cube vocabulary with custom terms for the dimensions. The related organisations could have been described by the ORG ontology and VCard with extensions ought to have covered the address information.

But I think the biggest oversight is the lack of linking, both internal and external. The data uses “strings” where it could have used “things”: for places, customers, localities, post codes, addresses, etc.

Improving the internal linking will make the dataset richer, e.g. allowing navigation to all transactions relating to a specific address, or all transactions for a specific town or postcode region. I’ve struggled to get a Post Code District based query to work (e.g. “price paid information for BA1”) because the query has to resort to regular expressions which are often poorly optimised in triple stores. Matching based on URIs is always much faster and more reliable.

External linking could have been improved in two ways:

  1. The dates in the transactions could have been linked to the UK Government Interval Sets. This provides URIs for individual days
  2. The postcode, locality, district and other regional information could have been linked to the Ordnance Survey Linked Data. That dataset already has URIs for all of these resources. While it may have been a little more work to match regions, the postcode based URIs are predictable so are trivial to generate.

These improvements would have moved from Land Registry data from 4 to 5 Stars with little additional effort. That does more than tick boxes, it makes the entire dataset easier to consume, query and remix with others.

Hopefully this feedback is useful for others looking to consume the data or who might be undertaking similar efforts. I’m also hoping that it is useful to the Land Registry as they evolve their Linked Data offering. I’m sure that what we’re seeing so far is just the initial steps.

HTTP 1.1 Changes Relevant to Linked Data

Mark Nottingham has posted a nice status report on the ongoing effort to revise HTTP 1.1 and specify HTTP 2.0. In the post Mark highlights a list of changes from RFC2616. This ought to be required reading for anyone doing web application development, particularly if you’re building APIs.

I thought it might be useful to skim through the list of changes and highlight those that are particularly relevant to Linked Data and Linked Data applications. Here are the things that caught my eye. I’ve included references to the relevant documents.

  • In Messaging, there is no longer a limit of 2 connections per server. Linked Data applications typically make multiple parallel requests to fetch Linked Data, so having more connections available could help improve performance on the client. In practice browsers have been ignoring this limit for a while, so you’re probably already seeing the benefit.
  • In Semantics, there is a terminology change with respect to content negotiation: we ought to be talking about proactive (client-side) and reactive (server-side) negotiation
  • In Semantics, a 201 Created response can now indicate that multiple resources have been created. Useful to help indicate if a POST of some data has resulted in the creation of several different RDF resources.
  • In Semantics, a 303 response is now cacheable. This addresses one performance issue associated with redirects.
  • In Semantics, the default charset for text media types is now whatever the media type says it is, and not ISO-8859-1. This should allow some caveats in the Turtle specification to be removed. UTF-8 can be documented as the default.

So, no really major impacts as far as I can see, but the cacheability of 303 should bring some benefits.

If you think I’ve missed anything important, then leave a comment.

UK & EU Linked Data Consultant Network?

As I explained in that announcement that I’m leaving Talis, I’m going to be exploring freelance consulting opportunities.

While I’m not limiting that to Linked Data work, its an area in which I have a lot invested and which there is still lots of activity. Perhaps not enough to support Talis Systems, but there certainly seems to be a number of opportunities that could support freelancers and small consulting businesses.

Talis was always keen to help develop the market and had quite open relationships with others in the industry. Everyone benefits if the pie gets bigger and in an early stage market is makes sense to share opportunities and collaborate where possible.

I’d like to continue that if possible. Even in the last few days I’ve had questions about whether Talis’ decision might mark the beginning of some wider move away from the technology. That’s certainly not how I see it. Even Talis is not moving away from the technology, its just focusing on a specific sector. I’ve already learnt of other companies that are starting to embrace Linked Data within the enterprise.

I think it would be a good thing if those of us working in this area in the UK & EU organise ourselves a little more; to make the most of the available opportunities and to continue to grow the market. There are various interest groups (like Lotico) but those are more community rather than business focused.

A network could take a number of forms. It may be simply be a LinkedIn network. Or a (closed?) mailing list to share opportunities and experience. But it would be nice to find a way to share successes and case studies where they exist. Sites like often promote projects, but I wonder whether something more focused might be useful.

These are just some early stage thoughts. What I’d most like to do is find out:

  • whether others think this is a good idea — would it be useful?
  • what forms people would prefer to see it take — what would be useful for you?
  • who is active, as a freelancer or SME, in this area — I have some contacts but I doubt its exhaustive

If you’ve got thoughts on those then please drop a comment on this post. Or drop me an email.

Four Links Good, Two Links Bad?

Having reviewed a number of Linked Data papers and projects I’ve noticed a recurring argument which goes something like this: “there are billions of triples available as Linked Data but the number of links, either within or between datasets, is still a small fraction of that total number. This is bad, and hence here is our software/project/methodology which will fix that…“.

I’m clearly paraphrasing, partly because I don’t want to single out any particular paper or project, but there seems to be a fairly deep-rooted belief that there aren’t enough links between datasets and this is something that needs fixing. The lack of links is often held up as a reason for why working with different datasets is harder than it should be.

But I’m not sure that I’ve seen anyone attempt to explain why increasing the density of links is automatically a good thing. Or, better yet, attempt to quantify in some way what a “good” level of inter-linking might be.

Simply counting links, and attempting to improve on that number, also glosses over reasons for why some links don’t exist in the first place. Is it better to have more links, or better quality links?

A Simple Illustration

Here’s a simple example. Consider the following trivial piece of Linked Data.

@prefix dct: <> .
@prefix foaf: <> .
@prefix owl: <> .

 dct:title "War and Peace";
 dct:creator <>.
 dct:title "Anna Karenina";
 dct:creator <>.

 foaf:name "Leo Tolstoy";
 owl:sameAs <>.

The example has two resources which are related to their creator, which is identified as being the same as a resource in dbpedia. This is a very common approach as typically a dataset will be progressively enriched with equivalence links. It’s much easier to decouple data conversion from inter-linking if a dataset is initially completely self-referential.

But if we’re counting links, then we only have a single outbound link. We could restructure the data as follows:

@prefix dct: <> .
 dct:title "War and Peace";
 dct:creator <>.

 dct:title "Anna Karenina";
 dct:creator <>.

We now have less triples, but we’ve increased the number of outbound links. If all we’re doing is measuring link density between datasets then clearly the second is “better”. We could go a step further and materialize inferences in our original dataset to assert all of the owl:sameAs links, giving us an even higher link density.

This is clearly a trivial example but it illustrates that even for very simple scenarios we can make decisions on how to publish Linked Data that impact link density. As ever we need a more nuanced understanding to help identify trade-offs for both publisher and consumer.

The first option with the lowest outbound link density is the better option in my opinion, for various reasons:

  • The dataset is initially self-contained, allowing data production to be separated from the process of inter-linking, thereby simplifying data publishing
  • Use of local URIs provides a convenient point of attachment for local annotations of resources, so useful if I have additional statements to make about the equivalent resources
  • Use of local URIs allows me to decide on my own definition of that resource, without immediately buying into a third-party definition.
  • Use of local URIs makes it easier to add new link targets, or remove existing links, at a later date
But there are also downsides:
  • Consumers needs to apply reasoning, or similar, in order to smush together datasets adding extra client-side processing
  • There are more URIs — a great “surface area” — to maintain within my Linked Data
And we’ve not yet considered how the links are created. Regardless of whether you’re creating links manually or automatically, there’s a cost to their creation. So which links are the most important to create? For me, and for the users of my data?
There is likely to be a law of diminishing returns on both sides for the addition of new links, particularly if “missing” relationships between resources can be otherwise inferred. For example if A is sameAs B, then it’s probably unnecessary for me to assert equivalences to all the resources to which B is, in turn, equivalent. Saying less, reduces the amount of data I’m producing so I can focus on making sure its of good quality.

Not All Datasets are Equal

Datasets should exhibit very different link characteristics that derive from how they’re published, who is publishing, and why they are publishing the data. Again these are more nuances that are lost by simply maximising link density.

Some datasets are purely annotations. Annotated Data may have no (new) links in it at all. Such a dataset might only be published to enrich an existing dataset. Because of the lack of links it won’t appear on the Linked Data cloud. They’re also not, yet, easily discoverable. But they’re easy to layer onto existing data and don’t require commitments to maintaining URIs, so they have their advantages.

Some datasets are link bases: they consist only of links and exist to help connect together previously unconnected datasets. Really they’re a particular kind of annotation, so share similar advantages and disadvantages.

Some datasets are hubs. These are intended to be link targets or to be annotated, but may not link to other sources. The UK Government reference interval URIs are one example of a “hub” dataset. The same is true for the Companies House URIs. Its likely that many datasets published by their managing authority will likely have a low outbound link density, simply because they are the definitive source of that data. Where else would you go? Other data publishers may annotate them, or define equivalents, but the source dataset itself may be low in links and remain so over time.

Related to this point, there are are several social, business and technical reasons why links may deliberately not exist between datasets:

  • Because they embody a different world-view or different levels of modelling precision. The Ordnance Survey dataset doesn’t link to dbpedia because even where there are apparent equivalences, with a little more digging it turns out that resources aren’t precisely the same.
  • Because the data publisher chooses not to link to a destination because of concerns about the quality of the destination. A publisher of biomedical data may choose not to link to another dataset if there are concerns about the quality of the data: more harm may be done by linking and then consuming incorrect data, than having no links at all.
  • Because the data publisher chooses not to link to data from a competitor.
  • Because datasets are published and updated on different time-scales. This is the reason for the appearance of many proxy URIs in datasets.

If, as a third party, I publish a Link Base that connects two datasets, then only in the last two scenarios am I automatically improving the situation for everyone.

In the other two scenarios I’m likely to be degrading the value of the available data by leading consumers to incorrect data or conclusions. So if you’re publishing a Link Base you need to be really clear on whether you understand the two datasets you’re connecting and the cost/benefits involved in making those links. Similarly, if you’re a consumer, consider the provenance of those links.

How do consumers rank and qualify different data sources? Blindly following your nose may not always be the best option.

Interestingly I’ve seen very little use of owl:differentFrom by data publishers. I wonder if this would be a useful way for a publisher to indicate that they have clearly considered whether some resources in a dataset are equivalent, but have decided that they are not. Seems like the closest thing to “no follow” in RDF.

Ironically of course, publishing lots of owl:differentFrom statements increases link density! But that speaks to my point that counting links alone isn’t useful. Any dataset can be added to the Linked Data Cloud diagram by adding 51 owl:differentFrom statements to an arbitrary selection of resources.

Studying link density and dataset connectivity is potentially an interesting academic exercise. I’d be interested to see how different datasets, perhaps from different subject domains, relate to known network topologies. But as the Linked Data cloud continues to grow we ought to think carefully about what infrastructure we need to help it be successful.

Focusing on increasing link density, e.g. by publishing more link bases, or by creating more linking tools, may not be the most valuable area to focus on. Infrastructure to support better selection, ranking and discovery of datasets is likely to offer more value longer term; we can see that from the existing web. Similarly, when we’re advising publishers, particularly governments on how to publish and link their data, there are many nuances to consider.

More links aren’t always better.


Principled use of RDF/XML

Everyone loves to hate RDF/XML. Indeed many have argued that RDF/XML is responsible for holding back semantic web adoption. I’m not sure that I fully agree with that (there’s a lot of other issues to consider) but its certainly awkward to work with if you’re trying to integrate both RDF and XML tools into your application.

It’s actually that combination that causes the awkwardness. If you’re just using RDF tools then RDF/XML is mostly fine. It benefits from XML’s Unicode support and is the most widely supported RDF serialisation. There are downsides though. For example there are some potential RDF graphs can’t be serialised as RDF/XML. But that is easy to avoid.

Developers, particularly XML developers, feel cheated by RDF/XML because of what they see as false advertising: its an XML format that doesn’t play nicely with XML tools. Some time ago, Dan Brickley wrote a nice history on the design of RDF/XML which is worth a read for some background. My goal here isn’t to rehash the RDF/XML discussion or even to mount a defense of RDF/XML as a good format for RDF (I prefer Turtle).

But developers are still struggling with RDF/XML, particularly in publishing workflows where XML is a good base representation for document structures, so I think its worthwhile capturing some advice on how to reach a compromise with RDF/XML that allows it to work nicely with XML tools. I can’t remember seeing anyone do that before, so I thought I’d write down some of my experiences. These are drawn from creating a publishing platform that ingested metadata and content in XML, used Apache Jena for storing that metadata, and Solr as a search engine. Integration between different components was carried out using XML based messaging. So there were several places where RDF and XML rubbed up against one another.

Tip 1: Don’t rely on default serialisations

The first thing to note is that RDF/XML offers a lot of flexibility in terms of how an RDF graph can be serialised as XML. A lot. The same graph can be serialised in many different ways using a lot of syntactic short-cuts. More on those in a moment.

It’s this unbounded flexibility that is the major source of the problems: producers and consumers may have reasonable default assumptions about how data will be made published that are completely at odds with one another. This makes it very difficult to consume arbitrary RDF/XML with anything other than RDF tools.

JSON-LD offers a lot of flexibility too, and I can’t help but wonder whether that flexibility is going to come back and bite us in the future.

By default RDF tools tend to generate RDF/XML in a form that makes it easy for them to serialise. This tends to mean automatically generated namespace prefixes and a per-triple approach to serialising the graph, e.g:

  <rdf:Description rdf:about="">
    <p0:label>Joe Bloggs</po:label>
  <rdf:Description rdf:about="">
    <rdf:type rdf:resource=""/>
  <rdf:Description rdf:about="">
    <p1:homepage rdf:resource=""/>

This is a disaster for XML tools as the description of the resource is spread across multiple elements making it hard to process. But its efficient to generate.

Some RDF frameworks may provide options for customising the output to apply some of the RDF/XML syntactic shortcuts. As we’ll see in a moment these are worth embracing and may produce some useful regularity.

But if you need to generate an XML format that has, for example, a precise ordering of child elements then you’re not going to get that kind of flexibility by default. You’ll need to craft a custom serialiser. Apache Jena allows you to use create RDF Writers to support this kind of customization. This isn’t ideal as you need to write code — even to tweak the output options — but it gives you more control.

So, if you need to generate an XML format from RDF sources then ensure that you normalize your output. If you have control over the XML document formats and can live with some flexibility in the content model, then using RDF/XML syntax shortcuts offered by your RDF tools might well be sufficient. However if you’re working to a more rigid format, then you’re likely to need some custom code.

Tip 2: Use all of the shortcuts

Lets look at the above example again but with a heavy dusting of syntax sugar:

  <rdfs:label>Joe Bloggs</rdfs:label>
  <foaf:homepage rdf:resource=""/>

Much nicer! The above describes exactly the same RDF graph as we had before. What have we done here:

  • We’ve omitted the rdf:RDF element as its unnecessary. If you have a single “root” resource in your graph then you can just this as the document element. If we had multiple, unrelated Person resources in the document then we’d need to re-introduce the rdf:RDF element as a generic container.
  • Defined some default namespace prefixes
  • Grouped triples about the same subject into the same element
  • Removed use of rdf:Description and rdf:type, preferring to instead use namespace element names

The result is something that is easier to read and much easier to work with in an XML context. You could even imagine creating an XML schema for this kind of document, particularly if you know which types and predicates are being used in your RDF graphs.

The nice thing about this approach is that its looks just like namespaced XML. For a publishing project I worked on we defined our XML schemas for receipt of data using this kind of approach; the client didn’t really need to know anything about RDF. We just had to explain that:

  • rdf:about is how we assign a unique identifier to a entity (and we used xml:base to simplify the contents further to avoid repetition)
  • rdf:resource was a “link” between two resources, e.g. for cross-referencing between content and subject categories

If you’re not using RDF containers of collections then those two attributes are the only bit of RDF that creeps into the syntax.

However in our case, we were also using RDF Lists to capture ordering of authors in academic papers. So we also explained that rdf:parseType was a processing instruction to indicate that some element content should be handled as a collection (a list).

This worked very well. We’d ended up with fine-grained document types anyway, to make it easier to update individual resources in the system, e.g. individual journal issues or articles, so the above structure mapped well to the system requirements.

Here’s a slightly more complex example that hopefully further illustrates the point. Here I’m showing nesting of several resource descriptions:


 <dc:title>An example article</dc:title>
 <dc:description>This is an article</dc:description>
 <ex:authors rdf:parseType="Collection">
   <foaf:Person rdf:about="">
     <rdfs:label>Joe Bloggs</rdfs:label>
     <foaf:homepage rdf:resource=""/>
   <foaf:Person rdf:about="">
     <rdfs:label>Sue Bloggs</rdfs:label>
     <foaf:homepage rdf:resource=""/>
   <ex:Article rdf:about=""/>
   <skos:Concept rdf:about=""/>

The reality is whether you’re working in an XML or a RDF context, there is very often a primary resource you’re interested in: e.g. your processing a resource or rendering a view of it, etc. This means that in practice there’s nearly always an obvious and natural “root” element to the graph for creating an RDF/XML serialisation. Its just that RDF tools don’t typically let you identify it.

Tip 3: Use RELAX NG

Because of the syntactic variation, writing schemas for RDF/XML can be damn near impossible. But for highly normalised RDF/XML its a much more tractable problem.

My preference has been to use RELAX NG as it offers more flexibility when creating open and flexible content models for elements, e.g. via interleaving. This gives options to leave the document structures a little looser to facilitate serialisation and also allow the contents of the graph to evolve (e.g. addition of new properties).

If you have the option, then I’d recommend RELAX when defining schemas for your XML data.

Tip 4: RDF for metadata; XML for content

The last tip isn’t about RDF/XML per se, I just want to make a general point about where to apply the different technologies.

XML is fantastic at describing document structures and content. RDF is fantastic at describing relationships between things. Both of those qualities are important, but in very different aspects of an application.

In my work in publishing I ended up using a triple store as the primary data repository. This is because the kinds of application behaviour I wanted to drive were increasingly relationship focused: e.g. browsing to related content, author based navigation, concept relationships, etc. Increasingly I also wanted the ability to create new slices and views across the same content and document structures were too rigid.

The extensibility of the RDF graph allowed me to quickly integrate new workflows (using the Blackboard pattern) so that I could, for example, harvest & integrate external links or use text mining tools to extract new relationships. This could be done without having to rework the main publishing workflow, evolve document formats, or the database for the metadata.

However XML works perfectly well for rendering out the detailed content. It would be crazy to try and capture content in RDF/XML (structure yes; but not content). So for transforming XML into HTML or other views, XML was the perfect starting point. We were early adopters of XProc so using pipelines to generate rendered content and to extract RDF/XML for loading into a triple store was easy to do.

In summary RDF/XML is not a great format for working with RDF in an XML context, but its not completely broken. You just need to know how to get the best from it. It provides a default interoperable format for exchanging RDF data over the web, but there are better alternatives for hand-authoring and efficient loading. Once the RDF Working Group completes work on RDF 1.1 its likely that Turtle will rapidly become the main RDF serialisation.

However, I think that RDF/XML will still have a role, as part of a well-designed system, in bridging between RDF and XML tools.

Layered Data: A Paper & Some Commentary

Two years ago I wrote a short paper about “layering” data but for various reasons never got round to putting it online. The paper tried to capture some of my thinking at the time about the opportunities and approaches for publishing and aggregating data on the web. I’ve finally got around to uploading it and you can read it here.

I’ve made a couple of minor tweaks in a few places but I think it stands up well, even given the recent pace of change around data publishing and re-use. I still think the abstraction that it describes is not only useful but necessary to take us forward on the next wave of data publishing.

Rather than edit the paper to bring it completely up to date with recent changes, I thought I’d publish it as is and then write some additional notes and commentary in this blog post.

You’re probably best off reading the paper, then coming back to the notes here. The illustration referenced in the paper is also now up on slideshare.

RDF & Layering

I see that the RDF Working Group, prompted by Dan Brickley, is now exploring the term. I should acknowledge that I also heard the term “layer” in conjunction with RDF from Dan, but I’ve tried to explore the concept from a number of perspectives.

The RDF Working Group may well end up using the term “layer” to mean a “named graph”. I’m using the term much more loosely in my paper. In my view an entire dataset could be a layer, as well as some easily identifiable sub-set of it. My usage might therefore be closer to Pat Hayes’s concept of a “Surface”, but I’m not sure.

I think that RDF is still an important factor in achieving the goal I outlined of allowing domain experts to quickly assemble aggregates through a layering metaphor. Or, if not RDF, then I think it would need to be based around a graph model, ideally one with a strong notion of identity. I also think that mechanisms to encourage sharing of both schemas and annotations are also useful. It’d be possible to build such a system without RDF, but I’m not sure why you’d go to the effort.

User Experience

One of the things that appeals to me about the concept of layering is that there are some nice ways to create visualisation and interfaces to support the creation, management and exploration of layers. It’s not hard to see how, given some descriptive metadata for a collection of layers, you could create:

  • A drag-and-drop tool for creating and managing new composite layers
  • An inspection tool that would let you explore how the dataset for an application or visualisation has been constructed, e.g. to explore provenance or to support sharing and customization. Think “view source” for data aggregation.
  • A recommendation engine that suggested new useful layers that could be added to a composite, including some indication of what additional query options might become available

There’s been some useful work done on describing datasets within the Linked Data community: VoiD and DCat for example. However there’s not yet enough data routinely available about the structure and relationships of individual datasets, nor enough research into how to provide useful summaries.

This is what prompted my work on an RDF Report Card to try and move the conversation forward beyond simply counting triples.

To start working with layers, we need to understand what each layer contains and how they relate to and complement one another.

Linked Data & Layers

In the paper I suggest that RDF & Linked Data alone aren’t enough and that we need systems, tools and vocabularies for capturing the required descriptive data and enabling the kinds of aggregation I envisage.

I also think that the Linked Data community is spending far too much effort on creating new identifiers for the same things and worrying how best to define equivalences.

I think the leap of faith that’s required, and that people like the BBC have already taken, is that we just need to get much more comfortable re-using other people’s identifiers and publishing annotations. Yes, there will be times when identifiers diverge, but there’s a lot to be gained, especially in terms of efficiency around data curation from just focusing on the value-added data, not re-publishing any copy of a core set of facts.

There are efficiency gains to be had from existing businesses, as well as faster routes to market for startups, if they can reliably build on some existing data. I suspect that there are also businesses that currently compete with one another — because they’re having to compile or re-compile the same core data assets — that could actually complement one another if they could instead focus on the data curation or collection tasks at which they excel.

Types of Data

In the paper I set out seven different facets which I think cover the majority of types of data that we routinely capture and publish. I think the classification could be debated, but I think its a reasonable first attempt.

The intention is to try and illustrate that we can usefully group together different types of data. And organisations may be particularly good at creating or collecting particular types of data. There’s scope for organisations to focus on being really good in a particular area and by avoiding needless competition around collecting and re-collecting the same core facts, there are almost certainly efficiency gains and cost savings to be had.

I’m sure there must be some prior work in this space, particularly around the core categories, so if anyone has pointers please share them.

There are also other ways to usefully categorise data. One area that springs to mind is how the data itself is collected, i.e. its provenance. E.g. is it collected automatically by sensors, or as a side-effect of user activity, or entered by hand by a human curator? Are those curators trained or are they self-selected contributors? Is the data derived from some form of statistical analysis?

I had toyed with provenance as a distinct facet, but I think its an orthogonal concern.

Layering & Big Data

A lot has happened in the last two years and I winced a bit at all of the Web 2.0 references in the paper. Remember that? If I were writing this now then the obvious trend to discuss as context to this approach is Big Data.

Chatting with Matt Biddulph recently he characterised a typical Big Data analysis as being based on “Activity Data” and “Reference Data”. Matt described reference data as being the core facts and information on top of which the activity data — e.g. from users of an application — is added. The analysis then draws on the combination to create some new insight, i.e. more data.

I referenced Matt’s characterisation in my Strata talk (with acknowledgement!). Currently Linked Data does really well in the Reference category but there’s not a great deal of Activity data. So while its potentially useful in a Big Data world, there’s a lot of value still not being captured.

I think Matt’s view of the world chimes well with both the layered data concept and the data classifications that I’ve proposed. Most of the facets in the paper really define different types of Reference data. The outcome of a typical Big Data analysis is usually a new facet, an obvious one being “Comparative” data, e.g. identifying the most popular, most connected, most referenced resources in a network.

However there’s clearly a different in approach between typical Big Data processing and the graph models that I think underpin a layered view of the world.

MapReduce workflows seem to work best with more regular data, however newer approaches like Pregel illustrate the potential for “graph-native” Big Data analysis. But setting that aside, there’s no real contention as a layering approach to combining data doesn’t say anything about how the data must actually be used: it can be easily projected out into structures that are amenable for indexing and processing in different ways.


Looking at the last section of the paper it should be obvious that much of the origin of this analysis was early preparation for Kasabi.

I still think that there’s a great deal of potential to create a marketplace around data layers and tools for interacting with them. But we’re not there yet though for several reasons. Firstly its taken time to get the underlying platform in place to support that. We’ve done that now and you can expect more information on that from more official sources shortly. Secondly I under estimated how much effort is still required to move the market forward: there’s still lots to be done to support organisations in opening up data before we can really explore more horizontal marketplaces. But that is a topic for another post.

This has been quite a ramble of a blog post but hopefully there are some useful thoughts here that chime with your own experience. Let me know what you think.

Beyond the Triple Count

This post was originally published on the Kasabi product blog.

On Monday I gave a talk at the SemTechBiz conference: “The RDF Report Card: Beyond the Triple Count“. I’ve published the slides on Slideshare which I’ve embedded below, but I thought I’d also post some of my notes here.

I’ve felt for a while now that the Linked Data community has an unhealthy fascination on triple counts, i.e. on the size of individual datasets.

This was quite natural in the boot-strapping phase of Linked Data in which we were primarily focused on communicating how much data was being gathered. But we’re now beyond that phase and need to start considering a more nuanced discussion around published data.

If you’re a triple store vendor then you definitely want to talk about the volume of data your store can hold. After all, potential users or customers are going to be very interested in how much data could be indexed in your product. Even so, no-one seriously takes a headline figure at face value. As users we’re much more interested in a variety of other factors. For example how long does it take to load my data? Or, how well does a store perform with my usage profile, taking into account my hardware investment? Etc. This is why we have benchmarks, so we can take into account additional factors and more easily compare stores across different environments.

But there’s not nearly enough attention paid to other factors when evaluating a dataset. A triple count alone tells us nothing. They’re not even a good indicator of the number of useful “facts” in a dataset.

During my talk I illustrated this point by showing how, in Dbpedia, there are often several redundant ways for capturing the same information. These in inflate the size of the datasets without adding useful extra information. By my estimate there’s over 4.6m redundant triples for capturing location information alone. In my view, having multiple copies or variations for the same data point reduces the utility of a dataset, because it adds confusion over which values are reliable.

There can be good reasons for including the same information in slightly different ways, e.g. to support consuming applications that rely on slightly different properties, or which cannot infer additional data. Vocabularies also evolve and become more popular and this too can lead to variants if a publisher is keen to adapt to changing best practices.

But I think too often the default position is to simply use every applicable property to publish some data. From a publishing perspective it’s easier: you don’t have to make a decision about which approach might be best. And because of the general fixation on dataset size, there’s an incentive to just publish more data.

I think it’s better for data publishers to make more considered curation decisions, and instead just use one preferred way to publish each piece of information. Its much easier for clients to use normalized data.

I also challenged the view that we need huge amounts of data to build useful applications. In some scenarios more data is always better, that’s especially true if you’re doing some kind of statistical analysis. Semantic web technology potentially allows us to draw on data from hundreds of different sources by reducing integration costs. But it doesn’t mean that we have to or need to in order to drive useful applications. For many cases we need much more modest collections of data.

I used BBC Programmes as an example here. It’s a great example of publishing high quality Linked Data especially because the BBC were amongst the first (if not the first) primary publisher of data on the Linked Data cloud. BBC Programmes is a very popular site with over 2.5 million unique users a week, triggering over 60 reqs/second on their back-end. Now, while the data isn’t managed in a triple store, if you crawl it then you’ll discover than there’s only about 50 million triples in the whole dataset. So you clearly don’t need billions of triples to drive useful applications.

It’s really easy to generate large amounts of data. Curating a good quality dataset is harder. Much harder.

I think it’s time to move beyond boasting about triple counts and instead provide ways for people to assess dataset quality and utility. There are lots of useful factors to take into account when deciding whether a dataset is fit for purpose. In other words, how can we help users understand whether a dataset can help them solve a particular problem, implement a particular feature, or build an application?

Typically the only information we get about a dataset are some brief notes on its size, a few example resources, perhaps a pointer to a SPARQL endpoint and maybe an RDFs or OWL schema. This is not enough. I’d consider myself to be an experienced semantic web developer and this isn’t nearly enough to get started. I always find myself doing a lot of exploration around and within a dataset before deciding whether its useful.

In the talk I presented a simple conceptual model, an “information spectrum”, that tried to tease out the different aspects of a dataset that are useful to communicate to potential users. Some of that information is more oriented towards “business” decisions: is the dataset from a reliable source, correctly licensed, etc. While others are more technical: how has the dataset been constructed, or modelled?

I identified several broad classes of information on that spectrum:

Metadata. This is the kind of information that people are busily pouring into various data catalogs, primarily from government sources. Dataset metadata, including its title, a description, publication dates, license, etc all help solve the discovery problem, i.e. identifying what datasets are available and whether you might be able to use them.

While the situation is improving, its still too hard to find out when some particular source was updated, who maintains or publishes the data, and (of biggest concern) how the data is licensed.

Scope. Scoping information for a dataset tells us what it contains. E.g. is it about people, places, or creative works? How many of each type of thing does a dataset contain? If the dataset contains points of interest, then what is the geographic coverage? If the dataset contains events, then over what time period(s)?

Then we get to the Structure of a dataset. I don’t mean a list of the specific vocabularies that are used, but more how those vocabularies have been meshed together to describe a particular type of entity. E.g. how is a person described in this dataset? Do all people have a common set of properties?

At the lowest level we then have the dataset Internals. This includes things like lists of RDF terms and their frequencies, use of named graphs, pointers to source files, etc. Triple counts may be useful at this point, but only to identify whether you could reasonably mirror a dataset locally. Knowledge of the underlying infrastructure, etc. might also be of use to developers.

Taken together I see presenting this information to users as being one of progressive disclosure: providing the right detail, to the right audience, at the right time. Currently we don’t routinely provide nearly enough information at any point on the spectrum. The irony here is that when we’re using RDF, the data is so well-structured that much of that detail could be automatically generated. Data publishing platforms need to do more to make this information more readily accessible, as well as providing data publishers with the tools to manage it.

We’ve been applying this conceptual model as we build out the features of Kasabi. Currently we’re ticking all of these boxes. It’s clear from every dataset homepage where some data has come from, how it is licensed and when it was updated. A user can easily then drill down into a dataset to get more information on its scope and internal structure. There’s lots more that we’re planning to do at all stages.

To round out the talk I previewed a feature that we’ll be releasing shortly called the “Report Card”. This is intended to provide an at a glance overview of what types of entity a dataset contains. There are examples included in the slides, but we’re still playing with the visuals. The idea is to quickly allow a user to determine the scope of a dataset, and whether it contains some useful information to them. In the BBC Music example you can quickly see that it contains data on Creative Works (reviews, albums), Organizations (bands) and People (artists) but it doesn’t contain any location information. You’re going to need to draw on a related linked dataset if you want to build a location based music app using BBC Music.

As well as summarizing a dataset, the report card will also be used to drive better discovery tools. This will allow users to quickly find datasets that include the same kinds of information, or relevant complementary data.

Ultimately my talk was arguing that I think it’s time to start focusing more on data curation. We need to give users a clearer view of the quality and utility of the data we’re publishing, and also think more carefully about the data we’re publishing.

This isn’t a unique semantic web problem. The same issues are rearing their heads with other approaches. Where I think we are well placed is in the ability to apply semantic web technology to help analyze and present data in a more useful and accessible way.

Giving RDF Datasets more Affordance

This post was originally published on the Kasabi product blog.

The following is a version of the talk on Creating APIs over RDF I gave at SemTech 2011. I’ve pruned some of the technical details in favour of linking out to other sources and concentrated here on the core message I was trying to get across. Comments welcome!

The Trouble with SPARQL

I’m a big fan of SPARQL, I constantly use it in my own development tasks, have built a number of production systems in which the query language is a core component, and wrote (I think!) one of the first SPARQL tutorials back in 2005 when it was still in Last Call. I’ve also worked with a number of engineering teams and developer communities over the last few years, introducing them to RDF and SPARQL.

My experience so far is two-fold: given some training and guidance SPARQL isn’t hard for any developer to learn. It’s just another query language and syntax. There are often some existing mental models that need to be overcome, but that’s always the case with any new technology. So at small scales SPARQL is easy to adopt, and a very useful tool when you’re working with graph shaped data.

But I’ve found, repeatedly, that when SPARQL is presented to a larger community, then the reaction and experience is very different. Developers quickly reject it as the learning curves are too great and, instead of seeing it as an enabler, they often see it as a barrier that’s placed between them and the data.

It’s easy to dismiss this kind of feedback and criticism by exhorting developers to just try harder or read more documentation. Surely any good developer is keen to learn a new technology? This overlooks the need of many people to just get stuff done quickly. Time and commercial pressures are a reality.

It’s also easy to dismiss this reaction as being down to the quality of the tools and documentation. Now, undoubtedly, there’s still much more that can be done there. And I was pleased to hear about Bob DuCharme’s forthcoming book on SPARQL.

But I think there are some technical reasons why, when we move from small groups to distributed adoption by a wider community, SPARQL causes frustrations. And I think this boils down to its affordance.


Consider the interface that most developers are presented with when given access to a SPARQL endpoint. It’s an empty text field and a button. If you’re really lucky, there may be an example query filled in.

In order to do anything you have to not only know how to write a valid SPARQL query, but you also really need to know how the underlying dataset is structured. Two immediate hurdles to get over. Yes, there are queries you can write to return arbitrary triples, or list classes and properties, but that’s still not something a new user would necessarily know. And you typically need a lot of exploration before you can start to understand how to best query the data.

Trial and error experiments aren’t easy either, its not always obvious how a query can be tweaked to customize the results. And when we typically share SPARQL queries, its typically by passing around direct links to an endpoint. Have fun unpicking the query from the URL, reformatting it, so you can understand how it works and how it can be tweaked!

Better tools can definitely help in both of these cases. In Kasabi we’ve added a feature that allows anyone to share useful queries for a SPARQL endpoint along with a description of how it works. It’s a simple click to drop the query into the API explorer to run it, or tweak it.

But in my opinion it’s about more than just the tooling. Affordance flows not just from the tools, but also the syntax and the data. If you point someone at a SPARQL endpoint, it’s not immediately useful, not without a lot of additional background. These are issues that can hamper widespread adoption of a technology but which don’t often arise with smaller groups with direct access to mentors.

Contrast this situation with typical web APIs which have, in my opinion, much more affordance. If I give someone a link to an API call then it’s more immediately useful. I think working with good, RESTful APIs is like pulling a thread: the URLs unravel into useful data. And that data contains more links that I can just follow to find more data.

Trial and error experiments are also much easier. If I want to tweak an API call then I can do that by simply editing the URL. As a developer this is syntax that I already know. URL templates can also give me hints of how to construct a useful request.

Importantly, my understanding of the structure of the dataset can grow as I work with it. My understanding grows through use, rather than before I start using. And that’s a great way to learn. There are no real barriers to progression. I need to know much less in order to start feeling empowered.

So, I’ve come to the conclusion that SPARQL is really for power users. It’s of most use to developers that are willing to take the time and trouble to learn the syntax and the underlying data model in order to get its benefits. This is not a critique of the technology itself, but a reflection on how technology is (or isn’t) being adopted and the challenges people are facing.

The obvious question that springs to mind is: how we can give RDF data more affordance?

Linked Data

Linked Data is all about giving affordance to data. Linking, and “follow your nose” access to data is a core part of the approach. By binding data to the web, making it accessible via a single click, we make it incredibly more useful.

Surely then Linked Data solves all of our problems: “your website is your API”, after all. I think there’s a lot of truth to that and rich Linked Data does remove many of the needs for an separate API.

But I don’t think it addresses all of the requirements, or at least: the current approaches and patterns for publishing Linked Data don’t address all of the requirements. Right now the main guidance is to focus on your domain modelling and the key entities and relationships that it contains. That’s good advice and a useful starting point.

But when you’re developing an application against a dataset, there are many more useful ways to partition the data, e.g.: by date, location, name, etc.

It’s entirely possible to materialize many of these partitions directly in the dataset — as yet more resources and links — but this quickly becomes unfeasible: there are too many useful data partions to realistically do this for any reasonably large or complex dataset. This is the exactly the gap that query languages, and specifically SPARQL, are designed to fill. But if we concede that SPARQL may be too complex for many cases, what other options can we explore?

SPARQL Stored Procedures and the Linked Data API

One option is to just build custom APIs. But this can be expensive to maintain, and can detract from the overall message of the core usefulness of publishing Linked Data. So, are there ways to surface useful views in declarative way, that both takes advantage of, and embraces the utility of a the underlying “web native” graph model?

Currently there are two approaches that we’ve explored. The first is what we’ve calling SPARQL Stored Procedures in Kasabi. This allows developers to:

  • Bind a SPARQL query to a URL, causing that query to be automatically executed when a GET request is made to the URL
  • Indicate that specific URL parameters be injected into the SPARQL query before it is executed, allowing the queries to be parameterized on a per-request basis
  • Generate custom output formats (e.g. XML or JSON) from a query using XSLT stylesheets that can be applied to the query results based on the requested mimetype

Kasabi provides tools for creating this type of API, including the ability to create one based on a SPARQL query shared by another user. This greatly lowers the barrier to entry for sharing useful ways to work with data.

The ability to either access the results of the query directly, e.g. as SPARQL XML results or various RDF serializations, means the underlying graph is still accessible. You can just treat the feature as a convenience layer that hides some of the complexity. But providing custom output formats we can also help developers use the data using existing skills and tools.

The second approach has grown out of work on Jeni Tennison (TSO), Dave Reynolds (Epimorphics) and I explored various options for creating APIs over Linked Data, resulting in the publication of the Linked Data API which is in use at, e.g. to support the excellent organogram visualizations.

As with SPARQL Stored Procedures, the Linked Data API provides a declarative way to create a RESTful API over RDF data sources. However rather than writing SPARQL queries directly, an API developer creates a configuration file that describes how various views of the data should be bound to web requests.

The Linked Data API is much more powerful (at the cost of some complexity), providing many more options for filtering and sorting through data, as well as simple XML and JSON result formats out of the box. In my opinion, the specification does a good job at weaving API interactions together with the underlying Linked Data, creating a very rich way to interact with a dataset. And one that has a lot more affordance than the equivalent SPARQL queries.

Again, Kasabi provides support for hosting this type of API. Right now the tooling is admittedly quite basic but we’re exploring way to make it more interactive. We’ve incorporated the “view source” principle into the custom API hosting feature as a whole, so its possible to view the configuration of an API to see how it was constructed.

I think both of these approaches can usefully provide ways for a wider developer community to get to grips with RDF and Linked Data, removing some of the hurdles to adoption. The tooling we’ve created in Kasabi is designed to allow skilled members of the community to directly drive this adoption by sharing queries and creating different kinds of APIs.

By separating the publication of datasets, from the creation of APIs — useful access paths into the dataset — we hope to let communities find and share useful ways to work with the available data, whatever their skills or preferred technologies.