What is a Dataset?

As my last post highlighted, I’ve been thinking about how we can find and discover datasets and their related APIs and services. I’m thinking of putting together some simple tools to help explore and encourage the kind of linking that my diagram illustrated.

There’s some related work going on in a few areas which is also worth mentioning:

  • Within the UK Government Linked Data group there’s some work progressing around the notion of a “registry” for Linked Data that could be used to collect dataset metadata as well as supporting dataset discovery. There’s a draft specification which is open for comment. I’d recommend you ignore the term “registry” and see it more as a modular approach for supporting dataset discovery, lightweight Linked Data publishing, and “namespace management” (aka URL redirection). A registry function is really just one aspect of the model.
  • There’s an Open Data on the Web workshop in April which will cover a range of topics including dataset discovery. My current thoughts are partly preparation for that event (and I’m on the Programme Committee)
  • There’s been some discussion and a draft proposal for adding the Dataset type to Schema.org. This could result in the publication of more embedded metadata about datasets. I’m interested in tools that can extract that information and do something useful with it.

Thinking about these topics I realised that there are many definitions of “dataset”. Unsurprisingly it means different things in different contexts. If we’re defining models, registries and markup for describing datasets we may need to get a sense of what these different definitions actually are.

As a result, I ended up looking around for a series of definitions and I thought I’d write them down here.

Definitions of Dataset

Lets start with the most basic, for example Dictionary.com has the following definition:

“a collection of data records for computer processing”

Which is pretty vague. Wikipedia has a definition which derives from the terms use in a mainframe environment:

“A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

Nontabular datasets can take the form of marked up strings of characters, such as an XML file.”

The W3C Data Catalog Vocabulary defines a dataset as:

“A collection of data, published or curated by a single source, and available for access or download in one or more formats.”

The JISC “Data Information Specialists Committee” have a definition of dataset as:

“…a group of data files–usually numeric or encoded–along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use. Generally a dataset is un-usable for sound analysis by a second party unless it is well documented.”

Which is a good definition as it highlights that the dataset is more than just the individual data files or facts, it also consists of some documentation that supports its use or analysis. I also came across a document called “A guide to data development” (2007) from the National Data Development and Standards Unit in Australia which describes a dataset as

“A data set is a set of data that is collected for a specific purpose. There are many ways in which data can be collected—for example, as part of service delivery, one-off surveys, interviews, observations, and so on. In order to ensure that the meaning of data in the data set is clearly understood and data can be consistently collected and used, data are defined using metadata…”

This too has the notion of context and clear definitions to support usage, but also notes that the data may be collected in a variety of ways.

A Legal Definition

As it happens, there’s also a legal definition of a dataset in the UK, at least as far as it relates to the Freedom of Information. The “Protections of Freedom Act 2012 Part 6, (102) c” includes the following definition:

In this Act “dataset” means information comprising a collection of information held in electronic form where all or most of the information in the collection—

  • (a)has been obtained or recorded for the purpose of providing a public authority with information in connection with the provision of a service by the authority or the carrying out of any other function of the authority,
  • (b)is factual information which—
    • (i)is not the product of analysis or interpretation other than calculation, and
    • (ii)is not an official statistic (within the meaning given by section 6(1) of the Statistics and Registration Service Act 2007), and
  • (c)remains presented in a way that (except for the purpose of forming part of the collection) has not been organised, adapted or otherwise materially altered since it was obtained or recorded.”

This definition is useful as it defines the boundaries for what type of data is covered by Freedom of Information requests. It clearly states that the data is collected as part of the normal business of the public body and also that the data is essentially “raw”, i.e. not the result of analysis or has not been adapted or altered.

Raw data (as defined here!) is more useful as it supports more downstream usage. Raw data has more potential.

Statistical Datasets

The statistical community has also worked towards having a clear definition of dataset. The OECD Glossary defines a Dataset as “any organised collection of data”, but then includes context that describes that further. For example that a dataset is a set of values that have a common structure and are usually thematically related. However there’s also this note that suggests that a dataset may also be made up of derived data:

“A data set is any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances”

Privacy is one key reason why a dataset may contain derived information only.

The RDF Data Cube vocabulary, which borrows heavily from SDMX — a key standard in the statistical community — defines a dataset as being made up of several parts:

  1. “Observations – This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
  2. Organizational structure – To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation…
  3. Internal metadata – Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated?…
  4. External metadata — This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed.”

The SDMX implementors guide has a long definition of dataset (page 7) which also focuses on the organisation of the data and specifically how individual observations are qualified along different dimensions and measures.

Scientific and Research Datasets

Over the last few years the scientific and research community have been working towards making their datasets more open, discoverable and accessible. Organisations like the Welcome Foundation have published guidance for researchers on data sharing; services like CrossRef and DataCite provide the means for giving datasets stable identifiers; and platforms like FigShare support the publishing and sharing process.

While I couldn’t find a definition of dataset from that community (happy to take pointers!) its clear that the definition of dataset is extremely broad. It could cover both raw results, e.g. output from sensors or equipment, through to more analysed results. The boundaries are hard to define.

Given the broad range of data formats and standards, services like FigShare accept any or all data formats. But as the Welcome Trust note:

“Data should be shared in accordance with recognised data standards where these exist, and in a way that maximises opportunities for data linkage and interoperability. Sufficient metadata must be provided to enable the dataset to be used by others. Agreed best practice standards for metadata provision should be adopted where these are in place.”

This echoes the earlier definitions that included supporting materials as being part of the dataset.

RDF Datasets

I’ve mentioned a couple of RDF vocabularies already, but within the RDF and Linked Data community there are a couple of other definitions of dataset to be found. The Vocabulary for Organising Interlinked Datasets (VoiD) is similar to, but predates, DCAT. Whereas DCAT focuses on describing a broad class of different datasets, VoiD describes a dataset as:

“…a set of RDF triples that are published, maintained or aggregated by a single provider…the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.”

Like the more general definitions this includes the notion that the data may relate to a specific topic or be curated by a single organisation. But this definition also makes some assumption about the technical aspects of how the data is organised and published. VoiD also includes support for linking to the services that relate to a dataset.

Along the same lines, SPARQL also has a definition of a Dataset:

“A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI…”

Unsurprisingly for a technical specification this is a very narrow definition of dataset. It also differs from the VoiD definition. While both assume RDF as the means for organising the data, the VoiD term is more general, e.g. it glosses over details of the internal organisation of the dataset into named graphs. This results in some awkwardness when attempting to navigate between a VoiD description and a SPARQL Service Description.


If you’ve gotten this far, then well done :)

I think there’s a couple of things we can draw out from these definitions which might help us when discussing “datasets”:

  • There’s a clear sense that a dataset relates to specific topic and is collected for a particular purpose.
  • The means by which a dataset is collected and the definitions of its contents are important for supporting proper re-use
  • Whether a dataset consists of “raw data” or more analysed results can vary across communities. Both forms of dataset might be available, but in some circumstances (e.g. for privacy reasons) only derived data might be published
  • Depending on your perspective and your immediate use case the dataset may be just the data items, perhaps expressed in a particular way (e.g. as RDF).  But in a broader sense, the dataset also includes the supporting documentation, definitions, licensing statements, etc.

While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web.

Dataset and API Discovery in Linked Data

I’ve been recently thinking about how applications can discover additional data and relevant APIs in Linked Data. While there’s been lots of research done on finding and using (semantic) web services I’m initially interested in supporting the kind of bootstrapping use cases covered by Autodiscovery.

We can characterise that use case as helping to answer the following kinds of questions:

  • Given a resource URI, how can I find out which dataset it is from?
  • Given a dataset URI, how can I find out which resources it contains and which APIs might let me interact with it?
  • Given a domain on the web, how can I find out whether it exposes some machine-readable data?
  • Where is the SPARQL endpoint for this dataset?

More succinctly: can we follow our nose to find all related data and APIs?

I decided to try and draw a diagram to illustrate the different resources involved and their connections. I’ve included a small version below:

Data and API Discovery with Linked Data

Lets run through the links between different types of resources:

  • From Dataset to Sparql Endpoint (and Item Lookup, and Open Search Description): this is covered by VoiD which provides simple predicates for linking a dataset to three types of resources. I’m not aware of other types of linking yet, but it might be nice to support reconciliation APIs.
  • From Well-Known VoiD Description (background) to Dataset. This well known URL allows a client to find the “top-level” VoiD description for a domain. It’s not clear what that entails, but I suspect the default option will be to serve a basic description of a single dataset, with reference to sub-sets (void:subset) where appropriate. There might also just be rdfs:seeAlso links.
  • From a Dataset to a Resource. A VoiD description can include example resources, this blesses a few resources in the dataset with direct links. Ideally these resources ought to be good representative examples of resources in the dataset, but they might also be good starting points for further browsing or crawling.
  • From a Resource to a Resource Description. If you’re using “slash” URIs in your data, then URIs will usually redirect to a resource description that contains the actual data. The resource description might be available in multiple formats and clients can content negotiation or follow Link headers to find alternative representations.
  • From a Resource Description to a Resource. A description will typically have a single primary topic, i.e. the resource its describing. It might also reference other related resources, either as direct relationships between those resources or via rdfs:seeAlso type links (“more data over here”).
  • From a Resource Description to a Dataset. This is where we might use a dct:source relationship to state that the current description has been extracted from a specific dataset.
  • From a SPARQL Endpoint (Service Description) to a Dataset. Here we run into some differences between definitions of dataset, but essentially we can describe in some detail the structure of the SPARQL dataset that is used in an endpoint and tie that back to the VoiD description. I found myself looking for a simple predicate that linked to a void:Dataset rather than describing the default and named graphs, but couldn’t find one.
  • I couldn’t find any way to relate a Graph Store to a Dataset or SPARQL endpoint. Early versions of the SPARQL Graph Store protocol had some notes on autodiscovery of descriptions, but these aren’t in the latest versions.

These links are expressed, for the most part, in the data but could also be present as Link headers in HTTP responses or in HTML (perhaps with embedded RDFa).

I’ve also not covered sitemaps at all, which provide a way to exhaustively list the key resources in a website or dataset to support mirroring and crawling. But I thought this diagram might be useful.

I’m not sure that the community has yet standardised on best practices for all of these cases and across all formats. That’s one area of discussion I’m keen to explore further.

Publishing SPARQL queries and documentation using github

Yesterday I released an early version of sparql-doc a SPARQL documentation generator. I’ve got plans for improving the functionality of the tool, but I wanted to briefly document how to use github and sparql-doc to publish SPARQL queries and their documentation.

Create a github project for the queries

First, create a new github project to hold your queries. If you’re new to github then you can follow their guide to get a basic repository set up with a README file.

The simplest way to manage queries in github is to publish each query as a separate SPARQL query file (with the extension “.rq”). You should also add a note somewhere specifying the license associated with your work.

As an example, here is how I’ve published a collection of SPARQL queries for the British National Bibliography.

When you create a new query be sure to add it to your repository and regularly commit and push the changes:

git add my-new-query.rq
git commit my-new-query.rq -m "Added another example"
git push origin master

The benefit of using github is that users can report bugs or submit pull requests to contribute improvements, optimisations, or new queries.

Use sparql-doc to document your queries

When you’re writing your queries follow the commenting conventions encouraged by sparql-doc. You should also install sparql-doc so you can use it from the command-line. (Note: installation guide and notes still needs some work!)

As a test you can try generating the documentation from your queries as follows. Make a new local directory called, e.g. ~/projects/examples. Execute the following from your github project directory to create the docs:

sparql-doc ~/projects/examples ~/projects/docs

You should then be able to open up the index.html document in ~/projects/examples to see how the documentation looks.

If you have an existing web server somewhere then you can just zip up those docs and put them somewhere public to share them with others.However you can also publish them via Github pages. This means you don’t have to setup any web hosting at all.

Use github pages to publish your docs

Github Pages allows github users to host public, static websites directly from github projects. It can be used to publish blogs or other project documentation. But using github pages can seem a little odd if you’re not familiar with git.

Effectively what we’re going to do is create a separate collection of files — the documentation — that sits in parallel to the actual queries. In git terms this is done by creating a separate “orphan” branch. The documentation lives in the branch, which must be called gh-pages, while the queries will remain in the master branch.

Again, github have a guide for manually adding and pushing files for hosting as pages. The steps below follow the same process, but using sparql-doc to create the files.

Github recommend starting with a fresh separate checkout of your project. You then create a new branch in that checkout, remove the existing files and replace them with your documentation.

As a convention I suggest that when you checkout the project a second time, that you give it a separate name, e.g. by adding a “-pages” suffix.

So for my bnb-queries project, I will have two separate checkouts:

The original checkout I did when setting up the project for the BNB queries was:

git clone git@github.com:ldodds/bnb-queries.git

This gave me a local ~/projects/bnb-queries directory containing the main project code. So to create the required github pages branch, I would do this in the ~/projects directory:

#specify different directory name on clone
git clone git@github.com:ldodds/bnb-queries.git bnb-queries-pages
#then follow the steps as per the github guide to create the pages branch
cd bnb-queries-pages
git checkout --orphan gh-pages
#then remove existing files
git rm -rf .

This gives me two parallel directories one containing the master branch and the other the documentation branch. Make sure you really are working with a separate checkout before deleting and replacing all of the files!

To generate the documentation I then run sparql-doc telling it to read the queries from the project directory containing the master branch, and then use the directory containing the gh-pages branch as the output, e.g.:

sparql-doc ~/projects/bnb-queries ~/projects/bnb-queries-pages

Once that is done, I then add, commit and push the documentation, as per the final step in the github guide:

cd ~/projects/bnb-queries-pages
git add *
git commit -am "Initial commit"
git push origin gh-pages

The first time you do this, it’ll take about 10 minutes for the page to become active. They will appear at the following URL:


E.g. http://ldodds.github.com/sparql-doc

If you add new queries to the project, be sure to re-run sparql-doc and add/commit the updated files.

Hopefully that’s relatively clear.

The main thing to understand is that locally you’ll have your github project checked out twice: once for the main line of code changes, and once for the output of sparql-doc.

These will need to be separately updated to add, commit and push files. In practice this is very straight-forward and means that you can publicly share queries and their documentation without the need for web hosting.


Developers often struggle with SPARQL queries. There aren’t always enough good examples to play with when learning the language or when trying to get to grips with a new dataset. Data publishers often overlook the need to publish examples or, if they do, rarely include much descriptive documentation.

I’ve also been involved with projects that make heavy use of SPARQL queries. These are often externalised into separate files to allow them to be easily tuned or tweaked without having to change code. Having documentation on what a query does and how it should be used is useful. I’ve seen projects that have hundreds of different queries.

It occurred to me that while we have plenty of tools for documenting code, we don’t have a tool for documenting SPARQL queries. If generating and publishing documentation was a little more frictionless, then perhaps people will do it more often. Services like SPARQLbin are useful, but provide address a slightly different use case.

Today I’ve hacked up the first version of a tool that I’m calling sparql-doc. Its primary usage is likely to be helping to publish good SPARQL examples, but might also make a good addition to existing code/project documentation tools.

The code is up on github for you to try out. Its still very rough-and-ready but already produces some useful output. You can see a short example here.

Its very simple and adopts the same approach as tools like rdoc and Javadoc: its just specifies some conventions for writing structured comments. Currently it supports adding a title, description, list of authors, tags, and related links to a query. Because the syntax is simple, I’m hoping that other SPARQL tools and IDEs will support it.

I plan to improve the documentation output to provide more ways to navigate the queries, e.g. by tag, query type, prefix, etc.

Let me know what you think!

A Brief Review of the Land Registry Linked Data

The Land Registry have today announced the publication of their Open Data — including both Price Paid information and Transactions as Linked Data. This is great to see, as it means that there is another UK public body making a commitment to Linked Data publishing.

I’ve taken some time to begin exploring the data. This blog post provides some pointers that may help others in using the Linked Data. I’m also including some hopefully constructive feedback on the approach that the Land Registry have taken.

The Land Registry Linked Data

The Linked Data is available from http://landregistry.data.gov.uk this follows the general pattern used by other organisations publishing public sector Linked Data in the UK.

The data consists of a single SPARQL endpoint — based on the Open Source Fuseki server — which contains RDF versions of both the Price Paid and Transaction data. The documentation notes that the endpoint will be updated on the 20th of each month, with the equivalent to the monthly releases that are already published as CSV files.

Based on some quick tests, it would appear that the endpoint contains all of the currently published Open Data, which in total is 16,873,170 triples covering 663,979 transactions.

The data seems to primarily use custom vocabularies for describing the data:

The landing page for the data doesn’t include any examples, but I ran some SPARQL queries to extract a few, e.g:

So for Price Paid Data, the model appears to be that a Transaction has a Transaction Record which in turn has an associated Address. The transaction counts seem to be standalone resources.

The SPARQL endpoint for the data is at http://landregistry.data.gov.uk/landregistry/sparql. A test form is also available and that page has a couple of example queries, including getting Price Paid data based on a postcode search.

However I’d suggest that the following version might be slightly better as it includes the record status for the record, which will indicate whether it is an “add” or a “delete”:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX lrppi: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX lrcommon: <http://landregistry.data.gov.uk/def/common/>
SELECT ?paon ?saon ?street ?town ?county ?postcode ?amount ?date ?status
{ ?transx lrppi:pricePaid ?amount .
 ?transx lrppi:transactionDate ?date .
 ?transx lrppi:propertyAddress ?addr.
 ?transx lrppi:recordStatus ?status.

 ?addr lrcommon:postcode "PL6 8RU"^^xsd:string .
 ?addr lrcommon:postcode ?postcode .

 OPTIONAL {?addr lrcommon:county ?county .}
 OPTIONAL {?addr lrcommon:paon ?paon .}
 OPTIONAL {?addr lrcommon:saon ?saon .}
 OPTIONAL {?addr lrcommon:street ?street .}
 OPTIONAL {?addr lrcommon:town ?town .}
ORDER BY ?amount

General Feedback

Lets start with the good points:

  • The data is clearly licensed so is open for widespread re-use
  • There is a clear commitment to regularly updating the data, so it should stay in line with the Land Registry’s other Open Data. This makes it reliable for developers to use the data and the identifiers it contains
  • The data uses Patterned URIs based on Shared Keys (the Land Registry’s own transaction identifiers) so building links is relatively straight-forward
  • The vocabularies are documented and the URIs resolve, so it is possible to lookup the definitions of terms. I’m already finding that easier than digging through the FAQs that the Land Registry publish for the CSV versions.

However I think there is room for improvement in a number of areas:

  • It would be useful to have more example queries, e.g. how to find the transactional data, as well as example Linked Data resources. A key benefit of a linked dataset is that you should be able to explore it in your browser. I had to run SPARQL queries to find simple examples
  • The SPARQL form could be improved: currently it uses a POST by default and so I don’t get a shareable URL for my query; the Javascript in the page also wipes out my query every time I hit the back button, making it frustrating to use
  • The vocabularies could be better documented, for example a diagram showing the key relationships would be useful, as would a landing page providing more of a conceptual overview
  • The URIs in the data don’t match the patterns recommended in Designing URI Sets for the Public Sector. While I believe that guidance is under review, the data is diverging from current documented best practice. Linked Data purists may also lament the lack of a distinction between resource and page.
  • The data uses custom vocabulary where there are existing vocabularies that fit the bill. The transactional statistics could have been adequately described by the Data Cube vocabulary with custom terms for the dimensions. The related organisations could have been described by the ORG ontology and VCard with extensions ought to have covered the address information.

But I think the biggest oversight is the lack of linking, both internal and external. The data uses “strings” where it could have used “things”: for places, customers, localities, post codes, addresses, etc.

Improving the internal linking will make the dataset richer, e.g. allowing navigation to all transactions relating to a specific address, or all transactions for a specific town or postcode region. I’ve struggled to get a Post Code District based query to work (e.g. “price paid information for BA1″) because the query has to resort to regular expressions which are often poorly optimised in triple stores. Matching based on URIs is always much faster and more reliable.

External linking could have been improved in two ways:

  1. The dates in the transactions could have been linked to the UK Government Interval Sets. This provides URIs for individual days
  2. The postcode, locality, district and other regional information could have been linked to the Ordnance Survey Linked Data. That dataset already has URIs for all of these resources. While it may have been a little more work to match regions, the postcode based URIs are predictable so are trivial to generate.

These improvements would have moved from Land Registry data from 4 to 5 Stars with little additional effort. That does more than tick boxes, it makes the entire dataset easier to consume, query and remix with others.

Hopefully this feedback is useful for others looking to consume the data or who might be undertaking similar efforts. I’m also hoping that it is useful to the Land Registry as they evolve their Linked Data offering. I’m sure that what we’re seeing so far is just the initial steps.

How I organise data conversions

Factual announced a new project last week, called Drake which is billed as a “make for data”. The tool provides a make style environment for building workflows for data conversions, it has support for multiple programming languages, uses a standard project layout, and integrates with HDFS.

It looks like a really nice tool and I plan to take a closer look at it. When you’re doing multiple data conversions, particularly in a production setting, its important to adopt some standard practices. Having a consistent way to manage assets, convert data and manage workflows is really useful. Quick and dirty data conversions might get the job done, but a little thought up front can save time later when you need to refresh a dataset, fix bugs, or allow others to contribute. Consistency also helps when you come to add another layer of automation, to run a number of conversions on a regular basis.

I’ve done a fair few data conversions over the last few years and I’ve already adopted a similar approach to Drake: I use a standard workflow, programming environment and project structure. I thought I’d write this down here in case its useful for others. Its certainly saved me time. I’d be interested to learn what approaches other people take to help organise their data conversions.

Project Layout

My standard project layout is:

  • bin — the command-line scripts used to run a conversion. I tend to keep these task based, e.g. focusing on one element of the workflow or conversion. E.g. separate scripts for crawling data, converting types of data, etc. Scripts are parameterised with input/output directories and/or filenames
  • data — created automatically this sub-directory holds the output data
    • cache — a cache directory for all data retrieved from the web. when crawling or scraping data I always work on a local cached copy to avoid unnecessary network traffic
    • nt (or rdf) — for RDF conversions I typically generate ntriple output as its simple to generate and work with in a range of tools. I sometimes generate RDF/XML output, but only if I’m using XSLT to do transformations from XML sources
  • etc — additional supporting files, including:
    • static — static data, e.g. hand-crafted data sources, RDF schema files, etc
    • sparql — SPARQL queries that are used in the conversion, as part of the “enrichment” phase
    • xslt — For keeping XSLT transforms when I’m using XML input and have found it easier to process using XSLT rather than using libxml.
  • lib — the actual code for the conversion. The scripts in the bin directory handle the input/output, the rest is done is Ruby classes
  • Rakefile — a Ruby Rakefile that describes the workflow. I use this to actually run the conversions

While there are some minor variations I’ve used this same structure across a number of different conversions, including:


The workflow for the conversion is managed using a Ruby Rakefile. Like Factual, I’ve found that a make style environment is useful for organising simple data conversion workflows. Rake allows me to execute command-line tools, e.g. curl for downloading data or rapper for doing RDF format conversions, execute arbitrary Ruby code, as well as shell out to dedicated scripts

I try to use a standard set of rake targets to co-ordinate the overall workflow. These are broken down into smaller stages where necessary. While the steps vary between datasets, the stages I most often use are:

  1. download (or cache) — the main starting point which fetches the necessary data. I try and avoid manually downloading any data and rely on curl or perhaps dpm to get the required files. I’ve tended to use “download” for when I’m just grabbing static files and “cache” for when I’m doing a website crawl. This is just a cue for me. I like to tread carefully when hitting other people’s servers so aggressively cache files. Having a separate stage to grab data is also handy for when you’re working offline on later steps
  2. convert — perform the actual conversion, working on the locally cached files only. So far I tend to use either custom Ruby code or XSLT.
  3. reconcile — generate links to other dataset, often using the Google Refine Reconciliation API
  4. enrich — enrich the dataset with additional data, e.g. by performing SPARQL queries to fetch remote data, or materialise new data
  5. package — package up the generated output as a tar.gz file
  6. publish — the overall target which runs all of the above

The precise stages used vary between projects and there are usually a number of other targets in the Rakefile that perform specific tasks, for example the convert stage is usually dependent on several other steps that generate particular types of data. But having standard stage names makes it easier to run specific parts of the overall conversion. One additional stage that would be useful to have is “validation“, so you can check the quality of the output.

At various times I’ve considered formalising these stages further, e.g by creating some dedicated Rake extensions, but I’ve not yet found the need to do that as there’s usually very little code in each step.

I tend to separate out dependencies on external resources, e.g. remote APIs, from the core conversion. The convert stage will work entirely on locally cached data and then I can call out to other APIs in a separate reconcile or enrich stage. Again, this helps when working on parts of the conversion offline and allows the core conversion to happen without risk of failure because of external dependencies. If a remote API fails, I don’t want to have to re-run a potentially lengthy data conversion, I just want to do resume from a known point.

I also try and avoid, as far as possible, using extra infrastructure, e.g. relying on databases, triple stores, or a particular environment. While this might help improve performances in some cases (particularly for large conversions) I like to minimise dependencies to make it easier to run the conversions in a range of environments, with minimal set-up, and minimal cost for anyone running the conversion code. But many of the conversions I’ve been doing are relatively small scale. For larger datasets using a triple store or Hadoop might be necessary. But this would be easy to integrate into the existing stages, perhaps adding a “prepare” stage to do any necessary installation and configuration.

For me its very important to be able to automate the download of the initial data files or web pages that need scraping. This allows the whole process to be automated and cached files re-used where possible. This simplifies the process of using the data and avoids unnecessary load on data repositories. As I noted at the end of yesterday’s post on using dpm with data.gov.uk, having easy access to the data files is important. The context for interpreting that data mustn’t be overlooked, but consuming that information is done separately from using the data.

To summarise, there’s nothing very revolutionary here: I’m sure many of you use similar and perhaps better approaches. But I wanted to share my style for organising conversions and encourage others to do likewise.

How to use dpm with data.gov.uk

The Data Package Manager is an Open Knowledge Foundation project to create a tool to support discovery and distribution of datasets. The tool uses the concept of a “data package” to describe the basic metadata for a dataset plus the supporting files. Packages are indexed in a registry to make them searchable and to support distribution. The dpm tool works with the CKAN data portal software, using its API to search and download data packages.

The dpm documentation includes guidance on how to install and use the software. Once the basic software is installed you run:

dpm setup config

This will create a default configuration file called .dpmrc in your home directory. This configuration works with The Data Hub allowing you to access its registry of over 5000 datasets. For example there’s a basic RDF/XML version of the British National Bibliography, if we wanted to automatically download the files associated with that package then we can run the following command:

dpm download ckan://bluk-bnb-basic bnb-basic

The second parameter is an identifier for the dataset, note that bluk-bnb-basic is the same as the id used in the URL of the dataset on the Data Hub. This makes it easy to script up downloads of a dataset if the publisher has gone to the trouble of associating the files with their CKAN package.

The data.gov.uk website has been built using CKAN. The API endpoint can be found at: http://data.gov.uk/api/. This means that we can use dpm to interact with data.gov.uk too, all we need to do is specify that dpm should use a different registry.

To get dpm to use a different CKAN instance we need to modify its config:

  1. Take a copy of ~/.dpmrc and put it somewhere handy, e.g. ~/tools/datapkg/datagovuk.ini
  2. Edit the ckan.url entry and change it to http://data.gov.uk/api/
  3. When you run dpm use the --config or -c parameter to specify that it should use the alternate config

Here’s a gist that shows an example of the edited config. Its best to just modify a copy of the default version as there are other paths in there that should remain unchanged.

Here are some examples of using dpm with data.gov.uk. Make sure the config parameter points to the location of your revised configuration file:

Search data.gov.uk for packages with the keyword “spending”:

dpm --config datagovuk.ini search ckan:// spending

Get a summary of a package:

dpm --config datagovuk.ini info ckan://warwickshire-spending-allocation

Download the files associated with a package to a local data directory. The tool will automatically create sub-directories for the package:

dpm --config datagovuk.ini download ckan://warwickshire-spending-allocation data

The latter command would be much more useful if the data.gov.uk datasets consistently had the data associated with them. Unfortunately in many cases there is still just a reference to another website.

Hopefully this will improve over time — while its important for data to be properly documented and contextualised, to support easy re-use it must also be easy to automate the retrieval and processing of that data. These are two separate, but important use cases.

HTTP 1.1 Changes Relevant to Linked Data

Mark Nottingham has posted a nice status report on the ongoing effort to revise HTTP 1.1 and specify HTTP 2.0. In the post Mark highlights a list of changes from RFC2616. This ought to be required reading for anyone doing web application development, particularly if you’re building APIs.

I thought it might be useful to skim through the list of changes and highlight those that are particularly relevant to Linked Data and Linked Data applications. Here are the things that caught my eye. I’ve included references to the relevant documents.

  • In Messaging, there is no longer a limit of 2 connections per server. Linked Data applications typically make multiple parallel requests to fetch Linked Data, so having more connections available could help improve performance on the client. In practice browsers have been ignoring this limit for a while, so you’re probably already seeing the benefit.
  • In Semantics, there is a terminology change with respect to content negotiation: we ought to be talking about proactive (client-side) and reactive (server-side) negotiation
  • In Semantics, a 201 Created response can now indicate that multiple resources have been created. Useful to help indicate if a POST of some data has resulted in the creation of several different RDF resources.
  • In Semantics, a 303 response is now cacheable. This addresses one performance issue associated with redirects.
  • In Semantics, the default charset for text media types is now whatever the media type says it is, and not ISO-8859-1. This should allow some caveats in the Turtle specification to be removed. UTF-8 can be documented as the default.

So, no really major impacts as far as I can see, but the cacheability of 303 should bring some benefits.

If you think I’ve missed anything important, then leave a comment.

Second Screens, Asymmetric Gaming and the New Multiplayer

The Second Screen concept has been with us for a while but interestingly the idea still seems to be largely associated with TV. And largely as a means of adding a social dimension to the on-screen events. But there are many ways in which a second screen could potentially enrich other forms of media. Whether its via a smart phone or a tablet, people at home or in an audience often have a internet enabled device at hand that could be used in some interesting ways.

For example at conferences it might be useful to deliver additional supplementary content to a presentation. While synchronizing the on-screen slides to the devices is an obvious step, it would be a natural way to supplement live audio and video streams allowing others to more easily participate remotely. There are other useful bits of information that could be delivered on devices, including speaker bios, references to websites, books (“buy this now”), demos, quick polls, etc.

Second screen apps for films (at home, rather than in cinema) wouldn’t be that dissimilar to TV apps. But while TV apps are typically synchronized to the live broadcast and favour social features, a film app would deliver actor, location or other information cued to the film. Given that many media players are now web enabled, synchronizing the device and playback wouldn’t be that hard.

In fact, with a move towards streaming distribution for films, we can expect that typical DVD and Bluray features are likely to move to online distribution too. A second screen provides more interesting ways to deliver that content. Arguably Rian Johnson’s in-theatre commentary for Looper is the first example of “second screen” use for films.

Second Screen Gaming

But I think the most interesting area for exploration is in gaming. There’s been some work in this area already, notably XBox Smart Glass and the new Wii U with its tablet controller. More on that in a moment.

I realised recently that both my son and are already using second-screens. He’s obsessed by Minecraft and Terraria and has taken to having a iPod Touch to hand whilst playing to access their respective wikis, avoiding switching away from the game itself. I’ve also been using a phone or laptop to access game wikis: in my case for Dark Souls, Fallout 3/New Vegas, etc.

I know we’re not the only gamers who do this. The additional content, although crowd-sourced and not formally part of the game, is becoming an integral part to the game play. It’s not cheating, its a collaborative way to expand the gaming experience. (Although the infamously hard Dark Souls ships with a link to the community wiki on the back of the box: you’re going to need that help!)

There are many more ways that a second screen could be used as part of game playing over and above delivering documentation and guidance. It opens up some interesting new ways to play.

For example resource collection games like Minecraft have separate inventory management and crafting interfaces. These could just as easily be delivered using a second screen app linked to the game. An embedded web server would provide an easy way to hook this kind of extra interface into a game, opening up any web-enabled device as a separate controller.

Asymmetric Game-play

The concept of asymmetric game-play isn’t new, but the idea has seen some attention this year with the impending launch of the WiiU. Asymmetric gaming is where the players don’t all have exactly the same gaming experience. The differences in game play might be small or large.

At one end you might be playing as different characters and character types, basically you can do different things in the game but essentially are experiencing the game in the same way. Most multi-player games that use character classes (e.g. Team Fortress) can be said to offer this kind of limited form of asymmetric game play.

The “Co-Star Mode” of Super Mario Galaxy offers a more advanced style of asymmetric gaming. One player controls Mario while the other uses a pointer in a supporting role: their on-screen presence and forms of interaction are more limited. This style is particularly great when you have players of different abilities, e.g. older and younger siblings.

Continuing down this road its not hard to see how you could end up with some very different experiences, particularly for multi-player games. This is the angle that is being promoted with the WiiU. The Gamepad controller has an integrated screen, allowing one player to potentially have a very different experience to others. Access to a separate screen (e.g. for secret information) creates possibilities for new types of game play. Nintendo have said they want to focus on exploring adversarial challenges where one player is pitted against a number of others, playing the game in different ways.

Even without multi-player the controller offers lots of interesting possibilities, some of which can be seen in the split-screen action included in ZombiU. This trailer has a nice demo: warning zombies. As I noted above, this same functionality could be offered in many games by modding them to expose a web interface that provided additional controls, viewpoints and interactivity.

Arguably the classic example of asymmetric game-play is the classic paper and dice based RPG. One player is the game master, the others the adventurers. The Dungeon Masters screen delineates the space between the GM and the other players similarly to a second screen. The GM has different knowledge of the game world and plays in a radically different way to the other players.

It would be interesting to see this translated more fully into a video game environment. A separate screen could support that kind of mechanics when you’re in the same room, but there are plenty of options to explore asymmetric gaming in an on-line multi-player environment.

New Forms of Multi-player

Traditionally games are designed to fit well-defined genres; RPGs, FPS or RTS to choose just three. The different genres each have their own conventions around interfaces, game play but their common limitation is the AI: designing good artificial intelligence is hard, which is why its so much more fun to play against people. Unfortunately in many cases multi-player is often limited to co-op or deathmatch (head-to-head) game modes with variations on rules and objectives.

But what if I could play a game, offering an RTS style interface whilst others are experiencing it from an FPS perspective? Why not replace the Left 4 Dead Director, for example, with a real human opponent? The “Play, Create, Share” idea needn’t be limited to crafting a LittleBigPlanet level for others to play independently, why not put the game designer into the action, with the means to affect it, just like an old school GM? Why can’t I take control over an entire region in a game like World of Warcraft and shape it as I want?

The upcoming game Dust 514 offers an interesting form of asymmetric game play that provides an neat twist on conventional multi-player  The game is an FPS that takes place on a planet in the Eve Online universe. Actions in one game can have effects in the other. The games offer different game play experiences, on different hardware, but in the same universe. I’ll be interested to see how that pans out in practice.

Experiments in multi-player gaming might also give us some insight into creating more nuanced, or at least more varied forms of social interaction in other on-line applications and tools. If you’re going to embrace gamification in your application then take it further than just badges and achievements, and let “players” pit themselves against each other or set each other challenges.

Dark Souls has a number of interesting multi-player innovations that come from applying constraints to how players can interact with one another, eschewing conventional friend lists and multi-player options. It’s very difficult to team up with a specific player and communication options are very limited. The primary mechanism is essentially a form of in-game graffiti. You can leave messages for other players, either to help or hinder. The messages are limited, but add an interesting dimension and often humour to the game. You can also catch glimpses of other players in the form of ghosts in your game world.

What if we extended this kind of idea to the web? For example a way to indicate how many other people are also reading the same page, what is their collective impression? It’s not quite a Like or a +1 but neither would it be a conventional comment.

Overall I think we’re at a really interesting stage in the development of gaming in general and multi-player gaming specifically. We have a lot of new highly connected devices, more connectivity and, soon, a new generation of consoles.

A lot of people spend a great deal of time in these Third Places now. In virtual environments we’re no longer limited to existing forms of communication. We can explore a lot of new territory. Unfortunately many of the existing forms of online communications are prone to abuse, spam and trolling. Perhaps some of these newer multi-player ideas might offer ways to create sense of community and sharing that avoids these issues. And if not, well, there’s still a lot of interesting games on the horizon.

Not Just Legislation: Sustainable Open Data Curation Projects

Francis Irving recently wrote an excited blog post about the open curation model that now backs legislation.gov.uk. It’s hard not to get excited about legislation.gov.uk. There’s been so much good work done on the project and everyone involved has achieved a great deal of which they can be proud.

If you’re not familiar with the background then read through Irving’s blog post and looks over these slides from a talk that John Sheridan and Jeni Tennison gave at Strata London last week. The project is a nice case study not just for the underlying technology but also for the application of open data in general.

However, while similarly excited by the project, I found myself disagreeing with Irving’s claim that legislation.gov.uk is “the world’s first REAL commercial open data curation project“. Inevitably I suspect we actually agree on a lot of things, and disagree on a few details. But I think there are plenty of other examples and its instructive to look closely at them.

The legislation.gov.uk Model

Firstly though, lets briefly summarise the legislation.gov.uk model. If I’m misrepresenting anything there, then please let me know in a comment!

  • The core asset being worked on is the UK legislation itself. This is available under the Open Government License, so no matter what commercial or organisational model underpins its curation, its free for anyone to use
  • The new curation model provides a means for commercial organisations to help maintain the corpus of legislation, e.g. to bring it up to date to reflect actual law. This is done under a Memoradum of Understanding, so there’s a direct relationship between the relevant organisations and the National Archives. Not just anyone can contribute
  • The financial contribution from the curators is in the form of labour: they are providing staff to work on the legislation
  • The National Archives save costs on maintaining the legislation
  • The commercial participants have a better asset upon which they can build new products; this covers not just the updated text, but its availability as Open Data via APIs, etc.
  • Everyone benefits from a more up to date, accurate and reference-able body of legislation. This includes not just the immediate participants but all of the downstream users, which includes lawyers, non-lawyer professionals and individual citizens

That’s a great model with some obvious tangible and intangible value being created and exchanged. But I think that there are some potential variations.

Characterising Variations of that Model

To help think about variations, lets identify several different dimensions along which we might find variations:

  • The Asset(s): what is being curated, is it primarily a dataset or is that a secondary by-product? It might be several things, the data might not even be the primary asset.
  • The Contributors: who is actually creating, delivering and maintaining the asset(s)? Can anyone contribute or are contributions limited to a particular group or type of participant?
  • The Consumers: who uses the asset? Is it the same group as contributes to its curation or is there a wider community? We might expect there to always be more consumers than contributors, particularly for a successful data project
  • The Financial Model: how is the work to curate the asset being supported? For a successful project the ongoing provision of the asset ought to be sustainable, but it might actually generate profits.
  • The Licensing: what form of licensing is associated with use of the asset(s)s?
  • Loosely we might want to characterise the Incentives: what are the benefits for both the contributors and consumers of the data?

Now, I’m not suggesting that these are the only useful dimensions to consider, but I think these are the main ones. Hopefully its obvious how the legislation model can be characterised along these dimensions.

Using headings like this makes it easier to summarise in a blog post, but there are other techniques for teasing out forms of value creation and exchange. The one I’ve used successfully in the past is Value Network Analysis (VNA). In my dimensions above the Consumers and Contributors are the participants in the network, and the Financial Model and Incentives describe the tangible and intangible value being exchanged.

I plan to blog more about VNA in the future when I share the analysis I’ve done around data marketplaces. But for the rest of the article I’m going to highlight a couple of examples that show some useful variations.


Lets start with MusicBrainz. I’ve long used MusicBrainz as an example of a sustainable open data project as it has some nice characteristics.

  • Assets: The project has several products which includes some open source software. But the most significant asset is the MusicBrainz Database. The data is split into a core public domain portion, and a separately licensed set of supplementary data
  • Contributors: Anyone can sign up and make contributions to the database, there are some privileged editorial positions, but anyone can contribute to both the data and the software. While I believe the majority of the contributions come from the MusicBrainz community there is at least one commercial curator: the BBC pay editorial staff to add to and update the database.
  • Consumers: Again, anyone can use the data. There are a lot of projects that use MusicBrainz data some of which are commercial.
  • Financial Model: The project is supported in part by donations from users and businesses; and in part by commercial licensing of the Live Data Feed. The BBC are the most notable commercial licensee; Google the single largest donator. There is also some revenue from affiliate fees, etc. Some organisations have also contributed in kind, e.g. hardware or software services. The project finances are transparent if anyone wants to dig further.
  • Licensing: the core of the database is Public Domain. The rest is under a Creative Commons BY-NC-SA license.
  • Incentives: having an open music database provides a lots of benefits for individuals and organisations building products around the data. The costs of building a dataset collaboratively are much lower than building and maintaining it independently. For organisations like the BBC, MusicBrainz provides an off-the-shelf asset that can be enriched directly by its editorial team or integrated into new products.


The Open Researcher and Contributor ID project is a not-for-profit organisation that aims to provide “a registry of persistent unique identifiers for researchers and scholars and automated linkages to research objects such as publications, grants, and patents“.

It’s a fairly new venture but has been in incubation for some time. Over the last few years there has been lots of interest in having a shared open identifier for helping link together research literature and ORCID is one of the key projects that has crystallised out of those activities. It’s in the process of moving towards a production system. So, whereas MusicBrainz predates the legislation.gov.uk work, the ORCID system is not yet fully launched.

Lets look at its model:

  • Assets: the primary asset is the database of researcher and contributor identifiers; the project software will also all be open source
  • Contributors: anyone will be able to use the website tools to create and manage their contributor identifier; there will also be ways for the project members to contribute directly to maintaining the data, e.g. to add new publication links. As noted in the principles, contributors will own their own data and profiles.
  • Consumers: broadly anyone can participate, but the expectation is that it will be of most value to individual researchers, publishers, and funding agencies
  • Financial Model: the ability to contribute data and use some of the basic data maintenance tools will be free. However additional services will only be available to paying members. This includes getting more timely access to updated data; notifications of data changes; etc. The project has been bootstrapped with support of a number of initial sponsors.
  • Licensing: the core database will be released on an annual basis under a CC0 license, placing it into the public domain.
  • Incentives: the broad incentive for all participants is to help bind together the research literature in a better way than is currently possible. Linking research to authors requires participants from across the whole publishing community, including the authors themselves. Using an open collaboration model ensures that the everyone can engage with a minimum of cost. The publishers, who perhaps stand to gain most, will be bringing sustainability. The membership model has already proven to work in publishing with CrossRef which is similarly structured.

ORCID is an interesting variation when contrasted with the legislation.gov.uk approach. Many aspects are similar: it is industry focused and is solving a known problem. The major financial contributions will come from commercial organisations.

There are also several differences. Firstly the collaboration model is different; its not just commercial organisations that can contribute to the basic maintenance of the data: researchers can manage their own profiles.

Secondly, the data licensing model is different. While legislation.gov.uk offers data under the OGL with free APIs, ORCID places data into the public domain but only plans to update data dumps annually. More frequent access to data requires use of the APIs which is are member services. This difference is clearly useful as a lever to encourage commercial organisations to sign-up, this will directly contribute to the sustainability of the overall project.

Board Game Geek (and other crowd-sourcing examples)

I’ve purposefully chosen the next example because it has several different characteristics. Board Game Geek (BGG) is a community of….well…board game geeks! The site provides a number of different features, including a marketplace, but the core of the service is the database of board games which is collaboratively maintained by the community. The database currently holds over 60,000 different games from well over 12,000 different publishers 

  • Assets: the primary asset is the database that backs the site. There are tips for mining data from the service as well as an API.
  • Contributors: anyone can sign-up and contribute
  • Consumers: again, anyone with an interest in the data. I’ve not been able to identify any commercial users of the service
  • Financial Model: the site is supported by advertising and donations from the community (BGG Patrons). Its possible to place adverts directly through the site which might be a viable way for games publishers to connect with what appears to be a thriving community.
  • Licensing: the data licensing is actually unclear, although I’ve seen references to free re-use so long as the data is not re-published
  • Incentives: the service provides a focal point for a community, so maintaining the database benefits all the participants equally; access to the raw data allows people to build their own tools for working with the games data

Admittedly the credentials of BGG as an Open Data project are shakier than the other examples here: the licensing is unclear and what data dumps are available are unfortunately out of date.

But I’ve included it because the basic model that underpins the service is actually pretty common. I could have chosen several alternate examples:

The common aspects here are the open participation and sustainability via advertising, donations and (no doubt) ongoing support and engagement by the project leads. In each case the service addresses the needs and interests of a particular community. Licensing and access to data varies considerably. Commercial use of these datasets is either discouraged or needs up-front agreement.

I’ve previously approached the leads behind TheGamesDB.net and TheTVDB.com to discuss whether either of them saw a data marketplace as a potential source of additional revenue. Neither were interested in exploring that further. We could draw any number of conclusions from that but presumably they’re at least not struggling to maintain their current services.

In each of these cases the creation of the core database is the primary aspect of the service. But we can also find examples of where collaborative curation of data is happening as a secondary aspect of a service:

  • Discogs is a community of music collectors. Like MusicBrainz that community has ended up curating a database of artists, releases and tracks. The original core of the site was a marketplace to support the buying and selling of records. The business model is based around advertising and commission on marketplace sales. The core database is available under a CC0 license via an API or monthly data dumps.
  • Bricklink is essentially an Ebay for Lego. It generates revenue from commissions on sales and, like Discogs, along the way has produced an dataset that contains data on lego bricks, sets, inventories, etc. The data can be downloaded and, while not explicitly licensed, I’ve been told by the maintainers that they just ask for attribution.

In both of these cases we can see that the crowd-sourcing has happened as a means to support another activity: creating a product marketplace. While the previous crowd-sourced databases are based on an “Ads + Donations” model, in these examples, sustainability is brought by the marketplace. The data will remain available and up-to-date so long as the marketplace remains active.


I think there’s several conclusions to draw out from these examples.

Firstly, the important part of an open data curation project is not that its supported by commercial organisations, it’s the reliance on a sustainable model that will ensure the continued provision of the data. There are clearly plenty of different ways of doing this. I’ve written about various models for generating revenue from data in the past. Jeni Tennison has also shared some thoughts from a more public sector perspective. I suspect there are more that can be explored.

Secondly, clearly legislation.gov.uk isn’t the first example of a sustainable open data curation model, its also not the first example of a commercially supported model. Its pre-dated by MusicBrainz at least. But it is, to my knowledge, the first of its kind in the public sector. That’s a real innovation of which John Sheridan can be proud.

Finally, there’s clearly a lot more work that we can collectively do to help collate together examples of the various approaches to building sustainable businesses and collaboration models around Open Data. The right approach is likely to vary considerably based on the domain. It will be useful to understand the trade-offs.

This will provide necessary evidence and case studies to support the further exploration of Open Data releases and operating models in the public sector, and beyond.

But perhaps more importantly it will help provide people with examples of how sustainable and perhaps even profitable businesses can be built around collaborative curation of Open Data.

This is an area in which Data Marketplaces have a role to play. By offering the infrastructure to support data hosting, delivery and revenue collection, they can be platforms to support communities coming together to draw some real tangible value from collective curation of data.


Get every new post delivered to your Inbox.

Join 29 other followers