Monthly Archives: January 2013

Publishing SPARQL queries and documentation using github

Yesterday I released an early version of sparql-doc a SPARQL documentation generator. I’ve got plans for improving the functionality of the tool, but I wanted to briefly document how to use github and sparql-doc to publish SPARQL queries and their documentation.

Create a github project for the queries

First, create a new github project to hold your queries. If you’re new to github then you can follow their guide to get a basic repository set up with a README file.

The simplest way to manage queries in github is to publish each query as a separate SPARQL query file (with the extension “.rq”). You should also add a note somewhere specifying the license associated with your work.

As an example, here is how I’ve published a collection of SPARQL queries for the British National Bibliography.

When you create a new query be sure to add it to your repository and regularly commit and push the changes:

git add my-new-query.rq
git commit my-new-query.rq -m "Added another example"
git push origin master

The benefit of using github is that users can report bugs or submit pull requests to contribute improvements, optimisations, or new queries.

Use sparql-doc to document your queries

When you’re writing your queries follow the commenting conventions encouraged by sparql-doc. You should also install sparql-doc so you can use it from the command-line. (Note: installation guide and notes still needs some work!)

As a test you can try generating the documentation from your queries as follows. Make a new local directory called, e.g. ~/projects/examples. Execute the following from your github project directory to create the docs:

sparql-doc ~/projects/examples ~/projects/docs

You should then be able to open up the index.html document in ~/projects/examples to see how the documentation looks.

If you have an existing web server somewhere then you can just zip up those docs and put them somewhere public to share them with others.However you can also publish them via Github pages. This means you don’t have to setup any web hosting at all.

Use github pages to publish your docs

Github Pages allows github users to host public, static websites directly from github projects. It can be used to publish blogs or other project documentation. But using github pages can seem a little odd if you’re not familiar with git.

Effectively what we’re going to do is create a separate collection of files — the documentation — that sits in parallel to the actual queries. In git terms this is done by creating a separate “orphan” branch. The documentation lives in the branch, which must be called gh-pages, while the queries will remain in the master branch.

Again, github have a guide for manually adding and pushing files for hosting as pages. The steps below follow the same process, but using sparql-doc to create the files.

Github recommend starting with a fresh separate checkout of your project. You then create a new branch in that checkout, remove the existing files and replace them with your documentation.

As a convention I suggest that when you checkout the project a second time, that you give it a separate name, e.g. by adding a “-pages” suffix.

So for my bnb-queries project, I will have two separate checkouts:

The original checkout I did when setting up the project for the BNB queries was:

git clone git@github.com:ldodds/bnb-queries.git

This gave me a local ~/projects/bnb-queries directory containing the main project code. So to create the required github pages branch, I would do this in the ~/projects directory:

#specify different directory name on clone
git clone git@github.com:ldodds/bnb-queries.git bnb-queries-pages
#then follow the steps as per the github guide to create the pages branch
cd bnb-queries-pages
git checkout --orphan gh-pages
#then remove existing files
git rm -rf .

This gives me two parallel directories one containing the master branch and the other the documentation branch. Make sure you really are working with a separate checkout before deleting and replacing all of the files!

To generate the documentation I then run sparql-doc telling it to read the queries from the project directory containing the master branch, and then use the directory containing the gh-pages branch as the output, e.g.:

sparql-doc ~/projects/bnb-queries ~/projects/bnb-queries-pages

Once that is done, I then add, commit and push the documentation, as per the final step in the github guide:

cd ~/projects/bnb-queries-pages
git add *
git commit -am "Initial commit"
git push origin gh-pages

The first time you do this, it’ll take about 10 minutes for the page to become active. They will appear at the following URL:

http://USER.github.com/PROJECT

E.g. http://ldodds.github.com/sparql-doc

If you add new queries to the project, be sure to re-run sparql-doc and add/commit the updated files.

Hopefully that’s relatively clear.

The main thing to understand is that locally you’ll have your github project checked out twice: once for the main line of code changes, and once for the output of sparql-doc.

These will need to be separately updated to add, commit and push files. In practice this is very straight-forward and means that you can publicly share queries and their documentation without the need for web hosting.

sparql-doc

Developers often struggle with SPARQL queries. There aren’t always enough good examples to play with when learning the language or when trying to get to grips with a new dataset. Data publishers often overlook the need to publish examples or, if they do, rarely include much descriptive documentation.

I’ve also been involved with projects that make heavy use of SPARQL queries. These are often externalised into separate files to allow them to be easily tuned or tweaked without having to change code. Having documentation on what a query does and how it should be used is useful. I’ve seen projects that have hundreds of different queries.

It occurred to me that while we have plenty of tools for documenting code, we don’t have a tool for documenting SPARQL queries. If generating and publishing documentation was a little more frictionless, then perhaps people will do it more often. Services like SPARQLbin are useful, but provide address a slightly different use case.

Today I’ve hacked up the first version of a tool that I’m calling sparql-doc. Its primary usage is likely to be helping to publish good SPARQL examples, but might also make a good addition to existing code/project documentation tools.

The code is up on github for you to try out. Its still very rough-and-ready but already produces some useful output. You can see a short example here.

Its very simple and adopts the same approach as tools like rdoc and Javadoc: its just specifies some conventions for writing structured comments. Currently it supports adding a title, description, list of authors, tags, and related links to a query. Because the syntax is simple, I’m hoping that other SPARQL tools and IDEs will support it.

I plan to improve the documentation output to provide more ways to navigate the queries, e.g. by tag, query type, prefix, etc.

Let me know what you think!

A Brief Review of the Land Registry Linked Data

The Land Registry have today announced the publication of their Open Data — including both Price Paid information and Transactions as Linked Data. This is great to see, as it means that there is another UK public body making a commitment to Linked Data publishing.

I’ve taken some time to begin exploring the data. This blog post provides some pointers that may help others in using the Linked Data. I’m also including some hopefully constructive feedback on the approach that the Land Registry have taken.

The Land Registry Linked Data

The Linked Data is available from http://landregistry.data.gov.uk this follows the general pattern used by other organisations publishing public sector Linked Data in the UK.

The data consists of a single SPARQL endpoint — based on the Open Source Fuseki server — which contains RDF versions of both the Price Paid and Transaction data. The documentation notes that the endpoint will be updated on the 20th of each month, with the equivalent to the monthly releases that are already published as CSV files.

Based on some quick tests, it would appear that the endpoint contains all of the currently published Open Data, which in total is 16,873,170 triples covering 663,979 transactions.

The data seems to primarily use custom vocabularies for describing the data:

The landing page for the data doesn’t include any examples, but I ran some SPARQL queries to extract a few, e.g:

So for Price Paid Data, the model appears to be that a Transaction has a Transaction Record which in turn has an associated Address. The transaction counts seem to be standalone resources.

The SPARQL endpoint for the data is at http://landregistry.data.gov.uk/landregistry/sparql. A test form is also available and that page has a couple of example queries, including getting Price Paid data based on a postcode search.

However I’d suggest that the following version might be slightly better as it includes the record status for the record, which will indicate whether it is an “add” or a “delete”:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX lrppi: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX lrcommon: <http://landregistry.data.gov.uk/def/common/>
SELECT ?paon ?saon ?street ?town ?county ?postcode ?amount ?date ?status
WHERE
{ ?transx lrppi:pricePaid ?amount .
 ?transx lrppi:transactionDate ?date .
 ?transx lrppi:propertyAddress ?addr.
 ?transx lrppi:recordStatus ?status.

 ?addr lrcommon:postcode "PL6 8RU"^^xsd:string .
 ?addr lrcommon:postcode ?postcode .

 OPTIONAL {?addr lrcommon:county ?county .}
 OPTIONAL {?addr lrcommon:paon ?paon .}
 OPTIONAL {?addr lrcommon:saon ?saon .}
 OPTIONAL {?addr lrcommon:street ?street .}
 OPTIONAL {?addr lrcommon:town ?town .}
}
ORDER BY ?amount

General Feedback

Lets start with the good points:

  • The data is clearly licensed so is open for widespread re-use
  • There is a clear commitment to regularly updating the data, so it should stay in line with the Land Registry’s other Open Data. This makes it reliable for developers to use the data and the identifiers it contains
  • The data uses Patterned URIs based on Shared Keys (the Land Registry’s own transaction identifiers) so building links is relatively straight-forward
  • The vocabularies are documented and the URIs resolve, so it is possible to lookup the definitions of terms. I’m already finding that easier than digging through the FAQs that the Land Registry publish for the CSV versions.

However I think there is room for improvement in a number of areas:

  • It would be useful to have more example queries, e.g. how to find the transactional data, as well as example Linked Data resources. A key benefit of a linked dataset is that you should be able to explore it in your browser. I had to run SPARQL queries to find simple examples
  • The SPARQL form could be improved: currently it uses a POST by default and so I don’t get a shareable URL for my query; the Javascript in the page also wipes out my query every time I hit the back button, making it frustrating to use
  • The vocabularies could be better documented, for example a diagram showing the key relationships would be useful, as would a landing page providing more of a conceptual overview
  • The URIs in the data don’t match the patterns recommended in Designing URI Sets for the Public Sector. While I believe that guidance is under review, the data is diverging from current documented best practice. Linked Data purists may also lament the lack of a distinction between resource and page.
  • The data uses custom vocabulary where there are existing vocabularies that fit the bill. The transactional statistics could have been adequately described by the Data Cube vocabulary with custom terms for the dimensions. The related organisations could have been described by the ORG ontology and VCard with extensions ought to have covered the address information.

But I think the biggest oversight is the lack of linking, both internal and external. The data uses “strings” where it could have used “things”: for places, customers, localities, post codes, addresses, etc.

Improving the internal linking will make the dataset richer, e.g. allowing navigation to all transactions relating to a specific address, or all transactions for a specific town or postcode region. I’ve struggled to get a Post Code District based query to work (e.g. “price paid information for BA1″) because the query has to resort to regular expressions which are often poorly optimised in triple stores. Matching based on URIs is always much faster and more reliable.

External linking could have been improved in two ways:

  1. The dates in the transactions could have been linked to the UK Government Interval Sets. This provides URIs for individual days
  2. The postcode, locality, district and other regional information could have been linked to the Ordnance Survey Linked Data. That dataset already has URIs for all of these resources. While it may have been a little more work to match regions, the postcode based URIs are predictable so are trivial to generate.

These improvements would have moved from Land Registry data from 4 to 5 Stars with little additional effort. That does more than tick boxes, it makes the entire dataset easier to consume, query and remix with others.

Hopefully this feedback is useful for others looking to consume the data or who might be undertaking similar efforts. I’m also hoping that it is useful to the Land Registry as they evolve their Linked Data offering. I’m sure that what we’re seeing so far is just the initial steps.

How I organise data conversions

Factual announced a new project last week, called Drake which is billed as a “make for data”. The tool provides a make style environment for building workflows for data conversions, it has support for multiple programming languages, uses a standard project layout, and integrates with HDFS.

It looks like a really nice tool and I plan to take a closer look at it. When you’re doing multiple data conversions, particularly in a production setting, its important to adopt some standard practices. Having a consistent way to manage assets, convert data and manage workflows is really useful. Quick and dirty data conversions might get the job done, but a little thought up front can save time later when you need to refresh a dataset, fix bugs, or allow others to contribute. Consistency also helps when you come to add another layer of automation, to run a number of conversions on a regular basis.

I’ve done a fair few data conversions over the last few years and I’ve already adopted a similar approach to Drake: I use a standard workflow, programming environment and project structure. I thought I’d write this down here in case its useful for others. Its certainly saved me time. I’d be interested to learn what approaches other people take to help organise their data conversions.

Project Layout

My standard project layout is:

  • bin — the command-line scripts used to run a conversion. I tend to keep these task based, e.g. focusing on one element of the workflow or conversion. E.g. separate scripts for crawling data, converting types of data, etc. Scripts are parameterised with input/output directories and/or filenames
  • data — created automatically this sub-directory holds the output data
    • cache — a cache directory for all data retrieved from the web. when crawling or scraping data I always work on a local cached copy to avoid unnecessary network traffic
    • nt (or rdf) — for RDF conversions I typically generate ntriple output as its simple to generate and work with in a range of tools. I sometimes generate RDF/XML output, but only if I’m using XSLT to do transformations from XML sources
  • etc — additional supporting files, including:
    • static — static data, e.g. hand-crafted data sources, RDF schema files, etc
    • sparql — SPARQL queries that are used in the conversion, as part of the “enrichment” phase
    • xslt — For keeping XSLT transforms when I’m using XML input and have found it easier to process using XSLT rather than using libxml.
  • lib — the actual code for the conversion. The scripts in the bin directory handle the input/output, the rest is done is Ruby classes
  • Rakefile — a Ruby Rakefile that describes the workflow. I use this to actually run the conversions

While there are some minor variations I’ve used this same structure across a number of different conversions, including:

Workflow

The workflow for the conversion is managed using a Ruby Rakefile. Like Factual, I’ve found that a make style environment is useful for organising simple data conversion workflows. Rake allows me to execute command-line tools, e.g. curl for downloading data or rapper for doing RDF format conversions, execute arbitrary Ruby code, as well as shell out to dedicated scripts

I try to use a standard set of rake targets to co-ordinate the overall workflow. These are broken down into smaller stages where necessary. While the steps vary between datasets, the stages I most often use are:

  1. download (or cache) — the main starting point which fetches the necessary data. I try and avoid manually downloading any data and rely on curl or perhaps dpm to get the required files. I’ve tended to use “download” for when I’m just grabbing static files and “cache” for when I’m doing a website crawl. This is just a cue for me. I like to tread carefully when hitting other people’s servers so aggressively cache files. Having a separate stage to grab data is also handy for when you’re working offline on later steps
  2. convert — perform the actual conversion, working on the locally cached files only. So far I tend to use either custom Ruby code or XSLT.
  3. reconcile — generate links to other dataset, often using the Google Refine Reconciliation API
  4. enrich — enrich the dataset with additional data, e.g. by performing SPARQL queries to fetch remote data, or materialise new data
  5. package — package up the generated output as a tar.gz file
  6. publish — the overall target which runs all of the above

The precise stages used vary between projects and there are usually a number of other targets in the Rakefile that perform specific tasks, for example the convert stage is usually dependent on several other steps that generate particular types of data. But having standard stage names makes it easier to run specific parts of the overall conversion. One additional stage that would be useful to have is “validation“, so you can check the quality of the output.

At various times I’ve considered formalising these stages further, e.g by creating some dedicated Rake extensions, but I’ve not yet found the need to do that as there’s usually very little code in each step.

I tend to separate out dependencies on external resources, e.g. remote APIs, from the core conversion. The convert stage will work entirely on locally cached data and then I can call out to other APIs in a separate reconcile or enrich stage. Again, this helps when working on parts of the conversion offline and allows the core conversion to happen without risk of failure because of external dependencies. If a remote API fails, I don’t want to have to re-run a potentially lengthy data conversion, I just want to do resume from a known point.

I also try and avoid, as far as possible, using extra infrastructure, e.g. relying on databases, triple stores, or a particular environment. While this might help improve performances in some cases (particularly for large conversions) I like to minimise dependencies to make it easier to run the conversions in a range of environments, with minimal set-up, and minimal cost for anyone running the conversion code. But many of the conversions I’ve been doing are relatively small scale. For larger datasets using a triple store or Hadoop might be necessary. But this would be easy to integrate into the existing stages, perhaps adding a “prepare” stage to do any necessary installation and configuration.

For me its very important to be able to automate the download of the initial data files or web pages that need scraping. This allows the whole process to be automated and cached files re-used where possible. This simplifies the process of using the data and avoids unnecessary load on data repositories. As I noted at the end of yesterday’s post on using dpm with data.gov.uk, having easy access to the data files is important. The context for interpreting that data mustn’t be overlooked, but consuming that information is done separately from using the data.

To summarise, there’s nothing very revolutionary here: I’m sure many of you use similar and perhaps better approaches. But I wanted to share my style for organising conversions and encourage others to do likewise.

How to use dpm with data.gov.uk

The Data Package Manager is an Open Knowledge Foundation project to create a tool to support discovery and distribution of datasets. The tool uses the concept of a “data package” to describe the basic metadata for a dataset plus the supporting files. Packages are indexed in a registry to make them searchable and to support distribution. The dpm tool works with the CKAN data portal software, using its API to search and download data packages.

The dpm documentation includes guidance on how to install and use the software. Once the basic software is installed you run:

dpm setup config

This will create a default configuration file called .dpmrc in your home directory. This configuration works with The Data Hub allowing you to access its registry of over 5000 datasets. For example there’s a basic RDF/XML version of the British National Bibliography, if we wanted to automatically download the files associated with that package then we can run the following command:

dpm download ckan://bluk-bnb-basic bnb-basic

The second parameter is an identifier for the dataset, note that bluk-bnb-basic is the same as the id used in the URL of the dataset on the Data Hub. This makes it easy to script up downloads of a dataset if the publisher has gone to the trouble of associating the files with their CKAN package.

The data.gov.uk website has been built using CKAN. The API endpoint can be found at: http://data.gov.uk/api/. This means that we can use dpm to interact with data.gov.uk too, all we need to do is specify that dpm should use a different registry.

To get dpm to use a different CKAN instance we need to modify its config:

  1. Take a copy of ~/.dpmrc and put it somewhere handy, e.g. ~/tools/datapkg/datagovuk.ini
  2. Edit the ckan.url entry and change it to http://data.gov.uk/api/
  3. When you run dpm use the --config or -c parameter to specify that it should use the alternate config

Here’s a gist that shows an example of the edited config. Its best to just modify a copy of the default version as there are other paths in there that should remain unchanged.

Here are some examples of using dpm with data.gov.uk. Make sure the config parameter points to the location of your revised configuration file:

Search data.gov.uk for packages with the keyword “spending”:

dpm --config datagovuk.ini search ckan:// spending

Get a summary of a package:

dpm --config datagovuk.ini info ckan://warwickshire-spending-allocation

Download the files associated with a package to a local data directory. The tool will automatically create sub-directories for the package:

dpm --config datagovuk.ini download ckan://warwickshire-spending-allocation data

The latter command would be much more useful if the data.gov.uk datasets consistently had the data associated with them. Unfortunately in many cases there is still just a reference to another website.

Hopefully this will improve over time — while its important for data to be properly documented and contextualised, to support easy re-use it must also be easy to automate the retrieval and processing of that data. These are two separate, but important use cases.

Follow

Get every new post delivered to your Inbox.

Join 28 other followers