Gridworks is a really fantastic tool and there’s scope to extend it in all kinds of interesting ways. Jeni Tennison has recently published a great blog post describing how to use Gridworks for generating Linked Data. I strongly encourage you to read her posting as it not only provides a good introduction to Gridworks itself, but also shows a nice real world example of generating RDF using its built-in data cleaning and templating tools.
I was luckily enough to meet David Huynh as a workshop recently and chatted to him briefly about another aspect of the Gridworks: its ability to match field values in a dataset to entities in Freebase, e.g. identifying a place based on just it’s name. Within Gridworks this process is known as “reconciliation”.
Reconciliation is an important step for generating good Linked Data as you’ll often need to correlate values in a dataset with URIs in existing datasets in order to generate links. E.g. matching company names to their URIs. While it is possible to generate identifiers algorithmically during a conversion this typically just defers the reconciliation work until a later stage, when you carry out cross-linking to introduce equivalence links.
Recognising that the ability to introduce new reconciliation services would be a powerful extension to Gridworks, David Huynh has been creating a draft specification that will allow third-parties to create and deploy their own reconciliation services. He’s been documenting his progress on implementing the client side of this protocol and has published a testing service.
It occurred to me that the reconciliation API is essentially a structured search over a dataset and thus could be implemented against the search interface exposed by Talis Platform stores. The RSS 1.0 feeds that the Platform returns includes enough information to rank and filter results as required by the API.
I’ve created a simple Ruby application, using the Sinatra web framework, that implements the reconciliation API for any Talis Platform store. You can find the code on github if you want to have a play with it. As I note in the README there are some areas where customisation is useful to get the most from the service. So while in principle it can be used against any existing Platform store you can create a simple JSON config to tweak it for particular datasets.
There’s a live version of the code running one my server here: http://ldodds.com/gridworks/.
That page has a simple API console for carrying out queries, but consult the draft specification for more details. I think I’ve covered all of the basic features (but bug reports welcome!). Consult the README for notes on configuration options and implementation decisions.
As a simple illustration, lets say that I have the value “Bath
” in a dataset and want to match that to some area in the UK administrative geography. This information is available from the Linked Data exposed by statistics.data.gov.uk
and this happens to be hosted in this platform store. The reconciliation API we need can therefore be found at: http://ldodds.com/gridworks/govuk-statistics/reconcile. An HTTP GET on that location retrieves the service metadata.
If we use the API explorer we can use a simple HTML form to try out examples. Select govuk-statistics
from the Store drop-down and then type Bath
into the search box. You’ll get this result. This is not very readable by default, so if you’re using Firefox I recommend you install the JSONView extension which provides a nicely formatted display.
Our initial search returns a number of results. The highest ranked of these being the Westminster Constituency for Bath. That seems like a pretty good initial result to me. As it is the most relevant result in the search it’s marked as an exact match, so once integrated with Gridworks it will capture and store the reconciled identifier for you.
However, we may know that in the imaginary dataset we’re working with, that a particular field doesn’t contain names of constituencies. It may instead refer to a Local Education Authority. We can refine our search by adding the URI that defines that type of resource into the type
field in the API explorer.
Try pasting in http://statistics.data.gov.uk/def/geography/LocalEducationAuthority
into the post and running the search again. You’ll find that this time you get a single result, which is Bath and North East Somerset. Job done.
Of course, to get the most from this you need to know what URIs you can use for filtering by types (and properties). But this is something that the Gridworks UI will help with. It can integrate with “suggestion services” that can be used to help map values to a properties and types within a schema. I’ll be looking at how to expose those as my next piece of work.
Hopefully you can see how the overall system works. Feel free to have a play with the API to try it out for yourself. If you have comments on the implementation then I’d love to hear them, but I’d suggest that comments on the specification are best addressed to the gridworks mailing list.
I also suspect the Reconciliation API has uses outside of just Gridworks. For example, I wonder how easy it would be to introduce reconciliation into Google Spreadsheets using Google Apps Script? It’s also another nice demonstration of how easy it is to map simple RESTful APIs onto RDF datasets, this implementation works for any data in the Platform, no matter what schema it confirms with. Neat.
One thought on “Gridworks Reconciliation API Implementation”
Comments are closed.