Category Archives: Open Data

What is an Open API?

I was reading a document this week that referred to an “Open API”. It occurred to me that I hadn’t really thought about what that term was supposed to mean before. Having looked at the API in question, it turned out it did not mean what I thought it meant. The definition of Open API on Wikipedia and the associated list of Open APIs are also both a bit lacklustre.

We could probably do with being more precise about what we mean by that term, particularly in how it relates to Open Source and Open Data. So far I’ve seen it used in several different ways:

  1. An API that is free for anyone to use — I think it would be clearer to refer to these as “Public APIs”. Some may require authentication, some may only have a limited free tier of usage, but the API is accessible to anyone that wants to use it
  2. An API that is backed by open data — the data that is extracted by the API is covered by an open licence. A Public API isn’t necessarily backed by Open Data. While it might be free for me to use an API, I may be limited in how I can use the data by API terms and/or a non-open data licence that applies to the data
  3. An API that is based on an open standard — the data available via an API might not be open, but the means of accessing and querying the data is covered by a specification that has been created by a standards body or has otherwise be openly published, e.g. the specification of the API is covered by an open licence. The important thing here is that the API could be (re-)implemented in an open source or commercial product without infringing on anyone’s rights or intellectual property. The specification of APIs that serve open data aren’t necessarily open. A commercial vendor may provide a data publishing service whose API is entirely proprietary.

Personally I think an Open API is one that meets that final definition.

These are important distinctions and I’d encourage you to look at the APIs you’re using or the API’s you’re publishing and considering into which category they fall. APIs built on open source software typically fall into the third category: a reference implementation and API documentation are already in the open. It’s easy to create alternate versions, improve an existing code base, or run a copy of a service.

While the data in a platform may be open, lock-in (whether planned or otherwise) can happen when APIs are proprietary. This limits competition and the ability for both data publishers and consumers to choose other vendors. This is also one reason why APIs shouldn’t be the default for open government data: at some level the raw data should be portable and useful outside of whatever platform the organisation may choose to deploy. Ideally platforms aimed at supporting open government data publishing should be open source or should, at the very least, openly licence their API documentation.

Building the new Ordnance Survey Linked Data platform

Disclaimer: the following is my own perspective on the build & design of the Ordnance Survey Linked Data platform. I don’t presume to speak for the OS and don’t have any inside knowledge of their long term plans.

Having said that I wanted to share some of the goals we (Julian Higman, Benjamin Nowack and myself) had when approaching the design of the platform. I will say that we had the full support and encouragement of the Ordnance Survey throughout the project, especially John Goodwin and others in the product management team.

Background & Goals

The original Ordnance Survey Linked Data site launched in April 2010. At the time it was a leading example of adoption of Linked Data by a public sector organisation. But time moves on and both the site and the data were due for a refresh. With Talis’ withdrawal from the data hosting business, the OS decided to bring the data hosting in-house and contracted Julian, Benjamin and myself to carry out the work.

While the migration from Talis was a key driver, the overall goal was to deliver a new Linked Data platform that would make a great showcase for the Ordnance Survey Linked Data. The beta of the new site was launched in April and went properly live at the beginning of June.

We had a number of high-level goals that we set out to achieve in the project:

  • Provide value for everyone, not just developers — the original site was very developer-centric, offering a very limited user experience with no easy way to browse the data. We wanted everyone to begin sharing links to the Ordnance Survey pages and that meant that the site needed a clean, user-friendly design. This meant we approached it from the point of building an application, not just a data portal
  • Deliver more than Linked Data — we wanted to offer a set of APIs that made the data accessible and useful for people who weren’t familiar with Linked Data or SPARQL. This meant offering some simpler tools to enable people to search and link to the data
  • Deliver a good developer user experience –this meant integrating API explorers, plenty of examples, and clear documentation. We wanted to shorten the “time to first JSON” to get developers into the data as fast as possible
  • Showcase the OS services and products — the OS offer a number of other web services and location products. The data should provide a way to show that value. Integrating mapping tools was the obvious first step
  • Support latest standards and best practices — where possible we wanted to make sure that the site offered standard APIs and formats, and conformed to the latest best practices around open data publishing
  • Support multiple datasets — the platform has been designed to support multiple datasets, allowing users to use just the data they need or the whole combined dataset. This provides more options for both publishing and consuming the data
  • Build a solid platform to support further innovation — we wanted to leave the OS with an extensible, scalable platform to allow them to further experiment with Linked Data

Best Practices & Standards

From a technical perspective we need to refresh not just the data but the APIs used to access it. This meant replacing the SPARQL 1.0 endpoint and custom search interface offered in the original with more standard APIs.

We also wanted to make the data and APIs discoverable and adopted a “completionist” approach to try and tick all the boxes for publishing and exposing dataset metadata, including basic versioning and licensing information.

As a result we ended up with:

  • SPARQL 1.1 query endpoints for every dataset, which expose a basic SPARQL 1.1 Service Description as well as the newer CSV and TSV response formats
  • Well populated VoID descriptions for each dataset, including all of the key metadata items including publication dates, licensing, coverage, and some initial dataset statistics
  • Autodiscovery support for datasets, APIs, and for underlying data about individual Linked Data resources
  • OpenSearch 1.1 compliant search APIs that support keyword and geo search over the data. The Atom and RSS response formats include the relevance and geo extensions
  • Licensing metadata is clearly labelled not just on the datasets, but as a Link HTTP header in every Linked Data or API result, so you can probe resources to learn more
  • Basic support for the OpenRefine Reconciliation API as a means to offer a simple linking API that can be used in a variety of applications but also, importantly, with people curating and publishing small datasets using OpenRefine
  • Support for CORS, allowing cross-browser requests to be made to the Linked Data and all of the APIs
  • Caching support through the use of ETags and Last-Modified headers. If you’re using the APIs then you can optimise your requests and cache data by making Conditional GET requests
  • Linked Data pages that offer more than just a data dump, the integrated mapping and links to other products and services makes the data more engaging.
  • Custom ontology pages that allow you to explore terms and classes within individual ontologies, e.g. see for example the definition of “London Borough

Clearly there’s more that could be potentially done. Tools can always be improved, but the best way for that to happen is through user feedback. I’d love to know what you think of the platform.

Overall I think we’ve achieved our goal of making a site that, while clearly developer oriented, offers a good user experience for non-developers. I’ll be interested to see what people do with the data over the coming months

Summarising Geographic Coverage of Dbpedia (and Wikipedia)

In “What Does Your Dataset Contain?” I outlined a conceptual framework for thinking about how we might want to describe datasets, e.g. how they’re produced, what they contain, etc. I’ve been reading with interest the series on dataset summaries in Scraperwiki which is exploring similar ideas.

I finally found the time to do some quick practical exploration of my own. One area that interests me is understanding the geographic coverage of a dataset. There’s lots of ways to approach that, mainly because datasets can vary widely in how they include geographical data. Some might include direct references to regions, whilst others might have more fine-grained latitude/longitude data.

I recently discovered local-geocoder which allows bulk reverse geocoding of lat/lng data to country names. I decided to apply this to data to dbpedia to see if I could get a sense of its overall coverage.

The result is a simple shell script that:

  1. Downloads the geographic data from the English version of Dbpedia 3.8
  2. Extracts the georss:point predicates and runs them through the local_geocode command-line tool
  3. Runs the results through some command-line tools to sort and summarise the data to create a simple CSV file

I created a gist that contains the script and the output as formatted text and CSV.

Quick summary of the results:

  • 475,001 geographic points in Dbpedia 3.8.
  • 26,763 (recorded as “nil” in the results) were unmatched, giving 448,238 points that can be geocoded to a country
  • 122,230 points were from US (25.7% of full set)
  • US, Poland (46,316; 9.75%), and United Kingdom (45,917, 9.67%) are the three most represented countries
  • 178 countries referenced in totaal

From a quick inspection, I think the results that can’t be geocoded are simply those that are outside country boundaries. E.g. the location for Apollo 8 is the middle of the Pacific).

The main caveat with the results (other than potential bugs) is that the boundary data used in local-geocoder is of unclear provenance. Its intended for quick prototyping only. However I’ve had a pull request accepted to local-geocoder to make it easier to use alternate data so there are now options to use alternative sources.

Most online geocoders are rate-limited or have specific terms and conditions that limit re-use of the resulting data. It would be interesting to create a good reference set of open boundary data for countries and administrative regions for use in open source geocoding tools.

I’ve been exploring how the Ordnance Survey data could be converted to GeoJSON for use with the tool. This would give more fine-grained data for England, Scotland and Wales.

 

How Do We Attribute Data?

This post is another in my ongoing series of “basic questions about open data”, which includes “What is a Dataset?” and “What does a dataset contain?“. In this post I want to focus on dataset attribution and in particular questions such as:

  • Why should we attribute data?
  • How are data publishers asking to be attributed?
  • What are some of the issues with attribution?
  • Can we identify some common conventions around attribution?
  • Can we monitor or track attribution?

I started to think about this because I’ve encountered a number of data publishers recently that have published Open Data but are now struggling to highlight how and where that data has been used or consumed. If data is published for anonymous download, or accessible through an open API then a data publisher only has usage logs to draw on.

I had thought that attribution might help here: if we can find links back to sources, then perhaps we can help data publishers mine the web for links and help them build evidence of usage. But it quickly became clear, as we’ll see in a moment, that there really aren’t any conventions around attribution, making it difficult to achieve this.

So lets explore the topic from first principles and tick off my questions individually.

Why Attribute?

The obvious answer here is simply that if we are building on the work of others, then it’s only fair that those efforts should be acknowledged. This helps the creator of the data (or work, or code) be recognised for their creativity and effort, which is the very least we can do if we’re not exchanging hard cash.

There are also legal reasons why the source of some data might be need to be acknowledged. Some licenses require attribution, copyright may need to be acknowledged. As a consumer I might also want to (or need to) clearly indicate that I am not the originator of some data in case it is find to be false, or misleading, etc.

Acknowledging my sources may also help guarantee that the data I’m using continues to be available: a data publisher might be collecting evidence of successful re-use in order to justify ongoing budget for data collection, curation and publishing. This is especially true when the data publisher is not directly benefiting from the data supply; and I think it’s almost always true for public sector data. If I’m reusing some data I should make it as clear as possible that I’m so doing.

There’s some additional useful background on attribution from a public sector perspective in a document called “Supporting attribution, protecting reputation, and preserving integrity“.

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

While these two cases clearly overlap, the intention is often slightly different. As a user of an application, or the reader of an academic paper, I might want a clear citation to the underlying dataset so I can re-use it myself, or do some fact checking. The important use case there is tracking facts and figures back to their sources. Attribution is more about crediting the effort involved in collecting that information.

It may be possible to achieve both goals with a simple link, but I think recognising the different use cases is important.

How are data publishers asking to be attributed?

So how are data publishers asking for attribution? What follows isn’t an exhaustive survey but should hopefully illustrate some of the variety.

Lets look first at some of the suggested wordings in some common Open Data licenses, then poke around in some terms and conditions to see how these are being applied in practice.

Attribution Statements in Common Open Data Licenses

The Open Data Commons Attribution license includes some recommended text (Section 4.3a – Example Notice):

Contains information from DATABASE NAME which is made available under the ODC Attribution License.

Where DATABASE NAME is the name of the dataset and is linked to the dataset homepage. Notice no mention of the originator, just the database. The license notes that in plain text the links should be included as text. The Open Data Commons Database license has the same text (again, section 4.3a)

The UK Open Government License notes that re-users should:

…acknowledge the source of the Information by including any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence

Where no attribution is provided, or multiple sources must be attributed, then the suggested default text, which should include a link to the license is:

Contains public sector information licensed under the Open Government Licence v1.0.

So again, no reference to the publisher but also no reference to the dataset either. The National Archives have some guidance on attribution which includes some other variations.  These variants do suggest including more detail including name of department, date of publication, etc. These look more like typical bibliographic citations.

As another data point we can look at the Ordnance Survey Open Data License. This is a variation of the Open Government License but carries some additional requirements, specifically around attribution. The basic attribution statement is:

Contains Ordnance Survey data © Crown copyright and database right [year]

However the Code Point Open dataset has some additional attribution requirements, which also acknowledge copyright of the Royal Mail and National Statistics. All of these statements acknowledge the originators and there’s no requirement to cite the dataset itself.

Interestingly, while the previous licenses state that re-publication of data should be under a compatible license, only the OS Open Data license explicitly notes that the attribution statements must also be preserved. So both the license and attribution have viral qualities.

Attribution Statements in Terms and Conditions

Now lets look at some specific Open Data services to see what attribution provisions they include.

Freebase is an interesting example. It draws on multiple datasets which are supplemented by contributions of its user community. Some of that data is under different licenses. As you can see from their attribution page, there are variants in attribution statements depending on whether the data is about one or several resources and whether it includes Wikipedia content, which must be specially acknowledged.

They provide a handy HTML snippet for you to include in your webpage to make sure you get the attribution exactly right. Ironically at the time of writing this service is broken (“User Rate Limit Exceeded”). If you want a slightly different attribution, then you’re asked to contact them.

Now, while Freebase might not meet everyone’s definition of Open Data, its an interesting data point.  Particularly as they ask for deep links back to the dataset, as well as having a clear expectation of where/how the attribution will be surfaced.

OpenCorporates is another illustrative example. Their legal/license info page examples that their dataset is licensed under the Open Data Commons Database License and explains that:

Use of any data must be accompanied by a hyperlink reading “from OpenCorporates” and linking to either the OpenCorporates homepage or the page referring to the information in question

There are also clear expectations around the visibility of that attribution:

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

So there is a clear expectation that the attribution should be displayed alongside any data. Like the OS license these attribution requirements are also viral as they must be passed on by aggregators.

My intention isn’t to criticise either OpenCorporates or Freebase, but merely to highlight some real world examples.

What are some of the issues with data attribution?

Clearly we could undertake a much more thorough review than I have done here. But this is sufficient to highlight what I think are some of the key issues. Put yourself in the position of a developer consuming some Open Data under any or all of these conditions. How do you responsibly provide attribution?

The questions that occur to me, at least are:

  • Do I need to put attribution on every page of my application, or can I simply add it to a colophon? (Aside: lanyrd has a great colophon page). In some cases it seems like I might have some freedom of choice, in others I don’t
  • If I do have to put a link or some text on a page, then do I have any flexibility around its size, positioning, visibility, etc? Again, in some cases I may do, but in others I have some clear guidance to follow. This might be challenging if I’m creating a mobile application with limited screen space. Or creating a voice or SMS application.
  • What if I just re-use the data as part of some back-end analysis, but none of that data is actually surfaced to the user? How do I attribute in this scenario?
  • Do I need to acknowledge the publisher, or a link to the source page(s)?
  • What if I need to address multiple requirements, e.g. if I mashed up data from data.gov.uk, the Ordnance Survey, Freebase and OpenCorporates? That might get awkward.

There are no clear answers to these questions. For individual datasets I might be able to get guidance, but it requires me to read the detailed terms and conditions for the dataset or API I’m using. Isn’t the whole purpose in having off-the-shelf licenses like the OGL or ODbL supposed to help us streamline data sharing? Attribution, or rather unclear or overly detailed attribution requirements are a clear source of friction. Especially if there are legal consequences for getting it wrong.

And that’s just when we’re considering integrating data sources by hand. What about if we want to automatically combine data? How is a machine going to understand these conditions? I suspect that every Linked Data browser and application fails to comply with the attribution requirements of the data its consuming.

Of course these issues have been explored already. The Science Commons Protocol encourages publishing data into the public domain — so no legal requirement for attribution at all. It also acknowledges the “Attribution Stacking” problem (section 5.3) which occurs when trying to attribute large numbers of datasets, each with their own requirements. Too much friction discourages use, whether its research or commercial.

Unfortunately the recently published Amsterdam Manifesto on data citation seems to overlook these issues, requiring all authors/contributors to be attributed.

The scientific community may be more comfortable with a public domain licensing approach and a best effort attribution model because it is supported by strong social norms: citation and attribution is essential to scientific discourse. We don’t have anything like that in the broader open data community. Maybe its not achievable, but it seems like clear guidance would be very useful.

There’s some useful background on problems with attribution and marking requirements on the Creative Commons wiki that also references some possible amendments and clarifications.

Can we convergence on some common conventions?

So would it be possible to converge on a simple set of conventions or norms around data re-use? Ideally to the extent that attribution can be simplified and ideally automated as far as possible.

How about the following:

  • Publishers should clearly describe their attribution requirements. Ideally this should be a short simple statement (similar to the Open Government License) which includes their name and a link to their homepage. This attribution could be included anywhere on the web site or application that consumes the data.
  • Publishers should be aware that the consumers of their data will be doing so in a variety of applications and on a variety of platforms. This means allowing a deal of flexibility around how/where attribution is displayed.
  • Publishers should clearly indicate whether attribution must be passed on to down-stream users
  • Publishers should separately document their citation requirements. If they want to encourage users to link to the dataset, or an individual page on their site, to allow users to find the original context, then they should publish instructions on how to do it. However this kind of linking is for citation so consumers should be bound to include it
  • Consumers should comply with publishers wishes and include an about page on their site or within their application that attributes the originators of the data they use. Where feasible they should also provide citations to specific resources or datasets from within their applications. This provides their users with clear citations to sources of data
  • Both sides should collaborate on structured markup to support publication of these attribution and citation requirements, as well as harvesting of links

Whether attribution should be a legally enforced is another discussion. Personally I’d be keen to see a common set of conventions regardless of the legal basis for doing it. Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

What Does Your Dataset Contain?

Having explored some ways that we might find related data and services, as well as different definitions of “dataset”, I wanted to look at the topic of dataset description and analysis. Specifically, how can we answer the following questions:

  • what kinds of information does this dataset contain?
  • what types of entity are described in this dataset?
  • how can I determine if this dataset will fulfil my requirements?

There’s been plenty of work done around trying to capture dataset metadata, e.g. VoiD and DCAT; there’s also the upcoming working on Open Data on the Web. Much of that work has focused on capturing the core metadata about a dataset, e.g. who published it, when was it last updated, where can I find the data files, etc. But there’s still plenty of work to be done here, to encourage broader adoption of best practices, and also to explore ways to expose more information about the internals of a dataset.

This is a topic I’ve touched on before, and which we experimented with in Kasabi. I wanted to move “beyond the triple count” and provide a “report card” that gave a little more insight into a dataset. A report card could usefully complement an ODI Open Data Certificate, for example. Understanding the composition of a dataset can also help support new ways of manipulating and combining datasets.

In this post I want to propose a conceptual framework for capturing metadata about datasets. Its intended as a discussion point, so I’m interested in getting feedback. (I would have submitted this to the ODW workshop but ran out of time before the deadline).

At the top level I think there are five broad categories of dataset information: Descriptive Data; Access Information; Indicators; Compositional Data; and Relationships. Compositional data can be broken down into smaller categories — this is what I described as an “information spectrum” in the Beyond the Triple Count post.

While I’ve thought about this largely from the perspective of Linked Data, I think its applicable to any format/technology.

Descriptive Data

This kind of information helps us understand a dataset as a “work”: its name, a human-readable description or summary, its license, and pointers to other relevant documentation such as quality control or feedback processes. This information is typically created and maintained directly by the data publisher, whereas the other categories of data I describe here can potentially be derived automatically by data analysis

Examples:

  • Title
  • Description
  • License
  • Publisher
  • Subject Categories

Access Information

Basically, where do I get the data?

  • Where do I download the latest data?
  • Where can I download archived or previous versions of the data?
  • Are there mirrors for the dataset?
  • Are there APIs that use this data?
  • How do I obtain access to the data or API?

Indicators

This is statistical information that can help provide some insight into the data set, for example its size. But indicators can also build confidence in re-users by highlighting useful statistics such as the timeliness of releases, speed of responding to data fixes, etc.

While a data publisher might publish some of these indicators as targets that they are aiming to achieve, many of these figures could be derived automatically from an underlying publishing platform or service.

Examples of indicators:

  • Size
  • Rate of Growth
  • Date of Last Update
  • Frequency of Updates
  • Number of Re-users (e.g. size of user community, or number of apps that use it)
  • Number of Contributors
  • Frequency of Use
  • Turn-around time for data fixes
  • Number of known errors
  • Availability (for API based access)

Relationships

Relationship data primarily drives discovery use cases: to which other datasets does this dataset relate? For example the dataset might re-use identifiers or directly link to resources in other datasets. Knowing the source of that information can help us build trust in the reliability of the combined data, as well as give us sign-posts to other useful context. This is where Linked Data excels.

Annotation Datasets provide context to, and enrich other reference datasets. Annotations might be limited to linking information (“Link Sets”) or they may add new facts/properties about existing resources. Independently sourced quality control information could be published as annotations.

Provenance is also a form of relationship information. Derived datasets, e.g. created through analysis or data conversions, should refer to their original input datasets, and ideally also the algorithms and/or code that were applied.

Again, much of this information can be derived from data analysis. Recommendations for relevant related datasets might be created based on existing links between datasets or by analysing usage patterns. Set algebra on URIs in datasets can be used to do analysis on their overlap, to discover linkages and to determine whether one dataset contains annotations of another.

Examples:

  • List of dataset(s) that this dataset draws on (e.g. re-uses identifiers, controlled vocabulary, etc)
  • List of datasets that this datasets references, e.g. via links
  • List of source datasets used to compile or create this dataset
  • List of datasets that link to this dataset (“back links”)
  • Which datasets are often used in conjunction with this dataset?

Compositional Data

This is information about the internals of a dataset: e.g. what kind of data does it contain, how is that data organized, and what kinds of things are being described?

This is the most complex area as there are potentially a number of different audiences and abilities to cater for. At one end of the spectrum we want to provide high level summaries of the contents of a dataset, while at the other end we want to provide detailed schema information to support developers. I’ve previously advocated a “progressive disclosure” approach to allow re-users to quickly find the data they need; a product manager looking for data to support a new feature will be looking for different information to a developer constructing queries over a dataset.

I think there are three broad ways that we can decompose Compositional Data further. There are particular questions and types of information that relate to each of them:

  • Scope or Coverage 
    • What kinds of things does this dataset describe? Is it people, places, or other objects?
    • How many of these things are in the dataset?
    • Is there a geographical focus to the dataset, e.g. a county, region, country or is it global?
    • Is the data confined to a particular data period (archival data) or does it contain recent information?
  • Structure
    • What are some typical example records from the dataset?
    • What schema does it conform to?
    • What graph patterns (e.g. combinations of vocabularies) are commonly found in the data?
    • How are various types of resource related to one another?
    • What is the logical data model for the data?
  • Internals
    • What RDF terms and vocabularies that are used in the data?
    • What formats are used for capturing dates, times, or other structured values?
    • Are there custom validation rules for particular fields or properties?
    • Are there caveats or qualifiers to individual schema elements or data items?
    • What is the physical data model
    • How is the dataset laid out in a particular database schema, across a collection of files, or named graphs?

The experiments we did in Kasabi around the report card (see the last slides for examples) were exploring ways to help visualise the scope of a dataset. It was based on identifying broad categories of entity in a dataset. I’m not sure we got the implementation quite right, but I think it was a useful visual indicator to help understand a dataset.

This is a project I plan to revive when I get some free time. Related to this is the work I did to map the Schema.org Types to the Noun Project Icons.

Summary

I’ve tried to present a framework that captures most, if not all of the kinds of questions that I’ve seen people ask when trying to get to grips with a new dataset. If we can understand the types of information people need and the questions they want to answer, then we can create a better set of data publishing and analysis tools.

To date, I think there’s been a tendency to focus on the Descriptive Data and Access Information — because we want to be able to discover data — and its Internals — so we know how to use it.

But for data to become more accessible to a non-technical audience we need to think about a broader range of information and how this might be surfaced by data publishing platforms.

If you have feedback on the framework, particularly if you think I’ve missed a category of information, then please leave a comment. The next step is to explore ways to automatically derive and surface some of this information.

What is a Dataset?

As my last post highlighted, I’ve been thinking about how we can find and discover datasets and their related APIs and services. I’m thinking of putting together some simple tools to help explore and encourage the kind of linking that my diagram illustrated.

There’s some related work going on in a few areas which is also worth mentioning:

  • Within the UK Government Linked Data group there’s some work progressing around the notion of a “registry” for Linked Data that could be used to collect dataset metadata as well as supporting dataset discovery. There’s a draft specification which is open for comment. I’d recommend you ignore the term “registry” and see it more as a modular approach for supporting dataset discovery, lightweight Linked Data publishing, and “namespace management” (aka URL redirection). A registry function is really just one aspect of the model.
  • There’s an Open Data on the Web workshop in April which will cover a range of topics including dataset discovery. My current thoughts are partly preparation for that event (and I’m on the Programme Committee)
  • There’s been some discussion and a draft proposal for adding the Dataset type to Schema.org. This could result in the publication of more embedded metadata about datasets. I’m interested in tools that can extract that information and do something useful with it.

Thinking about these topics I realised that there are many definitions of “dataset”. Unsurprisingly it means different things in different contexts. If we’re defining models, registries and markup for describing datasets we may need to get a sense of what these different definitions actually are.

As a result, I ended up looking around for a series of definitions and I thought I’d write them down here.

Definitions of Dataset

Lets start with the most basic, for example Dictionary.com has the following definition:

“a collection of data records for computer processing”

Which is pretty vague. Wikipedia has a definition which derives from the terms use in a mainframe environment:

“A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

Nontabular datasets can take the form of marked up strings of characters, such as an XML file.”

The W3C Data Catalog Vocabulary defines a dataset as:

“A collection of data, published or curated by a single source, and available for access or download in one or more formats.”

The JISC “Data Information Specialists Committee” have a definition of dataset as:

“…a group of data files–usually numeric or encoded–along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use. Generally a dataset is un-usable for sound analysis by a second party unless it is well documented.”

Which is a good definition as it highlights that the dataset is more than just the individual data files or facts, it also consists of some documentation that supports its use or analysis. I also came across a document called “A guide to data development” (2007) from the National Data Development and Standards Unit in Australia which describes a dataset as

“A data set is a set of data that is collected for a specific purpose. There are many ways in which data can be collected—for example, as part of service delivery, one-off surveys, interviews, observations, and so on. In order to ensure that the meaning of data in the data set is clearly understood and data can be consistently collected and used, data are defined using metadata…”

This too has the notion of context and clear definitions to support usage, but also notes that the data may be collected in a variety of ways.

A Legal Definition

As it happens, there’s also a legal definition of a dataset in the UK, at least as far as it relates to the Freedom of Information. The “Protections of Freedom Act 2012 Part 6, (102) c” includes the following definition:

In this Act “dataset” means information comprising a collection of information held in electronic form where all or most of the information in the collection—

  • (a)has been obtained or recorded for the purpose of providing a public authority with information in connection with the provision of a service by the authority or the carrying out of any other function of the authority,
  • (b)is factual information which—
    • (i)is not the product of analysis or interpretation other than calculation, and
    • (ii)is not an official statistic (within the meaning given by section 6(1) of the Statistics and Registration Service Act 2007), and
  • (c)remains presented in a way that (except for the purpose of forming part of the collection) has not been organised, adapted or otherwise materially altered since it was obtained or recorded.”

This definition is useful as it defines the boundaries for what type of data is covered by Freedom of Information requests. It clearly states that the data is collected as part of the normal business of the public body and also that the data is essentially “raw”, i.e. not the result of analysis or has not been adapted or altered.

Raw data (as defined here!) is more useful as it supports more downstream usage. Raw data has more potential.

Statistical Datasets

The statistical community has also worked towards having a clear definition of dataset. The OECD Glossary defines a Dataset as “any organised collection of data”, but then includes context that describes that further. For example that a dataset is a set of values that have a common structure and are usually thematically related. However there’s also this note that suggests that a dataset may also be made up of derived data:

“A data set is any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances”

Privacy is one key reason why a dataset may contain derived information only.

The RDF Data Cube vocabulary, which borrows heavily from SDMX — a key standard in the statistical community — defines a dataset as being made up of several parts:

  1. “Observations – This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
  2. Organizational structure – To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation…
  3. Internal metadata – Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated?…
  4. External metadata — This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed.”

The SDMX implementors guide has a long definition of dataset (page 7) which also focuses on the organisation of the data and specifically how individual observations are qualified along different dimensions and measures.

Scientific and Research Datasets

Over the last few years the scientific and research community have been working towards making their datasets more open, discoverable and accessible. Organisations like the Welcome Foundation have published guidance for researchers on data sharing; services like CrossRef and DataCite provide the means for giving datasets stable identifiers; and platforms like FigShare support the publishing and sharing process.

While I couldn’t find a definition of dataset from that community (happy to take pointers!) its clear that the definition of dataset is extremely broad. It could cover both raw results, e.g. output from sensors or equipment, through to more analysed results. The boundaries are hard to define.

Given the broad range of data formats and standards, services like FigShare accept any or all data formats. But as the Welcome Trust note:

“Data should be shared in accordance with recognised data standards where these exist, and in a way that maximises opportunities for data linkage and interoperability. Sufficient metadata must be provided to enable the dataset to be used by others. Agreed best practice standards for metadata provision should be adopted where these are in place.”

This echoes the earlier definitions that included supporting materials as being part of the dataset.

RDF Datasets

I’ve mentioned a couple of RDF vocabularies already, but within the RDF and Linked Data community there are a couple of other definitions of dataset to be found. The Vocabulary for Organising Interlinked Datasets (VoiD) is similar to, but predates, DCAT. Whereas DCAT focuses on describing a broad class of different datasets, VoiD describes a dataset as:

“…a set of RDF triples that are published, maintained or aggregated by a single provider…the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.”

Like the more general definitions this includes the notion that the data may relate to a specific topic or be curated by a single organisation. But this definition also makes some assumption about the technical aspects of how the data is organised and published. VoiD also includes support for linking to the services that relate to a dataset.

Along the same lines, SPARQL also has a definition of a Dataset:

“A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI…”

Unsurprisingly for a technical specification this is a very narrow definition of dataset. It also differs from the VoiD definition. While both assume RDF as the means for organising the data, the VoiD term is more general, e.g. it glosses over details of the internal organisation of the dataset into named graphs. This results in some awkwardness when attempting to navigate between a VoiD description and a SPARQL Service Description.

Summary

If you’ve gotten this far, then well done :)

I think there’s a couple of things we can draw out from these definitions which might help us when discussing “datasets”:

  • There’s a clear sense that a dataset relates to specific topic and is collected for a particular purpose.
  • The means by which a dataset is collected and the definitions of its contents are important for supporting proper re-use
  • Whether a dataset consists of “raw data” or more analysed results can vary across communities. Both forms of dataset might be available, but in some circumstances (e.g. for privacy reasons) only derived data might be published
  • Depending on your perspective and your immediate use case the dataset may be just the data items, perhaps expressed in a particular way (e.g. as RDF).  But in a broader sense, the dataset also includes the supporting documentation, definitions, licensing statements, etc.

While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web.

Dataset and API Discovery in Linked Data

I’ve been recently thinking about how applications can discover additional data and relevant APIs in Linked Data. While there’s been lots of research done on finding and using (semantic) web services I’m initially interested in supporting the kind of bootstrapping use cases covered by Autodiscovery.

We can characterise that use case as helping to answer the following kinds of questions:

  • Given a resource URI, how can I find out which dataset it is from?
  • Given a dataset URI, how can I find out which resources it contains and which APIs might let me interact with it?
  • Given a domain on the web, how can I find out whether it exposes some machine-readable data?
  • Where is the SPARQL endpoint for this dataset?

More succinctly: can we follow our nose to find all related data and APIs?

I decided to try and draw a diagram to illustrate the different resources involved and their connections. I’ve included a small version below:

Data and API Discovery with Linked Data

Lets run through the links between different types of resources:

  • From Dataset to Sparql Endpoint (and Item Lookup, and Open Search Description): this is covered by VoiD which provides simple predicates for linking a dataset to three types of resources. I’m not aware of other types of linking yet, but it might be nice to support reconciliation APIs.
  • From Well-Known VoiD Description (background) to Dataset. This well known URL allows a client to find the “top-level” VoiD description for a domain. It’s not clear what that entails, but I suspect the default option will be to serve a basic description of a single dataset, with reference to sub-sets (void:subset) where appropriate. There might also just be rdfs:seeAlso links.
  • From a Dataset to a Resource. A VoiD description can include example resources, this blesses a few resources in the dataset with direct links. Ideally these resources ought to be good representative examples of resources in the dataset, but they might also be good starting points for further browsing or crawling.
  • From a Resource to a Resource Description. If you’re using “slash” URIs in your data, then URIs will usually redirect to a resource description that contains the actual data. The resource description might be available in multiple formats and clients can content negotiation or follow Link headers to find alternative representations.
  • From a Resource Description to a Resource. A description will typically have a single primary topic, i.e. the resource its describing. It might also reference other related resources, either as direct relationships between those resources or via rdfs:seeAlso type links (“more data over here”).
  • From a Resource Description to a Dataset. This is where we might use a dct:source relationship to state that the current description has been extracted from a specific dataset.
  • From a SPARQL Endpoint (Service Description) to a Dataset. Here we run into some differences between definitions of dataset, but essentially we can describe in some detail the structure of the SPARQL dataset that is used in an endpoint and tie that back to the VoiD description. I found myself looking for a simple predicate that linked to a void:Dataset rather than describing the default and named graphs, but couldn’t find one.
  • I couldn’t find any way to relate a Graph Store to a Dataset or SPARQL endpoint. Early versions of the SPARQL Graph Store protocol had some notes on autodiscovery of descriptions, but these aren’t in the latest versions.

These links are expressed, for the most part, in the data but could also be present as Link headers in HTTP responses or in HTML (perhaps with embedded RDFa).

I’ve also not covered sitemaps at all, which provide a way to exhaustively list the key resources in a website or dataset to support mirroring and crawling. But I thought this diagram might be useful.

I’m not sure that the community has yet standardised on best practices for all of these cases and across all formats. That’s one area of discussion I’m keen to explore further.

A Brief Review of the Land Registry Linked Data

The Land Registry have today announced the publication of their Open Data — including both Price Paid information and Transactions as Linked Data. This is great to see, as it means that there is another UK public body making a commitment to Linked Data publishing.

I’ve taken some time to begin exploring the data. This blog post provides some pointers that may help others in using the Linked Data. I’m also including some hopefully constructive feedback on the approach that the Land Registry have taken.

The Land Registry Linked Data

The Linked Data is available from http://landregistry.data.gov.uk this follows the general pattern used by other organisations publishing public sector Linked Data in the UK.

The data consists of a single SPARQL endpoint — based on the Open Source Fuseki server — which contains RDF versions of both the Price Paid and Transaction data. The documentation notes that the endpoint will be updated on the 20th of each month, with the equivalent to the monthly releases that are already published as CSV files.

Based on some quick tests, it would appear that the endpoint contains all of the currently published Open Data, which in total is 16,873,170 triples covering 663,979 transactions.

The data seems to primarily use custom vocabularies for describing the data:

The landing page for the data doesn’t include any examples, but I ran some SPARQL queries to extract a few, e.g:

So for Price Paid Data, the model appears to be that a Transaction has a Transaction Record which in turn has an associated Address. The transaction counts seem to be standalone resources.

The SPARQL endpoint for the data is at http://landregistry.data.gov.uk/landregistry/sparql. A test form is also available and that page has a couple of example queries, including getting Price Paid data based on a postcode search.

However I’d suggest that the following version might be slightly better as it includes the record status for the record, which will indicate whether it is an “add” or a “delete”:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX lrppi: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX lrcommon: <http://landregistry.data.gov.uk/def/common/>
SELECT ?paon ?saon ?street ?town ?county ?postcode ?amount ?date ?status
WHERE
{ ?transx lrppi:pricePaid ?amount .
 ?transx lrppi:transactionDate ?date .
 ?transx lrppi:propertyAddress ?addr.
 ?transx lrppi:recordStatus ?status.

 ?addr lrcommon:postcode "PL6 8RU"^^xsd:string .
 ?addr lrcommon:postcode ?postcode .

 OPTIONAL {?addr lrcommon:county ?county .}
 OPTIONAL {?addr lrcommon:paon ?paon .}
 OPTIONAL {?addr lrcommon:saon ?saon .}
 OPTIONAL {?addr lrcommon:street ?street .}
 OPTIONAL {?addr lrcommon:town ?town .}
}
ORDER BY ?amount

General Feedback

Lets start with the good points:

  • The data is clearly licensed so is open for widespread re-use
  • There is a clear commitment to regularly updating the data, so it should stay in line with the Land Registry’s other Open Data. This makes it reliable for developers to use the data and the identifiers it contains
  • The data uses Patterned URIs based on Shared Keys (the Land Registry’s own transaction identifiers) so building links is relatively straight-forward
  • The vocabularies are documented and the URIs resolve, so it is possible to lookup the definitions of terms. I’m already finding that easier than digging through the FAQs that the Land Registry publish for the CSV versions.

However I think there is room for improvement in a number of areas:

  • It would be useful to have more example queries, e.g. how to find the transactional data, as well as example Linked Data resources. A key benefit of a linked dataset is that you should be able to explore it in your browser. I had to run SPARQL queries to find simple examples
  • The SPARQL form could be improved: currently it uses a POST by default and so I don’t get a shareable URL for my query; the Javascript in the page also wipes out my query every time I hit the back button, making it frustrating to use
  • The vocabularies could be better documented, for example a diagram showing the key relationships would be useful, as would a landing page providing more of a conceptual overview
  • The URIs in the data don’t match the patterns recommended in Designing URI Sets for the Public Sector. While I believe that guidance is under review, the data is diverging from current documented best practice. Linked Data purists may also lament the lack of a distinction between resource and page.
  • The data uses custom vocabulary where there are existing vocabularies that fit the bill. The transactional statistics could have been adequately described by the Data Cube vocabulary with custom terms for the dimensions. The related organisations could have been described by the ORG ontology and VCard with extensions ought to have covered the address information.

But I think the biggest oversight is the lack of linking, both internal and external. The data uses “strings” where it could have used “things”: for places, customers, localities, post codes, addresses, etc.

Improving the internal linking will make the dataset richer, e.g. allowing navigation to all transactions relating to a specific address, or all transactions for a specific town or postcode region. I’ve struggled to get a Post Code District based query to work (e.g. “price paid information for BA1″) because the query has to resort to regular expressions which are often poorly optimised in triple stores. Matching based on URIs is always much faster and more reliable.

External linking could have been improved in two ways:

  1. The dates in the transactions could have been linked to the UK Government Interval Sets. This provides URIs for individual days
  2. The postcode, locality, district and other regional information could have been linked to the Ordnance Survey Linked Data. That dataset already has URIs for all of these resources. While it may have been a little more work to match regions, the postcode based URIs are predictable so are trivial to generate.

These improvements would have moved from Land Registry data from 4 to 5 Stars with little additional effort. That does more than tick boxes, it makes the entire dataset easier to consume, query and remix with others.

Hopefully this feedback is useful for others looking to consume the data or who might be undertaking similar efforts. I’m also hoping that it is useful to the Land Registry as they evolve their Linked Data offering. I’m sure that what we’re seeing so far is just the initial steps.

How I organise data conversions

Factual announced a new project last week, called Drake which is billed as a “make for data”. The tool provides a make style environment for building workflows for data conversions, it has support for multiple programming languages, uses a standard project layout, and integrates with HDFS.

It looks like a really nice tool and I plan to take a closer look at it. When you’re doing multiple data conversions, particularly in a production setting, its important to adopt some standard practices. Having a consistent way to manage assets, convert data and manage workflows is really useful. Quick and dirty data conversions might get the job done, but a little thought up front can save time later when you need to refresh a dataset, fix bugs, or allow others to contribute. Consistency also helps when you come to add another layer of automation, to run a number of conversions on a regular basis.

I’ve done a fair few data conversions over the last few years and I’ve already adopted a similar approach to Drake: I use a standard workflow, programming environment and project structure. I thought I’d write this down here in case its useful for others. Its certainly saved me time. I’d be interested to learn what approaches other people take to help organise their data conversions.

Project Layout

My standard project layout is:

  • bin — the command-line scripts used to run a conversion. I tend to keep these task based, e.g. focusing on one element of the workflow or conversion. E.g. separate scripts for crawling data, converting types of data, etc. Scripts are parameterised with input/output directories and/or filenames
  • data — created automatically this sub-directory holds the output data
    • cache — a cache directory for all data retrieved from the web. when crawling or scraping data I always work on a local cached copy to avoid unnecessary network traffic
    • nt (or rdf) — for RDF conversions I typically generate ntriple output as its simple to generate and work with in a range of tools. I sometimes generate RDF/XML output, but only if I’m using XSLT to do transformations from XML sources
  • etc — additional supporting files, including:
    • static — static data, e.g. hand-crafted data sources, RDF schema files, etc
    • sparql — SPARQL queries that are used in the conversion, as part of the “enrichment” phase
    • xslt — For keeping XSLT transforms when I’m using XML input and have found it easier to process using XSLT rather than using libxml.
  • lib — the actual code for the conversion. The scripts in the bin directory handle the input/output, the rest is done is Ruby classes
  • Rakefile — a Ruby Rakefile that describes the workflow. I use this to actually run the conversions

While there are some minor variations I’ve used this same structure across a number of different conversions, including:

Workflow

The workflow for the conversion is managed using a Ruby Rakefile. Like Factual, I’ve found that a make style environment is useful for organising simple data conversion workflows. Rake allows me to execute command-line tools, e.g. curl for downloading data or rapper for doing RDF format conversions, execute arbitrary Ruby code, as well as shell out to dedicated scripts

I try to use a standard set of rake targets to co-ordinate the overall workflow. These are broken down into smaller stages where necessary. While the steps vary between datasets, the stages I most often use are:

  1. download (or cache) — the main starting point which fetches the necessary data. I try and avoid manually downloading any data and rely on curl or perhaps dpm to get the required files. I’ve tended to use “download” for when I’m just grabbing static files and “cache” for when I’m doing a website crawl. This is just a cue for me. I like to tread carefully when hitting other people’s servers so aggressively cache files. Having a separate stage to grab data is also handy for when you’re working offline on later steps
  2. convert — perform the actual conversion, working on the locally cached files only. So far I tend to use either custom Ruby code or XSLT.
  3. reconcile — generate links to other dataset, often using the Google Refine Reconciliation API
  4. enrich — enrich the dataset with additional data, e.g. by performing SPARQL queries to fetch remote data, or materialise new data
  5. package — package up the generated output as a tar.gz file
  6. publish — the overall target which runs all of the above

The precise stages used vary between projects and there are usually a number of other targets in the Rakefile that perform specific tasks, for example the convert stage is usually dependent on several other steps that generate particular types of data. But having standard stage names makes it easier to run specific parts of the overall conversion. One additional stage that would be useful to have is “validation“, so you can check the quality of the output.

At various times I’ve considered formalising these stages further, e.g by creating some dedicated Rake extensions, but I’ve not yet found the need to do that as there’s usually very little code in each step.

I tend to separate out dependencies on external resources, e.g. remote APIs, from the core conversion. The convert stage will work entirely on locally cached data and then I can call out to other APIs in a separate reconcile or enrich stage. Again, this helps when working on parts of the conversion offline and allows the core conversion to happen without risk of failure because of external dependencies. If a remote API fails, I don’t want to have to re-run a potentially lengthy data conversion, I just want to do resume from a known point.

I also try and avoid, as far as possible, using extra infrastructure, e.g. relying on databases, triple stores, or a particular environment. While this might help improve performances in some cases (particularly for large conversions) I like to minimise dependencies to make it easier to run the conversions in a range of environments, with minimal set-up, and minimal cost for anyone running the conversion code. But many of the conversions I’ve been doing are relatively small scale. For larger datasets using a triple store or Hadoop might be necessary. But this would be easy to integrate into the existing stages, perhaps adding a “prepare” stage to do any necessary installation and configuration.

For me its very important to be able to automate the download of the initial data files or web pages that need scraping. This allows the whole process to be automated and cached files re-used where possible. This simplifies the process of using the data and avoids unnecessary load on data repositories. As I noted at the end of yesterday’s post on using dpm with data.gov.uk, having easy access to the data files is important. The context for interpreting that data mustn’t be overlooked, but consuming that information is done separately from using the data.

To summarise, there’s nothing very revolutionary here: I’m sure many of you use similar and perhaps better approaches. But I wanted to share my style for organising conversions and encourage others to do likewise.

How to use dpm with data.gov.uk

The Data Package Manager is an Open Knowledge Foundation project to create a tool to support discovery and distribution of datasets. The tool uses the concept of a “data package” to describe the basic metadata for a dataset plus the supporting files. Packages are indexed in a registry to make them searchable and to support distribution. The dpm tool works with the CKAN data portal software, using its API to search and download data packages.

The dpm documentation includes guidance on how to install and use the software. Once the basic software is installed you run:

dpm setup config

This will create a default configuration file called .dpmrc in your home directory. This configuration works with The Data Hub allowing you to access its registry of over 5000 datasets. For example there’s a basic RDF/XML version of the British National Bibliography, if we wanted to automatically download the files associated with that package then we can run the following command:

dpm download ckan://bluk-bnb-basic bnb-basic

The second parameter is an identifier for the dataset, note that bluk-bnb-basic is the same as the id used in the URL of the dataset on the Data Hub. This makes it easy to script up downloads of a dataset if the publisher has gone to the trouble of associating the files with their CKAN package.

The data.gov.uk website has been built using CKAN. The API endpoint can be found at: http://data.gov.uk/api/. This means that we can use dpm to interact with data.gov.uk too, all we need to do is specify that dpm should use a different registry.

To get dpm to use a different CKAN instance we need to modify its config:

  1. Take a copy of ~/.dpmrc and put it somewhere handy, e.g. ~/tools/datapkg/datagovuk.ini
  2. Edit the ckan.url entry and change it to http://data.gov.uk/api/
  3. When you run dpm use the --config or -c parameter to specify that it should use the alternate config

Here’s a gist that shows an example of the edited config. Its best to just modify a copy of the default version as there are other paths in there that should remain unchanged.

Here are some examples of using dpm with data.gov.uk. Make sure the config parameter points to the location of your revised configuration file:

Search data.gov.uk for packages with the keyword “spending”:

dpm --config datagovuk.ini search ckan:// spending

Get a summary of a package:

dpm --config datagovuk.ini info ckan://warwickshire-spending-allocation

Download the files associated with a package to a local data directory. The tool will automatically create sub-directories for the package:

dpm --config datagovuk.ini download ckan://warwickshire-spending-allocation data

The latter command would be much more useful if the data.gov.uk datasets consistently had the data associated with them. Unfortunately in many cases there is still just a reference to another website.

Hopefully this will improve over time — while its important for data to be properly documented and contextualised, to support easy re-use it must also be easy to automate the retrieval and processing of that data. These are two separate, but important use cases.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers