In a meeting today, I was discussing how and when open geospatial identifiers are useful. I thought this might make a good topic for a blog post in my continuing series of questions about data. So here goes.
An identifier provides an unambiguous reference for something about which we want to collect and publish data. That thing might be a road, a school, a parcel of land or a bus stop.
If we publish a dataset that contains some data about “Westminster” then, without some additional documentation, a user of that dataset won’t know whether the data is about a tube station, the Parliamentary Constituency, a company based in Hayes or a school.
If we have identifiers for all of those different things, then we can use the identifiers in our data. This lets us be confident that we are talking about the same things. Publishing data about “940GZZLUWSM” makes it pretty clear that we’re referring to a specific tube station.
If data publishers use the same sets of identifiers, then we can start to easily combine your dataset on the wheelchair accessibility of tube stations, with my dataset of tube station locations and Transport for London’s transit data. So we can build an application that will help people in wheelchairs make better decisions about how to move around London.
To help us publish datasets that use the same identifiers, there are a few things that we repeatedly need to do.
For example it’s common to have to lookup an identifier based on the name of the thing we’re describing. E.g. what’s the code for Westminster tube station? We often need to find information about an identifier we’ve found in a dataset. E.g. what’s the name of the tube station identified by 940GZZLUWSM? And where is it?
When we’re working with geospatial data we often need to find identifiers based on a physical location. For example, based on a latitude and longitude:
- Where is the nearest tube station?
- Or, what polling district am I in, so I can find out where I should go to vote?
- Or, what is the identifier for the parcel of land that contains these co-ordinates?
It can be helpful if these repeated tasks are turned into specialised services (APIs) that make it easier to perform them on-demand. The alternative is that we all have to download and index the necessary datasets ourselves.
Choosing which identifiers to use in a dataset is an important part of creating agreements around how we publish data. We call those agreements data standards.
The more datasets that use the same set of identifiers, the easier it becomes to combine those datasets together, in various combinations that will help to solve a range of problems. To put it another way, using common identifiers helps to generate network effects that make it easier for everyone to publish and use data.
I think it’s true to say that almost every problem that we might try and solve with better use of data requires the combination of several different datasets. Some of those datasets might come from the private sector. Some of them might come from the public sector. No single organisation always holds all of the data.
This makes it important to be able to share and reuse identifiers across different organisations. And that is why it is important that those identifiers are published under an open licence.
Open licences allow anyone to access, use and share data. Openly licensed identifiers can be used in both open datasets and those that are shared under more restrictive licences. They give data publishers the freedom to choose the correct licence for their dataset, so that it sits at the right point on the data spectrum.
Identifiers that are not published under an open licence remove that choice. Restricted licensing limits the ability of publishers to share their data in the way that makes sense for their business model or application. Restrictive licences cause friction that gets in the way of making data as open as possible.
Open identifiers create open ecosystems. They create opportunities for a variety of business models, products and services. For example intermediaries can create platforms that aggregate and distribute data that has been published by a variety of different organisations.
So, the best identifiers are those that are
- published under an open licence that allows anyone to access, use and share them
- published alongside some basic metadata (a label, a location or other geospatial data, a type)
- and, are accessible via services that allow them to be easily used
Who provides that infrastructure?
Whenever there is friction around the use of data, application developers are left with a difficult choice. They either have to invest time and effort in working around that friction, or compromise their plans in some way. The need to quickly bring products to market may lead to choices which are not ideal.
For example, developers may choose to build applications against Google’s mapping services. These services are easily and immediately available for anyone developer wanting to display a map or recommend a route to a user. But these platforms usually have restricted licensing that means it is usually the platform provider that reaps the most benefits. In the absence of open licences, network effects can lead to data monopolies.
So who should provide these open identifiers, and the metadata and services that support them?
This is the role of national mapping agencies. These agencies will already have identifiers for important geospatial features. The Ordnance Survey has an identifier called a TOID which is assigned to every feature in Great Britain. But there are other identifiers in use too. Some are designed to support publication of specific types of data, e.g. UPRNs.
These identifiers are national assets. They should be managed as data infrastructure and not be tied up in commercial data products.
Publishing these identifiers under an open licence, in the ways that have been outlined here, will provide a framework to support the collection and curation of geospatial data by many different organisations, across the public and private sector. That infrastructure will allow value to be created from that geospatial data in a variety of new ways.
Provision of this type of infrastructure is also in-line with what we can see happening across other parts of government. For example the work of the GDS team to develop registers of important data. Identifiers, registers and standards are important building blocks of our local, national and global data infrastructure.
If you’re interested in reading more about the benefits of open identifiers, then you might be interested in this white paper that I wrote with colleagues from the Open Data Institute and Thomson Reuters: “Creating value from identifiers in an open data world“