UnINSPIREd: problems accessing local government geospatial data

This weekend I started a side project which I plan to spend some time on this winter. The goal is to create a web interface that will let people explore geospatial datasets published by the three local authorities that make up the West of England Combined Authority: Bristol City Council, South Gloucestershire Council and Bath & North East Somerset Council.

Through Bath: Hacked we’ve already worked with the council to publish a lot of geospatial data. We’ve also run community mapping events and created online tools to explore geospatial datasets. But we don’t have a single web interface that makes it easy for anyone to explore that data and perhaps mix it with new data that they have collected.

Rather than build something new, which would be fun but time consuming, I’ve decided to try out TerriaJS. Its an open source, web based mapping tool that is already being used to publish the Australian National Map. It should handle doing the West of England quite comfortably. It’s got a great set of features and can connect to existing data catalogues and endpoints. It seems to be perfect for my needs.

I decided to start by configuring the datasets that are already in the Bath: Hacked Datastore, the Bristol Open Data portal, and data.gov.uk. Every council also has to publish some data via standard APIs as part of the INSPIRE regulations, so I hoped to be able to quickly bring a list of existing datasets without having to download and manage them myself.

Unfortunately this hasn’t proved as easy as I’d hoped. Based on what we’ve learned so far about the state of geospatial data infrastructure in our project at the ODI I had reasonably low expectations. But there’s nothing like some practical experience to really drive things home.

Here’s a few of the challenges and issues I’ve encountered so far.

  • The three councils are publishing different sets of data. Why is that?
  • The dataset licensing isn’t open and looks to be inconsistent across the three councils. When is something covered by INSPIRE rather than the PSMA end user agreement?
  • The new data.gov.uk “filter by publisher” option doesn’t return all datasets for the specified publisher. I’ve reported this as a bug, in the meantime I’ve fallen back on searching by name
  • The metadata for the datasets is pretty poor, and there is little supporting documentation. I’m not sure what some of the datasets are intended to represent. What are “core strategy areas“?
  • The INSPIRE service endpoints do include metadata that isn’t exposed via data.gov.uk. For example this South Gloucester dataset includes contact details, data on geospatial extents, and format information which isn’t otherwise available. It would be nice to be able to see this and not have to read the XML
  • None of the metadata appears to tell me when the dataset was last updated. The last modified data on data.gov.uk is (I think) the date the catalogue entry was last updated. Are the Section 106 agreements listed in this dataset from 2010 or are they regularly updated. How can I tell?
  • Bath is using GetMapping to host its INSPIRE datasets. Working through them on data.gov.uk I found that 46 out of the 48 datasets I reviewed have broken endpoints. I’m reasonably certain these used to work. I’ve reported the issue to the council.
  • The two datasets that do work in Bath cannot be used in TerriaJS. I managed to work around the fact that they require a username and password to access but have hit a wall because the GetMapping APIs only seem to support EPSG:27700 (British National Grid) and not EPSG:3857 as used by online mapping tools. So the APIs refuse to serve the data in a way that can be used by the framework. The Bristol and South Gloucestershire endpoints handle this fine. I assume this is either a limitation of the GetMapping service or a misconfiguration. I’ve asked for help.
  • A single Web Mapping Service can expose multiple datasets as individual layers. But apart from Bristol, both Bath and South Gloucestershire are publishing each dataset through its own API endpoint. I hope the services they’re using aren’t charging per end-point, as they’re probably unnecessary? Bristol has chosen to publish a couple of API that bring together several datasets, but these are also available individually through separate APIs.
  • The same datasets are repeated across data catalogues and endpoints. Bristol has its data listed as individual datasets in its own platform, listed as individual datasets in data.gov.uk and also exposed via two different collections which bundle some (or all?) of them together. I’m unclear on the overlap or whether there are differences between them in terms of scope, timeliness, etc. The licensing is also different. Exploring the three different datasets that describe allotments in Bristol, only one actually displayed any data in TerriaJS. I don’t know why
  • The South Gloucestershire web mapping services all worked seamlessly, but I noticed that if I wanted to download the data, then I would need to jump through hoops to register to access it. Obviously not ideal if I do want to work with the data locally. This isn’t required by the other councils. I assume this is a feature of MisoPortal
  • The South Gloucestershire datasets don’t seem to include any useful attributes for the features represented in the data. When you click on the points, lines and polygons in TerriaJS no additional information is displayed. I don’t know yet whether this data just isn’t included in the dataset or if its a bug in the API or in how TerriaJS is requesting it. I’d need to download or explore the data in some other way to find out. However the data that is available from Bath and Bristol also has inconsistencies in how its described, so I suspect there aren’t any agreed standards
  • Neither the GetMapping or MisoPortal APIs support CORS. This means you can’t access the data from Javascript running directly in the browser, which is what TerriaJS does by default. I’ve had to configure those to be accessed via a proxy. “Web mapping services” should work on the web.
  • While TerriaJS doesn’t have a plugin for OpenDataSoft (which powers the Bristol Open Data platform), I found that OpenDataSoft do provide a Web Feature Service interface. So I was able to configure that in TerriaJS to access that. Unfortunately I then found that either there’s a bug in the platform or a problem with the data because most of the points were in the Indian Ocean

The goal of the INSPIRE legislation was to provide a common geospatial data infrastructure across Europe. What I’m trying to do here should be relatively quick and easy to do. Looking at this graph of INSPIRE conformance for the UK, everything looks rosy.

But, based on an admittedly small sample of only three local authorities, the reality seems to be that:

  • services are inconsistently implemented and have not been designed to be used as part of native web applications and mapping frameworks
  • metadata quality is poor
  • there is inconsistent detail about features which makes it hard to aggregate, use and compare data across different areas
  • it’s hard to tell the provenance of data because of duplicated copies of data across catalogues and endpoints. Without modification or provenance information, its unclear whether data is data is up to date
  • licensing is unclear
  • links to service endpoints are broken. At best, this leads to wasted time from data users. At worst, there’s public money being spent on publishing services that no-one can access

It’s important that we find ways to resolve these problems. As this recent survey by the ODI highlights, SMEs, startups and local community groups all need to be able to use this data. Local government needs more support to help strengthen our geospatial data infrastructure.

The building blocks of data infrastructure – Part 2

This is the second part of a two part post looking at the building blocks of data infrastructure. In part one we looked at definitions of data infrastructure, and the first set of building blocks: identifiers, standards and registers. You should read part one first and then jump back in here.

We’re using the example of weather data to help us think through the different elements of a data infrastructure. In our fictional system we have a network of weather stations installed around the world. The stations are feeding weather observations into a common database. We’ve looked at why its necessary to identify our stations, the role of standards, and the benefits of building registers to help us manage the system.

Technology

Technology is obviously part of data infrastructure. In part one we have already introduced several types of technology.

The sensors and components that are used to build the weather stations are also technologies.

The data standards that define how we organise and exchange data are technologies.

The protocols that help us transmit data, like WiFi or telecommunications networks, are technologies.

The APIs that are used to submit data to the global database of observations, or which help us retrieve observations from it, are also technologies.

Unfortunately, I often see some mistaken assumptions that data infrastructure is only about the technologies we use to help us manage and exchange data.

To use an analogy, this is a bit like focusing on tarmac and kerb stones as the defining characteristics of our road infrastructure. These materials are both important and necessary, but are just parts of a larger system. If we focus only on technology it’s easy to overlook the other, more important buildings blocks of data infrastructure.

We should be really clear when we are talking about “data infrastructure”, which encompasses all the building blocks we are discussing here and when we are talking about “infrastructure for data” which focuses just on the technologies we use to collect and manage data.

Technologies evolve and become obsolete. Over time we might choose to use different technologies in our data infrastructure.

What’s important is choosing technologies that ensure our data infrastructure is as reliable, sustainable and as open as possible.

Organisations

Our data infrastructure is taking shape. We now have a system that consists of weather stations installed around the world, reporting local weather observations into a central database. That dataset is the primary data asset that we will be publishing from our data infrastructure.

We’ve explored the various technologies, data standards and some of the other data assets (registers) that enable the collection and publishing of that data.

We’ve not yet considered the organisations that maintain and govern those assets.

The weather stations themselves will be manufactured and installed by many different organisations around the world. Other organisations might offer services to help maintain and calibrate stations after they are installed.

A National Meteorological Service might take on responsibility for maintaining the network of stations within it’s nation’s borders. The scope of their role will be defined by national legislation and policies. But a commercial organisation might also choose to take on responsibility for running a collection of stations.

In our data infrastructure, the central database of observations will be curated and managed by a single organisation. The (fictional) Global Weather Office. Our Global Weather Office will do more than just manage data assets. It also has a role to play in choosing and defining the data standards that support data collection. And it helps to certify which models of weather station conform to those standards.

Organisations are a key building block of data infrastructure. The organisational models that we choose to govern a data infrastructure and which take responsibility for its sustainability, are an important part of its design.

The value of the weather observations comes from their use. E.g. as input into predictive models to create weather forecasts and other services. Many organisations will use the observation data provided by our data infrastructure to create a range of products and services. E.g. national weather forecasts, or targeted advice for farmers that is delivered via farm management systems. The data might also be used by researchers. Or by environmental policy-makers to inform their work.

Mapping out the ecosystem of organisations that operate and benefit from our data infrastructure will help us to understand the roles and responsibilities of each organisation. It will also help clarify how and where value is being created.

Guidance and Policies

With so many different organisations operating, governing and benefiting from our data infrastructure we need to think about how they are supported in creating value from it.

To do this we will need to produce a range of guidance and policies, for example:

  • Documentation for all of the data assets that helps to put them in context, allowing them to be successfully used to create products and services. This might include notes on how we have collected our data, the standards used, and locations of our stations.
  • Recommendations for how data should be processed and interpreted to ensure that weather forecasts that use the data are reliable and safe
  • Licences that define how the data assets can be used
  • Documentation that describes the data governance processes that are applied to the data assets
  • Policies that define how organisations gain access to the data infrastructure, e.g. to start supplying data from new stations
  • Policies that decide how, when and where new stations might be added to the global network, to ensure that global coverage is maintained
  • Procurement policies that define how stations and the services that relate to them purchased
  • National regulations that apply to manufacture of weather stations, or that set safety standards that apply when they are installed or serviced
  • …etc

Guidance and policies are an important building block that help to shape the ecosystem that supports and benefits from our data infrastructure.

A strong data infrastructure will have policies and governance that will support equitable access to the system. Making infrastructure as open as possible will help to ensure that as many organisations as possible have the opportunity to use the assets it provides, and have equal opportunities to contribute to its operation.

Community

Why do we collect weather data? We do it to help create weather forecasts, monitor climate change and a whole host of other reasons. We want the data to be used to make decisions.

Many different people and organisations might benefit from the weather data we are providing. A commuter might just want to know whether to take an umbrella to work. A farmer might want help in choosing which crops to plant. Or an engineer planning a difficult construction task made need to know the expected weather conditions.

Outside of the organisations who are directly interacting with our data infrastructure there will be a number of communities, made up of both individuals and organisations who will benefit from the products and services made with the data assets it provides. Communities are the final building block of our data infrastructure.

These communities will be relying on our data infrastructure to plan their daily lives, activities and to make business decisions. But they may not realise it. Good infrastructure is boring and reliable.

In his book on the social value of infrastructure, Brett Frischmann refers to infrastructure as “shared means to many ends”. Governing and maintaining infrastructure requires us to recognise this variety of interests and make choices that balances a variety of needs. 

The choices we make about who has access to our data infrastructure, and how it will be made sustainable, will be important in ensuring that value can be created from it over the long-term.

Reviewing our building blocks

To summarise, our building blocks of data infrastructure are:

  • Identifiers
  • Standards
  • Registers
  • Technology, of various kinds
  • Organisations, who create, maintain, govern and use our infrastructure
  • Guidance and Policies that inform is use
  • Communities who are impacted or are affected by it

The building blocks have different sizes. Identifiers are a well-understood technical concept. Organisations, policies and communities are more complex, and perhaps less well-defined.

Understanding their relationships, and how they benefit from being more open, requires us to engage in some systems thinking. By identifying each building block I hope we can start to have deeper conversations about the systems we are building.

Over time we might be able to tease out more specific building blocks. We might be able to identify important organisational roles that occur as repeated patterns across different types of infrastructure. Or specific organisational models that have been found to be successful in creating trusted, sustainable infrastructures. Over time we might also identify key types of policy and guidance that are important elements of ensuring that a data infrastructure is successful. These are research questions that can help us refine our understanding of data as infrastructure.

There are other aspects of data infrastructure which we have not explicitly explored here. For example ethics and trust. This is because ethics is not a building block. It’s a way of working that will enable fairer, equitable access to data infrastructure by a variety of communities. Ethics should inform every decision and every activity we take to design, build and maintain our data infrastructure.

Trust is also not a building block. Trust emerges from how we operate and maintain our data infrastructures. Trust is earned, rather than designed into a system.

Help me make this better

I’ve written these posts to help me work through some of my thoughts around data infrastructure. There’s a lot more to be said about the different building blocks. And choosing other examples, e.g. that focus on data infrastructure that involves sharing of personal data like medical records, might better highlight some different characteristics.

Let me know what you think of this breakdown. Is it useful? Do you think I’ve missed some building blocks? Leave a comment or tweet me your thoughts.

Thanks to Peter Wells and Jeni Tennison for feedback and suggestions that have helped me write these posts.

The building blocks of data infrastructure – Part 1

Data is a vital form of infrastructure for our societies and our economies. When we think about infrastructure we usually think of physical things like roads and railways.

But there are broader definitions of infrastructure that include less tangible things. Like ideas or the internet.

It is important to recognise that there is more to “infrastructure” than just roads and railways. Otherwise there is a risk that we, as a society, won’t invest the necessary time or effort in building, maintaining and governing that infrastructure. The decisions we make about Infrastructure are important because infrastructure helps to shape our societies.

To help explore the idea of data as infrastructure, I want to look at the various building blocks that make up a specific example of a data infrastructure. My hope is that this will help to make it clearer that “data infrastructure” is about more than just technology. As we will see the technical infrastructure we use to manage data is just one component of data infrastructure.

The example we will use is greatly simplified and is partly fictionalised, but it is essentially a real example: we’re going to look at weather data infrastructure.

I’ve written about weather data infrastructure before. It’s a really interesting example to explore in this context because:

  • it’s easy to understand the value of collecting weather data
  • it’s a complex enough example to help dig into some real-world issues
  • It illustrates how data that is collected and used locally or nationally can also be part of a global data infrastructure

Weather data is usually open data, or at least public. But the building blocks we will outline here apply equally well to data from across the data spectrum. In a follow-up post I may explore a more complex example that illustrates a different type of data infrastructure, e.g. one for medical research that relies on researchers having access to medical records.

In the following sections we’ll look at the different building blocks that are important in building a global weather data infrastructure. The real infrastructure is much more complex.

Some of the building blocks are a bit fuzzy and have multiple roles to play in our infrastructure. But that’s fine. The world isn’t a neat and tidy place that we can always reduce to simpler components.

A definition of data infrastructure

Before we begin, let’s introduce a definition of data infrastructure:

A data infrastructure consists of data assets, the standards and technologies that are used to curate and provide access to those assets, the guidance and policies that inform their use and management, the organisations that govern the data infrastructure, and the communities involved in maintaining it, or are impacted by decisions that are made using those data assets.

There are a lot of moving parts there. And there are lots of things to say about each of them. For now let’s focus on the individual building blocks to explore ways in which they fit together.

Identifiers

Imagine we’re planning to build a global network of weather stations. Each station will be regularly recording the local temperature and rainfall. In our system we’ll be collecting all of these readings into a global dataset of weather observations.

So that we know which observations have been reported by which weather station, we need a unique reference for each of them.

We can’t just use the name of the town or village in which the station has been installed as that reference. There are Birminghams in both the UK and the US, for example. We might also need to move and reinstall weather stations over time, but may need to track information about them, such as when they were installed or services. So we need a global identifier that is more reliable than just a name.

By assigning each weather station a unique identifier, we can then attached additional data to it. Like it’s current location. We can also associate the identifier with every temperature and rainfall observation, so that we know which station reported that data.

Identifiers are the first building block of our data infrastructure.

Identifiers are deceptively simple. They’re just a number or a code, right? But there’s a lot to say about them, such as how they are assigned or are formatted. It can be hard to create good identifiers.

When identifiers are open, for anyone to use in their data, they have a role to play that goes beyond just providing unique references in a database. They can also help to create network effects that encourage publication of additional data.

Standards, part 1

Our weather stations are recording temperature and rainfall. We’ll measure temperature in degrees Centigrade and rainfall in millimetres. Both of these are standard units of measurement.

Standards are our second building block.

Standards are documented, reusable agreements. They help us collect and organise data in consistent ways, and make it easier to work with data from different sources.

Some standards, like units of measurement are global and are used in many different ways. But some standard might be only be relevant to specific communities or systems.

In our weather data infrastructure, we will need to standardise some other aspects of how we plan to collect weather data.

For example, let’s assume that our weather stations are recording data every half an hour. Every thirty minutes a station will record a new temperature reading. But is it recording the temperature at that specific moment in time, or should it report the average temperature over the last thirty minutes? There may be advantages in doing one or the other.

If we don’t standardise some of our data collection practices, then weather stations created by different manufacturers might record data differently. This will affect the quality of our data.  

Standards, part 2

Every data infrastructure will rely on a wide variety of different standards. Some standards support consistent measurement and data collection. Others help us to exchange data more effectively.

Our weather stations will need to record the data they collect and automatically upload it to a service that helps us build our global database. In a real system there are a number of different ways in which we might want a weather station to report data, to provide a variety of ways in which it could be aggregated and reused. But to simplify things, we’ll assume they just upload their data to a centralised service. Centralised data collection is problematic for a number of reasons, but that’s a topic for another article.

To help us define how the weather stations will upload their data we will need to pick a standard data format that will define the syntax for recording data in a machine-readable form. Let’s assume that we decide to use a simple CSV (comma-separated values) format.

Each station will produce a CSV file that contains one row for every half-hourly observation. Each row will consist of a station identifier, a time stamp for the recordings, a temperate reading and a rainfall reading.

The time stamps can be recorded using ISO 8601, which is an international standard for formatting dates and times. Helpfully we can include time zones, which will be essential for reporting time accurately across our global network of weather stations.

We also need to ensure that the order in which the four fields will be reported is consistent, or that the headers in the CSV file clearly identify what is contained in each column. Again, we might be using weather stations from multiple manufacturers and need data to be recorded consistently. Some stations might also include additional sensors, e.g. to record wind speed. So ideally our standard will be extensible to support that additional data. Taking time to design and standardise our CSV format will make data aggregation easier.

Every time we define how to collect, manage or share data within a system, we are creating agreements that will help ensure that everyone involved in those processes can be sure that those tasks are carried out in consistent ways. When we reuse existing standards, rather than creating bespoke versions, we can benefit from the work of thousands of different specialists across a variety of industries.

Sometimes though we do need to define a new standard, like the order of the columns in our specific type of CSV file. But where possible we should approach this by building on existing standards as much as possible.

Registers

To help us manage our network of weather stations it will be useful to record where each of them has been installed. It would also be helpful to record when they were installed. Then we can figure out when they might need to be re-calibrated or replaced and send some out to do the necessary work.

To do this, we can create a dataset, that lists the identifier, location, model and installation date of every weather station.

This type of dataset a register.

Registers are lists of important data. They have multiple uses, but are most frequently used to help us improve the quality of our data reporting.

For example we can use the above register to confirm that we’re regularly receiving data from every station on the network. When a station is installed it will need to get added to the register. We might give the company installing the station permission to do that, to help us maintain the register.

We can also use the register to determine if we have a good geographic spread of stations, to help us assess and improve the coverage and quality of the observations we’re collecting. The register is also useful for anyone using our global dataset so they can see how the dataset has been collected over time. Registers should be as open as possible.

There are other types of register that might be useful for governing our data infrastructure. For example we might create a register that lists all of the models of weather station that have been certified to comply with our preferred data standards.

We can use that register to help us make decisions about how to replace stations when they fail. A register can also help provide an incentive for the manufacturers of weather stations to conform to our chosen standards. If they’re not on the list, then we might not buy their products.

In Part 2 of this post we’ll look at others aspects of data infrastructure, including technology, organisations and policies. Thanks to Peter Wells and Jeni Tennison for feedback and suggestions that have helped me write these posts.

When are open (geospatial) identifiers useful?

In a meeting today, I was discussing how and when open geospatial identifiers are useful. I thought this might make a good topic for a blog post in my continuing series of questions about data. So here goes.

An identifier provides an unambiguous reference for something about which we want to collect and publish data. That thing might be a road, a school, a parcel of land or a bus stop.

If we publish a dataset that contains some data about “Westminster” then, without some additional documentation, a user of that dataset won’t know whether the data is about a tube station, the Parliamentary Constituency, a company based in Hayes or a school.

If we have identifiers for all of those different things, then we can use the identifiers in our data. This lets us be confident that we are talking about the same things. Publishing data about “940GZZLUWSM” makes it pretty clear that we’re referring to a specific tube station.

If data publishers use the same sets of identifiers, then we can start to easily combine your dataset on the wheelchair accessibility of tube stations, with my dataset of tube station locations and Transport for London’s transit data. So we can build an application that will help people in wheelchairs make better decisions about how to move around London.

Helpful services

To help us publish datasets that use the same identifiers, there are a few things that we repeatedly need to do.

For example it’s common to have to lookup an identifier based on the name of the thing we’re describing. E.g. what’s the code for Westminster tube station? We often need to find information about an identifier we’ve found in a dataset. E.g. what’s the name of the tube station identified by 940GZZLUWSM? And where is it?

When we’re working with geospatial data we often need to find identifiers based on a physical location. For example, based on a latitude and longitude:

  • Where is the nearest tube station?
  • Or, what polling district am I in, so I can find out where I should go to vote?
  • Or, what is the identifier for the parcel of land that contains these co-ordinates?
  • …etc

It can be helpful if these repeated tasks are turned into specialised services (APIs) that make it easier to perform them on-demand. The alternative is that we all have to download and index the necessary datasets ourselves.

Network effects

Choosing which identifiers to use in a dataset is an important part of creating agreements around how we publish data. We call those agreements data standards.

The more datasets that use the same set of identifiers, the easier it becomes to combine those datasets together, in various combinations that will help to solve a range of problems. To put it another way, using common identifiers helps to generate network effects that make it easier for everyone to publish and use data.

I think it’s true to say that almost every problem that we might try and solve with better use of data requires the combination of several different datasets. Some of those datasets might come from the private sector. Some of them might come from the public sector. No single organisation always holds all of the data.

This makes it important to be able to share and reuse identifiers across different organisations. And that is why it is important that those identifiers are published under an open licence.

Open licensing

Open licences allow anyone to access, use and share data. Openly licensed identifiers can be used in both open datasets and those that are shared under more restrictive licences. They give data publishers the freedom to choose the correct licence for their dataset, so that it sits at the right point on the data spectrum.

Identifiers that are not published under an open licence remove that choice. Restricted licensing limits the ability of publishers to share their data in the way that makes sense for their business model or application. Restrictive licences cause friction that gets in the way of making data as open as possible.

Open identifiers create open ecosystems. They create opportunities for a variety of business models, products and services. For example intermediaries can create platforms that aggregate and distribute data that has been published by a variety of different organisations.

So, the best identifiers are those that are

  • published under an open licence that allows anyone to access, use and share them
  • published alongside some basic metadata (a label, a location or other geospatial data, a type)
  • and, are accessible via services that allow them to be easily used

Who provides that infrastructure?

Whenever there is friction around the use of data, application developers are left with a difficult choice. They either have to invest time and effort in working around that friction, or compromise their plans in some way. The need to quickly bring products to market may lead to choices which are not ideal.

For example, developers may choose to build applications against Google’s mapping services. These services are easily and immediately available for anyone developer wanting to display a map or recommend a route to a user. But these platforms usually have restricted licensing that means it is usually the platform provider that reaps the most benefits. In the absence of open licences, network effects can lead to data monopolies.

So who should provide these open identifiers, and the metadata and services that support them?

This is the role of national mapping agencies. These agencies will already have identifiers for important geospatial features. The Ordnance Survey has an identifier called a TOID which is assigned to every feature in Great Britain. But there are other identifiers in use too. Some are designed to support publication of specific types of data, e.g. UPRNs.

These identifiers are national assets. They should be managed as data infrastructure and not be tied up in commercial data products.

Publishing these identifiers under an open licence, in the ways that have been outlined here, will provide a framework to support the collection and curation of geospatial data by many  different organisations, across the public and private sector. That infrastructure will allow value to be created from that geospatial data in a variety of new ways.

Provision of this type of infrastructure is also in-line with what we can see happening across other parts of government. For example the work of the GDS team to develop registers of important data. Identifiers, registers and standards are important building blocks of our local, national and global data infrastructure.

If you’re interested in reading more about the benefits of open identifiers, then you might be interested in this white paper that I wrote with colleagues from the Open Data Institute and Thomson Reuters: “Creating value from identifiers in an open data world

Data assets and data products

A lot of the work that we’ve done at the ODI over the last few years has involved helping organisations to recognise their data assets.

Many organisations will have their IT equipment and maybe even their desks and chairs asset tagged. They know who is using them, where they are, and have some kind of plan to make sure that they only invest in maintaining the assets they really need. But few will be treating data in the same way.

That’s a change that is only just beginning. Part of the shift is in understanding how those assets can be used to solve problems. Or help them, their partners and customers to make more informed decisions.

Often that means sharing or opening that data so that others can use it. Making sure that data is at the right point of the data spectrum helps to unlock its value.

A sticking point for many organisations is that they begin to question why they should share or open those data assets, and whether others should contribute to their maintenance. There are many commons questions around the value of sharing, respecting privacy, logistics, etc.

I think a useful framing for this type of discussion might be to distinguish between data assets and data products.

A data asset is what an organisation is managing internally. It may be shared with a limited audience.

A data product is what you share with or open to a wider audience. Its created from one or more data assets. A data product may not contain all of the same data as the data assets it’s based on. Personal data might need to be removed or anonymised for example. This means a data product might sit at a different point in the data spectrum. It can be more open. I’m using data product here to refer to specific types of datasets, not “applications that have been made using data”

An asset is something you manage and invest in. A product is intended to address some specific needs. It may need some support or documentation to make sure it’s useful. It may also need to evolve based on changing needs.

In some cases a data asset could also be a data product. The complete dataset might be published in its entirety. In my experience this is often rarely the case though. There’s usually additional information, e.g governance and version history, that might not be useful to reusers.

In others cases data assets are collaboratively maintained, often in the open. Wikidata and OpenStreetMap are global data assets that are maintained in this way. There are many organisations that are using those assets to create more tailored data products that help to meet specific needs. Over time I expect more data assets will be managed in collaborative ways.

Obviously not every open data release needs to be a fully supported “product”. To meet transparency goals we often just need to get data published as soon as possible, with a minimum of friction for both publishers and users.

But when we are using data as tool to create other types of impact, more work is sometimes needed. There are often a number of social, legal and technical issues to consider in making data accessible in a sustainable way.

By injecting some product thinking into how we share and open data it might be helpful in addressing the types of problems that can contribute to data releases not having the desired impact: Why are we opening this data? Who will use it? How can we help them be more effective? Does releasing the data provide ways in which the data asset might be more collaboratively maintained?

When governments are publishing data that should be part of a national data infrastructure, more value will be unlocked if more of the underlying data assets are available for anyone to access, use and share. Releasing a “data product” that is too closely targeted might limit its utility.  So I also think this “data asset” vs “data product” distinction can help us to challenge the types data that are being released. Are we getting access to the most valuable data assets or useful subsets of them. Or are we just being given a data product that has much more limited applications, regardless of how well it is being published?

We CAN get there from here

On Wednesday, as part of the Autumn Budget, the Chancellor announced that the government will be creating a Geospatial Commission “to establish how to open up freely the OS MasterMap data to UK-based small businesses”. It will be supported by new funding of £80 million over two years. The Commission will be looking at a range of things including:

  • improving the access to, links between, and quality of their data
  • looking at making more geospatial data available for free and without restriction
  • setting regulation and policy in relation to geospatial data created by the public sector
  • holding individual bodies to account for delivery against the geospatial strategy
  • providing strategic oversight and direction across Whitehall and public bodies who operate in this area

That’s a big pot of money to get something done and a remit that ticks all of the right boxes. As the ODI blog post notes, it creates “the opportunity for national mapping agencies to adapt to a future where they become stewards for national mapping data infrastructure, making sure that data is available to meet the needs of everyone in the country”.

So, I’m really surprised that the many of the reactions from the open data community have been fairly negative. I understand the concerns that the end result might not be a completely open Mastermap. There are many, many ways in which this could end up with little or no change to the status quo. That’s certainly true if we ignore the opportunity to embed some change.

From my perspective, this is the biggest step towards a more open future for UK geospatial data since the first OS Open Data release in 2010. (I remember excitedly hitting the publish button to make their first Linked Data release publicly accessible)

Anyone who has been involved with open data in the UK will have encountered the Ordnance Survey licensing issues that are massively inhibiting both the release and use of open data in the UK. It’s a frustration of mine that these issues aren’t manifest in the various open data indexes.

In my opinion, anything that moves us forward from the current licensing position is to be welcomed. Yes, we all want a completely open MasterMap. That’s our shared goal. But how do we get there?

We’ve just seen the government task and resource itself to do something that can help us achieve that goal. It’s taken concerted effort by a number of people to get to this point. We should be focusing on what we all can do, right now, to help this process stay on track. Dismissing it as an already failed attempt isn’t helpful.

I think there’s a great deal that the community could do to engage with and support this process.

Here’s a few ideas of things of ways that we could inject some useful thinking into the process:

  • Can we pull together examples of where existing licensing restrictions are causing friction for UK businesses? Those of who us have been involved with open data have internalised many of these issues already, but we need to make sure they’re clearly understood by a wider audience
  • Can we do the same for local government data and services? There are loads of these too. Particularly compelling examples will be those that highlight where more open licensing can help improve local service delivery
  • Where could greater clarity around existing licensing arrangements help UK businesses, public sector and civil society organisations achieve greater impact? It often seems like some projects and local areas are able to achieve releases where others can’t.
  • Even if all of MasterMap were open tomorrow, it might still be difficult to access. No-one likes the current shopping cart model for accessing OS open data. What services would we expect from the OS and others that would make this data useful? I suspect this would go beyond “let me download some shapefiles”. We built some of these ideas into the OS Linked Data site. It still baffles me that you can’t find much OS data on the OS website.
  • If all of MasterMap isn’t made open, then which elements of it would unlock the most value? Are there specific layers or data types that could reduce friction in important application areas?
  • Similarly, how could the existing OS open data be improved to make it more useful? Hint: currently all of the data is generalised and doesn’t have any stable identifiers at all.
  • What could the OS and others do to support the rest of us in annotating and improving their data assets? The OS switched off its TOID lookup service because no-one was using it. It wasn’t very good. So what would we expect that type of identifier service to do?
  • If there is more openly licensed data available, then how could it be usefully added to OpenStreetMap and used by the ecosystem of open geospatial tools that it is supporting?
  • We all want access to MasterMap because its a rich resource. What are the options available to ensure that the Ordnance Survey stays resourced to a level where we can retain it as a national asset? Are there reasonable compromises to be made between opening all the data and them offering some commercial services around it?
  • …etc, etc, etc.

Personally, I’m choosing to be optimistic. Let’s get to work to create the result we want to see.

The state of open licensing, 2017 edition

Let’s talk about open data licensing. Again.

Last year I wrote a post, the State of Open Licensing in which I gave a summary of the landscape as I saw it. A few recent developments mean that I think it’s worth posting an update.

But Leigh, I hear you cry, do people really care about licensing? Are you just fretting over needless details? We’re living in a post-open source world after all!

To which I would respond, if licensing doesn’t have real impacts, then why did the open source community recently go into meltdown about Facebook’s open source licences? And why have they recanted? There’s a difference between throwaway, unmaintained code and data, and resources that could and should be infrastructure.

The key points I make in my original post still stand: I think there is still a need to encourage convergence around licensing in order to reduce friction. But I’m concerned that we’re not moving in the right direction. Open Knowledge are doing some research around licensing and have also highlighted their concerns around current trends.

So what follows is a few observations from me looking at trends in a few different areas of open data practice.

Licensing of open government data

I don’t think much has changed with regards to open licenses for government data. The UK Open Government Licence (UK-OGL) still seems to be the starting point for creating bespoke national licences.

Looking through the open definition forum archives, the last government licence that was formally approved as open definition compliant was the Taiwan licence. Like the UK-OGL Version 3, the licence clearly indicates that it is compatible with the Creative Commons Attribution (CC-BY) 4.0 licence. The open data licence for Mexico makes a similar statement.

In short, you can take any data from the UK, Taiwan and Mexico and re-distribute it under a CC-BY 4.0 licence. Minimal friction.

I’d hoped that we could discourage governments from creating new licences. After all, if they’re compatible with CC-BY, then why go to the trouble?

But, chatting briefly about this with Ania Calderon this week, I’ve come to realise that the process of developing these licences is valuable, even if the end products end up being very similar. It encourages useful reflection on the relevant national laws and regulations, whilst also ensuring there is sufficient support and momentum behind adoption of the open data charter. They are as much as a statement of shared intent as a legal document.

The important thing is that national licences should always state compatibility with an existing licence. Ideally CC-BY 4.0. This removes all doubt when combining data collected from different national sources. This will be increasingly important as we strengthen our global data infrastructure.

Licensing of data from commercial publishers

Looking at how data is being published by commercial organisations, things are very mixed.

Within the OpenActive project we now have more than 20 commercial organisations publishing open data under a CC-BY 4.0 licence. Thomson Reuters are using CC-BY 4.0 as the core licence for its PermID product. And Syngenta are publishing their open data under a CC-BY-SA 4.0 licence. This is excellent. 10/10 would reuse again.

But in contrast, the UK Open Banking initiative has adopted a custom licence which has a number of limitations, which I’ve written about extensively. Despite feedback they’ve chosen to ignore concerns raised by the community.

Elsewhere the default is for publishers and platforms to use custom terms and conditions that create complexity for reusers. Or for lists of “open data” to have no clear licensing.

Licensing in the open data commons

It’s a similar situation in the broader open data commons.

In the research community CC0 licences have been recommended for some time and is the default on a number of research data archives. Promisingly the FigShare State of Open Data 2017 report (PDF) shows a growing awareness of open data amongst researchers, and a reduction in uncertainty around licensing. But there’s still lots of work to do. Julie McMurry of the (Re)usable Data Project notes that less than half of the databases they’ve indexed have a clear, findable licence.

While the CC-BY and CC-BY-SA 4.0 licences are seen to be the best practice default, a number of databases still rely on the Open Database Licence (ODbL). OpenStreetMap being the obvious example.

The OSM Licence Working Group has recently concluded that, pending a more detailed analysis, the Creative Commons licences are incompatible with the ODbL. They now recommend asking for specific permission and the completion of a waiver form before importing CC licenced open data into OSM. This is, of course, exactly the situation that open licensing is intended to avoid.

Obtaining 1:1 agreements is the opposite of friction-less data sharing.

And it’s not clear whose job it is to sort it out. I’m concerned that there’s no clear custodian for the ODbL or investment in its maintenance. Resolving issues of compatibility with the CC licences is clearly becoming more urgent. I think it needs an organisation or a consortia of interested parties to take this forward. It will need some legal advice and investment to resolve any issues. Taking no action doesn’t seem like a viable option to me.

Based on what I’ve seen summarised around previous discussions there seem to be some basic disagreements around the approaches taken to data licensing that have held up previous discussions. Creative Commons could take a lead on this, but so far they’ve not certified any third-party licences as compatible with their suite. All statements have been made the other way.

Despite the use by big projects like OSM, its really unclear to me what role the ODbL has longer term. Getting to a clear definition of compatibility would provide a potential way for existing users of the licence to transition at a future date.

Just to add to the fun, the Linux Foundation have thrown two new licences into the mix. There has been some discussion about this in the community and some feedback in these two articles in the Register. The second has some legal analysis: “I wouldn’t want to sign it“.

Adding more licences isn’t helpful. What would have been helpful would have been exploring compatibility issues amongst existing licences and investing in resolving them. But as their FAQ highlights, the Foundation explicitly chose to just create new licences rather than evaluate the current landscape.

I hope that the Linux Foundation can work with Creative Commons to develop a statement of compatibility, otherwise we’re in an even worse situation.

Some steps to encourage convergence

So how do we move forward?

My suggestions are:

  • No new licences! If you’re a government, you get a pass to create a national licence so long as you include a statement of compatibility with a Creative Commons licence
  • If your organisation has issues with the Creative Commons licences, then document and share them with the community. Then engage with the Creative Commons to explore creating revisions. Spend what you would have given your lawyers on helping the Creative Commons improve their licences. It’s a good test of how much you really do want to work in the open
  • If you’re developing a platform, require people to choose a licence or set a default. Choosing a licence can include “All Rights Reserved”. Let’s get some clarity
  • We need to invest further in developing guidance around data licensing.
  • Let’s sort out compatibility between the CC and ODbL licence suites
  • Let’s encourage Linux Foundation to do the same, and also ask them to submit their license to the licence approval process. This should be an obvious step for them as they’ve repeatedly highlighted the lessons to be learned from open source licensing, which go through a similar process.

I think these are all useful steps forward. What would you add to the list? What organisations can help drive this forward?

Note that I’m glossing over a set of more nuanced issues which are worthy of further, future discussion. For example whether licensing is always the right protection, or when “situated openness” may be the best approach towards building trust with communities. Or whether the two completely different licensing schemes for Wikidata and OSM will be a source of friction longer term or are simply necessary to ensure their sustainability.

For now though, I think I’ll stick with the following as my licensing recommendations: