How do data publishing choices shape data ecosystems?

This is the latest in a series of posts in which I explore some basic questions about data.

In our work at the ODI we have often been asked for advice about how best to publish data. When giving trying to give helpful advice, one thing I’m always mindful of is how the decisions about how data is published shapes the ways in which value can be created from it. More specifically, whether those choices will enable the creation of a rich data ecosystem of intermediaries and users.

So what are the types of decisions that might help to shape data ecosystems?

To give a simple example, if I publish a dataset so its available as a bulk download, then you could use that data in any kind of application. You could also use it to create a service that helps other people create value from the same data, e.g. by providing an API or an interface to generate reports from the data. Publishing in bulk allows intermediaries to help create a richer data ecosystem. But, if I’d just published that same data via an API then there are limited ways in which intermediaries can add value. Instead people must come directly to my API or services to use the data.

This is one of the reasons why people prefer open data to be available in bulk. It allows for more choice and flexibility in how it is used. But, as I noted in a recent post, depending on the “dataset archetype” your publishing options might be limited.

The decision to only publish a dataset as an API, even if it could be published in other ways is often a deliberate decision. The publisher may want to capture more of the value around the dataset, e.g. by charging for the use of an API. Or they may it is important to have more direct control over who uses it, and how. These are reasonable choices and, when the data is sensitive, sensible options.

But there are a variety of ways in which the choices that are made about how to publish data, can can shape or constrain the ecosystem around a specific dataset. It’s not just about bulk downloads versus APIs.

The choices include:

  • the licence that is applied to the data, which might limit it to non commercial use. Or restrict redistribution. Or imposing limits on the use of derived data
  • the terms and conditions for the API or other service that provides access to the data. These terms are often conflated with data licences, but typically focus on aspects of service provisions, for example rate limiting, restriction on storage of API results, permitted uses of the API, permitted types of users, etc
  • the technology used to provide access to data. In addition to bulk downloads vs API, there are also details such as the use of specific standards, the types of API call that are possible, etc
  • the governance around the API or service that provides access to data, which might create limit which users can get access the service or create friction that discourages use
  • the business model that is wrapped around the API or service, which might include a freemium model, chargeable usage tiers, service leverl agreements, usage limits, etc

I think these cover the main areas. Let me know if you think I’ve missed something.

You’ll notice that APIs and services provide more choices for how a publisher might control usage. This can be a good or a bad thing.

The range of choices also means it’s very easy to create a situation where an API or service doesn’t work well for some use cases. This is why user research and engagement is such an important part of releasing a data product and designing policy interventions that aim to increase access to data.

For example, let’s imagine someone has published an openly licensed dataset via an API that restricts users to a maximum number of API calls per month.

These choices limits some uses of the API, e.g. applications that need to make lots of queries. This also means that downstream users creating web applications are unable to provide a good quality of service to their own users. A popular application might just stop working at some point over the course of the month because it has hit the usage threshold.

The dataset might be technically openly, but practically its used has been constrained by other choices.

Those choices might have been made for good reasons. For example as a way for the data publisher to be able to predict how much they need to invest each month in providing a free service, that is accessible to lots of users making a smaller number of requests. There is inevitably a trade-off between the needs of individual users and the publisher.

Adding on a commercial usage tier for high volume users might provide a way for the publisher to recoup costs. It also allows some users to choose what to pay for their use of the API, e.g. to more smoothly handle unexpected peaks in their website traffic. But it may sometimes be simpler to provide the data in bulk to support those use cases. Different use cases might be better served by different publishing options.

Another example might be a system that provides access to both shared and open data via a set of APIs that conform to open standards. If the publisher makes it too difficult for users to actually sign up to use those APIs, e.g because of difficult registration or certification requirements, then only those organisations that can afford to invest the time and money to gain access might both using them. The end result might be a closed ecosystem that is built on open foundations.

I think its important for understand how this range of choices can impact data ecosystems. They’re important not just for how we design products and services, but also in helping to design successful policies and regulatory interventions. If we don’t consider the full range of changes, then we may not achieve the intended outcomes.

More generally, I think it’s important to think about the ecosystems of data use. Often I don’t think enough attention is paid to the variety of ways in which value is created. This can lead to poor choices, like a choosing to try and sell data for short term gain rather than considering the variety of ways in which value might be created in a more open ecosystem.

The words we use for data

I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think.  One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.

As Dan pointed out we often need a broader vocabulary when talking about data.  Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.

It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.

In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.

The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?

This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.

If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?

It turns out there’s quite a few.

For example there are “samples” and “sampling“.  These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples.  We might only be allowed to take a few.

Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.

Or how about “observations” and “observing“?  Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.

Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.

AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.

What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.

In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.

Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?

Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.

I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.

I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.

What terms do you find most useful when talking about data?

How can we describe different types of dataset? Ten dataset archetypes

As a community, when we are discussing recommendations and best practices for how data should be published and governed, there is a natural tendency for people to focus on the types of data they are most familiar with working with.

This leads to suggestions that every dataset should have an API, for example. Or that every dataset should be available in bulk. While good general guidance, those approaches aren’t practical in every case. That’s because we also need to take into account a variety of other issues, including:

  • the characteristics of the dataset
  • the capabilities of the publishing organisation and the funding their have available
  • the purpose behind publishing the data
  • and the ethical, legal and social contexts in which it will be used

I’m not going to cover all of that in this blog post.

But it occurred to me that it might be useful to describe a set of dataset archetypes, that would function a bit like user personas. They might help us better answer some of the basic questions people have around data, discuss recommendations around best practices, inform workshop exercises or just test our assumptions.

To test this idea I’ve briefly described ten archetypes. For each one I’ve tried to describe some it’s features, identified some specific examples, and briefly outlined some of the challenges that might apply in providing sustainable access to it.

Like any characterisation detail is lost. This is not an exhaustive list. I haven’t attempted to list every possible variation based on size, format, timeliness, category, etc. But I’ve tried to capture a range that hopefully illustrate some different characteristics. The archetypes reflect my own experiences, you will have different thoughts and ideas. I’d love to read them.

The Study

The Study is a dataset that was collected to support a research project. The research group collected a variety of new data as part of conducting their study. The dataset is small, focused on a specific use case and there are no plans to maintain or update it further as the research group does not have any ongoing funded to collect or maintain the dataset. The data is provided as is for others to reuse, e.g. to confirm the original analysis of the data or to use it on other studies. To help others, and as part of writing some academic papers that reference the dataset, the research group has documented their methodology for collecting the data. The dataset is likely published in an academic data portal or alongside the academic papers that reference it.

Examples: water quality samples, field sightings of animals, laboratory experiment results, bibliographic data from a literature review, photos showing evidence of plant diseases, consumer research survey results

The Sensor Feed

The Sensor Feed is a stream of sensor readings that are produced by a collection of sensors that have been installed across a city. New readings are added to the stream at regular intervals. The feed is provided to allow a variety of applications to tap into the raw sensor readings.. The data points are as directly reported by the individual sensors and are not quality controlled. The individual sensors may been updated, re-calibrated or replaced over time. The readings are part of the operational infrastructure of the city so can be expected to be available over at least the medium term. This mean the dataset is effectively unbounded: new observations will continue to be reported until the infrastructure is decommissioned.

Examples: air quality readings, car park occupancy, footfall measurements, rain gauges, traffic light queuing counts, real-time bus locations

The Statistical Index

The Statistical Index is intended to provide insights into the performance of specific social or economic policies by measuring some aspect of a local community or economy. For example a sales or well-being index. The index draws on a variety of primary datasets, e.g. on commercial activities, which are then processed according to a documented methodology to generate the index. The Index is stewarded by an organisation and is expected to be available over the long term. The dataset is relatively small and is reported against specific geographic areas (e.g. from The Register) to support comparisons. The Index is updated on a regular basis, e.g. monthly or annually. Use of the data typically involves comparing across time and location at different levels of aggregation.

Examples: street safety survey, consumer price indices, happiness index, various national statistical indexes

The Register

The Register is a set of reference data that is useful for adding context to other datasets. It consists of a list of specific things, e.g. locations, cars, services with an unique identifier and some basic descriptive metadata for each of the entries on the list. The Register is relatively small, but may grow over time. It is stewarded by an organisation tasked with making the data available for others. The steward, or custodian, provides some guarantees around the quality of the data.  It is commonly used as a means to link, validate and enrich other datasets and is rarely used in isolation other than in reporting on changes to the size and composition of the register.

Examples: licensed pubs, registered doctors, lists of MOT stations, registered companies, a taxonomy of business types, a statistical geography, addresses

The Database

The Database is a copy or extract of the data that underpins a specific application or service. The database contains information about a variety of different types of things, e.g. musicians, the albums and songs. It is a relatively large dataset that can be used to perform a variety of different types of query and to support a variety of uses. As it is used in a live service it is regularly updated, undergoes a variety of quality checks, and is growing over time in both volume and scope. Some aspects of The Database may reference one or more Registers or could be considered as Registers in themselves.

Examples: geographic datasets that include a variety of different types of features (e.g. OpenStreetMap, MasterMap), databases of music (e.g. MusicBrainz) and books (e.g. OpenLibrary), company product and customer databases, Wikidata

The Description

The Description is a collection of a few data points relating to a single entity. Embedded into a single web page, it provides some basic information about an event, or place, or company. Individually it may be useful in context, e.g. to support a social interaction or application share. The owner of the website provides some data about the things that are discussed or featured on the website, but does not have access to a full dataset. The individual item descriptions are provided by website contributors using a CRM to add content to the website. If available in aggregate, the individual descriptions might make a useful Database or Register.

Examples: descriptions of jobs, events, stores, video content, articles

The Personal Records

The Personal Records are a history of the interactions of a single person with a product or service. The data provides insight into the individual person’s activities.  The data is a slice of a larger Dataset that contains data for a larger number of people. As the information contains personal information it has to be secure and the individual has various rights over the collection and use of the data as granted by GDPR (or similar local regulation). The dataset is relatively small, is focused on a specific set of interactions, but is growing over time. Analysing the data might provide useful insight to the individual that may help them change their behaviour, increase their health, etc.

Examples: bank transactions, home energy usage, fitness or sleep tracker, order history with an online service, location tracker, health records

The Social Graph

The Social Graph is a dataset that describes the relationships between a group of individuals. It is typically built-up by a small number of contributions made by individuals that provide information about their relationships and connections to others. They may also provide information about those other people, e.g. names, contact numbers, service ratings, etc. When published or exported it is typically focused on a single individual, but might be available in aggregate. It is different to Personal Records as its specifically about multiple people, rather than a history of information about an individual (although Personal Records may reference or include data about others).  The graph as a whole is maintained by an organisation that is operating a social network (or service that has social features).

Examples: social networks data, collaboration graphs, reviews and trip histories from ride sharing services, etc

The Observatory

The Observatory is a very large dataset produce by a coordinated large-scale data collection exercise, for example by a range of earth observation satellites. The data collection is intentionally designed to support a variety of down-stream uses, which informs the scale and type of data collected. The scale and type of data can makes it difficult to use because of the need for specific tools or expertise. But there are a wide range of ways in which the raw data can be processed to create other types of data products, to drive a variety of analyses, or used to power a variety of services.  It is refreshed and re-released as required by the needs and financial constraints of the organisations collaborating on collecting and using the dataset.

Examples: earth observation data, LIDAR point clouds, data from astronomical surveys or Large Hadron Collider experiments

The Forecast

The Forecast is used to predict the outcome of specific real-world events, e.g. a weather or climate forecast. It draws on a variety of primary datasets which are then processed and anlysed to produce the output dataset. The process by which the predictions are made are well-documented to provide insight into the quality of the output. As the predictions are time-based the dataset has a relatively short “shelf-life” which means that users need to quickly access the most recent data for a specific location or area of interest. Depending on the scale and granularity, Forecast datasets can be very large, making them difficult to distribute in a timely manner.

Example: weather forecasts

Let me know what you think of these. Do they provide any useful perspective? How would you use or improve them?

Talk: Tabular data on the web

This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.

At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.

An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them

We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.

Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.

The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.

Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.

New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.

So how do we integrate data with the web? To ensure that statistics are well described and easy to find?

We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.

What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.

And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:

The primer provides an excellent walk through of all of the capabilities and I’d encourage you to explore it.

One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.

Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.

(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)

This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.

There also aren’t a great many tools that support CSV on the web currently.

It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?

But adoption could be driven from multiple angles. For example:

  • open data tools, portals and data publishers could start to generate best practice CSVs. That would be easy to implement
  • open data portals could also readily adopt CSV on the Web metadata, most already support DCAT
  • standards developers could adopt CSV on the Web as their primary means of defining schemas for tabular formats

Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.

Thanks for listening.

When are open (geospatial) identifiers useful?

In a meeting today, I was discussing how and when open geospatial identifiers are useful. I thought this might make a good topic for a blog post in my continuing series of questions about data. So here goes.

An identifier provides an unambiguous reference for something about which we want to collect and publish data. That thing might be a road, a school, a parcel of land or a bus stop.

If we publish a dataset that contains some data about “Westminster” then, without some additional documentation, a user of that dataset won’t know whether the data is about a tube station, the Parliamentary Constituency, a company based in Hayes or a school.

If we have identifiers for all of those different things, then we can use the identifiers in our data. This lets us be confident that we are talking about the same things. Publishing data about “940GZZLUWSM” makes it pretty clear that we’re referring to a specific tube station.

If data publishers use the same sets of identifiers, then we can start to easily combine your dataset on the wheelchair accessibility of tube stations, with my dataset of tube station locations and Transport for London’s transit data. So we can build an application that will help people in wheelchairs make better decisions about how to move around London.

Helpful services

To help us publish datasets that use the same identifiers, there are a few things that we repeatedly need to do.

For example it’s common to have to lookup an identifier based on the name of the thing we’re describing. E.g. what’s the code for Westminster tube station? We often need to find information about an identifier we’ve found in a dataset. E.g. what’s the name of the tube station identified by 940GZZLUWSM? And where is it?

When we’re working with geospatial data we often need to find identifiers based on a physical location. For example, based on a latitude and longitude:

  • Where is the nearest tube station?
  • Or, what polling district am I in, so I can find out where I should go to vote?
  • Or, what is the identifier for the parcel of land that contains these co-ordinates?
  • …etc

It can be helpful if these repeated tasks are turned into specialised services (APIs) that make it easier to perform them on-demand. The alternative is that we all have to download and index the necessary datasets ourselves.

Network effects

Choosing which identifiers to use in a dataset is an important part of creating agreements around how we publish data. We call those agreements data standards.

The more datasets that use the same set of identifiers, the easier it becomes to combine those datasets together, in various combinations that will help to solve a range of problems. To put it another way, using common identifiers helps to generate network effects that make it easier for everyone to publish and use data.

I think it’s true to say that almost every problem that we might try and solve with better use of data requires the combination of several different datasets. Some of those datasets might come from the private sector. Some of them might come from the public sector. No single organisation always holds all of the data.

This makes it important to be able to share and reuse identifiers across different organisations. And that is why it is important that those identifiers are published under an open licence.

Open licensing

Open licences allow anyone to access, use and share data. Openly licensed identifiers can be used in both open datasets and those that are shared under more restrictive licences. They give data publishers the freedom to choose the correct licence for their dataset, so that it sits at the right point on the data spectrum.

Identifiers that are not published under an open licence remove that choice. Restricted licensing limits the ability of publishers to share their data in the way that makes sense for their business model or application. Restrictive licences cause friction that gets in the way of making data as open as possible.

Open identifiers create open ecosystems. They create opportunities for a variety of business models, products and services. For example intermediaries can create platforms that aggregate and distribute data that has been published by a variety of different organisations.

So, the best identifiers are those that are

  • published under an open licence that allows anyone to access, use and share them
  • published alongside some basic metadata (a label, a location or other geospatial data, a type)
  • and, are accessible via services that allow them to be easily used

Who provides that infrastructure?

Whenever there is friction around the use of data, application developers are left with a difficult choice. They either have to invest time and effort in working around that friction, or compromise their plans in some way. The need to quickly bring products to market may lead to choices which are not ideal.

For example, developers may choose to build applications against Google’s mapping services. These services are easily and immediately available for anyone developer wanting to display a map or recommend a route to a user. But these platforms usually have restricted licensing that means it is usually the platform provider that reaps the most benefits. In the absence of open licences, network effects can lead to data monopolies.

So who should provide these open identifiers, and the metadata and services that support them?

This is the role of national mapping agencies. These agencies will already have identifiers for important geospatial features. The Ordnance Survey has an identifier called a TOID which is assigned to every feature in Great Britain. But there are other identifiers in use too. Some are designed to support publication of specific types of data, e.g. UPRNs.

These identifiers are national assets. They should be managed as data infrastructure and not be tied up in commercial data products.

Publishing these identifiers under an open licence, in the ways that have been outlined here, will provide a framework to support the collection and curation of geospatial data by many  different organisations, across the public and private sector. That infrastructure will allow value to be created from that geospatial data in a variety of new ways.

Provision of this type of infrastructure is also in-line with what we can see happening across other parts of government. For example the work of the GDS team to develop registers of important data. Identifiers, registers and standards are important building blocks of our local, national and global data infrastructure.

If you’re interested in reading more about the benefits of open identifiers, then you might be interested in this white paper that I wrote with colleagues from the Open Data Institute and Thomson Reuters: “Creating value from identifiers in an open data world

Data assets and data products

A lot of the work that we’ve done at the ODI over the last few years has involved helping organisations to recognise their data assets.

Many organisations will have their IT equipment and maybe even their desks and chairs asset tagged. They know who is using them, where they are, and have some kind of plan to make sure that they only invest in maintaining the assets they really need. But few will be treating data in the same way.

That’s a change that is only just beginning. Part of the shift is in understanding how those assets can be used to solve problems. Or help them, their partners and customers to make more informed decisions.

Often that means sharing or opening that data so that others can use it. Making sure that data is at the right point of the data spectrum helps to unlock its value.

A sticking point for many organisations is that they begin to question why they should share or open those data assets, and whether others should contribute to their maintenance. There are many commons questions around the value of sharing, respecting privacy, logistics, etc.

I think a useful framing for this type of discussion might be to distinguish between data assets and data products.

A data asset is what an organisation is managing internally. It may be shared with a limited audience.

A data product is what you share with or open to a wider audience. Its created from one or more data assets. A data product may not contain all of the same data as the data assets it’s based on. Personal data might need to be removed or anonymised for example. This means a data product might sit at a different point in the data spectrum. It can be more open. I’m using data product here to refer to specific types of datasets, not “applications that have been made using data”

An asset is something you manage and invest in. A product is intended to address some specific needs. It may need some support or documentation to make sure it’s useful. It may also need to evolve based on changing needs.

In some cases a data asset could also be a data product. The complete dataset might be published in its entirety. In my experience this is often rarely the case though. There’s usually additional information, e.g governance and version history, that might not be useful to reusers.

In others cases data assets are collaboratively maintained, often in the open. Wikidata and OpenStreetMap are global data assets that are maintained in this way. There are many organisations that are using those assets to create more tailored data products that help to meet specific needs. Over time I expect more data assets will be managed in collaborative ways.

Obviously not every open data release needs to be a fully supported “product”. To meet transparency goals we often just need to get data published as soon as possible, with a minimum of friction for both publishers and users.

But when we are using data as tool to create other types of impact, more work is sometimes needed. There are often a number of social, legal and technical issues to consider in making data accessible in a sustainable way.

By injecting some product thinking into how we share and open data it might be helpful in addressing the types of problems that can contribute to data releases not having the desired impact: Why are we opening this data? Who will use it? How can we help them be more effective? Does releasing the data provide ways in which the data asset might be more collaboratively maintained?

When governments are publishing data that should be part of a national data infrastructure, more value will be unlocked if more of the underlying data assets are available for anyone to access, use and share. Releasing a “data product” that is too closely targeted might limit its utility.  So I also think this “data asset” vs “data product” distinction can help us to challenge the types data that are being released. Are we getting access to the most valuable data assets or useful subsets of them. Or are we just being given a data product that has much more limited applications, regardless of how well it is being published?

We CAN get there from here

On Wednesday, as part of the Autumn Budget, the Chancellor announced that the government will be creating a Geospatial Commission “to establish how to open up freely the OS MasterMap data to UK-based small businesses”. It will be supported by new funding of £80 million over two years. The Commission will be looking at a range of things including:

  • improving the access to, links between, and quality of their data
  • looking at making more geospatial data available for free and without restriction
  • setting regulation and policy in relation to geospatial data created by the public sector
  • holding individual bodies to account for delivery against the geospatial strategy
  • providing strategic oversight and direction across Whitehall and public bodies who operate in this area

That’s a big pot of money to get something done and a remit that ticks all of the right boxes. As the ODI blog post notes, it creates “the opportunity for national mapping agencies to adapt to a future where they become stewards for national mapping data infrastructure, making sure that data is available to meet the needs of everyone in the country”.

So, I’m really surprised that the many of the reactions from the open data community have been fairly negative. I understand the concerns that the end result might not be a completely open Mastermap. There are many, many ways in which this could end up with little or no change to the status quo. That’s certainly true if we ignore the opportunity to embed some change.

From my perspective, this is the biggest step towards a more open future for UK geospatial data since the first OS Open Data release in 2010. (I remember excitedly hitting the publish button to make their first Linked Data release publicly accessible)

Anyone who has been involved with open data in the UK will have encountered the Ordnance Survey licensing issues that are massively inhibiting both the release and use of open data in the UK. It’s a frustration of mine that these issues aren’t manifest in the various open data indexes.

In my opinion, anything that moves us forward from the current licensing position is to be welcomed. Yes, we all want a completely open MasterMap. That’s our shared goal. But how do we get there?

We’ve just seen the government task and resource itself to do something that can help us achieve that goal. It’s taken concerted effort by a number of people to get to this point. We should be focusing on what we all can do, right now, to help this process stay on track. Dismissing it as an already failed attempt isn’t helpful.

I think there’s a great deal that the community could do to engage with and support this process.

Here’s a few ideas of things of ways that we could inject some useful thinking into the process:

  • Can we pull together examples of where existing licensing restrictions are causing friction for UK businesses? Those of who us have been involved with open data have internalised many of these issues already, but we need to make sure they’re clearly understood by a wider audience
  • Can we do the same for local government data and services? There are loads of these too. Particularly compelling examples will be those that highlight where more open licensing can help improve local service delivery
  • Where could greater clarity around existing licensing arrangements help UK businesses, public sector and civil society organisations achieve greater impact? It often seems like some projects and local areas are able to achieve releases where others can’t.
  • Even if all of MasterMap were open tomorrow, it might still be difficult to access. No-one likes the current shopping cart model for accessing OS open data. What services would we expect from the OS and others that would make this data useful? I suspect this would go beyond “let me download some shapefiles”. We built some of these ideas into the OS Linked Data site. It still baffles me that you can’t find much OS data on the OS website.
  • If all of MasterMap isn’t made open, then which elements of it would unlock the most value? Are there specific layers or data types that could reduce friction in important application areas?
  • Similarly, how could the existing OS open data be improved to make it more useful? Hint: currently all of the data is generalised and doesn’t have any stable identifiers at all.
  • What could the OS and others do to support the rest of us in annotating and improving their data assets? The OS switched off its TOID lookup service because no-one was using it. It wasn’t very good. So what would we expect that type of identifier service to do?
  • If there is more openly licensed data available, then how could it be usefully added to OpenStreetMap and used by the ecosystem of open geospatial tools that it is supporting?
  • We all want access to MasterMap because its a rich resource. What are the options available to ensure that the Ordnance Survey stays resourced to a level where we can retain it as a national asset? Are there reasonable compromises to be made between opening all the data and them offering some commercial services around it?
  • …etc, etc, etc.

Personally, I’m choosing to be optimistic. Let’s get to work to create the result we want to see.