Cooking up a new approach to supporting purposeful use of data

In my last post I explored how we might better support the use of datasets. To do that I applied the BASEDEF framework to outline the ways in which communities might collaborate to help unlock more value from individual datasets.

But what if we changed our focus from supporting discovery and use of datasets and instead focused on helping people explore specific types of problems or questions?

Our paradigm around data discovery is based on helping people find individual datasets. But unless a dataset has been designed to answer the specific question you have in mind, then it’s unlikely to be sufficient. Any non-trivial analysis is likely to need multiple datasets.

We know that data is more useful when it is combined, so why isn’t our approach to discovery based around identifying useful collections of datasets?

A cooking metaphor

To explore this further let’s use a cooking metaphor. I love cooking.

Many cuisines are based on a standard set of elements. Common spices or ingredients that become the base of most dishes. Like a mirepoix, a sofrito, the holy trinity of Cajun cooking, or the mother sauces in French cuisine.

As you learn to cook you come to appreciate how these flavour bases and sauces can be used to create a range of dishes. Add some extra spices and ingredients and you’ve created a complete dish.

Recipes help us consistently recreate these sauces.

A recipe consists of several elements. It will have a set of ingredients and a series of steps to combine them. A good recipe will also include some context. For example some background on the origins of the recipe and descriptions of unusual spices or ingredients. It might provide some things to watch out for during the cooking (“don’t burn the spices”) or suggest substitutions for difficult to source ingredients.

Our current approach to dataset discovery involves trying to document the provenance of an individual ingredient (a dataset) really well. We aren’t helping people combine them together to achieve results.

Efforts to improve dataset metadata, documentation and provenance reporting are important. Projects like the dataset nutrition label are great examples of that. We all want to be ethical, sustainable cooks. To do that we need to make informed choices about our ingredients.

But, to whisk these food metaphors together, nutrition labels are there to help you understand what’s gone into your supermarket pasta sauce. It’s not giving you a recipe to cook it from scratch for yourself. Or an idea of how to use the sauce to make a tasty dish.

Recipes for data-informed problem solving

I think we should think about sharing dataset recipes: instructions for how to mix up a selection of dataset ingredients. What would they consist of?

Firstly, the recipe would need to based around a specific type of question, problem or challenge.  Examples might include:

  • How can I understand air quality in my city?
  • How is deprivation changing in my local area?
  • What are the impacts of COVID-19 in my local authority?

Secondly, a recipe would include a list of datasets that have to be sourced, prepared and combined together to explore the specific problem. For example, if you’re exploring impacts of COVID-19 in your local authority you’re probably going to need:

  • demographic data from the most recent census
  • spatial boundaries to help visualise and present results
  • information about deprivation to help identify vulnerable people

Those three datasets are probably the holy trinity of any local spatial analysis?

Finally, you’re going to need some instructions for how to combine the datasets together. The instructions might identify some tools you need (Excel or QGIS), reference some techniques (Reprojection) and maybe some hints about how to substitute for key ingredients if you can’t get them in your local area (FOI).

The recipe might ways to vary the recipe for different purposes: add a sprinkle of Companies House data to understand your local business community, and a dash of OpenStreetMap to identify greenspaces?

As a time saver maybe you can find some pre-made versions of some of the steps in the recipe?

Examples in the wild

OK, its easy to come up with a metaphor and an idea. But would this actually meet a need? There’s a few reasons why I’m reasonably confident that dataset recipes could be helpful. Mostly because I can see this same approach re-appearing in some related contexts. For example:

If you have examples then let me know in the comments or on twitter.

How can dataset recipes help?

I think there’s a whole range of ways in which these types of recipe can be useful.

Data analysis always starts by posing a question. By documenting how datasets can be applied specific questions will make them easier to find on search engines. It just fits better with what people want to do.

Data discovery is important during periods where there is a sudden influx of new potential users. For example, where datasets have just been published under an open licence and are now available to more people, for a wider range of purposes.

In my experience data analysts and scientists who understand a domain, e.g population or transport modelling, have built up an tacit understanding of what datasets are most useful in different contexts. They understand the limitations and the process of combining datasets together. This thread from Chris Gale with a recipe about doing spatial analysis using PHE’s COVID-19 data is a perfect example. Documenting and sharing this knowledge can help others to do similar analyses. It’s like a cooking masterclass.

Discovery is also difficult when there is a sudden influx of new data available. Such as during this pandemic. Writing recipes is a good way to share learning across a community.

Documenting useful recipes might help us scale innovation across local areas.

Lastly, we’re still trying to understand which datasets are a most important part of our local, national and international data infrastructure. We’re currently lacking any real quantitative information about how datasets are combined together. In the same way that recipes can be analysed to create ingredient networks, dataset recipes could be analysed to find out how datasets are being used together. We can then strengthen that infrastructure.

If you’ve built something that helps people publish dataset recipes then send me a link to your app. I’d like to try it.

How can you help support the use of a dataset?

Getting the most value from data, whilst minimising its harmful impacts, is a community activity. Datasets need to be governed and published well. Most of that responsibility falls on the data publisher. Because the choices they make shapes data ecosystems.

But other people have a role to play too. Being a good data user means engaging with that process.

Helping others to find data and find the value in it, feels particularly important at the moment. During the pandemic there are many new datasets becoming available. And there are lots of questions to be answered. Some of them can be answered through better use of data.

So, how can communities work together to support use of data?

There are a lot of different ways to explore that question. But there’s a framework called BASEDEF, created by the open source community, which I find helpful.

BASEDEF stands for Blog, Apply, Suggest, Extend, Document, Evangelize and Fix. It describes the different types of contributions that can support an open source project. It can also be applied to help organise a small team in doing that work. Here’s a handy cheat sheet.

But the framework can also be applied to the task of supporting the use of an openly licensed dataset. Let’s run through the framework with that in mind.


Blog

You can write about a dataset to help others to discover it. You can help explain the potential value of applying the dataset to specific problems. Or perhaps you can see some downsides that others should consider.

Writing about how a dataset has been useful to you, by describing how you’ve successfully applied it in a project, will also help others see its potential value.

Apply

You can show how a dataset can be used, by creating something with it. You might do a detailed analysis of the data, but some simpler contributions can also be helpful.

For example you might create a simple visualisation. Or write and publish some code that illustrates how the dataset can be accessed and used. You could publish a quick demo showing how the dataset can be imported and used in some frequently used tools and platforms.

At the moment everyone is a bit tired of charts and graphs. And I agree with the first principle in the visualisation design principles for the pandemic. But a helpful visualisation can do a range of things. Visualisation can be exploratory rather than explanatory.

A visualisation could support other people in understanding the shape of a dataset, to inform their analysis and interpretation of it. It can help identify outliers, gaps, or highlight some of the richness in the data. I’d recommend making it clear when you’re doing it type of visualisation, rather than trying to derive specific insights.

Suggest

Read the documentation. Download and explore the dataset. Ask questions. Give feedback.

Make suggestions to the publisher about changes they could make to publish the data better. Rather than just offer academic critique, be clear about how suggested changes will support your needs or that of your community.

Extend

The freedoms granted by an open licence allow you to enrich and improve a dataset.

Sometimes the smallest changes can have the most impact. Convert the data into other common or standard formats. Extracting data from spreadsheets into CSV files. Convert data published in more complex formats or via APIs into simpler tabular data to make it more accessible to analysts rather than programmers.

Or maybe you can enrich a dataset by adding identifiers that will allow it to be linked to other sources. Do the work of merging with other datasets to bring in more context.

The downside here is that if the original data changes your extended version will get out of date. If you can’t commit to keeping your version up to date, then be sure to share your code and document your methods.

Allow others to repeat the steps you’ve taken. And don’t forget to suggest the improvements to the publisher.

Document

Write additional documentation to fill in gaps where the publisher has not provided sufficient background or explanation. Explain technical concepts or academic terms to a non-specialist audience.

As a user of the data, you’re able to write that documentation from a perspective that reflects the needs and questions of your specific community and the kinds of questions you need to ask. The original publisher might not have all that context or understand those needs, so this work can be really helpful.

Good documentation can be a finding aid. There are structured ways that you can go about writing documentation, such as this tool for writing civic data guides. (Check out some of the examples).

Evangelise

Email people that might have a need for the data. Tweet about it to a wider community. Highlight it in a presentation. Talk about it over coffee Zoom.

Fix

If the dataset is collaboratively maintained then go ahead and fix errors and omissions. If you’re not confident about making a fix, then submit an error report. In addition to fixing errors you might be able to help verify that data is correct.

If a dataset isn’t collaboratively maintained then, when you find errors, be sure to flag them to the publisher and highlight the issue for others. Or consider publishing an enriched version with fixes applied.


This framework isn’t perfect. The name is a bit clunky for a start. But there’s a couple of things that I like about it.

Firstly, it recognises that not all contributions need to be technical. There’s room for others to use different skills and in different ways.

Secondly, the elements overlap and reinforce one another. Writing documentation and blogging about how you’ve used a dataset helps to evangelise it. Enriching a dataset can help demonstrate in a practical way how a publisher can improve how data is published.

Finally, it serves to highlight some important aspects of community curation which aren’t always well supported in existing data platforms and portals. We can do better here.

If you’re interested in working on adapting this further then happy to chat!. It might be useful to have a cheat sheet that supports its application to data and more examples of how to do these different elements well.

Why is change discovery important for open data?

Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.

Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:

  • How often will a dataset get routinely updated and republished?
  • What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
  • How will the dataset, or parts of it be version controlled?
  • How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
  • How will planned and unplanned updates and changes be communicated to users of the dataset?
  • How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
  • Are large scale changes to the data model expected, and if so over what timescale?
  • Are changes to the technical infrastructure planned, and if so over what timescale?
  • How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?

These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.

Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.

Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.

A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:

  • changes to a row in a dataset might be flagged with a timestamp and a change notice
  • API responses might indicate the version of the object being retrieved
  • dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
  • a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
  • changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum

These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.

Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:

  • make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
  • use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
  • subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
  • subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems

Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.

I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.

The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.

Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.

Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.

In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.

And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.

How can publishing more data decrease the value of existing data?

Last month I wrote a post looking at how publishing new data might increase the value of existing data. I ended up listing seven different ways including things like improving validation, increasing coverage, supporting the ability to link together datasets, etc.

But that post only looked at half of the issue. What about the opposite? Are there ways in which publishing new data might reduce the value of data that’s already available?

The short answer is: yes there are.  But before jumping into that, lets take a moment to reflect on the language we’re using.

A note on language

The original post was prompted by an economic framing of the value of data. I was exploring how the option value for a dataset might be affected by increasing access to other data. While this post is primarily looking at how option value might be reduced, we need to acknowledge that “value” isn’t the only way to frame this type of question.

We might also ask, “how might increasing access to data increase potential for harms?” As part of a wider debate around the issues of increasing access to data, we need to use more than just economic language. There’s a wealth of good writing about the impacts of data on privacy and society which I’m not going to attempt to precis here.

It’s also important to highlight that “increasing value” and “decreasing value” are relative terms.

Increasing the value of existing datasets will not seem like a positive outcome if your goal is to attempt to capture as much value as possible, rather than benefit a broader ecosystem. Similarly, decreasing value of existing data, e.g. through obfuscation, might be seen as a positive outcome if it results in better privacy or increased personal safety.

Decreasing value of existing data

Having acknowledged that, lets try and answer the earlier question. In what ways can publishing new data reduce the value we can derive from existing data?

Increased harms leading to retraction and reduced trust

Publishing new data always runs the risk of re-identification and the enabling of unintended inferences. While the impacts of these harms are likely to be most directly felt by both communities and individuals, there are also broader commercial and national security issues. Together, these issues might ultimately reduce the value of the existing data ecosystem in several ways:

  • Existing datasets may need to be retracted, have their scope changed, or have their circulation reduced in order to avoid further harm. Data privacy impact assessments will need to be updated as the contexts in which data is being shared and published change
  • Increased concerns over potential privacy impacts might lead to organisations to choose not to increase access to similar or related datasets
  • Increased concerns might also lead communities and individuals to reduce the amount of data they are willing to share with previously trusted sources

Overall this can lead to a reduction in the overall coverage, quality and linking of data across a data ecosystem. It’s likely to be one of the most significant impact of poorly considered data releases. It can be mitigated through proper impact assessments, consultation and engagement.

Reducing overall quality

Newly published data might be intended to increase coverage, enrich, link, validate or otherwise improve existing data. But it might actually have the opposite effect because its of poor quality. I’ve briefly touched on this in a previous post on fictional data.

Publication of poor quality data might be unintended. For example an organisation may just be publishing the data it has to help address an issue, without properly considering or addressing underlying problems with it. Or a researcher may publish data that contains honest mistakes.

But publication of poor quality data might also be deliberate. For example as spam or misinformation intended to “poison the well“.

More subtly, practices like p-hacking and falsification of data which might be intended to have a short-term direct benefit to the publisher or author, might have longer term issues by impacting the use of other datasets.

This is why understanding and documenting the provenance of data, monitoring of retractions, fixes and updates to data, and the ability to link analyses with datasets are all so important.

Creating unnecessary competition or increasing friction

Publishing new datasets containing new observations and data about an area or topic of interest can lead to positive impacts, e.g. by increasing confidence or coverage. But datasets are also competing with one another. The same types of data might be available from different sources, but under different licences, access arrangements, pricing, etc.

This competition isn’t necessarily positive. For example, the data ecosystem might not benefit as much from the network effects that follow from linking data because key datasets are not linked or cannot be used together. Incompatible and competing datasets can add friction across an ecosystem.

Building poor foundations

Data is often published as a means of building stronger data infrastructure for a sector, or to address a specific challenge. But if that data is poorly maintained or is not sustainably funded, then the energy that goes into building the communities, tools and other datasets around that infrastructure might be wasted.

That reduces the value of existing datasets which might otherwise have provided a better foundation to build upon. Or whose quality is dependent on the shared infrastructure. While this issue is similar to that of the previous one about competition, its root causes and impacts are slightly different.

 

As I noted in my earlier post. I don’t think this is an exhaustive list and it can be improved by contributions. Leave a comment if you have any thoughts.

What is collaborative maintenance of data? A short talk at the Royal Society

Following the publication of their report on data governance in the 21st century, the Royal Society are running a number of workshops to explore data governance in different sectors. In October 2019 year they ran one exploring data governance in the auto insurance sector.

Last week they held a workshop looking at data governance in the civil society sector. The ODI were invited to help out, and I chaired a session looking at collaborative maintenance of data. I believe the Royal Society will be publishing a longer write-up of the workshop over the coming weeks.

This blog post is a written version of a short ten minute talk I gave during the workshop. The slides are public.

Let’s start with a definition. What is collaborative maintenance?

You might already be familiar with terms like “crowd-sourcing” or “citizen science”. Both of those are examples of collaborative maintenance. But it can take other forms too. At the ODI we use collaborative maintenance of data to refer to any scenario where organisations and communities are sharing the work of collecting and maintaining data.

It might be helpful to position collaborative maintenance alongside other approaches that are part of “open culture”. These include open standards, open source, and open data. Let’s look at each of them in turn.

Open standards for data are reusable, shared agreements that shape how we collect, share, govern and use data. There are different types of open standards. Some are technical, and describe file formats and methods of exchanging data. Others are higher-level and capture codes of practices and protocols for collecting data. Open standards are best developed collaboratively, so that everyone impacted by or benefiting from the standard can help shape it.

Open source involves collaborating to create reusable, openly licensed code and applications. Some open source projects are run by individuals or small communities. Others are backed by larger commercial organisations. This collaborative work is different to that of open standards. For example, it involves identifying and agreeing features, writing and testing code and producing documentation to allow others to use it.

Open data is about publishing data under an open licence, so it can be accessed, used and shared by anyone for any purpose. Different communities engage in publication of open data for different purposes.

For example, the open government movement originally focused on open data as a means to increase transparency of governments. More recently there is a shift towards using open data to help address a variety of social, economic and environmental challenges. In contrast, as part of the open science movement, there is a different role for open data. Recent attention has been on the use of open data to address the reproducibility crisis around research. Or to help respond to emerging health issues, like Coronavirus.

With a few exceptions, the main approach to open data has been a single organisation (or researcher) publishing data that they have already collected. There may be some collaboration around use of that data, but not in its collection or maintenance.

This makes open data quite distinct from open source or open sources.

We can think of collaborative maintenance as about taking the approach used in open source and applying it to data. Collaborative maintenance involves collaboration across the full lifecycle of a dataset.

Some examples might be helpful.

OpenStreetMap is a collaboratively produced spatial database of the entire world. While it was originally produced by individuals and communities, it is now contributed to by large organisations like Facebook, Microsoft and Apple. The Humanitarian OpenStreetMap community focuses on the collection and use of data to support humanitarian activities. The community are involved in deciding what data to collect, prioritising maintenance of data following disasters, and mapping activities either on the ground or remotely. The community works across the lifecycle and is self-directing.

Common Voice is a Mozilla project. It aims to build an open dataset to support voice recognition applications. By asking others to contribute to the dataset, they hope to make it more comprehensive and inclusive. Mozilla have defined what data will be collected and the tasks to be carried out, but anyone can contribute to the dataset by adding their voice or transcribing a recording. It’s this open participation that could help ensure that the dataset represents a more diverse set of people.

Edubase is maintained by the Department for Education (DfE). It’s our national database of schools. It’s used in a variety of different applications. Like Mozilla, DfE are acting as the steward of the data and have defined what information should be collected. But the work of populating and maintaining the shared directory is carried out by people in the individual schools. This is the best way to keep that data up to date. Those who are know when the data has changed have the ability to update it. The contributors all benefit from shared resource.

Build a shared directory is a common use for collaborative maintenance. But there are others.

Looking across these projects and other examples that we’ve studied in our desk and user research, we can see that there are different ways we can collaborate around data.

For example, we can work together to decide what data to collect. We can share the work of collecting and maintaining data, ensuring its quality and governing access to it. We can use open source to help to build the tools to support those communities.

We’ve developed the collaborative maintenance guidebook to help support the design of new services and platforms. It includes some background and a worked example. The bulk of the guidebook is a set of “design patterns” that describe solutions to common problems. For example how to manage quality when many different people are contributing to the same dataset.

We think collaborative maintenance can be useful in more projects. For civil society organisations collaborative maintenance might help you engage with communities that you’re supporting to collect and maintain useful data. It might also be a tool to support collaboration across the sector as a means of building common resources.

The guidebook is at an early stage and we’d love to get feedback on it contents. Or help you apply it to a real-world project. Let us know what you think!

 

How can publishing more data increase the value of existing data?

There’s lots to love about the “Value of Data” report. Like the fantastic infographic on page 9. I’ll wait while you go and check it out.

Great, isn’t it?

My favourite part about the paper is that it’s taught me a few terms that economists use, but which I hadn’t heard before. Like “Incomplete contracts” which is the uncertainty about how people will behave because of ambiguity in norms, regulations, licensing or other rules. Finally, a name to put to my repeated gripes about licensing!

But it’s the term “option value” that I’ve been mulling over for the last few days. Option value is a measure of our willingness to pay for something even though we’re not currently using it. Data has a large option value, because its hard to predict how its value might change in future.

Organisations continue to keep data because of its potential future uses. I’ve written before about data as stored potential.

The report notes that the value of a dataset can change because we might be able to apply new technologies to it. Or think of new questions to ask of it. Or, and this is the interesting part, because we acquire new data that might impact its value.

So, how does increasing access to one dataset affect the value of other datasets?

Moving data along the data spectrum means that increasingly more people will have access to it. That means it can be used by more people, potentially in very different ways than you might expect. Applying Joy’s Law then we might expect some interesting, innovative or just unanticipated uses. (See also: everyone loves a laser.)

But more people using the same data is just extracting additional value from that single dataset. It’s not directly impacting the value of other dataset.

To do that we need to use that in some specific ways. So far I’ve come up with seven ways that new data can change the value of existing data.

  1. Comparison. If we have two or more datasets then we can compare them. That will allow us to identify differences, look for similarities, or find correlations. New data can help us discover insights that aren’t otherwise apparent.
  2. Enrichment. New data can enrich an existing data by adding new information. It gives us context that we didn’t have access to before, unlocking further uses
  3. Validation. New data can help us identify and correct errors in existing data.
  4. Linking. A new dataset might help us to merge some existing dataset, allowing us to analyse them in new ways. The new dataset acts like a missing piece in a jigsaw puzzle.
  5. Scaffolding. A new dataset can help us to organise other data. It might also help us collect new data.
  6. Improve Coverage. Adding more data, of the same type, into an existing pool can help us create a larger, aggregated dataset. We end up with a more complete dataset, which opens up more uses. The combined dataset might have a a better spatial or temporal coverage, be less biased or capture more of the world we want to analyse
  7. Increase Confidence. If the new data measures something we’ve already recorded, then the repeated measurements can help us to be more confident about the quality of our existing data and analyses. For example, we might pool sensor readings about the weather from multiple weather stations in the same area. Or perform a meta-analysis of a scientific study.

I don’t think this is exhaustive, but it was a useful thought experiment.

A while ago, I outlined ten dataset archetypes. It’s interesting to see how these align with the above uses:

  • A meta-analysis to increase confidence will draw on multiple studies
  • Combining sensor feeds can also help us increase confidence in our observations of the world
  • A register can help us with linking or scaffolding datasets. They can also be used to support validation.
  • Pooling together multiple descriptions or personal records can help us create a database that has improved coverage for a specific application
  • A social graph is often used as scaffolding for other datasets

What would you add to my list of ways in which new data improves the value of existing data? What did I miss?

Licence Friction: A Tale of Two Datasets

For years now at the Open Data Institute we’ve been working to increase access to data, to create a range of social and economic benefits across a range of sectors. While the details change across projects one of the more consistent aspects of our work and guidance has been to support data stewards in making data as open as possible, whilst ensuring that is clearly licensed.

Reference data, like addresses and other geospatial data, that underpins our national and global data infrastructure needs to be available under an open licence. If it’s not, which is the ongoing situation in the UK, then other data cannot be made as open as possible. 

Other considerations aside, data can only be as open as the reference data it relies upon. Ideally, reference data would be in the public domain, e.g. using a CC0 waiver. Attribution should be a consistent norm regardless of what licence is used

Data becomes more useful when it is linked with other data. When it comes to data, adding context adds value. It can also add risks, but more value can be created from linking data. 

When data is published using bespoke or restrictive licences then it is harder to combine different datasets together, because there are often limitations in the licensing terms that restrict how data can be used and redistributed.

This means data needs to be licensed using common, consistent licences. Licences that work with a range of different types of data, collected and used by different communities across jurisdictions. 

Incompatible licences create friction that can make it impossible to create useful products and services. 

It’s well-reported that data scientists and other users spend huge amounts of time cleaning and tidying data because it’s messy and non-standardised. It’s probably less well-reported how many great ideas are simply shelved because of lack of access to data. Or are impossible because of issues with restrictive or incompatible data licences. Or are cancelled or simply needlessly expensive due to the need for legal consultations and drafting of data sharing agreements.

These are the hurdles you often need to overcome before you even get started with that messy data.

Here’s a real-world example of where the lack of open geospatial data in the UK, and ongoing incompatibilities between data licensing is getting in the way of useful work. 

Introducing Active Places

Active Places is a dataset stewarded by Sport England. It provides a curated database of sporting facilities across England. It includes facilities provided by a range of organisations across the public, private and third-sectors. It’s designed to help support decision making about the provision of tens of thousands of sporting sites and facilities around the UK to drive investment and policy making. 

The dataset is rich and includes a wide range of information from disabled access through to the length of ski slopes or the number of turns on a cycling track.

While Sport England are the data steward, the curation of the dataset is partly subcontracted to a data management firm and partly carried out collaboratively with the owners of those sites and facilities.

The dataset is published under a standard open licence, the Creative Commons Attribution 4.0 licence. So anyone can access, use and share the data so long as they acknowledge its source. Contributors to the dataset agree to this licence as part of registering to contribute to the site.

The dataset includes geospatial data, including the addresses and locations of individual sites. This data includes IP from Ordnance Survey and Royal Mail, which means they have a say over what happens to it. In order to release the data under an open licence, Sport England had to request an exemption from the Ordnance Survey to their default position, which is that data containing OS IP cannot be sublicensed. When granted an exemption, an organisation may publish their data under an open licence. In short, OS waive their rights over the geographic locations in the data. 

The OS can’t, however waive any rights that Royal Mail has over the address data. In order to grant Sport England an exemption, the OS also had to seek permission from Royal Mail.  The Sport England team were able to confirm this for me. 

Unfortunately it’s not clear, without having checked, that this is actually the case. It’s not evident in the documentation of either Active Places or the OS exemption process. Is it clarifying all third-party rights a routine part of the exemption process or not?

It would be helpful to know. As the ODI has highlighted, lack of transparency around third-party rights in open data is a problem. For many datasets the situation remains unclear. And Unclear positions are fantastic generators of legal and insurance fees.

So, to recap: Sport England has invested time in convincing Ordnance Survey to allow it to openly publish a rich dataset for the public good. A dataset in which geospatial data is clearly important, but is not the main feature of the dataset. The reference data is dictating how open the dataset can be and, as a result how much value can be created from it.

In case you’re wondering, lots of other organisations have had to do the same thing. The process is standardised to try and streamline it for everyone. A 2016 FOI request shows that between 2011 and 2015 the Ordnance Survey handled more than a 1000 of these requests

Enter OpenStreetMap

At the end of 2019, members of the OpenStreetmap community contacted Sport England to request permission to use the Active Places dataset. 

If you’re not familiar with OpenStreetmap, then you should be. It’s an openly licensed map of the world maintained by a huge community of volunteers, humanitarian organisations, public and private sector businesses around the world.

The OpenStreetmap Foundation is the official steward of the dataset with the day to data curation and operations happening through its volunteer network. As a small not-for-profit, it has to be very cautious about legal issues relating to the data. It can’t afford to be sued. The community is careful to ensure that data that is imported or added into the database comes from openly licensed sources.

In March 2017, after a consultation with the Creative Commons, the OpenStreetmap Licence/Legal Working Group concluded that data published under the Creative Commons Attribution licence is not compatible with the licence used by OpenStreetmap which is called the Open Database Licence. They felt that some specific terms in the licence (and particularly in its 4.0 version) meant that they needed additional permission in order to include that data in OpenStreetmap.

Since then the OpenStreetmap community, has been contacting data stewards to ask them to sign an additional waiver that grants the OSM community explicit permission to use the data. This is exactly what open licensing of data is intended to avoid.

CC-BY is one of the most frequently used open data licences, so this isn’t a rare occurrence. 

As an indicator of the extra effort required, in a 2018 talk from the Bing Maps team in which they discuss how they have been supporting the OpenStreetmap community in Australia, they called out their legal team as one of the most important assets they had to provide to the local mapping community, helping them to get waivers signed. At the time of writing nearly 90 waivers have been circulated in Australia alone, not all of which have been signed.

So, to recap, due to a perceived incompatibility between two of the most frequently used open data licences, the OpenStreetmap community and its supporters are spending time negotiating access to data that is already published under an open licence.

I am not a lawyer. So these are like, just my opinions. But while I understand why the OSM Licence Working Group needs to be cautious, it feels like they are being overly cautious. Then again, I’m not the one responsible for stewarding an increasingly important part of a global data infrastructure. 

Another opinion is that perhaps the Microsoft legal team might be better deployed to solve the licence incompatibility issues. Instead they are now drafting their own new open data licences, which are compatible with CC-BY.

Active Places and OpenStreetmap

Still with me?

At the end of last year, members of the OpenStreetMap community contacted Sport England to ask them to sign a waiver so that they could use the Active Places data. Presumably to incorporate some of the data into the OSM database.

The Sport England data and legal teams then had to understand what they were being asked to do and why. And they asked for some independent advice, which is where I provided some support through our work with Sport England on the OpenActive programme. 

The discussion included:

  • questions about why an additional waiver was actually necessary
  • the differences in how CC-BY and ODbL are designed to require data to remain open and accessible – CC-BY includes limitation on use of technical restrictions, which is allowed by the open definition, whilst ODbL adopts a principle of encouraging “parallel distribution”. 
  • acceptable forms and methods of attribution
  • who, within an organisation like Sport England, might have responsibility to decide what acceptable attribution looked like
  • why the OSM community had come to its decisions
  • who actually had authority to sign-off on the proposed waiver
  • whether signing a waiver and granting a specific permission undermined Sport England’s goal to adopt standard open data practices and licences, and a consistent approach for every user
  • whether the OS exemption, which granted permission to SE to publish the dataset under an open licence, impacted any of the above

All reasonable questions from a team being asked to do something new. 

Like a number of organisations asked to sign waiver in Australia, SE have not yet signed a waiver and may choose not to do so. Like all public sector organisations, SE are being cautious about taking risks. 

The discussion has spilled out onto twitter. I’m writing this to provide some context and background to the discussion in that thread. I’m not criticising anyone as I think everyone is trying to come to a reasonable outcome. 

As the twitter thread highlights, the OSM community are not just concerned about the CC-BY licence but also about the potential that additional third-party rights are lurking in the data. Clarifying that may require SE to share more details about how the address and location data in the dataset is collected, validated and normalised for the OSM community to be happy. But, as noted earlier in the blog, I’ve at least been able to determine the status of any third-party rights in the data. So perhaps this will help to move things further.

The End

So, as a final recap, we have two organisations both aiming to publish and use data for the public good. But, because of complexities around derived data and licence compatibilities, data that might otherwise be used in new, innovative ways is instead going unused.

This is a situation that needs solving. It needs the UK government and Geospatial Commission to open up more geospatial data.

It needs the open data community to invest in resolving licence incompatibilities (and less in creating new licences) so that everyone benefits. 

We also need to understand when licences are the appropriate means of governing how data is used and when norms, e.g. around attribution, can usefully shape how data is accessed, used and shared.

Until then these issues are going to continue to undermine the creation of value from open (geospatial) data.

[Paper review] Open data for electricity modeling: Legal aspects

This blog post is a quick review and notes relating to a research paper called: Open data for electronic modeling: Legal aspects.

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

The paper reviews the legal status of publicly available energy data (and some related datasets) in Europe, with a focus on German law. The paper is intended to help identify some of the legal issues relevant to creation of analytical models to support use of energy data, e.g. for capacity planning.

As background, the paper describes the types of data relevant to building these types of model, the relevant aspects of database and copyright law in the EU and the properties of open licences. This background is used to assess some of the key data assets published in the EU and how they are licensed (or not) for reuse.

The paper concludes that the majority of uses of this data to support energy modelling in the EU, whether for research or other purposes, is likely to be infringing on the rights of the database holders, meaning that users are currently carrying legal risks. The paper notes that in many cases this is likely not the intended outcome.

The paper provides a range of recommendations to address this issue, including the adoption of open licences.

Three reasons to read

Here’s three reasons why you might want to read this paper

  1. It provides a helpful primer on the range of datasets and data types that are used to develop applications in the energy sector in the EU. Useful if you want to know more about the domain
  2. The background information on database rights and related IP law is clearly written and a good introduction to the topic
  3. The paper provides a great case study of how licensing and legal protections applies to data use in a sector. The approach taken could be reused and extended to other areas

Three things I learned

Here’s three things that I learned from reading the paper.

  1. That a database might be covered by copyright (an “original” database) in addition to database rights. But the authors note this doesn’t apply in the case of a typical energy dataset
  2. That individual member states might have their own statutory exemptions to the the Database Directive. E.g. in Germany it doesn’t apply to use of data in non-commercial teaching. So there is variation in how it applies.
  3. The discussion on how the Database Directive relates to statutory obligations to publish data was interesting, but highlights that the situation is unclear.

Thoughts and impressions

Great paper that clearly articulates the legal issues relating to publication and use of data in the energy sector in the EU. It’s easy to extrapolate from this work to other use cases in energy and by extension to other sectors.

The paper concludes with a good set of recommendations: the adoption of open licences, the need to clarify rights around data reuse and the role of data institutions in doing that, and how policy makers can push towards a more open ecosystem.

However there’s a suggestion that funders should just mandate open licences when funding academic research. While this is the general trend I see across research funding, in the context of this article it lacks a bit of nuance. The paper clearly indicates that the current status quo is that data users do not have the rights to apply open licences to the data they are publishing and generating. I think funders also need to engage with other policy makers to ensure that upstream provision of data is aligned with an open research agenda. Otherwise we risk perpetuating an unclear landscape of rights and permissions. The authors do note the need to address wider issues, but I think there’s a potential role of research funders in helping to drive change.

Finally, in their review of open licences, the authors recommend a move towards adoption of CC0 (public domain waivers and marks) and CC-BY 4.0. But they don’t address the fact that upstream licensing might limit the choice of how researchers can licence downstream data.

Specifically, the authors note the use of OpenStreetmap data to provide infrastructure data. However depending on your use, you may need to adopt this licence when republishing data. This can be at odds with a mandate to use other licences or restrictive licences used by other data stewards.

 

How do data publishing choices shape data ecosystems?

This is the latest in a series of posts in which I explore some basic questions about data.

In our work at the ODI we have often been asked for advice about how best to publish data. When giving trying to give helpful advice, one thing I’m always mindful of is how the decisions about how data is published shapes the ways in which value can be created from it. More specifically, whether those choices will enable the creation of a rich data ecosystem of intermediaries and users.

So what are the types of decisions that might help to shape data ecosystems?

To give a simple example, if I publish a dataset so its available as a bulk download, then you could use that data in any kind of application. You could also use it to create a service that helps other people create value from the same data, e.g. by providing an API or an interface to generate reports from the data. Publishing in bulk allows intermediaries to help create a richer data ecosystem. But, if I’d just published that same data via an API then there are limited ways in which intermediaries can add value. Instead people must come directly to my API or services to use the data.

This is one of the reasons why people prefer open data to be available in bulk. It allows for more choice and flexibility in how it is used. But, as I noted in a recent post, depending on the “dataset archetype” your publishing options might be limited.

The decision to only publish a dataset as an API, even if it could be published in other ways is often a deliberate decision. The publisher may want to capture more of the value around the dataset, e.g. by charging for the use of an API. Or they may it is important to have more direct control over who uses it, and how. These are reasonable choices and, when the data is sensitive, sensible options.

But there are a variety of ways in which the choices that are made about how to publish data, can can shape or constrain the ecosystem around a specific dataset. It’s not just about bulk downloads versus APIs.

The choices include:

  • the licence that is applied to the data, which might limit it to non commercial use. Or restrict redistribution. Or imposing limits on the use of derived data
  • the terms and conditions for the API or other service that provides access to the data. These terms are often conflated with data licences, but typically focus on aspects of service provisions, for example rate limiting, restriction on storage of API results, permitted uses of the API, permitted types of users, etc
  • the technology used to provide access to data. In addition to bulk downloads vs API, there are also details such as the use of specific standards, the types of API call that are possible, etc
  • the governance around the API or service that provides access to data, which might create limit which users can get access the service or create friction that discourages use
  • the business model that is wrapped around the API or service, which might include a freemium model, chargeable usage tiers, service leverl agreements, usage limits, etc

I think these cover the main areas. Let me know if you think I’ve missed something.

You’ll notice that APIs and services provide more choices for how a publisher might control usage. This can be a good or a bad thing.

The range of choices also means it’s very easy to create a situation where an API or service doesn’t work well for some use cases. This is why user research and engagement is such an important part of releasing a data product and designing policy interventions that aim to increase access to data.

For example, let’s imagine someone has published an openly licensed dataset via an API that restricts users to a maximum number of API calls per month.

These choices limits some uses of the API, e.g. applications that need to make lots of queries. This also means that downstream users creating web applications are unable to provide a good quality of service to their own users. A popular application might just stop working at some point over the course of the month because it has hit the usage threshold.

The dataset might be technically openly, but practically its used has been constrained by other choices.

Those choices might have been made for good reasons. For example as a way for the data publisher to be able to predict how much they need to invest each month in providing a free service, that is accessible to lots of users making a smaller number of requests. There is inevitably a trade-off between the needs of individual users and the publisher.

Adding on a commercial usage tier for high volume users might provide a way for the publisher to recoup costs. It also allows some users to choose what to pay for their use of the API, e.g. to more smoothly handle unexpected peaks in their website traffic. But it may sometimes be simpler to provide the data in bulk to support those use cases. Different use cases might be better served by different publishing options.

Another example might be a system that provides access to both shared and open data via a set of APIs that conform to open standards. If the publisher makes it too difficult for users to actually sign up to use those APIs, e.g because of difficult registration or certification requirements, then only those organisations that can afford to invest the time and money to gain access might both using them. The end result might be a closed ecosystem that is built on open foundations.

I think its important for understand how this range of choices can impact data ecosystems. They’re important not just for how we design products and services, but also in helping to design successful policies and regulatory interventions. If we don’t consider the full range of changes, then we may not achieve the intended outcomes.

More generally, I think it’s important to think about the ecosystems of data use. Often I don’t think enough attention is paid to the variety of ways in which value is created. This can lead to poor choices, like a choosing to try and sell data for short term gain rather than considering the variety of ways in which value might be created in a more open ecosystem.

The words we use for data

I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think.  One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.

As Dan pointed out we often need a broader vocabulary when talking about data.  Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.

It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.

In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.

The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?

This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.

If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?

It turns out there’s quite a few.

For example there are “samples” and “sampling“.  These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples.  We might only be allowed to take a few.

Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.

Or how about “observations” and “observing“?  Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.

Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.

AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.

What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.

In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.

Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?

Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.

I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.

I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.

What terms do you find most useful when talking about data?