What is collaborative maintenance of data? A short talk at the Royal Society

Following the publication of their report on data governance in the 21st century, the Royal Society are running a number of workshops to explore data governance in different sectors. In October 2019 year they ran one exploring data governance in the auto insurance sector.

Last week they held a workshop looking at data governance in the civil society sector. The ODI were invited to help out, and I chaired a session looking at collaborative maintenance of data. I believe the Royal Society will be publishing a longer write-up of the workshop over the coming weeks.

This blog post is a written version of a short ten minute talk I gave during the workshop. The slides are public.

Let’s start with a definition. What is collaborative maintenance?

You might already be familiar with terms like “crowd-sourcing” or “citizen science”. Both of those are examples of collaborative maintenance. But it can take other forms too. At the ODI we use collaborative maintenance of data to refer to any scenario where organisations and communities are sharing the work of collecting and maintaining data.

It might be helpful to position collaborative maintenance alongside other approaches that are part of “open culture”. These include open standards, open source, and open data. Let’s look at each of them in turn.

Open standards for data are reusable, shared agreements that shape how we collect, share, govern and use data. There are different types of open standards. Some are technical, and describe file formats and methods of exchanging data. Others are higher-level and capture codes of practices and protocols for collecting data. Open standards are best developed collaboratively, so that everyone impacted by or benefiting from the standard can help shape it.

Open source involves collaborating to create reusable, openly licensed code and applications. Some open source projects are run by individuals or small communities. Others are backed by larger commercial organisations. This collaborative work is different to that of open standards. For example, it involves identifying and agreeing features, writing and testing code and producing documentation to allow others to use it.

Open data is about publishing data under an open licence, so it can be accessed, used and shared by anyone for any purpose. Different communities engage in publication of open data for different purposes.

For example, the open government movement originally focused on open data as a means to increase transparency of governments. More recently there is a shift towards using open data to help address a variety of social, economic and environmental challenges. In contrast, as part of the open science movement, there is a different role for open data. Recent attention has been on the use of open data to address the reproducibility crisis around research. Or to help respond to emerging health issues, like Coronavirus.

With a few exceptions, the main approach to open data has been a single organisation (or researcher) publishing data that they have already collected. There may be some collaboration around use of that data, but not in its collection or maintenance.

This makes open data quite distinct from open source or open sources.

We can think of collaborative maintenance as about taking the approach used in open source and applying it to data. Collaborative maintenance involves collaboration across the full lifecycle of a dataset.

Some examples might be helpful.

OpenStreetMap is a collaboratively produced spatial database of the entire world. While it was originally produced by individuals and communities, it is now contributed to by large organisations like Facebook, Microsoft and Apple. The Humanitarian OpenStreetMap community focuses on the collection and use of data to support humanitarian activities. The community are involved in deciding what data to collect, prioritising maintenance of data following disasters, and mapping activities either on the ground or remotely. The community works across the lifecycle and is self-directing.

Common Voice is a Mozilla project. It aims to build an open dataset to support voice recognition applications. By asking others to contribute to the dataset, they hope to make it more comprehensive and inclusive. Mozilla have defined what data will be collected and the tasks to be carried out, but anyone can contribute to the dataset by adding their voice or transcribing a recording. It’s this open participation that could help ensure that the dataset represents a more diverse set of people.

Edubase is maintained by the Department for Education (DfE). It’s our national database of schools. It’s used in a variety of different applications. Like Mozilla, DfE are acting as the steward of the data and have defined what information should be collected. But the work of populating and maintaining the shared directory is carried out by people in the individual schools. This is the best way to keep that data up to date. Those who are know when the data has changed have the ability to update it. The contributors all benefit from shared resource.

Build a shared directory is a common use for collaborative maintenance. But there are others.

Looking across these projects and other examples that we’ve studied in our desk and user research, we can see that there are different ways we can collaborate around data.

For example, we can work together to decide what data to collect. We can share the work of collecting and maintaining data, ensuring its quality and governing access to it. We can use open source to help to build the tools to support those communities.

We’ve developed the collaborative maintenance guidebook to help support the design of new services and platforms. It includes some background and a worked example. The bulk of the guidebook is a set of “design patterns” that describe solutions to common problems. For example how to manage quality when many different people are contributing to the same dataset.

We think collaborative maintenance can be useful in more projects. For civil society organisations collaborative maintenance might help you engage with communities that you’re supporting to collect and maintain useful data. It might also be a tool to support collaboration across the sector as a means of building common resources.

The guidebook is at an early stage and we’d love to get feedback on it contents. Or help you apply it to a real-world project. Let us know what you think!

 

A key difference between open data and open source

In “left-pad and the data commons” I tried to identify some lessons for the open data community based on recent events in the Javascript/NPM world. Open source, open science and open data are all parts of the same endeavor of creating the commons. There’s a lot of fertile territory to be explored by looking at how those respective communities are operating, the infrastructure they’re building, and the kinds of issues that are being faced.

One thing that occurs to me is that there’s currently some important differences between how open source and open data projects operate.

The similarities are obvious. Compare the key principles of the open source definition and the open definition, for example. Both have basic ideas such as the ability to access the entirety of the source code or data (let’s call them “works”). The ability to create derived works; the right to distribute the works and derivatives; ability to use the works for commercial and non-commercial uses, etc.

The ability to create derived works means that anyone can also modify the source or data as they see fit. In practice this means forking: creating a new custom version of some software, or a modified (corrected, reformatted) version of a dataset.

The differences are in the infrastructure that supports the original works. The default practice in the open source world is that code will be:

  • published in a public repository
  • published with a complete version history (or at least versioning dating from its publication)
  • published in an environment that supports transparent reporting of issues, bugs and suggestions
  • published in an environment that includes good documentation tools, such as a wiki
  • and, most importantly, published in an environment that allows forks and improvements to be folded back into the original project

I’d go as far as suggesting that each of these are as important to our modern experience and expectations of open source, as the basic rights granted by open licences. Clearly, not all open source projects benefit from a community of contributors, but the infrastructure is there to enable it. I see moves in the open source community to make contributions easier and more welcome.

This isn’t the case with the majority of open data releases though. The current practice is that:

  • data is published by a single organisation
  • there is little insight into how the data was curated, at best there is some documentation
  • data portals provide some infrastructure for, e.g. issue reporting and documentation, but this is often limited in scope
  • data portals don’t provide any support for encouraging collaboration or external contributions

There are, of course, examples of open datasets that are created from collaborative models. This includes Open Street Map, legislation.gov.uk and others. But these are currently the exceptions, rather than the norm. I’ve previously wondered whether we need more of these types of institution and incubators to support them.

Open source really came into the mainstream when commercial organisations started to adopt it not just as a way of releasing a work they had produced, but also embraced its collaborative aspects. Entire industries have now built up around open source projects that are see organisations that compete in other areas collaborating on the common, core infrastructure.

While we should continue to urge commercial organisations to open up their existing assets, I think that the open data commons will really start to mature once we starting adopting collaborative models. Which means the open data community needs to think about the tooling we need to enable that.

A “github for data” might be a useful short-hand. But this would overlook the fact that modern open source development is now done in an ecosystem that consists of an extremely rich infrastructure: continuous integration tools, discovery tools, package managers, repositories, etc. Github is the platform within which these tools co-ordinate. There will also be challenges that are specific to open data, such as anonymisation, aggregation, registries, identifiers and more.