A key difference between open data and open source

In “left-pad and the data commons” I tried to identify some lessons for the open data community based on recent events in the Javascript/NPM world. Open source, open science and open data are all parts of the same endeavor of creating the commons. There’s a lot of fertile territory to be explored by looking at how those respective communities are operating, the infrastructure they’re building, and the kinds of issues that are being faced.

One thing that occurs to me is that there’s currently some important differences between how open source and open data projects operate.

The similarities are obvious. Compare the key principles of the open source definition and the open definition, for example. Both have basic ideas such as the ability to access the entirety of the source code or data (let’s call them “works”). The ability to create derived works; the right to distribute the works and derivatives; ability to use the works for commercial and non-commercial uses, etc.

The ability to create derived works means that anyone can also modify the source or data as they see fit. In practice this means forking: creating a new custom version of some software, or a modified (corrected, reformatted) version of a dataset.

The differences are in the infrastructure that supports the original works. The default practice in the open source world is that code will be:

  • published in a public repository
  • published with a complete version history (or at least versioning dating from its publication)
  • published in an environment that supports transparent reporting of issues, bugs and suggestions
  • published in an environment that includes good documentation tools, such as a wiki
  • and, most importantly, published in an environment that allows forks and improvements to be folded back into the original project

I’d go as far as suggesting that each of these are as important to our modern experience and expectations of open source, as the basic rights granted by open licences. Clearly, not all open source projects benefit from a community of contributors, but the infrastructure is there to enable it. I see moves in the open source community to make contributions easier and more welcome.

This isn’t the case with the majority of open data releases though. The current practice is that:

  • data is published by a single organisation
  • there is little insight into how the data was curated, at best there is some documentation
  • data portals provide some infrastructure for, e.g. issue reporting and documentation, but this is often limited in scope
  • data portals don’t provide any support for encouraging collaboration or external contributions

There are, of course, examples of open datasets that are created from collaborative models. This includes Open Street Map, legislation.gov.uk and others. But these are currently the exceptions, rather than the norm. I’ve previously wondered whether we need more of these types of institution and incubators to support them.

Open source really came into the mainstream when commercial organisations started to adopt it not just as a way of releasing a work they had produced, but also embraced its collaborative aspects. Entire industries have now built up around open source projects that are see organisations that compete in other areas collaborating on the common, core infrastructure.

While we should continue to urge commercial organisations to open up their existing assets, I think that the open data commons will really start to mature once we starting adopting collaborative models. Which means the open data community needs to think about the tooling we need to enable that.

A “github for data” might be a useful short-hand. But this would overlook the fact that modern open source development is now done in an ecosystem that consists of an extremely rich infrastructure: continuous integration tools, discovery tools, package managers, repositories, etc. Github is the platform within which these tools co-ordinate. There will also be challenges that are specific to open data, such as anonymisation, aggregation, registries, identifiers and more.