Not Just Legislation: Sustainable Open Data Curation Projects

Francis Irving recently wrote an excited blog post about the open curation model that now backs It’s hard not to get excited about There’s been so much good work done on the project and everyone involved has achieved a great deal of which they can be proud.

If you’re not familiar with the background then read through Irving’s blog post and looks over these slides from a talk that John Sheridan and Jeni Tennison gave at Strata London last week. The project is a nice case study not just for the underlying technology but also for the application of open data in general.

However, while similarly excited by the project, I found myself disagreeing with Irving’s claim that is “the world’s first REAL commercial open data curation project“. Inevitably I suspect we actually agree on a lot of things, and disagree on a few details. But I think there are plenty of other examples and its instructive to look closely at them.

The Model

Firstly though, lets briefly summarise the model. If I’m misrepresenting anything there, then please let me know in a comment!

  • The core asset being worked on is the UK legislation itself. This is available under the Open Government License, so no matter what commercial or organisational model underpins its curation, its free for anyone to use
  • The new curation model provides a means for commercial organisations to help maintain the corpus of legislation, e.g. to bring it up to date to reflect actual law. This is done under a Memoradum of Understanding, so there’s a direct relationship between the relevant organisations and the National Archives. Not just anyone can contribute
  • The financial contribution from the curators is in the form of labour: they are providing staff to work on the legislation
  • The National Archives save costs on maintaining the legislation
  • The commercial participants have a better asset upon which they can build new products; this covers not just the updated text, but its availability as Open Data via APIs, etc.
  • Everyone benefits from a more up to date, accurate and reference-able body of legislation. This includes not just the immediate participants but all of the downstream users, which includes lawyers, non-lawyer professionals and individual citizens

That’s a great model with some obvious tangible and intangible value being created and exchanged. But I think that there are some potential variations.

Characterising Variations of that Model

To help think about variations, lets identify several different dimensions along which we might find variations:

  • The Asset(s): what is being curated, is it primarily a dataset or is that a secondary by-product? It might be several things, the data might not even be the primary asset.
  • The Contributors: who is actually creating, delivering and maintaining the asset(s)? Can anyone contribute or are contributions limited to a particular group or type of participant?
  • The Consumers: who uses the asset? Is it the same group as contributes to its curation or is there a wider community? We might expect there to always be more consumers than contributors, particularly for a successful data project
  • The Financial Model: how is the work to curate the asset being supported? For a successful project the ongoing provision of the asset ought to be sustainable, but it might actually generate profits.
  • The Licensing: what form of licensing is associated with use of the asset(s)s?
  • Loosely we might want to characterise the Incentives: what are the benefits for both the contributors and consumers of the data?

Now, I’m not suggesting that these are the only useful dimensions to consider, but I think these are the main ones. Hopefully its obvious how the legislation model can be characterised along these dimensions.

Using headings like this makes it easier to summarise in a blog post, but there are other techniques for teasing out forms of value creation and exchange. The one I’ve used successfully in the past is Value Network Analysis (VNA). In my dimensions above the Consumers and Contributors are the participants in the network, and the Financial Model and Incentives describe the tangible and intangible value being exchanged.

I plan to blog more about VNA in the future when I share the analysis I’ve done around data marketplaces. But for the rest of the article I’m going to highlight a couple of examples that show some useful variations.


Lets start with MusicBrainz. I’ve long used MusicBrainz as an example of a sustainable open data project as it has some nice characteristics.

  • Assets: The project has several products which includes some open source software. But the most significant asset is the MusicBrainz Database. The data is split into a core public domain portion, and a separately licensed set of supplementary data
  • Contributors: Anyone can sign up and make contributions to the database, there are some privileged editorial positions, but anyone can contribute to both the data and the software. While I believe the majority of the contributions come from the MusicBrainz community there is at least one commercial curator: the BBC pay editorial staff to add to and update the database.
  • Consumers: Again, anyone can use the data. There are a lot of projects that use MusicBrainz data some of which are commercial.
  • Financial Model: The project is supported in part by donations from users and businesses; and in part by commercial licensing of the Live Data Feed. The BBC are the most notable commercial licensee; Google the single largest donator. There is also some revenue from affiliate fees, etc. Some organisations have also contributed in kind, e.g. hardware or software services. The project finances are transparent if anyone wants to dig further.
  • Licensing: the core of the database is Public Domain. The rest is under a Creative Commons BY-NC-SA license.
  • Incentives: having an open music database provides a lots of benefits for individuals and organisations building products around the data. The costs of building a dataset collaboratively are much lower than building and maintaining it independently. For organisations like the BBC, MusicBrainz provides an off-the-shelf asset that can be enriched directly by its editorial team or integrated into new products.


The Open Researcher and Contributor ID project is a not-for-profit organisation that aims to provide “a registry of persistent unique identifiers for researchers and scholars and automated linkages to research objects such as publications, grants, and patents“.

It’s a fairly new venture but has been in incubation for some time. Over the last few years there has been lots of interest in having a shared open identifier for helping link together research literature and ORCID is one of the key projects that has crystallised out of those activities. It’s in the process of moving towards a production system. So, whereas MusicBrainz predates the work, the ORCID system is not yet fully launched.

Lets look at its model:

  • Assets: the primary asset is the database of researcher and contributor identifiers; the project software will also all be open source
  • Contributors: anyone will be able to use the website tools to create and manage their contributor identifier; there will also be ways for the project members to contribute directly to maintaining the data, e.g. to add new publication links. As noted in the principles, contributors will own their own data and profiles.
  • Consumers: broadly anyone can participate, but the expectation is that it will be of most value to individual researchers, publishers, and funding agencies
  • Financial Model: the ability to contribute data and use some of the basic data maintenance tools will be free. However additional services will only be available to paying members. This includes getting more timely access to updated data; notifications of data changes; etc. The project has been bootstrapped with support of a number of initial sponsors.
  • Licensing: the core database will be released on an annual basis under a CC0 license, placing it into the public domain.
  • Incentives: the broad incentive for all participants is to help bind together the research literature in a better way than is currently possible. Linking research to authors requires participants from across the whole publishing community, including the authors themselves. Using an open collaboration model ensures that the everyone can engage with a minimum of cost. The publishers, who perhaps stand to gain most, will be bringing sustainability. The membership model has already proven to work in publishing with CrossRef which is similarly structured.

ORCID is an interesting variation when contrasted with the approach. Many aspects are similar: it is industry focused and is solving a known problem. The major financial contributions will come from commercial organisations.

There are also several differences. Firstly the collaboration model is different; its not just commercial organisations that can contribute to the basic maintenance of the data: researchers can manage their own profiles.

Secondly, the data licensing model is different. While offers data under the OGL with free APIs, ORCID places data into the public domain but only plans to update data dumps annually. More frequent access to data requires use of the APIs which is are member services. This difference is clearly useful as a lever to encourage commercial organisations to sign-up, this will directly contribute to the sustainability of the overall project.

Board Game Geek (and other crowd-sourcing examples)

I’ve purposefully chosen the next example because it has several different characteristics. Board Game Geek (BGG) is a community of….well…board game geeks! The site provides a number of different features, including a marketplace, but the core of the service is the database of board games which is collaboratively maintained by the community. The database currently holds over 60,000 different games from well over 12,000 different publishers 

  • Assets: the primary asset is the database that backs the site. There are tips for mining data from the service as well as an API.
  • Contributors: anyone can sign-up and contribute
  • Consumers: again, anyone with an interest in the data. I’ve not been able to identify any commercial users of the service
  • Financial Model: the site is supported by advertising and donations from the community (BGG Patrons). Its possible to place adverts directly through the site which might be a viable way for games publishers to connect with what appears to be a thriving community.
  • Licensing: the data licensing is actually unclear, although I’ve seen references to free re-use so long as the data is not re-published
  • Incentives: the service provides a focal point for a community, so maintaining the database benefits all the participants equally; access to the raw data allows people to build their own tools for working with the games data

Admittedly the credentials of BGG as an Open Data project are shakier than the other examples here: the licensing is unclear and what data dumps are available are unfortunately out of date.

But I’ve included it because the basic model that underpins the service is actually pretty common. I could have chosen several alternate examples:

The common aspects here are the open participation and sustainability via advertising, donations and (no doubt) ongoing support and engagement by the project leads. In each case the service addresses the needs and interests of a particular community. Licensing and access to data varies considerably. Commercial use of these datasets is either discouraged or needs up-front agreement.

I’ve previously approached the leads behind and to discuss whether either of them saw a data marketplace as a potential source of additional revenue. Neither were interested in exploring that further. We could draw any number of conclusions from that but presumably they’re at least not struggling to maintain their current services.

In each of these cases the creation of the core database is the primary aspect of the service. But we can also find examples of where collaborative curation of data is happening as a secondary aspect of a service:

  • Discogs is a community of music collectors. Like MusicBrainz that community has ended up curating a database of artists, releases and tracks. The original core of the site was a marketplace to support the buying and selling of records. The business model is based around advertising and commission on marketplace sales. The core database is available under a CC0 license via an API or monthly data dumps.
  • Bricklink is essentially an Ebay for Lego. It generates revenue from commissions on sales and, like Discogs, along the way has produced an dataset that contains data on lego bricks, sets, inventories, etc. The data can be downloaded and, while not explicitly licensed, I’ve been told by the maintainers that they just ask for attribution.

In both of these cases we can see that the crowd-sourcing has happened as a means to support another activity: creating a product marketplace. While the previous crowd-sourced databases are based on an “Ads + Donations” model, in these examples, sustainability is brought by the marketplace. The data will remain available and up-to-date so long as the marketplace remains active.


I think there’s several conclusions to draw out from these examples.

Firstly, the important part of an open data curation project is not that its supported by commercial organisations, it’s the reliance on a sustainable model that will ensure the continued provision of the data. There are clearly plenty of different ways of doing this. I’ve written about various models for generating revenue from data in the past. Jeni Tennison has also shared some thoughts from a more public sector perspective. I suspect there are more that can be explored.

Secondly, clearly isn’t the first example of a sustainable open data curation model, its also not the first example of a commercially supported model. Its pre-dated by MusicBrainz at least. But it is, to my knowledge, the first of its kind in the public sector. That’s a real innovation of which John Sheridan can be proud.

Finally, there’s clearly a lot more work that we can collectively do to help collate together examples of the various approaches to building sustainable businesses and collaboration models around Open Data. The right approach is likely to vary considerably based on the domain. It will be useful to understand the trade-offs.

This will provide necessary evidence and case studies to support the further exploration of Open Data releases and operating models in the public sector, and beyond.

But perhaps more importantly it will help provide people with examples of how sustainable and perhaps even profitable businesses can be built around collaborative curation of Open Data.

This is an area in which Data Marketplaces have a role to play. By offering the infrastructure to support data hosting, delivery and revenue collection, they can be platforms to support communities coming together to draw some real tangible value from collective curation of data.

One thought on “Not Just Legislation: Sustainable Open Data Curation Projects

Comments are closed.