Lunchtime Lecture: “How you (yes, you) can contribute to open data”

The following is a written version of the lunchtime lecture I gave today at the Open Data Institute. I’ll put in a link to the video when it comes online. It’s not a transcript, I’m just writing down what I had planned to say.


I’m going to talk today about some of the projects that first got me excited about data on the web and open data specifically. I’m hopefully going to get you excited about them too. And show some ways in which you can individually get involved in creating some open data.

Open data is not (just) open government data

I’ve been reflecting recently about the shape of the open data community and ecosystem, to try and understand common issues and areas for useful work.

For example, we spend a lot of time focusing on Open Government Data. And so we talk about how open data can drive economic growth, create transparency, and be used to help tackle social issues.

But open data isn’t just government data. It’s a broader church that includes many different communities and organisations who are publishing and using open data for different purposes.

Open data is not (just) organisational data

More recently, as a community, we’ve focused some of our activism on encouraging commercial organisations to not just use open data (which many have been doing for years), but also to publish open data.

And so we talk about how open data can be supported by different business models and the need for organisational change to create more open cultures. And we collect evidence of impact to encourage more organisations to also become more open.

But open data isn’t just about data from organisations. Open data can be created and published by individuals and communities for their own needs and purposes.

Open data can (also) be a creative activity

Open data can also be a creative activity. A means for communities to collaborate around sharing what they know about a topic that is important or meaningful to them. Simply because they want to do it. I think sometimes we overlook these projects in the drive to encourage governments and other organisations to publish open data.

So I’m going to talk through eight (you said six in the talk, idiot! – Ed) different example projects. Some you will have definitely heard about before, but I suspect there will be a few that you haven’t. In most cases the primary goals of these projects are to create an openly licensed dataset. So when you contribute to the project, you’re directly helping to create more open data.

Of course, there are other ways in which we each contribute to open data. But these are often indirect contributions. For example where our personal data that is held in various services is aggregated, anonymised and openly published. But today I want to focus today on more direct contributions.

For each of the examples I’ve collected a few figures that indicate the date the project started, the number of contributors, and an indication of the size of the dataset. Hopefully this will help paint a picture of the level of effort that is already going into maintaining these resources. (Psst, see the slides for the figures – Ed)


The first example is Wikipedia. Everyone knows that anyone can edit Wikipedia. But you might not be aware that Wikipedia can be turned into structured data and used in applications. There’s lots of projects that do it. E.g. dbpedia which brings Wikipedia into the web of data.

The bit that’s turned into structured data are the “infoboxes” that give you the facts and figures about the person (for example) that you’re reading about. So if you add to Wikipedia and specifically add to the infoboxes, then you’re adding to an openly licensed dataset.

The most obvious example of where this data is used is in Google search results. The infoboxes you seen on search results whenever you google for a person, place or thing is partly powered by Wikipedia data.

A few years ago I added a wikipedia page for Gordon Boshell, the author of some children’s books I loved as a kid. There wasn’t a great deal of information about him on the web, so I pulled whatever I could find together and created a page for him. Now when anyone searches for Gordon Boshell they can see some information about him right on Google. And they now link out to the books that he wrote. It’s nice to think that I’ve helped raise his profile.

There’s also a related project from the Wikimedia Foundation called Wikidata. Again, anyone can edit it, but its a database of facts and figures rather than an encyclopedia.


The second example is OpenStreetMap. You’ll definitely have already heard about its goal to create a crowd-sourced map of the world. OpenStreetMap is fascinating because its grown this incredible ecosystem of tools and projects that make it easier to contribute to the database.

I’ve recently been getting involved with contributing to OpenStreetMap. My initial impression was that I was probably going to have to get a commercial GPS and go out and do complicated surveying. But its not like that at all. It’s really easy to add points to the map, and to use their tools to trace buildings from satellite imagery. They provide create tutorials to help you get started.

It’s surprisingly therapeutic. I’ve spent a few evenings drinking a couple of beers and tracing buildings. It’s a bit like an adult colouring book, except you’re creating a better map of the world. Neat!

There are a variety of other tools that let you contribute to OpenStreetMap. For example Wheelmap allows you to add wheelchair accessibility ratings to locations on the map. We’ve been using this in the AccessibleBath project to help crowd-source data about wheelchair accessibility in Bath. One afternoon we got a group of around 25 volunteers together for a couple of hours and mapped 30% of the city centre.

There’s a lot of humanitarian mapping that happens using OpenStreetMap. If there’s been a disaster or a disease outbreak then aid workers often need better maps to help reach the local population and target their efforts. Missing Maps lets you take part in that. They have a really nice workflow that lets you contribute towards improving the map by tracing satellite imagery.

There’s a related project called MapSwipe. Its a mobile application that presents you with a grid of satellite images. All you have to do is click the titles which contain a building and then swipe left. Behind the scenes this data is used to direct Missing Maps volunteers towards the areas where more detailed mapping would be most useful. This focuses contributors attention where its best needed and so is really respectful of people’s time.

MapSwipe can also be used offline. So you can download a work package to do when you’re on your daily commute. Easy!


You’ve probably also heard of Zooniverse, which is my third example. It’s a platform for citizen science projects. That just means using crowd-sourcing to create scientific datasets.

Their most famous project is probably GalaxyZoo which asked people to help classify objects in astronomical imagery. But there are many other projects. If you’re interested in biology then perhaps you’d like to help catalogue specimens held in the archives of the Natural History Museum?

Or there’s Old Weather, which I might get involved with. In that project you can help to build a picture of our historical climate by transcribing the weather reports that whaling ship captains wrote in their logs. By collecting that information we can build a dataset that tells us more about our climate.

I think its a really innovative way to use historical documents.


This is my fourth and favourite example. MusicBrainz is a database of music metadata: information about artists, albums, and tracks. It was created in direct response to commercial music databases that were asking people to contribute to their dataset, but then were taking all of the profits and not returning any value to the community. MusicBrainz created a free, open alternative.

I think MusicBrainz is the first open dataset I first got involved with. I wrote a client library to help developers use the data. (14 years ago, and you’re still talking about it – Ed)

MusicBrainz has also grown a commercial ecosystem around it, which has helped it be sustainable. There are a number of projects that use the dataset, including Spotify. And its been powering the BBC Music website for about ten years.


My fifth example, Discogs is also a music dataset. But its a dataset about vinyl releases. So it focuses more on the releases, labels, etc. Discogs is a little different because it started as, and still is a commercial service. At its core its a marketplace for record collectors. But to power that marketplace you need a dataset of vinyl releases. So they created tools to help the community build it. And, over time, its become progressively more open.

Today all of the data is in the public domain.


My sixth example is OpenPlaques. It’s a database of the commemorative plaques that you can see dotted around on buildings and streets. The plaques mark that an important event happened in that building, or that someone famous was born or lived there. Volunteers take photos of the plaques and share them with the service, along with the text and names of anyone who might be mentioned in the plaque.

It provides a really interesting way to explore the historical information in the context of cities and buildings. All of the information is linked to Wikipedia so you can find out more information.


My seventh example is Rebrickable which you’re unlikely to have heard about. I’m cheating a little here as its a service and not strictly a dataset. But its Lego, so I had to include it.

Rebrickable has a big database of all the official lego sets and what parts they contain. If you’re a fan of lego (they’re called AFOLS – Ed) design and create your own custom lego models (they’re known as MOCS – Ed) then you can upload the design and instructions to the service in machine-readable LEGO CAD formats.

Rebrickable exposes all of the information via an API under a liberal licence. So people can build useful tools. For example using the service you can find out which other official and custom sets you can build with bricks you already own.

Grand Comics Database

My eighth and final example is the Grand Comics Database. It’s also the oldest project as it was started in 1994. The original creators started with desktop tools before bringing it to the web.

It’s a big database of 1.3m comics. It contains everything from The Dandy and The Beano through to Marvel and DC releases. Its not just data on the comics, but also story arcs, artists, authors, etc. If you love comics you’ll love GCD. I checked and this weeks 2000AD (published 2 days ago – Ed) is in there already.

So those are my examples of places where you could contribute to open data.

Open data is an enabler

The interesting thing about them all is that open data is an enabler. Open data isn’t creating economic growth, or being used as a business model. Open licensing is being applied as a tool.

It creates a level playing field that means that everyone who contributes has an equal stake in the results. If you and I both contribute then we can both use the end result for any purpose. A commercial organisation is not extracting that value from us.

Open licensing can help to encourage people to share what they know, which is the reason the web exists.

Working with data

The projects are also great examples of ways of working with data on the web. They’re all highly distributed projects, accepting submissions from people internationally who will have very different skill sets and experience. That creates a challenge that can only be dealt with by having good collaboration tools and by having really strong community engagement.

Understanding the reasons how and why people collaborate to your open database is important. Because often those reasons will change over time. When OpenStreetMap had just started, contributors had the thrill of filling in a blank map with data about their local area. But now contributions are different. It’s more about maintaining data and adding depth.

Collaborative maintenance

In the open data community we often talk about making things open to make them better. It’s the tenth GDS design principle. And making data open does make them better in the sense that more people can use it. And perhaps more eyes can help spot flaws.

But if you really want to let people help make something better, then you need to put your data into a collaborative environment. Then data can get better at the pace of the community and not your ability to accept feedback.

It’s not work if you love it

Hopefully the examples give you an indication of the size of these communities and how much has been created. It struck me that many of them have been around since the early 2000s. I’ve not really found any good recent examples (Maybe people can suggest some – Ed). I wonder what that is?

Most of the examples were born around the Web 2.0 era (Mate. That phrase dates you. – Ed) when we were all excitedly contributing different types of content to different services. Bookmarks and photos and playlists. But now we mostly share things on social media. It feels like we’ve lost something. So it’s worth revisiting these services to see that they still exist and that we can still contribute.

While these fan communities are quietly hard at work, maybe we in the open data community can do more to support them?

There’s a lot of examples of “open” datasets that I didn’t use because they’re not actually open. The licenses are restrictive. Or the community has decided not to think about it. Perhaps we can help them understand why being a bit more open might be better?

There are also examples of openly licensed content that could be turned into more data. Take Wikia for example. It contains 360,000 wikis all with openly licensed content. They get 190m views a month and the system contains 43 million pages. About the same size as the English version of Wikipedia is currently. They’re all full of infoboxes that are crying out to be turned into structured data.

I think it’d be great to have all this fan produced data to a proper part of the open data commons, sitting alongside the government and organisational datasets that are being published.

Thank you (yes, you!)

That’s the end of my talk. I hope I’ve piqued your interest in looking at one or more of these projects in more detail. Hopefully there’s a project that will help you express your inner data geek.

Photo Attributions

Lego SpacemanEdwin AndradeJamie Street, Olu Elet, Aaron Burden, Volkan OlmezAlvaro, Jordan WhitfieldAnthony DELANOIX