Lunchtime Lecture: “How you (yes, you) can contribute to open data”

The following is a written version of the lunchtime lecture I gave today at the Open Data Institute. I’ll put in a link to the video when it comes online. It’s not a transcript, I’m just writing down what I had planned to say.

Hello!

I’m going to talk today about some of the projects that first got me excited about data on the web and open data specifically. I’m hopefully going to get you excited about them too. And show some ways in which you can individually get involved in creating some open data.

Open data is not (just) open government data

I’ve been reflecting recently about the shape of the open data community and ecosystem, to try and understand common issues and areas for useful work.

For example, we spend a lot of time focusing on Open Government Data. And so we talk about how open data can drive economic growth, create transparency, and be used to help tackle social issues.

But open data isn’t just government data. It’s a broader church that includes many different communities and organisations who are publishing and using open data for different purposes.

Open data is not (just) organisational data

More recently, as a community, we’ve focused some of our activism on encouraging commercial organisations to not just use open data (which many have been doing for years), but also to publish open data.

And so we talk about how open data can be supported by different business models and the need for organisational change to create more open cultures. And we collect evidence of impact to encourage more organisations to also become more open.

But open data isn’t just about data from organisations. Open data can be created and published by individuals and communities for their own needs and purposes.

Open data can (also) be a creative activity

Open data can also be a creative activity. A means for communities to collaborate around sharing what they know about a topic that is important or meaningful to them. Simply because they want to do it. I think sometimes we overlook these projects in the drive to encourage governments and other organisations to publish open data.

So I’m going to talk through eight (you said six in the talk, idiot! – Ed) different example projects. Some you will have definitely heard about before, but I suspect there will be a few that you haven’t. In most cases the primary goals of these projects are to create an openly licensed dataset. So when you contribute to the project, you’re directly helping to create more open data.

Of course, there are other ways in which we each contribute to open data. But these are often indirect contributions. For example where our personal data that is held in various services is aggregated, anonymised and openly published. But today I want to focus today on more direct contributions.

For each of the examples I’ve collected a few figures that indicate the date the project started, the number of contributors, and an indication of the size of the dataset. Hopefully this will help paint a picture of the level of effort that is already going into maintaining these resources. (Psst, see the slides for the figures – Ed)

Wikipedia

The first example is Wikipedia. Everyone knows that anyone can edit Wikipedia. But you might not be aware that Wikipedia can be turned into structured data and used in applications. There’s lots of projects that do it. E.g. dbpedia which brings Wikipedia into the web of data.

The bit that’s turned into structured data are the “infoboxes” that give you the facts and figures about the person (for example) that you’re reading about. So if you add to Wikipedia and specifically add to the infoboxes, then you’re adding to an openly licensed dataset.

The most obvious example of where this data is used is in Google search results. The infoboxes you seen on search results whenever you google for a person, place or thing is partly powered by Wikipedia data.

A few years ago I added a wikipedia page for Gordon Boshell, the author of some children’s books I loved as a kid. There wasn’t a great deal of information about him on the web, so I pulled whatever I could find together and created a page for him. Now when anyone searches for Gordon Boshell they can see some information about him right on Google. And they now link out to the books that he wrote. It’s nice to think that I’ve helped raise his profile.

There’s also a related project from the Wikimedia Foundation called Wikidata. Again, anyone can edit it, but its a database of facts and figures rather than an encyclopedia.

OpenStreetMap

The second example is OpenStreetMap. You’ll definitely have already heard about its goal to create a crowd-sourced map of the world. OpenStreetMap is fascinating because its grown this incredible ecosystem of tools and projects that make it easier to contribute to the database.

I’ve recently been getting involved with contributing to OpenStreetMap. My initial impression was that I was probably going to have to get a commercial GPS and go out and do complicated surveying. But its not like that at all. It’s really easy to add points to the map, and to use their tools to trace buildings from satellite imagery. They provide create tutorials to help you get started.

It’s surprisingly therapeutic. I’ve spent a few evenings drinking a couple of beers and tracing buildings. It’s a bit like an adult colouring book, except you’re creating a better map of the world. Neat!

There are a variety of other tools that let you contribute to OpenStreetMap. For example Wheelmap allows you to add wheelchair accessibility ratings to locations on the map. We’ve been using this in the AccessibleBath project to help crowd-source data about wheelchair accessibility in Bath. One afternoon we got a group of around 25 volunteers together for a couple of hours and mapped 30% of the city centre.

There’s a lot of humanitarian mapping that happens using OpenStreetMap. If there’s been a disaster or a disease outbreak then aid workers often need better maps to help reach the local population and target their efforts. Missing Maps lets you take part in that. They have a really nice workflow that lets you contribute towards improving the map by tracing satellite imagery.

There’s a related project called MapSwipe. Its a mobile application that presents you with a grid of satellite images. All you have to do is click the titles which contain a building and then swipe left. Behind the scenes this data is used to direct Missing Maps volunteers towards the areas where more detailed mapping would be most useful. This focuses contributors attention where its best needed and so is really respectful of people’s time.

MapSwipe can also be used offline. So you can download a work package to do when you’re on your daily commute. Easy!

Zooniverse

You’ve probably also heard of Zooniverse, which is my third example. It’s a platform for citizen science projects. That just means using crowd-sourcing to create scientific datasets.

Their most famous project is probably GalaxyZoo which asked people to help classify objects in astronomical imagery. But there are many other projects. If you’re interested in biology then perhaps you’d like to help catalogue specimens held in the archives of the Natural History Museum?

Or there’s Old Weather, which I might get involved with. In that project you can help to build a picture of our historical climate by transcribing the weather reports that whaling ship captains wrote in their logs. By collecting that information we can build a dataset that tells us more about our climate.

I think its a really innovative way to use historical documents.

MusicBrainz

This is my fourth and favourite example. MusicBrainz is a database of music metadata: information about artists, albums, and tracks. It was created in direct response to commercial music databases that were asking people to contribute to their dataset, but then were taking all of the profits and not returning any value to the community. MusicBrainz created a free, open alternative.

I think MusicBrainz is the first open dataset I first got involved with. I wrote a client library to help developers use the data. (14 years ago, and you’re still talking about it – Ed)

MusicBrainz has also grown a commercial ecosystem around it, which has helped it be sustainable. There are a number of projects that use the dataset, including Spotify. And its been powering the BBC Music website for about ten years.

Discogs

My fifth example, Discogs is also a music dataset. But its a dataset about vinyl releases. So it focuses more on the releases, labels, etc. Discogs is a little different because it started as, and still is a commercial service. At its core its a marketplace for record collectors. But to power that marketplace you need a dataset of vinyl releases. So they created tools to help the community build it. And, over time, its become progressively more open.

Today all of the data is in the public domain.

OpenPlaques

My sixth example is OpenPlaques. It’s a database of the commemorative plaques that you can see dotted around on buildings and streets. The plaques mark that an important event happened in that building, or that someone famous was born or lived there. Volunteers take photos of the plaques and share them with the service, along with the text and names of anyone who might be mentioned in the plaque.

It provides a really interesting way to explore the historical information in the context of cities and buildings. All of the information is linked to Wikipedia so you can find out more information.

Rebrickable

My seventh example is Rebrickable which you’re unlikely to have heard about. I’m cheating a little here as its a service and not strictly a dataset. But its Lego, so I had to include it.

Rebrickable has a big database of all the official lego sets and what parts they contain. If you’re a fan of lego (they’re called AFOLS – Ed) design and create your own custom lego models (they’re known as MOCS – Ed) then you can upload the design and instructions to the service in machine-readable LEGO CAD formats.

Rebrickable exposes all of the information via an API under a liberal licence. So people can build useful tools. For example using the service you can find out which other official and custom sets you can build with bricks you already own.

Grand Comics Database

My eighth and final example is the Grand Comics Database. It’s also the oldest project as it was started in 1994. The original creators started with desktop tools before bringing it to the web.

It’s a big database of 1.3m comics. It contains everything from The Dandy and The Beano through to Marvel and DC releases. Its not just data on the comics, but also story arcs, artists, authors, etc. If you love comics you’ll love GCD. I checked and this weeks 2000AD (published 2 days ago – Ed) is in there already.

So those are my examples of places where you could contribute to open data.

Open data is an enabler

The interesting thing about them all is that open data is an enabler. Open data isn’t creating economic growth, or being used as a business model. Open licensing is being applied as a tool.

It creates a level playing field that means that everyone who contributes has an equal stake in the results. If you and I both contribute then we can both use the end result for any purpose. A commercial organisation is not extracting that value from us.

Open licensing can help to encourage people to share what they know, which is the reason the web exists.

Working with data

The projects are also great examples of ways of working with data on the web. They’re all highly distributed projects, accepting submissions from people internationally who will have very different skill sets and experience. That creates a challenge that can only be dealt with by having good collaboration tools and by having really strong community engagement.

Understanding the reasons how and why people collaborate to your open database is important. Because often those reasons will change over time. When OpenStreetMap had just started, contributors had the thrill of filling in a blank map with data about their local area. But now contributions are different. It’s more about maintaining data and adding depth.

Collaborative maintenance

In the open data community we often talk about making things open to make them better. It’s the tenth GDS design principle. And making data open does make them better in the sense that more people can use it. And perhaps more eyes can help spot flaws.

But if you really want to let people help make something better, then you need to put your data into a collaborative environment. Then data can get better at the pace of the community and not your ability to accept feedback.

It’s not work if you love it

Hopefully the examples give you an indication of the size of these communities and how much has been created. It struck me that many of them have been around since the early 2000s. I’ve not really found any good recent examples (Maybe people can suggest some – Ed). I wonder what that is?

Most of the examples were born around the Web 2.0 era (Mate. That phrase dates you. – Ed) when we were all excitedly contributing different types of content to different services. Bookmarks and photos and playlists. But now we mostly share things on social media. It feels like we’ve lost something. So it’s worth revisiting these services to see that they still exist and that we can still contribute.

While these fan communities are quietly hard at work, maybe we in the open data community can do more to support them?

There’s a lot of examples of “open” datasets that I didn’t use because they’re not actually open. The licenses are restrictive. Or the community has decided not to think about it. Perhaps we can help them understand why being a bit more open might be better?

There are also examples of openly licensed content that could be turned into more data. Take Wikia for example. It contains 360,000 wikis all with openly licensed content. They get 190m views a month and the system contains 43 million pages. About the same size as the English version of Wikipedia is currently. They’re all full of infoboxes that are crying out to be turned into structured data.

I think it’d be great to have all this fan produced data to a proper part of the open data commons, sitting alongside the government and organisational datasets that are being published.

Thank you (yes, you!)

That’s the end of my talk. I hope I’ve piqued your interest in looking at one or more of these projects in more detail. Hopefully there’s a project that will help you express your inner data geek.

Photo Attributions

Lego SpacemanEdwin AndradeJamie Street, Olu Elet, Aaron Burden, Volkan OlmezAlvaro SerranoRawPixel.com, Jordan WhitfieldAnthony DELANOIX

 

Where can you contribute to open data? Yes, you!

This is just a quick post to gather together some pointers and links that were shared in answer to a question I asked on twitter yesterday:

I want to try out a bunch of different services to explore how easy it is for people to contribute to open data project. Because I’m interested in how we can contribute as individuals, then I’m ruling out things like government open data portals. They’re not typically places where mere punters like you or I can contribute.

I’m also interested in sites that generate open data. Not public data. There needs to be an open licence on the results. Or, at very least a note along the lines of: “do whatever you want with this”.

I’m thinking more of places where we can collaborate around creating open data.

The short list

Here’s a quick list of the suggestions, along with a few I’d already turned up. I’m sure there are a lot more. Please leave a comment or ping me on twitter if you have suggestions. And yes, I’ll turn this into data at some point.

  1. OpenStreetMap was the starter for ten. I’ve already written about a number of ways to can contribute to the effort
  2. Discogs, contribute to their public domain database
  3. Wikipedia, content in infoboxes is presented as data via dbpedia and wikidata
  4. You can also contribute directly to Wikidata
  5. MusicBrainz, is completely crowd-sourced
  6. You can contribute company information to OpenCorporates
  7. Questions you answer on Stackoverflow end up as open data
  8. DemocracyClub are doing an awesome job of co-ordinating crowd-sourced data collection that the UK government should just be doing itself
  9. The product data you add to OpenFoodFacts is open
  10. It looks like you can contribute Creative Commons licensed content and data to the Encylopedia of Life
  11. OpenPlaques is open to contributions
  12. The Quick, Draw with Google data is actually open. Google seem to be opening up more of their research data
  13. ESRI are building some crowdsourcing apps, which generate open data
  14. If you’re in Germany and have some sensor data, you can feed it into OpenSenseMap. Their data dumps are in the public domain

What else should be on this list?

Disqualifications

There were also a number of sites that were suggested, or which I considered, but had to be rejected. Mostly because they’re not actually publishing open data. They either have restrictions on usage, or the licensing is very unclear. If you can clarify any of these then let me know.

Clearly there are hundreds of non-open databases, but do let me know if I’m wrong about any of the above, and I’ll amend the article accordingly.

Can you publish tweets as open data?

Can you publish data from twitter as open data? The short answer is: No. Read on for some notes, pointers and comments.

Twitter’s developer policy places a number of restrictions on your use of their API and the data you get from it. Some of the key ones are:

  • In the Restrictions on Use of Licensed Materials (II.C) they make it clear that you can’t use any geographic data from the platform. You can only use it to identify the location from which a tweet was made and not for any other purpose. You also can’t aggregate or cache it, unless you’re storing it with the the rest of the tweet. And elsewhere they place further restrictions on storage of tweets. They reiterate this in section B.9
  • Section F.2 “Be a Good Partner to Twitter” (sic) is the key one for data, as here you’re agreeing to not store anything except the ID for a tweet. You can’t store the message, it’s metadata or anything about the user, just the ID.
  • You are allowed to make those IDs downloadable in various ways but there are restrictions on how many tweets you can publish per user, per day
  • In the Ownership and Feedback section, they make it clear that the only rights you have to use content are derived from this agreement and those rights can be taken away at any time.
  • Anyone that you distribute data to must also agree to ALL of twitters terms, not just the developer policy, but its general terms of service, privacy policy, etc. So everyone’s agreements can be revoked at any time.

That’s a very closed set of terms.

There’s some great analysis of the terms and what they mean for researchers elsewhere. Ernesto Priego has an interesting pair of posts looking at twitter as public evidence and the ethics of twitter research and why you might want to archive and share small twitter datasets.

Ed Summers has also written about archiving twitter datasets and the process of “hydrating” a twitter ID to turn it back into useful content. There’s a whole set of APIs, tools and practices that have built up around the process of hydration as a means to work around twitters terms. I think it’s interesting as an example of using a combination of data and open source to address licensing limitations.

Yesterday, Justin Littman published a short piece highlighting how Twitter have just further restricted their terms. The key changes are around placing upper limits on how many tweet IDs you can distribute. The changes raise concerns about how archival projects like DocNow can continue. Although in my reading of the terms, those projects were already under question as Twitter doesn’t grant you the rights to re-publish data under anything other than its own terms. I think those datasets were already in breach of the agreement.

So, we get to our answer: no you can’t publish anything from twitter under an open licence. If you’re intending to do this in a project then I recommend you get approval from twitter directly.

Obviously these terms are designed for Twitters sole benefit. It helps them retain as much value as possible while still operating as a platform. Data asymmetry in action.

I think what’s particularly frustrating is that they seem to rarely enforce these terms, even for services that clearly breach them. After crafting a legal agreement they choose not to actively police it, because its not worth their time to do so. Presumably they will step in if there are large scale, significant breaches. But it makes you wonder how much value is really being protected.

In the meantime we are left with areas of doubt and uncertainty. Does the continued existence of a service mean its an exemplar of acceptable practice. Or are twitter just choosing to ignore it? And this starts to poison the well of open data. A more open approach would be for them to offer some allowance for small scale archiving and data sharing. Openly licensing twitter IDs would be a start.

For better or worse Twitter’s data has a role in helping us understand modern society, so we should be able to use it. Unfortunately their donation of the twitter archive to the Library of Congress is floundering because of a mixture of technical and legal issues. Twitter is not really a public space. It’s a private hall where we choose to meet.

Addendum

A couple of final extra points based on comments on this post (see below) and on twitter. Ed Summers rightly pointed out is that services that are seemingly breaching Twitter’s terms may in fact have permission to do so. In fact a couple of examples came up.

Andy Piper (Twitter Dev lead) notes that Twitter have posted a policy update clarification:

The clarification explains that developers can request permission to share more 1.5m tweet ids in a 30 day period. It also notes that researchers from “an accredited academic institution” can share unlimited number of tweets. This raises some of the restrictions on distribution, but also reinforces some of the key points I make above: any use of the data remains subject to Twitter’s policies. By default data from Twitter can’t be published as open data. But if you’re willing to pay then it looks like Twitter are willing to share more widely.

Joe Wass from CrossRef explained that they’ve had explicit permission from Google to distribute Tweet IDs under a CC0 waiver within their Event Data service.

CrossRef negotiated this permission as part of their commercial arrangement with Twitter. This means that at least some Tweet IDs can be considered to be in the public domain. It just depends on where you got them from: the Twitter API or CrossRef.

Enabling data forensics

I’m interested in how people share information, particularly data, on social networks. I think it’s something to which it’s worth paying attention, so we can ensure that it’s easy for people to share insights and engage in online debates.

There’s lots of discussion at the moment around fact checking and similar ways that we can improve the ability to identify reliable and unreliable information online. But there may be other ways that we can make some small improvements in order to help people identify and find sources of data.

Data forensics is a term that usually refers to analysis of data to identify illegal activities. But the term does have a broader meaning that encompasses “identifying, preserving, recovering, analyzing, and presenting attributes of digital information“. So I’m going to appropriate the term to put a label on a few ideas.

The design of the Twitter and Facebook platforms constrain how we can share information. Within those constraints people have, inevitably, adopted various patterns that allow them to publish and share content in preferred ways. For example, information might be shared:

  1. As a link to a page, where the content of the tweet or post is just the title
  2. As a link to a page, but with a comment and/or hashtags for context
  3. As a screenshot, e.g. of some text, chart or something. This usually has some commentary attached. Some apps enable this automatically, allowing you to share a screenshot of some highlighted text
  4. As images and photographs, e.g. of printed page or report (or even sometimes a screenshot of text from another app)

In the first examples there are always links that allow someone to go and read the original content. In fact that seems to be the typical intention: go read (or watch) this thing.

The other two examples are usually workarounds for the fact that its often hard to deep link to a section of a page or video.

Sometimes it’s just not possible because the information of interest isn’t in a bookmarkable section of a page. Or perhaps the user doesn’t know how to create that kind of deep link. Or they may be further constrained by a mobile app or other service that is restricting their ability to easily share a link. Not every application let’s the web happen.

In some cases screenshotting may also be conscious choice, e.g. posting a photo of someone’s tweet because you don’t want to directly interact with them.

Whatever the reason, this means there is usually no link in the resulting post. Which often makes it difficult for a reader to find the original content. While social media is reducing friction in sharing, its increasing friction around our ability to check the reliability and accuracy of what’s been shared.

If you tweet out a graph with some figures in a debate, I want to know where it’s come from. I want to see the context that goes with it. The ability to easily identify the source of shared content is, I think, part of “data forensics”.

So, what can we do fix this?

Firstly, there’s more that could be done to build better ways to deep link into pages, e.g. to allow sharing of individual page elements. But people have been trying to do that on and off for years without much visible success. It’s a hard problem, particularly if you want to allow someone to link to a piece of text. It could be time for a standards body to have another crack at it. Or I might have missed some exciting process, so please tell me if I have! But I think something like this would need some serious push behind. You need support from not just web frameworks and the major CMS platforms, but also (probably) browser vendors.

Secondly, Twitter and Facebook could allow us some more flexibility. For example, allow apps to post additional links and/or other metadata that are then attached to posts and tweets. It won’t address every scenario, but it could help. It also feels like a relatively easy thing for them to do as its a natural extension of some existing features.

Thirdly, we could look at ways to attach data to the images people are posting, regardless of what the platforms support. I’ve previously wondered about using XMP packets to attach provenance and attribution information to images. Unfortunately it doesn’t work for every format and it turns out that most platforms strip embedded metadata anyway. This is presumably due to reasonable concerns around privacy, but they could still white-list some metadata. We could maybe use steganography too.

But the major downsides here is that you’d need a custom social media client or browser extension to let you see and interact with the data. So, again that’s a massive deployment issue.

As things currently stand I think the best approach is to plan for visualisations and information to be shared, and design the interactions and content accordingly. Assume that your carefully crafted web page is going to be shared in a million different pieces. Which means that you should:

  • Include plenty of in-page anchors and use clear labelling to help people build links to relevant sections
  • Adapt your social media sharing buttons to not just link to the whole page, but also allow the user to share a link to a specific section
  • Design your twitter cards and other social metadata, for example is there a key graphic that would be best used as the page image?
  • Include links and source information on all of the graphs and infographics that you share. Make sure the link is short and persistent in case it has to be re-keyed from a screenshot
  • Provide direct ways to tweet and share out a graph that will automatically include a clearly labelled image, that contains a link
  • Help users cite their sources
  • …etc

What do you think? Any tips or suggestions you’d add to this list? With a bit of awareness around how data is shared, we might be able to make small improvements to online discussions.

Adventures in geodata

I spend a lot of my professional life giving people advice. Mostly around how to publish and use open data. In order to make sure I give people the best advice I can, I try and spend a lot of time actually publishing and using open data. A mixture of research and practical work is the best way I’ve found of improving my own open data practice. This is one of the reasons I run Bath: Hacked, continue to work at the Open Data Institute, and like to stay hands-on with data projects.

Amongst my goals for this year was to spend time learning some new skills. For example, I’ve not been involved in running a crowd-sourcing project, but now have that underway with Accessible Bath.

And, while I’ve done some work with geographic data, until recently I hadn’t really spent any time contributing to OpenStreetmap or exploring its ecosystem. But I’ve spent the last couple of months fixing that by immersing myself in its community and tools. In this blog post I wanted to share some of the things I’ve learnt. It’s been really fascinating and, as I’d hoped, given me a new perspective on a number of issues.

Finding my way

To begin with I looked around for some online tutorials. While I knew that OpenStreetmap allowed anyone to contribute, I wasn’t really sure about how I could go about doing that. I had a bunch of questions such as:

  • Did I need a dedicated GPS device or could I collect data I my phone? (Answer: you can use your phone)
  • Did I need to go out with a clipboard and do a formal survey or are there other ways to contribute? (Answer: you can contribute in a lot of different ways)
  • How do you actually go about editing the map, what tools do you need to use? (Answer: however you feel comfortable)
  • How do I find useful ways to contribute? Has everything been mapped already? (Answer: there’s still a lot to do!)

To help answer my questions I started out by watching some YouTube tutorials. There’s a lot of great training material for the OpenStreetMap ecosystem that covers the basics of mapping, how to add buildings, and some nice bite size videos that introduce best elements of the tool-set.

Other people in the Bath: Hacked community had also been looking at OpenStreetmap, mainly as a potential data resource. So we held a small evening meetup to get together and share what we knew. We had two experienced local mappers who came along and also offered encouragement (thanks Neil and Dave!).

This was a great way to learn the ropes and build up the confidence to wade in. I personally found having some existing members of the OSM community on hand very helpful. Dave has been particularly supportive of reviewing my edits and offering suggested improvements.

Equipping my expedition

There’s an amazing set of tools that support the OSM community. Too many to mention in a single blog post. But here’s a few that I’ve found particularly useful:

  • There are a few different OSM editors, but the new, default iD editor is really easy to use. If you plan on making some editrs, focus on learning this tool, rather than looking at the older, more complex tools (although they have their uses). It’s really nice to work with. It also has some pleasing little UX elements.
  • osmtracker is an Android (and Windows mobile) application that lets you record GPS traces, upload them to OSM (where they can be viewed in the iD editor), and exported to GPX files for use in other tools. It’s in the app store so easy to install
  • The OSM wiki is an essential resource. The OSM database itself is basically a wiki: you can add tags to any item on the map. While the online editor does a lot of the work for you, sometimes you need to add some additional metadata and the trick is in knowing which tags to add to which locations. The wiki provides plenty of examples. It also includes some beginner tutorials, but I found the videos to be a good starting point

My first attempt at proper mapping was walking my local high street, recording my progress and using osmtracker to take notes of the names of each shop. I later updated the building outlines, names and details of all the locations.

Into the unknown

That process of collecting data and updating a map lit up the bits of my brain that likes exploring and scavenging in video games, so I was immediately keen to do more. That’s when I starting contributing to Missing Maps, which I’d heard about from Rares during our meetup.

Missing Maps uses volunteers to trace satellite imagery of locations around the world. This data is then improved locally and used by humanitarian organisations to plan their disaster response activities. So I spent a happy evening finding and tracing Tukuls in Sudan. I thoroughly enjoyed it. It felt like doing an adult colouring book, but where I was painting the world a bit better with each stroke.

As a contributor the tooling is great: simple task allocation, clear guidance and tutorials, and making contributions is straight-forward as you’re using the standard editors. The community was also quick to provide feedback.

I also tripped over MapSwipe. This lets you identify, with a simple click, satellite images that contain buildings. This generates new tasks that go into the Missing Maps pipeline. It also has some light gamification and encouragement to keep you contributing.

Even if you’re not confident about editing the full map, you can quickly make small contributions using this mobile app. You can download tasks for use offline, so it’s also possible to map when you’re on the go. There’s a little micro-tasking app called StreetComplete which takes a similar approach towards making local contributions as easy as possible.

Between MapSwipe, MissingMaps and editing the local OSM map and updating locations on the Wheelmap app,  I’m now trying to make a small contribution to OSM every day.

The landscape

I’ve been really blown away by the range of tools and applications that fill out the OSM ecosystem. I plan on doing a lengthier post on some of this at a later date, but I’d be very surprised if this ecosystem wasn’t at least as good as, or even better than those used internally by national mapping agencies.

The ecosystem doesn’t just consist of hobbyists, there’s a growing commercial community that are contributing to, supporting and helping develop OpenStreetMap. Just look at how clearly Mapbox and Mapillary articulate how their company strategies align with making OSM a continued success.

I was also really surprised to learn that the satellite imagery that all OSM mappers are using has been donated by Microsoft. The Bing aerial imagery is free for use in OSM mapping and has been since 2010. That’s a significant contribution to an open data ecosystem.

If you’re interested in learning more about the OSM community, I’d encourage you to explore the videos from the annual State of the Map conference. There’s some really interesting work presented there including:

  • introductions to new OSM tools and research
  • analysis and discussion about the OSM community itself, the reasons why people contribute and how to encourage them to continue to do so
  • case studies of how OSM data and tooling is used in a variety of projects

New territory

I’ve now done several street surveys of Bath and have refined my workflow. What I’ve found to be the simplest approach is to use osmtracker to record my route and uses its facility to take photos of streets and shop fronts. This gives a quick way to collect information on the go, and I can then use this update the map later.

Uploading the GPX traces to OSM, putting the photos into the public domain on Flickr, and also publishing them to Mapillary allows me to demonstrate that I’ve actually done the field work, rather than just sneakily copied from Google StreetView, whilst also making them available to other people to use when they’re mapping. Mapillary data can be added to the iD editor so you can see contributed photos as you work.

I’ve decided that the surveys are a good way to encourage me to be more active over the summer!

Trip report

This post has just been a taster of what I’ve learnt and explored over the last couple of months. If you’ve ever wondered about contributing to OSM I’d encourage you to have a go. And I’m happy to help you get started! As I’ve outlined here, there are a number of different ways you can contribute either your local knowledge, or pitch in to some humanitarian mapping.

I’m going to be writing more here about some of the ecosystem in future. The exercise has been a great insight into how the OSM community hangs together and I’ve really only scratched the surface.

To briefly summarise though I think there’s some aspects of OSM that could work well in other contexts, for example:

  • the various approaches taken to ensuring quality and consistency of the map
  • the effort that goes into understanding and managing the community
  • the means by which commercial and volunteer efforts can both contribute to an open resource

If you’re interested in data as infrastructure then OSM is a great project to study in more detail. I think it embodies all of the key principles of a strong open data infrastructure.

Someone also needs to do a proper review of the OSM ecosystem because all of that “open data impact” people are looking to measure is right there. There’s a bit too much focus on measuring impact of government data IMHO, when there’s an existing ecosystem which can provide some great insights.

The limitations of the open banking licence

The Open Banking initiative recently began to publicly publish specifications, guidance and data through its website. If you’re not already aware of the initiative, it was created as a direct result of government reforms that aim to encourage the banking sector to be more open and innovative. The CMA undertook a lengthy consultation period during which the ODI coordinated work on the Open Banking Standard report.

The recommendations from that report and the CMA ruling were clear. Banks have to:

  • publish open data about their products, branches and locations, and
  • develop and provide open APIs to support access to other data, e.g. the transaction history on your account.

Unfortunately, while the banks are moving in that direction, the data they are publishing is not open data.

The Open Definition is the definitive description of what makes content and data open. It describes certain freedoms that are essential to maximise the value of publishing data under an open licence.

I think publishing open data is what the CMA and others really intended. Its also clearly spelt out in the Open Banking report. But unfortunately something has been lost in translation. The Open Banking Licence does not conform to the open definition.

Owen Boswarva has given a detailed review on his blog. For a review of the impacts of non-open licences you can read the ODI guidance which I helped to draft.

Rather than recap that guidance here, I thought it might be useful to try to spell out where the limitations in the Open Banking licence will impact reuse of the data. This is based on my early explorations with the public data.

Exploring the limitations of the open banking licence

The Open Banking API dashboard provides direct access to the currently available data. It includes data on the ATMs provided by each of the participating banks, their branches and products.

The data is published as JSON. A commonly used data format that is easy for developers to work with.

I can’t freely distribute the data

The first thing I did was to build a public map of all of the ATM data. To do this I had to convert the data from JSON to CSV which I could then load into an online mapping tool (Carto).

This is a permitted use under the Open Banking licence. The conversion of the data from JSON to CSV, and the creation of a map is explicitly allowed in the licence. Section 2.1(c) says that I am allowed “to adapt the Open Data into different formats for the purposes of data mapping (or otherwise for display or presentational purposes)“.

But that clause means that:

  • I can’t share the CSV version of the data. Data in CSV format is useful to many more potential reusers of the data. Many analytics tools support CSV but won’t support custom JSON documents. Because I can’t distribute the alternative version, fewer people can immediately use the data
  • I had to keep the dataset private in my Carto account. I’m lucky enough to have a personal account that lets me keep data private. Most freely available online tools allow people to use their services for free, so long as they’re using open data. If I was allowed to share the data with other Carto users, anyone could use it in their own maps. People without a paid Carto account can’t use this data. The result is, again, that fewer people can get immediate benefit from it.

The ability to freely convert and distribute data is a key part of the open definition. It allows data re-users to support each other in using the data by making it available in alternative formats and on all available platforms.

At the moment we are only allowed to copy, re-use, publish and distribute data so long as we don’t change it.

I’m limited in using the data to enrich other services and products

Because I can’t distribute the data it means I can’t take the data that has been provided and use it to improve an existing system. For example I don’t believe I can use the data to add missing ATM locations in Open Street Map.

The terms of the Open Banking licence are not compatible with the Open Street Map licence. Because it is a custom licence, rather than an existing standard open licence, resolving that issue will require legal advice.

OSM requires contributors to be extra cautious when adding data from other sources. They suggest getting explicit written agreement. This takes time and effort. That doesn’t seem to be achieving the desired outcome of a more open banking sector.

The licence is also revocable. At any time the banks can revoke my ability to use the data. Open licences, like the creative commons licences are not revocable. This means I’m exposing myself to legal and commercial risks if I build it into a product or service. I would need to take legal advice on that.

I can’t improve the data

After creating a basic map of ATM locations, I wanted to link the data with other sources. Data becomes more valuable once its linked together.

I opened my CSV version of the data in a free, open source desktop GIS tool called QGIS. Using the standard features of that tool I was able to match the geographic coordinates in the ATM data against openly licenced geographic data from the Office of National Statistics.

This generated an enriched dataset in which every ATM was now linked to an LSOA. An LSOA is a statistical area used by the ONS and others to help publish statistics about the UK. There are many statistical datasets that are reported against these areas.

Having completed that enrichment process I could now start to explore the data in the context of official statistics on demographics. There are many interesting questions that I can now ask of the data. But other people might also have interesting uses for that enriched dataset.

The process of doing the enriching is quite technical. I’m comfortable with teaching myself how to do that. But it would be great if I could help other people unlock value by letting them explore the enriched data.

Unfortunately I can’t share my enriched version with them. I’m not allowed to change any of the content of the data, or distribute it in alternate forms. The best I could do is tweet out a few interesting insights.

I am discouraged from using the data

One way I could use the enriched data is to explore how ATM and branch locations might relate to deprivation or other demographic statistics. This might highlight patterns in how individual banks have chosen to site their branches.

I could also monitor the data over time and build up a picture of where ATMs and branches are opening and closing around the country. Or explores the changing mix of products available from individual banks.

Unfortunately I don’t think I can do that. Clause 3.1(b) of the licence states that I must not “use or present the Open Data or any analysis of it in a way that is unfair or misleading, for example comparisons must be based on objective criteria and not be prejudiced by commercial interests“.

It’s not clear to me what unfair or misleading means. Unfair to the banks? Unfair to consumers? What type of objective criteria are acceptable?

If I were working for a fintech startup, I could perhaps use the data to identify new financial products that could be offered to consumers. I think that’s the type of innovation that the CMA wanted to encourage?

But if I do that and explain my analysis with others, then am I “prejudiced by commercial interests”? The licence says I can use the data commercially, but seems to discourage certain types of commercial usage.

These types of broad, under defined clauses in licences discourage reuse. They create uncertainty around what is actually permitted under the terms of the licence. This reduces the likelihood of people using the data, unless they can cover the legal guidance needed to remove the uncertainty.

I have probably already broken the terms of the licence

I think I may have already broken the terms of the licence. As a bit of fun I’ve created a twitter account called @allthebarclays. Every day it tweets out a picture of a branch of Barclays along with its name and unique identifier.

I’m probably not allowed to do that. The photos in the data don’t have a licence attached to them, so I’m hoping that if challenged, I can justify it under fair use.

The account is clearly a joke. It’s of real use to anyone. But it gave me a focus for my explorations with the data.

It was also a deliberate attempt to show how the data could be used to create something which far from its original intended use. Because encouraging unexpected uses of the data is one of the primary goals of publishing open data. It’s the unexpected uses that are most likely to hit the types of limitations that I’ve outlined above.

How does this get resolved?

There are several ways in which these issues could begin to be addressed. There are measures that the initiative could take that would address some specific limtations, or they could take steps to address all of them. For example, the Open Banking Initiative could:

  1. Publish data in other formats, e.g. by providing a CSV download, this would explicitly address one part of the first issue I highlighted, but none of the real concerns
  2. Publish some guidance for reusers that clarifies some of the terms of its existing licence. This might avoid discouraging some uses of the data but again, it doesn’t address the primary issues. The data would still not be open data
  3. Revise its licence to remove the problematic clauses and create an open data licence. This would ideally go through the licence approval process. This would address all of the concerns
  4. Drop the licence completely in favour of the Creative Commons Attribution licence (CC-BY 4.0). This would address all of the concerns with the added benefit that it would be explicitly clear to all users that the data could be freely and easily mixed with other open data

Only the last two options would be substantial progress.

What’s needed is for someone at the Open Banking initiative (or perhaps the CMA?) to step up and take responsibility for addressing the issues. Unfortunately, until that happens, the initiative is just another example of open washing.

What is data asymmetry?

You’ve just parked your car. Google Maps offers to record your current location so you can find where you parked your car. It also lets you note how much parking time you have available.

Sharing this data allows Google Maps to provide you with a small but valuable service: you can quickly find your car and avoid having to pay a fine.

For you that data has a limited shelf-life. It’s useful to know where you are currently parked, but much less useful to know where you were parked last week.

But that data has much more value to Google because it can be combined with the data from everyone else who uses the same feature. When aggregated that data will tell them:

  • The location of all the parking spaces around the world
  • Which parking spaces are most popular
  • Whether those parking spaces are metered (or otherwise time-limited)
  • Which parking spaces will be become available in the next few hours
  • When combined with other geographic data, it can tell them the places where people usually park when they visit other locations, e.g. specific shops or venues
  • …etc

That value only arises when many data points are aggregated together. And that data remains valuable for a much longer period.

With access to just your individual data point Google can offer you a simple parking reminder service. But with access to the aggregate data points they can extract further value. For example by:

  • Improving their maps, using the data points to add parking spaces or validate those that they may already know about
  • Suggesting a place to park as people plan a trip into the same city
  • Creating an analytics solution that provides insight into where and when people park in a city
  • …etc

The term data asymmetry refers to any occasion when there a disparity in access to data. In all cases this results in the data steward being able to unlock more value than a contributor.

A simple illustration using personal data

When does data asymmetry occur?

Broadly, data asymmetry occurs in almost every single digital service or application. Anyone running an application automatically has access to more information than its users. In almost all cases there will be a database of users, content or transaction histories.

Data asymmetry, and the resulting imbalances of power and value, are most often raised in the context of personal data. Social networks mining user information to target adverts, for example. This prompts discussion around how to put people back in control of their own data as well as encouraging individuals to be more aware of their data.

Apart from social networks, other examples of data asymmetry that relate to personal data include:

  • Smart meters that provide you with a personal view of your energy consumption, whilst providing energy companies with an aggregated view of consumption patterns across all consumers
  • Health devices that track and report on fitness and diet, whilst developing aggregated views of health across its population of users
  • Activity loggers like Strava that allow you to record your individual rides, whilst developing an understanding of mobility and usage of transport networks across a larger population

But because asymmetry is so prevalent it occurs in many other areas; it’s not an issue that is specific to personal data. Any organisation that offers a B2B digital service will also be involved in data asymmetry. Examples include:

  • Accounting packages that allow better access to business information, whilst simultaneously creating a rich set of benchmarking data on organisations across an industry
  • Open data portals that will have metrics and usage data on how users of the service are finding and consuming data
  • “Sharing economy” platforms that can turn individual transactions into analytics products

Data asymmetry is as important an issue in this areas as it is for personal data. These asymmetries can create power imbalances in sharing economy platforms like Uber. The impact of information asymmetry on markets has been understood since the 1970s.

How can data asymmetry be reduced?

There are many ways that data asymmetry can be reduced. Broadly, the solutions either involve reducing disparity in access to data, or in reducing disparities in the ability to extract value from that data.

Reducing the amount of data available to an application or service provider is where data protection legislation has a role to play. For example, data protection law places limits on what personal data companies can collect, store and share. Other examples of reducing disparities in access to data include:

  • Allowing users to opt-out of providing certain information
  • Allowing users to remove their data from a service
  • Creating data retention policies to reduce accumulation of data

Practising Datensparsamkeit reduces risks and imbalances associated with unfettered collection of data.

Reducing disparities in the ability to extract value from data can include:

  • Giving users more insight and input into when and where their data is used or shared
  • Giving users or businesses access to all of their data, e.g. a complete transaction history or a set of usage statistics, so they can attempt to draw additional value from it
  • Publishing some or all of the aggregated data as open data

Different applications and services will adopt a different mix of strategies. This will require balancing the interests of everyone involved in the value exchange. Policy makers and regulators also have a role to play in creating a level playing field.

Open data can reduce asymmetry by allowing value to spread through a wider network

Update: the diagrams in this post were made with a service called LOOPY. You can customise the diagrams and play with the systems yourself. Here’s the first diagram visualising data asymmetry and here is the revised version shows how open data reduces asymmetry by allowing value to spread further.

This post is part of a series called “basic questions about data“.