Where can you contribute to open data? Yes, you!

This is just a quick post to gather together some pointers and links that were shared in answer to a question I asked on twitter yesterday:

I want to try out a bunch of different services to explore how easy it is for people to contribute to open data project. Because I’m interested in how we can contribute as individuals, then I’m ruling out things like government open data portals. They’re not typically places where mere punters like you or I can contribute.

I’m also interested in sites that generate open data. Not public data. There needs to be an open licence on the results. Or, at very least a note along the lines of: “do whatever you want with this”.

I’m thinking more of places where we can collaborate around creating open data.

The short list

Here’s a quick list of the suggestions, along with a few I’d already turned up. I’m sure there are a lot more. Please leave a comment or ping me on twitter if you have suggestions. And yes, I’ll turn this into data at some point.

  1. OpenStreetMap was the starter for ten. I’ve already written about a number of ways to can contribute to the effort
  2. Discogs, contribute to their public domain database
  3. Wikipedia, content in infoboxes is presented as data via dbpedia and wikidata
  4. You can also contribute directly to Wikidata
  5. MusicBrainz, is completely crowd-sourced
  6. You can contribute company information to OpenCorporates
  7. Questions you answer on Stackoverflow end up as open data
  8. DemocracyClub are doing an awesome job of co-ordinating crowd-sourced data collection that the UK government should just be doing itself
  9. The product data you add to OpenFoodFacts is open
  10. It looks like you can contribute Creative Commons licensed content and data to the Encylopedia of Life
  11. OpenPlaques is open to contributions
  12. The Quick, Draw with Google data is actually open. Google seem to be opening up more of their research data
  13. ESRI are building some crowdsourcing apps, which generate open data
  14. If you’re in Germany and have some sensor data, you can feed it into OpenSenseMap. Their data dumps are in the public domain

What else should be on this list?

Disqualifications

There were also a number of sites that were suggested, or which I considered, but had to be rejected. Mostly because they’re not actually publishing open data. They either have restrictions on usage, or the licensing is very unclear. If you can clarify any of these then let me know.

Clearly there are hundreds of non-open databases, but do let me know if I’m wrong about any of the above, and I’ll amend the article accordingly.

Can you publish tweets as open data?

Can you publish data from twitter as open data? The short answer is: No. Read on for some notes, pointers and comments.

Twitter’s developer policy places a number of restrictions on your use of their API and the data you get from it. Some of the key ones are:

  • In the Restrictions on Use of Licensed Materials (II.C) they make it clear that you can’t use any geographic data from the platform. You can only use it to identify the location from which a tweet was made and not for any other purpose. You also can’t aggregate or cache it, unless you’re storing it with the the rest of the tweet. And elsewhere they place further restrictions on storage of tweets. They reiterate this in section B.9
  • Section F.2 “Be a Good Partner to Twitter” (sic) is the key one for data, as here you’re agreeing to not store anything except the ID for a tweet. You can’t store the message, it’s metadata or anything about the user, just the ID.
  • You are allowed to make those IDs downloadable in various ways but there are restrictions on how many tweets you can publish per user, per day
  • In the Ownership and Feedback section, they make it clear that the only rights you have to use content are derived from this agreement and those rights can be taken away at any time.
  • Anyone that you distribute data to must also agree to ALL of twitters terms, not just the developer policy, but its general terms of service, privacy policy, etc. So everyone’s agreements can be revoked at any time.

That’s a very closed set of terms.

There’s some great analysis of the terms and what they mean for researchers elsewhere. Ernesto Priego has an interesting pair of posts looking at twitter as public evidence and the ethics of twitter research and why you might want to archive and share small twitter datasets.

Ed Summers has also written about archiving twitter datasets and the process of “hydrating” a twitter ID to turn it back into useful content. There’s a whole set of APIs, tools and practices that have built up around the process of hydration as a means to work around twitters terms. I think it’s interesting as an example of using a combination of data and open source to address licensing limitations.

Yesterday, Justin Littman published a short piece highlighting how Twitter have just further restricted their terms. The key changes are around placing upper limits on how many tweet IDs you can distribute. The changes raise concerns about how archival projects like DocNow can continue. Although in my reading of the terms, those projects were already under question as Twitter doesn’t grant you the rights to re-publish data under anything other than its own terms. I think those datasets were already in breach of the agreement.

So, we get to our answer: no you can’t publish anything from twitter under an open licence. If you’re intending to do this in a project then I recommend you get approval from twitter directly.

Obviously these terms are designed for Twitters sole benefit. It helps them retain as much value as possible while still operating as a platform. Data asymmetry in action.

I think what’s particularly frustrating is that they seem to rarely enforce these terms, even for services that clearly breach them. After crafting a legal agreement they choose not to actively police it, because its not worth their time to do so. Presumably they will step in if there are large scale, significant breaches. But it makes you wonder how much value is really being protected.

In the meantime we are left with areas of doubt and uncertainty. Does the continued existence of a service mean its an exemplar of acceptable practice. Or are twitter just choosing to ignore it? And this starts to poison the well of open data. A more open approach would be for them to offer some allowance for small scale archiving and data sharing. Openly licensing twitter IDs would be a start.

For better or worse Twitter’s data has a role in helping us understand modern society, so we should be able to use it. Unfortunately their donation of the twitter archive to the Library of Congress is floundering because of a mixture of technical and legal issues. Twitter is not really a public space. It’s a private hall where we choose to meet.

Addendum

A couple of final extra points based on comments on this post (see below) and on twitter. Ed Summers rightly pointed out is that services that are seemingly breaching Twitter’s terms may in fact have permission to do so. In fact a couple of examples came up.

Andy Piper (Twitter Dev lead) notes that Twitter have posted a policy update clarification:

The clarification explains that developers can request permission to share more 1.5m tweet ids in a 30 day period. It also notes that researchers from “an accredited academic institution” can share unlimited number of tweets. This raises some of the restrictions on distribution, but also reinforces some of the key points I make above: any use of the data remains subject to Twitter’s policies. By default data from Twitter can’t be published as open data. But if you’re willing to pay then it looks like Twitter are willing to share more widely.

Joe Wass from CrossRef explained that they’ve had explicit permission from Google to distribute Tweet IDs under a CC0 waiver within their Event Data service.

CrossRef negotiated this permission as part of their commercial arrangement with Twitter. This means that at least some Tweet IDs can be considered to be in the public domain. It just depends on where you got them from: the Twitter API or CrossRef.

Enabling data forensics

I’m interested in how people share information, particularly data, on social networks. I think it’s something to which it’s worth paying attention, so we can ensure that it’s easy for people to share insights and engage in online debates.

There’s lots of discussion at the moment around fact checking and similar ways that we can improve the ability to identify reliable and unreliable information online. But there may be other ways that we can make some small improvements in order to help people identify and find sources of data.

Data forensics is a term that usually refers to analysis of data to identify illegal activities. But the term does have a broader meaning that encompasses “identifying, preserving, recovering, analyzing, and presenting attributes of digital information“. So I’m going to appropriate the term to put a label on a few ideas.

The design of the Twitter and Facebook platforms constrain how we can share information. Within those constraints people have, inevitably, adopted various patterns that allow them to publish and share content in preferred ways. For example, information might be shared:

  1. As a link to a page, where the content of the tweet or post is just the title
  2. As a link to a page, but with a comment and/or hashtags for context
  3. As a screenshot, e.g. of some text, chart or something. This usually has some commentary attached. Some apps enable this automatically, allowing you to share a screenshot of some highlighted text
  4. As images and photographs, e.g. of printed page or report (or even sometimes a screenshot of text from another app)

In the first examples there are always links that allow someone to go and read the original content. In fact that seems to be the typical intention: go read (or watch) this thing.

The other two examples are usually workarounds for the fact that its often hard to deep link to a section of a page or video.

Sometimes it’s just not possible because the information of interest isn’t in a bookmarkable section of a page. Or perhaps the user doesn’t know how to create that kind of deep link. Or they may be further constrained by a mobile app or other service that is restricting their ability to easily share a link. Not every application let’s the web happen.

In some cases screenshotting may also be conscious choice, e.g. posting a photo of someone’s tweet because you don’t want to directly interact with them.

Whatever the reason, this means there is usually no link in the resulting post. Which often makes it difficult for a reader to find the original content. While social media is reducing friction in sharing, its increasing friction around our ability to check the reliability and accuracy of what’s been shared.

If you tweet out a graph with some figures in a debate, I want to know where it’s come from. I want to see the context that goes with it. The ability to easily identify the source of shared content is, I think, part of “data forensics”.

So, what can we do fix this?

Firstly, there’s more that could be done to build better ways to deep link into pages, e.g. to allow sharing of individual page elements. But people have been trying to do that on and off for years without much visible success. It’s a hard problem, particularly if you want to allow someone to link to a piece of text. It could be time for a standards body to have another crack at it. Or I might have missed some exciting process, so please tell me if I have! But I think something like this would need some serious push behind. You need support from not just web frameworks and the major CMS platforms, but also (probably) browser vendors.

Secondly, Twitter and Facebook could allow us some more flexibility. For example, allow apps to post additional links and/or other metadata that are then attached to posts and tweets. It won’t address every scenario, but it could help. It also feels like a relatively easy thing for them to do as its a natural extension of some existing features.

Thirdly, we could look at ways to attach data to the images people are posting, regardless of what the platforms support. I’ve previously wondered about using XMP packets to attach provenance and attribution information to images. Unfortunately it doesn’t work for every format and it turns out that most platforms strip embedded metadata anyway. This is presumably due to reasonable concerns around privacy, but they could still white-list some metadata. We could maybe use steganography too.

But the major downsides here is that you’d need a custom social media client or browser extension to let you see and interact with the data. So, again that’s a massive deployment issue.

As things currently stand I think the best approach is to plan for visualisations and information to be shared, and design the interactions and content accordingly. Assume that your carefully crafted web page is going to be shared in a million different pieces. Which means that you should:

  • Include plenty of in-page anchors and use clear labelling to help people build links to relevant sections
  • Adapt your social media sharing buttons to not just link to the whole page, but also allow the user to share a link to a specific section
  • Design your twitter cards and other social metadata, for example is there a key graphic that would be best used as the page image?
  • Include links and source information on all of the graphs and infographics that you share. Make sure the link is short and persistent in case it has to be re-keyed from a screenshot
  • Provide direct ways to tweet and share out a graph that will automatically include a clearly labelled image, that contains a link
  • Help users cite their sources
  • …etc

What do you think? Any tips or suggestions you’d add to this list? With a bit of awareness around how data is shared, we might be able to make small improvements to online discussions.

Adventures in geodata

I spend a lot of my professional life giving people advice. Mostly around how to publish and use open data. In order to make sure I give people the best advice I can, I try and spend a lot of time actually publishing and using open data. A mixture of research and practical work is the best way I’ve found of improving my own open data practice. This is one of the reasons I run Bath: Hacked, continue to work at the Open Data Institute, and like to stay hands-on with data projects.

Amongst my goals for this year was to spend time learning some new skills. For example, I’ve not been involved in running a crowd-sourcing project, but now have that underway with Accessible Bath.

And, while I’ve done some work with geographic data, until recently I hadn’t really spent any time contributing to OpenStreetmap or exploring its ecosystem. But I’ve spent the last couple of months fixing that by immersing myself in its community and tools. In this blog post I wanted to share some of the things I’ve learnt. It’s been really fascinating and, as I’d hoped, given me a new perspective on a number of issues.

Finding my way

To begin with I looked around for some online tutorials. While I knew that OpenStreetmap allowed anyone to contribute, I wasn’t really sure about how I could go about doing that. I had a bunch of questions such as:

  • Did I need a dedicated GPS device or could I collect data I my phone? (Answer: you can use your phone)
  • Did I need to go out with a clipboard and do a formal survey or are there other ways to contribute? (Answer: you can contribute in a lot of different ways)
  • How do you actually go about editing the map, what tools do you need to use? (Answer: however you feel comfortable)
  • How do I find useful ways to contribute? Has everything been mapped already? (Answer: there’s still a lot to do!)

To help answer my questions I started out by watching some YouTube tutorials. There’s a lot of great training material for the OpenStreetMap ecosystem that covers the basics of mapping, how to add buildings, and some nice bite size videos that introduce best elements of the tool-set.

Other people in the Bath: Hacked community had also been looking at OpenStreetmap, mainly as a potential data resource. So we held a small evening meetup to get together and share what we knew. We had two experienced local mappers who came along and also offered encouragement (thanks Neil and Dave!).

This was a great way to learn the ropes and build up the confidence to wade in. I personally found having some existing members of the OSM community on hand very helpful. Dave has been particularly supportive of reviewing my edits and offering suggested improvements.

Equipping my expedition

There’s an amazing set of tools that support the OSM community. Too many to mention in a single blog post. But here’s a few that I’ve found particularly useful:

  • There are a few different OSM editors, but the new, default iD editor is really easy to use. If you plan on making some editrs, focus on learning this tool, rather than looking at the older, more complex tools (although they have their uses). It’s really nice to work with. It also has some pleasing little UX elements.
  • osmtracker is an Android (and Windows mobile) application that lets you record GPS traces, upload them to OSM (where they can be viewed in the iD editor), and exported to GPX files for use in other tools. It’s in the app store so easy to install
  • The OSM wiki is an essential resource. The OSM database itself is basically a wiki: you can add tags to any item on the map. While the online editor does a lot of the work for you, sometimes you need to add some additional metadata and the trick is in knowing which tags to add to which locations. The wiki provides plenty of examples. It also includes some beginner tutorials, but I found the videos to be a good starting point

My first attempt at proper mapping was walking my local high street, recording my progress and using osmtracker to take notes of the names of each shop. I later updated the building outlines, names and details of all the locations.

Into the unknown

That process of collecting data and updating a map lit up the bits of my brain that likes exploring and scavenging in video games, so I was immediately keen to do more. That’s when I starting contributing to Missing Maps, which I’d heard about from Rares during our meetup.

Missing Maps uses volunteers to trace satellite imagery of locations around the world. This data is then improved locally and used by humanitarian organisations to plan their disaster response activities. So I spent a happy evening finding and tracing Tukuls in Sudan. I thoroughly enjoyed it. It felt like doing an adult colouring book, but where I was painting the world a bit better with each stroke.

As a contributor the tooling is great: simple task allocation, clear guidance and tutorials, and making contributions is straight-forward as you’re using the standard editors. The community was also quick to provide feedback.

I also tripped over MapSwipe. This lets you identify, with a simple click, satellite images that contain buildings. This generates new tasks that go into the Missing Maps pipeline. It also has some light gamification and encouragement to keep you contributing.

Even if you’re not confident about editing the full map, you can quickly make small contributions using this mobile app. You can download tasks for use offline, so it’s also possible to map when you’re on the go. There’s a little micro-tasking app called StreetComplete which takes a similar approach towards making local contributions as easy as possible.

Between MapSwipe, MissingMaps and editing the local OSM map and updating locations on the Wheelmap app,  I’m now trying to make a small contribution to OSM every day.

The landscape

I’ve been really blown away by the range of tools and applications that fill out the OSM ecosystem. I plan on doing a lengthier post on some of this at a later date, but I’d be very surprised if this ecosystem wasn’t at least as good as, or even better than those used internally by national mapping agencies.

The ecosystem doesn’t just consist of hobbyists, there’s a growing commercial community that are contributing to, supporting and helping develop OpenStreetMap. Just look at how clearly Mapbox and Mapillary articulate how their company strategies align with making OSM a continued success.

I was also really surprised to learn that the satellite imagery that all OSM mappers are using has been donated by Microsoft. The Bing aerial imagery is free for use in OSM mapping and has been since 2010. That’s a significant contribution to an open data ecosystem.

If you’re interested in learning more about the OSM community, I’d encourage you to explore the videos from the annual State of the Map conference. There’s some really interesting work presented there including:

  • introductions to new OSM tools and research
  • analysis and discussion about the OSM community itself, the reasons why people contribute and how to encourage them to continue to do so
  • case studies of how OSM data and tooling is used in a variety of projects

New territory

I’ve now done several street surveys of Bath and have refined my workflow. What I’ve found to be the simplest approach is to use osmtracker to record my route and uses its facility to take photos of streets and shop fronts. This gives a quick way to collect information on the go, and I can then use this update the map later.

Uploading the GPX traces to OSM, putting the photos into the public domain on Flickr, and also publishing them to Mapillary allows me to demonstrate that I’ve actually done the field work, rather than just sneakily copied from Google StreetView, whilst also making them available to other people to use when they’re mapping. Mapillary data can be added to the iD editor so you can see contributed photos as you work.

I’ve decided that the surveys are a good way to encourage me to be more active over the summer!

Trip report

This post has just been a taster of what I’ve learnt and explored over the last couple of months. If you’ve ever wondered about contributing to OSM I’d encourage you to have a go. And I’m happy to help you get started! As I’ve outlined here, there are a number of different ways you can contribute either your local knowledge, or pitch in to some humanitarian mapping.

I’m going to be writing more here about some of the ecosystem in future. The exercise has been a great insight into how the OSM community hangs together and I’ve really only scratched the surface.

To briefly summarise though I think there’s some aspects of OSM that could work well in other contexts, for example:

  • the various approaches taken to ensuring quality and consistency of the map
  • the effort that goes into understanding and managing the community
  • the means by which commercial and volunteer efforts can both contribute to an open resource

If you’re interested in data as infrastructure then OSM is a great project to study in more detail. I think it embodies all of the key principles of a strong open data infrastructure.

Someone also needs to do a proper review of the OSM ecosystem because all of that “open data impact” people are looking to measure is right there. There’s a bit too much focus on measuring impact of government data IMHO, when there’s an existing ecosystem which can provide some great insights.

The limitations of the open banking licence

The Open Banking initiative recently began to publicly publish specifications, guidance and data through its website. If you’re not already aware of the initiative, it was created as a direct result of government reforms that aim to encourage the banking sector to be more open and innovative. The CMA undertook a lengthy consultation period during which the ODI coordinated work on the Open Banking Standard report.

The recommendations from that report and the CMA ruling were clear. Banks have to:

  • publish open data about their products, branches and locations, and
  • develop and provide open APIs to support access to other data, e.g. the transaction history on your account.

Unfortunately, while the banks are moving in that direction, the data they are publishing is not open data.

The Open Definition is the definitive description of what makes content and data open. It describes certain freedoms that are essential to maximise the value of publishing data under an open licence.

I think publishing open data is what the CMA and others really intended. Its also clearly spelt out in the Open Banking report. But unfortunately something has been lost in translation. The Open Banking Licence does not conform to the open definition.

Owen Boswarva has given a detailed review on his blog. For a review of the impacts of non-open licences you can read the ODI guidance which I helped to draft.

Rather than recap that guidance here, I thought it might be useful to try to spell out where the limitations in the Open Banking licence will impact reuse of the data. This is based on my early explorations with the public data.

Exploring the limitations of the open banking licence

The Open Banking API dashboard provides direct access to the currently available data. It includes data on the ATMs provided by each of the participating banks, their branches and products.

The data is published as JSON. A commonly used data format that is easy for developers to work with.

I can’t freely distribute the data

The first thing I did was to build a public map of all of the ATM data. To do this I had to convert the data from JSON to CSV which I could then load into an online mapping tool (Carto).

This is a permitted use under the Open Banking licence. The conversion of the data from JSON to CSV, and the creation of a map is explicitly allowed in the licence. Section 2.1(c) says that I am allowed “to adapt the Open Data into different formats for the purposes of data mapping (or otherwise for display or presentational purposes)“.

But that clause means that:

  • I can’t share the CSV version of the data. Data in CSV format is useful to many more potential reusers of the data. Many analytics tools support CSV but won’t support custom JSON documents. Because I can’t distribute the alternative version, fewer people can immediately use the data
  • I had to keep the dataset private in my Carto account. I’m lucky enough to have a personal account that lets me keep data private. Most freely available online tools allow people to use their services for free, so long as they’re using open data. If I was allowed to share the data with other Carto users, anyone could use it in their own maps. People without a paid Carto account can’t use this data. The result is, again, that fewer people can get immediate benefit from it.

The ability to freely convert and distribute data is a key part of the open definition. It allows data re-users to support each other in using the data by making it available in alternative formats and on all available platforms.

At the moment we are only allowed to copy, re-use, publish and distribute data so long as we don’t change it.

I’m limited in using the data to enrich other services and products

Because I can’t distribute the data it means I can’t take the data that has been provided and use it to improve an existing system. For example I don’t believe I can use the data to add missing ATM locations in Open Street Map.

The terms of the Open Banking licence are not compatible with the Open Street Map licence. Because it is a custom licence, rather than an existing standard open licence, resolving that issue will require legal advice.

OSM requires contributors to be extra cautious when adding data from other sources. They suggest getting explicit written agreement. This takes time and effort. That doesn’t seem to be achieving the desired outcome of a more open banking sector.

The licence is also revocable. At any time the banks can revoke my ability to use the data. Open licences, like the creative commons licences are not revocable. This means I’m exposing myself to legal and commercial risks if I build it into a product or service. I would need to take legal advice on that.

I can’t improve the data

After creating a basic map of ATM locations, I wanted to link the data with other sources. Data becomes more valuable once its linked together.

I opened my CSV version of the data in a free, open source desktop GIS tool called QGIS. Using the standard features of that tool I was able to match the geographic coordinates in the ATM data against openly licenced geographic data from the Office of National Statistics.

This generated an enriched dataset in which every ATM was now linked to an LSOA. An LSOA is a statistical area used by the ONS and others to help publish statistics about the UK. There are many statistical datasets that are reported against these areas.

Having completed that enrichment process I could now start to explore the data in the context of official statistics on demographics. There are many interesting questions that I can now ask of the data. But other people might also have interesting uses for that enriched dataset.

The process of doing the enriching is quite technical. I’m comfortable with teaching myself how to do that. But it would be great if I could help other people unlock value by letting them explore the enriched data.

Unfortunately I can’t share my enriched version with them. I’m not allowed to change any of the content of the data, or distribute it in alternate forms. The best I could do is tweet out a few interesting insights.

I am discouraged from using the data

One way I could use the enriched data is to explore how ATM and branch locations might relate to deprivation or other demographic statistics. This might highlight patterns in how individual banks have chosen to site their branches.

I could also monitor the data over time and build up a picture of where ATMs and branches are opening and closing around the country. Or explores the changing mix of products available from individual banks.

Unfortunately I don’t think I can do that. Clause 3.1(b) of the licence states that I must not “use or present the Open Data or any analysis of it in a way that is unfair or misleading, for example comparisons must be based on objective criteria and not be prejudiced by commercial interests“.

It’s not clear to me what unfair or misleading means. Unfair to the banks? Unfair to consumers? What type of objective criteria are acceptable?

If I were working for a fintech startup, I could perhaps use the data to identify new financial products that could be offered to consumers. I think that’s the type of innovation that the CMA wanted to encourage?

But if I do that and explain my analysis with others, then am I “prejudiced by commercial interests”? The licence says I can use the data commercially, but seems to discourage certain types of commercial usage.

These types of broad, under defined clauses in licences discourage reuse. They create uncertainty around what is actually permitted under the terms of the licence. This reduces the likelihood of people using the data, unless they can cover the legal guidance needed to remove the uncertainty.

I have probably already broken the terms of the licence

I think I may have already broken the terms of the licence. As a bit of fun I’ve created a twitter account called @allthebarclays. Every day it tweets out a picture of a branch of Barclays along with its name and unique identifier.

I’m probably not allowed to do that. The photos in the data don’t have a licence attached to them, so I’m hoping that if challenged, I can justify it under fair use.

The account is clearly a joke. It’s of real use to anyone. But it gave me a focus for my explorations with the data.

It was also a deliberate attempt to show how the data could be used to create something which far from its original intended use. Because encouraging unexpected uses of the data is one of the primary goals of publishing open data. It’s the unexpected uses that are most likely to hit the types of limitations that I’ve outlined above.

How does this get resolved?

There are several ways in which these issues could begin to be addressed. There are measures that the initiative could take that would address some specific limtations, or they could take steps to address all of them. For example, the Open Banking Initiative could:

  1. Publish data in other formats, e.g. by providing a CSV download, this would explicitly address one part of the first issue I highlighted, but none of the real concerns
  2. Publish some guidance for reusers that clarifies some of the terms of its existing licence. This might avoid discouraging some uses of the data but again, it doesn’t address the primary issues. The data would still not be open data
  3. Revise its licence to remove the problematic clauses and create an open data licence. This would ideally go through the licence approval process. This would address all of the concerns
  4. Drop the licence completely in favour of the Creative Commons Attribution licence (CC-BY 4.0). This would address all of the concerns with the added benefit that it would be explicitly clear to all users that the data could be freely and easily mixed with other open data

Only the last two options would be substantial progress.

What’s needed is for someone at the Open Banking initiative (or perhaps the CMA?) to step up and take responsibility for addressing the issues. Unfortunately, until that happens, the initiative is just another example of open washing.

What is data asymmetry?

You’ve just parked your car. Google Maps offers to record your current location so you can find where you parked your car. It also lets you note how much parking time you have available.

Sharing this data allows Google Maps to provide you with a small but valuable service: you can quickly find your car and avoid having to pay a fine.

For you that data has a limited shelf-life. It’s useful to know where you are currently parked, but much less useful to know where you were parked last week.

But that data has much more value to Google because it can be combined with the data from everyone else who uses the same feature. When aggregated that data will tell them:

  • The location of all the parking spaces around the world
  • Which parking spaces are most popular
  • Whether those parking spaces are metered (or otherwise time-limited)
  • Which parking spaces will be become available in the next few hours
  • When combined with other geographic data, it can tell them the places where people usually park when they visit other locations, e.g. specific shops or venues
  • …etc

That value only arises when many data points are aggregated together. And that data remains valuable for a much longer period.

With access to just your individual data point Google can offer you a simple parking reminder service. But with access to the aggregate data points they can extract further value. For example by:

  • Improving their maps, using the data points to add parking spaces or validate those that they may already know about
  • Suggesting a place to park as people plan a trip into the same city
  • Creating an analytics solution that provides insight into where and when people park in a city
  • …etc

The term data asymmetry refers to any occasion when there a disparity in access to data. In all cases this results in the data steward being able to unlock more value than a contributor.

A simple illustration using personal data

When does data asymmetry occur?

Broadly, data asymmetry occurs in almost every single digital service or application. Anyone running an application automatically has access to more information than its users. In almost all cases there will be a database of users, content or transaction histories.

Data asymmetry, and the resulting imbalances of power and value, are most often raised in the context of personal data. Social networks mining user information to target adverts, for example. This prompts discussion around how to put people back in control of their own data as well as encouraging individuals to be more aware of their data.

Apart from social networks, other examples of data asymmetry that relate to personal data include:

  • Smart meters that provide you with a personal view of your energy consumption, whilst providing energy companies with an aggregated view of consumption patterns across all consumers
  • Health devices that track and report on fitness and diet, whilst developing aggregated views of health across its population of users
  • Activity loggers like Strava that allow you to record your individual rides, whilst developing an understanding of mobility and usage of transport networks across a larger population

But because asymmetry is so prevalent it occurs in many other areas; it’s not an issue that is specific to personal data. Any organisation that offers a B2B digital service will also be involved in data asymmetry. Examples include:

  • Accounting packages that allow better access to business information, whilst simultaneously creating a rich set of benchmarking data on organisations across an industry
  • Open data portals that will have metrics and usage data on how users of the service are finding and consuming data
  • “Sharing economy” platforms that can turn individual transactions into analytics products

Data asymmetry is as important an issue in this areas as it is for personal data. These asymmetries can create power imbalances in sharing economy platforms like Uber. The impact of information asymmetry on markets has been understood since the 1970s.

How can data asymmetry be reduced?

There are many ways that data asymmetry can be reduced. Broadly, the solutions either involve reducing disparity in access to data, or in reducing disparities in the ability to extract value from that data.

Reducing the amount of data available to an application or service provider is where data protection legislation has a role to play. For example, data protection law places limits on what personal data companies can collect, store and share. Other examples of reducing disparities in access to data include:

  • Allowing users to opt-out of providing certain information
  • Allowing users to remove their data from a service
  • Creating data retention policies to reduce accumulation of data

Practising Datensparsamkeit reduces risks and imbalances associated with unfettered collection of data.

Reducing disparities in the ability to extract value from data can include:

  • Giving users more insight and input into when and where their data is used or shared
  • Giving users or businesses access to all of their data, e.g. a complete transaction history or a set of usage statistics, so they can attempt to draw additional value from it
  • Publishing some or all of the aggregated data as open data

Different applications and services will adopt a different mix of strategies. This will require balancing the interests of everyone involved in the value exchange. Policy makers and regulators also have a role to play in creating a level playing field.

Open data can reduce asymmetry by allowing value to spread through a wider network

Update: the diagrams in this post were made with a service called LOOPY. You can customise the diagrams and play with the systems yourself. Here’s the first diagram visualising data asymmetry and here is the revised version shows how open data reduces asymmetry by allowing value to spread further.

This post is part of a series called “basic questions about data“.

Fearful about personal data, a personal example

I was recently at a workshop on making better use of (personal) data for the benefit of specific communities. The discussion, perhaps inevitably, ended up focusing on many of the attendees concerns around how data about them was being used.

The group was asked to share what made them afraid or fearful about how personal data might be misused. The examples were mainly about use of the data by Facebook, by advertisers, as surveillance, etc. There was a view that being in control of that data would remove the fear and put the individual back in control. This same argument pervades a lot of the discussion around personal data. The narrative is that if I own my data then I can decide how and where it is used.

But this overlooks the fact that data ownership is not a clear cut thing. Multiple people might reasonably claim to have ownership over some data. For example bank transactions between individuals. Or about cats. Multiple people might need to have a say in how and when that data is used.

But setting aside that aspect of the discussion for now, I wanted to share what made me fearful about how some personal data might be misused.

As I’ve written here before my daughter has Type-1 diabetes. People with Type-1 diabetes live a quantified life. Blood glucose testing and carbohydrate counting are a fact of life. Using sensors makes this easier and produces better data.

We have access to my daughter’s data because we are a family. By sharing it we can help her manage her condition. The data is shared with her diabetes nurses through an online system that allows us to upload and view the data.

What makes me fearful isn’t that this data might be misused by that system or the NHS staff.

What makes me fearful is that we might not be using the data as effectively as we could be.

We are fully in control of the data, but that doesn’t automatically give us the tools, expertise or insight to use it. There may be other ways to use that data that might help my daughter manage her condition better. Is there more that we could be doing? Is there more data we could be collecting?

I’m technically proficient enough to do things with that data. I can download, chart and analyse it. Not everyone can do that. What I don’t have are the skills, the medical knowledge, to really use it effectively.

We have access to some online reporting tools as a consequence of sharing the data with the NHS. I’m glad that’s available to us. It does a better job than I can do.

I also fear that there might be insights that researchers could extract from that data, by aggregating it with data shared by other people with diabetes. But that isn’t happening, because have no way to really allow that. And even so I’m not sure we would be qualified to judge the quality of a research project to know where it might best be shared.

My aim here is not to be melodramatic. We are managing very well thank you. And yes there are clearly areas where unfettered access to personal data is problematic. There’s no denying that. My point is to highlight that ownership and control doesn’t automatically address concerns or create value.

We are not empowered by the data, we are empowered when it is being used effectively. We are empowered when it is shared.