People like you are in this dataset

One of the recent projects we’ve done at Bath: Hacked is to explore a sample of the Strava Metro data covering the city of Bath. I’m not going to cover all of the project details in this post, but if you’re interested then I suggest you read this introductory post and then look at some of the different ways we presented and analysed the data.

From the start of the project we decided that we wanted to show the local (cycling) community what insights we might be able to draw from the dataset and illustrate some of the ways it might be used.

Our first step was to describe the dataset and how it was collected. We then outlined some questions we might ask of the data. And we tried to assess how representative the dataset was of the local cycling community by comparing it with data from the last census.

The reactions were really interesting. I spent a great deal of time on social media patiently answering questions and objections. I wanted to help answer those questions and understand what issues and concerns people might have in using this type of data.

I found that there were broadly two different types of feedback.

Visible participation

The first, more positive response, was from existing or previous Strava users surprised or delighted that their data might contribute towards this type of analysis. Some people shared the fact that they only logged some types of rides, while others explained that they already logged all of their activity including commutes and recreational riding. I saw one comment from a user who was now determined to do this more diligently, just so they could contribute to the Metro dataset.

A lesson here is that even users who understand that their data is being collected can still be surprised in the ways that the data might be re-purposed.  This is a data literacy issue: how can we help non-specialists understand the incredible malleability of data?

I think the reaction also reinforces the point that people will often contribute more if they think their data can be used for social good. Or just that people like them are also contributing.

This is important if we want to  encourage more participation in the maintenance of data infrastructure. Commercial organisations would do well to think about how open data and data philanthropy might drive more use of their platforms rather than threaten them.

Even if the Strava data were completely open there are still challenges in its use and interpretation. This creates the space for value-added services. (btw, if anyone wants help with using the Strava Metro data then I’m happy to discuss how Bath: Hacked could help out!)

Two tribes

The second, more negative response, was from people who didn’t use Strava and often had strong opinions about the service. I’ll step lightly over the details here. But, while I want to avoid being critical (because I’m genuinely not), I want to share a variety of the responses I saw:

  • I don’t use this dataset, so it can’t tell you anything about how I cycle
  • I don’t understand why people might use the service, so I’m suspicious of what the data might include
  • I think only a certain type of people use the service so its only representative of them, not me
  • I think people only use this service in a specific way, e.g. not for regular commutes, and so the data has limited use
  • I’m suspicious about the reliability of the data, so distrust it.

I’d think I’d sum all of that up as: “people like me don’t use this service, so any data you have isn’t representative of me or my community“.

This is exactly the issue we tried to shed some light on in our first two blog posts. So clearly we failed at that! Something to improve on in future.

The real lesson for me here is that people need to see themselves in a dataset.

If  we don’t help someone understand whether a dataset is representative of them, then it’s use will be viewed with suspicion and doubt. It doesn’t matter how rigorous the data collection and analysis process might be behind the scenes, it’s important to help find ways for people to see that for themselves. This isn’t a data literacy issue: it’s a problem with how we effectively communicate and build trust in data.

If we increasingly want to use data as a mirror of society, then people need to be able to see themselves in its reflection.

If they can see how they might be a valuable part of a dataset, then they may be more willing to contribute. If they can see whether they (or people like them) are represented in a dataset, then they may be more willing to accept insights drawn from that data.

Story telling is likely to be a useful tool here, but I wonder whether there are other complementary ways to approach these issues?