Why are bulk downloads of open data important?

I was really pleased to see that at the GODAN Summit last week the USDA announced the launch of its Branded Food Product Database, providing nutritional information on over 80,000 food products. Product reference data is an area that has been long under-represented in the open data commons, so its great to see data of this type being made available. Nutritional data is also an area in which I’m developing a personal interest.

The database is in the public domain, so anyone can use it for any purpose. It’s also been made available via a (rate-limited) API that allows it to be easily searched. For many people the liberal licence and machine-readability will be enough to place a tick in the “open data” box. And I’m inclined to agree.

In the balance

However, as Owen Boswarva observed, the lack of a bulk download option means that the dataset technically doesn’t meet the open definition. The latest version of the definition states that data “must be provided as a whole…and should be downloadable via the Internet”. This softens the language used in the previous version which required data to be “available in bulk”.

The question is, does this matter? Is the open definition taking an overly pedantic view or is it enough, as many people would argue for the data to be openly licensed?

I think having a clear definition of what makes data open, as opposed to closed or shared, is essential as it helps us focus discussion around what we want to achieve: the ability for anyone to access, use and share data, for any purpose.

It’s important to understand how licensing or accessibility restrictions might stop us from achieving those goals. Because then we can make informed decisions, with an understanding of the impacts.

I’m less interested in using the definition as a means of beating up on data publishers. It’s a tool we can use to understand how a dataset has been published.

That said, I’m never going to stop getting cross about people talking about “open data” that doesn’t have an open licence. That’s the line I won’t cross!

Bulking up

I think its an unequivocally good thing that the USDA have made this public domain data available. So lets focus on what the impacts of their decision to not publish a bulk download. Something which I suspect would be very easy for them to do. It’s a large spreadsheet, not “big data”.

The API help page notes that the service is “intended primarily to assist application developers wishing to incorporate nutrient data into their applications or websites”. And I think it achieves that for the most part. But there are some uses of data that are harder when the machine-readable version is only available via an API.

Here’s a quick, and non-exhaustive list of the ways a dataset could be used:

  • A developer may want to create a new interface to the dataset, to improve on the USDA’s own website
  • A developer may want to query it to add some extra features to an existing website
  • A developer may want to use the data in a mobile application
  • A developer may want to use the data in desktop application
  • A developer may want to enrich the dataset with additional information and re-publish it
  • A data scientist might want to use the data as part of an analysis
  • An archivist might want to package the dataset and place a copy in the Internet Archive to preserve it
  • A scientist might want to use the data as part of their experimental analysis
  • An organisation might want to provide a mirror of the USDA data (and perhaps service) to help it scale
  • A developer might want to use the data inside services like Amazon or Google public datasets, or Kaggle, or data.world
  • A data journalist might want to analyse the data as part of a story
  • ….etc.

Basically there are a lot of different use cases, which vary based on:

  • the technical expertise of the user
  • the technical infrastructure in which the data is being used
  • whether all or only part of the dataset is required
  • whether custom processing or analysis is required
  • whether the results are being distributed

What’s important to highlight is that all of these use cases can be supported by a bulk download. But many of them are easier if there is an API available.

How much easier depends on the design of the API. And the trade-off by making data easier to use is that it increases the cost and effort to publish the data. The USDA are obviously aware of that cost, because they’ve added some rate-limits to the API.

Many of the use cases are harder if the publisher is only providing an API. Again, it will depend on the design of the API how much harder.

Personally I always advocate having bulk downloads by default and APIs available on a best effort basis. This is because it supports the broadest possible set of use cases. In particular it helps make data portable so that it can be used in a variety of platforms. And as there are no well-adopted standard APIs for managing and querying open datasets, bulk downloads offer the most portability across platforms.

Of course there are some datasets, those that are particularly large or rapidly changing, where it is harder to provide a useful, regularly updated data dump. In those cases provision via an API or other infrastructure is a reasonable compromise.

Balancing the scales

Returning to the USDA’s specific goals they are definitely assisting developers in incorporating nutrient data into their applications. But they’re not necessarily making it easy for all application developers, or even all types of user.

Presumably they’ve made a conscious decision to focus on querying the data over use cases involving bulk analysis, archiving and mirroring. This might not be an ideal trade-off for some. And if you feel disadvantaged then you should take time to engage with the USDA to explain what you’d like to achieve.

But the fact that the data is openly licensed and in a machine-readable form means that it’s possible for a third-party intermediary or aggregator to collect and republish the data as a bulk download. It’ll just be less efficient for them to do it than the USDA. The gap can be filled by someone else.

Which is why I think its so important to focus on licensing. It’s the real enabler behind making something open. Without an open licence you can’t get any real (legal) value-add from your community.

And if you don’t want to enable that, then why are you sharing data in the first place?

This post is part of a series called “basic questions about data“.

People like you are in this dataset

One of the recent projects we’ve done at Bath: Hacked is to explore a sample of the Strava Metro data covering the city of Bath. I’m not going to cover all of the project details in this post, but if you’re interested then I suggest you read this introductory post and then look at some of the different ways we presented and analysed the data.

From the start of the project we decided that we wanted to show the local (cycling) community what insights we might be able to draw from the dataset and illustrate some of the ways it might be used.

Our first step was to describe the dataset and how it was collected. We then outlined some questions we might ask of the data. And we tried to assess how representative the dataset was of the local cycling community by comparing it with data from the last census.

The reactions were really interesting. I spent a great deal of time on social media patiently answering questions and objections. I wanted to help answer those questions and understand what issues and concerns people might have in using this type of data.

I found that there were broadly two different types of feedback.

Visible participation

The first, more positive response, was from existing or previous Strava users surprised or delighted that their data might contribute towards this type of analysis. Some people shared the fact that they only logged some types of rides, while others explained that they already logged all of their activity including commutes and recreational riding. I saw one comment from a user who was now determined to do this more diligently, just so they could contribute to the Metro dataset.

A lesson here is that even users who understand that their data is being collected can still be surprised in the ways that the data might be re-purposed.  This is a data literacy issue: how can we help non-specialists understand the incredible malleability of data?

I think the reaction also reinforces the point that people will often contribute more if they think their data can be used for social good. Or just that people like them are also contributing.

This is important if we want to  encourage more participation in the maintenance of data infrastructure. Commercial organisations would do well to think about how open data and data philanthropy might drive more use of their platforms rather than threaten them.

Even if the Strava data were completely open there are still challenges in its use and interpretation. This creates the space for value-added services. (btw, if anyone wants help with using the Strava Metro data then I’m happy to discuss how Bath: Hacked could help out!)

Two tribes

The second, more negative response, was from people who didn’t use Strava and often had strong opinions about the service. I’ll step lightly over the details here. But, while I want to avoid being critical (because I’m genuinely not), I want to share a variety of the responses I saw:

  • I don’t use this dataset, so it can’t tell you anything about how I cycle
  • I don’t understand why people might use the service, so I’m suspicious of what the data might include
  • I think only a certain type of people use the service so its only representative of them, not me
  • I think people only use this service in a specific way, e.g. not for regular commutes, and so the data has limited use
  • I’m suspicious about the reliability of the data, so distrust it.

I’d think I’d sum all of that up as: “people like me don’t use this service, so any data you have isn’t representative of me or my community“.

This is exactly the issue we tried to shed some light on in our first two blog posts. So clearly we failed at that! Something to improve on in future.

The real lesson for me here is that people need to see themselves in a dataset.

If  we don’t help someone understand whether a dataset is representative of them, then it’s use will be viewed with suspicion and doubt. It doesn’t matter how rigorous the data collection and analysis process might be behind the scenes, it’s important to help find ways for people to see that for themselves. This isn’t a data literacy issue: it’s a problem with how we effectively communicate and build trust in data.

If we increasingly want to use data as a mirror of society, then people need to be able to see themselves in its reflection.

If they can see how they might be a valuable part of a dataset, then they may be more willing to contribute. If they can see whether they (or people like them) are represented in a dataset, then they may be more willing to accept insights drawn from that data.

Story telling is likely to be a useful tool here, but I wonder whether there are other complementary ways to approach these issues?

Help me use your data

I’ve been interviewed a couple of times recently by people interested in understanding how best to publish data to make it useful for others.  Once by a startup and a couple of times by researchers. The core of the discussion has essentially been the same question: “how do you know if a dataset will be useful to you?”

I’ve given essentially the same answer each time. When I’m sifting through dataset descriptions, either in a portal or via a web search, my first stage of filtering involves looking for:

  1. A brief summary of the dataset: e.g. a title and a description
  2. The licence
  3. Some idea of its coverage, e.g. geographic coverage, scope of time series, level of aggregation, etc
  4. Whether it’s in a usable format

Beyond, that there’s a lot more that I’m interested in: the provenance of the data, its timeliness and a variety of quality indicators. But those pieces of information are what I’m looking for right at the start. I’ll happily jump through hoops to massage some data into a better format. But if the licence or coverage isn’t right then its useless to me.

We can frame these as questions:

  1. What is it? (Description)
  2. Can I use it? (Licence)
  3. Will it help answer my question? (in whole, or  part)
  4. How difficult will it be to use? (format, technical characteristics)

It’s frustrating how often these essentials aren’t readily available.

Here’s an example of why this is important.

A weather data example

I’m currently working on a project that needs access to local weather observations. I want openly licensed temperature readings for my local area.

My initial port of call was the Met Office Hourly Site Specific Observations. The product description is a useful overview and the terms of use make the licensing clear. Questions 1 & 2 answered.

However I couldn’t find a list of sites to answer Question 3. Eventually I found the API documentation for the service that would generate me a list of sites. But I can only access that with an API key. So I’ve signed up, obtained a key, made the API call, downloaded the JSON, converted it into CSV, uploaded it to Carto and then made a map.

And now I can answer Question 3. The closest site is in Bristol and so the service isn’t useful to me at all. Time wasted, but hopefully not all the effort because now you can just look at the map. But the Met Office could simply have published a map. There is one of the whole network, but they don’t all contribute to the open dataset.

So I started to look at the OpenWeatherMap API. They also have an API endpoint that exposes weather data for a specific station. Or stations within a geographic area. But again, they’ve not actually published a map that would let me see if there are any local to me. I might have missed something so I’ve asked them.

In both cases I’m having to get into invest time and some technical effort in answering questions which should be part of the documentation. They could even use their own APIs to create an interactive map for people to use!

As a result I’m going to end up using wunderground. By browsing the user facing part of their site I’ve been able to confirm there are several local weather stations. And hopefully these will be exposed via the API. (But I’m going to have to dig a bit to check on the terms of use. Sigh.)

If you really want me to use your data then you need to help me to use it. Think about my user experience. Help me understand what your dataset contains before I have to actually poke around inside it.

Reputation data portability

Yesterday I went to the ODI lunchtime lecture on portability of reputation data. It was an interesting discussion which triggered a few thoughts which I thought I’d share here.

The debate was prompted by a call for evidence from the Department formally known as BIS around consumer data and account switching:

“The government would like to understand whether the reputation data earned by a user on a particular platform could be used to help them win business or prove their trustworthiness in other contexts. We would also be interested in views on the technical and other challenges that would be associated with making this reputation data portable”

The consultation includes this question:

“What new opportunities or risks for businesses, workers and consumers would be created if they were able to port their reputation and feedback data between platforms?”

It also asks about the barriers that might hinder this type of portability.

One useful way to answer these questions is to break them down into smaller pieces:

  1. Should consumers be able to access data they’ve contributed in a platform?
  2. Should businesses be able to access data about them in a platform, e.g. reputation data such as reviews
  3. Should businesses and consumers be able to move this this data between platforms?
  4. Should it be permitted for that data to be reused by others, e.g. in competing platforms?

The first two questions are about exporting data.

The third and fourth questions are really about portability and data licensing.

I would say that broadly the answers to all these questions is: Yes.

I think consumers and businesses should be able to access this data and, further, that it should be machine-readable open data. They should also be able to access any of their personal data held in the platform, but this isn’t really an area of debate. The new EU GDPR regulations requires platforms to provide you with your data if you request it, although it doesn’t (to my knowledge) require it to be in a machine-readable reusable form.

I think this also answers the last question, the data should be reusable. However I expect there to be resistance from platforms as where this type of data is currently made available it is done so under non-open terms. For example, via API agreements that prohibit some forms of reuse, such as use in a competing service.

The question on portability is trickier though. While I think  that portability is something to aspire to, in practice it is going to be difficult to achieve.

Portability requires more than just creating a data standard to enable export and import of data, or APIs that enable more dynamic synchronisation. I think that’s the easy part.

Portability would also require platforms to agree or converge around how reputation data is collected and calculated. It’s no good moving data from one system to another if they have incompatible definitions. There are many ways in which platforms might differ:

  • They can use a different rating scheme, e.g. 5 stars, 10 stars, or just “likes”
  • They might allow, or require, a text review in addition to a rating
  • They can allow anonymous reviews or require users to make themselves known
  • They can allow anyone to review any service or business (e.g. TripAdvisor, Amazon product reviews), or they can enforce that reviews are only made when there has been evidence of a transaction (e.g. rating a supplier on EBay or Amazon)
  • Related, they might allow both forms of review, but distinguish those that are based on a transaction
  • Or they may not allow explicit reviews at all and measure reputation in some other way (e.g. completing transactions within an expected time period, or number of sales made)
  • …etc, etc

And this is without even getting into the weeds of what users think they are reviewing. For example are you reviewing the restaurant, the service you received on a specific visit, the menu choice with respect to your preferences, or perhaps even a specific person that delivered that service. I think we’ve all seen examples of all of those variations even within single platforms.

XKCD has nicely summarised a variety of issues with rating systems in these three cartoons. And we shouldn’t forget the creative ways in which review systems get repurposed.

It’s important to highlight here that this type of variation doesn’t really occur with data like banking transactions, utility bills, etc. I tend to think portability there is much easier to achieve. These is variation, but this is typically around charging models not in the meaning and method of collection of the data.

Is all the variation in rating and review schemes warranted? Perhaps not. Some convergence might actually be useful. But these variations are also likely to be key parts of the user experience and functionality of the platform. So I’m personally very wary about restricting product developers in innovating in this area.

In my view rather than focusing on portability, we should be asking for this data to be published as open data. This will then open the possibility for the data to be aggregated and presented across platforms.

Enabling the creation of aggregated reference points for reputation data may be more practical that requiring true portability across platforms. In fact we have models for this already: price comparison sites and credit reference agencies. In fact if these data becomes more open it seems likely that credit agencies will be the first to benefit from it.

The state of open licensing

I spend a lot of time reading through licences and terms & conditions. Much more so than I thought I would when I first started getting involved with open data. After all, I largely just like making things with data.

But there’s still so much data that is public but not open. Or datasets that are nearly open but which have been wrapped in awkward additional terms. And still plenty of confusion about what open data actually is, as Andy Dickinson highlighted yesterday.

And yet the open data licensing choices really aren’t that hard, you can get the essential choices in a tweet.

Resolving this is just going to take more time, education and patient explanation of the benefits and disadvantages of different licensing models.

But I’ve been wondering about what direction we’re moving in with regards to licensing.

Reducing friction

Since the release of 4.0 series of creative commons licences we’ve had a standard, globally applicable set of terms that allow us to openly licence all forms of creative works and datasets. I really don’t see any reason to continue to use the Open Database Licence and I would love the maintainers to either clarify the continued role it plays or acknowledge that it’s deprecated and discourage its use.

The UK Open Government Licence (OGL) has spawned a variety of national licences. But, now that it is interchangeable with CC-BY 4.0, its continued existence also seems largely unnecessary. Governments currently without a standard national licence are better off adopting CC-By 4.0 than creating another fork of the OGL.

There may be good reasons for retaining the OGL and I’d be interested in hearing them if anyone has opinions. But it feels like we might continue to simplify the licensing landscape by planning for it to become obsolete.

I continue to wrestle with the fact that I’m becoming an open data pedant. (And maybe I am!) But I feel these are issues that are important to pay attention to, if only to follow evolving best practice.

That said, I’m convinced that any friction around licensing can potentially hamper the reuse of open data. So I think its something to recognise and remove wherever possible. The more the commons is used, the more value will be unlocked. And this will help it grow, not just by increasing contribution, but also through increasing investment, so we can have a proper open data infrastructure for society as a whole.

And friction not only hampers reuse it also slows publication of new data. I know from experience that confusion around appropriate licences are a common area of uncertainty for publishers. Especially commercial publishers who are concerned about the risks of adopting open licences rather than using custom terms which are within their comfort zone.

As specific licencing frameworks and model terms and conditions become embedded, they will be harder to remove later. It’s important to not overlook the impacts of bespoke terms.

Evolving practice

For example its interesting to see how, for example, the OpenOpp terms borrow heavily from those of OpenCorporates. As a successful open data business its not surprising that OpenCorporates is being used as an exemplar.

But, in my opinion, the OpenCorporates terms have some niggling issues. Firstly there is the specific requirement around how attribution must be presented (font sizes, and not just a text and a link), coupled with the requirement that anyone re-publishing the data must ensure that downstream users also conform with those requirements. That’s really not dissimilar to the custom attribution requirements that were present in the Ordnance Survey’s original fork of the OGL.

The open data community has campaigned at length to convince governments that they should, at most, require simple attribution statements from re-users of its data. I don’t think its a positive move for that same data to begin accumulating new terms and licences within its first few steps into the ecosystem.

That said, the more concerning way in which practice may evolve is by stepping away from open licensing entirely. That goes hand-in-hand with the increasing interest and reference to “data markets” which I’ve encountered from many city-based initiatives. I’ve already written at length about my thoughts on the Copenhagen marketplace and I’m hoping London isn’t going in the same direction.

Elsewhere though, I see promising progress. The scientific research community has long been converging on CC0 (public domain) for its data and CC-BY for its content. CC0 avoids problems with attribution stacking and that community has long had social norms that encourage recognition of sources, without requiring it through a licensing regime.

But that practice isn’t yet so commonplace elsewhere. Even though it part and parcel of being a good re-user. The visible impact of open data and content is a tide that raises all boats. If you call yourself an open data start-up you should have be able to proudly point to where your data sources are listed on your website.

I also read that the US may be adopting legislation that will ensure that its open government data remains in the public domain. This is fantastic. That change will also clarify that the data is in the public domain internationally. It’s currently unclear whether “public domain” actually means “public domain within the US”. It may be crystal clear to IP and Copyright lawyers but not necessarily to non-experts like myself, which is my general point.

I wonder whether the general trajectory will be as the EFF recommend, for more open data to be placed into the public domain? That would require a big step forward for many governments as well as established projects like OpenStreetMap. Large scale licensing changes of that form are tricky to co-ordinate. Realistically I don’t see it happening unless there are either major changes to the social norms around data reuse, or until we start bumping into compatibility issues between data from different communities.

That’s not entirely unlikely however. For example the prevalence of CC-BY and CC-BY-SA style licencing from the commercial and public sector is at odds with research norms that require raw and derived data to be placed into the public domain under a CC0 waiver. You can’t draw from one well and then add to the other. However, there are bigger issues to address first, as the recent OKCupid data release highlighted.

 

“The Wizard of the Wash”, an open data parable

The fourth open data parable.

In a time long past, in a land far away, there was once a great fen. A vast, sprawling wetland filled with a richness of plants and criss-crossed with many tiny streams and rivers.

This fertile land was part of a great kingdom ruled by a wise and recently crowned king. The fen was home to a hardy and industrious people who made a living from fishing, cutting peat and gathering the rare herbs that sprouted amongst the verdant grasses.

At the time of this tale the new king was travelling across his lands to learn more about his people. In a certain area of the fen he expected to find a thriving town that had become widely renowned for the skills of its herbalists and fishermen.

Instead he came upon a ramshackle collection of makeshift huts and tents clinging to patches of dry ground. The dejected people living in these shelters had clearly fallen on hard times and were eking out a living on the verges of the fen. Nearby was what was clearly the ruins of their settlement. Houses had tumbled haphazardly into the waters. The broken remains were being picked over for materials to build shelters and provide wood for fires.

Speaking to a fisherman, the king asked “What terrible disaster has befallen your village? How have you good people been brought so low?”

While continuing his task of mending a fishing net, the fisherman proceeded to tell the following tale:

“Our town has grown slowly over the years, sire. We live a hard life in the fens, and building on this treacherous land takes great care. For years our people were limited to building on isolated patches of stable ground. Our original village clung tightly to the patches of rock hidden just beneath the surface of these waters.

Until we made our pact with the Wizard of the Wash.

One day the Wizard came to us and demonstrated his great magicks. Showing how his powers could be used to drive great wooden piles deep into the peat. Deep enough to reach the bedrock and let us build wherever we wished. We would need only ask the Wizard to create a stable footing and we could build wherever we chose. In return, and to complete our pact, we need only to collect for him the rarest herbs and plants for his research. An easy task for us as we have long known the secrets of the fen.

And so for many years we have prospered. Each year we have planned out where we would build our new houses and workshops. And pointed to where we needed new roads, inns and store houses. And each year the Wizard would oblige us with his magicks. The town has spread across the fen and we great started to grow rich from trade.

But then things began to change.

In the beginning the Wizard refused to drive new piles in a few places. He explained that he was concerned that the buildings may hinder certain herbs which grew in that area. And we followed his wishes for there were other places to build.

And so this continued. Each year the Wizard would reject some of our plans or convince us to change them for his own ends. For example where we once had planned a school he instead convinced us to build a new dock for his supply boats. Disappointed, we again submitted to his wishes, for we still needed to build and there was still space aplenty elsewhere. As traders we had grown accustomed to compromise.

But then the Wizard began to visit us more frequently, demanding to review in more detail our plans. He objected to certain buildings being extended as they blocked views that he enjoyed. He began to refuse to build in ever more locations and expressed opinions about how the town should grow.

Once he even required us to dismantle several houses so that we might build a better inn for him to stay in during his visits. He threatened to simply remove the foundations if we didn’t comply. In return he choose to drive in only a few new piles. As a result some families were forced to live in cramped and poorer lodgings. And what choice did we have but to comply?

In these last few years the Wizard has became ever more demanding. He has argued that these piles were his, had always been his, and that we have only been using them with his permission. If we were unhappy, he argued, we could simply return to building as we had before.

Sire, while these lands are ours and have been for many generations, we had gladly given ourselves over to a petty tyrant. Once the pact had been made it was easier to comply than to resist.

The final disaster happened a few months ago. The Wizard had long been growing old and unwell. One night he passed away whilst staying in our finest inn. And on that night all of his magicks were undone. And so our fine town suddenly fell back into the swamp.

And so, as you see, we were ruined.”

Sadden by the tale, the king realised that here was a people whose needs had long been overlooked, leaving them at the mercy of fickle powers. He resolved to help them rebuild.

On the spot he issued a decree for the Royal Engineers to provide assistance to any town, village or people that required help. His kingdom would be built on firm foundations.

Discussion document: archiving open data

This is a brief post to highlight a short discussion document that I recently published about archiving open data.  The document is intended to help gather ideas, suggestions and best practices around archiving open data to the Internet Archive. The goal being to gather together useful guidance that can help encourage archiving and distribution of open data from existing portals, frameworks, etc.

This isn’t an attempt to build a new standard, just encourage some convergence and activity. At present the guidance recommends building around the Data Package specification as it is simple and provides a well-defined unit (a zip file) for archiving purposes.

Archiving data can help build resilience in the open data commons providing backups of important data resources. This will help deal with:

  • Unexpected system outages that could take down data portals
  • Decisions by publishers to remove data previously published under an open licence, ensuring an original copy remains
  • Decisions by publishers to take down data
  • Services and portals permanently going offline

If you have thoughts or suggestions then feel free to add them to the document. It would particularly benefit from input from those in the archival community and especially those who are already familiar with working with the Internet Archive.

I hope to build a small reference implementation to illustrate the idea and help to archive the data from Bath: Hacked.