Why are bulk downloads of open data important?

I was really pleased to see that at the GODAN Summit last week the USDA announced the launch of its Branded Food Product Database, providing nutritional information on over 80,000 food products. Product reference data is an area that has been long under-represented in the open data commons, so its great to see data of this type being made available. Nutritional data is also an area in which I’m developing a personal interest.

The database is in the public domain, so anyone can use it for any purpose. It’s also been made available via a (rate-limited) API that allows it to be easily searched. For many people the liberal licence and machine-readability will be enough to place a tick in the “open data” box. And I’m inclined to agree.

In the balance

However, as Owen Boswarva observed, the lack of a bulk download option means that the dataset technically doesn’t meet the open definition. The latest version of the definition states that data “must be provided as a whole…and should be downloadable via the Internet”. This softens the language used in the previous version which required data to be “available in bulk”.

The question is, does this matter? Is the open definition taking an overly pedantic view or is it enough, as many people would argue for the data to be openly licensed?

I think having a clear definition of what makes data open, as opposed to closed or shared, is essential as it helps us focus discussion around what we want to achieve: the ability for anyone to access, use and share data, for any purpose.

It’s important to understand how licensing or accessibility restrictions might stop us from achieving those goals. Because then we can make informed decisions, with an understanding of the impacts.

I’m less interested in using the definition as a means of beating up on data publishers. It’s a tool we can use to understand how a dataset has been published.

That said, I’m never going to stop getting cross about people talking about “open data” that doesn’t have an open licence. That’s the line I won’t cross!

Bulking up

I think its an unequivocally good thing that the USDA have made this public domain data available. So lets focus on what the impacts of their decision to not publish a bulk download. Something which I suspect would be very easy for them to do. It’s a large spreadsheet, not “big data”.

The API help page notes that the service is “intended primarily to assist application developers wishing to incorporate nutrient data into their applications or websites”. And I think it achieves that for the most part. But there are some uses of data that are harder when the machine-readable version is only available via an API.

Here’s a quick, and non-exhaustive list of the ways a dataset could be used:

  • A developer may want to create a new interface to the dataset, to improve on the USDA’s own website
  • A developer may want to query it to add some extra features to an existing website
  • A developer may want to use the data in a mobile application
  • A developer may want to use the data in desktop application
  • A developer may want to enrich the dataset with additional information and re-publish it
  • A data scientist might want to use the data as part of an analysis
  • An archivist might want to package the dataset and place a copy in the Internet Archive to preserve it
  • A scientist might want to use the data as part of their experimental analysis
  • An organisation might want to provide a mirror of the USDA data (and perhaps service) to help it scale
  • A developer might want to use the data inside services like Amazon or Google public datasets, or Kaggle, or data.world
  • A data journalist might want to analyse the data as part of a story
  • ….etc.

Basically there are a lot of different use cases, which vary based on:

  • the technical expertise of the user
  • the technical infrastructure in which the data is being used
  • whether all or only part of the dataset is required
  • whether custom processing or analysis is required
  • whether the results are being distributed

What’s important to highlight is that all of these use cases can be supported by a bulk download. But many of them are easier if there is an API available.

How much easier depends on the design of the API. And the trade-off by making data easier to use is that it increases the cost and effort to publish the data. The USDA are obviously aware of that cost, because they’ve added some rate-limits to the API.

Many of the use cases are harder if the publisher is only providing an API. Again, it will depend on the design of the API how much harder.

Personally I always advocate having bulk downloads by default and APIs available on a best effort basis. This is because it supports the broadest possible set of use cases. In particular it helps make data portable so that it can be used in a variety of platforms. And as there are no well-adopted standard APIs for managing and querying open datasets, bulk downloads offer the most portability across platforms.

Of course there are some datasets, those that are particularly large or rapidly changing, where it is harder to provide a useful, regularly updated data dump. In those cases provision via an API or other infrastructure is a reasonable compromise.

Balancing the scales

Returning to the USDA’s specific goals they are definitely assisting developers in incorporating nutrient data into their applications. But they’re not necessarily making it easy for all application developers, or even all types of user.

Presumably they’ve made a conscious decision to focus on querying the data over use cases involving bulk analysis, archiving and mirroring. This might not be an ideal trade-off for some. And if you feel disadvantaged then you should take time to engage with the USDA to explain what you’d like to achieve.

But the fact that the data is openly licensed and in a machine-readable form means that it’s possible for a third-party intermediary or aggregator to collect and republish the data as a bulk download. It’ll just be less efficient for them to do it than the USDA. The gap can be filled by someone else.

Which is why I think its so important to focus on licensing. It’s the real enabler behind making something open. Without an open licence you can’t get any real (legal) value-add from your community.

And if you don’t want to enable that, then why are you sharing data in the first place?

This post is part of a series called “basic questions about data“.

3 thoughts on “Why are bulk downloads of open data important?

  1. Hi Leigh,

    It’s a interesting argumentation but could not it just be that providing a dump will go against the reason why they put a rate limit on their API ? If they do serve a dump they may as well remove that limit too. There must be some reason that drove them into putting it.

    I sometimes found that people reluctant to share a dump are happy sharing an API. They would perceive that as a way to make their data open without “loosing control” as they would still be the only one holding the whole dump – and would put some API control in place to limit the risks of seeing someone re-assemble that dump.

    Christophe

  2. Hi Christophe,

    Rate limiting an API is a measure for limiting the amount of resources that someone can use on a server, either accidentally (e.g. poorly optimised code) or purposefully (e.g. denial of service). It could also be used to limit access to a complete dataset, but I don’t think its being used that way here. I can walk the dataset is around 80 hours. Slower than ideal, but not that slow for what is going to be a very slowly evolving dataset. The fact that the data is also in the public domain means that they’ve already relinquished quite a bit of control already.

    I agree though that APIs can be used to limit access to a dataset. I think that’s mostly commonly associated with a freemium model and the publisher wants to generate revenue from the API usage. But again, that doesn’t seem to be the case here.

    Cheers,

    L.

Comments are closed.