The practice of open data

Open data is data that anyone can access, use and share.

Open data is the result of several processes. The most obvious one is the release process that results in data being made available for reuse and sharing.

But there are other processes that may take place before that open data is made available: collecting and curating a dataset; running it through quality checks; or ensuring that data has been properly anonymised.

There are also processes that happen after data has been published. Providing support to users, for example. Or dealing with error reports or service issues with an API or portal.

Some processes are also continuous. Engaging with re-users is something that is best done on an ongoing basis. Re-users can help you decide which datasets to release and when. They can also give you feedback on ways to improve how your data is published. Or how it can be connected and enriched against other sources.

Collectively these processes define the practice of open data.

The practice of open data covers much more than the technical details of helping someone else access your data. It covers a whole range of organisational activities.

Releasing open data can be really easy. But developing your open data practice can take time. It can involve other changes in your organisation, such as creating a more open approach to data sharing. Or getting better at data governance and management.

The extent to which you develop an open data practice depends on how important open data is to your organisation. Is it part of your core strategy or just something you’re doing on a more limited basis?

The breadth and depth of the practice of open data is surprising to many people. The learning process is best experienced. Going through the process of opening a dataset, however small, provides useful insight that can help identify where further learning is needed.

On aspect of the practice of open data involves understanding what data can be open, what can be shared and what must stay closed. Moving data along the data spectrum can unlock more value. But not all data can be open.

An open data practitioner works to make sure that data is at the right point on the data spectrum.

An open data practitioner will understand the practice of open data and be able to use those skills to create value for their organisation.

Often I find that when people write about “the state of open data” what they’re actually writing about is the practice of open data within a specific community. For example, the practice of open data in research, or the practice of open government data in the US, or the UK.

Different communities are developing their open data practices at different rates. It’s useful to compare practices so we can distil out the useful, reusable elements. But we must acknowledge that these differences exist. That open data can fulfil a different role and offer a different value proposition in different communities. However there will obviously be common elements to those practices; the common processes that we all follow.

The open data maturity model is an attempt to describe the practice of open data. The framework identifies a range of activities and processes that are relevant to the practice of open data. It’s based on years of experience across a range of different projects. And it’s been used by both public and private sector organisations.

The model is designed to help organisations assess and improve their open data practice. It provides a tool-kit to help you think about the different aspects of open data practice. By using a common framework we can benchmark our practices against those in other organisations. Not as a way to generate leader-boards, but as a way to identify opportunities for sharing our experiences to help each other develop.

If you take and find it useful, then let me know. And if you don’t find it useful, then let me know too. Hearing what works and what doesn’t is how I develop my own open data practice.

Discogs: a business based on public domain data

When I’m discussing business models around open data I regularly refer to a few different examples. Not all of these have well developed case studies, so I thought I’d start trying to capture them here. In this first write-up I’m going to look at Discogs.

In an attempt to explore a few different aspects of the service I’m going to:

How well that will work I don’t know, but lets see!

Discogs: the service

Discogs is a crowd-sourced database about music releases: singles, albums, artists, etc. The service was launched in 2000. In 2015 it held data on more than 6.6 million releases. As of today there are 7.7 million releases. That’s a 30% growth from 2014-15 and around 16% growth in 2015-2016. The 2015 report and this wikipedia entry contain more details.

The database has been built from the contributions of over 300,000 people. That community has grown about 10% in the last six months alone.

The database has been described as one of the most exhaustive collections of discographical metadata in the world.

The service has been made sustainable through its marketplace, which allows record collectors to buy and sell releases. As of today there are more than 30 million items for sale. A New York Times article from last year explained that the marketplace was generating 80,000 orders a week and was on track to do $100 million in sales. Of which Discogs take an 8% commission.

The company has grown from a one man operation to having 47 employees around the world, and that the website has 20 million visitors a month and over 3 million registered users. So approximately 1% of users also contribute to the database.

In 2007 Discogs added an API to allow anyone to access the database. Initially the data was made available under a custom data licence which included attribution and no derivatives clauses. The latter encouraged reusers to contribute to the core database, rather than modify it outside of the system. This licence was rapidly dropped (within a few months, as far as I can tell) in favour of a public domain licence. This has subsequently transitioned to a Creative Commons CC0 waiver.

The API has gone through a number of iterations. Over time the requirement to use API keys has been dropped, rate limits have been lifted and since 2008 full data dumps of the catalogue have been available for anyone to download. In short the data has been increasingly open and accessible to anyone that wanted to use it.

Wikipedia lists a number of pieces of music software that uses the data. In May 2012 Discogs and The Echo Nest both announced a partnership which would see the Discogs database incorporated into Echo Nest’s Rosetta Stone product which was being sold as a “big data” product to music businesses. It’s unclear to me if there’s an ongoing relationship. But The Echo Nest were acquired by Spotify in 2014 and have a range of customers, so we might expect that the Discogs data is being used regularly as part of their products.

Discogs: the data ecosystem

Looking at the various roles in the Discogs data ecosystem, we can identify:

  • Steward: Discogs is a service operated by Zink Media, Inc. They operate the infrastructure and marketplace.
  • Contributor: The team of volunteers curating the website as well as the community support and leaders on the Discogs team
  • Reusers: The database is used in a number of small music software and potentially by other organisations like Echo Nest and their customers. Some more work required here to understand this aspect more
  • Aggregator: Echo Nest aggregates data from Discogs and other services, providing value-added services to other organisations on a commercial basis. Echo Nest in turn support additional reusers and applications.
  • Beneficiaries: Through the website, the information is consumed by a wide variety of enthusiasts, collectors and music stores. A larger network of individuals and organisations is likely supported through the APIs and aggregators

Discogs: the data infrastructure

To characterise the model we can identify:

  • Assets: the core database is available as open data. Most of this is available via the data dumps, although the API also exposes some additional data and functionality, including user lists and marketplace entries. It’s not clear to me how much data is available on the historical pricing in the marketplace. This might not be openly available, in which case it would be classified as shared data available only to the Discogs team.
  • Community: the Contributors, Reusers and Aggregators are all outlined above
  • Financial Model: the service is made sustainable through the revenue generated from the marketplace transactions. Interestingly, originally the marketplace wasn’t a part of the core service but was added based on user demand. This clearly provided a means for the service to become more sustainable and supported growth of staff and office space.
  • Licensing: I wasn’t able to find any details on other partnerships or deals, but the entire data assets of the business are in the public domain. It’s the community around the dataset and the website that has meant that Discogs has continued to grow whilst other efforts have failed
  • Incentives: as with any enthusiast driven website, the incentives are around creating and maintaining a freely available, authoritative resource. The marketplace provides a means for record collectors to buy and sell releases, whilst the website itself provides a reference and a resource in support of other commercial activities

Exploring Discog as a data infrastructure using Ostrom’s principles we can see that:

While it is hard to assess any community from the outside, the fact that both the marketplace and contributor communities are continuing to grow suggests that these measures are working.

I’ll leave this case study with the following great quote from Discog’s founder, Kevin Lewandowski:

See, the thing about a community is that it’s different from a network. A network is like your Facebook group; you cherrypick who you want to live in your circle, and it validates you, but it doesn’t make you grow as easily. A web community, much like a neighborhood community, is made up of people you do not pluck from a roster, and the only way to make order out of it is to communicate and demonstrate democratic growth, which I believe we have done and will continue to do with Discogs in the future.

If you found this case study interesting and useful, then let me know. It’ll encourage me to do more. I’m particularly interested in your views on the approach I’ve taken to capture the different aspects of the ecosystem, infrastructure, etc.

Checking Fact Checkers

As of last month Google News attempts to highlight fact check articles. Content from fact checking organisations will be tagged so that their contribution to on-line debate can be more clearly identified. I think this is a great move and a first small step towards addressing wider concerns around use of the web for disinformation and a “post truth” society.

So how does it work?

Firstly, news sites can now advertise fact checking articles using a pending schema.org extension called Claim Review. The mark-up allows a fact checker to indicate which article they are critiquing along with a brief summary of what aspects are being reviewed.

Metadata alone is obviously ripe for abuse. Anyone could claim any article is a fact check. So there’s an additional level of editorial control that Google layer on top of that metadata. They’ve outlined their criteria in their help pages. These seems perfectly reasonable: it should be clear what facts are being checked, sources must be cited, organisations must be non-partisan and transparent, etc.

It’s the latter aspect that I think is worth digging into a little more. The Google News announcement references the International Fact Checking Network and a study on fact checking sites. The study, by the Duke Reporter’s Lab, outlines how they identify fact checking organisations. Again, they mention both transparency of sources and organisational transparency as being important criteria.

I think I’d go a step further and require that:

  • Google’s (and other’s) lists of approved fact checking organisations are published as open data
  • The lists are cross-referenced with identifiers from sources like OpenCorporates that will allow independent verification of ownership, etc.
  • Fact checking organisations publish open data about their sources of funding and affiliations
  • Fact checking organisations publish open data, perhaps using Schema.org annotations, about the dataset(s) they use to check individual claims in their articles
  • Fact checking organisations licence their ClaimReview metadata for reuse by anyone

Fact checking is an area that benefits from the greatest possible transparency. Open data can deliver that transparency.

Another angle to consider is that fact checking may be carried out by more than just media organisations. John Udell has written a couple of interesting pieces on annotating the wild-west of information flow and bird-dogging the web that highlight the potential role of annotation services in helping to fact check and create constructive debate and discussion on-line.

Current gaps in the open data standards framework

In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)

To define the scope of those standards, lets try and answer two questions.

Question 1: What are the various activities that we might want to carry out around an open dataset?

  • A. Discover the metadata and documentation about a dataset
  • B. Download or otherwise extract the contents of a dataset
  • C. Manage a dataset within a platform, e.g. create and publish it, update or delete it
  • D. Monitor a dataset for updates
  • E. Extract metrics about a dataset, e.g. a description of its contents or quality metrics
  • F. Mirror a dataset to another location, e.g. exporting its metadata and contents
  • G. Link or reconcile some data against a dataset or register

Question 2: What are the various activities that we might want to carry out around an open data catalogue?

  • V. Find whether a dataset exists, e.g. via a search or similar interface
  • X. List the contents of the platform, e.g. its datasets or other published assets
  • Y. Manage user accounts, e.g. to create accounts, or grant or remove rights from specific accounts
  • Z. Extract usage statistics, e.g. metrics on use of the platform and the datasets it contains

Now, based on that quick review: which of these areas of functionality are covered by existing standards?

  • DCAT and its extensions gives us a way to describe a dataset (A) and can be used to find download links which addresses part of (B). But it doesn’t say how the metadata is to be discovered by clients.
  • The draft Data Catalog Vocabulary starts to address parts of (E) but also doesn’t address discovery of published metrics
  • ODATA provides a means for querying and manipulating data via a RESTful interface (B, C). Although I don’t think it recognises a dataset as such, just resources exposed over the web
  • SPARQL (query, update, etc) also provides a means for similar operations (B, C), but on RDF data.
  • The Linked Data Platform specification also offers a similar set of functionality (B, C)
  • If a platform exposes its catalogue using DCAT then a client could use that to list its contents (X)
  • The draft Linked Data Notifications specification covers monitoring and synchronising of data (D)
  • Data Packages provide a means for packaging metadata and contents of dataset for download and mirroring (B, F)

I think there’s a number of obvious gaps around discovery and platform (portal) functionality. API and metadata discovery is also something could usefully be addressed.

If you’re managing and publishing data as RDF and Linked Data then you’re slightly better covered at least in terms of standards, if not in actual platform and tool support. The majority of current portals don’t manage data as RDF or Linked Data. They’re focused on either tabular or maybe geographic datasets.

This means that portability among the current crop of portals is actually pretty low. Moving between a platform means moving between different entirely different sets of APIs and workflows. I’m not sure that’s ideal. I don’t feel like we’ve yet created a very coherent set of standards.

What do you think? What am I missing?

Why are bulk downloads of open data important?

I was really pleased to see that at the GODAN Summit last week the USDA announced the launch of its Branded Food Product Database, providing nutritional information on over 80,000 food products. Product reference data is an area that has been long under-represented in the open data commons, so its great to see data of this type being made available. Nutritional data is also an area in which I’m developing a personal interest.

The database is in the public domain, so anyone can use it for any purpose. It’s also been made available via a (rate-limited) API that allows it to be easily searched. For many people the liberal licence and machine-readability will be enough to place a tick in the “open data” box. And I’m inclined to agree.

In the balance

However, as Owen Boswarva observed, the lack of a bulk download option means that the dataset technically doesn’t meet the open definition. The latest version of the definition states that data “must be provided as a whole…and should be downloadable via the Internet”. This softens the language used in the previous version which required data to be “available in bulk”.

The question is, does this matter? Is the open definition taking an overly pedantic view or is it enough, as many people would argue for the data to be openly licensed?

I think having a clear definition of what makes data open, as opposed to closed or shared, is essential as it helps us focus discussion around what we want to achieve: the ability for anyone to access, use and share data, for any purpose.

It’s important to understand how licensing or accessibility restrictions might stop us from achieving those goals. Because then we can make informed decisions, with an understanding of the impacts.

I’m less interested in using the definition as a means of beating up on data publishers. It’s a tool we can use to understand how a dataset has been published.

That said, I’m never going to stop getting cross about people talking about “open data” that doesn’t have an open licence. That’s the line I won’t cross!

Bulking up

I think its an unequivocally good thing that the USDA have made this public domain data available. So lets focus on what the impacts of their decision to not publish a bulk download. Something which I suspect would be very easy for them to do. It’s a large spreadsheet, not “big data”.

The API help page notes that the service is “intended primarily to assist application developers wishing to incorporate nutrient data into their applications or websites”. And I think it achieves that for the most part. But there are some uses of data that are harder when the machine-readable version is only available via an API.

Here’s a quick, and non-exhaustive list of the ways a dataset could be used:

  • A developer may want to create a new interface to the dataset, to improve on the USDA’s own website
  • A developer may want to query it to add some extra features to an existing website
  • A developer may want to use the data in a mobile application
  • A developer may want to use the data in desktop application
  • A developer may want to enrich the dataset with additional information and re-publish it
  • A data scientist might want to use the data as part of an analysis
  • An archivist might want to package the dataset and place a copy in the Internet Archive to preserve it
  • A scientist might want to use the data as part of their experimental analysis
  • An organisation might want to provide a mirror of the USDA data (and perhaps service) to help it scale
  • A developer might want to use the data inside services like Amazon or Google public datasets, or Kaggle, or data.world
  • A data journalist might want to analyse the data as part of a story
  • ….etc.

Basically there are a lot of different use cases, which vary based on:

  • the technical expertise of the user
  • the technical infrastructure in which the data is being used
  • whether all or only part of the dataset is required
  • whether custom processing or analysis is required
  • whether the results are being distributed

What’s important to highlight is that all of these use cases can be supported by a bulk download. But many of them are easier if there is an API available.

How much easier depends on the design of the API. And the trade-off by making data easier to use is that it increases the cost and effort to publish the data. The USDA are obviously aware of that cost, because they’ve added some rate-limits to the API.

Many of the use cases are harder if the publisher is only providing an API. Again, it will depend on the design of the API how much harder.

Personally I always advocate having bulk downloads by default and APIs available on a best effort basis. This is because it supports the broadest possible set of use cases. In particular it helps make data portable so that it can be used in a variety of platforms. And as there are no well-adopted standard APIs for managing and querying open datasets, bulk downloads offer the most portability across platforms.

Of course there are some datasets, those that are particularly large or rapidly changing, where it is harder to provide a useful, regularly updated data dump. In those cases provision via an API or other infrastructure is a reasonable compromise.

Balancing the scales

Returning to the USDA’s specific goals they are definitely assisting developers in incorporating nutrient data into their applications. But they’re not necessarily making it easy for all application developers, or even all types of user.

Presumably they’ve made a conscious decision to focus on querying the data over use cases involving bulk analysis, archiving and mirroring. This might not be an ideal trade-off for some. And if you feel disadvantaged then you should take time to engage with the USDA to explain what you’d like to achieve.

But the fact that the data is openly licensed and in a machine-readable form means that it’s possible for a third-party intermediary or aggregator to collect and republish the data as a bulk download. It’ll just be less efficient for them to do it than the USDA. The gap can be filled by someone else.

Which is why I think its so important to focus on licensing. It’s the real enabler behind making something open. Without an open licence you can’t get any real (legal) value-add from your community.

And if you don’t want to enable that, then why are you sharing data in the first place?

This post is part of a series called “basic questions about data“.

People like you are in this dataset

One of the recent projects we’ve done at Bath: Hacked is to explore a sample of the Strava Metro data covering the city of Bath. I’m not going to cover all of the project details in this post, but if you’re interested then I suggest you read this introductory post and then look at some of the different ways we presented and analysed the data.

From the start of the project we decided that we wanted to show the local (cycling) community what insights we might be able to draw from the dataset and illustrate some of the ways it might be used.

Our first step was to describe the dataset and how it was collected. We then outlined some questions we might ask of the data. And we tried to assess how representative the dataset was of the local cycling community by comparing it with data from the last census.

The reactions were really interesting. I spent a great deal of time on social media patiently answering questions and objections. I wanted to help answer those questions and understand what issues and concerns people might have in using this type of data.

I found that there were broadly two different types of feedback.

Visible participation

The first, more positive response, was from existing or previous Strava users surprised or delighted that their data might contribute towards this type of analysis. Some people shared the fact that they only logged some types of rides, while others explained that they already logged all of their activity including commutes and recreational riding. I saw one comment from a user who was now determined to do this more diligently, just so they could contribute to the Metro dataset.

A lesson here is that even users who understand that their data is being collected can still be surprised in the ways that the data might be re-purposed.  This is a data literacy issue: how can we help non-specialists understand the incredible malleability of data?

I think the reaction also reinforces the point that people will often contribute more if they think their data can be used for social good. Or just that people like them are also contributing.

This is important if we want to  encourage more participation in the maintenance of data infrastructure. Commercial organisations would do well to think about how open data and data philanthropy might drive more use of their platforms rather than threaten them.

Even if the Strava data were completely open there are still challenges in its use and interpretation. This creates the space for value-added services. (btw, if anyone wants help with using the Strava Metro data then I’m happy to discuss how Bath: Hacked could help out!)

Two tribes

The second, more negative response, was from people who didn’t use Strava and often had strong opinions about the service. I’ll step lightly over the details here. But, while I want to avoid being critical (because I’m genuinely not), I want to share a variety of the responses I saw:

  • I don’t use this dataset, so it can’t tell you anything about how I cycle
  • I don’t understand why people might use the service, so I’m suspicious of what the data might include
  • I think only a certain type of people use the service so its only representative of them, not me
  • I think people only use this service in a specific way, e.g. not for regular commutes, and so the data has limited use
  • I’m suspicious about the reliability of the data, so distrust it.

I’d think I’d sum all of that up as: “people like me don’t use this service, so any data you have isn’t representative of me or my community“.

This is exactly the issue we tried to shed some light on in our first two blog posts. So clearly we failed at that! Something to improve on in future.

The real lesson for me here is that people need to see themselves in a dataset.

If  we don’t help someone understand whether a dataset is representative of them, then it’s use will be viewed with suspicion and doubt. It doesn’t matter how rigorous the data collection and analysis process might be behind the scenes, it’s important to help find ways for people to see that for themselves. This isn’t a data literacy issue: it’s a problem with how we effectively communicate and build trust in data.

If we increasingly want to use data as a mirror of society, then people need to be able to see themselves in its reflection.

If they can see how they might be a valuable part of a dataset, then they may be more willing to contribute. If they can see whether they (or people like them) are represented in a dataset, then they may be more willing to accept insights drawn from that data.

Story telling is likely to be a useful tool here, but I wonder whether there are other complementary ways to approach these issues?

Help me use your data

I’ve been interviewed a couple of times recently by people interested in understanding how best to publish data to make it useful for others.  Once by a startup and a couple of times by researchers. The core of the discussion has essentially been the same question: “how do you know if a dataset will be useful to you?”

I’ve given essentially the same answer each time. When I’m sifting through dataset descriptions, either in a portal or via a web search, my first stage of filtering involves looking for:

  1. A brief summary of the dataset: e.g. a title and a description
  2. The licence
  3. Some idea of its coverage, e.g. geographic coverage, scope of time series, level of aggregation, etc
  4. Whether it’s in a usable format

Beyond, that there’s a lot more that I’m interested in: the provenance of the data, its timeliness and a variety of quality indicators. But those pieces of information are what I’m looking for right at the start. I’ll happily jump through hoops to massage some data into a better format. But if the licence or coverage isn’t right then its useless to me.

We can frame these as questions:

  1. What is it? (Description)
  2. Can I use it? (Licence)
  3. Will it help answer my question? (in whole, or  part)
  4. How difficult will it be to use? (format, technical characteristics)

It’s frustrating how often these essentials aren’t readily available.

Here’s an example of why this is important.

A weather data example

I’m currently working on a project that needs access to local weather observations. I want openly licensed temperature readings for my local area.

My initial port of call was the Met Office Hourly Site Specific Observations. The product description is a useful overview and the terms of use make the licensing clear. Questions 1 & 2 answered.

However I couldn’t find a list of sites to answer Question 3. Eventually I found the API documentation for the service that would generate me a list of sites. But I can only access that with an API key. So I’ve signed up, obtained a key, made the API call, downloaded the JSON, converted it into CSV, uploaded it to Carto and then made a map.

And now I can answer Question 3. The closest site is in Bristol and so the service isn’t useful to me at all. Time wasted, but hopefully not all the effort because now you can just look at the map. But the Met Office could simply have published a map. There is one of the whole network, but they don’t all contribute to the open dataset.

So I started to look at the OpenWeatherMap API. They also have an API endpoint that exposes weather data for a specific station. Or stations within a geographic area. But again, they’ve not actually published a map that would let me see if there are any local to me. I might have missed something so I’ve asked them.

In both cases I’m having to get into invest time and some technical effort in answering questions which should be part of the documentation. They could even use their own APIs to create an interactive map for people to use!

As a result I’m going to end up using wunderground. By browsing the user facing part of their site I’ve been able to confirm there are several local weather stations. And hopefully these will be exposed via the API. (But I’m going to have to dig a bit to check on the terms of use. Sigh.)

If you really want me to use your data then you need to help me to use it. Think about my user experience. Help me understand what your dataset contains before I have to actually poke around inside it.