Caution: data, use responsibly

Originally published on the Open Data Institute blog. Original URL:

In December 2015, Ben Goldacre and Anna Powell-Smith launched the beta of Open Prescribing. The site, which was swiftly celebrated in the open data community and beyond, provides insight into the prescribing practices of GPs around the UK. Its visualisations and reports give an entirely new perspective on some of the bulk open datasets available from the NHS.

Open Prescribing is a fantastic demonstration of how openly publishing data can unlock new, creative uses.

There is a particular feature of the site which piqued my interest: a page entitled, ‘Caution: how to use the data responsibly‘. Goldacre and Powell-Smith have included some clear guidance that helps users to properly interpret their findings, including:

  • guidance on how to interpret high and low values for the measurements, encouraging thought into what patterns they may or may not demonstrate – because of differences in population around a practice, for example
  • notes on how the individual measures were decided upon
  • insight into the importance of specific drugs and measures for a non-specialist audience
  • links to useful background information from the original data publishers

The ‘About‘ page for the site also attributes all of the datasets that were used as input to the analysis.

Clear attribution, provenance reporting and guidance on limits to the analysis might be expected from authors with a background in evidence-based medicine. It’s not yet normal practice within the open data community. But it should be.

As a society, we are making an increasing number of decisions based on data, about our health, economy and businesses. So it’s becoming more and more important that we know the limits of what that data can reliably tell us. Data enables informed decisions. Knowing the limits of data also makes us more informed.

In my opinion all data analysis should have an equivalent of the Open Prescribing “/caution” URL.

To achieve this data users need to know more about how data is collected and processed before it is published. This is why the higher levels of Open Data Certificaterequire publishers to:

  • document any known quality issues or limitations with the data
  • publish details of their quality control processes, including how to report errors
  • describe the provenance of the data, e.g. how it was collected and analysed

That information provides the necessary foundation for re-users to properly interpret and apply data. This information can then be cited, as it is on Open Prescribing, to help downstream users understand the impacts on any analysis.

Documenting the datasets used in an analysis is another norm that’s common in the medical and scientific communities. Linking to source datasets is the basis for citation analysis in academic research. These links power many types of discovery tools, and help improve reproducibility and transparency in research.

Use of machine-readable attributions could do the same for more general uses of data online. In the early days of the web, developers would “view source” to view the markup behind a webpage to learn how it was put together. The ability to “view sources” to discover the data underlying an application or data analysis would be a useful feature for the data web.

So, if you’re doing some data analysis, follow the best practices embodied by Open Prescribing and help users and other developers to understand how you’ve achieved your results.