Exploring open data quality

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/exploring-open-data-quality

There are a number of initiatives at the moment exploring the idea of data quality, with particular reference to describing, measuring and improving the quality of open data.

For example, the W3C Data on the Web Best Practices Working Group are producing a vocabulary for publishing and describing data quality metrics. There is also related work capturing best practices for sharing public sector data.

Various open data projects and communities are working to improve the quality of their open data and have started to share guidance. For example data.gov.sg have recently shared their data quality guide for tabular data. And Mark Frank and Johanna Walker at Southampton University have recently published a paper exploring a user-centred view of data quality.

To contribute to this ongoing discussion, we recently undertook a small project with Experian to explore data quality in some open datasets.

The project had several goals:

  • to identify the types of data quality issues we might find in some existing open datasets
  • to suggest some common data quality checks that both publishers and users could apply to data
  • to explore the idea of an ‘open data quality index’, building on existing work on Open Data Certificates and benchmarking open data

For the initial exploratory project we’ve used the Land Registry Price Paid data, the Companies House register and the NHS Choices GP Practices and Surgeries.

We worked with the data quality team at Experian to run the datasets through their Pandora data quality tool. Pandora is a data-profiling tool designed to support exploration of datasets, highlight data quality issues and enrich data against other sources. For this project we used Pandora to generate some quality metrics for each of the datasets we reviewed.

You can recreate a number of the checks we carried out using the free version of the tool.

The outputs have been published under an open license and we’ve written a short report on the findings.

Our key insights are as follows:

  • There is still scope to improve how well datasets are documented and published to data.gov.uk and beyond
  • Even in large, well-used and maintained datasets there are a number of basic data quality checks that could be applied to improve data quality
  • Defining and using standard schemas for datasets would benefit both data publishers and users
  • Being able to quickly summarise and explore a dataset offers a powerful way to understand its structure and highlight potential data quality issues
  • The use of standard, open registers will be a significant boost to the quality of many open datasets

If you have any feedback on the findings or suggestions for how to build on the work further, then please get in touch with our labs team.