We can strengthen data infrastructure by analysing open data

Data is infrastructure for our society and businesses. To create stronger, sustainable data infrastructure that supports a variety of users and uses, we need to build it in a principled way.

Over time, as we gain experience with a variety of infrastructures supporting both shared and open data, we can identify the common elements of good data infrastructure. We can use that to help to write a design manual for data infrastructure.

There a variety of ways to approach that task. We can write case studies on specific projects, and we can map ecosystems to understand how value is created through data. We can also take time to contribute to projects. Experiencing different types of governance, following processes and using tools can provide useful insight.

We can also analyse open data to look for additional insights that might help use improve data infrastructure. I’ve recently been involved in two short projects that have analysed some existing open data.

Exploring open data quality

Working with Experian and colleagues at the ODI, we looked at the quality of some UK government datasets. We used a data quality tool to analyse data from the Land Registry, the NHS and Companies House. We found issues with each of the datasets.

It’s clear that there’s is still plenty of scope to make basic improvements to how data is published, by providing:

  • better guidance on the structure, content and licensing of data
  • basic data models and machine-readable schemas to help standardise approaches to sharing similar data
  • better tooling to help reconcile data against authoritative registers

The UK is also still in need of a national open address register.

Open data quality is a current topic in the open data community. The community might benefit from access to an “open data quality index” that provides more detail into these issues. Open data certificates would be an important part of that index. The tools used to generate that index could also be used on shared datasets. The results could be open, even if the datasets themselves might not be.

Exploring the evolution of data

There are currently plans to further improve the data infrastructure that supports academic research by standardising organisation identifiers. I’ve been doing some R&D work for that project to analyse several different shared and open datasets of organisation identifiers. By collecting and indexing the data, we’ve been able to assess how well they can support improving existing data, through automated reconciliation and by creating better data entry tools for users.

Increasingly, when we are building new data infrastructures, we are building on and linking together existing datasets. So it’s important to have a good understanding of the scope, coverage and governance of the source data we are using. Access to regularly published data gives us an opportunity to explore the dynamics around the management of those sources.

For example, I’ve explored the growth of the GRID organisational identifiers.

This type of analysis can help assess the level of investment required to maintain different types of dataset and registers. The type of governance we decide to put around data will have a big impact on the technology and processes that need to be created to maintain it. A collaborative, user maintained register will operate very differently to one that is managed by a single authority.

One final area in which I hope the community can begin to draw together some insight is around how data is used. At present there are no standards to guide the collection and reporting on metrics for the usage of either shared or open data. Publishing open data about how data is used could be extremely useful not just in understanding data infrastructure, but also in providing transparency about when and how data is being used.


