Data is a vital form of infrastructure for our societies and our economies. When we think about infrastructure we usually think of physical things like roads and railways.
But there are broader definitions of infrastructure that include less tangible things. Like ideas or the internet.
It is important to recognise that there is more to “infrastructure” than just roads and railways. Otherwise there is a risk that we, as a society, won’t invest the necessary time or effort in building, maintaining and governing that infrastructure. The decisions we make about Infrastructure are important because infrastructure helps to shape our societies.
To help explore the idea of data as infrastructure, I want to look at the various building blocks that make up a specific example of a data infrastructure. My hope is that this will help to make it clearer that “data infrastructure” is about more than just technology. As we will see the technical infrastructure we use to manage data is just one component of data infrastructure.
The example we will use is greatly simplified and is partly fictionalised, but it is essentially a real example: we’re going to look at weather data infrastructure.
I’ve written about weather data infrastructure before. It’s a really interesting example to explore in this context because:
- it’s easy to understand the value of collecting weather data
- it’s a complex enough example to help dig into some real-world issues
- It illustrates how data that is collected and used locally or nationally can also be part of a global data infrastructure
Weather data is usually open data, or at least public. But the building blocks we will outline here apply equally well to data from across the data spectrum. In a follow-up post I may explore a more complex example that illustrates a different type of data infrastructure, e.g. one for medical research that relies on researchers having access to medical records.
In the following sections we’ll look at the different building blocks that are important in building a global weather data infrastructure. The real infrastructure is much more complex.
Some of the building blocks are a bit fuzzy and have multiple roles to play in our infrastructure. But that’s fine. The world isn’t a neat and tidy place that we can always reduce to simpler components.
A definition of data infrastructure
Before we begin, let’s introduce a definition of data infrastructure:
A data infrastructure consists of data assets, the standards and technologies that are used to curate and provide access to those assets, the guidance and policies that inform their use and management, the organisations that govern the data infrastructure, and the communities involved in maintaining it, or are impacted by decisions that are made using those data assets.
There are a lot of moving parts there. And there are lots of things to say about each of them. For now let’s focus on the individual building blocks to explore ways in which they fit together.
Imagine we’re planning to build a global network of weather stations. Each station will be regularly recording the local temperature and rainfall. In our system we’ll be collecting all of these readings into a global dataset of weather observations.
So that we know which observations have been reported by which weather station, we need a unique reference for each of them.
We can’t just use the name of the town or village in which the station has been installed as that reference. There are Birminghams in both the UK and the US, for example. We might also need to move and reinstall weather stations over time, but may need to track information about them, such as when they were installed or services. So we need a global identifier that is more reliable than just a name.
By assigning each weather station a unique identifier, we can then attached additional data to it. Like it’s current location. We can also associate the identifier with every temperature and rainfall observation, so that we know which station reported that data.
Identifiers are the first building block of our data infrastructure.
Identifiers are deceptively simple. They’re just a number or a code, right? But there’s a lot to say about them, such as how they are assigned or are formatted. It can be hard to create good identifiers.
When identifiers are open, for anyone to use in their data, they have a role to play that goes beyond just providing unique references in a database. They can also help to create network effects that encourage publication of additional data.
Standards, part 1
Our weather stations are recording temperature and rainfall. We’ll measure temperature in degrees Centigrade and rainfall in millimetres. Both of these are standard units of measurement.
Standards are our second building block.
Standards are documented, reusable agreements. They help us collect and organise data in consistent ways, and make it easier to work with data from different sources.
Some standards, like units of measurement are global and are used in many different ways. But some standard might be only be relevant to specific communities or systems.
In our weather data infrastructure, we will need to standardise some other aspects of how we plan to collect weather data.
For example, let’s assume that our weather stations are recording data every half an hour. Every thirty minutes a station will record a new temperature reading. But is it recording the temperature at that specific moment in time, or should it report the average temperature over the last thirty minutes? There may be advantages in doing one or the other.
If we don’t standardise some of our data collection practices, then weather stations created by different manufacturers might record data differently. This will affect the quality of our data.
Standards, part 2
Every data infrastructure will rely on a wide variety of different standards. Some standards support consistent measurement and data collection. Others help us to exchange data more effectively.
Our weather stations will need to record the data they collect and automatically upload it to a service that helps us build our global database. In a real system there are a number of different ways in which we might want a weather station to report data, to provide a variety of ways in which it could be aggregated and reused. But to simplify things, we’ll assume they just upload their data to a centralised service. Centralised data collection is problematic for a number of reasons, but that’s a topic for another article.
To help us define how the weather stations will upload their data we will need to pick a standard data format that will define the syntax for recording data in a machine-readable form. Let’s assume that we decide to use a simple CSV (comma-separated values) format.
Each station will produce a CSV file that contains one row for every half-hourly observation. Each row will consist of a station identifier, a time stamp for the recordings, a temperate reading and a rainfall reading.
The time stamps can be recorded using ISO 8601, which is an international standard for formatting dates and times. Helpfully we can include time zones, which will be essential for reporting time accurately across our global network of weather stations.
We also need to ensure that the order in which the four fields will be reported is consistent, or that the headers in the CSV file clearly identify what is contained in each column. Again, we might be using weather stations from multiple manufacturers and need data to be recorded consistently. Some stations might also include additional sensors, e.g. to record wind speed. So ideally our standard will be extensible to support that additional data. Taking time to design and standardise our CSV format will make data aggregation easier.
Every time we define how to collect, manage or share data within a system, we are creating agreements that will help ensure that everyone involved in those processes can be sure that those tasks are carried out in consistent ways. When we reuse existing standards, rather than creating bespoke versions, we can benefit from the work of thousands of different specialists across a variety of industries.
Sometimes though we do need to define a new standard, like the order of the columns in our specific type of CSV file. But where possible we should approach this by building on existing standards as much as possible.
To help us manage our network of weather stations it will be useful to record where each of them has been installed. It would also be helpful to record when they were installed. Then we can figure out when they might need to be re-calibrated or replaced and send some out to do the necessary work.
To do this, we can create a dataset, that lists the identifier, location, model and installation date of every weather station.
This type of dataset a register.
Registers are lists of important data. They have multiple uses, but are most frequently used to help us improve the quality of our data reporting.
For example we can use the above register to confirm that we’re regularly receiving data from every station on the network. When a station is installed it will need to get added to the register. We might give the company installing the station permission to do that, to help us maintain the register.
We can also use the register to determine if we have a good geographic spread of stations, to help us assess and improve the coverage and quality of the observations we’re collecting. The register is also useful for anyone using our global dataset so they can see how the dataset has been collected over time. Registers should be as open as possible.
There are other types of register that might be useful for governing our data infrastructure. For example we might create a register that lists all of the models of weather station that have been certified to comply with our preferred data standards.
We can use that register to help us make decisions about how to replace stations when they fail. A register can also help provide an incentive for the manufacturers of weather stations to conform to our chosen standards. If they’re not on the list, then we might not buy their products.
In Part 2 of this post we’ll look at others aspects of data infrastructure, including technology, organisations and policies. Thanks to Peter Wells and Jeni Tennison for feedback and suggestions that have helped me write these posts.