Schema explorers and how they can help guide adoption of common standards

Despite being very different projects Wikidata and OpenStreetmap have a number of similarities. Recurring patterns in how they organise and support the work of their communities.

We documented a number of these patterns in the ODI Collaborative Maintenance Guidebook. There were also a number we didn’t get time to write-up.

A further pattern which I noticed recently is that both Wikidata and OSM provide tools and documentation that help contributors and data users explore the schema that shapes the data.

Both projects have a core data model around which their communities are building and iterating on a more focused domain model. This approach of providing tools for the community to discuss, evolve and revise a schema is what we called the Shared Canvas pattern in the ODI guidebook.

In OpenStreetmap that core model is consists of nodes, ways and relations. Tags (name-value pairs) can be attached to any of these types.

In Wikidata the core data model is essentially a graph. A collection of statements that associate values with nodes using a range of different properties. It’s actually more complicated than that, but the detail isn’t important here.

The list of properties in Wikidata and the list of tags in OpenStreetmap are continually revised and extended by the community to capture additional information.

The OpenStreetmap community documents tags in its Wiki (e.g. the building tag). Wikidata documents its properties within the project dataset (e.g. the name property, P2561).

But to successfully apply the Shared Canvas pattern, you also need to keep the community up to date about your Evolving Schema. To do that you need some way to communicate which properties or tags are in use, and how. OSM and Wikidata both provide tools to support that.

In OSM this role is filled by TagInfo. It can provide you with a break down of what type of feature the tag is used on, the range of values, combinations with other tags and some idea of its geographic usage. Tag uses varies by geographic community in OSM. Here’s the information about the building tag.

In Wikidata this tooling is provided by a series of reports that are available from the Discussion page for an individual property. This includes information about how often it is used and pointers to examples of frequent and recent uses. Here’s the information about the name property.

Both tools provide useful insight into how different aspects of a schema are being adopted and uses. They can help guide not just the discussion around the schema (“is this tag in use?”, but also the process of collecting data (“which tags should I use here”) and using the data (“what tags might I find, or query for?”).

Any project that adopts a Shared Canvas approach is likely to need to implement this type of tooling. Lets call it the “Schema explorer” pattern for now.

I’ll leave documenting it further for another post, or a contribution to the guidebook.

Schema explorers for open standards and open data

This type of tooling would be useful in other contexts.

Anywhere that we’re trying to drive adoption of a common data standard, it would be helpful to be able to assess how well used different parts of that schema are by analysing the available data.

That’s not something I’ve regularly seen produced. In our survey of decentralised publishing initiatives at the ODI we found common types of documentation, data validators and other tools to support use of data, like useful aggregations. But no tooling to help explore how well it is adopted. Or to help data users understand the shape of the available data prior to aggregating it.

When i was working on the OpenActive standard, I found the data profiles that Dan Winchester produced really helpful. They provide useful insight into which parts of a standard different publishers were actually using.

I was thinking about this again recently whilst doing some work for Full Fact, exploring the ClaimReview markup in Schema.org. It would be great to see which features different fact checkers are actually using. In fact that would be true of many different aspects of Schema.org.

This type of reporting is hard to do in a distributed environment without aggregating all the data. But Google are regularly harvesting some of this data, so it feels like it would be relatively easy for them to provide insights like this if they chose.

An alternative is the Schema.org Table Corpus which provides exports of Schema.org data contained in the Common Crawl dataset. But more work is likely needed to generate some useful views over the data, and it is less frequently updated.

Outside of Schema.org, schema explorers reporting on the contents of open datasets, would help inform a range of standards work. For example, it could help inform decisions about how to iterate on a schema, guide the production of documentation, and help improve the design of validators and other tools.

If you’ve seen examples of this type of tooling, then I’d be interested to see some links.

Building data validators

This is a post about building tools to validate data. I wanted to share a few reflections based on helping to design and build a few different public and private tools, as well as my experience as a user.

I like using data validators to check my homework. I’ve been using a few different recently which has prompted me to think a bit about their role and the designs that go into their design.

The tl;dr version of this post is along the lines of “Think about user needs when designing tools. But also be conscious of the role those tools play in their broader ecosystem“.

What is a data validator?

A data validator is a tool that checks the correctness and quality of data. This means doing the following categories of checks:

  • Syntax
    • Checking to determine whether there are any mistakes in how it is formatted. E.g. is the syntax of a CSV, XML or JSON file correct?
  • Validity
    • Confirming if all of the required fields, necessary to make the data useful, been provided?
    • Testing that individual values have been correctly specified. E.g. if the field contains a number then is the provided value actually a number rather than a text?
    • Performing more semantic checks such as, if this is a dataset about UK planning applications, then are the coordinates actually in the UK? Or is the start date for the application before the end date?
  • Utility
    • Confirming that provided data is of a useful quality, e.g. are geographic coordinates of the right precision? Or do any links to other resources actually work?
    • Warning about data that may or may not be included. For example, prompting the user to include additional fields that may improve the utility of the data. Or asking them to consider whether any personal data included should be there

These validation rules will typically come from a range of different sources, including:

  • The standard or specification that defines the syntax of the data.
  • The standard or specification (or schema) that describes the structure and content of the data. (This might be the same as the above, or might be defined elsewhere)
  • Legislation, which might guide, inform or influence what data should or should not be included
  • The implementer of the validation tool, who may have opinions about what is considered to be correct or useful data based on their specific needs (e.g. as a direct consumer of the data) or more broadly as a contributor to a community initiative to support improvements to how data is published

Data validators are frequently web based these days. At least for smaller datasets. But both desktop and command-line tools are also regularly used in different settings. The choice of design will be informed by things like how open the data can be, the volume of data being checked, and how the validator might be integrated into a data workflow, e.g. as an automated or manual step.

Examples of different types of data validator

Here are some examples of different data validators created for different purposes and projects

  1. JSON lint
  2. GeoJSON Lint
  3. JSON LD Playground
  4. CSVlint
  5. ODI Leeds Business Rates format validator
  6. 360Giving Data Quality Tool
  7. OpenContracting Data Review Tool
  8. The OpenActive validator
  9. OpenReferral UK Service Validator
  10. The Schema.org validator
  11. Google’s Rich Results Test
  12. The Twitter Card validator
  13. Facebook’s sharing debugger

The first few on the list are largely syntax checkers. They validate whether your CSV, JSON or GeoJSON files are correctly structured.

The others go further and check not just the format of the data, but also its validity against a schema. That schema is defined in a standard intended to support consistent publication of data across a community. The goal of these tools is to improve quality of data for a wide range of potential users, by guiding publishers about how to publish data well.

The last three examples are validators that are designed to help publishers meet the needs of a specific application or consumer of the data. They’re an actionable way to test data against the requirements of a specific user.

Validators also vary in other ways.

For example, the 360Giving, OpenContracting and Rich Results Test validators all accept a range of different data formats. They validate different syntaxes against a common schema. Others are built around a single specific format

Some tools provide a low-level view of the results, e.g. a list of errors and warnings with reference to specific sections of the data. Others provide a high-level interface, such as a preview of what the data looks like on a map or as it would be displayed in a specific application. This type of visual presentation can help catch other types of errors and more directly confirm how data might be interpreted, whilst also making the tool useful to a wider audience.

What do we mean by data being valid?

For simple syntax checking identifying whether something is valid is straight-forward. Your JSON is either well-formed or its not.

Validators that are designed around specific applications also usually have a clear marker of what is “valid”: can the application parse, interpret and display the data as expected? Does my twitter card look correct?

In other examples, the notion of “valid” is harder to define. They may be some basic rules around what a minimum viable dataset looks like. If so, these are easier to identify and classify as errors.

But there is often variability within a schema. E.g. optional elements. This means that validators need to offer more than just a binary decision and instead offer warnings, suggestions and feedback.

For example, when thinking about the design of the OpenActive validator we discussed the need to go beyond simple validation and provide feedback and prompts along the lines of “you haven’t provided a price, is the event free or chargeable“? Or “you haven’t provided an image for this event, this is legal but evidence shows that participants are more likely to sign-up to events where they can see what participation looks like.”

To put this differently: data quality depends on how you’re planning to use the data. It’s not an absolute. If you’re not validating data for a specific application or purpose, then you tool should be prompting users to think about the choices they are making around how data is being shared.

In the context of sharing and publishing open data, this moves the role of a data validator beyond simplify checking correctness, and towards identifying sources of friction that will exist between publisher and consumer.

Beyond the formal conformance criteria defined in a specification, deciding whether something is valid or not, is really just a marker for how much extra work is required by a consumer. And in some cases the publisher may not have the time, budget or resources to invest in reducing that burden.

Things to think about when designing a validator

To wrap up this post, here are some things to think about when designing a data validator

  • Who are your users? What level of technical skill and understanding are you designing for?
  • How will the validator be used or integrated into the users workflow? A tool for integration into a continuous integration environment will need to operate differently to something used to do acceptance checking before data is published. Maybe you need several different tools?
  • How much knowledge of the relevant standards or specification will a user need before they can use the tool? Should the tool facilitate learning and exploration about how to structure data, or is just checking existing data?
  • How can you provide good, clear feedback? Tools that rely on applying machine-readable schemas like JSON Schema can often have cryptic messages as they rely on an underlying library to report errors
  • How can you provide guidance and feedback that will help users decide how to improve data? Is the feedback actionable? (For example in CSVLint we figured out that when reporting that a user had an incorrect mime-type for their CSV file we could identify if it was served from AWS and provide a clear suggestion about how to fix the issue)
  • Would showing the data, as a preview or within a mocked up view, help surface problems or build confidence in how data is published?
  • Are the documentation about how to publish data and the reports from your validator consistent? If not, then fix the documentation or explain the limits of the validator

Finally, if you’re designing a validator for a specific application, then don’t mark as “invalid” anything that you can simply ignore. Don’t force the ecosystem to converge on your preferences.

You may not be interested in the full scope of a standard, but different applications and users will have different needs.

Data quality is a dialogue between publishers and users of data. One that will evolve over time as tools, applications, norms and standards become adopted across a data ecosystem. A data validator is an important building block that can facilitate that discussion.

Some lessons learned from building standards around Schema.org

OpenActive is a community-led initiative in the sport and physical activity sector in England. It’s goal is to help to get people healthier and more active by making its easier for people to find information about activities and events happening in their area. Publishing open data about opportunities to be active is a key part of its approach.

The initiative has been running for several years, funded by Sport England. Its supported by a team at the Open Data Institute who are working in close collaboration with a range of organisations across the sector.

During the early stages of the project I was responsible for leading the work to develop the technical standards and guidance that would help organisations publish open data about squash courts and exercise classes. I’ve written some previous blog posts that described the steps that got us to version 1.0 of the standards and then later the roadmap towards 2.0.

Since then the team have been exploring new features like publishing data about walking and cycling routes, improving accessibility information and, more recently, testing a standard API for booking classes.

If you’re interested in more of the details then I’d encourage you to dig into those posts as well as the developer portal.

What I wanted to cover in this blog post are some reflections about one of the key decisions we made early in the standards workstream. This was to base the core data model on Schema.org.

Why did we end up basing the standards on Schema.org?

We started the standards work in OpenActive by doing a proper scoping exercise. This helped us to understand the potential benefits of introducing a standard, and the requirements that would inform its development.

As part of our initial research, we did a review of what standards existed in the sector. We found very little that matched our needs. The few APIs that were provided were quite limited and proprietary and there was little consistency around how data was organised.

It was clear that some standardisation would be beneficial and that there was little in the way of sector-specific work to build on. It was also clear that we’d need a range of different types of standard. Data formats and APIs to support exchange of data, a common data model to help organise data and a taxonomy to help describe different types of activity.

For the data model, it was clear that the core domain model would need to be able to describe events. E.g. that a yoga class takes place in a specific gym at regular times. This would support basic discovery use cases. Where can I go and exercise today? What classes are happening near me?

As part of our review of existing standards, we found that Schema.org already provided this core model along with some additional vocabulary that would help us categorise and describe both the events and locations. For example, whether an Event was free, its capacity and information about the organiser.

For many people Schema.org may be more synonymous with publishing data for use by search engines. But as a project its goal is much broader, it is “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data“.

The data model covers much more than what search engines are consuming. Some communities are instead using the project as a means to collaborate on developing better vocabulary for sharing data between other applications. As well as aligning existing vocabularies under a common umbrella.

New standards should ideally be based on existing standards. We knew we were going to be building the OpenActive technical standards around a “stack” of standards that included HTTP, JSON and JSON-LD. So it was a natural step to base our initial domain model on aspects of Schema.org.

What were the benefits?

An early benefit of this approach is that we could immediately focus our roadmap on exploring extensions to the Schema.org data model that would add value to the physical activity sector.

Our initial community sessions around the OpenActive standards involved demonstrating how well the existing Schema.org model fitted the core requirements. And exploring where additional work was needed.

This meant we skipped any wrangling around how to describe events and instead focused on what we wanted to say about them. Important early questions focused on what information would potential participants find helpful in understanding whether this is specific activity or event is something that they might want to try? For example, details like: what activities they involved and for what level of competency?

We were able to identify those elements of the core Schema.org model supported out use cases and then documented some extensions in our own specifications. The extensions and clarifications were important for the OpenActive community, but not necessarily relevant in the broader context in which Schema.org is being used. We wanted to build some agreement and usage in our community first, before suggesting changes to Schema.org.

As well as giving us an initial head start, the decision also helped us address new requirements much quicker.

As we uncovered further requirements that mean expanding our data model, we were always able to initially look to see if existing Schema.org terms covered what we needed. We began using it as a kind of “dictionary” that we could draw on when needed.

Where existing parts of the Schema.org model fitted out needs, it was gratifying to be able to rapidly address the new requirements by documenting patterns for how to use them. Data publishers were also doing the same thing. Having a common dictionary of terms gave freedom to experiment with new features, drawing on terms defined in a public schema, before the community had discussed and agreed how to implement those patterns more broadly.

Every standards project has its own cadence. The speed of development and adoption are tied up with a whole range of different factors that go well beyond how quickly you can reach consensus around a specification.

But I think the decision to use Schema.org definitely accelerated progress and helped us more quickly deliver a data model that covered the core requirements for the sector.

Where were the challenges?

The approach wasn’t without its challenges, however.

Firstly, for a sector that was new to building open standards, choosing to based parts of that new standard on one project and then defining extensions created some confusion. Some communities seem more comfortable with piecing together vocabularies and taxonomies, but that is not true more widely.

Developers found it tricky to refer to both specifications, to explore their options for publishing different types of data. So we ended up expanding our documentation to cover all of the Schema.org terms we recommended or suggested people use, instead of focusing more on our own extensions.

Secondly, we also initially adopted the same flexible, non-prescriptive approach to data publishing that Schema.org uses. It does not define strict conformance critiera and there are often different options for how the same data might be organised depending on the level of detail a publisher has available. If Schema.org were too restrictive then it would limit how well the model could be used by different communities. It also leaves space for usage patterns to emerge.

In OpenActive we recognised that the physical activity sector had a wide range of capabilities when it came to publishing structured data. And different organisations organised data in different ways. We adopted the same less prescriptive approach to publishing with the goal of reducing the barriers to getting more data published. Essentially asking publishers to structure data as best they could within the options available.

In the end this wasn’t the right decision.

Too much flexibility made it harder for implementers to understand what data would be most useful to publish. And how to do it well. Many publishers were building new services to expose the data so they needed a clearer specification for their development teams.

We addressed this in Version 2 of the specifications by considerably tightening up the requirements. We defined which terms were required or just recommended (and why). And added cardinalities and legal values for terms. Our specification became a more formal, extended profile of Schema.org. This also allowed us build a data validator that is now being released and maintained alongside the specifications.

Our third challenge was about process. In a few cases we identified changes that we felt would sit more naturally within Schema.org than our own extensions. For example, they were improvements and clarifications around the core Event model that would be useful more widely. So we submitted those as proposed changes and clarifications.

Given that Schema.org has a very open process, and the wide range of people active in discussing issues and proposals, it was sometimes hard to know how decisions would get made. We had good support from Dan Brickley and others stewarding the project, but without knowing much about who is commenting on your proposal, their background or their own uses cases, it was tricky to know how much time to spend on handling this feedback. Or when we could confidently say that we had achieved some level of consensus.

We managed to successfully navigate this, by engaging as we would within any open community: working transparently and collegiately, and being willing to reflect on and incorporate feedback regardless of its source.

The final challenge was about assessing the level of use of different parts of the Schema.org model. If we wanted to propose a change in how a term was documented or suggest a revision to its expected values, it is difficult to assess the potential impact of that change. There’s no easy way to see which applications might be relying on specific parts of the model. Or how many people are publishing data that uses different terms.

The Schema.org documentation does flag terms that are currently under discussion or evaluation as “pending”. But outside of this its difficult to understand more about how the model is being used in practice. To do that you need to engage with a user community, or find some metrics about deployment.

We handled this by engaging with the open process of discussion, sharing our own planned usage to inform the discussion. And, where we felt that Schema.org didn’t fit with the direction we needed, we were happy to look to other standards that better filled those gaps. For example we chose to use SKOS to help us organise and structure a taxonomy of physical activities rather than using some of the similar vocabulary that Schema.org provides.

Choosing to draw on Schema.org as a source of part of our domain model didn’t mean that we felt tied to using only what it provides.

Some recommendations

Overall I’m happy that we made the right decision. The benefits definitely outweighed the challenges.

But navigating those challenges was easier because those of us leading the standards work were comfortable both with working in the open and in combining different standards to achieve a specific goal. Helping to build more competency in this area is one goal of the ODI standards guidebook.

If you’re involved in a project to build a common data model as part of a community project to publish data, then I’d recommend looking at whether based some or all of that model around Schema.org might help kickstart your technical work.

If you do that, my personal advice would be:

  • Remember that Schema.org isn’t the right home for every data model. Depending on your requirements, the complexity and the potential uses for the data, you may be better off designing and iterating on your model separately. Similarly, don’t expect that every change or extension you might want to make will necessarily be accepted into the core model
  • Don’t assume that search engines will start using your data, just because you’re using Schema.org as a basis for publishing, or even if you successfully submit change proposals. It’s not a means of driving adoption and use of your data or preferred model
  • Plan to write your own specifications and documentation that describe how your application or community expects data to be published. You’ll need to add more conformance criteria and document useful patterns that go beyond that Schema.org is providing
  • Work out how you will engage with your community. To make it easier to refine your specifications, discuss extensions and gather implementation feedback, you’ll still need a dedicated forum or channel for your community to collaborate. Schema.org doesn’t really provide a home for that. You might have your own github project or setup a W3C community group.
  • Build your own tooling. Schema.org are improving their own tooling, but you’ll likely need your own validation tools that are tailored to your community and your specifications
  • Contribute to the Schema.org project where you can. If you have feedback, proposed changes or revisions then submit these to the project. Its through a community approach that we improve the model for everyone. Just be aware that there are likely to be a whole range of different use cases that may be different to your own. Your proposals may need to go through several revisions before being accepted. Proposals that draw on real-world experience or are tied to actual applications will likely carry more weight than general opinions about the “right” way to design something
  • Be prepared to diverge where necessary. As I’ve explained above, sometimes the right option is to propose changes to Schema.org. And sometimes you may need to be ready to draw on other standards or approaches.

The UK Smart Meter Data Ecosystem

Disclaimer: this blog post is about my understanding of the UK’s smart meter data ecosystem and contains some opinions about how it might evolve. These do not in any way reflect those of Energy Sparks of which I am a trustee.

This blog post is an introduction to the UK’s smart meter data ecosystem. It sketches out some of the key pieces of data infrastructure with some observations around how the overall ecosystem is evolving.

It’s a large, complex system so this post will only touch on the main elements. Pointers to more detail are included along the way.

If you want a quick reference, with more diagrams then this UK government document, “Smart Meters, Smart Data, Smart Growth” is a good start.

Smart meter data infrastructure

Smart meters and meter readings

Data about your home or business energy usage was collected by someone coming to read the actual numbers displayed on the front of your meter. And in some cases that’s still how the data is collected. It’s just that today you might be entering those readings into a mobile or web application provided by your supplier. In between those readings, your supplier will be estimating your usage.

This situation improved with the introduction of AMR (“Automated Meter Reading”) meters which can connect via radio to an energy supplier. The supplier can then read your meter automatically, to get basic information on your usage. After receiving a request the meter can broadcast the data via radio signal. These meters are often only installed in commercial properties.

Smart meters are a step up from AMR meters. They connect via a Wide Area Network (WAN) rather than radio, support two way communications and provide more detailed data collection. This means that when you have a smart meter your energy supplier can send messages to the meter, as well as taking readings from it. These messages can include updated tariffs (e.g. as you switch supplier or if you are on a dynamic tariff) or a notification to say you’ve topped up your meter, etc.

The improved connectivity and functionality means that readings can be collected more frequently and are much more detailed. Half hourly usage data is the standard. A smart meter can typically store around 13 months of half-hourly usage data. 

The first generation of smart meters are known as SMETS-1 meters. The latest meters are SMETS-2.

Meter identifiers and registers

Meters have unique identifiers

For gas meters the identifiers are called MPRNs. I believe these are allocated in blocks to gas providers to be assigned to meters as they are installed.

For energy meters, these identifiers are called MPANs. Electricity meters also have a serial number. I believe MPRNs are assigned by the individual regional electricity network operators and that this information is used to populate a national database of installed meters.

From a consumer point of view, services like Find My Supplier will allow you to find your MPRN and energy suppliers.

Connectivity and devices in the home

If you have a smart meter installed then your meters might talk directly to the WAN, or access it via a separate controller that provides the necessary connectivity. 

But within the home, devices will talk to each other using Zigbee, which is a low power internet of things protocol. Together they form what is often referred to as the “Home Area Network” (HAN).

It’s via the home network that your “In Home Display” (IHD) can show your current and historical energy usage as it can connect to the meter and access the data it stores. Your electricity usage is broadcast to connected devices every 10 seconds, while gas usage is broadcast every 30 minutes.

You IHD can show your energy consumption in various ways, including how much it is costing you. This relies on your energy supplier sending your latest tariff information to your meter. 

As this article by Bulb highlights, the provision of an IHD and its basic features is required by law. Research showed that IHDs were more accessible and nudged people towards being more conscious of their energy usage. The high-frequency updates from the meter to connected devices makes it easier, for example, for you to identify which devices or uses contribute most to your bill.

Your energy supplier might provide other apps and services that provide you with insights, via the data collected via the WAN. 

But you can also connect other devices into the home network provided by your smart meter (or data controller). One example is a newer category of IHD called a “Consumer Access Device” (CAD), e.g. the Glow

These devices connect via Zigbee to your meter and via Wifi to a third-party service, where it will send your meter readings. For the Glow device, that service is operated by Hildebrand

These third party services can then provide you with access to your energy usage data via mobile or web applications. Or even via API. Otherwise as a consumer you need to access data via whatever methods your energy supplier supports.

The smart meter network infrastructure

SMETS-1 meters connected to a variety of different networks. This meant that if you switched suppliers then they frequently couldn’t access your meter because it was on a different network. So meters needed to be replaced. And, even if they were on the same network, then differences in technical infrastructure meant the meters might lose functionality.. 

SMETS-2 meters don’t have this issue as they all connect via a shared Wide Area Network (WAN). There are two of these covering the north and south of the country.

While SMETS-2 meters are better than previous models, they still have all of the issues of any Internet of Things device: problems with connectivity in rural areas, need for power, varied performance based on manufacturer, etc.

Some SMETS-1 meters are also now being connected to the WAN. 

Who operates the infrastructure?

The Data Communication Company is a state-licensed monopoly that operates the entire UK smart meter network infrastructure. It’s a wholly-owned subsidiary of Capita. Their current licence runs until 2025. 

DCC subcontracted provision of the WAN to support connectivity of smart meters to two regional providers.In the North of England and Scotland that provider is Arqiva. In the rest of England and Wales it is Telefonica UK (who own O2).

All of the messages that go to and from the meters via the WAN go via DCC’s technical infrastructure.

The network has been designed to be secure. As a key piece of national infrastructure, that’s a basic requirement. Here’s a useful overview of how the security was designed, including some notes on trust and threat modelling.

Part of the design of the system is that there is no central database of meter readings or customer information. It’s all just messages between the suppliers and the meters. However, as they describe in a recently published report, the DCC do apparently have some databases of the “system data” generated by the network. This is the metadata about individual meters and the messages sent to them. The DCC calls this “system data”.

The smart meter roll-out

It’s mandatory for smart meters to now be installed in domestic and smaller commercial properties in the UK. Companies can install SMETS-1 or SMETS-2 meters, but the rules were changed recently so only newer meters count towards their individual targets. And energy companies can get fined if they don’t install them quickly enough

Consumers are being encouraged to have smart meters fitted in existing homes, as meters are replaced, to provide them with more information on their usage and access to better tariffs such as those that offer dynamic time of day pricing., etc. 

But there are also concerns around privacy and fears of energy supplies being remotely disconnected, which are making people reluctant to switch when given the choice. Trust is clearly an important part of achieving a successful rollout.

Ofgem have a handy guide to consumer rights relating to smart meters. Which? have an article about whether you have to accept a smart meter, and Energy UK and Citizens Advice have a 1 page “data guide” that provides the key facts

But smart meters aren’t being uniformly rolled out. For example they are not mandated for all commercial (non-domestic) properties. 

At the time of writing there are over 10 million smart meters connected via the DCC, with 70% of those being SMET-2 meters. The Elexon dashboard for smart electricity meters estimates that the rollout of electricity meters is roughly 44% complete. There are also some official statistics about the rollout.

The future will hold much more fine-grained data about energy usage across the homes and businesses in the UK. But in the short-term there’s likely to be a continued mix of different meter types (dumb, AMR and smart) meaning that domestic and non-domestic usage will have differences in the quality and coverage of data due to differences in how smart meters are being rolled out.

Smart meters will give consumers greater choice in tariffs because the infrastructure can better deal with dynamic pricing. It will help to shift to a greener more efficient energy network because there is better data to help manage the network.

Access to the data infrastructure

Access to and use of the smart meter infrastructure is governed by the Smart Energy Code. Section I covers privacy.

The code sets out the roles and responsibilities of the various actors who have access to the network. That includes the infrastructure operators (e.g. the organisations looking after the power lines and cables) as well as the energy companies (e.g. those who are generating the energy) and the energy suppliers (e.g. the organisations selling you the energy). 

There is a public list of all of the organisations in each category and a summary of their licensing conditions that apply to smart meters.

The focus of the code is on those core actors. But there is an additional category of “Other Providers”. This is basically a miscellaneous group of other organisations not directly involved in provision of energy as a utility, but may have or require access to the data infrastructure.

These other providers include organisations that:

  • provide technology to energy companies who need to be able to design, test and build software against the smart meter network
  • that offer services like switching and product recommendations
  • that access the network on behalf of consumers allowing them to directly access usage data in the home using devices, e.g. Hildebrand and its Glow device
  • provide other additional third-party services. This includes companies like Hildebrand and N3RGY that are providing value-added APIs over the core network

To be authorised to access the network you need to go through a number of stages, including an audit to confirm that you have the right security in place. This can take a long time to complete. Documentation suggests this might take upwards of 6 months.

There are also substantial annual costs for access to the network. This helps to make the infrastructure sustainable, with all users contributing to it. 

Data ecosystem map

Click for larger version

As a summary, here’s the key points:

  • your in-home devices send and receive messages and data via a the smart meter or controller installed in your home, or business property
  • your in-home device might also be sending your data to other services, with your consent
  • messages to and from your meter are sent via a secure network operated by the DCC
  • the DCC provide APIs that allow authorised organisations to send and receive messages from that data infrastructure
  • the DCC doesn’t store any of the meter readings, but do collect metadata about the traffic over that network
  • organisation who have access to the infrastructure may store and use the data they can access, but generally need consent from users for detailed meter data
  • the level and type of access, e.g. what messages can be sent and received, may differ across organisations
  • your energy suppliers uses the data they retrieve from the DCC to generate your bills, provide you with services, optimise the system, etc
  • the UK government has licensed the DCC to operate that national data infrastructure, with Ofgem regulating the system

At a high-level, the UK smart meter system is like a big federated database: the individual meters store and submit data, with access to that database being governed by the DCC. The authorised users of that network build and maintain their own local caches of data as required to support their businesses and customers.

The evolving ecosystem

This is a big complex piece of national data infrastructure. This makes it interesting to unpick as an example of real-world decisions around the design and governance of data access.

It’s also interesting as the ecosystem is evolving.

Changing role of the DCC

The DCC have recently published a paper called “Data for Good” which sets out their intention to a “system data exchange” (you should read that as “system data” exchange). This means providing access to the data they hold about meters and the messages sent to and from them. (There’s a list of these message types in a SEC code appendix). 

The paper suggests that increased access to that data could be used in a variety of beneficial ways. This includes helping people in fuel poverty, or improving management of the energy network.

Encouragingly the paper talks about open and free access to data, which seems reasonable if data is suitably aggregated and anonymised. However the language is qualified in many places. DCC will presumably be incentivised by the existing ecosystem to reduce its costs and find other revenue sources. And their 5 year business development plan makes it clear that they see data services as a new revenue stream.

So time will tell.

The DCC is also required to improve efficiency and costs for operating the network to reduce burden on the organisations paying to use the infrastructure. This includes extending use of the network into other areas. For example to water meters or remote healthcare (see note at end of page 13).

Any changes to what data is provided, or how the network is used will require changes to the licence and some negotiation with Ofgem. As the licence is due to be renewed in 2025, then this might be laying groundwork for a revised licence to operate.

New intermediaries

In addition to a potentially changing role for the DCC, the other area in which the ecosystem is growing is via “Other Providers” that are becoming data intermediaries.

The infrastructure and financial costs of meeting the technical, security and audit requirements required for direct access to the DCC network creates a high barrier for third-parties wanting to provide additional services that use the data. 

The DCC APIs and messaging infrastructure are also difficult to work with meaning that integration costs can be high. The DCC “Data for Good” report notes that direct integration “…is recognised to be challenging and resource intensive“.

There are a small but growing number of organisations, including Hildebrand, N3RGY, Smart Pear and Utiligroup who see an opportunity both to lower this barrier by providing value-added services over the DCC infrastructure. For example, simple JSON based APIs that simplify access to meter data. 

Coupled with access to sandbox environments to support prototyping, this provides a simpler and cheaper API with which to integrate. Security remains important but the threat profiles and risks are different as API users have no direct access to the underlying infrastructure and only read-only access to data.

To comply with the governance of the existing system, the downstream user still needs to ensure they have appropriate consent to access data. And they need to be ready to provide evidence if the intermediary is audited.

The APIs offered by these new intermediaries are commercial services: the businesses are looking to do more than just cover their costs and will be hoping to generate significant margin through what is basically a reseller model. 

It’s worth noting that access to AMR meter data is also typically via commercial services, at least for non-domestic meters. The price per meter for data from smart meters currently seems lower, perhaps because it’s relying on a more standard, shared underlying data infrastructure.

As the number of smart meters grows I expect access to a cheaper and more modern API layer will become increasingly interesting for a range of existing and new products and services.

Lessons from Open Banking

From my perspective the major barrier to more innovative use of smart meter data is the existing data infrastructure. The DCC obviously recognises the difficulty of integration and other organisations are seeing potential for new revenue streams by becoming data intermediaries.

And needless to say, all of these new intermediaries have their own business models and bespoke APIs. Ultimately, while they may end up competing in different sectors or markets, or over quality of service, they’re all relying on the same underlying data and infrastructure.

In the finance sector, Open Banking has already demonstrated that a standardised set of APIs, licensing and approach to managing access and consent can help to drive innovation in a way that is good for consumers. 

There are clear parallels to be drawn between Open Banking, which increased access to banking data, and how access to smart meter data might be increased. It’s a very similar type of data: highly personal, transactional records. And can be used in very similar ways, e.g. account switching.

The key difference is that there’s no single source of banking transactions, so regulation was required to ensure that all the major banks adopted the standard. Smart meter data is already flowing through a single state-licensed monopoly.

Perhaps if the role of the DCC is changing, then they could also provide a simpler standardised API to access the data? Ofgem and DCC could work with the market to define this API as happened with Open Banking. And by reducing the number of intermediaries it may help to increase trust in how data is being accessed, used and shared?

If there is a reluctance to extend DCC’s role in this direction then an alternative step would be to recognise the role and existence of these new types of intermediary with the Smart Energy Code. That would allow their license to use the network to include agreement to offer a common, core standard API, common data licensing terms and approach for collection and management of consent. Again, Ofgem, DCC and others could work with the market to define that API.

For me either of these approaches are the most obvious ways to carry the lessons and models from Open Banking into the energy sector. There are clearly many more aspects of the energy data ecosystem that might benefit from improved access to data, which is where initiatives like Icebreaker One are focused. But starting with what will become a fundamental part of the national data infrastructure seems like an obvious first step to me.

The other angle that Open Banking tackled was creating better access to data about banking products. The energy sector needs this too, as there’s no easy way to access data on energy supplier tariffs and products.

The importance of tracking dataset retractions and updates

There are lots of recent examples of researchers collecting and releasing datasets which end up raising serious ethical and legal concerns. The IBM facial recognition dataset being just one example that springs to mind.

I read an interesting post exploring how facial recognition datasets are being widely used despite being taken down due to ethical concerns.

The post highlights how these datasets, despite being retracted, are still being widely used in research. This is in part because the original datasets are still circulating via mirrors of the original files. But also because they have been incorporated into derived datasets which are still being distributed with the original contents intact.

The authors describe how just one dataset, the DukeMTMC dataset was used in more than 135 papers after being retracted, 116 of those drawing on derived datasets. Some datasets have many derivatives, one example cited has been used in 14 derived datasets.

The research raises important questions about how datasets are published, mirrored, used and licensed. There’s a lot to unpack there and I look forward to reading more about the research. The concerns around open licensing are reminiscent of similar debates in the open source community leading to a set of “ethical open source licences“.

But the issue I wanted to highlight here is the difficulty of tracking the mirroring and reuse of datasets.

Change notification is a missing piece of our data infrastructure.

If it were easier to monitor important changes to datasets, then it would be easier to:

  • maintain mirrors of data
  • retract or remove data that breached laws or social and ethical norms
  • update derived datasets to remove or amend data
  • re-run analyses against datasets which has seen significant corrections or revisions
  • assess the impacts of poor quality or unethically shared data
  • proactively notify relevant communities of potential impacts relating to published data
  • monitor and review the reasons why datasets get retracted
  • …etc, etc

The importance of these activities can be seen in other contexts.

For example, Retraction Watch is a project that monitors retractions of research papers. CrossMark helps to highlight major changes to published papers including corrections and retractions.

Principle T3: Orderly Release, of the UK Statistics Authority code of practice explains that scheduled revisions and unscheduled corrections to statistics should be transparent, and that organisations should have a specific policy for how they are handled.

More broadly, product recalls and safety notices are standard for consumer goods. Maybe datasets should be treated similarly?

This feels like an area that warrants further research, investment and infrastructure. At some point we need to raise our sights from setting up even more portals and endlessly refining their feature sets and think more broadly about the system and ecosystem we are building.

Four types of innovation around data

Vaughn Tan’s The Uncertainty Mindset is one of the most fascinating books I’ve read this year. It’s an exploration of how to build R&D teams drawing on lessons learned in high-end kitchens around the world. I love cooking and I’m interested in creative R&D and what makes high-performing teams work well. I’d strongly recommend it if you’re interested in any of these topics.

I’m also a sucker for a good intellectual framework that helps me think about things in different ways. I did that recently with the BASEDEF framework.

Tan introduces a nice framework in Chapter 4 of the book which looks at four broad types of innovation around food. These are presented as a way to help the reader understand how and where innovation creates impact in restaurants. The four categories are:

  1. New dishes – new arrangements of ingredients, where innovation might be incremental refinements to existing dishes, combining ingredients together in new ways, or using ingredients from different contexts (think “fusion”)
  2. New ingredients – coming up with new things to be cooked
  3. New cooking methods – new ways of cooking things, like spherification or sous vide
  4. New cooking processes – new ways of organising the processes of cooking, e.g. to help kitchen staff prepare a dish more efficiently and consistently

The categories are the top are more evident to the consumer, those lower down less so. But the impacts of new methods and processes are greater as they apply in a variety of contexts.

Somewhat inevitably, I found myself thinking about how these categories work in the context of data:

  1. New dishes analyses – New derived datasets made from existing primary sources. Or new ways of combining datasets to create insights. I’ve used the metaphor of cooking to describe data analysis before, those recipes for data-informed problem solving help to document this stage to make it reproducible
  2. New ingredients datasets and data sources – Finding and using new sources of data, like turning image, text or audio libraries into datasets, using cheaper sensors, finding a way to extract data from non-traditional sources, or using phone sensors for earthquake detection
  3. New cooking methods for cleaning, managing or analysing data – which includes things like Jupyter notebooks, machine learning or differential privacy
  4. New cooking processes for organising the collection, preparation and analysis of data – e.g. collaborative maintenance, developing open standards for data or approaches to data governance and collective consent?

The breakdown isn’t perfect, but I found the exercise useful to think through the types of innovation around data. I’ve been conscious recently that I’m often using the word “innovation” without really digging into what that means, how that innovation happens and what exactly is being done differently or produced as a result.

The categories are also useful, I think, in reflecting on the possible impacts of breakthroughs of different types. Or perhaps where investment in R&D might be prioritised and where ensuring the translation of innovative approaches into the mainstream might have most impact?

What do you think?

Increasing inclusion around open standards for data

I read an interesting article this week by Ana Brandusescu, Michael Canares and Silvana Fumega. Called “Open data standards design behind closed doors?” it explores issues of inclusion and equity around the development of “open data standards” (which I’m reading as “open standards for data”).

Ana, Michael and Silvana rightly highlight that standards development is often seen and carried out as a technical process, whereas their development and impacts are often political, social or economic. To ensure that standards are well designed, we need to recognise their power, choose when to wield that tool, and ensure that we use it well. The article also asks questions about how standards are currently developed and suggests a framework for creating more participatory approaches throughout their development.

I’ve been reflecting on the article this week alongside a discussion that took place in this thread started by Ana.

Improving the ODI standards guidebook

I agree that standards development should absolutely be more inclusive. I too often find myself in standards discussions and groups with people that look like me and whose experiences may not always reflect those who are ultimately impacted by the creation and use of a standard.

In the open standards for data guidebook we explore how and why standards are developed to help make that process more transparent to a wider group of people. We also placed an emphasis on the importance of the scoping and adoption phases of standards development because this is so often where standards fail. Not just because the wrong thing is standardised, but also because the standard is designed for the wrong audience, or its potential impacts and value are not communicated.

Sometimes we don’t even need a standard. Standards development isn’t about creating specifications or technology, those are just outputs. The intended impact is to create some wider change in the world, which might be to increase transparency, or support implementation of a policy or to create a more equitable marketplace. Other interventions or activities might achieve those same goals better or faster. Some of them might not even use data(!)

But looking back through the guidebook, while we highlight in many places the need for engagement, outreach, developing a shared understanding of goals and desired impacts and a clear set of roles and responsibilities, we don’t specifically foreground issues of inclusion and equity as much as we could have.

The language and content of the guidebook could be improved. As could some prototype tools we included like the standards canvas. How would that be changed in order to foreground issues of inclusion and equity?

I’d love to get some contributions to the guidebook to help us improve it. Drop me a message if you have suggestions about that.

Standards as shared agreements

Open standards for data are reusable agreements that guide the exchange of data. They shape how I collect data from you, as a data provider. And as a data provider they shape how you (re)present data you have collected and, in many cases will ultimately impact how you collect data in the future.

If we foreground standards as agreements for shaping how data is collected and shared, then to increase inclusion and equity in the design of those agreements we can look to existing work like the Toolkit for Centering Racial Equity which provides a framework for thinking about inclusion throughout the life-cycle of data. Standards development fits within that life-cycle, even if it operates at a larger scale and extends it out to different time frames.

We can also recognise existing work and best practices around good participatory design and research.

We should avoid standards development, as a process, being divorced from broader discussions and best practices around ethics, equity and engagement around data. Taking a more inclusive and equitable approach to standards development is part of the broader discussion around the need for more integration across the computing and social sciences.

We may also need to recognise that sometimes agreements are made that don’t provide equitable outcomes for everyone. We might not be able to achieve a compromise that works for everyone. Being transparent about the goals and aims of a standard, and how it was developed, can help to surface who it is designed for (or not). Sometimes we might just need different standards, optimised for different purposes.

Some standards are more harmful than others

There are many different types of standard. And standards can be applied to different types of data. The authors of the original article didn’t really touch on this within their framework, but I think its important to recognise these differences, as part of any follow-on activities.

The impacts of a poorly designed standard that classifies people or their health outcomes will be much more harmful than a poorly defined data exchange format. See all of Susan Leigh Star‘s work. Or concerns from indigenous peoples about how they are counted and represented (or not) in statistical datasets.

Increasing inclusion can help to mitigate the harmful impacts around data. So focusing on improving inclusion (or recognising existing work and best practices) around the design of standards with greater capacity for harms is important. The skills and experience required in developing a taxonomy is fundamentally different to those required to develop a data exchange format.

Recognising these differences is also helpful when planning how to engage with a wider group of people. As we can identify what help and input is needed: What skills or perspectives are lacking among those leading standards work? What help or support needs to be offered to increase inclusion. E.g. by developing skills, or choosing different collaboration tools or methods of seeking input.

Developing a community of practice

Since we launched the standards guidebook I’ve been wondering whether it would be helpful to have more of a community of practice around standards development. I found myself thinking about this again after reading Ana, Michael and Silvana’s article and the subsequent discussion on twitter.

What would that look like? Does it exist already?

Perhaps supported by a set of learning or training resources that re-purposes some of the ODI guidebook material alongside other resources to help others to engage with and lead impactful, inclusive standards work?

I’m interested to see how this work and discussion unfolds.

FAIR, fairer, fairest?

“FAIR” (or “FAIR data”) is an term that I’ve been bumping into more and more frequently. For example, its included in the UK’s recently published Geospatial Strategy.

FAIR is an acronym that stands for Findable, Accessible, Interoperable and Reusable. It defines a set of principles that highlight some important aspects of publishing machine-readable data well. For example they identify the need to adopt common standards, use common identifiers, provide good metadata and clear usage licences.

The principles were originally defined by researchers in the life sciences. They were intended to help to improve management and sharing of data in research. Since then the principles have been increasingly referenced in other disciplines and domains.

At the ODI we’re currently working with CABI on a project that is applying the FAIR data principles, alongside other recommendations, to improve data sharing in grants and projects funded by the Gates Foundation.

From the perspective of encouraging the management and sharing of well-structured, standardised, machine-readable data, the FAIR principles are pretty good. They explore similar territory as the ODI’s Open Data Certificates and Tim Berners-Lee’s 5-Star Principles.

But the FAIR principles have some limitations and have been critiqued by various communities. As the principles become adopted in other contexts it is important that we understand these limitations, as they may have more of an impact in different situations.

A good background on the FAIR principles and some of their limitations can be found in this 2018 paper. But there are a few I’d like to highlight in this post.

They’re just principles

A key issue with FAIR is that they’re just principles. They offer recommendations about best practices, but they don’t help you answer specific questions. For example:

  • what metadata is useful to publish alongside different types of datasets?
  • which standards and shared identifiers are the best to use when publishing a specific dataset?
  • where will people be looking for this dataset to ensure its findable?
  • what are the trade-offs of using different competing standards?
  • what terms of use and licensing are appropriate to use when publishing a specific dataset for use by a specific community?
  • …etc

Applying the principles to a specific dataset means you need to have a clear idea about what you’re trying to achieve, what standards and best practices are used by the community you’re trying to support, or what approach might best enable the ecosystem you’re trying to grow and support.

We touched on some of these issues in a previous project that CABI and ODI delivered to the Gates Foundation. We encouraged people to think about FAIR in the context of a specific data ecosystem.

Currently there’s very little guidance that exists to support these decisions around FAIR. Which makes it harder to assess whether something is really FAIR in practice. Inevitably there will be trade-offs that involve making choices about standards and how much to invest in data curation and publication. Principles only go so far.

The principles are designed for a specific context

The FAIR principles were designed to reflect the needs of a specific community and context. Many of the recommendations are also broadly applicable to data publishing in other domains and contexts. But they embody design decisions that may not apply universally.

For example, they choose to emphasise machine-readability. Other communities might choose to focus on other elements that are more important to them or their needs.

As an alternative, the CARE principles for indigenous data governance are based around Collective Benefit, Authority to Control, Responsibility and Ethics. Those are good principles too. Other groups have chosen to propose ways to adapt and expand on FAIR.

It may be that the FAIR principles will work well in your specific context or community. But it might also be true that if you were to start from scratch and designed a new set of principles, you might choose to highlight other principles.

Whenever we are applying off-the-shelf principles in new areas, we need to think about whether they are helping us to achieve our own goals. Do they emphasise and prioritise work in the right areas?

The principles are not about being “fair”

Despite the acronym, the principles aren’t about being “fair”.

I don’t really know how to properly define “fair”. But I think it includes things like equity ‒ of access, or representation, or participation. And ethics and engagement. The principles are silent on those topics, leading some people to think about FAIRER data.

Don’t let the memorable acronym distract from the importance of ethics, consequence scanning and centering equity.

FAIR is not open

The principles were designed to be applied in contexts where not all data can be open. Life science research involves lots of sensitive personal information. Instead the principles recommend that data usage rights are clear.

I usually point out that FAIR data can exist across the data spectrum. But the principles don’t remind you that data should be as open as possible. Or prompt you to consider about the impacts of different types of licensing. They just ask you to be clear about the terms of reuse, however restrictive they might be.

So, to recap: the FAIR data principles offer a useful framework of things to consider when making data more accessible and easier to reuse. But they are not perfect. And they do not consider all of the various elements required to build an open and trustworthy data ecosystem.

What kinds of data is it useful to include in a register?

Registers are useful lists of information. A register might be a list of countries, companies, or registered doctors. Or addresses.

At the ODI we did a whole report on registers. It looks at different types of registers and how they’re governed. And GDS built a whole infrastructure to support them being published and used across the UK government.

Registers are core components of some types of identifier systems. They help to collect and share information about some aspect of the world we’re collectively interested in. For that reason it can be useful to know more about how the register is governed. So we know what it contains and how that list might change over time.

When those lists of things are useful in many different contexts, then making those registers open helps us to connect together different datasets and analyse them in new ways. They help to unlock context.

How much information should we put in a register? What information might it be useful to capture about the things ‒ the countries, the companies, or the addresses ‒  that are in our shared lists? Do we record just a company number and a name? Or also include the address of the company headquarters and the date it was founded?

When I’ve been designing registers and similar reference datasets, there’s some common categories of a information that I usually think about.

Identifiers

It’s useful if the things in our list have a unique identifier. They might have other identifiers assigned by different systems.

By capturing identifiers we can do things like:

  • clearly refer to items in the register, so we can find their attributes
  • use that identifier to link together different datasets
  • map between datasets that use different identifiers

Names and Labels

Things in the real world aren’t often referred to by an identifier. We give things names. Sometimes they may have several names.

Including names and labels in our identifiers allows us to do things like:

  • use a consistent, canonical name for things wherever they are referenced
  • link to things from a webpage
  • provide a way for a human being to recognise and find things in the register
  • turn a name into an identifier, so we can find more information about something

Relationships

Things in the real world are related to one another. Sometimes literally: I am your father (not, really). Sometimes spatially (this thing is here, or next to this other thing). Sometimes our world is organised into hierarchies or connected in other ways.

Including relationships in our register allows us to do things like:

  • visualise, present and navigate the contents of the list in a variety of ways
  • aggregate and report data according to the relationships between things
  • put something on a map

Types and categories

The things in our list might not all be the same. Or there may be differences between them. For example different types of companies. Or residential versus business addresses. Things might also be put into different categories. A register of companies might also categories businesses by sector.

Having types and categories in a list allows us to do things like:

  • extract part of the list we are interested in, sometimes we don’t need the whole thing
  • visualise, present and navigate the contents of the list in a greater variety of different ways
  • aggregate and report data according to how things are categorised

Lifecycle information

Things in the real world often have a life cycle. So do many digital things. Things are built, created, updated, revised, republished, retracted and demolished. Sometimes those events are tied to the thing being added to the register (“a list of registered companies”), sometimes they’re not (“a list of our current customers”).

Recording lifecycle information can help us to do things like:

  • understand the current state or status of something, which can help drive business and planning decisions
  • visualise, present and navigate the contents of the list in an even greater variety of ways
  • aggregate and report data according to where things are in their lifecycle

Administrative data (relating to the register)

It’s useful to capture data about when the information in a register has changed. For example when was something added to, or removed from a register? When did we last update its attributes or check that the information is current?

This type of information can help us to:

  • identify when information has been changed, so we can update our local copy of what’s in the register
  • extract part of the list we are interested in, as maybe we only want current or historical entries. Or just the recent additions
  • aggregate and report on how the data in the register has changed

Everything else

The list of useful things we might want to include in a register is potentially open ended. The trick in designing a good register is the working out of which bits are useful to be in the register, and which bits should be part of separate databases.

A good register will contain the data that is most commonly used across systems. Centralising that data can reduce the work, costs and also risks of collecting and maintaining it. If you put too much into the register you may end up increasing costs as you may have more to maintain. Or users have to spend more time pruning out what they don’t need.

But, if you are already maintaining a register and are planning to share it for others to use, you can increase its utility by sharing more information about each entry in the list.

Open UPRNs, a worked example

The UK should have an openly licensed address register. At the ODI we’ve long argued for the need for an open address register. But we don’t have that yet.

We do have a partial subset of our national address register available under an open licence, in the form of OS Open UPRNs product. It contains just the UPRN identifier and some spatial coordinates. Through the information in the related Open Identifiers product, we can also uncover some relationships between UPRNs and other spatial objects and administrative areas.

Drawing from the above examples this means we can do things like:

  • increase use of UPRNs as a common machine-readable identifier across datasets
  • identify a valid UPRN
  • locate them spatially on a map
  • relate those UPRNs to other things of interest, like administrative areas

With a bit of extra data engineering and analysis, e.g to look for variations across versions of the dataset we can also maybe work out a rough date for when a UPRN has been added to the list.

This is more than we can do before, which is great.

But there’s obviously clear much, much more we still can’t do:

  • filter out historical UPRNs
  • filter out UPRNs of different types
  • map between addresses (the names for those places) and the identifiers
  • understand the current status of a UPRN
  • aggregate and report on them using different categories
  • help people by building services that use the names (addresses) they’re familiar with
  • …etc, etc

We won’t be able to do those things until we have a fully open address register. But, until then, even including a handful of additional attributes (like a status code!) would clearly unlock more value.

I’ve previously argued that introducing a bit of product thinking might help to bring some focus to the decisions made about how data is published. And I still stand by much of that. But we need to be able to evaluate whether those product design decisions are achieving the intended effect.

Why is change discovery important for open data?

Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.

Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:

  • How often will a dataset get routinely updated and republished?
  • What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
  • How will the dataset, or parts of it be version controlled?
  • How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
  • How will planned and unplanned updates and changes be communicated to users of the dataset?
  • How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
  • Are large scale changes to the data model expected, and if so over what timescale?
  • Are changes to the technical infrastructure planned, and if so over what timescale?
  • How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?

These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.

Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.

Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.

A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:

  • changes to a row in a dataset might be flagged with a timestamp and a change notice
  • API responses might indicate the version of the object being retrieved
  • dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
  • a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
  • changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum

These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.

Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:

  • make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
  • use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
  • subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
  • subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems

Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.

I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.

The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.

Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.

Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.

In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.

And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.