A further pattern which I noticed recently is that both Wikidata and OSM provide tools and documentation that help contributors and data users explore the schema that shapes the data.
Both projects have a core data model around which their communities are building and iterating on a more focused domain model. This approach of providing tools for the community to discuss, evolve and revise a schema is what we called the Shared Canvas pattern in the ODI guidebook.
But to successfully apply the Shared Canvas pattern, you also need to keep the community up to date about your Evolving Schema. To do that you need some way to communicate which properties or tags are in use, and how. OSM and Wikidata both provide tools to support that.
In OSM this role is filled by TagInfo. It can provide you with a break down of what type of feature the tag is used on, the range of values, combinations with other tags and some idea of its geographic usage. Tag uses varies by geographic community in OSM. Here’s the information about the building tag.
In Wikidata this tooling is provided by a series of reports that are available from the Discussion page for an individual property. This includes information about how often it is used and pointers to examples of frequent and recent uses. Here’s the information about the name property.
Both tools provide useful insight into how different aspects of a schema are being adopted and uses. They can help guide not just the discussion around the schema (“is this tag in use?”, but also the process of collecting data (“which tags should I use here”) and using the data (“what tags might I find, or query for?”).
Any project that adopts a Shared Canvas approach is likely to need to implement this type of tooling. Lets call it the “Schema explorer” pattern for now.
I’ll leave documenting it further for another post, or a contribution to the guidebook.
Schema explorers for open standards and open data
This type of tooling would be useful in other contexts.
Anywhere that we’re trying to drive adoption of a common data standard, it would be helpful to be able to assess how well used different parts of that schema are by analysing the available data.
That’s not something I’ve regularly seen produced. In our survey of decentralised publishing initiatives at the ODI we found common types of documentation, data validators and other tools to support use of data, like useful aggregations. But no tooling to help explore how well it is adopted. Or to help data users understand the shape of the available data prior to aggregating it.
When i was working on the OpenActive standard, I found the data profiles that Dan Winchester produced really helpful. They provide useful insight into which parts of a standard different publishers were actually using.
I was thinking about this again recently whilst doing some work for Full Fact, exploring the ClaimReview markup in Schema.org. It would be great to see which features different fact checkers are actually using. In fact that would be true of many different aspects of Schema.org.
This type of reporting is hard to do in a distributed environment without aggregating all the data. But Google are regularly harvesting some of this data, so it feels like it would be relatively easy for them to provide insights like this if they chose.
An alternative is the Schema.org Table Corpus which provides exports of Schema.org data contained in the Common Crawl dataset. But more work is likely needed to generate some useful views over the data, and it is less frequently updated.
Outside of Schema.org, schema explorers reporting on the contents of open datasets, would help inform a range of standards work. For example, it could help inform decisions about how to iterate on a schema, guide the production of documentation, and help improve the design of validators and other tools.
If you’ve seen examples of this type of tooling, then I’d be interested to see some links.
This is a post about building tools to validate data. I wanted to share a few reflections based on helping to design and build a few different public and private tools, as well as my experience as a user.
I like using data validators to check my homework. I’ve been using a few different recently which has prompted me to think a bit about their role and the designs that go into their design.
The tl;dr version of this post is along the lines of “Think about user needs when designing tools. But also be conscious of the role those tools play in their broader ecosystem“.
What is a data validator?
A data validator is a tool that checks the correctness and quality of data. This means doing the following categories of checks:
Checking to determine whether there are any mistakes in how it is formatted. E.g. is the syntax of a CSV, XML or JSON file correct?
Confirming if all of the required fields, necessary to make the data useful, been provided?
Testing that individual values have been correctly specified. E.g. if the field contains a number then is the provided value actually a number rather than a text?
Performing more semantic checks such as, if this is a dataset about UK planning applications, then are the coordinates actually in the UK? Or is the start date for the application before the end date?
Confirming that provided data is of a useful quality, e.g. are geographic coordinates of the right precision? Or do any links to other resources actually work?
Warning about data that may or may not be included. For example, prompting the user to include additional fields that may improve the utility of the data. Or asking them to consider whether any personal data included should be there
These validation rules will typically come from a range of different sources, including:
The standard or specification that defines the syntax of the data.
The standard or specification (or schema) that describes the structure and content of the data. (This might be the same as the above, or might be defined elsewhere)
Legislation, which might guide, inform or influence what data should or should not be included
The implementer of the validation tool, who may have opinions about what is considered to be correct or useful data based on their specific needs (e.g. as a direct consumer of the data) or more broadly as a contributor to a community initiative to support improvements to how data is published
Data validators are frequently web based these days. At least for smaller datasets. But both desktop and command-line tools are also regularly used in different settings. The choice of design will be informed by things like how open the data can be, the volume of data being checked, and how the validator might be integrated into a data workflow, e.g. as an automated or manual step.
Examples of different types of data validator
Here are some examples of different data validators created for different purposes and projects
The first few on the list are largely syntax checkers. They validate whether your CSV, JSON or GeoJSON files are correctly structured.
The others go further and check not just the format of the data, but also its validity against a schema. That schema is defined in a standard intended to support consistent publication of data across a community. The goal of these tools is to improve quality of data for a wide range of potential users, by guiding publishers about how to publish data well.
The last three examples are validators that are designed to help publishers meet the needs of a specific application or consumer of the data. They’re an actionable way to test data against the requirements of a specific user.
Validators also vary in other ways.
For example, the 360Giving, OpenContracting and Rich Results Test validators all accept a range of different data formats. They validate different syntaxes against a common schema. Others are built around a single specific format
Some tools provide a low-level view of the results, e.g. a list of errors and warnings with reference to specific sections of the data. Others provide a high-level interface, such as a preview of what the data looks like on a map or as it would be displayed in a specific application. This type of visual presentation can help catch other types of errors and more directly confirm how data might be interpreted, whilst also making the tool useful to a wider audience.
What do we mean by data being valid?
For simple syntax checking identifying whether something is valid is straight-forward. Your JSON is either well-formed or its not.
Validators that are designed around specific applications also usually have a clear marker of what is “valid”: can the application parse, interpret and display the data as expected? Does my twitter card look correct?
In other examples, the notion of “valid” is harder to define. They may be some basic rules around what a minimum viable dataset looks like. If so, these are easier to identify and classify as errors.
But there is often variability within a schema. E.g. optional elements. This means that validators need to offer more than just a binary decision and instead offer warnings, suggestions and feedback.
For example, when thinking about the design of the OpenActive validator we discussed the need to go beyond simple validation and provide feedback and prompts along the lines of “you haven’t provided a price, is the event free or chargeable“? Or “you haven’t provided an image for this event, this is legal but evidence shows that participants are more likely to sign-up to events where they can see what participation looks like.”
To put this differently: data quality depends on how you’re planning to use the data. It’s not an absolute. If you’re not validating data for a specific application or purpose, then you tool should be prompting users to think about the choices they are making around how data is being shared.
In the context of sharing and publishing open data, this moves the role of a data validator beyond simplify checking correctness, and towards identifying sources of friction that will exist between publisher and consumer.
Beyond the formal conformance criteria defined in a specification, deciding whether something is valid or not, is really just a marker for how much extra work is required by a consumer. And in some cases the publisher may not have the time, budget or resources to invest in reducing that burden.
Things to think about when designing a validator
To wrap up this post, here are some things to think about when designing a data validator
Who are your users? What level of technical skill and understanding are you designing for?
How will the validator be used or integrated into the users workflow? A tool for integration into a continuous integration environment will need to operate differently to something used to do acceptance checking before data is published. Maybe you need several different tools?
How much knowledge of the relevant standards or specification will a user need before they can use the tool? Should the tool facilitate learning and exploration about how to structure data, or is just checking existing data?
How can you provide good, clear feedback? Tools that rely on applying machine-readable schemas like JSON Schema can often have cryptic messages as they rely on an underlying library to report errors
How can you provide guidance and feedback that will help users decide how to improve data? Is the feedback actionable? (For example in CSVLint we figured out that when reporting that a user had an incorrect mime-type for their CSV file we could identify if it was served from AWS and provide a clear suggestion about how to fix the issue)
Would showing the data, as a preview or within a mocked up view, help surface problems or build confidence in how data is published?
Are the documentation about how to publish data and the reports from your validator consistent? If not, then fix the documentation or explain the limits of the validator
Finally, if you’re designing a validator for a specific application, then don’t mark as “invalid” anything that you can simply ignore. Don’t force the ecosystem to converge on your preferences.
You may not be interested in the full scope of a standard, but different applications and users will have different needs.
Data quality is a dialogue between publishers and users of data. One that will evolve over time as tools, applications, norms and standards become adopted across a data ecosystem. A data validator is an important building block that can facilitate that discussion.
Most of the last few years has been very focused on research and advisory work. I’ve enjoyed all of that. But I’ve been missing the rewards that come from building, maintaining and growing things over the longer term.
I really enjoy consulting and freelance work in general. It’s a fantastic opportunity to learn about many different sectors and work with a range of different teams and organisations. But you can only go so far: a good engagement is really about providing insight, building capacity and the moving on to the next thing.
In my earlier career I spent more time making and maintaining stuff which has its own rewards. So I’ve been looking for a role that would allow me to do that. I’ve turned down some and got knocked back from others. So it goes.
I’m very happy to say that I’ve found a new part-time role. And it’s with a project that I’ve already been involved with for some time. Having stepped down as a trustee, as of next month I will be taking on the role of CTO of Energy Sparks on a part-time basis.
I helped to start the project a number of years ago. And have continued to provide some advice and support as it grew into a charity. Recently I’ve been helping the team build out some new features. Which is what prompted me to learn more about the UK’s smart meter data ecosystem.
The product is at a stage where it needs some more technical leadership and support. There are some interesting data engineering and technical challenges involved in scaling the system as it continues to roll-out across the UK. I’ll be enjoying digging into that.
What really excites me though, is the opportunity that Energy Sparks provides to help educate children around climate change and energy efficiency. And, more broadly, data literacy in general. The insights it provides to staff are already unlocking cost savings, but it’s this wider impact and use in the classroom, which I think is really key.
There’s also a lot of interesting work happening in the energy sector in the UK right now, which should lead to increased access to, and better use of data. It will be good to be a small part of that.
For now, this is will be a part-time role. I’ll be continuing to do freelance work alongside my other duties. Recently I’ve been helping Full Fact thinking through structured data around fact checks and a team at the World Bank to think about how to develop open standards for risk data.
Get in touch if you need some help with other projects or want to learn more about what we’re doing in Energy Sparks.
OpenActive is a community-led initiative in the sport and physical activity sector in England. It’s goal is to help to get people healthier and more active by making its easier for people to find information about activities and events happening in their area. Publishing open data about opportunities to be active is a key part of its approach.
The initiative has been running for several years, funded by Sport England. Its supported by a team at the Open Data Institute who are working in close collaboration with a range of organisations across the sector.
If you’re interested in more of the details then I’d encourage you to dig into those posts as well as the developer portal.
What I wanted to cover in this blog post are some reflections about one of the key decisions we made early in the standards workstream. This was to base the core data model on Schema.org.
Why did we end up basing the standards on Schema.org?
We started the standards work in OpenActive by doing a proper scoping exercise. This helped us to understand the potential benefits of introducing a standard, and the requirements that would inform its development.
As part of our initial research, we did a review of what standards existed in the sector. We found very little that matched our needs. The few APIs that were provided were quite limited and proprietary and there was little consistency around how data was organised.
It was clear that some standardisation would be beneficial and that there was little in the way of sector-specific work to build on. It was also clear that we’d need a range of different types of standard. Data formats and APIs to support exchange of data, a common data model to help organise data and a taxonomy to help describe different types of activity.
For the data model, it was clear that the core domain model would need to be able to describe events. E.g. that a yoga class takes place in a specific gym at regular times. This would support basic discovery use cases. Where can I go and exercise today? What classes are happening near me?
As part of our review of existing standards, we found that Schema.org already provided this core model along with some additional vocabulary that would help us categorise and describe both the events and locations. For example, whether an Event was free, its capacity and information about the organiser.
For many people Schema.org may be more synonymous with publishing data for use by search engines. But as a project its goal is much broader, it is “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data“.
The data model covers much more than what search engines are consuming. Some communities are instead using the project as a means to collaborate on developing better vocabulary for sharing data between other applications. As well as aligning existing vocabularies under a common umbrella.
New standards should ideally be based on existing standards. We knew we were going to be building the OpenActive technical standards around a “stack” of standards that included HTTP, JSON and JSON-LD. So it was a natural step to base our initial domain model on aspects of Schema.org.
What were the benefits?
An early benefit of this approach is that we could immediately focus our roadmap on exploring extensions to the Schema.org data model that would add value to the physical activity sector.
Our initial community sessions around the OpenActive standards involved demonstrating how well the existing Schema.org model fitted the core requirements. And exploring where additional work was needed.
This meant we skipped any wrangling around how to describe events and instead focused on what we wanted to say about them. Important early questions focused on what information would potential participants find helpful in understanding whether this is specific activity or event is something that they might want to try? For example, details like: what activities they involved and for what level of competency?
We were able to identify those elements of the core Schema.org model supported out use cases and then documented some extensions in our own specifications. The extensions and clarifications were important for the OpenActive community, but not necessarily relevant in the broader context in which Schema.org is being used. We wanted to build some agreement and usage in our community first, before suggesting changes to Schema.org.
As well as giving us an initial head start, the decision also helped us address new requirements much quicker.
As we uncovered further requirements that mean expanding our data model, we were always able to initially look to see if existing Schema.org terms covered what we needed. We began using it as a kind of “dictionary” that we could draw on when needed.
Where existing parts of the Schema.org model fitted out needs, it was gratifying to be able to rapidly address the new requirements by documenting patterns for how to use them. Data publishers were also doing the same thing. Having a common dictionary of terms gave freedom to experiment with new features, drawing on terms defined in a public schema, before the community had discussed and agreed how to implement those patterns more broadly.
Every standards project has its own cadence. The speed of development and adoption are tied up with a whole range of different factors that go well beyond how quickly you can reach consensus around a specification.
But I think the decision to use Schema.org definitely accelerated progress and helped us more quickly deliver a data model that covered the core requirements for the sector.
Where were the challenges?
The approach wasn’t without its challenges, however.
Firstly, for a sector that was new to building open standards, choosing to based parts of that new standard on one project and then defining extensions created some confusion. Some communities seem more comfortable with piecing together vocabularies and taxonomies, but that is not true more widely.
Developers found it tricky to refer to both specifications, to explore their options for publishing different types of data. So we ended up expanding our documentation to cover all of the Schema.org terms we recommended or suggested people use, instead of focusing more on our own extensions.
Secondly, we also initially adopted the same flexible, non-prescriptive approach to data publishing that Schema.org uses. It does not define strict conformance critiera and there are often different options for how the same data might be organised depending on the level of detail a publisher has available. If Schema.org were too restrictive then it would limit how well the model could be used by different communities. It also leaves space for usage patterns to emerge.
In OpenActive we recognised that the physical activity sector had a wide range of capabilities when it came to publishing structured data. And different organisations organised data in different ways. We adopted the same less prescriptive approach to publishing with the goal of reducing the barriers to getting more data published. Essentially asking publishers to structure data as best they could within the options available.
In the end this wasn’t the right decision.
Too much flexibility made it harder for implementers to understand what data would be most useful to publish. And how to do it well. Many publishers were building new services to expose the data so they needed a clearer specification for their development teams.
We addressed this in Version 2 of the specifications by considerably tightening up the requirements. We defined which terms were required or just recommended (and why). And added cardinalities and legal values for terms. Our specification became a more formal, extended profile of Schema.org. This also allowed us build a data validator that is now being released and maintained alongside the specifications.
Our third challenge was about process. In a few cases we identified changes that we felt would sit more naturally within Schema.org than our own extensions. For example, they were improvements and clarifications around the core Event model that would be useful more widely. So we submitted those as proposed changes and clarifications.
Given that Schema.org has a very open process, and the wide range of people active in discussing issues and proposals, it was sometimes hard to know how decisions would get made. We had good support from Dan Brickley and others stewarding the project, but without knowing much about who is commenting on your proposal, their background or their own uses cases, it was tricky to know how much time to spend on handling this feedback. Or when we could confidently say that we had achieved some level of consensus.
We managed to successfully navigate this, by engaging as we would within any open community: working transparently and collegiately, and being willing to reflect on and incorporate feedback regardless of its source.
The final challenge was about assessing the level of use of different parts of the Schema.org model. If we wanted to propose a change in how a term was documented or suggest a revision to its expected values, it is difficult to assess the potential impact of that change. There’s no easy way to see which applications might be relying on specific parts of the model. Or how many people are publishing data that uses different terms.
The Schema.org documentation does flag terms that are currently under discussion or evaluation as “pending”. But outside of this its difficult to understand more about how the model is being used in practice. To do that you need to engage with a user community, or find some metrics about deployment.
We handled this by engaging with the open process of discussion, sharing our own planned usage to inform the discussion. And, where we felt that Schema.org didn’t fit with the direction we needed, we were happy to look to other standards that better filled those gaps. For example we chose to use SKOS to help us organise and structure a taxonomy of physical activities rather than using some of the similar vocabulary that Schema.org provides.
Choosing to draw on Schema.org as a source of part of our domain model didn’t mean that we felt tied to using only what it provides.
Overall I’m happy that we made the right decision. The benefits definitely outweighed the challenges.
But navigating those challenges was easier because those of us leading the standards work were comfortable both with working in the open and in combining different standards to achieve a specific goal. Helping to build more competency in this area is one goal of the ODI standards guidebook.
If you’re involved in a project to build a common data model as part of a community project to publish data, then I’d recommend looking at whether based some or all of that model around Schema.org might help kickstart your technical work.
If you do that, my personal advice would be:
Remember that Schema.org isn’t the right home for every data model. Depending on your requirements, the complexity and the potential uses for the data, you may be better off designing and iterating on your model separately. Similarly, don’t expect that every change or extension you might want to make will necessarily be accepted into the core model
Don’t assume that search engines will start using your data, just because you’re using Schema.org as a basis for publishing, or even if you successfully submit change proposals. It’s not a means of driving adoption and use of your data or preferred model
Plan to write your own specifications and documentation that describe how your application or community expects data to be published. You’ll need to add more conformance criteria and document useful patterns that go beyond that Schema.org is providing
Work out how you will engage with your community. To make it easier to refine your specifications, discuss extensions and gather implementation feedback, you’ll still need a dedicated forum or channel for your community to collaborate. Schema.org doesn’t really provide a home for that. You might have your own github project or setup a W3C community group.
Build your own tooling. Schema.org are improving their own tooling, but you’ll likely need your own validation tools that are tailored to your community and your specifications
Contribute to the Schema.org project where you can. If you have feedback, proposed changes or revisions then submit these to the project. Its through a community approach that we improve the model for everyone. Just be aware that there are likely to be a whole range of different use cases that may be different to your own. Your proposals may need to go through several revisions before being accepted. Proposals that draw on real-world experience or are tied to actual applications will likely carry more weight than general opinions about the “right” way to design something
Be prepared to diverge where necessary. As I’ve explained above, sometimes the right option is to propose changes to Schema.org. And sometimes you may need to be ready to draw on other standards or approaches.
Disclaimer: this blog post is about my understanding of the UK’s smart meter data ecosystem and contains some opinions about how it might evolve. These do not in any way reflect those of Energy Sparks of which I am a trustee.
This blog post is an introduction to the UK’s smart meter data ecosystem. It sketches out some of the key pieces of data infrastructure with some observations around how the overall ecosystem is evolving.
It’s a large, complex system so this post will only touch on the main elements. Pointers to more detail are included along the way.
Data about your home or business energy usage was collected by someone coming to read the actual numbers displayed on the front of your meter. And in some cases that’s still how the data is collected. It’s just that today you might be entering those readings into a mobile or web application provided by your supplier. In between those readings, your supplier will be estimating your usage.
This situation improved with the introduction of AMR (“Automated Meter Reading”) meters which can connect via radio to an energy supplier. The supplier can then read your meter automatically, to get basic information on your usage. After receiving a request the meter can broadcast the data via radio signal. These meters are often only installed in commercial properties.
Smart meters are a step up from AMR meters. They connect via a Wide Area Network (WAN) rather than radio, support two way communications and provide more detailed data collection. This means that when you have a smart meter your energy supplier can send messages to the meter, as well as taking readings from it. These messages can include updated tariffs (e.g. as you switch supplier or if you are on a dynamic tariff) or a notification to say you’ve topped up your meter, etc.
The improved connectivity and functionality means that readings can be collected more frequently and are much more detailed. Half hourly usage data is the standard. A smart meter can typically store around 13 months of half-hourly usage data.
The first generation of smart meters are known as SMETS-1 meters. The latest meters are SMETS-2.
From a consumer point of view, services like Find My Supplier will allow you to find your MPRN and energy suppliers.
Connectivity and devices in the home
If you have a smart meter installed then your meters might talk directly to the WAN, or access it via a separate controller that provides the necessary connectivity.
But within the home, devices will talk to each other using Zigbee, which is a low power internet of things protocol. Together they form what is often referred to as the “Home Area Network” (HAN).
It’s via the home network that your “In Home Display” (IHD) can show your current and historical energy usage as it can connect to the meter and access the data it stores. Your electricity usage is broadcast to connected devices every 10 seconds, while gas usage is broadcast every 30 minutes.
You IHD can show your energy consumption in various ways, including how much it is costing you. This relies on your energy supplier sending your latest tariff information to your meter.
As this article by Bulb highlights, the provision of an IHD and its basic features is required by law. Research showed that IHDs were more accessible and nudged people towards being more conscious of their energy usage. The high-frequency updates from the meter to connected devices makes it easier, for example, for you to identify which devices or uses contribute most to your bill.
Your energy supplier might provide other apps and services that provide you with insights, via the data collected via the WAN.
But you can also connect other devices into the home network provided by your smart meter (or data controller). One example is a newer category of IHD called a “Consumer Access Device” (CAD), e.g. the Glow.
These devices connect via Zigbee to your meter and via Wifi to a third-party service, where it will send your meter readings. For the Glow device, that service is operated by Hildebrand.
These third party services can then provide you with access to your energy usage data via mobile or web applications. Or even via API. Otherwise as a consumer you need to access data via whatever methods your energy supplier supports.
The smart meter network infrastructure
SMETS-1 meters connected to a variety of different networks. This meant that if you switched suppliers then they frequently couldn’t access your meter because it was on a different network. So meters needed to be replaced. And, even if they were on the same network, then differences in technical infrastructure meant the meters might lose functionality..
SMETS-2 meters don’t have this issue as they all connect via a shared Wide Area Network (WAN). There are two of these covering the north and south of the country.
While SMETS-2 meters are better than previous models, they still have all of the issues of any Internet of Things device: problems with connectivity in rural areas, need for power, varied performance based on manufacturer, etc.
Some SMETS-1 meters are also now being connected to the WAN.
Who operates the infrastructure?
The Data Communication Company is a state-licensed monopoly that operates the entire UK smart meter network infrastructure. It’s a wholly-owned subsidiary of Capita. Their current licence runs until 2025.
DCC subcontracted provision of the WAN to support connectivity of smart meters to two regional providers.In the North of England and Scotland that provider is Arqiva. In the rest of England and Wales it is Telefonica UK (who own O2).
All of the messages that go to and from the meters via the WAN go via DCC’s technical infrastructure.
It’s mandatory for smart meters to now be installed in domestic and smaller commercial properties in the UK. Companies can install SMETS-1 or SMETS-2 meters, but the rules were changed recently so only newer meters count towards their individual targets. And energy companies can get fined if they don’t install them quickly enough.
Consumers are being encouraged to have smart meters fitted in existing homes, as meters are replaced, to provide them with more information on their usage and access to better tariffs such as those that offer dynamic time of day pricing., etc.
But there are also concerns around privacy and fears of energy supplies being remotely disconnected, which are making people reluctant to switch when given the choice. Trust is clearly an important part of achieving a successful rollout.
The future will hold much more fine-grained data about energy usage across the homes and businesses in the UK. But in the short-term there’s likely to be a continued mix of different meter types (dumb, AMR and smart) meaning that domestic and non-domestic usage will have differences in the quality and coverage of data due to differences in how smart meters are being rolled out.
Smart meters will give consumers greater choice in tariffs because the infrastructure can better deal with dynamic pricing. It will help to shift to a greener more efficient energy network because there is better data to help manage the network.
The code sets out the roles and responsibilities of the various actors who have access to the network. That includes the infrastructure operators (e.g. the organisations looking after the power lines and cables) as well as the energy companies (e.g. those who are generating the energy) and the energy suppliers (e.g. the organisations selling you the energy).
The focus of the code is on those core actors. But there is an additional category of “Other Providers”. This is basically a miscellaneous group of other organisations not directly involved in provision of energy as a utility, but may have or require access to the data infrastructure.
These other providers include organisations that:
provide technology to energy companies who need to be able to design, test and build software against the smart meter network
that offer services like switching and product recommendations
that access the network on behalf of consumers allowing them to directly access usage data in the home using devices, e.g. Hildebrand and its Glow device
provide other additional third-party services. This includes companies like Hildebrand and N3RGY that are providing value-added APIs over the core network
There are also substantial annual costs for access to the network. This helps to make the infrastructure sustainable, with all users contributing to it.
Data ecosystem map
As a summary, here’s the key points:
your in-home devices send and receive messages and data via a the smart meter or controller installed in your home, or business property
your in-home device might also be sending your data to other services, with your consent
messages to and from your meter are sent via a secure network operated by the DCC
the DCC provide APIs that allow authorised organisations to send and receive messages from that data infrastructure
the DCC doesn’t store any of the meter readings, but do collect metadata about the traffic over that network
organisation who have access to the infrastructure may store and use the data they can access, but generally need consent from users for detailed meter data
the level and type of access, e.g. what messages can be sent and received, may differ across organisations
your energy suppliers uses the data they retrieve from the DCC to generate your bills, provide you with services, optimise the system, etc
the UK government has licensed the DCC to operate that national data infrastructure, with Ofgem regulating the system
At a high-level, the UK smart meter system is like a big federated database: the individual meters store and submit data, with access to that database being governed by the DCC. The authorised users of that network build and maintain their own local caches of data as required to support their businesses and customers.
The evolving ecosystem
This is a big complex piece of national data infrastructure. This makes it interesting to unpick as an example of real-world decisions around the design and governance of data access.
It’s also interesting as the ecosystem is evolving.
Changing role of the DCC
The DCC have recently published a paper called “Data for Good” which sets out their intention to a “system data exchange” (you should read that as “system data”exchange). This means providing access to the data they hold about meters and the messages sent to and from them. (There’s a list of these message types in a SEC code appendix).
The paper suggests that increased access to that data could be used in a variety of beneficial ways. This includes helping people in fuel poverty, or improving management of the energy network.
The DCC is also required to improve efficiency and costs for operating the network to reduce burden on the organisations paying to use the infrastructure. This includes extending use of the network into other areas. For example to water meters or remote healthcare (see note at end of page 13).
Any changes to what data is provided, or how the network is used will require changes to the licence and some negotiation with Ofgem. As the licence is due to be renewed in 2025, then this might be laying groundwork for a revised licence to operate.
In addition to a potentially changing role for the DCC, the other area in which the ecosystem is growing is via “Other Providers” that are becoming data intermediaries.
The infrastructure and financial costs of meeting the technical, security and audit requirements required for direct access to the DCC network creates a high barrier for third-parties wanting to provide additional services that use the data.
There are a small but growing number of organisations, including Hildebrand, N3RGY, Smart Pear and Utiligroup who see an opportunity both to lower this barrier by providing value-added services over the DCC infrastructure. For example, simple JSON based APIs that simplify access to meter data.
Coupled with access to sandbox environments to support prototyping, this provides a simpler and cheaper API with which to integrate. Security remains important but the threat profiles and risks are different as API users have no direct access to the underlying infrastructure and only read-only access to data.
To comply with the governance of the existing system, the downstream user still needs to ensure they have appropriate consent to access data. And they need to be ready to provide evidence if the intermediary is audited.
The APIs offered by these new intermediaries are commercial services: the businesses are looking to do more than just cover their costs and will be hoping to generate significant margin through what is basically a reseller model.
It’s worth noting that access to AMR meter data is also typically via commercial services, at least for non-domestic meters. The price per meter for data from smart meters currently seems lower, perhaps because it’s relying on a more standard, shared underlying data infrastructure.
As the number of smart meters grows I expect access to a cheaper and more modern API layer will become increasingly interesting for a range of existing and new products and services.
Lessons from Open Banking
From my perspective the major barrier to more innovative use of smart meter data is the existing data infrastructure. The DCC obviously recognises the difficulty of integration and other organisations are seeing potential for new revenue streams by becoming data intermediaries.
And needless to say, all of these new intermediaries have their own business models and bespoke APIs. Ultimately, while they may end up competing in different sectors or markets, or over quality of service, they’re all relying on the same underlying data and infrastructure.
In the finance sector, Open Banking has already demonstrated that a standardised set of APIs, licensing and approach to managing access and consent can help to drive innovation in a way that is good for consumers.
There are clear parallels to be drawn between Open Banking, which increased access to banking data, and how access to smart meter data might be increased. It’s a very similar type of data: highly personal, transactional records. And can be used in very similar ways, e.g. account switching.
The key difference is that there’s no single source of banking transactions, so regulation was required to ensure that all the major banks adopted the standard. Smart meter data is already flowing through a single state-licensed monopoly.
Perhaps if the role of the DCC is changing, then they could also provide a simpler standardised API to access the data? Ofgem and DCC could work with the market to define this API as happened with Open Banking. And by reducing the number of intermediaries it may help to increase trust in how data is being accessed, used and shared?
If there is a reluctance to extend DCC’s role in this direction then an alternative step would be to recognise the role and existence of these new types of intermediary with the Smart Energy Code. That would allow their license to use the network to include agreement to offer a common, core standard API, common data licensing terms and approach for collection and management of consent. Again, Ofgem, DCC and others could work with the market to define that API.
For me either of these approaches are the most obvious ways to carry the lessons and models from Open Banking into the energy sector. There are clearly many more aspects of the energy data ecosystem that might benefit from improved access to data, which is where initiatives like Icebreaker One are focused. But starting with what will become a fundamental part of the national data infrastructure seems like an obvious first step to me.
The other angle that Open Banking tackled was creating better access to data about banking products. The energy sector needs this too, as there’s no easy way to access data on energy supplier tariffs and products.
This blog post is basically a mood board showing some examples of how people are mapping data ecosystems. I wanted to record a few examples and highlight some of the design decisions that goes into creating a map.
A data ecosystem consists of data infrastructure, and the people, communities and organisations that benefit from the value created by it. A map of that data ecosystem can help illustrate how data and value is created and shared amongst those different actors.
Joseph Wilk‘s “The Flow of My Voice” is highlights the many different steps through which his voice travels before being stored and served from a YouTube channel, and transcribed for others to read.
The emphasis here is on exhaustively mapping each step, with a representation of the processing at each stage. The text notes which organisation owns the infrastructure at each stage. The intent here is to help to highlight the loss of control over data as it passes through complex interconnected infrastructures. This means a lot of detail.
Data Archeogram: mapping the datafication of work
Armelle Skatulski has produced a “Data Archeogram” that highlights the complex range of data flows and data infrastructure that are increasingly being used to monitor people in the workplace. Starting from various workplace and personal data collection tools, it rapidly expands out to show a wide variety of different systems and uses of data.
Similar to Wilk’s map this diagram is intended to help promote critical review and discussion about how this data is being accessed, used and shared. But it necessarily sacrifices detail around individual flows in an attempt to map out a much larger space. I think the use of patent diagrams to add some detail is a nice touch.
Retail and EdTech data flows
The Future of Privacy Forum recently published some simple data ecosystem maps to illustrate local and global data flows using the Retail and EdTech sectors as examples.
These maps are intended to help highlight the complexity of real world data flows, to help policy makers understand the range of systems and jurisdictions that are involved in sharing and storing personal data.
Because these maps are intended to highlight cross-border flows of data they are presented as if they were an actual map of routes between different countries and territories. This is something that is less evident in the previous examples. These diagrams aren’t showing any specific system and illustrate a typical, but simplified data flow.
They emphasise the actors and flows of different types of data in a geographical context.
Data privacy project: Surfing the web from a library computer terminal
Again, the map shows a typical rather than a real-world system. Its useful to contrast this with the first example which is much more detailed by comparison. For an educational tool, a more summarised view is better to help building understanding.
The choice of which actors are shown also reflects its intended use. It highlights web hosts, ISPs and advertising networks, but has less to say about the organisations whose websites are being used and how they might use data they collect.
This ecosystem map, which I produced for a project we did at the ODI, has a similar intended use.
It provides a summary of a typical data ecosystem we observed around some Gates Foundation funded agronomy projects. The map is intended as a discussion and educational tool to help Programme Officers reflect on the ecosystem within which their programmes are embedded.
This map uses features of Kumu to encourage exploration, providing summaries for each of the different actors in the map. This makes it more dynamic than the previous examples.
Following the methodology we were developing at the ODI it also tries to highlight different types of value exchange: not just data, but also funding, insights, code, etc. These were important inputs and outputs to these programmes.
In contrast to most of the earlier examples, this partial map of the OSM ecosystem tries to show a real-world ecosystem. It would be impossible to properly map the full OSM ecosystem so this is inevitably incomplete and increasingly out of date.
The decision about what detail to include was driven by the goals of the project. The intent was to try and illustrate some of the richness of the ecosystem whilst highlighting how a number of major commercial organisations were participants in that ecosystem. This was not evident to many people until recently.
The map mixes together broad categories of actors, e.g. “End Users” and “Contributor Community” alongside individual commercial companies and real-world applications. The level of detail is therefore varied across the map.
Governance design patterns
The final example comes from this Sage Bionetworks paper. The paper describes a number of design patterns for governing the sharing of data. It includes diagrams of some general patterns as well as real-world applications.
The diagrams shows relatively simple data flows, but they are drawn differently to some of the previous examples. Here the individual actors aren’t directly shown as the endpoints of those data flows. Instead, the data stewards, users and donors are depicted as areas on the map. This is to help emphasise where data is crossing governance boundaries and its use informed by different rules and agreements. Those agreements are also highlighted on the map.
Like the Future of Privacy ecosystem maps, the design is being used to help communicate some important aspects of the ecosystem.
Like the original guidance my feedback largely ignores considerations of infrastructure or tools. That’s quite a big topic and recommendations in those areas are unlikely to be applicable solely to reference data.
The guidance also doesn’t address issues around data sharing, such as privacy or regulatory compliance. I’m also going to gloss over that. Again, not because its not important, but because those considerations apply to sharing and publishing any form of data, not just reference data
Here’s the list of things I’d revise or add to this guidance:
The guidance should recommend that reference data be at open as possible, to allow it to be reused as broadly as possible. Reference data that doesn’t contain personal information should be published under an open licence. Licensing is important even for cross-government sharing because other parts of government might be working with private or third sector who also need to be able to use the reference data. This is the biggest omission for me.
Reference data needs to be published over the long term so that other teams can rely on it and build it into their services and workflows. When developing an approach for publishing reference data, consider what investment needs to be made for this to happen. That investment will need to cover people and infrastructure costs. If you can’t do that, then at least indicate how long you expect to be publishing this data. Transparent stewardship can build trust.
For reference data to be used, it needs to be discoverable. The guide mentions creating metadata and doing SEO on dataset pages, but doesn’t include other suggestions such as using Schema.org Dataset metadata or even just depositing metadata in data.gov.uk.
The guidance should recommend that stewardship of reference data is part of a broader data governance strategy. While you may need to identify stewards for individual datasets, governance of reference data should be part of broader data governance within the organisation. It’s not a separate activity. Implementing that wider strategy shouldn’t block making early progress to open up data, but consider reference data alongside other datasets
Forums for discussing how reference data is published should include external voices. The guidance suggests creating a forum for discussing reference data, involving people from across the organisation. But the intent is to publish data so it can be reused by others. This type of forum needs external voices too.
The guidance should recommend documenting provenance of data. It notes that reference data might be created from multiple sources, but does not encourage recording or sharing information about its provenance. That’s important context for reusers.
There is a recommendation to allow users to report errors or provide feedback on a dataset. That should be extended to include a recommendation that the data publisher makes known errors clear to other users, as well as transparency around when individual errors might be fixed. Reporting an error without visibility of the process for fixing data is frustrating
GDS might recommend an API first approach, but reference data is often used in bulk. So there should be a recommendation to have bulk access to data, not just an API. It might also be cheaper and more sustainable to share data in this way
The guidance on versioning should include record level metadata. The guidance contains quite a bit of detail around versioning of datasets. While useful, it should also include suggestions to include status codes and timestamps on individual records, to simplify integration and change monitoring. Change reporting is an important but detailed topic.
While the guidance doesn’t touch on infrastructure, I think it would be helpful for it to recommend that platforms and tools used to manage reference data are open sourced. This will help others to manage and publish their own reference data, and build alignment around how data is published.
Finally, if multiple organisations are benefiting from use of the same reference data then encouraging exploration of collaborative maintenance might help to reduce costs for maintaining data, as well as improving its quality. This can help to ensure that data infrastructure is properly supported and invested in.
For the past month I’ve been working on a small side project which I’m pleased to launch for Open Data Day 2021.
I’ve long been a fan of OpenStreetMap. I’ve contributed to the map, coordinated a local crowd-mapping project and used OSM tiles to help build web based maps. But I’ve only done a small amount of work with the actual data. Not much more than running a few Overpass API queries and playing with some of the exports available from Geofabrik.
I recently started exploring the Overpass API again to learn how to write useful queries. I wanted to see if I could craft some queries to help me contribute more effectively. For example by helping me to spot areas that might need updating. Or identify locations where I could add links to Wikidata.
There’s a quite a bit of documentation about the Overpass API and the query language it uses, which is called OverpassQL. But I didn’t find them that accessible. The documentation is more of a reference than a tutorial.
And, while, there’s quite a few example queries to find across the OSM wiki and other websites, there isn’t always a great deal of context to those examples that explain how they work or when you might use them.
So I’ve been working on two things to address what I think is a gap in helping people learn how to get more from the OpenStreetMap API.
The first is a simple tool that will take a collection of Overpass queries and build a set of HTML pages from them. It’s based on a similar tool I built for SPARQL queries a few years ago. Both are inspired by Javadoc and other code documentation tools.
The idea was to encourage the publication of collections of useful, documented queries. E.g. to be shared amongst members of a community or people working on a project. The OSM wiki can be used to share queries, but it might not always be a suitable home for this type of content.
The tool is still at quite an early stage. It’s buggy, but functional.
To test it out I’ve been working on my own collection of Overpass queries. I initially started to pull together some simple examples that illustrated a few features of the language. But then realised that I should just use the tool to write a proper tutorial. So that’s what I’ve been doing for the last week or so.
how to write queries to extract nodes, ways and relations from the OSM database using a variety of different methods
how to filtering data to extract just the features of interest
how to write spatial queries to find features based on whether they are within specific areas or are within proximity to one another
how to output data as CSV and JSON for use in other tools
Every query in the tutorial has its own page containing an embedded syntax highlighted version of the query. This makes them easier to share with others. You can click a button to load and run the query using the Overpass Turbo IDE. So you can easily view the results and tinker with the query.
I think the tutorial covers all the basic options for querying and filtering data. Many of the queries include comments that illustrate variations of the syntax, encouraging you to further explore the language.
I’ve also been compiling an Overpass QL syntax reference that provides a more concise view of some of the information in the OSM wiki. There’s a lot of advanced features (like this) which I will likely cover in a separate tutorial.
Writing a tutorial against the live OpenStreetMap database is tricky. The results can change at any time. So I opted to focus on demonstrating the functionality using mostly natural features and administrative boundaries.
In the end I chose to focus on an area around Uluru in Australia. Not just because it provides an interesting and stable backdrop for the tutorial. But because I also wanted to encourage a tiny bit of reflection in the reader about what gets mapped, who does the mapping, and how things get tagged.
A bit of map art, and a request
The three other query collections are quite small:
a set of miscellaneous queries that are based on my own experiments
These were all done by styling the existing OSM data. No edits were done to change the map. I wouldn’t encourage you to do that.
I’ve put all the source files and content for the website into the public domain so you’re free to adapt, use and share however you see fit.
While I’ll continue to improve the tutorial and add some more examples I’m also hoping that I can encourage others to contribute to the site. If you have useful queries that you could be added to the site then submit them via Github. I’ve provided a simple issue template to help you do that.
I’m hoping this provides a useful resource for people in the OSM community and that we can collectively improve it over time. I’ve love to get some feedback, so feel free to drop me an email, comment on this post or message me on twitter.
And if you’ve never explored the data behind OpenStreetMap then Open Data Day is a great time to dive in. Enjoy.
Disclaimer: this blog post is about some of the challenges that we have faced in consuming and using data in Energy Sparks. While I am a trustee of the Energy Sparks application, and am currently working with the team on some improvements to the application, this blog post are my own opinions.
Energy Sparks is an online energy analysis tool and energy education programme specifically designed to help schools reduce their electricity and gas usage through the analysis of smart meter data. The service is run by the Energy Sparks charity, which aims to educate young people about climate change and the importance of energy saving and reducing carbon emissions.
The team provides support for teachers and school eco-teams in running educational activities to help pupils learn about energy and climate change in the context of their school.
The application uses a lot of different types of data, to provide insights, analysis and reporting to pupils, teachers and school administrators.
There are a number of challenges with accessing and using these different types of dataset. As there is a lot of work happening across the UK energy data ecosystem at the moment, I thought I’d share some information about what data is being used and where the challenges lie.
Unsurprisingly the base dataset for the service is information about schools. There’s actually a lot of different type of information that is useful to know about a school in order to do some useful analysis:
basic data about the school, e.g. its identifier, type and the curriculum key stages taught at the school
where it is, so we can map the schools and find local weather data
whether the school is part of a multi-academy trust or which local authority it is with
information about its physical infrastructure. For example number of pupils, floor area, whether it has solar panels, night storage heaters, a swimming pool or serves school dinners
its calendar, so we can identify term times, inset days, etc. Useful if you want to identify when the heating may need to be on, or when it can be switched off
contact information for people at the school (provided with consent)
what energy meters are installed at the school and what energy tariffs are being used
This data can be tricky to acquire because:
there are separate databases of schools across the devolved nations, no consistent method of access or similarity of data
calendars vary across local authorities, school groups and on an individual basis
schools typically have multiple gas and electricity meters installed in different buildings
schools might have direct contracts with energy suppliers, be part of a group purchase scheme managed by a trust or their local authority or be part of a large purchasing framework agreement, so tariff and meter data might need to come from elsewhere. Many local authorities appoint separate meter operators adding a further layer of complexity to data acquisition.
If you want to analyse energy usage then you need to know what the weather was like at the location and time it was being used. You need more energy for heating when it’s cold. But maybe you can switch the heating off if it’s cold and it’s outside of term time.
If you want to suggest that the heating might be adjusted because it’s going to be sunny next week, then you need a weather forecast.
And if you want to help people understand whether solar panels might be useful, then you need to be able to estimate how much energy they might have been able to generate in their location.
This means we use:
half-hourly historical temperature data to analyse the equivalent historical energy usage. On average we’re looking at four years worth of data for each school, but for some schools we have ten or more years
forecast temperatures to drive some user alerts and recommendations
estimated solar PV generation data
The unfortunate thing here is that the Met Office doesn’t provide the data we need. They don’t provide historical temperature or solar irradiance data at all. They do provide forecasts via DataPoint, but these are weather station specific forecasts. So not that useful if you want something more local.
We’ve now moved on to using Meteostat. Which is a volunteer run service that provides a range of APIs and bulk data access under a CC-BY-NC-4.0 licence.
The feature that Meteostat, Dark Sky and Weather Underground all offer is location based weather data, based on interpolating observation data from individual stations. This lets us get closer to actual temperature data at the schools.
It would be great if the Met Office offered a similar feature under a fully open licence. They only offer a feed of recent site-specific observations.
To provide schools with estimates of the potential benefits of installing solar panels, we currently use the Sheffield University Solar PV Live API, which is publicly available, but unfortunately not clearly licensed. But it’s our best option. Based on that data we can indicate potential economic benefits of installing different sizes of solar panels.
National energy generation
We provide schools with reports on their carbon emissions and, as part of the educational activities, give insights into the sources of energy being generated on the national grid.
For both of these uses, we take data from the Carbon Intensity API provided by the National Grid which publishes data under a CC-BY licence. The API provides both live and historical half-hourly data, which aligns nicely with our other sources.
School energy usage and generation
The bulk of the data coming into the system is half-hourly meter readings from gas and electricity meters (usage) and from solar PV panels from schools that have them (generation and export).
This allows us to chart and analyse data presenting reports and analysis across the school data.
There are numerous difficulties with getting access to this data:
the complexity of the energy ecosystem means that data is passed between meter operators, energy suppliers, local authorities, school groups, solar PV systems and a mixture of intermediary platforms. So just getting permission in the right place can be tricky
some solar PV monitoring systems, e.g. SolarEdge and RBee, offer APIs so in some cases we integrate with these, further adding to the mixture of sources
the mixture of platforms means that, while there is more or less industry standard reporting of half-hourly readings, there is no standard API for accessing this data. There’s a tangle of proprietary, undocumented or restricted access APIs
meters get added and removed over time, so the number of meters can change
for some rural schools, reporting of usage and generation is made trickier because of connectivity issues
in some cases, we know schools have solar PV installed, but we can’t get access to a proper feed. So in this case, we have to use the Sheffield PV Live API and knowledge of what panels are installed to create estimated outputs
The feature that most platforms and suppliers seem to consistently offer, at least to non-domestic customers, is a daily or weekly email with a CSV attachment containing the relevant readings. So this is the main route by which we currently bring data into the system.
We’re also currently prototyping ingesting smart meter data, rather than the current AMR data. We will likely be accessing that via one or more intermediaries who provide public APIs that interface with the underlying infrastructure and APIs run by the Data Communications Company (DCC). The DCC are the organisation responsible for the UK’s entire smart meter infrastructure. There is a growing set of these companies that are providing access to this data.
I plan to write more about that in a future post. But I’ll note here that the approach for managing consent and standardising API access is in a very early stage.
Unfortunately the government legislation backing the shift to smart meters only applies to domestic meters. So there is no requirement to stop installing AMR meters in non-domestic settings or a path to upgrade. So services targeting schools and businesses will need to deal with a tangle of data sources for some time to come.
In addition to exploring integration with the DCC there are other ways that we might improve our data collection. For example directly integrating with the “Get Information about Schools Service” for data on English schools. Or using the EPC data to help find data on floor area for school buildings.
But as of today, for the bulk of data we use, the two big wins would be access to historical data from the Met Office in a useful format, and some coordination across the industry around standardising access to meter data. I doubt we’ll see the former and I’m not clear yet whether any of the various open energy initiatives will produce the latter.
One of my little side projects is to explore historical images and maps of Bath and the surrounding areas. I like understanding the contrast between how Bath used to look and how it is today. It’s grown and changed a huge amount over the years. It gives me a strong sense of place and history.
There is a rich archive of photographs and images of the city and area that were digitised for the Bath in Time project. Unfortunately the council has chosen to turn this archive into a, frankly terrible, website that is being used to sell over-priced framed prints.
The website has limited navigation and there’s no access to higher resolution imagery. Older versions of the site had better navigation and access to some maps.
The current version looks like it’s based on a default ecommerce theme for WordPress rather than being designed to show off the richness of the 40,000 images it contains. Ironically the @bathintime twitter account tweets out higher resolution images than you can find on the website.
This is a real shame. Frankly I can’t imagine there’s a huge amount of revenue being generated from these prints.
If the metadata and images were published under a more open licence (even with a non-commercial limitation) then it would be more useful for people like me who are interested in local history. We might even be able to help build useful interfaces. I would happily invest time in cataloguing images and making something useful with them. In fact, I have been.
In lieu of a proper online archive, I’ve been compiling a list of publicly available images from other museums and collections. So far, I’ve sifted through:
So you can take the metadata and links and explore them for yourself. I thought they may be useful for anyone looking to reuse images in their research or publications.
I’m in the process of adding geographic coordinates to each of the images, so they can be placed on the map. I’m approaching that by geocoding them as if they were produced using a mobile phone or camera. For example, an image of the abbey won’t have the coordinates of the abbey associated with it, it’ll be wherever the artist was standing when they painted the picture.
This is already showing some interesting common views over the years. I’ve included a selection below.
Views from the river, towards Pulteney Bridge
Southern views of the city
Looking to the east across abbey churchyard
Views of the Orange Grove and Abbey
It’s really interesting to be able to look at the same locations over time. Hopefully that gives a sense of what could be done if more of the archives we made available.