24 different tabular formats for half-hourly energy data

A couple of months ago I wrote a post that provided some background on the data we use in Energy Sparks.

The largest data source comes from gas and electricity meters (consumption) and solar panels (generation). While we’re integrating with APIs that allow us to access data from smart meters, for the foreseeable future most of this data will still be collected via AMR rather than SMETS-2 meters. And then shared with us as CSV files attached to emails.

That data is sent via a variety of systems and platforms run by energy companies, aggregators and local authorities. We’re currently dealing with about 24 different variations of what is basically the same dataset.

I thought I’d share a quick summary of that variation. As its interests from a “designing CSV files” and data standards perspective.

For a quick overview, you can look at this Google spreadsheet which provides a summary of the formats, in a way that hopefully makes them easy to compare.

The rest of this post has some notes on the variations.

What data are we talking about?

In Energy Sparks we work with half-hourly consumption and production data. A typical dataset will consist of a series of 48 daily readings for each meter.

Each half hourly data point reports the total amount of energy consumed (or generated) in the previous 30 minutes.

The dataset might usually contain data for several days of readings for many different meters.

This means that the key bits of information that we need to process each dataset is:

  • An identifier for the meter, e.g. an MPAN or MPRN
  • The date that the readings was taken
  • A series of 48 data points making up a full days readings

Pretty straight-forward. But as you can see in the spreadsheet there’s a lot of different variations.

We receive different formats for both the gas and electricity data. Different formats for historical vs ongoing data supply. Or both.

And formats might change as schools or local authorities change platform, suppliers, etc.

Use of CSV

In general, the CSV files are pretty consistent. We rely on the Ruby CSV parsers default behaviour to automatically identify line endings. And all the formats we’re using use commas, rather than tabs, as delimiters.

The number of header rows varies. Most have a single row, but some don’t have any. A couple have two.

Date formats

Various date formats are used. The following lists the most common first:

  1. %d/%m/%Y (15)
  2. %d/%m/%y (4)
  3. %y-%m-%d (3)
  4. %b %e %Y %I:%M%p (1)
  5. %e %b %Y %H:%M:%S (1)

Not much use of ISO 8601!

But the skew towards readable formats probably makes sense given that the primary anticipated use of this data is for people to open it in a spreadsheet.

Where we have several different formats from a single source (yes, this happens), I’ve noticed that the %Y based date formats are used in formats used to provide historical data, while %y year format seems to be the default for ongoing data.

Data is supplied either as UTC dates or, most commonly, in whatever the current timezone is in the UK. So readings switch from GMT to BST. And this means that when the clocks change we end up with gaps in the readings.

Tabular structure

The majority of formats (22/24) are column oriented. By which I mean the tables consist of one row per meter, per day. Each row having 48 half-hourly readings as separate columns.

Two are row oriented. Each row containing a measurement for a specific meter at a specific date-time.

Meter identifiers

The column used to hold meter identifiers also varies. We might expect at least two: MPAN for electricity meters and MPRN for gas. What we actually get is:

  • Channel
  • M1_Code1
  • Meter
  • Meter Number
  • MPAN
  • MPN
  • MPR
  • "MPR"
  • MPR Value
  • Site Id

“Meter” seems fair as a generic column header if you know what you’re getting. Otherwise some baffling variations here.

Date column

What about the column that contains the date (or date-time for row oriented files). What are they called?

  • "Date"
  • ConsumptionDate
  • Date
  • Date (Local)
  • Date (UTC)
  • DAY
  • read_date
  • ReadDate
  • ReadDatetime
  • Reading Date

Units

The default is that data is supplied in kilowatt-hours (kwh).

So few of the formats actually bother to specify a unit. Those that do call it “ReportingUnit“, “Units” or “Data Type“.

One format actually contains 48 columns reporting kwh and another 48 columns reporting Kilo Volt Amperes Reactive Hours (kVah).

Readings

Focusing on the column oriented formats, what are the columns containing the 48 half-hourly readings called?

Most commonly they’re named after the half-hour. For example a column called “20:00” will contain the kwh consumption for the time between 7.30pm and 8pm.

In other cases the columns are positional, e.g. “half hour 1” through to “half hour 48”. This gives us the following variants:

  • 00:00
  • 00:30:00
  • [00:30]
  • H0030
  • HH01
  • hh01
  • hr0000
  • kWh_1

For added fun, some formats have their first column as 00:30, while others have 00:00.

Some formats interleave the actual readings with an extra column that is used to provide a note or qualifier. There are two variants of this:

  • 00:00 Flag
  • Type

Other columns

In addition to the meter numbers, dates, readings, etc the files sometimes contain extra columns, e.g:

  • Location
  • MPRAlias
  • Stark ID
  • Meter Name
  • MSN
  • meter_identifier
  • Meter Number
  • siteRef
  • ReadType
  • Total
  • Total kWh
  • PostCode
  • M1_Code2

We generally ignore this information as its either redundant or irrelevant to our needs.

Some files provide additional meter names, numbers or identifiers that are bespoke to the data source rather than a public identifier.

Summary

We’ve got the point now that adding new formats is relatively straight-forward.

Like anyone dealing with large volumes of tabular data, we’ve got a configuration driven data ingest which we can tailor for different formats. We largely just need to know the name of the date column, the name of the column containing the meter id, and the names of the 48 readings columns.

But it’s taken time to develop that.

Most of the ongoing effort is during the setup of a new school or data provider, when we need to check to see if a data feed matches something we know, or whether we need to configure another slightly different variation.

And we have ongoing reporting to alert us when formats change without notice.

The fact that there are so many variations isn’t a surprise. There are many different sources and at every organisation someone has made a reasonable guess at what a useful format might be. They might have spoken to users, but probably don’t know what their competitors are doing.

This variation inevitably creates cost. This costs isn’t immediately felt by the average user who only has to deal with 1-2 formats at a time when they’re working with their own data in spreadsheets.

But those costs add up for those of us building tools and platforms, and operating systems, to support those users.

I don’t see anyone driving a standardisation effort in this area. Although, as I’ve hopefully shown here, behind the variations there is a simple, common tabular format that is waiting to be defined.

My impression at the moment is that most focus is on the emerging smart meter data ecosystem, and the new range of APIs that might support faster access to this same data.

But as I pointed out in my other post, if there isn’t an early attempt to standardise those, we’ll just end up with a whole range of new, slightly different APIs and data feeds. What we need is a common API standard.

Schema explorers and how they can help guide adoption of common standards

Despite being very different projects Wikidata and OpenStreetmap have a number of similarities. Recurring patterns in how they organise and support the work of their communities.

We documented a number of these patterns in the ODI Collaborative Maintenance Guidebook. There were also a number we didn’t get time to write-up.

A further pattern which I noticed recently is that both Wikidata and OSM provide tools and documentation that help contributors and data users explore the schema that shapes the data.

Both projects have a core data model around which their communities are building and iterating on a more focused domain model. This approach of providing tools for the community to discuss, evolve and revise a schema is what we called the Shared Canvas pattern in the ODI guidebook.

In OpenStreetmap that core model is consists of nodes, ways and relations. Tags (name-value pairs) can be attached to any of these types.

In Wikidata the core data model is essentially a graph. A collection of statements that associate values with nodes using a range of different properties. It’s actually more complicated than that, but the detail isn’t important here.

The list of properties in Wikidata and the list of tags in OpenStreetmap are continually revised and extended by the community to capture additional information.

The OpenStreetmap community documents tags in its Wiki (e.g. the building tag). Wikidata documents its properties within the project dataset (e.g. the name property, P2561).

But to successfully apply the Shared Canvas pattern, you also need to keep the community up to date about your Evolving Schema. To do that you need some way to communicate which properties or tags are in use, and how. OSM and Wikidata both provide tools to support that.

In OSM this role is filled by TagInfo. It can provide you with a break down of what type of feature the tag is used on, the range of values, combinations with other tags and some idea of its geographic usage. Tag uses varies by geographic community in OSM. Here’s the information about the building tag.

In Wikidata this tooling is provided by a series of reports that are available from the Discussion page for an individual property. This includes information about how often it is used and pointers to examples of frequent and recent uses. Here’s the information about the name property.

Both tools provide useful insight into how different aspects of a schema are being adopted and uses. They can help guide not just the discussion around the schema (“is this tag in use?”, but also the process of collecting data (“which tags should I use here”) and using the data (“what tags might I find, or query for?”).

Any project that adopts a Shared Canvas approach is likely to need to implement this type of tooling. Lets call it the “Schema explorer” pattern for now.

I’ll leave documenting it further for another post, or a contribution to the guidebook.

Schema explorers for open standards and open data

This type of tooling would be useful in other contexts.

Anywhere that we’re trying to drive adoption of a common data standard, it would be helpful to be able to assess how well used different parts of that schema are by analysing the available data.

That’s not something I’ve regularly seen produced. In our survey of decentralised publishing initiatives at the ODI we found common types of documentation, data validators and other tools to support use of data, like useful aggregations. But no tooling to help explore how well it is adopted. Or to help data users understand the shape of the available data prior to aggregating it.

When i was working on the OpenActive standard, I found the data profiles that Dan Winchester produced really helpful. They provide useful insight into which parts of a standard different publishers were actually using.

I was thinking about this again recently whilst doing some work for Full Fact, exploring the ClaimReview markup in Schema.org. It would be great to see which features different fact checkers are actually using. In fact that would be true of many different aspects of Schema.org.

This type of reporting is hard to do in a distributed environment without aggregating all the data. But Google are regularly harvesting some of this data, so it feels like it would be relatively easy for them to provide insights like this if they chose.

An alternative is the Schema.org Table Corpus which provides exports of Schema.org data contained in the Common Crawl dataset. But more work is likely needed to generate some useful views over the data, and it is less frequently updated.

Outside of Schema.org, schema explorers reporting on the contents of open datasets, would help inform a range of standards work. For example, it could help inform decisions about how to iterate on a schema, guide the production of documentation, and help improve the design of validators and other tools.

If you’ve seen examples of this type of tooling, then I’d be interested to see some links.

Building data validators

This is a post about building tools to validate data. I wanted to share a few reflections based on helping to design and build a few different public and private tools, as well as my experience as a user.

I like using data validators to check my homework. I’ve been using a few different recently which has prompted me to think a bit about their role and the designs that go into their design.

The tl;dr version of this post is along the lines of “Think about user needs when designing tools. But also be conscious of the role those tools play in their broader ecosystem“.

What is a data validator?

A data validator is a tool that checks the correctness and quality of data. This means doing the following categories of checks:

  • Syntax
    • Checking to determine whether there are any mistakes in how it is formatted. E.g. is the syntax of a CSV, XML or JSON file correct?
  • Validity
    • Confirming if all of the required fields, necessary to make the data useful, been provided?
    • Testing that individual values have been correctly specified. E.g. if the field contains a number then is the provided value actually a number rather than a text?
    • Performing more semantic checks such as, if this is a dataset about UK planning applications, then are the coordinates actually in the UK? Or is the start date for the application before the end date?
  • Utility
    • Confirming that provided data is of a useful quality, e.g. are geographic coordinates of the right precision? Or do any links to other resources actually work?
    • Warning about data that may or may not be included. For example, prompting the user to include additional fields that may improve the utility of the data. Or asking them to consider whether any personal data included should be there

These validation rules will typically come from a range of different sources, including:

  • The standard or specification that defines the syntax of the data.
  • The standard or specification (or schema) that describes the structure and content of the data. (This might be the same as the above, or might be defined elsewhere)
  • Legislation, which might guide, inform or influence what data should or should not be included
  • The implementer of the validation tool, who may have opinions about what is considered to be correct or useful data based on their specific needs (e.g. as a direct consumer of the data) or more broadly as a contributor to a community initiative to support improvements to how data is published

Data validators are frequently web based these days. At least for smaller datasets. But both desktop and command-line tools are also regularly used in different settings. The choice of design will be informed by things like how open the data can be, the volume of data being checked, and how the validator might be integrated into a data workflow, e.g. as an automated or manual step.

Examples of different types of data validator

Here are some examples of different data validators created for different purposes and projects

  1. JSON lint
  2. GeoJSON Lint
  3. JSON LD Playground
  4. CSVlint
  5. ODI Leeds Business Rates format validator
  6. 360Giving Data Quality Tool
  7. OpenContracting Data Review Tool
  8. The OpenActive validator
  9. OpenReferral UK Service Validator
  10. The Schema.org validator
  11. Google’s Rich Results Test
  12. The Twitter Card validator
  13. Facebook’s sharing debugger

The first few on the list are largely syntax checkers. They validate whether your CSV, JSON or GeoJSON files are correctly structured.

The others go further and check not just the format of the data, but also its validity against a schema. That schema is defined in a standard intended to support consistent publication of data across a community. The goal of these tools is to improve quality of data for a wide range of potential users, by guiding publishers about how to publish data well.

The last three examples are validators that are designed to help publishers meet the needs of a specific application or consumer of the data. They’re an actionable way to test data against the requirements of a specific user.

Validators also vary in other ways.

For example, the 360Giving, OpenContracting and Rich Results Test validators all accept a range of different data formats. They validate different syntaxes against a common schema. Others are built around a single specific format

Some tools provide a low-level view of the results, e.g. a list of errors and warnings with reference to specific sections of the data. Others provide a high-level interface, such as a preview of what the data looks like on a map or as it would be displayed in a specific application. This type of visual presentation can help catch other types of errors and more directly confirm how data might be interpreted, whilst also making the tool useful to a wider audience.

What do we mean by data being valid?

For simple syntax checking identifying whether something is valid is straight-forward. Your JSON is either well-formed or its not.

Validators that are designed around specific applications also usually have a clear marker of what is “valid”: can the application parse, interpret and display the data as expected? Does my twitter card look correct?

In other examples, the notion of “valid” is harder to define. They may be some basic rules around what a minimum viable dataset looks like. If so, these are easier to identify and classify as errors.

But there is often variability within a schema. E.g. optional elements. This means that validators need to offer more than just a binary decision and instead offer warnings, suggestions and feedback.

For example, when thinking about the design of the OpenActive validator we discussed the need to go beyond simple validation and provide feedback and prompts along the lines of “you haven’t provided a price, is the event free or chargeable“? Or “you haven’t provided an image for this event, this is legal but evidence shows that participants are more likely to sign-up to events where they can see what participation looks like.”

To put this differently: data quality depends on how you’re planning to use the data. It’s not an absolute. If you’re not validating data for a specific application or purpose, then you tool should be prompting users to think about the choices they are making around how data is being shared.

In the context of sharing and publishing open data, this moves the role of a data validator beyond simplify checking correctness, and towards identifying sources of friction that will exist between publisher and consumer.

Beyond the formal conformance criteria defined in a specification, deciding whether something is valid or not, is really just a marker for how much extra work is required by a consumer. And in some cases the publisher may not have the time, budget or resources to invest in reducing that burden.

Things to think about when designing a validator

To wrap up this post, here are some things to think about when designing a data validator

  • Who are your users? What level of technical skill and understanding are you designing for?
  • How will the validator be used or integrated into the users workflow? A tool for integration into a continuous integration environment will need to operate differently to something used to do acceptance checking before data is published. Maybe you need several different tools?
  • How much knowledge of the relevant standards or specification will a user need before they can use the tool? Should the tool facilitate learning and exploration about how to structure data, or is just checking existing data?
  • How can you provide good, clear feedback? Tools that rely on applying machine-readable schemas like JSON Schema can often have cryptic messages as they rely on an underlying library to report errors
  • How can you provide guidance and feedback that will help users decide how to improve data? Is the feedback actionable? (For example in CSVLint we figured out that when reporting that a user had an incorrect mime-type for their CSV file we could identify if it was served from AWS and provide a clear suggestion about how to fix the issue)
  • Would showing the data, as a preview or within a mocked up view, help surface problems or build confidence in how data is published?
  • Are the documentation about how to publish data and the reports from your validator consistent? If not, then fix the documentation or explain the limits of the validator

Finally, if you’re designing a validator for a specific application, then don’t mark as “invalid” anything that you can simply ignore. Don’t force the ecosystem to converge on your preferences.

You may not be interested in the full scope of a standard, but different applications and users will have different needs.

Data quality is a dialogue between publishers and users of data. One that will evolve over time as tools, applications, norms and standards become adopted across a data ecosystem. A data validator is an important building block that can facilitate that discussion.

Some lessons learned from building standards around Schema.org

OpenActive is a community-led initiative in the sport and physical activity sector in England. It’s goal is to help to get people healthier and more active by making its easier for people to find information about activities and events happening in their area. Publishing open data about opportunities to be active is a key part of its approach.

The initiative has been running for several years, funded by Sport England. Its supported by a team at the Open Data Institute who are working in close collaboration with a range of organisations across the sector.

During the early stages of the project I was responsible for leading the work to develop the technical standards and guidance that would help organisations publish open data about squash courts and exercise classes. I’ve written some previous blog posts that described the steps that got us to version 1.0 of the standards and then later the roadmap towards 2.0.

Since then the team have been exploring new features like publishing data about walking and cycling routes, improving accessibility information and, more recently, testing a standard API for booking classes.

If you’re interested in more of the details then I’d encourage you to dig into those posts as well as the developer portal.

What I wanted to cover in this blog post are some reflections about one of the key decisions we made early in the standards workstream. This was to base the core data model on Schema.org.

Why did we end up basing the standards on Schema.org?

We started the standards work in OpenActive by doing a proper scoping exercise. This helped us to understand the potential benefits of introducing a standard, and the requirements that would inform its development.

As part of our initial research, we did a review of what standards existed in the sector. We found very little that matched our needs. The few APIs that were provided were quite limited and proprietary and there was little consistency around how data was organised.

It was clear that some standardisation would be beneficial and that there was little in the way of sector-specific work to build on. It was also clear that we’d need a range of different types of standard. Data formats and APIs to support exchange of data, a common data model to help organise data and a taxonomy to help describe different types of activity.

For the data model, it was clear that the core domain model would need to be able to describe events. E.g. that a yoga class takes place in a specific gym at regular times. This would support basic discovery use cases. Where can I go and exercise today? What classes are happening near me?

As part of our review of existing standards, we found that Schema.org already provided this core model along with some additional vocabulary that would help us categorise and describe both the events and locations. For example, whether an Event was free, its capacity and information about the organiser.

For many people Schema.org may be more synonymous with publishing data for use by search engines. But as a project its goal is much broader, it is “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data“.

The data model covers much more than what search engines are consuming. Some communities are instead using the project as a means to collaborate on developing better vocabulary for sharing data between other applications. As well as aligning existing vocabularies under a common umbrella.

New standards should ideally be based on existing standards. We knew we were going to be building the OpenActive technical standards around a “stack” of standards that included HTTP, JSON and JSON-LD. So it was a natural step to base our initial domain model on aspects of Schema.org.

What were the benefits?

An early benefit of this approach is that we could immediately focus our roadmap on exploring extensions to the Schema.org data model that would add value to the physical activity sector.

Our initial community sessions around the OpenActive standards involved demonstrating how well the existing Schema.org model fitted the core requirements. And exploring where additional work was needed.

This meant we skipped any wrangling around how to describe events and instead focused on what we wanted to say about them. Important early questions focused on what information would potential participants find helpful in understanding whether this is specific activity or event is something that they might want to try? For example, details like: what activities they involved and for what level of competency?

We were able to identify those elements of the core Schema.org model supported out use cases and then documented some extensions in our own specifications. The extensions and clarifications were important for the OpenActive community, but not necessarily relevant in the broader context in which Schema.org is being used. We wanted to build some agreement and usage in our community first, before suggesting changes to Schema.org.

As well as giving us an initial head start, the decision also helped us address new requirements much quicker.

As we uncovered further requirements that mean expanding our data model, we were always able to initially look to see if existing Schema.org terms covered what we needed. We began using it as a kind of “dictionary” that we could draw on when needed.

Where existing parts of the Schema.org model fitted out needs, it was gratifying to be able to rapidly address the new requirements by documenting patterns for how to use them. Data publishers were also doing the same thing. Having a common dictionary of terms gave freedom to experiment with new features, drawing on terms defined in a public schema, before the community had discussed and agreed how to implement those patterns more broadly.

Every standards project has its own cadence. The speed of development and adoption are tied up with a whole range of different factors that go well beyond how quickly you can reach consensus around a specification.

But I think the decision to use Schema.org definitely accelerated progress and helped us more quickly deliver a data model that covered the core requirements for the sector.

Where were the challenges?

The approach wasn’t without its challenges, however.

Firstly, for a sector that was new to building open standards, choosing to based parts of that new standard on one project and then defining extensions created some confusion. Some communities seem more comfortable with piecing together vocabularies and taxonomies, but that is not true more widely.

Developers found it tricky to refer to both specifications, to explore their options for publishing different types of data. So we ended up expanding our documentation to cover all of the Schema.org terms we recommended or suggested people use, instead of focusing more on our own extensions.

Secondly, we also initially adopted the same flexible, non-prescriptive approach to data publishing that Schema.org uses. It does not define strict conformance critiera and there are often different options for how the same data might be organised depending on the level of detail a publisher has available. If Schema.org were too restrictive then it would limit how well the model could be used by different communities. It also leaves space for usage patterns to emerge.

In OpenActive we recognised that the physical activity sector had a wide range of capabilities when it came to publishing structured data. And different organisations organised data in different ways. We adopted the same less prescriptive approach to publishing with the goal of reducing the barriers to getting more data published. Essentially asking publishers to structure data as best they could within the options available.

In the end this wasn’t the right decision.

Too much flexibility made it harder for implementers to understand what data would be most useful to publish. And how to do it well. Many publishers were building new services to expose the data so they needed a clearer specification for their development teams.

We addressed this in Version 2 of the specifications by considerably tightening up the requirements. We defined which terms were required or just recommended (and why). And added cardinalities and legal values for terms. Our specification became a more formal, extended profile of Schema.org. This also allowed us build a data validator that is now being released and maintained alongside the specifications.

Our third challenge was about process. In a few cases we identified changes that we felt would sit more naturally within Schema.org than our own extensions. For example, they were improvements and clarifications around the core Event model that would be useful more widely. So we submitted those as proposed changes and clarifications.

Given that Schema.org has a very open process, and the wide range of people active in discussing issues and proposals, it was sometimes hard to know how decisions would get made. We had good support from Dan Brickley and others stewarding the project, but without knowing much about who is commenting on your proposal, their background or their own uses cases, it was tricky to know how much time to spend on handling this feedback. Or when we could confidently say that we had achieved some level of consensus.

We managed to successfully navigate this, by engaging as we would within any open community: working transparently and collegiately, and being willing to reflect on and incorporate feedback regardless of its source.

The final challenge was about assessing the level of use of different parts of the Schema.org model. If we wanted to propose a change in how a term was documented or suggest a revision to its expected values, it is difficult to assess the potential impact of that change. There’s no easy way to see which applications might be relying on specific parts of the model. Or how many people are publishing data that uses different terms.

The Schema.org documentation does flag terms that are currently under discussion or evaluation as “pending”. But outside of this its difficult to understand more about how the model is being used in practice. To do that you need to engage with a user community, or find some metrics about deployment.

We handled this by engaging with the open process of discussion, sharing our own planned usage to inform the discussion. And, where we felt that Schema.org didn’t fit with the direction we needed, we were happy to look to other standards that better filled those gaps. For example we chose to use SKOS to help us organise and structure a taxonomy of physical activities rather than using some of the similar vocabulary that Schema.org provides.

Choosing to draw on Schema.org as a source of part of our domain model didn’t mean that we felt tied to using only what it provides.

Some recommendations

Overall I’m happy that we made the right decision. The benefits definitely outweighed the challenges.

But navigating those challenges was easier because those of us leading the standards work were comfortable both with working in the open and in combining different standards to achieve a specific goal. Helping to build more competency in this area is one goal of the ODI standards guidebook.

If you’re involved in a project to build a common data model as part of a community project to publish data, then I’d recommend looking at whether based some or all of that model around Schema.org might help kickstart your technical work.

If you do that, my personal advice would be:

  • Remember that Schema.org isn’t the right home for every data model. Depending on your requirements, the complexity and the potential uses for the data, you may be better off designing and iterating on your model separately. Similarly, don’t expect that every change or extension you might want to make will necessarily be accepted into the core model
  • Don’t assume that search engines will start using your data, just because you’re using Schema.org as a basis for publishing, or even if you successfully submit change proposals. It’s not a means of driving adoption and use of your data or preferred model
  • Plan to write your own specifications and documentation that describe how your application or community expects data to be published. You’ll need to add more conformance criteria and document useful patterns that go beyond that Schema.org is providing
  • Work out how you will engage with your community. To make it easier to refine your specifications, discuss extensions and gather implementation feedback, you’ll still need a dedicated forum or channel for your community to collaborate. Schema.org doesn’t really provide a home for that. You might have your own github project or setup a W3C community group.
  • Build your own tooling. Schema.org are improving their own tooling, but you’ll likely need your own validation tools that are tailored to your community and your specifications
  • Contribute to the Schema.org project where you can. If you have feedback, proposed changes or revisions then submit these to the project. Its through a community approach that we improve the model for everyone. Just be aware that there are likely to be a whole range of different use cases that may be different to your own. Your proposals may need to go through several revisions before being accepted. Proposals that draw on real-world experience or are tied to actual applications will likely carry more weight than general opinions about the “right” way to design something
  • Be prepared to diverge where necessary. As I’ve explained above, sometimes the right option is to propose changes to Schema.org. And sometimes you may need to be ready to draw on other standards or approaches.

The UK Smart Meter Data Ecosystem

Disclaimer: this blog post is about my understanding of the UK’s smart meter data ecosystem and contains some opinions about how it might evolve. These do not in any way reflect those of Energy Sparks of which I am a trustee.

This blog post is an introduction to the UK’s smart meter data ecosystem. It sketches out some of the key pieces of data infrastructure with some observations around how the overall ecosystem is evolving.

It’s a large, complex system so this post will only touch on the main elements. Pointers to more detail are included along the way.

If you want a quick reference, with more diagrams then this UK government document, “Smart Meters, Smart Data, Smart Growth” is a good start.

Smart meter data infrastructure

Smart meters and meter readings

Data about your home or business energy usage was collected by someone coming to read the actual numbers displayed on the front of your meter. And in some cases that’s still how the data is collected. It’s just that today you might be entering those readings into a mobile or web application provided by your supplier. In between those readings, your supplier will be estimating your usage.

This situation improved with the introduction of AMR (“Automated Meter Reading”) meters which can connect via radio to an energy supplier. The supplier can then read your meter automatically, to get basic information on your usage. After receiving a request the meter can broadcast the data via radio signal. These meters are often only installed in commercial properties.

Smart meters are a step up from AMR meters. They connect via a Wide Area Network (WAN) rather than radio, support two way communications and provide more detailed data collection. This means that when you have a smart meter your energy supplier can send messages to the meter, as well as taking readings from it. These messages can include updated tariffs (e.g. as you switch supplier or if you are on a dynamic tariff) or a notification to say you’ve topped up your meter, etc.

The improved connectivity and functionality means that readings can be collected more frequently and are much more detailed. Half hourly usage data is the standard. A smart meter can typically store around 13 months of half-hourly usage data. 

The first generation of smart meters are known as SMETS-1 meters. The latest meters are SMETS-2.

Meter identifiers and registers

Meters have unique identifiers

For gas meters the identifiers are called MPRNs. I believe these are allocated in blocks to gas providers to be assigned to meters as they are installed.

For energy meters, these identifiers are called MPANs. Electricity meters also have a serial number. I believe MPRNs are assigned by the individual regional electricity network operators and that this information is used to populate a national database of installed meters.

From a consumer point of view, services like Find My Supplier will allow you to find your MPRN and energy suppliers.

Connectivity and devices in the home

If you have a smart meter installed then your meters might talk directly to the WAN, or access it via a separate controller that provides the necessary connectivity. 

But within the home, devices will talk to each other using Zigbee, which is a low power internet of things protocol. Together they form what is often referred to as the “Home Area Network” (HAN).

It’s via the home network that your “In Home Display” (IHD) can show your current and historical energy usage as it can connect to the meter and access the data it stores. Your electricity usage is broadcast to connected devices every 10 seconds, while gas usage is broadcast every 30 minutes.

You IHD can show your energy consumption in various ways, including how much it is costing you. This relies on your energy supplier sending your latest tariff information to your meter. 

As this article by Bulb highlights, the provision of an IHD and its basic features is required by law. Research showed that IHDs were more accessible and nudged people towards being more conscious of their energy usage. The high-frequency updates from the meter to connected devices makes it easier, for example, for you to identify which devices or uses contribute most to your bill.

Your energy supplier might provide other apps and services that provide you with insights, via the data collected via the WAN. 

But you can also connect other devices into the home network provided by your smart meter (or data controller). One example is a newer category of IHD called a “Consumer Access Device” (CAD), e.g. the Glow

These devices connect via Zigbee to your meter and via Wifi to a third-party service, where it will send your meter readings. For the Glow device, that service is operated by Hildebrand

These third party services can then provide you with access to your energy usage data via mobile or web applications. Or even via API. Otherwise as a consumer you need to access data via whatever methods your energy supplier supports.

The smart meter network infrastructure

SMETS-1 meters connected to a variety of different networks. This meant that if you switched suppliers then they frequently couldn’t access your meter because it was on a different network. So meters needed to be replaced. And, even if they were on the same network, then differences in technical infrastructure meant the meters might lose functionality.. 

SMETS-2 meters don’t have this issue as they all connect via a shared Wide Area Network (WAN). There are two of these covering the north and south of the country.

While SMETS-2 meters are better than previous models, they still have all of the issues of any Internet of Things device: problems with connectivity in rural areas, need for power, varied performance based on manufacturer, etc.

Some SMETS-1 meters are also now being connected to the WAN. 

Who operates the infrastructure?

The Data Communication Company is a state-licensed monopoly that operates the entire UK smart meter network infrastructure. It’s a wholly-owned subsidiary of Capita. Their current licence runs until 2025. 

DCC subcontracted provision of the WAN to support connectivity of smart meters to two regional providers.In the North of England and Scotland that provider is Arqiva. In the rest of England and Wales it is Telefonica UK (who own O2).

All of the messages that go to and from the meters via the WAN go via DCC’s technical infrastructure.

The network has been designed to be secure. As a key piece of national infrastructure, that’s a basic requirement. Here’s a useful overview of how the security was designed, including some notes on trust and threat modelling.

Part of the design of the system is that there is no central database of meter readings or customer information. It’s all just messages between the suppliers and the meters. However, as they describe in a recently published report, the DCC do apparently have some databases of the “system data” generated by the network. This is the metadata about individual meters and the messages sent to them. The DCC calls this “system data”.

The smart meter roll-out

It’s mandatory for smart meters to now be installed in domestic and smaller commercial properties in the UK. Companies can install SMETS-1 or SMETS-2 meters, but the rules were changed recently so only newer meters count towards their individual targets. And energy companies can get fined if they don’t install them quickly enough

Consumers are being encouraged to have smart meters fitted in existing homes, as meters are replaced, to provide them with more information on their usage and access to better tariffs such as those that offer dynamic time of day pricing., etc. 

But there are also concerns around privacy and fears of energy supplies being remotely disconnected, which are making people reluctant to switch when given the choice. Trust is clearly an important part of achieving a successful rollout.

Ofgem have a handy guide to consumer rights relating to smart meters. Which? have an article about whether you have to accept a smart meter, and Energy UK and Citizens Advice have a 1 page “data guide” that provides the key facts

But smart meters aren’t being uniformly rolled out. For example they are not mandated for all commercial (non-domestic) properties. 

At the time of writing there are over 10 million smart meters connected via the DCC, with 70% of those being SMET-2 meters. The Elexon dashboard for smart electricity meters estimates that the rollout of electricity meters is roughly 44% complete. There are also some official statistics about the rollout.

The future will hold much more fine-grained data about energy usage across the homes and businesses in the UK. But in the short-term there’s likely to be a continued mix of different meter types (dumb, AMR and smart) meaning that domestic and non-domestic usage will have differences in the quality and coverage of data due to differences in how smart meters are being rolled out.

Smart meters will give consumers greater choice in tariffs because the infrastructure can better deal with dynamic pricing. It will help to shift to a greener more efficient energy network because there is better data to help manage the network.

Access to the data infrastructure

Access to and use of the smart meter infrastructure is governed by the Smart Energy Code. Section I covers privacy.

The code sets out the roles and responsibilities of the various actors who have access to the network. That includes the infrastructure operators (e.g. the organisations looking after the power lines and cables) as well as the energy companies (e.g. those who are generating the energy) and the energy suppliers (e.g. the organisations selling you the energy). 

There is a public list of all of the organisations in each category and a summary of their licensing conditions that apply to smart meters.

The focus of the code is on those core actors. But there is an additional category of “Other Providers”. This is basically a miscellaneous group of other organisations not directly involved in provision of energy as a utility, but may have or require access to the data infrastructure.

These other providers include organisations that:

  • provide technology to energy companies who need to be able to design, test and build software against the smart meter network
  • that offer services like switching and product recommendations
  • that access the network on behalf of consumers allowing them to directly access usage data in the home using devices, e.g. Hildebrand and its Glow device
  • provide other additional third-party services. This includes companies like Hildebrand and N3RGY that are providing value-added APIs over the core network

To be authorised to access the network you need to go through a number of stages, including an audit to confirm that you have the right security in place. This can take a long time to complete. Documentation suggests this might take upwards of 6 months.

There are also substantial annual costs for access to the network. This helps to make the infrastructure sustainable, with all users contributing to it. 

Data ecosystem map

Click for larger version

As a summary, here’s the key points:

  • your in-home devices send and receive messages and data via a the smart meter or controller installed in your home, or business property
  • your in-home device might also be sending your data to other services, with your consent
  • messages to and from your meter are sent via a secure network operated by the DCC
  • the DCC provide APIs that allow authorised organisations to send and receive messages from that data infrastructure
  • the DCC doesn’t store any of the meter readings, but do collect metadata about the traffic over that network
  • organisation who have access to the infrastructure may store and use the data they can access, but generally need consent from users for detailed meter data
  • the level and type of access, e.g. what messages can be sent and received, may differ across organisations
  • your energy suppliers uses the data they retrieve from the DCC to generate your bills, provide you with services, optimise the system, etc
  • the UK government has licensed the DCC to operate that national data infrastructure, with Ofgem regulating the system

At a high-level, the UK smart meter system is like a big federated database: the individual meters store and submit data, with access to that database being governed by the DCC. The authorised users of that network build and maintain their own local caches of data as required to support their businesses and customers.

The evolving ecosystem

This is a big complex piece of national data infrastructure. This makes it interesting to unpick as an example of real-world decisions around the design and governance of data access.

It’s also interesting as the ecosystem is evolving.

Changing role of the DCC

The DCC have recently published a paper called “Data for Good” which sets out their intention to a “system data exchange” (you should read that as “system data” exchange). This means providing access to the data they hold about meters and the messages sent to and from them. (There’s a list of these message types in a SEC code appendix). 

The paper suggests that increased access to that data could be used in a variety of beneficial ways. This includes helping people in fuel poverty, or improving management of the energy network.

Encouragingly the paper talks about open and free access to data, which seems reasonable if data is suitably aggregated and anonymised. However the language is qualified in many places. DCC will presumably be incentivised by the existing ecosystem to reduce its costs and find other revenue sources. And their 5 year business development plan makes it clear that they see data services as a new revenue stream.

So time will tell.

The DCC is also required to improve efficiency and costs for operating the network to reduce burden on the organisations paying to use the infrastructure. This includes extending use of the network into other areas. For example to water meters or remote healthcare (see note at end of page 13).

Any changes to what data is provided, or how the network is used will require changes to the licence and some negotiation with Ofgem. As the licence is due to be renewed in 2025, then this might be laying groundwork for a revised licence to operate.

New intermediaries

In addition to a potentially changing role for the DCC, the other area in which the ecosystem is growing is via “Other Providers” that are becoming data intermediaries.

The infrastructure and financial costs of meeting the technical, security and audit requirements required for direct access to the DCC network creates a high barrier for third-parties wanting to provide additional services that use the data. 

The DCC APIs and messaging infrastructure are also difficult to work with meaning that integration costs can be high. The DCC “Data for Good” report notes that direct integration “…is recognised to be challenging and resource intensive“.

There are a small but growing number of organisations, including Hildebrand, N3RGY, Smart Pear and Utiligroup who see an opportunity both to lower this barrier by providing value-added services over the DCC infrastructure. For example, simple JSON based APIs that simplify access to meter data. 

Coupled with access to sandbox environments to support prototyping, this provides a simpler and cheaper API with which to integrate. Security remains important but the threat profiles and risks are different as API users have no direct access to the underlying infrastructure and only read-only access to data.

To comply with the governance of the existing system, the downstream user still needs to ensure they have appropriate consent to access data. And they need to be ready to provide evidence if the intermediary is audited.

The APIs offered by these new intermediaries are commercial services: the businesses are looking to do more than just cover their costs and will be hoping to generate significant margin through what is basically a reseller model. 

It’s worth noting that access to AMR meter data is also typically via commercial services, at least for non-domestic meters. The price per meter for data from smart meters currently seems lower, perhaps because it’s relying on a more standard, shared underlying data infrastructure.

As the number of smart meters grows I expect access to a cheaper and more modern API layer will become increasingly interesting for a range of existing and new products and services.

Lessons from Open Banking

From my perspective the major barrier to more innovative use of smart meter data is the existing data infrastructure. The DCC obviously recognises the difficulty of integration and other organisations are seeing potential for new revenue streams by becoming data intermediaries.

And needless to say, all of these new intermediaries have their own business models and bespoke APIs. Ultimately, while they may end up competing in different sectors or markets, or over quality of service, they’re all relying on the same underlying data and infrastructure.

In the finance sector, Open Banking has already demonstrated that a standardised set of APIs, licensing and approach to managing access and consent can help to drive innovation in a way that is good for consumers. 

There are clear parallels to be drawn between Open Banking, which increased access to banking data, and how access to smart meter data might be increased. It’s a very similar type of data: highly personal, transactional records. And can be used in very similar ways, e.g. account switching.

The key difference is that there’s no single source of banking transactions, so regulation was required to ensure that all the major banks adopted the standard. Smart meter data is already flowing through a single state-licensed monopoly.

Perhaps if the role of the DCC is changing, then they could also provide a simpler standardised API to access the data? Ofgem and DCC could work with the market to define this API as happened with Open Banking. And by reducing the number of intermediaries it may help to increase trust in how data is being accessed, used and shared?

If there is a reluctance to extend DCC’s role in this direction then an alternative step would be to recognise the role and existence of these new types of intermediary with the Smart Energy Code. That would allow their license to use the network to include agreement to offer a common, core standard API, common data licensing terms and approach for collection and management of consent. Again, Ofgem, DCC and others could work with the market to define that API.

For me either of these approaches are the most obvious ways to carry the lessons and models from Open Banking into the energy sector. There are clearly many more aspects of the energy data ecosystem that might benefit from improved access to data, which is where initiatives like Icebreaker One are focused. But starting with what will become a fundamental part of the national data infrastructure seems like an obvious first step to me.

The other angle that Open Banking tackled was creating better access to data about banking products. The energy sector needs this too, as there’s no easy way to access data on energy supplier tariffs and products.

Examples of data ecosystem mapping

This blog post is basically a mood board showing some examples of how people are mapping data ecosystems. I wanted to record a few examples and highlight some of the design decisions that goes into creating a map.

A data ecosystem consists of data infrastructure, and the people, communities and organisations that benefit from the value created by it. A map of that data ecosystem can help illustrate how data and value is created and shared amongst those different actors.

The ODI has published a range of tools and guidance on ecosystem mapping. Data ecosystem mapping is one of several approaches that are being used to help people design and plan data initiatives. A recent ODI report looks at these “data landscaping” tools with some useful references to other examples.

The Flow of My Voice

Joseph Wilk‘s “The Flow of My Voice” is highlights the many different steps through which his voice travels before being stored and served from a YouTube channel, and transcribed for others to read.

The emphasis here is on exhaustively mapping each step, with a representation of the processing at each stage. The text notes which organisation owns the infrastructure at each stage. The intent here is to help to highlight the loss of control over data as it passes through complex interconnected infrastructures. This means a lot of detail.

Data Archeogram: mapping the datafication of work

Armelle Skatulski has produced a “Data Archeogram” that highlights the complex range of data flows and data infrastructure that are increasingly being used to monitor people in the workplace. Starting from various workplace and personal data collection tools, it rapidly expands out to show a wide variety of different systems and uses of data.

Similar to Wilk’s map this diagram is intended to help promote critical review and discussion about how this data is being accessed, used and shared. But it necessarily sacrifices detail around individual flows in an attempt to map out a much larger space. I think the use of patent diagrams to add some detail is a nice touch.

Retail and EdTech data flows

The Future of Privacy Forum recently published some simple data ecosystem maps to illustrate local and global data flows using the Retail and EdTech sectors as examples.

These maps are intended to help highlight the complexity of real world data flows, to help policy makers understand the range of systems and jurisdictions that are involved in sharing and storing personal data.

Because these maps are intended to highlight cross-border flows of data they are presented as if they were an actual map of routes between different countries and territories. This is something that is less evident in the previous examples. These diagrams aren’t showing any specific system and illustrate a typical, but simplified data flow.

They emphasise the actors and flows of different types of data in a geographical context.

Data privacy project: Surfing the web from a library computer terminal

The Data Privacy Projectteaches NYC library staff how information travels and is shared online, what risks users commonly encounter online, and how libraries can better protect patron privacy“. As part of their training materials they have produced a simple ecosystem map and some supporting illustrations to help describe the flow of data that happens when someone is surfing the web in a library.

Again, the map shows a typical rather than a real-world system. Its useful to contrast this with the first example which is much more detailed by comparison. For an educational tool, a more summarised view is better to help building understanding.

The choice of which actors are shown also reflects its intended use. It highlights web hosts, ISPs and advertising networks, but has less to say about the organisations whose websites are being used and how they might use data they collect.

Agronomy projects

This ecosystem map, which I produced for a project we did at the ODI, has a similar intended use.

It provides a summary of a typical data ecosystem we observed around some Gates Foundation funded agronomy projects. The map is intended as a discussion and educational tool to help Programme Officers reflect on the ecosystem within which their programmes are embedded.

This map uses features of Kumu to encourage exploration, providing summaries for each of the different actors in the map. This makes it more dynamic than the previous examples.

Following the methodology we were developing at the ODI it also tries to highlight different types of value exchange: not just data, but also funding, insights, code, etc. These were important inputs and outputs to these programmes.

OpenStreetMap Ecosystem

In contrast to most of the earlier examples, this partial map of the OSM ecosystem tries to show a real-world ecosystem. It would be impossible to properly map the full OSM ecosystem so this is inevitably incomplete and increasingly out of date.

The decision about what detail to include was driven by the goals of the project. The intent was to try and illustrate some of the richness of the ecosystem whilst highlighting how a number of major commercial organisations were participants in that ecosystem. This was not evident to many people until recently.

The map mixes together broad categories of actors, e.g. “End Users” and “Contributor Community” alongside individual commercial companies and real-world applications. The level of detail is therefore varied across the map.

Governance design patterns

The final example comes from this Sage Bionetworks paper. The paper describes a number of design patterns for governing the sharing of data. It includes diagrams of some general patterns as well as real-world applications.

The diagrams shows relatively simple data flows, but they are drawn differently to some of the previous examples. Here the individual actors aren’t directly shown as the endpoints of those data flows. Instead, the data stewards, users and donors are depicted as areas on the map. This is to help emphasise where data is crossing governance boundaries and its use informed by different rules and agreements. Those agreements are also highlighted on the map.

Like the Future of Privacy ecosystem maps, the design is being used to help communicate some important aspects of the ecosystem.

12 ways to improve the GDS guidance on reference data publishing

GDS have published some guidance about publishing reference data for reuse across government. I’ve had a read and it contains a good set of recommendations. But some of them could be clearer. And I feel like some important areas aren’t covered. So I thought I’d write this post to capture my feedback.

Like the original guidance my feedback largely ignores considerations of infrastructure or tools. That’s quite a big topic and recommendations in those areas are unlikely to be applicable solely to reference data.

The guidance also doesn’t address issues around data sharing, such as privacy or regulatory compliance. I’m also going to gloss over that. Again, not because its not important, but because those considerations apply to sharing and publishing any form of data, not just reference data

Here’s the list of things I’d revise or add to this guidance:

  1. The guidance should recommend that reference data be at open as possible, to allow it to be reused as broadly as possible. Reference data that doesn’t contain personal information should be published under an open licence. Licensing is important even for cross-government sharing because other parts of government might be working with private or third sector who also need to be able to use the reference data. This is the biggest omission for me.
  2. Reference data needs to be published over the long term so that other teams can rely on it and build it into their services and workflows. When developing an approach for publishing reference data, consider what investment needs to be made for this to happen. That investment will need to cover people and infrastructure costs. If you can’t do that, then at least indicate how long you expect to be publishing this data. Transparent stewardship can build trust.
  3. For reference data to be used, it needs to be discoverable. The guide mentions creating metadata and doing SEO on dataset pages, but doesn’t include other suggestions such as using Schema.org Dataset metadata or even just depositing metadata in data.gov.uk.
  4. The guidance should recommend that stewardship of reference data is part of a broader data governance strategy. While you may need to identify stewards for individual datasets, governance of reference data should be part of broader data governance within the organisation. It’s not a separate activity. Implementing that wider strategy shouldn’t block making early progress to open up data, but consider reference data alongside other datasets
  5. Forums for discussing how reference data is published should include external voices. The guidance suggests creating a forum for discussing reference data, involving people from across the organisation. But the intent is to publish data so it can be reused by others. This type of forum needs external voices too.
  6. The guidance should recommend documenting provenance of data. It notes that reference data might be created from multiple sources, but does not encourage recording or sharing information about its provenance. That’s important context for reusers.
  7. The guide should recommend documenting how identifiers are assigned and managed. The guidance has quite a bit of detail about adding unique identifiers to records. It should also encourage those publishing reference data to document how and when they create identifiers for things, and what types of things will be identified. Mistakes in understanding the scope and coverage of reference data can have huge impacts.
  8. There is a recommendation to allow users to report errors or provide feedback on a dataset. That should be extended to include a recommendation that the data publisher makes known errors clear to other users, as well as transparency around when individual errors might be fixed. Reporting an error without visibility of the process for fixing data is frustrating
  9. GDS might recommend an API first approach, but reference data is often used in bulk. So there should be a recommendation to have bulk access to data, not just an API. It might also be cheaper and more sustainable to share data in this way
  10. The guidance on versioning should include record level metadata. The guidance contains quite a bit of detail around versioning of datasets. While useful, it should also include suggestions to include status codes and timestamps on individual records, to simplify integration and change monitoring. Change reporting is an important but detailed topic.
  11. While the guidance doesn’t touch on infrastructure, I think it would be helpful for it to recommend that platforms and tools used to manage reference data are open sourced. This will help others to manage and publish their own reference data, and build alignment around how data is published.
  12. Finally, if multiple organisations are benefiting from use of the same reference data then encouraging exploration of collaborative maintenance might help to reduce costs for maintaining data, as well as improving its quality. This can help to ensure that data infrastructure is properly supported and invested in.

OSM Queries

For the past month I’ve been working on a small side project which I’m pleased to launch for Open Data Day 2021.

I’ve long been a fan of OpenStreetMap. I’ve contributed to the map, coordinated a local crowd-mapping project and used OSM tiles to help build web based maps. But I’ve only done a small amount of work with the actual data. Not much more than running a few Overpass API queries and playing with some of the exports available from Geofabrik.

I recently started exploring the Overpass API again to learn how to write useful queries. I wanted to see if I could craft some queries to help me contribute more effectively. For example by helping me to spot areas that might need updating. Or identify locations where I could add links to Wikidata.

There’s a quite a bit of documentation about the Overpass API and the query language it uses, which is called OverpassQL. But I didn’t find them that accessible. The documentation is more of a reference than a tutorial.

And, while, there’s quite a few example queries to find across the OSM wiki and other websites, there isn’t always a great deal of context to those examples that explain how they work or when you might use them.

So I’ve been working on two things to address what I think is a gap in helping people learn how to get more from the OpenStreetMap API.

overpass-doc

The first is a simple tool that will take a collection of Overpass queries and build a set of HTML pages from them. It’s based on a similar tool I built for SPARQL queries a few years ago. Both are inspired by Javadoc and other code documentation tools.

The idea was to encourage the publication of collections of useful, documented queries. E.g. to be shared amongst members of a community or people working on a project. The OSM wiki can be used to share queries, but it might not always be a suitable home for this type of content.

The tool is still at quite an early stage. It’s buggy, but functional.

To test it out I’ve been working on my own collection of Overpass queries. I initially started to pull together some simple examples that illustrated a few features of the language. But then realised that I should just use the tool to write a proper tutorial. So that’s what I’ve been doing for the last week or so.

Announcing OSM Queries

OSM Queries is the result. As of today the website contains four collections of queries. The main collection of queries is a 26 part tutorial that covers the basic features of Overpass QL.

By working through the tutorial you’ll learn:

  • some basics of the OpenStreetMap data model
  • how to write queries to extract nodes, ways and relations from the OSM database using a variety of different methods
  • how to filtering data to extract just the features of interest
  • how to write spatial queries to find features based on whether they are within specific areas or are within proximity to one another
  • how to output data as CSV and JSON for use in other tools

Every query in the tutorial has its own page containing an embedded syntax highlighted version of the query. This makes them easier to share with others. You can click a button to load and run the query using the Overpass Turbo IDE. So you can easily view the results and tinker with the query.

I think the tutorial covers all the basic options for querying and filtering data. Many of the queries include comments that illustrate variations of the syntax, encouraging you to further explore the language.

I’ve also been compiling an Overpass QL syntax reference that provides a more concise view of some of the information in the OSM wiki. There’s a lot of advanced features (like this) which I will likely cover in a separate tutorial.

Writing a tutorial against the live OpenStreetMap database is tricky. The results can change at any time. So I opted to focus on demonstrating the functionality using mostly natural features and administrative boundaries.

In the end I chose to focus on an area around Uluru in Australia. Not just because it provides an interesting and stable backdrop for the tutorial. But because I also wanted to encourage a tiny bit of reflection in the reader about what gets mapped, who does the mapping, and how things get tagged.

A bit of map art, and a request

The three other query collections are quite small:

I ended up getting a bit creative with the MapCSS queries.

For example, to show off the functionality I’ve written a query that shows the masonic symbol hidden in the streets of Bath, styled Br√łndby Haveby like a bunch of flowers and the Lotus Bahai Temple as, well, a lotus flower.

These were all done by styling the existing OSM data. No edits were done to change the map. I wouldn’t encourage you to do that.

I’ve put all the source files and content for the website into the public domain so you’re free to adapt, use and share however you see fit.

While I’ll continue to improve the tutorial and add some more examples I’m also hoping that I can encourage others to contribute to the site. If you have useful queries that you could be added to the site then submit them via Github. I’ve provided a simple issue template to help you do that.

I’m hoping this provides a useful resource for people in the OSM community and that we can collectively improve it over time. I’ve love to get some feedback, so feel free to drop me an email, comment on this post or message me on twitter.

And if you’ve never explored the data behind OpenStreetMap then Open Data Day is a great time to dive in. Enjoy.

Bath Historical Images

One of my little side projects is to explore historical images and maps of Bath and the surrounding areas. I like understanding the contrast between how Bath used to look and how it is today. It’s grown and changed a huge amount over the years. It gives me a strong sense of place and history.

There is a rich archive of photographs and images of the city and area that were digitised for the Bath in Time project. Unfortunately the council has chosen to turn this archive into a, frankly terrible, website that is being used to sell over-priced framed prints.

The website has limited navigation and there’s no access to higher resolution imagery. Older versions of the site had better navigation and access to some maps.

The current version looks like it’s based on a default ecommerce theme for WordPress rather than being designed to show off the richness of the 40,000 images it contains. Ironically the @bathintime twitter account tweets out higher resolution images than you can find on the website.

This is a real shame. Frankly I can’t imagine there’s a huge amount of revenue being generated from these prints.

If the metadata and images were published under a more open licence (even with a non-commercial limitation) then it would be more useful for people like me who are interested in local history. We might even be able to help build useful interfaces. I would happily invest time in cataloguing images and making something useful with them. In fact, I have been.

In lieu of a proper online archive, I’ve been compiling a list of publicly available images from other museums and collections. So far, I’ve sifted through:

I’ve only found around 230 images (including some duplicates across collections) so far, but there are some interesting items in there. Including some images of old maps.

I’ve published the list as open data.

So you can take the metadata and links and explore them for yourself. I thought they may be useful for anyone looking to reuse images in their research or publications.

I’m in the process of adding geographic coordinates to each of the images, so they can be placed on the map. I’m approaching that by geocoding them as if they were produced using a mobile phone or camera. For example, an image of the abbey won’t have the coordinates of the abbey associated with it, it’ll be wherever the artist was standing when they painted the picture.

This is already showing some interesting common views over the years. I’ve included a selection below.

Views from the river, towards Pulteney Bridge

Southern views of the city

Looking to the east across abbey churchyard

Views of the Orange Grove and Abbey

It’s really interesting to be able to look at the same locations over time. Hopefully that gives a sense of what could be done if more of the archives we made available.

There’s more documentation on the dataset if you want to poke around. If you know of other collections of images I should look at, then let me know.

And if you have metadata or images to release under an open licence, or have archives you want to share, then get in touch as I may be able to help.

The Common Voice data ecosystem

In 2021 I’m planning to spend some more time exploring different data ecosystems with an emphasis on understanding the flows of data within and between different data initiatives, the tools they use to collect and share data, and the role of collaborative maintenance and open standards.

One project I’ve been looking at this week is Mozilla Common Voice. It’s an initiative that is producing a crowd-sourced, public domain dataset that can be used to train voice recognition applications. It’s the largest dataset of its type, consisting of over 7,000 hours of audio across 60 languages.

It’s a great example of communities working to create datasets that are more open and representative. Helping to address biases and supporting the creation of more equitable products and services. I’ve been using it in my recent talks on collaborative maintenance, but have had chance to dig a bit deeper this week.

The main interface allows contributors to either record their voice, by reading short pre-prepared sentences, or validate existing contributions by listening to existing recording and confirming that they match the script.

Behind the scenes is a more complicated process, which I found interesting.

It further highlights the importance of both open source tooling and openly licensed content in supporting the production of open data. It also another example of how choices around licensing can create friction between open projects.

The data pipeline

Essentially, the goal of the Common Voice project is to create new releases of its dataset. With each release including more languages and, for each language, more validated recordings.

The data pipeline that supports that consists of the following basic steps. (There may be other stages involved in the production of the output corpus, but I’ve not dug further into the code and docs.)

  1. Localisation. The Common Voice web application first has to be localised into the required language. This is coordinated via Mozilla Pontoon, with a community of contributors submitting translations licensed under the Mozilla Public Licence 2.0. Pontoon is open source and can be used for other non-Mozilla applications. When the localization gets to 95% the language can be added to the website and the process can move to the next stage
  2. Sentence Collection. Common Voice needs short sentences for people to read. These sentences need to be in the public domain (e.g. via a CC0 waiver). A minimum of 5,000 sentences are required before a language can be added to the website. The content comes from people submitting and validating sentences via the sentence collector tool. The text is also drawn from public domain sources. There’s a sentence extractor tool that can pull content from wikipedia and other sources. For bulk imports the Mozilla team needs to check for licence compatibility before adding text. All of this means that the source texts for each language are different.
  3. Voice Donation. Contributors read the provided sentences to add their voice to their dataset. The reading and validation steps are separate microtasks. Contributions are gamified and there are progress indicators for each language.
  4. Validation. Submitted recordings go through retrospective review to assess their quality. This allows for some moderation, allowing contributors to flag recordings that are offensive, incorrect or are of poor quality. Validation tasks are also gamified. In general there are more submitted recordings than validations. Clips need to be reviewed by two separate users for them to be marked as valid (or invalid).
  5. Publication. The corpus consists of valid, invalid and “other” (not yet validated) recordings, split into development, training and test datasets. There are separate datasets for each language.

There is an additional dataset which consists of 14 single word sentences (the ten digits, “yes”, “no”, “hey”, “Firefox”) which is published separately. The steps 2-4 look similar though.

Some observations

What should be clear is that there are multiple stages, each with their own thresholds for success.

To get a language into the project you need to translate around 600 text fragments from the application and compile a corpus of at least 5,000 sentences before the real work of collecting the voice dataset can begin.

That work requires input from multiple, potentially overlapping communities:

  • the community of translators, working through Pontoon
  • the community of writers, authors, content creators creating public domain content that can be reused in the service
  • the common voice contributors submitting new additional sentences
  • the contributors recording their voice
  • the contributors validating other recordings
  • the teams at Mozilla, coordinating and supporting all of the above

As the Common Voice application and configuration is open source, it is easy to include it in Pontoon to allow others to contribute to its localisation. To build representative datasets, your tools need to work for all the communities that will be using them.

The availability of public domain text in the source languages, is clearly a contributing factor in getting a language added to the site and ultimately included in the dataset.

So the adoption of open licences and the richness of the commons in those languages will be a factor in determining how rich the voice dataset might be for that language. And, hence, how easy it is to create good voice and text applications that can support those communities.

You can clearly create a new dedicated corpus, as people have done for Hakha Chin. But the strength and openness of one area of the commons will impact other areas. It’s all linked.

While there are different communities involved in Common Voice, its clear these reports from communities working on Hakha Chin and Welsh, in some cases its the same community that is working across the whole process.

Every language community is working to address its own needs: “We’re not dependent on anyone else to make this happen…We just have to do it“.

That’s the essence of shared infrastructure. A common resource that supports a mixture of uses and communities.

The decisions about what licences to use is, as ever, really important. At present Common Voice only takes a few sentences from individual pages of the larger Wikipedia instances. As I understand it this is because Wikipedia content is not public domain, so cannot be used wholesale. But small extracts should be covered by fair use?

I would expect that those interested in building and maintaining their language specific instances of wikipedia have overlaps with those interested in making voice applications work in that same language. Incompatible licensing can limit the ability to build on existing work.

Regardless, the Mozilla and the Wikimedia Foundations have made licensing choices that reflect the needs of their communities and the goals of their projects. That’s an important part of building trust. But, as ever, those licensing choices have subtle impacts across the wider ecosystem.