The UK Smart Meter Data Ecosystem

Disclaimer: this blog post is about my understanding of the UK’s smart meter data ecosystem and contains some opinions about how it might evolve. These do not in any way reflect those of Energy Sparks of which I am a trustee.

This blog post is an introduction to the UK’s smart meter data ecosystem. It sketches out some of the key pieces of data infrastructure with some observations around how the overall ecosystem is evolving.

It’s a large, complex system so this post will only touch on the main elements. Pointers to more detail are included along the way.

If you want a quick reference, with more diagrams then this UK government document, “Smart Meters, Smart Data, Smart Growth” is a good start.

Smart meter data infrastructure

Smart meters and meter readings

Data about your home or business energy usage was collected by someone coming to read the actual numbers displayed on the front of your meter. And in some cases that’s still how the data is collected. It’s just that today you might be entering those readings into a mobile or web application provided by your supplier. In between those readings, your supplier will be estimating your usage.

This situation improved with the introduction of AMR (“Automated Meter Reading”) meters which can connect via radio to an energy supplier. The supplier can then read your meter automatically, to get basic information on your usage. After receiving a request the meter can broadcast the data via radio signal. These meters are often only installed in commercial properties.

Smart meters are a step up from AMR meters. They connect via a Wide Area Network (WAN) rather than radio, support two way communications and provide more detailed data collection. This means that when you have a smart meter your energy supplier can send messages to the meter, as well as taking readings from it. These messages can include updated tariffs (e.g. as you switch supplier or if you are on a dynamic tariff) or a notification to say you’ve topped up your meter, etc.

The improved connectivity and functionality means that readings can be collected more frequently and are much more detailed. Half hourly usage data is the standard. A smart meter can typically store around 13 months of half-hourly usage data. 

The first generation of smart meters are known as SMETS-1 meters. The latest meters are SMETS-2.

Meter identifiers and registers

Meters have unique identifiers

For gas meters the identifiers are called MPRNs. I believe these are allocated in blocks to gas providers to be assigned to meters as they are installed.

For energy meters, these identifiers are called MPANs. Electricity meters also have a serial number. I believe MPRNs are assigned by the individual regional electricity network operators and that this information is used to populate a national database of installed meters.

From a consumer point of view, services like Find My Supplier will allow you to find your MPRN and energy suppliers.

Connectivity and devices in the home

If you have a smart meter installed then your meters might talk directly to the WAN, or access it via a separate controller that provides the necessary connectivity. 

But within the home, devices will talk to each other using Zigbee, which is a low power internet of things protocol. Together they form what is often referred to as the “Home Area Network” (HAN).

It’s via the home network that your “In Home Display” (IHD) can show your current and historical energy usage as it can connect to the meter and access the data it stores. Your electricity usage is broadcast to connected devices every 10 seconds, while gas usage is broadcast every 30 minutes.

You IHD can show your energy consumption in various ways, including how much it is costing you. This relies on your energy supplier sending your latest tariff information to your meter. 

As this article by Bulb highlights, the provision of an IHD and its basic features is required by law. Research showed that IHDs were more accessible and nudged people towards being more conscious of their energy usage. The high-frequency updates from the meter to connected devices makes it easier, for example, for you to identify which devices or uses contribute most to your bill.

Your energy supplier might provide other apps and services that provide you with insights, via the data collected via the WAN. 

But you can also connect other devices into the home network provided by your smart meter (or data controller). One example is a newer category of IHD called a “Consumer Access Device” (CAD), e.g. the Glow

These devices connect via Zigbee to your meter and via Wifi to a third-party service, where it will send your meter readings. For the Glow device, that service is operated by Hildebrand

These third party services can then provide you with access to your energy usage data via mobile or web applications. Or even via API. Otherwise as a consumer you need to access data via whatever methods your energy supplier supports.

The smart meter network infrastructure

SMETS-1 meters connected to a variety of different networks. This meant that if you switched suppliers then they frequently couldn’t access your meter because it was on a different network. So meters needed to be replaced. And, even if they were on the same network, then differences in technical infrastructure meant the meters might lose functionality.. 

SMETS-2 meters don’t have this issue as they all connect via a shared Wide Area Network (WAN). There are two of these covering the north and south of the country.

While SMETS-2 meters are better than previous models, they still have all of the issues of any Internet of Things device: problems with connectivity in rural areas, need for power, varied performance based on manufacturer, etc.

Some SMETS-1 meters are also now being connected to the WAN. 

Who operates the infrastructure?

The Data Communication Company is a state-licensed monopoly that operates the entire UK smart meter network infrastructure. It’s a wholly-owned subsidiary of Capita. Their current licence runs until 2025. 

DCC subcontracted provision of the WAN to support connectivity of smart meters to two regional providers.In the North of England and Scotland that provider is Arqiva. In the rest of England and Wales it is Telefonica UK (who own O2).

All of the messages that go to and from the meters via the WAN go via DCC’s technical infrastructure.

The network has been designed to be secure. As a key piece of national infrastructure, that’s a basic requirement. Here’s a useful overview of how the security was designed, including some notes on trust and threat modelling.

Part of the design of the system is that there is no central database of meter readings or customer information. It’s all just messages between the suppliers and the meters. However, as they describe in a recently published report, the DCC do apparently have some databases of the “system data” generated by the network. This is the metadata about individual meters and the messages sent to them. The DCC calls this “system data”.

The smart meter roll-out

It’s mandatory for smart meters to now be installed in domestic and smaller commercial properties in the UK. Companies can install SMETS-1 or SMETS-2 meters, but the rules were changed recently so only newer meters count towards their individual targets. And energy companies can get fined if they don’t install them quickly enough

Consumers are being encouraged to have smart meters fitted in existing homes, as meters are replaced, to provide them with more information on their usage and access to better tariffs such as those that offer dynamic time of day pricing., etc. 

But there are also concerns around privacy and fears of energy supplies being remotely disconnected, which are making people reluctant to switch when given the choice. Trust is clearly an important part of achieving a successful rollout.

Ofgem have a handy guide to consumer rights relating to smart meters. Which? have an article about whether you have to accept a smart meter, and Energy UK and Citizens Advice have a 1 page “data guide” that provides the key facts

But smart meters aren’t being uniformly rolled out. For example they are not mandated for all commercial (non-domestic) properties. 

At the time of writing there are over 10 million smart meters connected via the DCC, with 70% of those being SMET-2 meters. The Elexon dashboard for smart electricity meters estimates that the rollout of electricity meters is roughly 44% complete. There are also some official statistics about the rollout.

The future will hold much more fine-grained data about energy usage across the homes and businesses in the UK. But in the short-term there’s likely to be a continued mix of different meter types (dumb, AMR and smart) meaning that domestic and non-domestic usage will have differences in the quality and coverage of data due to differences in how smart meters are being rolled out.

Smart meters will give consumers greater choice in tariffs because the infrastructure can better deal with dynamic pricing. It will help to shift to a greener more efficient energy network because there is better data to help manage the network.

Access to the data infrastructure

Access to and use of the smart meter infrastructure is governed by the Smart Energy Code. Section I covers privacy.

The code sets out the roles and responsibilities of the various actors who have access to the network. That includes the infrastructure operators (e.g. the organisations looking after the power lines and cables) as well as the energy companies (e.g. those who are generating the energy) and the energy suppliers (e.g. the organisations selling you the energy). 

There is a public list of all of the organisations in each category and a summary of their licensing conditions that apply to smart meters.

The focus of the code is on those core actors. But there is an additional category of “Other Providers”. This is basically a miscellaneous group of other organisations not directly involved in provision of energy as a utility, but may have or require access to the data infrastructure.

These other providers include organisations that:

  • provide technology to energy companies who need to be able to design, test and build software against the smart meter network
  • that offer services like switching and product recommendations
  • that access the network on behalf of consumers allowing them to directly access usage data in the home using devices, e.g. Hildebrand and its Glow device
  • provide other additional third-party services. This includes companies like Hildebrand and N3RGY that are providing value-added APIs over the core network

To be authorised to access the network you need to go through a number of stages, including an audit to confirm that you have the right security in place. This can take a long time to complete. Documentation suggests this might take upwards of 6 months.

There are also substantial annual costs for access to the network. This helps to make the infrastructure sustainable, with all users contributing to it. 

Data ecosystem map

Click for larger version

As a summary, here’s the key points:

  • your in-home devices send and receive messages and data via a the smart meter or controller installed in your home, or business property
  • your in-home device might also be sending your data to other services, with your consent
  • messages to and from your meter are sent via a secure network operated by the DCC
  • the DCC provide APIs that allow authorised organisations to send and receive messages from that data infrastructure
  • the DCC doesn’t store any of the meter readings, but do collect metadata about the traffic over that network
  • organisation who have access to the infrastructure may store and use the data they can access, but generally need consent from users for detailed meter data
  • the level and type of access, e.g. what messages can be sent and received, may differ across organisations
  • your energy suppliers uses the data they retrieve from the DCC to generate your bills, provide you with services, optimise the system, etc
  • the UK government has licensed the DCC to operate that national data infrastructure, with Ofgem regulating the system

At a high-level, the UK smart meter system is like a big federated database: the individual meters store and submit data, with access to that database being governed by the DCC. The authorised users of that network build and maintain their own local caches of data as required to support their businesses and customers.

The evolving ecosystem

This is a big complex piece of national data infrastructure. This makes it interesting to unpick as an example of real-world decisions around the design and governance of data access.

It’s also interesting as the ecosystem is evolving.

Changing role of the DCC

The DCC have recently published a paper called “Data for Good” which sets out their intention to a “system data exchange” (you should read that as “system data” exchange). This means providing access to the data they hold about meters and the messages sent to and from them. (There’s a list of these message types in a SEC code appendix). 

The paper suggests that increased access to that data could be used in a variety of beneficial ways. This includes helping people in fuel poverty, or improving management of the energy network.

Encouragingly the paper talks about open and free access to data, which seems reasonable if data is suitably aggregated and anonymised. However the language is qualified in many places. DCC will presumably be incentivised by the existing ecosystem to reduce its costs and find other revenue sources. And their 5 year business development plan makes it clear that they see data services as a new revenue stream.

So time will tell.

The DCC is also required to improve efficiency and costs for operating the network to reduce burden on the organisations paying to use the infrastructure. This includes extending use of the network into other areas. For example to water meters or remote healthcare (see note at end of page 13).

Any changes to what data is provided, or how the network is used will require changes to the licence and some negotiation with Ofgem. As the licence is due to be renewed in 2025, then this might be laying groundwork for a revised licence to operate.

New intermediaries

In addition to a potentially changing role for the DCC, the other area in which the ecosystem is growing is via “Other Providers” that are becoming data intermediaries.

The infrastructure and financial costs of meeting the technical, security and audit requirements required for direct access to the DCC network creates a high barrier for third-parties wanting to provide additional services that use the data. 

The DCC APIs and messaging infrastructure are also difficult to work with meaning that integration costs can be high. The DCC “Data for Good” report notes that direct integration “…is recognised to be challenging and resource intensive“.

There are a small but growing number of organisations, including Hildebrand, N3RGY, Smart Pear and Utiligroup who see an opportunity both to lower this barrier by providing value-added services over the DCC infrastructure. For example, simple JSON based APIs that simplify access to meter data. 

Coupled with access to sandbox environments to support prototyping, this provides a simpler and cheaper API with which to integrate. Security remains important but the threat profiles and risks are different as API users have no direct access to the underlying infrastructure and only read-only access to data.

To comply with the governance of the existing system, the downstream user still needs to ensure they have appropriate consent to access data. And they need to be ready to provide evidence if the intermediary is audited.

The APIs offered by these new intermediaries are commercial services: the businesses are looking to do more than just cover their costs and will be hoping to generate significant margin through what is basically a reseller model. 

It’s worth noting that access to AMR meter data is also typically via commercial services, at least for non-domestic meters. The price per meter for data from smart meters currently seems lower, perhaps because it’s relying on a more standard, shared underlying data infrastructure.

As the number of smart meters grows I expect access to a cheaper and more modern API layer will become increasingly interesting for a range of existing and new products and services.

Lessons from Open Banking

From my perspective the major barrier to more innovative use of smart meter data is the existing data infrastructure. The DCC obviously recognises the difficulty of integration and other organisations are seeing potential for new revenue streams by becoming data intermediaries.

And needless to say, all of these new intermediaries have their own business models and bespoke APIs. Ultimately, while they may end up competing in different sectors or markets, or over quality of service, they’re all relying on the same underlying data and infrastructure.

In the finance sector, Open Banking has already demonstrated that a standardised set of APIs, licensing and approach to managing access and consent can help to drive innovation in a way that is good for consumers. 

There are clear parallels to be drawn between Open Banking, which increased access to banking data, and how access to smart meter data might be increased. It’s a very similar type of data: highly personal, transactional records. And can be used in very similar ways, e.g. account switching.

The key difference is that there’s no single source of banking transactions, so regulation was required to ensure that all the major banks adopted the standard. Smart meter data is already flowing through a single state-licensed monopoly.

Perhaps if the role of the DCC is changing, then they could also provide a simpler standardised API to access the data? Ofgem and DCC could work with the market to define this API as happened with Open Banking. And by reducing the number of intermediaries it may help to increase trust in how data is being accessed, used and shared?

If there is a reluctance to extend DCC’s role in this direction then an alternative step would be to recognise the role and existence of these new types of intermediary with the Smart Energy Code. That would allow their license to use the network to include agreement to offer a common, core standard API, common data licensing terms and approach for collection and management of consent. Again, Ofgem, DCC and others could work with the market to define that API.

For me either of these approaches are the most obvious ways to carry the lessons and models from Open Banking into the energy sector. There are clearly many more aspects of the energy data ecosystem that might benefit from improved access to data, which is where initiatives like Icebreaker One are focused. But starting with what will become a fundamental part of the national data infrastructure seems like an obvious first step to me.

The other angle that Open Banking tackled was creating better access to data about banking products. The energy sector needs this too, as there’s no easy way to access data on energy supplier tariffs and products.

Examples of data ecosystem mapping

This blog post is basically a mood board showing some examples of how people are mapping data ecosystems. I wanted to record a few examples and highlight some of the design decisions that goes into creating a map.

A data ecosystem consists of data infrastructure, and the people, communities and organisations that benefit from the value created by it. A map of that data ecosystem can help illustrate how data and value is created and shared amongst those different actors.

The ODI has published a range of tools and guidance on ecosystem mapping. Data ecosystem mapping is one of several approaches that are being used to help people design and plan data initiatives. A recent ODI report looks at these “data landscaping” tools with some useful references to other examples.

The Flow of My Voice

Joseph Wilk‘s “The Flow of My Voice” is highlights the many different steps through which his voice travels before being stored and served from a YouTube channel, and transcribed for others to read.

The emphasis here is on exhaustively mapping each step, with a representation of the processing at each stage. The text notes which organisation owns the infrastructure at each stage. The intent here is to help to highlight the loss of control over data as it passes through complex interconnected infrastructures. This means a lot of detail.

Data Archeogram: mapping the datafication of work

Armelle Skatulski has produced a “Data Archeogram” that highlights the complex range of data flows and data infrastructure that are increasingly being used to monitor people in the workplace. Starting from various workplace and personal data collection tools, it rapidly expands out to show a wide variety of different systems and uses of data.

Similar to Wilk’s map this diagram is intended to help promote critical review and discussion about how this data is being accessed, used and shared. But it necessarily sacrifices detail around individual flows in an attempt to map out a much larger space. I think the use of patent diagrams to add some detail is a nice touch.

Retail and EdTech data flows

The Future of Privacy Forum recently published some simple data ecosystem maps to illustrate local and global data flows using the Retail and EdTech sectors as examples.

These maps are intended to help highlight the complexity of real world data flows, to help policy makers understand the range of systems and jurisdictions that are involved in sharing and storing personal data.

Because these maps are intended to highlight cross-border flows of data they are presented as if they were an actual map of routes between different countries and territories. This is something that is less evident in the previous examples. These diagrams aren’t showing any specific system and illustrate a typical, but simplified data flow.

They emphasise the actors and flows of different types of data in a geographical context.

Data privacy project: Surfing the web from a library computer terminal

The Data Privacy Projectteaches NYC library staff how information travels and is shared online, what risks users commonly encounter online, and how libraries can better protect patron privacy“. As part of their training materials they have produced a simple ecosystem map and some supporting illustrations to help describe the flow of data that happens when someone is surfing the web in a library.

Again, the map shows a typical rather than a real-world system. Its useful to contrast this with the first example which is much more detailed by comparison. For an educational tool, a more summarised view is better to help building understanding.

The choice of which actors are shown also reflects its intended use. It highlights web hosts, ISPs and advertising networks, but has less to say about the organisations whose websites are being used and how they might use data they collect.

Agronomy projects

This ecosystem map, which I produced for a project we did at the ODI, has a similar intended use.

It provides a summary of a typical data ecosystem we observed around some Gates Foundation funded agronomy projects. The map is intended as a discussion and educational tool to help Programme Officers reflect on the ecosystem within which their programmes are embedded.

This map uses features of Kumu to encourage exploration, providing summaries for each of the different actors in the map. This makes it more dynamic than the previous examples.

Following the methodology we were developing at the ODI it also tries to highlight different types of value exchange: not just data, but also funding, insights, code, etc. These were important inputs and outputs to these programmes.

OpenStreetMap Ecosystem

In contrast to most of the earlier examples, this partial map of the OSM ecosystem tries to show a real-world ecosystem. It would be impossible to properly map the full OSM ecosystem so this is inevitably incomplete and increasingly out of date.

The decision about what detail to include was driven by the goals of the project. The intent was to try and illustrate some of the richness of the ecosystem whilst highlighting how a number of major commercial organisations were participants in that ecosystem. This was not evident to many people until recently.

The map mixes together broad categories of actors, e.g. “End Users” and “Contributor Community” alongside individual commercial companies and real-world applications. The level of detail is therefore varied across the map.

Governance design patterns

The final example comes from this Sage Bionetworks paper. The paper describes a number of design patterns for governing the sharing of data. It includes diagrams of some general patterns as well as real-world applications.

The diagrams shows relatively simple data flows, but they are drawn differently to some of the previous examples. Here the individual actors aren’t directly shown as the endpoints of those data flows. Instead, the data stewards, users and donors are depicted as areas on the map. This is to help emphasise where data is crossing governance boundaries and its use informed by different rules and agreements. Those agreements are also highlighted on the map.

Like the Future of Privacy ecosystem maps, the design is being used to help communicate some important aspects of the ecosystem.

12 ways to improve the GDS guidance on reference data publishing

GDS have published some guidance about publishing reference data for reuse across government. I’ve had a read and it contains a good set of recommendations. But some of them could be clearer. And I feel like some important areas aren’t covered. So I thought I’d write this post to capture my feedback.

Like the original guidance my feedback largely ignores considerations of infrastructure or tools. That’s quite a big topic and recommendations in those areas are unlikely to be applicable solely to reference data.

The guidance also doesn’t address issues around data sharing, such as privacy or regulatory compliance. I’m also going to gloss over that. Again, not because its not important, but because those considerations apply to sharing and publishing any form of data, not just reference data

Here’s the list of things I’d revise or add to this guidance:

  1. The guidance should recommend that reference data be at open as possible, to allow it to be reused as broadly as possible. Reference data that doesn’t contain personal information should be published under an open licence. Licensing is important even for cross-government sharing because other parts of government might be working with private or third sector who also need to be able to use the reference data. This is the biggest omission for me.
  2. Reference data needs to be published over the long term so that other teams can rely on it and build it into their services and workflows. When developing an approach for publishing reference data, consider what investment needs to be made for this to happen. That investment will need to cover people and infrastructure costs. If you can’t do that, then at least indicate how long you expect to be publishing this data. Transparent stewardship can build trust.
  3. For reference data to be used, it needs to be discoverable. The guide mentions creating metadata and doing SEO on dataset pages, but doesn’t include other suggestions such as using Schema.org Dataset metadata or even just depositing metadata in data.gov.uk.
  4. The guidance should recommend that stewardship of reference data is part of a broader data governance strategy. While you may need to identify stewards for individual datasets, governance of reference data should be part of broader data governance within the organisation. It’s not a separate activity. Implementing that wider strategy shouldn’t block making early progress to open up data, but consider reference data alongside other datasets
  5. Forums for discussing how reference data is published should include external voices. The guidance suggests creating a forum for discussing reference data, involving people from across the organisation. But the intent is to publish data so it can be reused by others. This type of forum needs external voices too.
  6. The guidance should recommend documenting provenance of data. It notes that reference data might be created from multiple sources, but does not encourage recording or sharing information about its provenance. That’s important context for reusers.
  7. The guide should recommend documenting how identifiers are assigned and managed. The guidance has quite a bit of detail about adding unique identifiers to records. It should also encourage those publishing reference data to document how and when they create identifiers for things, and what types of things will be identified. Mistakes in understanding the scope and coverage of reference data can have huge impacts.
  8. There is a recommendation to allow users to report errors or provide feedback on a dataset. That should be extended to include a recommendation that the data publisher makes known errors clear to other users, as well as transparency around when individual errors might be fixed. Reporting an error without visibility of the process for fixing data is frustrating
  9. GDS might recommend an API first approach, but reference data is often used in bulk. So there should be a recommendation to have bulk access to data, not just an API. It might also be cheaper and more sustainable to share data in this way
  10. The guidance on versioning should include record level metadata. The guidance contains quite a bit of detail around versioning of datasets. While useful, it should also include suggestions to include status codes and timestamps on individual records, to simplify integration and change monitoring. Change reporting is an important but detailed topic.
  11. While the guidance doesn’t touch on infrastructure, I think it would be helpful for it to recommend that platforms and tools used to manage reference data are open sourced. This will help others to manage and publish their own reference data, and build alignment around how data is published.
  12. Finally, if multiple organisations are benefiting from use of the same reference data then encouraging exploration of collaborative maintenance might help to reduce costs for maintaining data, as well as improving its quality. This can help to ensure that data infrastructure is properly supported and invested in.

OSM Queries

For the past month I’ve been working on a small side project which I’m pleased to launch for Open Data Day 2021.

I’ve long been a fan of OpenStreetMap. I’ve contributed to the map, coordinated a local crowd-mapping project and used OSM tiles to help build web based maps. But I’ve only done a small amount of work with the actual data. Not much more than running a few Overpass API queries and playing with some of the exports available from Geofabrik.

I recently started exploring the Overpass API again to learn how to write useful queries. I wanted to see if I could craft some queries to help me contribute more effectively. For example by helping me to spot areas that might need updating. Or identify locations where I could add links to Wikidata.

There’s a quite a bit of documentation about the Overpass API and the query language it uses, which is called OverpassQL. But I didn’t find them that accessible. The documentation is more of a reference than a tutorial.

And, while, there’s quite a few example queries to find across the OSM wiki and other websites, there isn’t always a great deal of context to those examples that explain how they work or when you might use them.

So I’ve been working on two things to address what I think is a gap in helping people learn how to get more from the OpenStreetMap API.

overpass-doc

The first is a simple tool that will take a collection of Overpass queries and build a set of HTML pages from them. It’s based on a similar tool I built for SPARQL queries a few years ago. Both are inspired by Javadoc and other code documentation tools.

The idea was to encourage the publication of collections of useful, documented queries. E.g. to be shared amongst members of a community or people working on a project. The OSM wiki can be used to share queries, but it might not always be a suitable home for this type of content.

The tool is still at quite an early stage. It’s buggy, but functional.

To test it out I’ve been working on my own collection of Overpass queries. I initially started to pull together some simple examples that illustrated a few features of the language. But then realised that I should just use the tool to write a proper tutorial. So that’s what I’ve been doing for the last week or so.

Announcing OSM Queries

OSM Queries is the result. As of today the website contains four collections of queries. The main collection of queries is a 26 part tutorial that covers the basic features of Overpass QL.

By working through the tutorial you’ll learn:

  • some basics of the OpenStreetMap data model
  • how to write queries to extract nodes, ways and relations from the OSM database using a variety of different methods
  • how to filtering data to extract just the features of interest
  • how to write spatial queries to find features based on whether they are within specific areas or are within proximity to one another
  • how to output data as CSV and JSON for use in other tools

Every query in the tutorial has its own page containing an embedded syntax highlighted version of the query. This makes them easier to share with others. You can click a button to load and run the query using the Overpass Turbo IDE. So you can easily view the results and tinker with the query.

I think the tutorial covers all the basic options for querying and filtering data. Many of the queries include comments that illustrate variations of the syntax, encouraging you to further explore the language.

I’ve also been compiling an Overpass QL syntax reference that provides a more concise view of some of the information in the OSM wiki. There’s a lot of advanced features (like this) which I will likely cover in a separate tutorial.

Writing a tutorial against the live OpenStreetMap database is tricky. The results can change at any time. So I opted to focus on demonstrating the functionality using mostly natural features and administrative boundaries.

In the end I chose to focus on an area around Uluru in Australia. Not just because it provides an interesting and stable backdrop for the tutorial. But because I also wanted to encourage a tiny bit of reflection in the reader about what gets mapped, who does the mapping, and how things get tagged.

A bit of map art, and a request

The three other query collections are quite small:

I ended up getting a bit creative with the MapCSS queries.

For example, to show off the functionality I’ve written a query that shows the masonic symbol hidden in the streets of Bath, styled Brøndby Haveby like a bunch of flowers and the Lotus Bahai Temple as, well, a lotus flower.

These were all done by styling the existing OSM data. No edits were done to change the map. I wouldn’t encourage you to do that.

I’ve put all the source files and content for the website into the public domain so you’re free to adapt, use and share however you see fit.

While I’ll continue to improve the tutorial and add some more examples I’m also hoping that I can encourage others to contribute to the site. If you have useful queries that you could be added to the site then submit them via Github. I’ve provided a simple issue template to help you do that.

I’m hoping this provides a useful resource for people in the OSM community and that we can collectively improve it over time. I’ve love to get some feedback, so feel free to drop me an email, comment on this post or message me on twitter.

And if you’ve never explored the data behind OpenStreetMap then Open Data Day is a great time to dive in. Enjoy.

Bath Historical Images

One of my little side projects is to explore historical images and maps of Bath and the surrounding areas. I like understanding the contrast between how Bath used to look and how it is today. It’s grown and changed a huge amount over the years. It gives me a strong sense of place and history.

There is a rich archive of photographs and images of the city and area that were digitised for the Bath in Time project. Unfortunately the council has chosen to turn this archive into a, frankly terrible, website that is being used to sell over-priced framed prints.

The website has limited navigation and there’s no access to higher resolution imagery. Older versions of the site had better navigation and access to some maps.

The current version looks like it’s based on a default ecommerce theme for WordPress rather than being designed to show off the richness of the 40,000 images it contains. Ironically the @bathintime twitter account tweets out higher resolution images than you can find on the website.

This is a real shame. Frankly I can’t imagine there’s a huge amount of revenue being generated from these prints.

If the metadata and images were published under a more open licence (even with a non-commercial limitation) then it would be more useful for people like me who are interested in local history. We might even be able to help build useful interfaces. I would happily invest time in cataloguing images and making something useful with them. In fact, I have been.

In lieu of a proper online archive, I’ve been compiling a list of publicly available images from other museums and collections. So far, I’ve sifted through:

I’ve only found around 230 images (including some duplicates across collections) so far, but there are some interesting items in there. Including some images of old maps.

I’ve published the list as open data.

So you can take the metadata and links and explore them for yourself. I thought they may be useful for anyone looking to reuse images in their research or publications.

I’m in the process of adding geographic coordinates to each of the images, so they can be placed on the map. I’m approaching that by geocoding them as if they were produced using a mobile phone or camera. For example, an image of the abbey won’t have the coordinates of the abbey associated with it, it’ll be wherever the artist was standing when they painted the picture.

This is already showing some interesting common views over the years. I’ve included a selection below.

Views from the river, towards Pulteney Bridge

Southern views of the city

Looking to the east across abbey churchyard

Views of the Orange Grove and Abbey

It’s really interesting to be able to look at the same locations over time. Hopefully that gives a sense of what could be done if more of the archives we made available.

There’s more documentation on the dataset if you want to poke around. If you know of other collections of images I should look at, then let me know.

And if you have metadata or images to release under an open licence, or have archives you want to share, then get in touch as I may be able to help.

The Common Voice data ecosystem

In 2021 I’m planning to spend some more time exploring different data ecosystems with an emphasis on understanding the flows of data within and between different data initiatives, the tools they use to collect and share data, and the role of collaborative maintenance and open standards.

One project I’ve been looking at this week is Mozilla Common Voice. It’s an initiative that is producing a crowd-sourced, public domain dataset that can be used to train voice recognition applications. It’s the largest dataset of its type, consisting of over 7,000 hours of audio across 60 languages.

It’s a great example of communities working to create datasets that are more open and representative. Helping to address biases and supporting the creation of more equitable products and services. I’ve been using it in my recent talks on collaborative maintenance, but have had chance to dig a bit deeper this week.

The main interface allows contributors to either record their voice, by reading short pre-prepared sentences, or validate existing contributions by listening to existing recording and confirming that they match the script.

Behind the scenes is a more complicated process, which I found interesting.

It further highlights the importance of both open source tooling and openly licensed content in supporting the production of open data. It also another example of how choices around licensing can create friction between open projects.

The data pipeline

Essentially, the goal of the Common Voice project is to create new releases of its dataset. With each release including more languages and, for each language, more validated recordings.

The data pipeline that supports that consists of the following basic steps. (There may be other stages involved in the production of the output corpus, but I’ve not dug further into the code and docs.)

  1. Localisation. The Common Voice web application first has to be localised into the required language. This is coordinated via Mozilla Pontoon, with a community of contributors submitting translations licensed under the Mozilla Public Licence 2.0. Pontoon is open source and can be used for other non-Mozilla applications. When the localization gets to 95% the language can be added to the website and the process can move to the next stage
  2. Sentence Collection. Common Voice needs short sentences for people to read. These sentences need to be in the public domain (e.g. via a CC0 waiver). A minimum of 5,000 sentences are required before a language can be added to the website. The content comes from people submitting and validating sentences via the sentence collector tool. The text is also drawn from public domain sources. There’s a sentence extractor tool that can pull content from wikipedia and other sources. For bulk imports the Mozilla team needs to check for licence compatibility before adding text. All of this means that the source texts for each language are different.
  3. Voice Donation. Contributors read the provided sentences to add their voice to their dataset. The reading and validation steps are separate microtasks. Contributions are gamified and there are progress indicators for each language.
  4. Validation. Submitted recordings go through retrospective review to assess their quality. This allows for some moderation, allowing contributors to flag recordings that are offensive, incorrect or are of poor quality. Validation tasks are also gamified. In general there are more submitted recordings than validations. Clips need to be reviewed by two separate users for them to be marked as valid (or invalid).
  5. Publication. The corpus consists of valid, invalid and “other” (not yet validated) recordings, split into development, training and test datasets. There are separate datasets for each language.

There is an additional dataset which consists of 14 single word sentences (the ten digits, “yes”, “no”, “hey”, “Firefox”) which is published separately. The steps 2-4 look similar though.

Some observations

What should be clear is that there are multiple stages, each with their own thresholds for success.

To get a language into the project you need to translate around 600 text fragments from the application and compile a corpus of at least 5,000 sentences before the real work of collecting the voice dataset can begin.

That work requires input from multiple, potentially overlapping communities:

  • the community of translators, working through Pontoon
  • the community of writers, authors, content creators creating public domain content that can be reused in the service
  • the common voice contributors submitting new additional sentences
  • the contributors recording their voice
  • the contributors validating other recordings
  • the teams at Mozilla, coordinating and supporting all of the above

As the Common Voice application and configuration is open source, it is easy to include it in Pontoon to allow others to contribute to its localisation. To build representative datasets, your tools need to work for all the communities that will be using them.

The availability of public domain text in the source languages, is clearly a contributing factor in getting a language added to the site and ultimately included in the dataset.

So the adoption of open licences and the richness of the commons in those languages will be a factor in determining how rich the voice dataset might be for that language. And, hence, how easy it is to create good voice and text applications that can support those communities.

You can clearly create a new dedicated corpus, as people have done for Hakha Chin. But the strength and openness of one area of the commons will impact other areas. It’s all linked.

While there are different communities involved in Common Voice, its clear these reports from communities working on Hakha Chin and Welsh, in some cases its the same community that is working across the whole process.

Every language community is working to address its own needs: “We’re not dependent on anyone else to make this happen…We just have to do it“.

That’s the essence of shared infrastructure. A common resource that supports a mixture of uses and communities.

The decisions about what licences to use is, as ever, really important. At present Common Voice only takes a few sentences from individual pages of the larger Wikipedia instances. As I understand it this is because Wikipedia content is not public domain, so cannot be used wholesale. But small extracts should be covered by fair use?

I would expect that those interested in building and maintaining their language specific instances of wikipedia have overlaps with those interested in making voice applications work in that same language. Incompatible licensing can limit the ability to build on existing work.

Regardless, the Mozilla and the Wikimedia Foundations have made licensing choices that reflect the needs of their communities and the goals of their projects. That’s an important part of building trust. But, as ever, those licensing choices have subtle impacts across the wider ecosystem.

The importance of tracking dataset retractions and updates

There are lots of recent examples of researchers collecting and releasing datasets which end up raising serious ethical and legal concerns. The IBM facial recognition dataset being just one example that springs to mind.

I read an interesting post exploring how facial recognition datasets are being widely used despite being taken down due to ethical concerns.

The post highlights how these datasets, despite being retracted, are still being widely used in research. This is in part because the original datasets are still circulating via mirrors of the original files. But also because they have been incorporated into derived datasets which are still being distributed with the original contents intact.

The authors describe how just one dataset, the DukeMTMC dataset was used in more than 135 papers after being retracted, 116 of those drawing on derived datasets. Some datasets have many derivatives, one example cited has been used in 14 derived datasets.

The research raises important questions about how datasets are published, mirrored, used and licensed. There’s a lot to unpack there and I look forward to reading more about the research. The concerns around open licensing are reminiscent of similar debates in the open source community leading to a set of “ethical open source licences“.

But the issue I wanted to highlight here is the difficulty of tracking the mirroring and reuse of datasets.

Change notification is a missing piece of our data infrastructure.

If it were easier to monitor important changes to datasets, then it would be easier to:

  • maintain mirrors of data
  • retract or remove data that breached laws or social and ethical norms
  • update derived datasets to remove or amend data
  • re-run analyses against datasets which has seen significant corrections or revisions
  • assess the impacts of poor quality or unethically shared data
  • proactively notify relevant communities of potential impacts relating to published data
  • monitor and review the reasons why datasets get retracted
  • …etc, etc

The importance of these activities can be seen in other contexts.

For example, Retraction Watch is a project that monitors retractions of research papers. CrossMark helps to highlight major changes to published papers including corrections and retractions.

Principle T3: Orderly Release, of the UK Statistics Authority code of practice explains that scheduled revisions and unscheduled corrections to statistics should be transparent, and that organisations should have a specific policy for how they are handled.

More broadly, product recalls and safety notices are standard for consumer goods. Maybe datasets should be treated similarly?

This feels like an area that warrants further research, investment and infrastructure. At some point we need to raise our sights from setting up even more portals and endlessly refining their feature sets and think more broadly about the system and ecosystem we are building.

Four types of innovation around data

Vaughn Tan’s The Uncertainty Mindset is one of the most fascinating books I’ve read this year. It’s an exploration of how to build R&D teams drawing on lessons learned in high-end kitchens around the world. I love cooking and I’m interested in creative R&D and what makes high-performing teams work well. I’d strongly recommend it if you’re interested in any of these topics.

I’m also a sucker for a good intellectual framework that helps me think about things in different ways. I did that recently with the BASEDEF framework.

Tan introduces a nice framework in Chapter 4 of the book which looks at four broad types of innovation around food. These are presented as a way to help the reader understand how and where innovation creates impact in restaurants. The four categories are:

  1. New dishes – new arrangements of ingredients, where innovation might be incremental refinements to existing dishes, combining ingredients together in new ways, or using ingredients from different contexts (think “fusion”)
  2. New ingredients – coming up with new things to be cooked
  3. New cooking methods – new ways of cooking things, like spherification or sous vide
  4. New cooking processes – new ways of organising the processes of cooking, e.g. to help kitchen staff prepare a dish more efficiently and consistently

The categories are the top are more evident to the consumer, those lower down less so. But the impacts of new methods and processes are greater as they apply in a variety of contexts.

Somewhat inevitably, I found myself thinking about how these categories work in the context of data:

  1. New dishes analyses – New derived datasets made from existing primary sources. Or new ways of combining datasets to create insights. I’ve used the metaphor of cooking to describe data analysis before, those recipes for data-informed problem solving help to document this stage to make it reproducible
  2. New ingredients datasets and data sources – Finding and using new sources of data, like turning image, text or audio libraries into datasets, using cheaper sensors, finding a way to extract data from non-traditional sources, or using phone sensors for earthquake detection
  3. New cooking methods for cleaning, managing or analysing data – which includes things like Jupyter notebooks, machine learning or differential privacy
  4. New cooking processes for organising the collection, preparation and analysis of data – e.g. collaborative maintenance, developing open standards for data or approaches to data governance and collective consent?

The breakdown isn’t perfect, but I found the exercise useful to think through the types of innovation around data. I’ve been conscious recently that I’m often using the word “innovation” without really digging into what that means, how that innovation happens and what exactly is being done differently or produced as a result.

The categories are also useful, I think, in reflecting on the possible impacts of breakthroughs of different types. Or perhaps where investment in R&D might be prioritised and where ensuring the translation of innovative approaches into the mainstream might have most impact?

What do you think?

Increasing inclusion around open standards for data

I read an interesting article this week by Ana Brandusescu, Michael Canares and Silvana Fumega. Called “Open data standards design behind closed doors?” it explores issues of inclusion and equity around the development of “open data standards” (which I’m reading as “open standards for data”).

Ana, Michael and Silvana rightly highlight that standards development is often seen and carried out as a technical process, whereas their development and impacts are often political, social or economic. To ensure that standards are well designed, we need to recognise their power, choose when to wield that tool, and ensure that we use it well. The article also asks questions about how standards are currently developed and suggests a framework for creating more participatory approaches throughout their development.

I’ve been reflecting on the article this week alongside a discussion that took place in this thread started by Ana.

Improving the ODI standards guidebook

I agree that standards development should absolutely be more inclusive. I too often find myself in standards discussions and groups with people that look like me and whose experiences may not always reflect those who are ultimately impacted by the creation and use of a standard.

In the open standards for data guidebook we explore how and why standards are developed to help make that process more transparent to a wider group of people. We also placed an emphasis on the importance of the scoping and adoption phases of standards development because this is so often where standards fail. Not just because the wrong thing is standardised, but also because the standard is designed for the wrong audience, or its potential impacts and value are not communicated.

Sometimes we don’t even need a standard. Standards development isn’t about creating specifications or technology, those are just outputs. The intended impact is to create some wider change in the world, which might be to increase transparency, or support implementation of a policy or to create a more equitable marketplace. Other interventions or activities might achieve those same goals better or faster. Some of them might not even use data(!)

But looking back through the guidebook, while we highlight in many places the need for engagement, outreach, developing a shared understanding of goals and desired impacts and a clear set of roles and responsibilities, we don’t specifically foreground issues of inclusion and equity as much as we could have.

The language and content of the guidebook could be improved. As could some prototype tools we included like the standards canvas. How would that be changed in order to foreground issues of inclusion and equity?

I’d love to get some contributions to the guidebook to help us improve it. Drop me a message if you have suggestions about that.

Standards as shared agreements

Open standards for data are reusable agreements that guide the exchange of data. They shape how I collect data from you, as a data provider. And as a data provider they shape how you (re)present data you have collected and, in many cases will ultimately impact how you collect data in the future.

If we foreground standards as agreements for shaping how data is collected and shared, then to increase inclusion and equity in the design of those agreements we can look to existing work like the Toolkit for Centering Racial Equity which provides a framework for thinking about inclusion throughout the life-cycle of data. Standards development fits within that life-cycle, even if it operates at a larger scale and extends it out to different time frames.

We can also recognise existing work and best practices around good participatory design and research.

We should avoid standards development, as a process, being divorced from broader discussions and best practices around ethics, equity and engagement around data. Taking a more inclusive and equitable approach to standards development is part of the broader discussion around the need for more integration across the computing and social sciences.

We may also need to recognise that sometimes agreements are made that don’t provide equitable outcomes for everyone. We might not be able to achieve a compromise that works for everyone. Being transparent about the goals and aims of a standard, and how it was developed, can help to surface who it is designed for (or not). Sometimes we might just need different standards, optimised for different purposes.

Some standards are more harmful than others

There are many different types of standard. And standards can be applied to different types of data. The authors of the original article didn’t really touch on this within their framework, but I think its important to recognise these differences, as part of any follow-on activities.

The impacts of a poorly designed standard that classifies people or their health outcomes will be much more harmful than a poorly defined data exchange format. See all of Susan Leigh Star‘s work. Or concerns from indigenous peoples about how they are counted and represented (or not) in statistical datasets.

Increasing inclusion can help to mitigate the harmful impacts around data. So focusing on improving inclusion (or recognising existing work and best practices) around the design of standards with greater capacity for harms is important. The skills and experience required in developing a taxonomy is fundamentally different to those required to develop a data exchange format.

Recognising these differences is also helpful when planning how to engage with a wider group of people. As we can identify what help and input is needed: What skills or perspectives are lacking among those leading standards work? What help or support needs to be offered to increase inclusion. E.g. by developing skills, or choosing different collaboration tools or methods of seeking input.

Developing a community of practice

Since we launched the standards guidebook I’ve been wondering whether it would be helpful to have more of a community of practice around standards development. I found myself thinking about this again after reading Ana, Michael and Silvana’s article and the subsequent discussion on twitter.

What would that look like? Does it exist already?

Perhaps supported by a set of learning or training resources that re-purposes some of the ODI guidebook material alongside other resources to help others to engage with and lead impactful, inclusive standards work?

I’m interested to see how this work and discussion unfolds.

FAIR, fairer, fairest?

“FAIR” (or “FAIR data”) is an term that I’ve been bumping into more and more frequently. For example, its included in the UK’s recently published Geospatial Strategy.

FAIR is an acronym that stands for Findable, Accessible, Interoperable and Reusable. It defines a set of principles that highlight some important aspects of publishing machine-readable data well. For example they identify the need to adopt common standards, use common identifiers, provide good metadata and clear usage licences.

The principles were originally defined by researchers in the life sciences. They were intended to help to improve management and sharing of data in research. Since then the principles have been increasingly referenced in other disciplines and domains.

At the ODI we’re currently working with CABI on a project that is applying the FAIR data principles, alongside other recommendations, to improve data sharing in grants and projects funded by the Gates Foundation.

From the perspective of encouraging the management and sharing of well-structured, standardised, machine-readable data, the FAIR principles are pretty good. They explore similar territory as the ODI’s Open Data Certificates and Tim Berners-Lee’s 5-Star Principles.

But the FAIR principles have some limitations and have been critiqued by various communities. As the principles become adopted in other contexts it is important that we understand these limitations, as they may have more of an impact in different situations.

A good background on the FAIR principles and some of their limitations can be found in this 2018 paper. But there are a few I’d like to highlight in this post.

They’re just principles

A key issue with FAIR is that they’re just principles. They offer recommendations about best practices, but they don’t help you answer specific questions. For example:

  • what metadata is useful to publish alongside different types of datasets?
  • which standards and shared identifiers are the best to use when publishing a specific dataset?
  • where will people be looking for this dataset to ensure its findable?
  • what are the trade-offs of using different competing standards?
  • what terms of use and licensing are appropriate to use when publishing a specific dataset for use by a specific community?
  • …etc

Applying the principles to a specific dataset means you need to have a clear idea about what you’re trying to achieve, what standards and best practices are used by the community you’re trying to support, or what approach might best enable the ecosystem you’re trying to grow and support.

We touched on some of these issues in a previous project that CABI and ODI delivered to the Gates Foundation. We encouraged people to think about FAIR in the context of a specific data ecosystem.

Currently there’s very little guidance that exists to support these decisions around FAIR. Which makes it harder to assess whether something is really FAIR in practice. Inevitably there will be trade-offs that involve making choices about standards and how much to invest in data curation and publication. Principles only go so far.

The principles are designed for a specific context

The FAIR principles were designed to reflect the needs of a specific community and context. Many of the recommendations are also broadly applicable to data publishing in other domains and contexts. But they embody design decisions that may not apply universally.

For example, they choose to emphasise machine-readability. Other communities might choose to focus on other elements that are more important to them or their needs.

As an alternative, the CARE principles for indigenous data governance are based around Collective Benefit, Authority to Control, Responsibility and Ethics. Those are good principles too. Other groups have chosen to propose ways to adapt and expand on FAIR.

It may be that the FAIR principles will work well in your specific context or community. But it might also be true that if you were to start from scratch and designed a new set of principles, you might choose to highlight other principles.

Whenever we are applying off-the-shelf principles in new areas, we need to think about whether they are helping us to achieve our own goals. Do they emphasise and prioritise work in the right areas?

The principles are not about being “fair”

Despite the acronym, the principles aren’t about being “fair”.

I don’t really know how to properly define “fair”. But I think it includes things like equity ‒ of access, or representation, or participation. And ethics and engagement. The principles are silent on those topics, leading some people to think about FAIRER data.

Don’t let the memorable acronym distract from the importance of ethics, consequence scanning and centering equity.

FAIR is not open

The principles were designed to be applied in contexts where not all data can be open. Life science research involves lots of sensitive personal information. Instead the principles recommend that data usage rights are clear.

I usually point out that FAIR data can exist across the data spectrum. But the principles don’t remind you that data should be as open as possible. Or prompt you to consider about the impacts of different types of licensing. They just ask you to be clear about the terms of reuse, however restrictive they might be.

So, to recap: the FAIR data principles offer a useful framework of things to consider when making data more accessible and easier to reuse. But they are not perfect. And they do not consider all of the various elements required to build an open and trustworthy data ecosystem.