There’s a lot going at the minute. Both in general and personally.
I’ve been trying to think through the way I feel about it all. To reflect on what helps me deal with the anxiety of These Times. As well as what doesn’t.
Today it clicked and I tweeted it. This is post just expands on that a little.
Maslow’s hierarchy of needs describes an ordering of human needs, some of which are more fundamental than others. But all of them (or at least as many of them as possible) need to be met for us to feel good. There are plenty of critiques, but it’s a handy reference point.
I feel like at the moment I’m wrestling with the equivalent of that pyramid of needs. But each layer is instead a different source of anxiety.
The bottom layers are the broad existential angst of the climate crisis and the pandemic. And all these nazis there are these days. It’s a good solid base.
The top of the pyramid is a bit more pointed. A bit more personal. It’s those layers of anxieties that are (or at least we feel are) unique to ourselves. Am I good at my job? Am I being a good parent? AITA?
I’m calling it the Doom Pyramid.
Trying to deal with the whole pyramid at once is too much. Social media keeps feeding us new reasons to be concerned, as well as ever increasing ways to be anxious about ourselves.
Douglas Adams introduced the idea of the “Someone Else’s Problem Field“. A kind of cloaking device to hide things from our brains because we think that it’s somebody else’s problem.
I’m finding that dealing with all of the anxieties that the Doom Pyramid represents requires me to strategically deploy a SEP field. To try and drown out all the things that I have no control over. Or which I cannot deal with today. Sometimes though the SEP field breaks down and all of the reality rushes in.
Is that healthy? I don’t know. But that’s how I’m trying to handle things at the moment.
Unfortunately it comes with its own guilt.
Am I part of those broader problems, if I’m not actively trying to tackle them? Probably.
At least some, maybe all, of those broader problems are caused by denial or a general lack of awareness. Which might be attributable to education, experience or just overwhelming privilege.
But I have only so much time and energy in each day. I need to focusing on fighting the smaller, more personal demons at the top of the pyramid. The rest are problems for another (election) day.
What I mean is that there are a lot of different types of bee. 275 species in Great Britain and Ireland, and over 20,000 species globally? The planet is absolutely buzzing. Never really thought about it before. I spent more time during my degree reading about ants than bees.
This wasn’t my first surprise encounter with bees. There were the bumble bees nesting in the garage, or that time we had a swarm of bees appear in the garden. We won’t talk about the time I was in the toilet and wasps started coming out of the wall. Because wasps are not bees.
But this was the first time I had tried to actually identify a bee. As a result I found all kinds of useful identification charts and guides. (If you know me well, you can probably guess where this is going).
Later this “summer” whilst dozing in the garden I realised our honeysuckle was absolutely crawling with bees. They kept waking me up. So I started trying to count how many different types I could see.
This inevitably has lead to me spending most of the summer surveying how many different types of bee we actually get in our garden.
I’ve learned how hard it is to photograph bees. The buzzers keep moving. Top tip: video them, then export individual frames.
I’ve always felt privileged to have the garden that we do. But never more so than during two summers living through lockdowns. Working in the garden has given me light, air and a welcome distraction from both the frustrations of my work and current events.
But the garden for me has mostly been somewhere to work. I end up going from task to task: weeding, pruning, planting and watering. Spending time looking at the bees and insects in the garden has required me to slow down. To just sit and focus on what’s going on around me.
One afternoon I discovered we had a nascent wasp nest in the lawn, and mason bees burrowing into a tree branch.
I’ve watched a honey bee crawl into a fuschia flower like it was a sleeping bag. I’ve witnessed nectar robbing. (I’ve seen things you people wouldn’t believe, etc).
I’ve also repeatedly exclaimed “New bee just dropped!” whilst dashing in from the garden to grab my phone and field guide. Much to the chagrin of everyone in the family.
COVID-19 aside we’ve got a lot going on here at Chez Dodds. Everything has indeed been happening so much. A new hobby that happens to encourage a bit of mindful distraction has been a balm.
Amongst all the dissembling, disinformation and disaster that faces me whenever I open a browser or switch on the news, I’ve come to re-appreciate all of the quiet work that supports us, largely unnoticed. The low hum that keeps the world turning. And the ceaseless efforts to understand it.
Vaccines are great. But have you tried the reassuring presence of a four-hundred page field guide?
It’s good clear guidance that should help anyone building a data portal. It has tips for designing search interfaces, presenting results and dataset metadata.
There’s very little advice that is specifically relevant to geospatial data and little in the way of new insights in general. The recommendations echo lots of existing research, guidance and practice. But it’s always helpful to see best practices presented in an accessible way.
For guidance that is ostensibly about geospatial data portal, I would have liked to have seem more of a focus on geospatial data. This aspect is largely limited to recommending the inclusion of a geospatial search, spatial filtering and use of spatial data formats.
It would have been nice to see some suggestions around the useful boundaries to include in search interfaces, recommendations around specific GIS formats and APIs, or some exploration of how to communicate the geographic extents of individual datasets to users.
Fixing a broken journey
The guidance presents a typical user journey that involves someone using a search engine, finding a portal rather than the data they need, and then repeating their search in a portal.
Improving that user journey is best done at the first step. A portal is just getting in the way.
Data publishers should be encouraged to improve the SEO of their datasets if they really want them to be found and used.
Data publishers should be encouraged to improve the documentation and metadata on their “dataset landing pages” to help put that data in context.
If we can improve this then we don’t have to support users in discovering a portal, checking whether it is relevant, teaching them to navigate it, etc.
We don’t really need more portals to improve discovery or use of data. We should be thinking about this differently.
There are many portals, but this one is mine
Portals are created for all kinds of purposes.
Many are just a fancy CMS for datasets that are run by individual organisations.
Others are there to act as hosts for data to help others make it more accessible. Some provide a directory of datasets across a sector.
Looking more broadly, portals support consumption of data by providing a single point of integration with a range of tools and platforms. They work as shared spaces for teams, enabling collaborative maintenance and sharing of research outputs. They also support data governance processes: you need to know what data you have in order to ensure you’re managing it correctly.
If we want to build better portals, then we ought to really have a clearer idea of what is being built, for whom and why.
This new guidance rightly encourages user research, but presumes building a portal as the eventual outcome.
I don’t mean that to be dismissive. There are definitely cases where it is useful to bring together collections of data to help users. But that doesn’t necessarily mean that we need to create a traditional portal interface.
For example, in order to tackle specific challenges it can be useful to identify a set of relevant related data. This implies a level of curation — a librarian function — which is so far missing from the majority of portals.
Curated collections of data (& code & models & documentation & support) might drive innovation whilst helping ensure that data is used in ways that are mindful of the context of its collection. I’ve suggested recipes as one approach to that. But there are others.
Curation and maintenance of collections are less popular because they’re not easily automated. You need to employ people with an understanding of an issue, the relevant data, and how it might be used or not. To me this approach is fundamental to “publishing with purpose”.
I won’t attempt to capture the nuance of her idea, but it involves providing a service to support people in finding data via an expert help desk. The ONS already have something similar for their own datasets, but an agency could cover a whole sector or domain. It could also publish curated lists of useful data.
This approach would help broker relationships between data users and data publishers. This would not only help improve discovery, but also build trust and confidence in how data is being accessed, used and shared.
Actually linking to data?
I have a working hypothesis that, setting aside those that need to aggregate lots of small datasets from different sources, most data-enabled analyses, products and services typically only use a small number of related datasets. Maybe a dozen?
The same foundational datasets are used repeatedly in many different ways. The same combination of datasets might also be analysed for different purposes. It would be helpful to surface the most useful datasets and their combinations.
We have very little insight into this because dataset citation, linking and attribution practices are poor.
We could improve data search if this type of information was more readily available. Link analysis isn’t a substitute for good metadata, but its part of the overall puzzle in creating good discovery tools.
Portals often provide an opportunity to standardise how data is being published. As an intermediary they inevitably shape how data is published and used. This is another area where existing portals do little to improve their overall ecosystem.
But those activities aren’t necessarily tied to the creation and operation of a portal. Provision of shared platforms, open source tools, guidance, quality checkers, linking and aggregation tools, and driving development and adoption of standards can all be done in other ways.
Six months later and I’m now in two weekly TTRPG sessions. And I’m thoroughly enjoying it.
For a long period TTRPGs were a big part of my life.
Like many people of my age, my introduction to TTRPGs was through the D&D “red box” set. When I got my copy I badgered my older teenage cousin and his mates to run a session. They threw us into a tiny dungeon filled with werewolves where we immediately died. Then they went down the pub. But that early taste got me hooked.
I started playing regularly with a group of school friends when I was 11 (1983). We had a weekly RuneQuest session run by my mate Darren’s big brother. I played a Duck.
When Games Workshop announced they were releasing a UK version of Middle-Earth Role Playing (MERP) in 1985, I immediately pre-ordered it from the local game shop. Then proceeded to annoy the hell out of the owner by visiting every couple of days to see if it had arrived yet.
On 21st July 1989 I spent a day in a disused underground bunker in Nottingham dressed as a wizard. Before jumping on a train to see The Cure at the NEC. Good Times.
When I sent to university in Leicester I naturally made a beeline for the RPG Society. My first weekend I joined a MERP game that proceeded to run every Saturday afternoon for three years. It gave me a near instant set of friends whilst I was still getting to know my housemates and the other people on my course.
At some point we spun out a regular Wednesday afternoon session. Branching out into Shadowrun, Vampire and Werewolf because they were the new and exciting games. Lots of dice were purchased.
I also started regularly running a Call of Cthulhu game.
I’ve joked at times that everything I know about running workshops professionally, I first learned around a table leading a TTRPG session.
It’s not really a joke though. The need for good preparation, helping to ensure that everyone is comfortable and, most importantly, that everyone is getting something out of the event are all transferable skills. And if you’re shy like me then it’s just good practice at leading and speaking in front of a group.
After graduation I turned that Call of Cthulhu game into a free-form play by mail campaign for the same group. Letters with revelations of cosmic horror and containing extracts from cryptic texts flew around the country for a while.
And then things petered out. Life, inevitably, moves on.
I think our last game as a group was a one-off session of Paranoia that I ran. To try and create a suitably ridiculous atmosphere for Alpha Complex, I kept playing sound effects from a BBC Radiophonic Workshop cassette I’d found in the local library. But it just baffled everyone, so we just went down the pub. And that was that for a while.
Now I’ve jumped back in I can’t believe I waited for so long.
I’m currently playing in two campaigns in two very different settings. Masks (which I’m running) is about teenage superheroes. While Good Society is about telling stories inspired by Jane Austen.
Both are “Powered by the Apocalypse” games. These are games that all share some common heritage, being based on a system of rules that were originally designed for a game called Apocalypse World. But that system has been generalised and adapted in many different ways. It now exists as a template for others to build on.
TTRPGs have had a huge boost recently. Largely due to the rise in popularity of Actual Play streams, especially during lockdown. Video calls, tools like Discord and platforms like Roll 20 have also reduced barriers to play, making it easier to play with friends no matter where they are. Both of the groups I’m in include people who are new to TTRPGs.
But mainly the hobby is in an exciting place because of the sheer range of new TTRPGs there are available. Many more than when I was a kid. And many of these systems are Powered by the Apocalypse games. Or at least have some common DNA.
They’re very different to D&D. In ways that encourage creativity and collaboration while placing a focus on narrative. I love rolling dice and pouring over rule books as much as the next nerd. But it’s the opportunity to create fun, exciting or moving stories that really bring people — and increasingly, more people, it seems — to the table.
So now I’ve got a rapidly growing collection of new games to play. Fantastic stuff like TEETH (“Jane Austen’s STALKER”) and Brindlewood Bay (Murder She Wrote x Lovecraft) and Tales from the Loop (based on Simon Stålenhag’s art).
I picked an exciting time to return to one of my favourite hobbies.
The largest data source comes from gas and electricity meters (consumption) and solar panels (generation). While we’re integrating with APIs that allow us to access data from smart meters, for the foreseeable future most of this data will still be collected via AMR rather than SMETS-2 meters. And then shared with us as CSV files attached to emails.
That data is sent via a variety of systems and platforms run by energy companies, aggregators and local authorities. We’re currently dealing with about 24 different variations of what is basically the same dataset.
I thought I’d share a quick summary of that variation. As its interests from a “designing CSV files” and data standards perspective.
For a quick overview, you can look at this Google spreadsheet which provides a summary of the formats, in a way that hopefully makes them easy to compare.
The rest of this post has some notes on the variations.
What data are we talking about?
In Energy Sparks we work with half-hourly consumption and production data. A typical dataset will consist of a series of 48 daily readings for each meter.
Each half hourly data point reports the total amount of energy consumed (or generated) in the previous 30 minutes.
The dataset might usually contain data for several days of readings for many different meters.
This means that the key bits of information that we need to process each dataset is:
An identifier for the meter, e.g. an MPAN or MPRN
The date that the readings was taken
A series of 48 data points making up a full days readings
Pretty straight-forward. But as you can see in the spreadsheet there’s a lot of different variations.
We receive different formats for both the gas and electricity data. Different formats for historical vs ongoing data supply. Or both.
And formats might change as schools or local authorities change platform, suppliers, etc.
Use of CSV
In general, the CSV files are pretty consistent. We rely on the Ruby CSV parsers default behaviour to automatically identify line endings. And all the formats we’re using use commas, rather than tabs, as delimiters.
The number of header rows varies. Most have a single row, but some don’t have any. A couple have two.
Various date formats are used. The following lists the most common first:
%b %e %Y %I:%M%p (1)
%e %b %Y %H:%M:%S (1)
Not much use of ISO 8601!
But the skew towards readable formats probably makes sense given that the primary anticipated use of this data is for people to open it in a spreadsheet.
Where we have several different formats from a single source (yes, this happens), I’ve noticed that the %Y based date formats are used in formats used to provide historical data, while %y year format seems to be the default for ongoing data.
Data is supplied either as UTC dates or, most commonly, in whatever the current timezone is in the UK. So readings switch from GMT to BST. And this means that when the clocks change we end up with gaps in the readings.
The majority of formats (22/24) are column oriented. By which I mean the tables consist of one row per meter, per day. Each row having 48 half-hourly readings as separate columns.
Two are row oriented. Each row containing a measurement for a specific meter at a specific date-time.
The column used to hold meter identifiers also varies. We might expect at least two: MPAN for electricity meters and MPRN for gas. What we actually get is:
“Meter” seems fair as a generic column header if you know what you’re getting. Otherwise some baffling variations here.
What about the column that contains the date (or date-time for row oriented files). What are they called?
The default is that data is supplied in kilowatt-hours (kwh).
So few of the formats actually bother to specify a unit. Those that do call it “ReportingUnit“, “Units” or “Data Type“.
One format actually contains 48 columns reporting kwh and another 48 columns reporting Kilo Volt Amperes Reactive Hours (kVah).
Focusing on the column oriented formats, what are the columns containing the 48 half-hourly readings called?
Most commonly they’re named after the half-hour. For example a column called “20:00” will contain the kwh consumption for the time between 7.30pm and 8pm.
In other cases the columns are positional, e.g. “half hour 1” through to “half hour 48”. This gives us the following variants:
For added fun, some formats have their first column as 00:30, while others have 00:00.
Some formats interleave the actual readings with an extra column that is used to provide a note or qualifier. There are two variants of this:
In addition to the meter numbers, dates, readings, etc the files sometimes contain extra columns, e.g:
We generally ignore this information as its either redundant or irrelevant to our needs.
Some files provide additional meter names, numbers or identifiers that are bespoke to the data source rather than a public identifier.
We’ve got the point now that adding new formats is relatively straight-forward.
Like anyone dealing with large volumes of tabular data, we’ve got a configuration driven data ingest which we can tailor for different formats. We largely just need to know the name of the date column, the name of the column containing the meter id, and the names of the 48 readings columns.
But it’s taken time to develop that.
Most of the ongoing effort is during the setup of a new school or data provider, when we need to check to see if a data feed matches something we know, or whether we need to configure another slightly different variation.
And we have ongoing reporting to alert us when formats change without notice.
The fact that there are so many variations isn’t a surprise. There are many different sources and at every organisation someone has made a reasonable guess at what a useful format might be. They might have spoken to users, but probably don’t know what their competitors are doing.
This variation inevitably creates cost. This costs isn’t immediately felt by the average user who only has to deal with 1-2 formats at a time when they’re working with their own data in spreadsheets.
But those costs add up for those of us building tools and platforms, and operating systems, to support those users.
I don’t see anyone driving a standardisation effort in this area. Although, as I’ve hopefully shown here, behind the variations there is a simple, common tabular format that is waiting to be defined.
My impression at the moment is that most focus is on the emerging smart meter data ecosystem, and the new range of APIs that might support faster access to this same data.
But as I pointed out in my other post, if there isn’t an early attempt to standardise those, we’ll just end up with a whole range of new, slightly different APIs and data feeds. What we need is a common API standard.
A further pattern which I noticed recently is that both Wikidata and OSM provide tools and documentation that help contributors and data users explore the schema that shapes the data.
Both projects have a core data model around which their communities are building and iterating on a more focused domain model. This approach of providing tools for the community to discuss, evolve and revise a schema is what we called the Shared Canvas pattern in the ODI guidebook.
But to successfully apply the Shared Canvas pattern, you also need to keep the community up to date about your Evolving Schema. To do that you need some way to communicate which properties or tags are in use, and how. OSM and Wikidata both provide tools to support that.
In OSM this role is filled by TagInfo. It can provide you with a break down of what type of feature the tag is used on, the range of values, combinations with other tags and some idea of its geographic usage. Tag uses varies by geographic community in OSM. Here’s the information about the building tag.
In Wikidata this tooling is provided by a series of reports that are available from the Discussion page for an individual property. This includes information about how often it is used and pointers to examples of frequent and recent uses. Here’s the information about the name property.
Both tools provide useful insight into how different aspects of a schema are being adopted and uses. They can help guide not just the discussion around the schema (“is this tag in use?”, but also the process of collecting data (“which tags should I use here”) and using the data (“what tags might I find, or query for?”).
Any project that adopts a Shared Canvas approach is likely to need to implement this type of tooling. Lets call it the “Schema explorer” pattern for now.
I’ll leave documenting it further for another post, or a contribution to the guidebook.
Schema explorers for open standards and open data
This type of tooling would be useful in other contexts.
Anywhere that we’re trying to drive adoption of a common data standard, it would be helpful to be able to assess how well used different parts of that schema are by analysing the available data.
That’s not something I’ve regularly seen produced. In our survey of decentralised publishing initiatives at the ODI we found common types of documentation, data validators and other tools to support use of data, like useful aggregations. But no tooling to help explore how well it is adopted. Or to help data users understand the shape of the available data prior to aggregating it.
When i was working on the OpenActive standard, I found the data profiles that Dan Winchester produced really helpful. They provide useful insight into which parts of a standard different publishers were actually using.
I was thinking about this again recently whilst doing some work for Full Fact, exploring the ClaimReview markup in Schema.org. It would be great to see which features different fact checkers are actually using. In fact that would be true of many different aspects of Schema.org.
This type of reporting is hard to do in a distributed environment without aggregating all the data. But Google are regularly harvesting some of this data, so it feels like it would be relatively easy for them to provide insights like this if they chose.
An alternative is the Schema.org Table Corpus which provides exports of Schema.org data contained in the Common Crawl dataset. But more work is likely needed to generate some useful views over the data, and it is less frequently updated.
Outside of Schema.org, schema explorers reporting on the contents of open datasets, would help inform a range of standards work. For example, it could help inform decisions about how to iterate on a schema, guide the production of documentation, and help improve the design of validators and other tools.
If you’ve seen examples of this type of tooling, then I’d be interested to see some links.
This is a post about building tools to validate data. I wanted to share a few reflections based on helping to design and build a few different public and private tools, as well as my experience as a user.
I like using data validators to check my homework. I’ve been using a few different recently which has prompted me to think a bit about their role and the designs that go into their design.
The tl;dr version of this post is along the lines of “Think about user needs when designing tools. But also be conscious of the role those tools play in their broader ecosystem“.
What is a data validator?
A data validator is a tool that checks the correctness and quality of data. This means doing the following categories of checks:
Checking to determine whether there are any mistakes in how it is formatted. E.g. is the syntax of a CSV, XML or JSON file correct?
Confirming if all of the required fields, necessary to make the data useful, been provided?
Testing that individual values have been correctly specified. E.g. if the field contains a number then is the provided value actually a number rather than a text?
Performing more semantic checks such as, if this is a dataset about UK planning applications, then are the coordinates actually in the UK? Or is the start date for the application before the end date?
Confirming that provided data is of a useful quality, e.g. are geographic coordinates of the right precision? Or do any links to other resources actually work?
Warning about data that may or may not be included. For example, prompting the user to include additional fields that may improve the utility of the data. Or asking them to consider whether any personal data included should be there
These validation rules will typically come from a range of different sources, including:
The standard or specification that defines the syntax of the data.
The standard or specification (or schema) that describes the structure and content of the data. (This might be the same as the above, or might be defined elsewhere)
Legislation, which might guide, inform or influence what data should or should not be included
The implementer of the validation tool, who may have opinions about what is considered to be correct or useful data based on their specific needs (e.g. as a direct consumer of the data) or more broadly as a contributor to a community initiative to support improvements to how data is published
Data validators are frequently web based these days. At least for smaller datasets. But both desktop and command-line tools are also regularly used in different settings. The choice of design will be informed by things like how open the data can be, the volume of data being checked, and how the validator might be integrated into a data workflow, e.g. as an automated or manual step.
Examples of different types of data validator
Here are some examples of different data validators created for different purposes and projects
The first few on the list are largely syntax checkers. They validate whether your CSV, JSON or GeoJSON files are correctly structured.
The others go further and check not just the format of the data, but also its validity against a schema. That schema is defined in a standard intended to support consistent publication of data across a community. The goal of these tools is to improve quality of data for a wide range of potential users, by guiding publishers about how to publish data well.
The last three examples are validators that are designed to help publishers meet the needs of a specific application or consumer of the data. They’re an actionable way to test data against the requirements of a specific user.
Validators also vary in other ways.
For example, the 360Giving, OpenContracting and Rich Results Test validators all accept a range of different data formats. They validate different syntaxes against a common schema. Others are built around a single specific format
Some tools provide a low-level view of the results, e.g. a list of errors and warnings with reference to specific sections of the data. Others provide a high-level interface, such as a preview of what the data looks like on a map or as it would be displayed in a specific application. This type of visual presentation can help catch other types of errors and more directly confirm how data might be interpreted, whilst also making the tool useful to a wider audience.
What do we mean by data being valid?
For simple syntax checking identifying whether something is valid is straight-forward. Your JSON is either well-formed or its not.
Validators that are designed around specific applications also usually have a clear marker of what is “valid”: can the application parse, interpret and display the data as expected? Does my twitter card look correct?
In other examples, the notion of “valid” is harder to define. They may be some basic rules around what a minimum viable dataset looks like. If so, these are easier to identify and classify as errors.
But there is often variability within a schema. E.g. optional elements. This means that validators need to offer more than just a binary decision and instead offer warnings, suggestions and feedback.
For example, when thinking about the design of the OpenActive validator we discussed the need to go beyond simple validation and provide feedback and prompts along the lines of “you haven’t provided a price, is the event free or chargeable“? Or “you haven’t provided an image for this event, this is legal but evidence shows that participants are more likely to sign-up to events where they can see what participation looks like.”
To put this differently: data quality depends on how you’re planning to use the data. It’s not an absolute. If you’re not validating data for a specific application or purpose, then you tool should be prompting users to think about the choices they are making around how data is being shared.
In the context of sharing and publishing open data, this moves the role of a data validator beyond simplify checking correctness, and towards identifying sources of friction that will exist between publisher and consumer.
Beyond the formal conformance criteria defined in a specification, deciding whether something is valid or not, is really just a marker for how much extra work is required by a consumer. And in some cases the publisher may not have the time, budget or resources to invest in reducing that burden.
Things to think about when designing a validator
To wrap up this post, here are some things to think about when designing a data validator
Who are your users? What level of technical skill and understanding are you designing for?
How will the validator be used or integrated into the users workflow? A tool for integration into a continuous integration environment will need to operate differently to something used to do acceptance checking before data is published. Maybe you need several different tools?
How much knowledge of the relevant standards or specification will a user need before they can use the tool? Should the tool facilitate learning and exploration about how to structure data, or is just checking existing data?
How can you provide good, clear feedback? Tools that rely on applying machine-readable schemas like JSON Schema can often have cryptic messages as they rely on an underlying library to report errors
How can you provide guidance and feedback that will help users decide how to improve data? Is the feedback actionable? (For example in CSVLint we figured out that when reporting that a user had an incorrect mime-type for their CSV file we could identify if it was served from AWS and provide a clear suggestion about how to fix the issue)
Would showing the data, as a preview or within a mocked up view, help surface problems or build confidence in how data is published?
Are the documentation about how to publish data and the reports from your validator consistent? If not, then fix the documentation or explain the limits of the validator
Finally, if you’re designing a validator for a specific application, then don’t mark as “invalid” anything that you can simply ignore. Don’t force the ecosystem to converge on your preferences.
You may not be interested in the full scope of a standard, but different applications and users will have different needs.
Data quality is a dialogue between publishers and users of data. One that will evolve over time as tools, applications, norms and standards become adopted across a data ecosystem. A data validator is an important building block that can facilitate that discussion.
Most of the last few years has been very focused on research and advisory work. I’ve enjoyed all of that. But I’ve been missing the rewards that come from building, maintaining and growing things over the longer term.
I really enjoy consulting and freelance work in general. It’s a fantastic opportunity to learn about many different sectors and work with a range of different teams and organisations. But you can only go so far: a good engagement is really about providing insight, building capacity and the moving on to the next thing.
In my earlier career I spent more time making and maintaining stuff which has its own rewards. So I’ve been looking for a role that would allow me to do that. I’ve turned down some and got knocked back from others. So it goes.
I’m very happy to say that I’ve found a new part-time role. And it’s with a project that I’ve already been involved with for some time. Having stepped down as a trustee, as of next month I will be taking on the role of CTO of Energy Sparks on a part-time basis.
I helped to start the project a number of years ago. And have continued to provide some advice and support as it grew into a charity. Recently I’ve been helping the team build out some new features. Which is what prompted me to learn more about the UK’s smart meter data ecosystem.
The product is at a stage where it needs some more technical leadership and support. There are some interesting data engineering and technical challenges involved in scaling the system as it continues to roll-out across the UK. I’ll be enjoying digging into that.
What really excites me though, is the opportunity that Energy Sparks provides to help educate children around climate change and energy efficiency. And, more broadly, data literacy in general. The insights it provides to staff are already unlocking cost savings, but it’s this wider impact and use in the classroom, which I think is really key.
There’s also a lot of interesting work happening in the energy sector in the UK right now, which should lead to increased access to, and better use of data. It will be good to be a small part of that.
For now, this is will be a part-time role. I’ll be continuing to do freelance work alongside my other duties. Recently I’ve been helping Full Fact thinking through structured data around fact checks and a team at the World Bank to think about how to develop open standards for risk data.
Get in touch if you need some help with other projects or want to learn more about what we’re doing in Energy Sparks.
OpenActive is a community-led initiative in the sport and physical activity sector in England. It’s goal is to help to get people healthier and more active by making its easier for people to find information about activities and events happening in their area. Publishing open data about opportunities to be active is a key part of its approach.
The initiative has been running for several years, funded by Sport England. Its supported by a team at the Open Data Institute who are working in close collaboration with a range of organisations across the sector.
If you’re interested in more of the details then I’d encourage you to dig into those posts as well as the developer portal.
What I wanted to cover in this blog post are some reflections about one of the key decisions we made early in the standards workstream. This was to base the core data model on Schema.org.
Why did we end up basing the standards on Schema.org?
We started the standards work in OpenActive by doing a proper scoping exercise. This helped us to understand the potential benefits of introducing a standard, and the requirements that would inform its development.
As part of our initial research, we did a review of what standards existed in the sector. We found very little that matched our needs. The few APIs that were provided were quite limited and proprietary and there was little consistency around how data was organised.
It was clear that some standardisation would be beneficial and that there was little in the way of sector-specific work to build on. It was also clear that we’d need a range of different types of standard. Data formats and APIs to support exchange of data, a common data model to help organise data and a taxonomy to help describe different types of activity.
For the data model, it was clear that the core domain model would need to be able to describe events. E.g. that a yoga class takes place in a specific gym at regular times. This would support basic discovery use cases. Where can I go and exercise today? What classes are happening near me?
As part of our review of existing standards, we found that Schema.org already provided this core model along with some additional vocabulary that would help us categorise and describe both the events and locations. For example, whether an Event was free, its capacity and information about the organiser.
For many people Schema.org may be more synonymous with publishing data for use by search engines. But as a project its goal is much broader, it is “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data“.
The data model covers much more than what search engines are consuming. Some communities are instead using the project as a means to collaborate on developing better vocabulary for sharing data between other applications. As well as aligning existing vocabularies under a common umbrella.
New standards should ideally be based on existing standards. We knew we were going to be building the OpenActive technical standards around a “stack” of standards that included HTTP, JSON and JSON-LD. So it was a natural step to base our initial domain model on aspects of Schema.org.
What were the benefits?
An early benefit of this approach is that we could immediately focus our roadmap on exploring extensions to the Schema.org data model that would add value to the physical activity sector.
Our initial community sessions around the OpenActive standards involved demonstrating how well the existing Schema.org model fitted the core requirements. And exploring where additional work was needed.
This meant we skipped any wrangling around how to describe events and instead focused on what we wanted to say about them. Important early questions focused on what information would potential participants find helpful in understanding whether this is specific activity or event is something that they might want to try? For example, details like: what activities they involved and for what level of competency?
We were able to identify those elements of the core Schema.org model supported out use cases and then documented some extensions in our own specifications. The extensions and clarifications were important for the OpenActive community, but not necessarily relevant in the broader context in which Schema.org is being used. We wanted to build some agreement and usage in our community first, before suggesting changes to Schema.org.
As well as giving us an initial head start, the decision also helped us address new requirements much quicker.
As we uncovered further requirements that mean expanding our data model, we were always able to initially look to see if existing Schema.org terms covered what we needed. We began using it as a kind of “dictionary” that we could draw on when needed.
Where existing parts of the Schema.org model fitted out needs, it was gratifying to be able to rapidly address the new requirements by documenting patterns for how to use them. Data publishers were also doing the same thing. Having a common dictionary of terms gave freedom to experiment with new features, drawing on terms defined in a public schema, before the community had discussed and agreed how to implement those patterns more broadly.
Every standards project has its own cadence. The speed of development and adoption are tied up with a whole range of different factors that go well beyond how quickly you can reach consensus around a specification.
But I think the decision to use Schema.org definitely accelerated progress and helped us more quickly deliver a data model that covered the core requirements for the sector.
Where were the challenges?
The approach wasn’t without its challenges, however.
Firstly, for a sector that was new to building open standards, choosing to based parts of that new standard on one project and then defining extensions created some confusion. Some communities seem more comfortable with piecing together vocabularies and taxonomies, but that is not true more widely.
Developers found it tricky to refer to both specifications, to explore their options for publishing different types of data. So we ended up expanding our documentation to cover all of the Schema.org terms we recommended or suggested people use, instead of focusing more on our own extensions.
Secondly, we also initially adopted the same flexible, non-prescriptive approach to data publishing that Schema.org uses. It does not define strict conformance critiera and there are often different options for how the same data might be organised depending on the level of detail a publisher has available. If Schema.org were too restrictive then it would limit how well the model could be used by different communities. It also leaves space for usage patterns to emerge.
In OpenActive we recognised that the physical activity sector had a wide range of capabilities when it came to publishing structured data. And different organisations organised data in different ways. We adopted the same less prescriptive approach to publishing with the goal of reducing the barriers to getting more data published. Essentially asking publishers to structure data as best they could within the options available.
In the end this wasn’t the right decision.
Too much flexibility made it harder for implementers to understand what data would be most useful to publish. And how to do it well. Many publishers were building new services to expose the data so they needed a clearer specification for their development teams.
We addressed this in Version 2 of the specifications by considerably tightening up the requirements. We defined which terms were required or just recommended (and why). And added cardinalities and legal values for terms. Our specification became a more formal, extended profile of Schema.org. This also allowed us build a data validator that is now being released and maintained alongside the specifications.
Our third challenge was about process. In a few cases we identified changes that we felt would sit more naturally within Schema.org than our own extensions. For example, they were improvements and clarifications around the core Event model that would be useful more widely. So we submitted those as proposed changes and clarifications.
Given that Schema.org has a very open process, and the wide range of people active in discussing issues and proposals, it was sometimes hard to know how decisions would get made. We had good support from Dan Brickley and others stewarding the project, but without knowing much about who is commenting on your proposal, their background or their own uses cases, it was tricky to know how much time to spend on handling this feedback. Or when we could confidently say that we had achieved some level of consensus.
We managed to successfully navigate this, by engaging as we would within any open community: working transparently and collegiately, and being willing to reflect on and incorporate feedback regardless of its source.
The final challenge was about assessing the level of use of different parts of the Schema.org model. If we wanted to propose a change in how a term was documented or suggest a revision to its expected values, it is difficult to assess the potential impact of that change. There’s no easy way to see which applications might be relying on specific parts of the model. Or how many people are publishing data that uses different terms.
The Schema.org documentation does flag terms that are currently under discussion or evaluation as “pending”. But outside of this its difficult to understand more about how the model is being used in practice. To do that you need to engage with a user community, or find some metrics about deployment.
We handled this by engaging with the open process of discussion, sharing our own planned usage to inform the discussion. And, where we felt that Schema.org didn’t fit with the direction we needed, we were happy to look to other standards that better filled those gaps. For example we chose to use SKOS to help us organise and structure a taxonomy of physical activities rather than using some of the similar vocabulary that Schema.org provides.
Choosing to draw on Schema.org as a source of part of our domain model didn’t mean that we felt tied to using only what it provides.
Overall I’m happy that we made the right decision. The benefits definitely outweighed the challenges.
But navigating those challenges was easier because those of us leading the standards work were comfortable both with working in the open and in combining different standards to achieve a specific goal. Helping to build more competency in this area is one goal of the ODI standards guidebook.
If you’re involved in a project to build a common data model as part of a community project to publish data, then I’d recommend looking at whether based some or all of that model around Schema.org might help kickstart your technical work.
If you do that, my personal advice would be:
Remember that Schema.org isn’t the right home for every data model. Depending on your requirements, the complexity and the potential uses for the data, you may be better off designing and iterating on your model separately. Similarly, don’t expect that every change or extension you might want to make will necessarily be accepted into the core model
Don’t assume that search engines will start using your data, just because you’re using Schema.org as a basis for publishing, or even if you successfully submit change proposals. It’s not a means of driving adoption and use of your data or preferred model
Plan to write your own specifications and documentation that describe how your application or community expects data to be published. You’ll need to add more conformance criteria and document useful patterns that go beyond that Schema.org is providing
Work out how you will engage with your community. To make it easier to refine your specifications, discuss extensions and gather implementation feedback, you’ll still need a dedicated forum or channel for your community to collaborate. Schema.org doesn’t really provide a home for that. You might have your own github project or setup a W3C community group.
Build your own tooling. Schema.org are improving their own tooling, but you’ll likely need your own validation tools that are tailored to your community and your specifications
Contribute to the Schema.org project where you can. If you have feedback, proposed changes or revisions then submit these to the project. Its through a community approach that we improve the model for everyone. Just be aware that there are likely to be a whole range of different use cases that may be different to your own. Your proposals may need to go through several revisions before being accepted. Proposals that draw on real-world experience or are tied to actual applications will likely carry more weight than general opinions about the “right” way to design something
Be prepared to diverge where necessary. As I’ve explained above, sometimes the right option is to propose changes to Schema.org. And sometimes you may need to be ready to draw on other standards or approaches.
Disclaimer: this blog post is about my understanding of the UK’s smart meter data ecosystem and contains some opinions about how it might evolve. These do not in any way reflect those of Energy Sparks of which I am a trustee.
This blog post is an introduction to the UK’s smart meter data ecosystem. It sketches out some of the key pieces of data infrastructure with some observations around how the overall ecosystem is evolving.
It’s a large, complex system so this post will only touch on the main elements. Pointers to more detail are included along the way.
Data about your home or business energy usage was collected by someone coming to read the actual numbers displayed on the front of your meter. And in some cases that’s still how the data is collected. It’s just that today you might be entering those readings into a mobile or web application provided by your supplier. In between those readings, your supplier will be estimating your usage.
This situation improved with the introduction of AMR (“Automated Meter Reading”) meters which can connect via radio to an energy supplier. The supplier can then read your meter automatically, to get basic information on your usage. After receiving a request the meter can broadcast the data via radio signal. These meters are often only installed in commercial properties.
Smart meters are a step up from AMR meters. They connect via a Wide Area Network (WAN) rather than radio, support two way communications and provide more detailed data collection. This means that when you have a smart meter your energy supplier can send messages to the meter, as well as taking readings from it. These messages can include updated tariffs (e.g. as you switch supplier or if you are on a dynamic tariff) or a notification to say you’ve topped up your meter, etc.
The improved connectivity and functionality means that readings can be collected more frequently and are much more detailed. Half hourly usage data is the standard. A smart meter can typically store around 13 months of half-hourly usage data.
The first generation of smart meters are known as SMETS-1 meters. The latest meters are SMETS-2.
From a consumer point of view, services like Find My Supplier will allow you to find your MPRN and energy suppliers.
Connectivity and devices in the home
If you have a smart meter installed then your meters might talk directly to the WAN, or access it via a separate controller that provides the necessary connectivity.
But within the home, devices will talk to each other using Zigbee, which is a low power internet of things protocol. Together they form what is often referred to as the “Home Area Network” (HAN).
It’s via the home network that your “In Home Display” (IHD) can show your current and historical energy usage as it can connect to the meter and access the data it stores. Your electricity usage is broadcast to connected devices every 10 seconds, while gas usage is broadcast every 30 minutes.
You IHD can show your energy consumption in various ways, including how much it is costing you. This relies on your energy supplier sending your latest tariff information to your meter.
As this article by Bulb highlights, the provision of an IHD and its basic features is required by law. Research showed that IHDs were more accessible and nudged people towards being more conscious of their energy usage. The high-frequency updates from the meter to connected devices makes it easier, for example, for you to identify which devices or uses contribute most to your bill.
Your energy supplier might provide other apps and services that provide you with insights, via the data collected via the WAN.
But you can also connect other devices into the home network provided by your smart meter (or data controller). One example is a newer category of IHD called a “Consumer Access Device” (CAD), e.g. the Glow.
These devices connect via Zigbee to your meter and via Wifi to a third-party service, where it will send your meter readings. For the Glow device, that service is operated by Hildebrand.
These third party services can then provide you with access to your energy usage data via mobile or web applications. Or even via API. Otherwise as a consumer you need to access data via whatever methods your energy supplier supports.
The smart meter network infrastructure
SMETS-1 meters connected to a variety of different networks. This meant that if you switched suppliers then they frequently couldn’t access your meter because it was on a different network. So meters needed to be replaced. And, even if they were on the same network, then differences in technical infrastructure meant the meters might lose functionality..
SMETS-2 meters don’t have this issue as they all connect via a shared Wide Area Network (WAN). There are two of these covering the north and south of the country.
While SMETS-2 meters are better than previous models, they still have all of the issues of any Internet of Things device: problems with connectivity in rural areas, need for power, varied performance based on manufacturer, etc.
Some SMETS-1 meters are also now being connected to the WAN.
Who operates the infrastructure?
The Data Communication Company is a state-licensed monopoly that operates the entire UK smart meter network infrastructure. It’s a wholly-owned subsidiary of Capita. Their current licence runs until 2025.
DCC subcontracted provision of the WAN to support connectivity of smart meters to two regional providers.In the North of England and Scotland that provider is Arqiva. In the rest of England and Wales it is Telefonica UK (who own O2).
All of the messages that go to and from the meters via the WAN go via DCC’s technical infrastructure.
It’s mandatory for smart meters to now be installed in domestic and smaller commercial properties in the UK. Companies can install SMETS-1 or SMETS-2 meters, but the rules were changed recently so only newer meters count towards their individual targets. And energy companies can get fined if they don’t install them quickly enough.
Consumers are being encouraged to have smart meters fitted in existing homes, as meters are replaced, to provide them with more information on their usage and access to better tariffs such as those that offer dynamic time of day pricing., etc.
But there are also concerns around privacy and fears of energy supplies being remotely disconnected, which are making people reluctant to switch when given the choice. Trust is clearly an important part of achieving a successful rollout.
The future will hold much more fine-grained data about energy usage across the homes and businesses in the UK. But in the short-term there’s likely to be a continued mix of different meter types (dumb, AMR and smart) meaning that domestic and non-domestic usage will have differences in the quality and coverage of data due to differences in how smart meters are being rolled out.
Smart meters will give consumers greater choice in tariffs because the infrastructure can better deal with dynamic pricing. It will help to shift to a greener more efficient energy network because there is better data to help manage the network.
The code sets out the roles and responsibilities of the various actors who have access to the network. That includes the infrastructure operators (e.g. the organisations looking after the power lines and cables) as well as the energy companies (e.g. those who are generating the energy) and the energy suppliers (e.g. the organisations selling you the energy).
The focus of the code is on those core actors. But there is an additional category of “Other Providers”. This is basically a miscellaneous group of other organisations not directly involved in provision of energy as a utility, but may have or require access to the data infrastructure.
These other providers include organisations that:
provide technology to energy companies who need to be able to design, test and build software against the smart meter network
that offer services like switching and product recommendations
that access the network on behalf of consumers allowing them to directly access usage data in the home using devices, e.g. Hildebrand and its Glow device
provide other additional third-party services. This includes companies like Hildebrand and N3RGY that are providing value-added APIs over the core network
There are also substantial annual costs for access to the network. This helps to make the infrastructure sustainable, with all users contributing to it.
Data ecosystem map
As a summary, here’s the key points:
your in-home devices send and receive messages and data via a the smart meter or controller installed in your home, or business property
your in-home device might also be sending your data to other services, with your consent
messages to and from your meter are sent via a secure network operated by the DCC
the DCC provide APIs that allow authorised organisations to send and receive messages from that data infrastructure
the DCC doesn’t store any of the meter readings, but do collect metadata about the traffic over that network
organisation who have access to the infrastructure may store and use the data they can access, but generally need consent from users for detailed meter data
the level and type of access, e.g. what messages can be sent and received, may differ across organisations
your energy suppliers uses the data they retrieve from the DCC to generate your bills, provide you with services, optimise the system, etc
the UK government has licensed the DCC to operate that national data infrastructure, with Ofgem regulating the system
At a high-level, the UK smart meter system is like a big federated database: the individual meters store and submit data, with access to that database being governed by the DCC. The authorised users of that network build and maintain their own local caches of data as required to support their businesses and customers.
The evolving ecosystem
This is a big complex piece of national data infrastructure. This makes it interesting to unpick as an example of real-world decisions around the design and governance of data access.
It’s also interesting as the ecosystem is evolving.
Changing role of the DCC
The DCC have recently published a paper called “Data for Good” which sets out their intention to a “system data exchange” (you should read that as “system data”exchange). This means providing access to the data they hold about meters and the messages sent to and from them. (There’s a list of these message types in a SEC code appendix).
The paper suggests that increased access to that data could be used in a variety of beneficial ways. This includes helping people in fuel poverty, or improving management of the energy network.
The DCC is also required to improve efficiency and costs for operating the network to reduce burden on the organisations paying to use the infrastructure. This includes extending use of the network into other areas. For example to water meters or remote healthcare (see note at end of page 13).
Any changes to what data is provided, or how the network is used will require changes to the licence and some negotiation with Ofgem. As the licence is due to be renewed in 2025, then this might be laying groundwork for a revised licence to operate.
In addition to a potentially changing role for the DCC, the other area in which the ecosystem is growing is via “Other Providers” that are becoming data intermediaries.
The infrastructure and financial costs of meeting the technical, security and audit requirements required for direct access to the DCC network creates a high barrier for third-parties wanting to provide additional services that use the data.
There are a small but growing number of organisations, including Hildebrand, N3RGY, Smart Pear and Utiligroup who see an opportunity both to lower this barrier by providing value-added services over the DCC infrastructure. For example, simple JSON based APIs that simplify access to meter data.
Coupled with access to sandbox environments to support prototyping, this provides a simpler and cheaper API with which to integrate. Security remains important but the threat profiles and risks are different as API users have no direct access to the underlying infrastructure and only read-only access to data.
To comply with the governance of the existing system, the downstream user still needs to ensure they have appropriate consent to access data. And they need to be ready to provide evidence if the intermediary is audited.
The APIs offered by these new intermediaries are commercial services: the businesses are looking to do more than just cover their costs and will be hoping to generate significant margin through what is basically a reseller model.
It’s worth noting that access to AMR meter data is also typically via commercial services, at least for non-domestic meters. The price per meter for data from smart meters currently seems lower, perhaps because it’s relying on a more standard, shared underlying data infrastructure.
As the number of smart meters grows I expect access to a cheaper and more modern API layer will become increasingly interesting for a range of existing and new products and services.
Lessons from Open Banking
From my perspective the major barrier to more innovative use of smart meter data is the existing data infrastructure. The DCC obviously recognises the difficulty of integration and other organisations are seeing potential for new revenue streams by becoming data intermediaries.
And needless to say, all of these new intermediaries have their own business models and bespoke APIs. Ultimately, while they may end up competing in different sectors or markets, or over quality of service, they’re all relying on the same underlying data and infrastructure.
In the finance sector, Open Banking has already demonstrated that a standardised set of APIs, licensing and approach to managing access and consent can help to drive innovation in a way that is good for consumers.
There are clear parallels to be drawn between Open Banking, which increased access to banking data, and how access to smart meter data might be increased. It’s a very similar type of data: highly personal, transactional records. And can be used in very similar ways, e.g. account switching.
The key difference is that there’s no single source of banking transactions, so regulation was required to ensure that all the major banks adopted the standard. Smart meter data is already flowing through a single state-licensed monopoly.
Perhaps if the role of the DCC is changing, then they could also provide a simpler standardised API to access the data? Ofgem and DCC could work with the market to define this API as happened with Open Banking. And by reducing the number of intermediaries it may help to increase trust in how data is being accessed, used and shared?
If there is a reluctance to extend DCC’s role in this direction then an alternative step would be to recognise the role and existence of these new types of intermediary with the Smart Energy Code. That would allow their license to use the network to include agreement to offer a common, core standard API, common data licensing terms and approach for collection and management of consent. Again, Ofgem, DCC and others could work with the market to define that API.
For me either of these approaches are the most obvious ways to carry the lessons and models from Open Banking into the energy sector. There are clearly many more aspects of the energy data ecosystem that might benefit from improved access to data, which is where initiatives like Icebreaker One are focused. But starting with what will become a fundamental part of the national data infrastructure seems like an obvious first step to me.
The other angle that Open Banking tackled was creating better access to data about banking products. The energy sector needs this too, as there’s no easy way to access data on energy supplier tariffs and products.