Why are we still building portals?

The Geospatial Commission have recently published some guidance on Designing Geospatial Data Portals. There’s a useful overview in the accompanying blog post.

It’s good clear guidance that should help anyone building a data portal. It has tips for designing search interfaces, presenting results and dataset metadata.

There’s very little advice that is specifically relevant to geospatial data and little in the way of new insights in general. The recommendations echo lots of existing research, guidance and practice. But it’s always helpful to see best practices presented in an accessible way.

For guidance that is ostensibly about geospatial data portal, I would have liked to have seem more of a focus on geospatial data. This aspect is largely limited to recommending the inclusion of a geospatial search, spatial filtering and use of spatial data formats.

It would have been nice to see some suggestions around the useful boundaries to include in search interfaces, recommendations around specific GIS formats and APIs, or some exploration of how to communicate the geographic extents of individual datasets to users.

Fixing a broken journey

The guidance presents a typical user journey that involves someone using a search engine, finding a portal rather than the data they need, and then repeating their search in a portal.

Improving that user journey is best done at the first step. A portal is just getting in the way.

Data publishers should be encouraged to improve the SEO of their datasets if they really want them to be found and used.

Data publishers should be encouraged to improve the documentation and metadata on their “dataset landing pages” to help put that data in context.

If we can improve this then we don’t have to support users in discovering a portal, checking whether it is relevant, teaching them to navigate it, etc.

We don’t really need more portals to improve discovery or use of data. We should be thinking about this differently.

There are many portals, but this one is mine

Portals are created for all kinds of purposes.

Many are just a fancy CMS for datasets that are run by individual organisations.

Others are there to act as hosts for data to help others make it more accessible. Some provide a directory of datasets across a sector.

Looking more broadly, portals support consumption of data by providing a single point of integration with a range of tools and platforms. They work as shared spaces for teams, enabling collaborative maintenance and sharing of research outputs. They also support data governance processes: you need to know what data you have in order to ensure you’re managing it correctly.

If we want to build better portals, then we ought to really have a clearer idea of what is being built, for whom and why.

This new guidance rightly encourages user research, but presumes building a portal as the eventual outcome.

I don’t mean that to be dismissive. There are definitely cases where it is useful to bring together collections of data to help users. But that doesn’t necessarily mean that we need to create a traditional portal interface.

Librarians exist

For example, in order to tackle specific challenges it can be useful to identify a set of relevant related data. This implies a level of curation — a librarian function — which is so far missing from the majority of portals.

Curated collections of data (& code & models & documentation & support) might drive innovation whilst helping ensure that data is used in ways that are mindful of the context of its collection. I’ve suggested recipes as one approach to that. But there are others.

Curation and maintenance of collections are less popular because they’re not easily automated. You need to employ people with an understanding of an issue, the relevant data, and how it might be used or not. To me this approach is fundamental to “publishing with purpose”.

Data agencies

Jeni has previously proposed the idea of “data agencies” as a means of improving discovery. The idea is briefly mentioned in this document.

I won’t attempt to capture the nuance of her idea, but it involves providing a service to support people in finding data via an expert help desk. The ONS already have something similar for their own datasets, but an agency could cover a whole sector or domain. It could also publish curated lists of useful data.

This approach would help broker relationships between data users and data publishers. This would not only help improve discovery, but also build trust and confidence in how data is being accessed, used and shared.

Actually linking to data?

I have a working hypothesis that, setting aside those that need to aggregate lots of small datasets from different sources, most data-enabled analyses, products and services typically only use a small number of related datasets. Maybe a dozen?

The same foundational datasets are used repeatedly in many different ways. The same combination of datasets might also be analysed for different purposes. It would be helpful to surface the most useful datasets and their combinations.

We have very little insight into this because dataset citation, linking and attribution practices are poor.

We could improve data search if this type of information was more readily available. Link analysis isn’t a substitute for good metadata, but its part of the overall puzzle in creating good discovery tools.

Actually linking to data when its referenced would also be helpful.

Developing shared infrastructure

Portals often provide an opportunity to standardise how data is being published. As an intermediary they inevitably shape how data is published and used. This is another area where existing portals do little to improve their overall ecosystem.

But those activities aren’t necessarily tied to the creation and operation of a portal. Provision of shared platforms, open source tools, guidance, quality checkers, linking and aggregation tools, and driving development and adoption of standards can all be done in other ways.

It doesn’t matter how well designed your portal interface is if a user ends up at an out-of-date, poor quality or inaccessible dataset. Or if the costs of using it are too high. Or a lack of context contributes to it being misused or misinterpreted.

This type of shared infrastructure development doesn’t get funded because its not easy to automate. And it rarely produces something you can point at and say “we launched this”.

But it is vital to actually achieving good outcomes.

Portals as service failures

The need for a data portal is an indicator of service failure.

Addressing that failure might involve creating a new service. But we shouldn’t rule out reviewing existing services to see where data can be made more discoverable.

If a new service is required then it doesn’t necessarily have to be a conventional search engine.

24 different tabular formats for half-hourly energy data

A couple of months ago I wrote a post that provided some background on the data we use in Energy Sparks.

The largest data source comes from gas and electricity meters (consumption) and solar panels (generation). While we’re integrating with APIs that allow us to access data from smart meters, for the foreseeable future most of this data will still be collected via AMR rather than SMETS-2 meters. And then shared with us as CSV files attached to emails.

That data is sent via a variety of systems and platforms run by energy companies, aggregators and local authorities. We’re currently dealing with about 24 different variations of what is basically the same dataset.

I thought I’d share a quick summary of that variation. As its interests from a “designing CSV files” and data standards perspective.

For a quick overview, you can look at this Google spreadsheet which provides a summary of the formats, in a way that hopefully makes them easy to compare.

The rest of this post has some notes on the variations.

What data are we talking about?

In Energy Sparks we work with half-hourly consumption and production data. A typical dataset will consist of a series of 48 daily readings for each meter.

Each half hourly data point reports the total amount of energy consumed (or generated) in the previous 30 minutes.

The dataset might usually contain data for several days of readings for many different meters.

This means that the key bits of information that we need to process each dataset is:

  • An identifier for the meter, e.g. an MPAN or MPRN
  • The date that the readings was taken
  • A series of 48 data points making up a full days readings

Pretty straight-forward. But as you can see in the spreadsheet there’s a lot of different variations.

We receive different formats for both the gas and electricity data. Different formats for historical vs ongoing data supply. Or both.

And formats might change as schools or local authorities change platform, suppliers, etc.

Use of CSV

In general, the CSV files are pretty consistent. We rely on the Ruby CSV parsers default behaviour to automatically identify line endings. And all the formats we’re using use commas, rather than tabs, as delimiters.

The number of header rows varies. Most have a single row, but some don’t have any. A couple have two.

Date formats

Various date formats are used. The following lists the most common first:

  1. %d/%m/%Y (15)
  2. %d/%m/%y (4)
  3. %y-%m-%d (3)
  4. %b %e %Y %I:%M%p (1)
  5. %e %b %Y %H:%M:%S (1)

Not much use of ISO 8601!

But the skew towards readable formats probably makes sense given that the primary anticipated use of this data is for people to open it in a spreadsheet.

Where we have several different formats from a single source (yes, this happens), I’ve noticed that the %Y based date formats are used in formats used to provide historical data, while %y year format seems to be the default for ongoing data.

Data is supplied either as UTC dates or, most commonly, in whatever the current timezone is in the UK. So readings switch from GMT to BST. And this means that when the clocks change we end up with gaps in the readings.

Tabular structure

The majority of formats (22/24) are column oriented. By which I mean the tables consist of one row per meter, per day. Each row having 48 half-hourly readings as separate columns.

Two are row oriented. Each row containing a measurement for a specific meter at a specific date-time.

Meter identifiers

The column used to hold meter identifiers also varies. We might expect at least two: MPAN for electricity meters and MPRN for gas. What we actually get is:

  • Channel
  • M1_Code1
  • Meter
  • Meter Number
  • MPAN
  • MPN
  • MPR
  • "MPR"
  • MPR Value
  • Site Id

“Meter” seems fair as a generic column header if you know what you’re getting. Otherwise some baffling variations here.

Date column

What about the column that contains the date (or date-time for row oriented files). What are they called?

  • "Date"
  • ConsumptionDate
  • Date
  • Date (Local)
  • Date (UTC)
  • DAY
  • read_date
  • ReadDate
  • ReadDatetime
  • Reading Date

Units

The default is that data is supplied in kilowatt-hours (kwh).

So few of the formats actually bother to specify a unit. Those that do call it “ReportingUnit“, “Units” or “Data Type“.

One format actually contains 48 columns reporting kwh and another 48 columns reporting Kilo Volt Amperes Reactive Hours (kVah).

Readings

Focusing on the column oriented formats, what are the columns containing the 48 half-hourly readings called?

Most commonly they’re named after the half-hour. For example a column called “20:00” will contain the kwh consumption for the time between 7.30pm and 8pm.

In other cases the columns are positional, e.g. “half hour 1” through to “half hour 48”. This gives us the following variants:

  • 00:00
  • 00:30:00
  • [00:30]
  • H0030
  • HH01
  • hh01
  • hr0000
  • kWh_1

For added fun, some formats have their first column as 00:30, while others have 00:00.

Some formats interleave the actual readings with an extra column that is used to provide a note or qualifier. There are two variants of this:

  • 00:00 Flag
  • Type

Other columns

In addition to the meter numbers, dates, readings, etc the files sometimes contain extra columns, e.g:

  • Location
  • MPRAlias
  • Stark ID
  • Meter Name
  • MSN
  • meter_identifier
  • Meter Number
  • siteRef
  • ReadType
  • Total
  • Total kWh
  • PostCode
  • M1_Code2

We generally ignore this information as its either redundant or irrelevant to our needs.

Some files provide additional meter names, numbers or identifiers that are bespoke to the data source rather than a public identifier.

Summary

We’ve got the point now that adding new formats is relatively straight-forward.

Like anyone dealing with large volumes of tabular data, we’ve got a configuration driven data ingest which we can tailor for different formats. We largely just need to know the name of the date column, the name of the column containing the meter id, and the names of the 48 readings columns.

But it’s taken time to develop that.

Most of the ongoing effort is during the setup of a new school or data provider, when we need to check to see if a data feed matches something we know, or whether we need to configure another slightly different variation.

And we have ongoing reporting to alert us when formats change without notice.

The fact that there are so many variations isn’t a surprise. There are many different sources and at every organisation someone has made a reasonable guess at what a useful format might be. They might have spoken to users, but probably don’t know what their competitors are doing.

This variation inevitably creates cost. This costs isn’t immediately felt by the average user who only has to deal with 1-2 formats at a time when they’re working with their own data in spreadsheets.

But those costs add up for those of us building tools and platforms, and operating systems, to support those users.

I don’t see anyone driving a standardisation effort in this area. Although, as I’ve hopefully shown here, behind the variations there is a simple, common tabular format that is waiting to be defined.

My impression at the moment is that most focus is on the emerging smart meter data ecosystem, and the new range of APIs that might support faster access to this same data.

But as I pointed out in my other post, if there isn’t an early attempt to standardise those, we’ll just end up with a whole range of new, slightly different APIs and data feeds. What we need is a common API standard.

Schema explorers and how they can help guide adoption of common standards

Despite being very different projects Wikidata and OpenStreetmap have a number of similarities. Recurring patterns in how they organise and support the work of their communities.

We documented a number of these patterns in the ODI Collaborative Maintenance Guidebook. There were also a number we didn’t get time to write-up.

A further pattern which I noticed recently is that both Wikidata and OSM provide tools and documentation that help contributors and data users explore the schema that shapes the data.

Both projects have a core data model around which their communities are building and iterating on a more focused domain model. This approach of providing tools for the community to discuss, evolve and revise a schema is what we called the Shared Canvas pattern in the ODI guidebook.

In OpenStreetmap that core model is consists of nodes, ways and relations. Tags (name-value pairs) can be attached to any of these types.

In Wikidata the core data model is essentially a graph. A collection of statements that associate values with nodes using a range of different properties. It’s actually more complicated than that, but the detail isn’t important here.

The list of properties in Wikidata and the list of tags in OpenStreetmap are continually revised and extended by the community to capture additional information.

The OpenStreetmap community documents tags in its Wiki (e.g. the building tag). Wikidata documents its properties within the project dataset (e.g. the name property, P2561).

But to successfully apply the Shared Canvas pattern, you also need to keep the community up to date about your Evolving Schema. To do that you need some way to communicate which properties or tags are in use, and how. OSM and Wikidata both provide tools to support that.

In OSM this role is filled by TagInfo. It can provide you with a break down of what type of feature the tag is used on, the range of values, combinations with other tags and some idea of its geographic usage. Tag uses varies by geographic community in OSM. Here’s the information about the building tag.

In Wikidata this tooling is provided by a series of reports that are available from the Discussion page for an individual property. This includes information about how often it is used and pointers to examples of frequent and recent uses. Here’s the information about the name property.

Both tools provide useful insight into how different aspects of a schema are being adopted and uses. They can help guide not just the discussion around the schema (“is this tag in use?”, but also the process of collecting data (“which tags should I use here”) and using the data (“what tags might I find, or query for?”).

Any project that adopts a Shared Canvas approach is likely to need to implement this type of tooling. Lets call it the “Schema explorer” pattern for now.

I’ll leave documenting it further for another post, or a contribution to the guidebook.

Schema explorers for open standards and open data

This type of tooling would be useful in other contexts.

Anywhere that we’re trying to drive adoption of a common data standard, it would be helpful to be able to assess how well used different parts of that schema are by analysing the available data.

That’s not something I’ve regularly seen produced. In our survey of decentralised publishing initiatives at the ODI we found common types of documentation, data validators and other tools to support use of data, like useful aggregations. But no tooling to help explore how well it is adopted. Or to help data users understand the shape of the available data prior to aggregating it.

When i was working on the OpenActive standard, I found the data profiles that Dan Winchester produced really helpful. They provide useful insight into which parts of a standard different publishers were actually using.

I was thinking about this again recently whilst doing some work for Full Fact, exploring the ClaimReview markup in Schema.org. It would be great to see which features different fact checkers are actually using. In fact that would be true of many different aspects of Schema.org.

This type of reporting is hard to do in a distributed environment without aggregating all the data. But Google are regularly harvesting some of this data, so it feels like it would be relatively easy for them to provide insights like this if they chose.

An alternative is the Schema.org Table Corpus which provides exports of Schema.org data contained in the Common Crawl dataset. But more work is likely needed to generate some useful views over the data, and it is less frequently updated.

Outside of Schema.org, schema explorers reporting on the contents of open datasets, would help inform a range of standards work. For example, it could help inform decisions about how to iterate on a schema, guide the production of documentation, and help improve the design of validators and other tools.

If you’ve seen examples of this type of tooling, then I’d be interested to see some links.

Building data validators

This is a post about building tools to validate data. I wanted to share a few reflections based on helping to design and build a few different public and private tools, as well as my experience as a user.

I like using data validators to check my homework. I’ve been using a few different recently which has prompted me to think a bit about their role and the designs that go into their design.

The tl;dr version of this post is along the lines of “Think about user needs when designing tools. But also be conscious of the role those tools play in their broader ecosystem“.

What is a data validator?

A data validator is a tool that checks the correctness and quality of data. This means doing the following categories of checks:

  • Syntax
    • Checking to determine whether there are any mistakes in how it is formatted. E.g. is the syntax of a CSV, XML or JSON file correct?
  • Validity
    • Confirming if all of the required fields, necessary to make the data useful, been provided?
    • Testing that individual values have been correctly specified. E.g. if the field contains a number then is the provided value actually a number rather than a text?
    • Performing more semantic checks such as, if this is a dataset about UK planning applications, then are the coordinates actually in the UK? Or is the start date for the application before the end date?
  • Utility
    • Confirming that provided data is of a useful quality, e.g. are geographic coordinates of the right precision? Or do any links to other resources actually work?
    • Warning about data that may or may not be included. For example, prompting the user to include additional fields that may improve the utility of the data. Or asking them to consider whether any personal data included should be there

These validation rules will typically come from a range of different sources, including:

  • The standard or specification that defines the syntax of the data.
  • The standard or specification (or schema) that describes the structure and content of the data. (This might be the same as the above, or might be defined elsewhere)
  • Legislation, which might guide, inform or influence what data should or should not be included
  • The implementer of the validation tool, who may have opinions about what is considered to be correct or useful data based on their specific needs (e.g. as a direct consumer of the data) or more broadly as a contributor to a community initiative to support improvements to how data is published

Data validators are frequently web based these days. At least for smaller datasets. But both desktop and command-line tools are also regularly used in different settings. The choice of design will be informed by things like how open the data can be, the volume of data being checked, and how the validator might be integrated into a data workflow, e.g. as an automated or manual step.

Examples of different types of data validator

Here are some examples of different data validators created for different purposes and projects

  1. JSON lint
  2. GeoJSON Lint
  3. JSON LD Playground
  4. CSVlint
  5. ODI Leeds Business Rates format validator
  6. 360Giving Data Quality Tool
  7. OpenContracting Data Review Tool
  8. The OpenActive validator
  9. OpenReferral UK Service Validator
  10. The Schema.org validator
  11. Google’s Rich Results Test
  12. The Twitter Card validator
  13. Facebook’s sharing debugger

The first few on the list are largely syntax checkers. They validate whether your CSV, JSON or GeoJSON files are correctly structured.

The others go further and check not just the format of the data, but also its validity against a schema. That schema is defined in a standard intended to support consistent publication of data across a community. The goal of these tools is to improve quality of data for a wide range of potential users, by guiding publishers about how to publish data well.

The last three examples are validators that are designed to help publishers meet the needs of a specific application or consumer of the data. They’re an actionable way to test data against the requirements of a specific user.

Validators also vary in other ways.

For example, the 360Giving, OpenContracting and Rich Results Test validators all accept a range of different data formats. They validate different syntaxes against a common schema. Others are built around a single specific format

Some tools provide a low-level view of the results, e.g. a list of errors and warnings with reference to specific sections of the data. Others provide a high-level interface, such as a preview of what the data looks like on a map or as it would be displayed in a specific application. This type of visual presentation can help catch other types of errors and more directly confirm how data might be interpreted, whilst also making the tool useful to a wider audience.

What do we mean by data being valid?

For simple syntax checking identifying whether something is valid is straight-forward. Your JSON is either well-formed or its not.

Validators that are designed around specific applications also usually have a clear marker of what is “valid”: can the application parse, interpret and display the data as expected? Does my twitter card look correct?

In other examples, the notion of “valid” is harder to define. They may be some basic rules around what a minimum viable dataset looks like. If so, these are easier to identify and classify as errors.

But there is often variability within a schema. E.g. optional elements. This means that validators need to offer more than just a binary decision and instead offer warnings, suggestions and feedback.

For example, when thinking about the design of the OpenActive validator we discussed the need to go beyond simple validation and provide feedback and prompts along the lines of “you haven’t provided a price, is the event free or chargeable“? Or “you haven’t provided an image for this event, this is legal but evidence shows that participants are more likely to sign-up to events where they can see what participation looks like.”

To put this differently: data quality depends on how you’re planning to use the data. It’s not an absolute. If you’re not validating data for a specific application or purpose, then you tool should be prompting users to think about the choices they are making around how data is being shared.

In the context of sharing and publishing open data, this moves the role of a data validator beyond simplify checking correctness, and towards identifying sources of friction that will exist between publisher and consumer.

Beyond the formal conformance criteria defined in a specification, deciding whether something is valid or not, is really just a marker for how much extra work is required by a consumer. And in some cases the publisher may not have the time, budget or resources to invest in reducing that burden.

Things to think about when designing a validator

To wrap up this post, here are some things to think about when designing a data validator

  • Who are your users? What level of technical skill and understanding are you designing for?
  • How will the validator be used or integrated into the users workflow? A tool for integration into a continuous integration environment will need to operate differently to something used to do acceptance checking before data is published. Maybe you need several different tools?
  • How much knowledge of the relevant standards or specification will a user need before they can use the tool? Should the tool facilitate learning and exploration about how to structure data, or is just checking existing data?
  • How can you provide good, clear feedback? Tools that rely on applying machine-readable schemas like JSON Schema can often have cryptic messages as they rely on an underlying library to report errors
  • How can you provide guidance and feedback that will help users decide how to improve data? Is the feedback actionable? (For example in CSVLint we figured out that when reporting that a user had an incorrect mime-type for their CSV file we could identify if it was served from AWS and provide a clear suggestion about how to fix the issue)
  • Would showing the data, as a preview or within a mocked up view, help surface problems or build confidence in how data is published?
  • Are the documentation about how to publish data and the reports from your validator consistent? If not, then fix the documentation or explain the limits of the validator

Finally, if you’re designing a validator for a specific application, then don’t mark as “invalid” anything that you can simply ignore. Don’t force the ecosystem to converge on your preferences.

You may not be interested in the full scope of a standard, but different applications and users will have different needs.

Data quality is a dialogue between publishers and users of data. One that will evolve over time as tools, applications, norms and standards become adopted across a data ecosystem. A data validator is an important building block that can facilitate that discussion.

Some lessons learned from building standards around Schema.org

OpenActive is a community-led initiative in the sport and physical activity sector in England. It’s goal is to help to get people healthier and more active by making its easier for people to find information about activities and events happening in their area. Publishing open data about opportunities to be active is a key part of its approach.

The initiative has been running for several years, funded by Sport England. Its supported by a team at the Open Data Institute who are working in close collaboration with a range of organisations across the sector.

During the early stages of the project I was responsible for leading the work to develop the technical standards and guidance that would help organisations publish open data about squash courts and exercise classes. I’ve written some previous blog posts that described the steps that got us to version 1.0 of the standards and then later the roadmap towards 2.0.

Since then the team have been exploring new features like publishing data about walking and cycling routes, improving accessibility information and, more recently, testing a standard API for booking classes.

If you’re interested in more of the details then I’d encourage you to dig into those posts as well as the developer portal.

What I wanted to cover in this blog post are some reflections about one of the key decisions we made early in the standards workstream. This was to base the core data model on Schema.org.

Why did we end up basing the standards on Schema.org?

We started the standards work in OpenActive by doing a proper scoping exercise. This helped us to understand the potential benefits of introducing a standard, and the requirements that would inform its development.

As part of our initial research, we did a review of what standards existed in the sector. We found very little that matched our needs. The few APIs that were provided were quite limited and proprietary and there was little consistency around how data was organised.

It was clear that some standardisation would be beneficial and that there was little in the way of sector-specific work to build on. It was also clear that we’d need a range of different types of standard. Data formats and APIs to support exchange of data, a common data model to help organise data and a taxonomy to help describe different types of activity.

For the data model, it was clear that the core domain model would need to be able to describe events. E.g. that a yoga class takes place in a specific gym at regular times. This would support basic discovery use cases. Where can I go and exercise today? What classes are happening near me?

As part of our review of existing standards, we found that Schema.org already provided this core model along with some additional vocabulary that would help us categorise and describe both the events and locations. For example, whether an Event was free, its capacity and information about the organiser.

For many people Schema.org may be more synonymous with publishing data for use by search engines. But as a project its goal is much broader, it is “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data“.

The data model covers much more than what search engines are consuming. Some communities are instead using the project as a means to collaborate on developing better vocabulary for sharing data between other applications. As well as aligning existing vocabularies under a common umbrella.

New standards should ideally be based on existing standards. We knew we were going to be building the OpenActive technical standards around a “stack” of standards that included HTTP, JSON and JSON-LD. So it was a natural step to base our initial domain model on aspects of Schema.org.

What were the benefits?

An early benefit of this approach is that we could immediately focus our roadmap on exploring extensions to the Schema.org data model that would add value to the physical activity sector.

Our initial community sessions around the OpenActive standards involved demonstrating how well the existing Schema.org model fitted the core requirements. And exploring where additional work was needed.

This meant we skipped any wrangling around how to describe events and instead focused on what we wanted to say about them. Important early questions focused on what information would potential participants find helpful in understanding whether this is specific activity or event is something that they might want to try? For example, details like: what activities they involved and for what level of competency?

We were able to identify those elements of the core Schema.org model supported out use cases and then documented some extensions in our own specifications. The extensions and clarifications were important for the OpenActive community, but not necessarily relevant in the broader context in which Schema.org is being used. We wanted to build some agreement and usage in our community first, before suggesting changes to Schema.org.

As well as giving us an initial head start, the decision also helped us address new requirements much quicker.

As we uncovered further requirements that mean expanding our data model, we were always able to initially look to see if existing Schema.org terms covered what we needed. We began using it as a kind of “dictionary” that we could draw on when needed.

Where existing parts of the Schema.org model fitted out needs, it was gratifying to be able to rapidly address the new requirements by documenting patterns for how to use them. Data publishers were also doing the same thing. Having a common dictionary of terms gave freedom to experiment with new features, drawing on terms defined in a public schema, before the community had discussed and agreed how to implement those patterns more broadly.

Every standards project has its own cadence. The speed of development and adoption are tied up with a whole range of different factors that go well beyond how quickly you can reach consensus around a specification.

But I think the decision to use Schema.org definitely accelerated progress and helped us more quickly deliver a data model that covered the core requirements for the sector.

Where were the challenges?

The approach wasn’t without its challenges, however.

Firstly, for a sector that was new to building open standards, choosing to based parts of that new standard on one project and then defining extensions created some confusion. Some communities seem more comfortable with piecing together vocabularies and taxonomies, but that is not true more widely.

Developers found it tricky to refer to both specifications, to explore their options for publishing different types of data. So we ended up expanding our documentation to cover all of the Schema.org terms we recommended or suggested people use, instead of focusing more on our own extensions.

Secondly, we also initially adopted the same flexible, non-prescriptive approach to data publishing that Schema.org uses. It does not define strict conformance critiera and there are often different options for how the same data might be organised depending on the level of detail a publisher has available. If Schema.org were too restrictive then it would limit how well the model could be used by different communities. It also leaves space for usage patterns to emerge.

In OpenActive we recognised that the physical activity sector had a wide range of capabilities when it came to publishing structured data. And different organisations organised data in different ways. We adopted the same less prescriptive approach to publishing with the goal of reducing the barriers to getting more data published. Essentially asking publishers to structure data as best they could within the options available.

In the end this wasn’t the right decision.

Too much flexibility made it harder for implementers to understand what data would be most useful to publish. And how to do it well. Many publishers were building new services to expose the data so they needed a clearer specification for their development teams.

We addressed this in Version 2 of the specifications by considerably tightening up the requirements. We defined which terms were required or just recommended (and why). And added cardinalities and legal values for terms. Our specification became a more formal, extended profile of Schema.org. This also allowed us build a data validator that is now being released and maintained alongside the specifications.

Our third challenge was about process. In a few cases we identified changes that we felt would sit more naturally within Schema.org than our own extensions. For example, they were improvements and clarifications around the core Event model that would be useful more widely. So we submitted those as proposed changes and clarifications.

Given that Schema.org has a very open process, and the wide range of people active in discussing issues and proposals, it was sometimes hard to know how decisions would get made. We had good support from Dan Brickley and others stewarding the project, but without knowing much about who is commenting on your proposal, their background or their own uses cases, it was tricky to know how much time to spend on handling this feedback. Or when we could confidently say that we had achieved some level of consensus.

We managed to successfully navigate this, by engaging as we would within any open community: working transparently and collegiately, and being willing to reflect on and incorporate feedback regardless of its source.

The final challenge was about assessing the level of use of different parts of the Schema.org model. If we wanted to propose a change in how a term was documented or suggest a revision to its expected values, it is difficult to assess the potential impact of that change. There’s no easy way to see which applications might be relying on specific parts of the model. Or how many people are publishing data that uses different terms.

The Schema.org documentation does flag terms that are currently under discussion or evaluation as “pending”. But outside of this its difficult to understand more about how the model is being used in practice. To do that you need to engage with a user community, or find some metrics about deployment.

We handled this by engaging with the open process of discussion, sharing our own planned usage to inform the discussion. And, where we felt that Schema.org didn’t fit with the direction we needed, we were happy to look to other standards that better filled those gaps. For example we chose to use SKOS to help us organise and structure a taxonomy of physical activities rather than using some of the similar vocabulary that Schema.org provides.

Choosing to draw on Schema.org as a source of part of our domain model didn’t mean that we felt tied to using only what it provides.

Some recommendations

Overall I’m happy that we made the right decision. The benefits definitely outweighed the challenges.

But navigating those challenges was easier because those of us leading the standards work were comfortable both with working in the open and in combining different standards to achieve a specific goal. Helping to build more competency in this area is one goal of the ODI standards guidebook.

If you’re involved in a project to build a common data model as part of a community project to publish data, then I’d recommend looking at whether based some or all of that model around Schema.org might help kickstart your technical work.

If you do that, my personal advice would be:

  • Remember that Schema.org isn’t the right home for every data model. Depending on your requirements, the complexity and the potential uses for the data, you may be better off designing and iterating on your model separately. Similarly, don’t expect that every change or extension you might want to make will necessarily be accepted into the core model
  • Don’t assume that search engines will start using your data, just because you’re using Schema.org as a basis for publishing, or even if you successfully submit change proposals. It’s not a means of driving adoption and use of your data or preferred model
  • Plan to write your own specifications and documentation that describe how your application or community expects data to be published. You’ll need to add more conformance criteria and document useful patterns that go beyond that Schema.org is providing
  • Work out how you will engage with your community. To make it easier to refine your specifications, discuss extensions and gather implementation feedback, you’ll still need a dedicated forum or channel for your community to collaborate. Schema.org doesn’t really provide a home for that. You might have your own github project or setup a W3C community group.
  • Build your own tooling. Schema.org are improving their own tooling, but you’ll likely need your own validation tools that are tailored to your community and your specifications
  • Contribute to the Schema.org project where you can. If you have feedback, proposed changes or revisions then submit these to the project. Its through a community approach that we improve the model for everyone. Just be aware that there are likely to be a whole range of different use cases that may be different to your own. Your proposals may need to go through several revisions before being accepted. Proposals that draw on real-world experience or are tied to actual applications will likely carry more weight than general opinions about the “right” way to design something
  • Be prepared to diverge where necessary. As I’ve explained above, sometimes the right option is to propose changes to Schema.org. And sometimes you may need to be ready to draw on other standards or approaches.

The UK Smart Meter Data Ecosystem

Disclaimer: this blog post is about my understanding of the UK’s smart meter data ecosystem and contains some opinions about how it might evolve. These do not in any way reflect those of Energy Sparks of which I am a trustee.

This blog post is an introduction to the UK’s smart meter data ecosystem. It sketches out some of the key pieces of data infrastructure with some observations around how the overall ecosystem is evolving.

It’s a large, complex system so this post will only touch on the main elements. Pointers to more detail are included along the way.

If you want a quick reference, with more diagrams then this UK government document, “Smart Meters, Smart Data, Smart Growth” is a good start.

Smart meter data infrastructure

Smart meters and meter readings

Data about your home or business energy usage was collected by someone coming to read the actual numbers displayed on the front of your meter. And in some cases that’s still how the data is collected. It’s just that today you might be entering those readings into a mobile or web application provided by your supplier. In between those readings, your supplier will be estimating your usage.

This situation improved with the introduction of AMR (“Automated Meter Reading”) meters which can connect via radio to an energy supplier. The supplier can then read your meter automatically, to get basic information on your usage. After receiving a request the meter can broadcast the data via radio signal. These meters are often only installed in commercial properties.

Smart meters are a step up from AMR meters. They connect via a Wide Area Network (WAN) rather than radio, support two way communications and provide more detailed data collection. This means that when you have a smart meter your energy supplier can send messages to the meter, as well as taking readings from it. These messages can include updated tariffs (e.g. as you switch supplier or if you are on a dynamic tariff) or a notification to say you’ve topped up your meter, etc.

The improved connectivity and functionality means that readings can be collected more frequently and are much more detailed. Half hourly usage data is the standard. A smart meter can typically store around 13 months of half-hourly usage data. 

The first generation of smart meters are known as SMETS-1 meters. The latest meters are SMETS-2.

Meter identifiers and registers

Meters have unique identifiers

For gas meters the identifiers are called MPRNs. I believe these are allocated in blocks to gas providers to be assigned to meters as they are installed.

For energy meters, these identifiers are called MPANs. Electricity meters also have a serial number. I believe MPRNs are assigned by the individual regional electricity network operators and that this information is used to populate a national database of installed meters.

From a consumer point of view, services like Find My Supplier will allow you to find your MPRN and energy suppliers.

Connectivity and devices in the home

If you have a smart meter installed then your meters might talk directly to the WAN, or access it via a separate controller that provides the necessary connectivity. 

But within the home, devices will talk to each other using Zigbee, which is a low power internet of things protocol. Together they form what is often referred to as the “Home Area Network” (HAN).

It’s via the home network that your “In Home Display” (IHD) can show your current and historical energy usage as it can connect to the meter and access the data it stores. Your electricity usage is broadcast to connected devices every 10 seconds, while gas usage is broadcast every 30 minutes.

You IHD can show your energy consumption in various ways, including how much it is costing you. This relies on your energy supplier sending your latest tariff information to your meter. 

As this article by Bulb highlights, the provision of an IHD and its basic features is required by law. Research showed that IHDs were more accessible and nudged people towards being more conscious of their energy usage. The high-frequency updates from the meter to connected devices makes it easier, for example, for you to identify which devices or uses contribute most to your bill.

Your energy supplier might provide other apps and services that provide you with insights, via the data collected via the WAN. 

But you can also connect other devices into the home network provided by your smart meter (or data controller). One example is a newer category of IHD called a “Consumer Access Device” (CAD), e.g. the Glow

These devices connect via Zigbee to your meter and via Wifi to a third-party service, where it will send your meter readings. For the Glow device, that service is operated by Hildebrand

These third party services can then provide you with access to your energy usage data via mobile or web applications. Or even via API. Otherwise as a consumer you need to access data via whatever methods your energy supplier supports.

The smart meter network infrastructure

SMETS-1 meters connected to a variety of different networks. This meant that if you switched suppliers then they frequently couldn’t access your meter because it was on a different network. So meters needed to be replaced. And, even if they were on the same network, then differences in technical infrastructure meant the meters might lose functionality.. 

SMETS-2 meters don’t have this issue as they all connect via a shared Wide Area Network (WAN). There are two of these covering the north and south of the country.

While SMETS-2 meters are better than previous models, they still have all of the issues of any Internet of Things device: problems with connectivity in rural areas, need for power, varied performance based on manufacturer, etc.

Some SMETS-1 meters are also now being connected to the WAN. 

Who operates the infrastructure?

The Data Communication Company is a state-licensed monopoly that operates the entire UK smart meter network infrastructure. It’s a wholly-owned subsidiary of Capita. Their current licence runs until 2025. 

DCC subcontracted provision of the WAN to support connectivity of smart meters to two regional providers.In the North of England and Scotland that provider is Arqiva. In the rest of England and Wales it is Telefonica UK (who own O2).

All of the messages that go to and from the meters via the WAN go via DCC’s technical infrastructure.

The network has been designed to be secure. As a key piece of national infrastructure, that’s a basic requirement. Here’s a useful overview of how the security was designed, including some notes on trust and threat modelling.

Part of the design of the system is that there is no central database of meter readings or customer information. It’s all just messages between the suppliers and the meters. However, as they describe in a recently published report, the DCC do apparently have some databases of the “system data” generated by the network. This is the metadata about individual meters and the messages sent to them. The DCC calls this “system data”.

The smart meter roll-out

It’s mandatory for smart meters to now be installed in domestic and smaller commercial properties in the UK. Companies can install SMETS-1 or SMETS-2 meters, but the rules were changed recently so only newer meters count towards their individual targets. And energy companies can get fined if they don’t install them quickly enough

Consumers are being encouraged to have smart meters fitted in existing homes, as meters are replaced, to provide them with more information on their usage and access to better tariffs such as those that offer dynamic time of day pricing., etc. 

But there are also concerns around privacy and fears of energy supplies being remotely disconnected, which are making people reluctant to switch when given the choice. Trust is clearly an important part of achieving a successful rollout.

Ofgem have a handy guide to consumer rights relating to smart meters. Which? have an article about whether you have to accept a smart meter, and Energy UK and Citizens Advice have a 1 page “data guide” that provides the key facts

But smart meters aren’t being uniformly rolled out. For example they are not mandated for all commercial (non-domestic) properties. 

At the time of writing there are over 10 million smart meters connected via the DCC, with 70% of those being SMET-2 meters. The Elexon dashboard for smart electricity meters estimates that the rollout of electricity meters is roughly 44% complete. There are also some official statistics about the rollout.

The future will hold much more fine-grained data about energy usage across the homes and businesses in the UK. But in the short-term there’s likely to be a continued mix of different meter types (dumb, AMR and smart) meaning that domestic and non-domestic usage will have differences in the quality and coverage of data due to differences in how smart meters are being rolled out.

Smart meters will give consumers greater choice in tariffs because the infrastructure can better deal with dynamic pricing. It will help to shift to a greener more efficient energy network because there is better data to help manage the network.

Access to the data infrastructure

Access to and use of the smart meter infrastructure is governed by the Smart Energy Code. Section I covers privacy.

The code sets out the roles and responsibilities of the various actors who have access to the network. That includes the infrastructure operators (e.g. the organisations looking after the power lines and cables) as well as the energy companies (e.g. those who are generating the energy) and the energy suppliers (e.g. the organisations selling you the energy). 

There is a public list of all of the organisations in each category and a summary of their licensing conditions that apply to smart meters.

The focus of the code is on those core actors. But there is an additional category of “Other Providers”. This is basically a miscellaneous group of other organisations not directly involved in provision of energy as a utility, but may have or require access to the data infrastructure.

These other providers include organisations that:

  • provide technology to energy companies who need to be able to design, test and build software against the smart meter network
  • that offer services like switching and product recommendations
  • that access the network on behalf of consumers allowing them to directly access usage data in the home using devices, e.g. Hildebrand and its Glow device
  • provide other additional third-party services. This includes companies like Hildebrand and N3RGY that are providing value-added APIs over the core network

To be authorised to access the network you need to go through a number of stages, including an audit to confirm that you have the right security in place. This can take a long time to complete. Documentation suggests this might take upwards of 6 months.

There are also substantial annual costs for access to the network. This helps to make the infrastructure sustainable, with all users contributing to it. 

Data ecosystem map

Click for larger version

As a summary, here’s the key points:

  • your in-home devices send and receive messages and data via a the smart meter or controller installed in your home, or business property
  • your in-home device might also be sending your data to other services, with your consent
  • messages to and from your meter are sent via a secure network operated by the DCC
  • the DCC provide APIs that allow authorised organisations to send and receive messages from that data infrastructure
  • the DCC doesn’t store any of the meter readings, but do collect metadata about the traffic over that network
  • organisation who have access to the infrastructure may store and use the data they can access, but generally need consent from users for detailed meter data
  • the level and type of access, e.g. what messages can be sent and received, may differ across organisations
  • your energy suppliers uses the data they retrieve from the DCC to generate your bills, provide you with services, optimise the system, etc
  • the UK government has licensed the DCC to operate that national data infrastructure, with Ofgem regulating the system

At a high-level, the UK smart meter system is like a big federated database: the individual meters store and submit data, with access to that database being governed by the DCC. The authorised users of that network build and maintain their own local caches of data as required to support their businesses and customers.

The evolving ecosystem

This is a big complex piece of national data infrastructure. This makes it interesting to unpick as an example of real-world decisions around the design and governance of data access.

It’s also interesting as the ecosystem is evolving.

Changing role of the DCC

The DCC have recently published a paper called “Data for Good” which sets out their intention to a “system data exchange” (you should read that as “system data” exchange). This means providing access to the data they hold about meters and the messages sent to and from them. (There’s a list of these message types in a SEC code appendix). 

The paper suggests that increased access to that data could be used in a variety of beneficial ways. This includes helping people in fuel poverty, or improving management of the energy network.

Encouragingly the paper talks about open and free access to data, which seems reasonable if data is suitably aggregated and anonymised. However the language is qualified in many places. DCC will presumably be incentivised by the existing ecosystem to reduce its costs and find other revenue sources. And their 5 year business development plan makes it clear that they see data services as a new revenue stream.

So time will tell.

The DCC is also required to improve efficiency and costs for operating the network to reduce burden on the organisations paying to use the infrastructure. This includes extending use of the network into other areas. For example to water meters or remote healthcare (see note at end of page 13).

Any changes to what data is provided, or how the network is used will require changes to the licence and some negotiation with Ofgem. As the licence is due to be renewed in 2025, then this might be laying groundwork for a revised licence to operate.

New intermediaries

In addition to a potentially changing role for the DCC, the other area in which the ecosystem is growing is via “Other Providers” that are becoming data intermediaries.

The infrastructure and financial costs of meeting the technical, security and audit requirements required for direct access to the DCC network creates a high barrier for third-parties wanting to provide additional services that use the data. 

The DCC APIs and messaging infrastructure are also difficult to work with meaning that integration costs can be high. The DCC “Data for Good” report notes that direct integration “…is recognised to be challenging and resource intensive“.

There are a small but growing number of organisations, including Hildebrand, N3RGY, Smart Pear and Utiligroup who see an opportunity both to lower this barrier by providing value-added services over the DCC infrastructure. For example, simple JSON based APIs that simplify access to meter data. 

Coupled with access to sandbox environments to support prototyping, this provides a simpler and cheaper API with which to integrate. Security remains important but the threat profiles and risks are different as API users have no direct access to the underlying infrastructure and only read-only access to data.

To comply with the governance of the existing system, the downstream user still needs to ensure they have appropriate consent to access data. And they need to be ready to provide evidence if the intermediary is audited.

The APIs offered by these new intermediaries are commercial services: the businesses are looking to do more than just cover their costs and will be hoping to generate significant margin through what is basically a reseller model. 

It’s worth noting that access to AMR meter data is also typically via commercial services, at least for non-domestic meters. The price per meter for data from smart meters currently seems lower, perhaps because it’s relying on a more standard, shared underlying data infrastructure.

As the number of smart meters grows I expect access to a cheaper and more modern API layer will become increasingly interesting for a range of existing and new products and services.

Lessons from Open Banking

From my perspective the major barrier to more innovative use of smart meter data is the existing data infrastructure. The DCC obviously recognises the difficulty of integration and other organisations are seeing potential for new revenue streams by becoming data intermediaries.

And needless to say, all of these new intermediaries have their own business models and bespoke APIs. Ultimately, while they may end up competing in different sectors or markets, or over quality of service, they’re all relying on the same underlying data and infrastructure.

In the finance sector, Open Banking has already demonstrated that a standardised set of APIs, licensing and approach to managing access and consent can help to drive innovation in a way that is good for consumers. 

There are clear parallels to be drawn between Open Banking, which increased access to banking data, and how access to smart meter data might be increased. It’s a very similar type of data: highly personal, transactional records. And can be used in very similar ways, e.g. account switching.

The key difference is that there’s no single source of banking transactions, so regulation was required to ensure that all the major banks adopted the standard. Smart meter data is already flowing through a single state-licensed monopoly.

Perhaps if the role of the DCC is changing, then they could also provide a simpler standardised API to access the data? Ofgem and DCC could work with the market to define this API as happened with Open Banking. And by reducing the number of intermediaries it may help to increase trust in how data is being accessed, used and shared?

If there is a reluctance to extend DCC’s role in this direction then an alternative step would be to recognise the role and existence of these new types of intermediary with the Smart Energy Code. That would allow their license to use the network to include agreement to offer a common, core standard API, common data licensing terms and approach for collection and management of consent. Again, Ofgem, DCC and others could work with the market to define that API.

For me either of these approaches are the most obvious ways to carry the lessons and models from Open Banking into the energy sector. There are clearly many more aspects of the energy data ecosystem that might benefit from improved access to data, which is where initiatives like Icebreaker One are focused. But starting with what will become a fundamental part of the national data infrastructure seems like an obvious first step to me.

The other angle that Open Banking tackled was creating better access to data about banking products. The energy sector needs this too, as there’s no easy way to access data on energy supplier tariffs and products.

The importance of tracking dataset retractions and updates

There are lots of recent examples of researchers collecting and releasing datasets which end up raising serious ethical and legal concerns. The IBM facial recognition dataset being just one example that springs to mind.

I read an interesting post exploring how facial recognition datasets are being widely used despite being taken down due to ethical concerns.

The post highlights how these datasets, despite being retracted, are still being widely used in research. This is in part because the original datasets are still circulating via mirrors of the original files. But also because they have been incorporated into derived datasets which are still being distributed with the original contents intact.

The authors describe how just one dataset, the DukeMTMC dataset was used in more than 135 papers after being retracted, 116 of those drawing on derived datasets. Some datasets have many derivatives, one example cited has been used in 14 derived datasets.

The research raises important questions about how datasets are published, mirrored, used and licensed. There’s a lot to unpack there and I look forward to reading more about the research. The concerns around open licensing are reminiscent of similar debates in the open source community leading to a set of “ethical open source licences“.

But the issue I wanted to highlight here is the difficulty of tracking the mirroring and reuse of datasets.

Change notification is a missing piece of our data infrastructure.

If it were easier to monitor important changes to datasets, then it would be easier to:

  • maintain mirrors of data
  • retract or remove data that breached laws or social and ethical norms
  • update derived datasets to remove or amend data
  • re-run analyses against datasets which has seen significant corrections or revisions
  • assess the impacts of poor quality or unethically shared data
  • proactively notify relevant communities of potential impacts relating to published data
  • monitor and review the reasons why datasets get retracted
  • …etc, etc

The importance of these activities can be seen in other contexts.

For example, Retraction Watch is a project that monitors retractions of research papers. CrossMark helps to highlight major changes to published papers including corrections and retractions.

Principle T3: Orderly Release, of the UK Statistics Authority code of practice explains that scheduled revisions and unscheduled corrections to statistics should be transparent, and that organisations should have a specific policy for how they are handled.

More broadly, product recalls and safety notices are standard for consumer goods. Maybe datasets should be treated similarly?

This feels like an area that warrants further research, investment and infrastructure. At some point we need to raise our sights from setting up even more portals and endlessly refining their feature sets and think more broadly about the system and ecosystem we are building.

Four types of innovation around data

Vaughn Tan’s The Uncertainty Mindset is one of the most fascinating books I’ve read this year. It’s an exploration of how to build R&D teams drawing on lessons learned in high-end kitchens around the world. I love cooking and I’m interested in creative R&D and what makes high-performing teams work well. I’d strongly recommend it if you’re interested in any of these topics.

I’m also a sucker for a good intellectual framework that helps me think about things in different ways. I did that recently with the BASEDEF framework.

Tan introduces a nice framework in Chapter 4 of the book which looks at four broad types of innovation around food. These are presented as a way to help the reader understand how and where innovation creates impact in restaurants. The four categories are:

  1. New dishes – new arrangements of ingredients, where innovation might be incremental refinements to existing dishes, combining ingredients together in new ways, or using ingredients from different contexts (think “fusion”)
  2. New ingredients – coming up with new things to be cooked
  3. New cooking methods – new ways of cooking things, like spherification or sous vide
  4. New cooking processes – new ways of organising the processes of cooking, e.g. to help kitchen staff prepare a dish more efficiently and consistently

The categories are the top are more evident to the consumer, those lower down less so. But the impacts of new methods and processes are greater as they apply in a variety of contexts.

Somewhat inevitably, I found myself thinking about how these categories work in the context of data:

  1. New dishes analyses – New derived datasets made from existing primary sources. Or new ways of combining datasets to create insights. I’ve used the metaphor of cooking to describe data analysis before, those recipes for data-informed problem solving help to document this stage to make it reproducible
  2. New ingredients datasets and data sources – Finding and using new sources of data, like turning image, text or audio libraries into datasets, using cheaper sensors, finding a way to extract data from non-traditional sources, or using phone sensors for earthquake detection
  3. New cooking methods for cleaning, managing or analysing data – which includes things like Jupyter notebooks, machine learning or differential privacy
  4. New cooking processes for organising the collection, preparation and analysis of data – e.g. collaborative maintenance, developing open standards for data or approaches to data governance and collective consent?

The breakdown isn’t perfect, but I found the exercise useful to think through the types of innovation around data. I’ve been conscious recently that I’m often using the word “innovation” without really digging into what that means, how that innovation happens and what exactly is being done differently or produced as a result.

The categories are also useful, I think, in reflecting on the possible impacts of breakthroughs of different types. Or perhaps where investment in R&D might be prioritised and where ensuring the translation of innovative approaches into the mainstream might have most impact?

What do you think?

Increasing inclusion around open standards for data

I read an interesting article this week by Ana Brandusescu, Michael Canares and Silvana Fumega. Called “Open data standards design behind closed doors?” it explores issues of inclusion and equity around the development of “open data standards” (which I’m reading as “open standards for data”).

Ana, Michael and Silvana rightly highlight that standards development is often seen and carried out as a technical process, whereas their development and impacts are often political, social or economic. To ensure that standards are well designed, we need to recognise their power, choose when to wield that tool, and ensure that we use it well. The article also asks questions about how standards are currently developed and suggests a framework for creating more participatory approaches throughout their development.

I’ve been reflecting on the article this week alongside a discussion that took place in this thread started by Ana.

Improving the ODI standards guidebook

I agree that standards development should absolutely be more inclusive. I too often find myself in standards discussions and groups with people that look like me and whose experiences may not always reflect those who are ultimately impacted by the creation and use of a standard.

In the open standards for data guidebook we explore how and why standards are developed to help make that process more transparent to a wider group of people. We also placed an emphasis on the importance of the scoping and adoption phases of standards development because this is so often where standards fail. Not just because the wrong thing is standardised, but also because the standard is designed for the wrong audience, or its potential impacts and value are not communicated.

Sometimes we don’t even need a standard. Standards development isn’t about creating specifications or technology, those are just outputs. The intended impact is to create some wider change in the world, which might be to increase transparency, or support implementation of a policy or to create a more equitable marketplace. Other interventions or activities might achieve those same goals better or faster. Some of them might not even use data(!)

But looking back through the guidebook, while we highlight in many places the need for engagement, outreach, developing a shared understanding of goals and desired impacts and a clear set of roles and responsibilities, we don’t specifically foreground issues of inclusion and equity as much as we could have.

The language and content of the guidebook could be improved. As could some prototype tools we included like the standards canvas. How would that be changed in order to foreground issues of inclusion and equity?

I’d love to get some contributions to the guidebook to help us improve it. Drop me a message if you have suggestions about that.

Standards as shared agreements

Open standards for data are reusable agreements that guide the exchange of data. They shape how I collect data from you, as a data provider. And as a data provider they shape how you (re)present data you have collected and, in many cases will ultimately impact how you collect data in the future.

If we foreground standards as agreements for shaping how data is collected and shared, then to increase inclusion and equity in the design of those agreements we can look to existing work like the Toolkit for Centering Racial Equity which provides a framework for thinking about inclusion throughout the life-cycle of data. Standards development fits within that life-cycle, even if it operates at a larger scale and extends it out to different time frames.

We can also recognise existing work and best practices around good participatory design and research.

We should avoid standards development, as a process, being divorced from broader discussions and best practices around ethics, equity and engagement around data. Taking a more inclusive and equitable approach to standards development is part of the broader discussion around the need for more integration across the computing and social sciences.

We may also need to recognise that sometimes agreements are made that don’t provide equitable outcomes for everyone. We might not be able to achieve a compromise that works for everyone. Being transparent about the goals and aims of a standard, and how it was developed, can help to surface who it is designed for (or not). Sometimes we might just need different standards, optimised for different purposes.

Some standards are more harmful than others

There are many different types of standard. And standards can be applied to different types of data. The authors of the original article didn’t really touch on this within their framework, but I think its important to recognise these differences, as part of any follow-on activities.

The impacts of a poorly designed standard that classifies people or their health outcomes will be much more harmful than a poorly defined data exchange format. See all of Susan Leigh Star‘s work. Or concerns from indigenous peoples about how they are counted and represented (or not) in statistical datasets.

Increasing inclusion can help to mitigate the harmful impacts around data. So focusing on improving inclusion (or recognising existing work and best practices) around the design of standards with greater capacity for harms is important. The skills and experience required in developing a taxonomy is fundamentally different to those required to develop a data exchange format.

Recognising these differences is also helpful when planning how to engage with a wider group of people. As we can identify what help and input is needed: What skills or perspectives are lacking among those leading standards work? What help or support needs to be offered to increase inclusion. E.g. by developing skills, or choosing different collaboration tools or methods of seeking input.

Developing a community of practice

Since we launched the standards guidebook I’ve been wondering whether it would be helpful to have more of a community of practice around standards development. I found myself thinking about this again after reading Ana, Michael and Silvana’s article and the subsequent discussion on twitter.

What would that look like? Does it exist already?

Perhaps supported by a set of learning or training resources that re-purposes some of the ODI guidebook material alongside other resources to help others to engage with and lead impactful, inclusive standards work?

I’m interested to see how this work and discussion unfolds.

FAIR, fairer, fairest?

“FAIR” (or “FAIR data”) is an term that I’ve been bumping into more and more frequently. For example, its included in the UK’s recently published Geospatial Strategy.

FAIR is an acronym that stands for Findable, Accessible, Interoperable and Reusable. It defines a set of principles that highlight some important aspects of publishing machine-readable data well. For example they identify the need to adopt common standards, use common identifiers, provide good metadata and clear usage licences.

The principles were originally defined by researchers in the life sciences. They were intended to help to improve management and sharing of data in research. Since then the principles have been increasingly referenced in other disciplines and domains.

At the ODI we’re currently working with CABI on a project that is applying the FAIR data principles, alongside other recommendations, to improve data sharing in grants and projects funded by the Gates Foundation.

From the perspective of encouraging the management and sharing of well-structured, standardised, machine-readable data, the FAIR principles are pretty good. They explore similar territory as the ODI’s Open Data Certificates and Tim Berners-Lee’s 5-Star Principles.

But the FAIR principles have some limitations and have been critiqued by various communities. As the principles become adopted in other contexts it is important that we understand these limitations, as they may have more of an impact in different situations.

A good background on the FAIR principles and some of their limitations can be found in this 2018 paper. But there are a few I’d like to highlight in this post.

They’re just principles

A key issue with FAIR is that they’re just principles. They offer recommendations about best practices, but they don’t help you answer specific questions. For example:

  • what metadata is useful to publish alongside different types of datasets?
  • which standards and shared identifiers are the best to use when publishing a specific dataset?
  • where will people be looking for this dataset to ensure its findable?
  • what are the trade-offs of using different competing standards?
  • what terms of use and licensing are appropriate to use when publishing a specific dataset for use by a specific community?
  • …etc

Applying the principles to a specific dataset means you need to have a clear idea about what you’re trying to achieve, what standards and best practices are used by the community you’re trying to support, or what approach might best enable the ecosystem you’re trying to grow and support.

We touched on some of these issues in a previous project that CABI and ODI delivered to the Gates Foundation. We encouraged people to think about FAIR in the context of a specific data ecosystem.

Currently there’s very little guidance that exists to support these decisions around FAIR. Which makes it harder to assess whether something is really FAIR in practice. Inevitably there will be trade-offs that involve making choices about standards and how much to invest in data curation and publication. Principles only go so far.

The principles are designed for a specific context

The FAIR principles were designed to reflect the needs of a specific community and context. Many of the recommendations are also broadly applicable to data publishing in other domains and contexts. But they embody design decisions that may not apply universally.

For example, they choose to emphasise machine-readability. Other communities might choose to focus on other elements that are more important to them or their needs.

As an alternative, the CARE principles for indigenous data governance are based around Collective Benefit, Authority to Control, Responsibility and Ethics. Those are good principles too. Other groups have chosen to propose ways to adapt and expand on FAIR.

It may be that the FAIR principles will work well in your specific context or community. But it might also be true that if you were to start from scratch and designed a new set of principles, you might choose to highlight other principles.

Whenever we are applying off-the-shelf principles in new areas, we need to think about whether they are helping us to achieve our own goals. Do they emphasise and prioritise work in the right areas?

The principles are not about being “fair”

Despite the acronym, the principles aren’t about being “fair”.

I don’t really know how to properly define “fair”. But I think it includes things like equity ‒ of access, or representation, or participation. And ethics and engagement. The principles are silent on those topics, leading some people to think about FAIRER data.

Don’t let the memorable acronym distract from the importance of ethics, consequence scanning and centering equity.

FAIR is not open

The principles were designed to be applied in contexts where not all data can be open. Life science research involves lots of sensitive personal information. Instead the principles recommend that data usage rights are clear.

I usually point out that FAIR data can exist across the data spectrum. But the principles don’t remind you that data should be as open as possible. Or prompt you to consider about the impacts of different types of licensing. They just ask you to be clear about the terms of reuse, however restrictive they might be.

So, to recap: the FAIR data principles offer a useful framework of things to consider when making data more accessible and easier to reuse. But they are not perfect. And they do not consider all of the various elements required to build an open and trustworthy data ecosystem.