Why are we still building portals?

The Geospatial Commission have recently published some guidance on Designing Geospatial Data Portals. There’s a useful overview in the accompanying blog post.

It’s good clear guidance that should help anyone building a data portal. It has tips for designing search interfaces, presenting results and dataset metadata.

There’s very little advice that is specifically relevant to geospatial data and little in the way of new insights in general. The recommendations echo lots of existing research, guidance and practice. But it’s always helpful to see best practices presented in an accessible way.

For guidance that is ostensibly about geospatial data portal, I would have liked to have seem more of a focus on geospatial data. This aspect is largely limited to recommending the inclusion of a geospatial search, spatial filtering and use of spatial data formats.

It would have been nice to see some suggestions around the useful boundaries to include in search interfaces, recommendations around specific GIS formats and APIs, or some exploration of how to communicate the geographic extents of individual datasets to users.

Fixing a broken journey

The guidance presents a typical user journey that involves someone using a search engine, finding a portal rather than the data they need, and then repeating their search in a portal.

Improving that user journey is best done at the first step. A portal is just getting in the way.

Data publishers should be encouraged to improve the SEO of their datasets if they really want them to be found and used.

Data publishers should be encouraged to improve the documentation and metadata on their “dataset landing pages” to help put that data in context.

If we can improve this then we don’t have to support users in discovering a portal, checking whether it is relevant, teaching them to navigate it, etc.

We don’t really need more portals to improve discovery or use of data. We should be thinking about this differently.

There are many portals, but this one is mine

Portals are created for all kinds of purposes.

Many are just a fancy CMS for datasets that are run by individual organisations.

Others are there to act as hosts for data to help others make it more accessible. Some provide a directory of datasets across a sector.

Looking more broadly, portals support consumption of data by providing a single point of integration with a range of tools and platforms. They work as shared spaces for teams, enabling collaborative maintenance and sharing of research outputs. They also support data governance processes: you need to know what data you have in order to ensure you’re managing it correctly.

If we want to build better portals, then we ought to really have a clearer idea of what is being built, for whom and why.

This new guidance rightly encourages user research, but presumes building a portal as the eventual outcome.

I don’t mean that to be dismissive. There are definitely cases where it is useful to bring together collections of data to help users. But that doesn’t necessarily mean that we need to create a traditional portal interface.

Librarians exist

For example, in order to tackle specific challenges it can be useful to identify a set of relevant related data. This implies a level of curation — a librarian function — which is so far missing from the majority of portals.

Curated collections of data (& code & models & documentation & support) might drive innovation whilst helping ensure that data is used in ways that are mindful of the context of its collection. I’ve suggested recipes as one approach to that. But there are others.

Curation and maintenance of collections are less popular because they’re not easily automated. You need to employ people with an understanding of an issue, the relevant data, and how it might be used or not. To me this approach is fundamental to “publishing with purpose”.

Data agencies

Jeni has previously proposed the idea of “data agencies” as a means of improving discovery. The idea is briefly mentioned in this document.

I won’t attempt to capture the nuance of her idea, but it involves providing a service to support people in finding data via an expert help desk. The ONS already have something similar for their own datasets, but an agency could cover a whole sector or domain. It could also publish curated lists of useful data.

This approach would help broker relationships between data users and data publishers. This would not only help improve discovery, but also build trust and confidence in how data is being accessed, used and shared.

Actually linking to data?

I have a working hypothesis that, setting aside those that need to aggregate lots of small datasets from different sources, most data-enabled analyses, products and services typically only use a small number of related datasets. Maybe a dozen?

The same foundational datasets are used repeatedly in many different ways. The same combination of datasets might also be analysed for different purposes. It would be helpful to surface the most useful datasets and their combinations.

We have very little insight into this because dataset citation, linking and attribution practices are poor.

We could improve data search if this type of information was more readily available. Link analysis isn’t a substitute for good metadata, but its part of the overall puzzle in creating good discovery tools.

Actually linking to data when its referenced would also be helpful.

Developing shared infrastructure

Portals often provide an opportunity to standardise how data is being published. As an intermediary they inevitably shape how data is published and used. This is another area where existing portals do little to improve their overall ecosystem.

But those activities aren’t necessarily tied to the creation and operation of a portal. Provision of shared platforms, open source tools, guidance, quality checkers, linking and aggregation tools, and driving development and adoption of standards can all be done in other ways.

It doesn’t matter how well designed your portal interface is if a user ends up at an out-of-date, poor quality or inaccessible dataset. Or if the costs of using it are too high. Or a lack of context contributes to it being misused or misinterpreted.

This type of shared infrastructure development doesn’t get funded because its not easy to automate. And it rarely produces something you can point at and say “we launched this”.

But it is vital to actually achieving good outcomes.

Portals as service failures

The need for a data portal is an indicator of service failure.

Addressing that failure might involve creating a new service. But we shouldn’t rule out reviewing existing services to see where data can be made more discoverable.

If a new service is required then it doesn’t necessarily have to be a conventional search engine.

The UK Smart Meter Data Ecosystem

Disclaimer: this blog post is about my understanding of the UK’s smart meter data ecosystem and contains some opinions about how it might evolve. These do not in any way reflect those of Energy Sparks of which I am a trustee.

This blog post is an introduction to the UK’s smart meter data ecosystem. It sketches out some of the key pieces of data infrastructure with some observations around how the overall ecosystem is evolving.

It’s a large, complex system so this post will only touch on the main elements. Pointers to more detail are included along the way.

If you want a quick reference, with more diagrams then this UK government document, “Smart Meters, Smart Data, Smart Growth” is a good start.

Smart meter data infrastructure

Smart meters and meter readings

Data about your home or business energy usage was collected by someone coming to read the actual numbers displayed on the front of your meter. And in some cases that’s still how the data is collected. It’s just that today you might be entering those readings into a mobile or web application provided by your supplier. In between those readings, your supplier will be estimating your usage.

This situation improved with the introduction of AMR (“Automated Meter Reading”) meters which can connect via radio to an energy supplier. The supplier can then read your meter automatically, to get basic information on your usage. After receiving a request the meter can broadcast the data via radio signal. These meters are often only installed in commercial properties.

Smart meters are a step up from AMR meters. They connect via a Wide Area Network (WAN) rather than radio, support two way communications and provide more detailed data collection. This means that when you have a smart meter your energy supplier can send messages to the meter, as well as taking readings from it. These messages can include updated tariffs (e.g. as you switch supplier or if you are on a dynamic tariff) or a notification to say you’ve topped up your meter, etc.

The improved connectivity and functionality means that readings can be collected more frequently and are much more detailed. Half hourly usage data is the standard. A smart meter can typically store around 13 months of half-hourly usage data. 

The first generation of smart meters are known as SMETS-1 meters. The latest meters are SMETS-2.

Meter identifiers and registers

Meters have unique identifiers

For gas meters the identifiers are called MPRNs. I believe these are allocated in blocks to gas providers to be assigned to meters as they are installed.

For energy meters, these identifiers are called MPANs. Electricity meters also have a serial number. I believe MPRNs are assigned by the individual regional electricity network operators and that this information is used to populate a national database of installed meters.

From a consumer point of view, services like Find My Supplier will allow you to find your MPRN and energy suppliers.

Connectivity and devices in the home

If you have a smart meter installed then your meters might talk directly to the WAN, or access it via a separate controller that provides the necessary connectivity. 

But within the home, devices will talk to each other using Zigbee, which is a low power internet of things protocol. Together they form what is often referred to as the “Home Area Network” (HAN).

It’s via the home network that your “In Home Display” (IHD) can show your current and historical energy usage as it can connect to the meter and access the data it stores. Your electricity usage is broadcast to connected devices every 10 seconds, while gas usage is broadcast every 30 minutes.

You IHD can show your energy consumption in various ways, including how much it is costing you. This relies on your energy supplier sending your latest tariff information to your meter. 

As this article by Bulb highlights, the provision of an IHD and its basic features is required by law. Research showed that IHDs were more accessible and nudged people towards being more conscious of their energy usage. The high-frequency updates from the meter to connected devices makes it easier, for example, for you to identify which devices or uses contribute most to your bill.

Your energy supplier might provide other apps and services that provide you with insights, via the data collected via the WAN. 

But you can also connect other devices into the home network provided by your smart meter (or data controller). One example is a newer category of IHD called a “Consumer Access Device” (CAD), e.g. the Glow

These devices connect via Zigbee to your meter and via Wifi to a third-party service, where it will send your meter readings. For the Glow device, that service is operated by Hildebrand

These third party services can then provide you with access to your energy usage data via mobile or web applications. Or even via API. Otherwise as a consumer you need to access data via whatever methods your energy supplier supports.

The smart meter network infrastructure

SMETS-1 meters connected to a variety of different networks. This meant that if you switched suppliers then they frequently couldn’t access your meter because it was on a different network. So meters needed to be replaced. And, even if they were on the same network, then differences in technical infrastructure meant the meters might lose functionality.. 

SMETS-2 meters don’t have this issue as they all connect via a shared Wide Area Network (WAN). There are two of these covering the north and south of the country.

While SMETS-2 meters are better than previous models, they still have all of the issues of any Internet of Things device: problems with connectivity in rural areas, need for power, varied performance based on manufacturer, etc.

Some SMETS-1 meters are also now being connected to the WAN. 

Who operates the infrastructure?

The Data Communication Company is a state-licensed monopoly that operates the entire UK smart meter network infrastructure. It’s a wholly-owned subsidiary of Capita. Their current licence runs until 2025. 

DCC subcontracted provision of the WAN to support connectivity of smart meters to two regional providers.In the North of England and Scotland that provider is Arqiva. In the rest of England and Wales it is Telefonica UK (who own O2).

All of the messages that go to and from the meters via the WAN go via DCC’s technical infrastructure.

The network has been designed to be secure. As a key piece of national infrastructure, that’s a basic requirement. Here’s a useful overview of how the security was designed, including some notes on trust and threat modelling.

Part of the design of the system is that there is no central database of meter readings or customer information. It’s all just messages between the suppliers and the meters. However, as they describe in a recently published report, the DCC do apparently have some databases of the “system data” generated by the network. This is the metadata about individual meters and the messages sent to them. The DCC calls this “system data”.

The smart meter roll-out

It’s mandatory for smart meters to now be installed in domestic and smaller commercial properties in the UK. Companies can install SMETS-1 or SMETS-2 meters, but the rules were changed recently so only newer meters count towards their individual targets. And energy companies can get fined if they don’t install them quickly enough

Consumers are being encouraged to have smart meters fitted in existing homes, as meters are replaced, to provide them with more information on their usage and access to better tariffs such as those that offer dynamic time of day pricing., etc. 

But there are also concerns around privacy and fears of energy supplies being remotely disconnected, which are making people reluctant to switch when given the choice. Trust is clearly an important part of achieving a successful rollout.

Ofgem have a handy guide to consumer rights relating to smart meters. Which? have an article about whether you have to accept a smart meter, and Energy UK and Citizens Advice have a 1 page “data guide” that provides the key facts

But smart meters aren’t being uniformly rolled out. For example they are not mandated for all commercial (non-domestic) properties. 

At the time of writing there are over 10 million smart meters connected via the DCC, with 70% of those being SMET-2 meters. The Elexon dashboard for smart electricity meters estimates that the rollout of electricity meters is roughly 44% complete. There are also some official statistics about the rollout.

The future will hold much more fine-grained data about energy usage across the homes and businesses in the UK. But in the short-term there’s likely to be a continued mix of different meter types (dumb, AMR and smart) meaning that domestic and non-domestic usage will have differences in the quality and coverage of data due to differences in how smart meters are being rolled out.

Smart meters will give consumers greater choice in tariffs because the infrastructure can better deal with dynamic pricing. It will help to shift to a greener more efficient energy network because there is better data to help manage the network.

Access to the data infrastructure

Access to and use of the smart meter infrastructure is governed by the Smart Energy Code. Section I covers privacy.

The code sets out the roles and responsibilities of the various actors who have access to the network. That includes the infrastructure operators (e.g. the organisations looking after the power lines and cables) as well as the energy companies (e.g. those who are generating the energy) and the energy suppliers (e.g. the organisations selling you the energy). 

There is a public list of all of the organisations in each category and a summary of their licensing conditions that apply to smart meters.

The focus of the code is on those core actors. But there is an additional category of “Other Providers”. This is basically a miscellaneous group of other organisations not directly involved in provision of energy as a utility, but may have or require access to the data infrastructure.

These other providers include organisations that:

  • provide technology to energy companies who need to be able to design, test and build software against the smart meter network
  • that offer services like switching and product recommendations
  • that access the network on behalf of consumers allowing them to directly access usage data in the home using devices, e.g. Hildebrand and its Glow device
  • provide other additional third-party services. This includes companies like Hildebrand and N3RGY that are providing value-added APIs over the core network

To be authorised to access the network you need to go through a number of stages, including an audit to confirm that you have the right security in place. This can take a long time to complete. Documentation suggests this might take upwards of 6 months.

There are also substantial annual costs for access to the network. This helps to make the infrastructure sustainable, with all users contributing to it. 

Data ecosystem map

Click for larger version

As a summary, here’s the key points:

  • your in-home devices send and receive messages and data via a the smart meter or controller installed in your home, or business property
  • your in-home device might also be sending your data to other services, with your consent
  • messages to and from your meter are sent via a secure network operated by the DCC
  • the DCC provide APIs that allow authorised organisations to send and receive messages from that data infrastructure
  • the DCC doesn’t store any of the meter readings, but do collect metadata about the traffic over that network
  • organisation who have access to the infrastructure may store and use the data they can access, but generally need consent from users for detailed meter data
  • the level and type of access, e.g. what messages can be sent and received, may differ across organisations
  • your energy suppliers uses the data they retrieve from the DCC to generate your bills, provide you with services, optimise the system, etc
  • the UK government has licensed the DCC to operate that national data infrastructure, with Ofgem regulating the system

At a high-level, the UK smart meter system is like a big federated database: the individual meters store and submit data, with access to that database being governed by the DCC. The authorised users of that network build and maintain their own local caches of data as required to support their businesses and customers.

The evolving ecosystem

This is a big complex piece of national data infrastructure. This makes it interesting to unpick as an example of real-world decisions around the design and governance of data access.

It’s also interesting as the ecosystem is evolving.

Changing role of the DCC

The DCC have recently published a paper called “Data for Good” which sets out their intention to a “system data exchange” (you should read that as “system data” exchange). This means providing access to the data they hold about meters and the messages sent to and from them. (There’s a list of these message types in a SEC code appendix). 

The paper suggests that increased access to that data could be used in a variety of beneficial ways. This includes helping people in fuel poverty, or improving management of the energy network.

Encouragingly the paper talks about open and free access to data, which seems reasonable if data is suitably aggregated and anonymised. However the language is qualified in many places. DCC will presumably be incentivised by the existing ecosystem to reduce its costs and find other revenue sources. And their 5 year business development plan makes it clear that they see data services as a new revenue stream.

So time will tell.

The DCC is also required to improve efficiency and costs for operating the network to reduce burden on the organisations paying to use the infrastructure. This includes extending use of the network into other areas. For example to water meters or remote healthcare (see note at end of page 13).

Any changes to what data is provided, or how the network is used will require changes to the licence and some negotiation with Ofgem. As the licence is due to be renewed in 2025, then this might be laying groundwork for a revised licence to operate.

New intermediaries

In addition to a potentially changing role for the DCC, the other area in which the ecosystem is growing is via “Other Providers” that are becoming data intermediaries.

The infrastructure and financial costs of meeting the technical, security and audit requirements required for direct access to the DCC network creates a high barrier for third-parties wanting to provide additional services that use the data. 

The DCC APIs and messaging infrastructure are also difficult to work with meaning that integration costs can be high. The DCC “Data for Good” report notes that direct integration “…is recognised to be challenging and resource intensive“.

There are a small but growing number of organisations, including Hildebrand, N3RGY, Smart Pear and Utiligroup who see an opportunity both to lower this barrier by providing value-added services over the DCC infrastructure. For example, simple JSON based APIs that simplify access to meter data. 

Coupled with access to sandbox environments to support prototyping, this provides a simpler and cheaper API with which to integrate. Security remains important but the threat profiles and risks are different as API users have no direct access to the underlying infrastructure and only read-only access to data.

To comply with the governance of the existing system, the downstream user still needs to ensure they have appropriate consent to access data. And they need to be ready to provide evidence if the intermediary is audited.

The APIs offered by these new intermediaries are commercial services: the businesses are looking to do more than just cover their costs and will be hoping to generate significant margin through what is basically a reseller model. 

It’s worth noting that access to AMR meter data is also typically via commercial services, at least for non-domestic meters. The price per meter for data from smart meters currently seems lower, perhaps because it’s relying on a more standard, shared underlying data infrastructure.

As the number of smart meters grows I expect access to a cheaper and more modern API layer will become increasingly interesting for a range of existing and new products and services.

Lessons from Open Banking

From my perspective the major barrier to more innovative use of smart meter data is the existing data infrastructure. The DCC obviously recognises the difficulty of integration and other organisations are seeing potential for new revenue streams by becoming data intermediaries.

And needless to say, all of these new intermediaries have their own business models and bespoke APIs. Ultimately, while they may end up competing in different sectors or markets, or over quality of service, they’re all relying on the same underlying data and infrastructure.

In the finance sector, Open Banking has already demonstrated that a standardised set of APIs, licensing and approach to managing access and consent can help to drive innovation in a way that is good for consumers. 

There are clear parallels to be drawn between Open Banking, which increased access to banking data, and how access to smart meter data might be increased. It’s a very similar type of data: highly personal, transactional records. And can be used in very similar ways, e.g. account switching.

The key difference is that there’s no single source of banking transactions, so regulation was required to ensure that all the major banks adopted the standard. Smart meter data is already flowing through a single state-licensed monopoly.

Perhaps if the role of the DCC is changing, then they could also provide a simpler standardised API to access the data? Ofgem and DCC could work with the market to define this API as happened with Open Banking. And by reducing the number of intermediaries it may help to increase trust in how data is being accessed, used and shared?

If there is a reluctance to extend DCC’s role in this direction then an alternative step would be to recognise the role and existence of these new types of intermediary with the Smart Energy Code. That would allow their license to use the network to include agreement to offer a common, core standard API, common data licensing terms and approach for collection and management of consent. Again, Ofgem, DCC and others could work with the market to define that API.

For me either of these approaches are the most obvious ways to carry the lessons and models from Open Banking into the energy sector. There are clearly many more aspects of the energy data ecosystem that might benefit from improved access to data, which is where initiatives like Icebreaker One are focused. But starting with what will become a fundamental part of the national data infrastructure seems like an obvious first step to me.

The other angle that Open Banking tackled was creating better access to data about banking products. The energy sector needs this too, as there’s no easy way to access data on energy supplier tariffs and products.

Examples of data ecosystem mapping

This blog post is basically a mood board showing some examples of how people are mapping data ecosystems. I wanted to record a few examples and highlight some of the design decisions that goes into creating a map.

A data ecosystem consists of data infrastructure, and the people, communities and organisations that benefit from the value created by it. A map of that data ecosystem can help illustrate how data and value is created and shared amongst those different actors.

The ODI has published a range of tools and guidance on ecosystem mapping. Data ecosystem mapping is one of several approaches that are being used to help people design and plan data initiatives. A recent ODI report looks at these “data landscaping” tools with some useful references to other examples.

The Flow of My Voice

Joseph Wilk‘s “The Flow of My Voice” is highlights the many different steps through which his voice travels before being stored and served from a YouTube channel, and transcribed for others to read.

The emphasis here is on exhaustively mapping each step, with a representation of the processing at each stage. The text notes which organisation owns the infrastructure at each stage. The intent here is to help to highlight the loss of control over data as it passes through complex interconnected infrastructures. This means a lot of detail.

Data Archeogram: mapping the datafication of work

Armelle Skatulski has produced a “Data Archeogram” that highlights the complex range of data flows and data infrastructure that are increasingly being used to monitor people in the workplace. Starting from various workplace and personal data collection tools, it rapidly expands out to show a wide variety of different systems and uses of data.

Similar to Wilk’s map this diagram is intended to help promote critical review and discussion about how this data is being accessed, used and shared. But it necessarily sacrifices detail around individual flows in an attempt to map out a much larger space. I think the use of patent diagrams to add some detail is a nice touch.

Retail and EdTech data flows

The Future of Privacy Forum recently published some simple data ecosystem maps to illustrate local and global data flows using the Retail and EdTech sectors as examples.

These maps are intended to help highlight the complexity of real world data flows, to help policy makers understand the range of systems and jurisdictions that are involved in sharing and storing personal data.

Because these maps are intended to highlight cross-border flows of data they are presented as if they were an actual map of routes between different countries and territories. This is something that is less evident in the previous examples. These diagrams aren’t showing any specific system and illustrate a typical, but simplified data flow.

They emphasise the actors and flows of different types of data in a geographical context.

Data privacy project: Surfing the web from a library computer terminal

The Data Privacy Projectteaches NYC library staff how information travels and is shared online, what risks users commonly encounter online, and how libraries can better protect patron privacy“. As part of their training materials they have produced a simple ecosystem map and some supporting illustrations to help describe the flow of data that happens when someone is surfing the web in a library.

Again, the map shows a typical rather than a real-world system. Its useful to contrast this with the first example which is much more detailed by comparison. For an educational tool, a more summarised view is better to help building understanding.

The choice of which actors are shown also reflects its intended use. It highlights web hosts, ISPs and advertising networks, but has less to say about the organisations whose websites are being used and how they might use data they collect.

Agronomy projects

This ecosystem map, which I produced for a project we did at the ODI, has a similar intended use.

It provides a summary of a typical data ecosystem we observed around some Gates Foundation funded agronomy projects. The map is intended as a discussion and educational tool to help Programme Officers reflect on the ecosystem within which their programmes are embedded.

This map uses features of Kumu to encourage exploration, providing summaries for each of the different actors in the map. This makes it more dynamic than the previous examples.

Following the methodology we were developing at the ODI it also tries to highlight different types of value exchange: not just data, but also funding, insights, code, etc. These were important inputs and outputs to these programmes.

OpenStreetMap Ecosystem

In contrast to most of the earlier examples, this partial map of the OSM ecosystem tries to show a real-world ecosystem. It would be impossible to properly map the full OSM ecosystem so this is inevitably incomplete and increasingly out of date.

The decision about what detail to include was driven by the goals of the project. The intent was to try and illustrate some of the richness of the ecosystem whilst highlighting how a number of major commercial organisations were participants in that ecosystem. This was not evident to many people until recently.

The map mixes together broad categories of actors, e.g. “End Users” and “Contributor Community” alongside individual commercial companies and real-world applications. The level of detail is therefore varied across the map.

Governance design patterns

The final example comes from this Sage Bionetworks paper. The paper describes a number of design patterns for governing the sharing of data. It includes diagrams of some general patterns as well as real-world applications.

The diagrams shows relatively simple data flows, but they are drawn differently to some of the previous examples. Here the individual actors aren’t directly shown as the endpoints of those data flows. Instead, the data stewards, users and donors are depicted as areas on the map. This is to help emphasise where data is crossing governance boundaries and its use informed by different rules and agreements. Those agreements are also highlighted on the map.

Like the Future of Privacy ecosystem maps, the design is being used to help communicate some important aspects of the ecosystem.

The Common Voice data ecosystem

In 2021 I’m planning to spend some more time exploring different data ecosystems with an emphasis on understanding the flows of data within and between different data initiatives, the tools they use to collect and share data, and the role of collaborative maintenance and open standards.

One project I’ve been looking at this week is Mozilla Common Voice. It’s an initiative that is producing a crowd-sourced, public domain dataset that can be used to train voice recognition applications. It’s the largest dataset of its type, consisting of over 7,000 hours of audio across 60 languages.

It’s a great example of communities working to create datasets that are more open and representative. Helping to address biases and supporting the creation of more equitable products and services. I’ve been using it in my recent talks on collaborative maintenance, but have had chance to dig a bit deeper this week.

The main interface allows contributors to either record their voice, by reading short pre-prepared sentences, or validate existing contributions by listening to existing recording and confirming that they match the script.

Behind the scenes is a more complicated process, which I found interesting.

It further highlights the importance of both open source tooling and openly licensed content in supporting the production of open data. It also another example of how choices around licensing can create friction between open projects.

The data pipeline

Essentially, the goal of the Common Voice project is to create new releases of its dataset. With each release including more languages and, for each language, more validated recordings.

The data pipeline that supports that consists of the following basic steps. (There may be other stages involved in the production of the output corpus, but I’ve not dug further into the code and docs.)

  1. Localisation. The Common Voice web application first has to be localised into the required language. This is coordinated via Mozilla Pontoon, with a community of contributors submitting translations licensed under the Mozilla Public Licence 2.0. Pontoon is open source and can be used for other non-Mozilla applications. When the localization gets to 95% the language can be added to the website and the process can move to the next stage
  2. Sentence Collection. Common Voice needs short sentences for people to read. These sentences need to be in the public domain (e.g. via a CC0 waiver). A minimum of 5,000 sentences are required before a language can be added to the website. The content comes from people submitting and validating sentences via the sentence collector tool. The text is also drawn from public domain sources. There’s a sentence extractor tool that can pull content from wikipedia and other sources. For bulk imports the Mozilla team needs to check for licence compatibility before adding text. All of this means that the source texts for each language are different.
  3. Voice Donation. Contributors read the provided sentences to add their voice to their dataset. The reading and validation steps are separate microtasks. Contributions are gamified and there are progress indicators for each language.
  4. Validation. Submitted recordings go through retrospective review to assess their quality. This allows for some moderation, allowing contributors to flag recordings that are offensive, incorrect or are of poor quality. Validation tasks are also gamified. In general there are more submitted recordings than validations. Clips need to be reviewed by two separate users for them to be marked as valid (or invalid).
  5. Publication. The corpus consists of valid, invalid and “other” (not yet validated) recordings, split into development, training and test datasets. There are separate datasets for each language.

There is an additional dataset which consists of 14 single word sentences (the ten digits, “yes”, “no”, “hey”, “Firefox”) which is published separately. The steps 2-4 look similar though.

Some observations

What should be clear is that there are multiple stages, each with their own thresholds for success.

To get a language into the project you need to translate around 600 text fragments from the application and compile a corpus of at least 5,000 sentences before the real work of collecting the voice dataset can begin.

That work requires input from multiple, potentially overlapping communities:

  • the community of translators, working through Pontoon
  • the community of writers, authors, content creators creating public domain content that can be reused in the service
  • the common voice contributors submitting new additional sentences
  • the contributors recording their voice
  • the contributors validating other recordings
  • the teams at Mozilla, coordinating and supporting all of the above

As the Common Voice application and configuration is open source, it is easy to include it in Pontoon to allow others to contribute to its localisation. To build representative datasets, your tools need to work for all the communities that will be using them.

The availability of public domain text in the source languages, is clearly a contributing factor in getting a language added to the site and ultimately included in the dataset.

So the adoption of open licences and the richness of the commons in those languages will be a factor in determining how rich the voice dataset might be for that language. And, hence, how easy it is to create good voice and text applications that can support those communities.

You can clearly create a new dedicated corpus, as people have done for Hakha Chin. But the strength and openness of one area of the commons will impact other areas. It’s all linked.

While there are different communities involved in Common Voice, its clear these reports from communities working on Hakha Chin and Welsh, in some cases its the same community that is working across the whole process.

Every language community is working to address its own needs: “We’re not dependent on anyone else to make this happen…We just have to do it“.

That’s the essence of shared infrastructure. A common resource that supports a mixture of uses and communities.

The decisions about what licences to use is, as ever, really important. At present Common Voice only takes a few sentences from individual pages of the larger Wikipedia instances. As I understand it this is because Wikipedia content is not public domain, so cannot be used wholesale. But small extracts should be covered by fair use?

I would expect that those interested in building and maintaining their language specific instances of wikipedia have overlaps with those interested in making voice applications work in that same language. Incompatible licensing can limit the ability to build on existing work.

Regardless, the Mozilla and the Wikimedia Foundations have made licensing choices that reflect the needs of their communities and the goals of their projects. That’s an important part of building trust. But, as ever, those licensing choices have subtle impacts across the wider ecosystem.

How do data publishing choices shape data ecosystems?

This is the latest in a series of posts in which I explore some basic questions about data.

In our work at the ODI we have often been asked for advice about how best to publish data. When giving trying to give helpful advice, one thing I’m always mindful of is how the decisions about how data is published shapes the ways in which value can be created from it. More specifically, whether those choices will enable the creation of a rich data ecosystem of intermediaries and users.

So what are the types of decisions that might help to shape data ecosystems?

To give a simple example, if I publish a dataset so its available as a bulk download, then you could use that data in any kind of application. You could also use it to create a service that helps other people create value from the same data, e.g. by providing an API or an interface to generate reports from the data. Publishing in bulk allows intermediaries to help create a richer data ecosystem. But, if I’d just published that same data via an API then there are limited ways in which intermediaries can add value. Instead people must come directly to my API or services to use the data.

This is one of the reasons why people prefer open data to be available in bulk. It allows for more choice and flexibility in how it is used. But, as I noted in a recent post, depending on the “dataset archetype” your publishing options might be limited.

The decision to only publish a dataset as an API, even if it could be published in other ways is often a deliberate decision. The publisher may want to capture more of the value around the dataset, e.g. by charging for the use of an API. Or they may it is important to have more direct control over who uses it, and how. These are reasonable choices and, when the data is sensitive, sensible options.

But there are a variety of ways in which the choices that are made about how to publish data, can can shape or constrain the ecosystem around a specific dataset. It’s not just about bulk downloads versus APIs.

The choices include:

  • the licence that is applied to the data, which might limit it to non commercial use. Or restrict redistribution. Or imposing limits on the use of derived data
  • the terms and conditions for the API or other service that provides access to the data. These terms are often conflated with data licences, but typically focus on aspects of service provisions, for example rate limiting, restriction on storage of API results, permitted uses of the API, permitted types of users, etc
  • the technology used to provide access to data. In addition to bulk downloads vs API, there are also details such as the use of specific standards, the types of API call that are possible, etc
  • the governance around the API or service that provides access to data, which might create limit which users can get access the service or create friction that discourages use
  • the business model that is wrapped around the API or service, which might include a freemium model, chargeable usage tiers, service leverl agreements, usage limits, etc

I think these cover the main areas. Let me know if you think I’ve missed something.

You’ll notice that APIs and services provide more choices for how a publisher might control usage. This can be a good or a bad thing.

The range of choices also means it’s very easy to create a situation where an API or service doesn’t work well for some use cases. This is why user research and engagement is such an important part of releasing a data product and designing policy interventions that aim to increase access to data.

For example, let’s imagine someone has published an openly licensed dataset via an API that restricts users to a maximum number of API calls per month.

These choices limits some uses of the API, e.g. applications that need to make lots of queries. This also means that downstream users creating web applications are unable to provide a good quality of service to their own users. A popular application might just stop working at some point over the course of the month because it has hit the usage threshold.

The dataset might be technically openly, but practically its used has been constrained by other choices.

Those choices might have been made for good reasons. For example as a way for the data publisher to be able to predict how much they need to invest each month in providing a free service, that is accessible to lots of users making a smaller number of requests. There is inevitably a trade-off between the needs of individual users and the publisher.

Adding on a commercial usage tier for high volume users might provide a way for the publisher to recoup costs. It also allows some users to choose what to pay for their use of the API, e.g. to more smoothly handle unexpected peaks in their website traffic. But it may sometimes be simpler to provide the data in bulk to support those use cases. Different use cases might be better served by different publishing options.

Another example might be a system that provides access to both shared and open data via a set of APIs that conform to open standards. If the publisher makes it too difficult for users to actually sign up to use those APIs, e.g because of difficult registration or certification requirements, then only those organisations that can afford to invest the time and money to gain access might both using them. The end result might be a closed ecosystem that is built on open foundations.

I think its important for understand how this range of choices can impact data ecosystems. They’re important not just for how we design products and services, but also in helping to design successful policies and regulatory interventions. If we don’t consider the full range of changes, then we may not achieve the intended outcomes.

More generally, I think it’s important to think about the ecosystems of data use. Often I don’t think enough attention is paid to the variety of ways in which value is created. This can lead to poor choices, like a choosing to try and sell data for short term gain rather than considering the variety of ways in which value might be created in a more open ecosystem.

Some tips for open data ecosystem mapping

At Open Data Camp last month I pitched to run a session on mapping open data ecosystems. Happily quite a few people were interested in the topic, so we got together to try out the process and discuss the ideas. We ended up running the session according to my outline and a handout I’d prepared to help people.

There’s a nice writeup with a fantastic drawnalism summary on the Open Data Camp blog. I had a lot of good feedback from people afterwards to say that they’d found the process useful.

I’ve explored the idea a bit further with some of the ODI team, which has prompted some useful discussion. It also turns out that the Food Standards Agency are working through a similar exercise at the moment to better understand their value networks.

This blog post is just gather together those links along with a couple more examples and a quick brain dump of some hints and tips for applying the tool.

Some example maps

After the session at Open Data Camp I shared a few example maps I’d created:

That example starts to present some of the information covered in my case study on Discogs.

I also tried doing a map to illustrate aspects of the Energy Sparks project:

Neither of those are fully developed, but hopefully provide useful reference points.

I’ve been using Draw.io to do those maps as it saves to Google Drive which makes it easier to collaborate.

Some notes

  • The maps don’t have to focus on just the external value, e.g. what happens after data is published. You could map value networks internal to an organisation as well
  • I’ve found that the maps can get very busy, very quickly. My suggestion is to focus on the key value exchanges rather than trying to be completely comprehensive (at least at first)
  • Try to focus on real, rather than potential exchanges of value. So, rather than brainstorm ways that sharing some data might provide useful, as a rule of thumb check whether you can point to some evidence of a tangible or intangible value exchange. For example:
    • Tangible value: Is someone signing up to a service, or is there an documented API or data access route?
    • Intangible value: is there an event, contact point or feedback form which allows this value to actually be shared?
  • “Follow the data”. Start with the data exchanges and then add applications and related services.
  • While one of the goals is to identify the different roles that organisations play in data ecosystems (e.g. “Aggregator”) its often easier to start with the individual organisation and their specific exchanges first, rather than the goal. Organisations may end up playing several roles, and that’s fine. The map will help evidence that
  • Map the current state, not the future. There’s no time aspect to these maps, I’d recommend drawing a different map to show how you hope things might be, rather than how they are.
  • There was a good suggestion to label data exchanges in some way to add a bit more context, e.g. by using thicker lines for key data exchanges, or a marker to indicate open (versus shared or closed data sharing)
  • Don’t forget that for almost all exchanges where a service is being delivered (e.g. an application, hosting arrangement, etc) there will also be an implicit, reciprocal data exchange. As a user of a service I am contributing data back to the service provider in the form of usage statistics, transactional data, etc. Identifying where that data is accruing (but not being shared) is a good way to identify future open data releases
  • A value network is not a process diagram. The value exchanges are between people and organisations, not systems. If you’ve got a named application on the diagram it should only be as the name of tangible value (“provision of application X”) not as a node in the diagram
  • Sometimes you’re better off drawing a process or data flow diagram. If you want to follow how the data gets exchanged between systems, e.g. to understand its provenance or how it is processed, then you may be better of drawing a data flow diagram. I think as practitioners we may need to draw different views of our data ecosystems. Similar to how systems architects have different ways to document software architecture
  • The process of drawing a map is as important as the output itself. From the open data camp workshop and some subsequent discussions, I’ve found that the diagrams quickly generate useful insights and talking points. I’m keen to try the process out in a workshop setting again to explore this further

I’m keen to get more feedback on this. So if you’ve tried out the approach then let me know how it works for you. I’d be really interested to see some more maps!

If you’re not sure how to get started then also let me know how I can help, for example what resources would be useful? This is one of several tools I’m hoping to write-up in my book.

Open Data Camp Pitch: Mapping data ecosystems

I’m going to Open Data Camp #4 this weekend. I’m really looking forward to catching up with people and seeing what sessions will be running. I’ve been toying with a few session proposals of my own and thought I’d share an outline for this one to gauge interest and get some feedback.

I’m calling the session: “Mapping open data ecosystems“.

Problem statement

I’m very interested in understanding how people and organisations create and share value through open data. One of the key questions that the community wrestles with is demonstrating that value, and we often turn to case studies to attempt to describe it. We also develop arguments to use to convince both publishers and consumers of data that “open” is a positive.

But, as I’ve written about before, the open data ecosystem consists of more than just publishers and consumers. There are a number of different roles. Value is created and shared between those roles. This creates a value network including both tangible (e.g. data, applications) and intangible (knowledge, insight, experience) value.

I think if we map these networks we can get more insight into what roles people play, what makes a stable ecosystem, and better understand the needs of different types of user. For example we can compare open data ecosystems with more closed marketplaces.

The goal

Get together a group of people to:

  • map some ecosystems using a suggested set of roles, e.g. those we are individually involved with
  • discuss whether the suggested roles need to be refined
  • share the maps with each other, to look for overlaps, draw out insights, validate the approach, etc


I know Open Data Camp sessions are self-organising, but I was going to propose a structure to give everyone a chance to contribute, whilst also generating some output. Assuming an hour session, we could organise it as follows:

  • 5 mins review of the background, the roles and approach
  • 20 mins group activity to do a mapping exercise
  • 20 mins discussion to share maps, thoughts, etc
  • 15 mins discussion on whether the approach is useful, refine the roles, etc

The intention here being to try to generate some outputs that we can take away. Most of the session will be group activity and discussion.

Obviously I’m open to other approaches.

And if no-one is interested in the session then that’s fine. I might just wander round with bits of paper and ask people to draw their own networks over the weekend.

Let me know if you’re interested!


Beyond Publishers and Consumers: Some Example Ecosystems

Yesterday I wrote a post suggesting that we should move beyond publishers and consumers and recognise the presence of a wider variety of roles in the open data ecosystem. I suggested a taxonomy of roles as a starting point for discussion.

In this post I wanted to explore how we can use that taxonomy to help map and understand an ecosystem. Eventually I want to work towards a more complete value network analysis and some supporting diagrams for a few key ecosystems. But I wanted to start with hopefully simple examples.

As I’ve been looking at it recently I thought I’d start by examining Copenhagen’s open data initiative and their city data marketplace.

What kind of ecosystems do those two programmes support?

The copenhagen open data ecosystem

The open data ecosystem can support all of the roles I outlined in my taxonomy:

  • Steward: The city of Copenhagen is the steward of all (or the majority of) the datasets that are made available through its data platform, e.g. the location of parking meters
  • Contributor: The contributors to the dataset are the staff and employees of the administration who collect and then publish the data
  • Reuser: Developers or start-ups who are building apps and services, such as I Bike CpH using open data
  • Beneficiary: Residents and visitors to Copenhagen

Examples of the tangible value being exchanged here are:

  • (Steward -> Reuser) The provision of data from the Steward to the Reuser
  • (Reuser -> Beneficiary) The provision of a transport application from the Reuser to the Beneficiary

Examples of the intangible value are:

  • (Contributor -> Steward) The expertise of the Contributors offered to the Steward to help manage the data
  • (Beneficiary -> Reuser) The market insights gained by the Reuser which may be used to create new products
  • (Reuser -> Steward) The insights shared by the Reuser with the Steward into which other datasets might be useful to release or improve

In addition, the open licensing of the data enables two additional actors in the ecosystem:

  • Intermediaries: who can link the Copenhagen data with other datasets, enrich it against other sources, or offer value added APIs. Services such as TransportAPI.
  • Aggregators: e.g. services that aggregate data from multiple portals to create specific value-added datasets, e.g. an aggregation of census data

In this case the Intermediaries and Aggregators will be supporting their own community of Reusers and Beneficiaries. This increases the number of ways in which value is exchanged.

The copenhagen city data marketplace

The ecosystem around the city data marketplace is largely identical to the open data ecosystem. However there are some important differences.

  • Steward: The city of Copenhagen is not the only Steward, the goal is to allow other organisations to publish their data via the marketplace. The marketplace will be multi-tenant.
  • Intermediary: the marketplace itself has become an intermediary, operated by Hitachi
  • The ecosystem will have a greater variety of Contributors, reflecting the wider variety of organisations contributing to the maintenance of those datasets.
  • Reusers and Beneficiaries will be present as before

In addition, because the marketplace offers paid access to data, there are other forms of value exchange, e.g. exchange of money for services (Reuser -> Intermediary).

But the marketplace explicitly rules out the Intermediary and Aggregator roles. Services like TransportAPI or Geolytix could not build their businesses against the city data marketplace. This is because the terms of use of the market prohibit onward distribution of data and the creation of potentially competitive services.

In an effort to create a more open platform to enable data sharing, the result has been to exclude certain types of value exchange and value-added services. The design of the ecosystem privileges a single Intermediary: in this case Hitachi as operator of the platform.

Time will tell whether this is an issue or not. But my feeling is that limiting certain forms of value creation isn’t a great basis for encouraging innovation.

An alternative approach would be to have designed the platform to be part of the digital commons. For example, by allowing Stewards the choice of adding data to the platform under an open licence would give space for other Intermediaries and Aggregators to operate.

Let me know if you think this type of analysis is useful!



Enabling the Linked Data Ecosystem

This post was originally published in the Talis “Nodalities” blog and in “Nodalities” magazine issue 5.

he Linked Data web might usefully be viewed as an incremental evolution beyond Web 2.0. Instead of disconnected silos of data accessible only through disconnected custom APIs, we have datasets that are deeply connected to one another using simple web links, allowing applications to “follow their nose” to find additional relevant data about a specific resource. Custom protocols and data formats are the realm of the early web; the future of the web is in an increased emphasis on standards like HTTP, URIs and RDF that ironically have been in use for many years.

Describing this as a “back to basics” approach wouldn’t be far wrong. Many might dispute that RDF is far from simple, but this overlooks the elegance of its core model. Working within the constraints of standard technologies and the web architecture allows for a greater focus on the real drivers behind data publishing: what information do we want to share, and how is it modelled?

Answering those questions should be relatively easy for any organisation. All businesses have useful datasets that their customers and business partners might usefully access; and they have the domain expertise required to structure that data for online reuse. And, should any organisation want some additional creative input, the Linked Data community has also put together a shopping list [1] to highlight some specific datasets of interest. This list is worth reviewing alongside the Linked Data graph [2], to explore both the current state of the Linked Data web and the directions in which it is potentially going to grow.

Beyond the first questions of what and how to share data, there are other issues that need to be considered. These range from internal issues that organisations face in attempting to justify the sharing of data online, through to larger concerns that may impact the Linked Data ecosystem. For the purposes of this of article, this ecosystem can be divided up into two main categories: data publishers, who publish and share information online; and data consumers, who make use of these rich datasets.

There is obvious overlap between these two categories: many organisations will fall into both camps, as do we all through our personal contributions to the web. However, for this paper I want to focus primarily on business and organisational participants, and attempt to illustrate the different issues that are relevant to these  roles.

Data Publishers Perspective

The first issue facing any organisation is how to justify both the initial and ongoing effort required to support the publishing of Linked Data. Depending on existing infrastructure this may range from a relatively small effort to a major engineering task—particularly true if content has to be converted from other formats or new workflows introduced. In “A Call to Arms” in the last issue of Nodalities [3], John Sheridan and Jeni Tennison provided some insight into how to address the technology hurdle by using technologies like RDFa.

But can this effort be made sustainable? Can the initial investment and ongoing costs be recouped? And, if a dataset becomes popular and grows to become very heavily used, can the infrastructure supporting the data publishing scale to match?

The general aim with enabling access to data is that it will foster network effects, and drive increasing traffic and usage towards existing products and services. There are success stories aplenty (Amazon, Ebay, Salesforce, etc) that illustrate that there is real and not imagined potential.

But this justification overlooks some important distinctions. Firstly for some organisations, e.g. charities and non-governmental organisations information dissemination is part of their mission and there may not be other chargeable services to which additional traffic may be driven. In this scenario everything must be sustainable from the outset. Secondly, it also overlooks the fact that the data being shared may itself be an asset that can be commoditised. The value of access to raw data, stripped of any bundling application, has never been clearer, or been easier to achieve. New business models are likely to arise around direct access to quality data sources. Simple usage-based models are already prevalent on a number of Web 2.0 services and APIs—the free basic access fosters network effects, while the tiered pricing provides more reliable revenue for the data publisher.

Software as a service and cloud computing models undoubtedly have a role to play in addressing the sustainability and scaling issue, allowing data publishers to build out a publishing infrastructure that will support these operations without significant capital investments. But few of the existing services are really firmly targeted at this particular niche: while computing power and storage are increasingly readily available, support for Linked Data publishing or metered access to resources are not yet common-place.

This is where Talis and the Talis Platform have a distinct offering: by supporting organisations in their initial exploration of Linked Data publishing, with a minimum of initial investment, and a scaleable, standards based infrastructure, it becomes much easier to justify dipping a toe into the “Blue Ocean” (see Nodalities issue 2 [5]).

Data Consumers Perspective

Let’s turn now to another aspect of the Linked Data ecosystem, and consider the data consumers perspective.

One issue that quickly becomes apparent when integrating an application with a web service or Linked Dataset is the need to move beyond simple “on the fly” data requests,  e.g. to compose (“mash-up”) and view data sources in the browser, towards polling and harvesting increasingly large chunks of a Linked Dataset.

What drives this requirement? In part it is a natural consequence and benefit of the close linking of resources: links can be mined to find additional relevant metadata that can be used to enrich an application. The way that the data is exposed, e.g. as inter-related resources, is unlikely to always match the needs of the application developer who must harvest the data in order to index, process and analyse it so that it best fits the use cases of her application.

Creating an efficient web-crawling infrastructure is not an easy task, particularly as the growth of the Linked Data web continues and the pool of available data grows. Technologies like SPARQL do go some way towards mitigating these issues, as a query language allows for more flexibility in extracting data. However provision of a stable SPARQL endpoint may be beyond the reach of smaller data publishers, particularly those who are adopting the RDFa approach of instrumenting existing applications with embedded data.  SPARQL also doesn’t help address the need to analyse datasets, e.g. to mine the graph in order to generate recommendations, analyse social networks, etc.

Just as few applications carry out large scale crawling of the web, instead relying on services from a small number of large search engines, it seems reasonable to assume that the Linked Data web will similarly organise around some “true” semantic web search engines that provide data harvesting and acquisition services to machines rather than human users. Issues of trust will also need to be addressed within this community as the Linked Data web matures and becomes an increasing target for spam and other malicious uses. Inaccuracies and inconsistencies are already showing up.

The Talis Platform aims to address these issues by ultimately providing application developers with ready access to Linked Datasets, avoiding the need for individual users and organisations to repeatedly crawl the web. Value-added services can then be offered across these data sources, allowing features, such as graph analysis (e.g. recommendations), to become commodity services available to all. The intention is not to try and mirror or aggregate the whole Linked Data web, this would be unfeasible, but rather collate those datasets that are of most value and use to the community, as well as shepherding the publishing of new datasets by working closely with data publishers.

As an intermediary, the Talis Platform can also address another issue: that of scaling service infrastructure to meet the requirements of data consumers without requiring data publishers to do likewise. It seems likely that data publishers may ultimately choose to “multi-home” their datasets, e.g. publishing directly onto the Linked Data web and also within environments such as the Talis Platform in order to allow consumers more choice in the method of data access.


The bootstrapping phase of the Linked Data web is now behind us. As a community, we need to begin considering the next steps, especially as the available data continues to grow.  This article has attempted to illustrate a few from a wide range of different issues that we face. While technology development, particularly around key standards like SPARQL, rules and inferencing, and the creation of core vocabularies, will always underpin the growth of the semantic web, increasingly it will be issues such as serviceable infrastructure and sustainable business models that will come to the fore.

At Talis we are thinking carefully about the role we might play in addressing those issues and playing our part in enabling the Linked Data ecosystem to flourish.

[1]. http://community.linkeddata.org/MediaWiki/index.php?ShoppingList
[2]. http://richard.cyganiak.de/2007/10/lod/
[3]. http://www.talis.com/nodalities/pdf/nodalities_issue4.pdf
[4]. http://labs.google.com/papers/bigtable.html
[5]. http://www.talis.com/nodalities/pdf/nodalities_issue2.pdf