Examples of data ecosystem mapping

This blog post is basically a mood board showing some examples of how people are mapping data ecosystems. I wanted to record a few examples and highlight some of the design decisions that goes into creating a map.

A data ecosystem consists of data infrastructure, and the people, communities and organisations that benefit from the value created by it. A map of that data ecosystem can help illustrate how data and value is created and shared amongst those different actors.

The ODI has published a range of tools and guidance on ecosystem mapping. Data ecosystem mapping is one of several approaches that are being used to help people design and plan data initiatives. A recent ODI report looks at these “data landscaping” tools with some useful references to other examples.

The Flow of My Voice

Joseph Wilk‘s “The Flow of My Voice” is highlights the many different steps through which his voice travels before being stored and served from a YouTube channel, and transcribed for others to read.

The emphasis here is on exhaustively mapping each step, with a representation of the processing at each stage. The text notes which organisation owns the infrastructure at each stage. The intent here is to help to highlight the loss of control over data as it passes through complex interconnected infrastructures. This means a lot of detail.

Data Archeogram: mapping the datafication of work

Armelle Skatulski has produced a “Data Archeogram” that highlights the complex range of data flows and data infrastructure that are increasingly being used to monitor people in the workplace. Starting from various workplace and personal data collection tools, it rapidly expands out to show a wide variety of different systems and uses of data.

Similar to Wilk’s map this diagram is intended to help promote critical review and discussion about how this data is being accessed, used and shared. But it necessarily sacrifices detail around individual flows in an attempt to map out a much larger space. I think the use of patent diagrams to add some detail is a nice touch.

Retail and EdTech data flows

The Future of Privacy Forum recently published some simple data ecosystem maps to illustrate local and global data flows using the Retail and EdTech sectors as examples.

These maps are intended to help highlight the complexity of real world data flows, to help policy makers understand the range of systems and jurisdictions that are involved in sharing and storing personal data.

Because these maps are intended to highlight cross-border flows of data they are presented as if they were an actual map of routes between different countries and territories. This is something that is less evident in the previous examples. These diagrams aren’t showing any specific system and illustrate a typical, but simplified data flow.

They emphasise the actors and flows of different types of data in a geographical context.

Data privacy project: Surfing the web from a library computer terminal

The Data Privacy Projectteaches NYC library staff how information travels and is shared online, what risks users commonly encounter online, and how libraries can better protect patron privacy“. As part of their training materials they have produced a simple ecosystem map and some supporting illustrations to help describe the flow of data that happens when someone is surfing the web in a library.

Again, the map shows a typical rather than a real-world system. Its useful to contrast this with the first example which is much more detailed by comparison. For an educational tool, a more summarised view is better to help building understanding.

The choice of which actors are shown also reflects its intended use. It highlights web hosts, ISPs and advertising networks, but has less to say about the organisations whose websites are being used and how they might use data they collect.

Agronomy projects

This ecosystem map, which I produced for a project we did at the ODI, has a similar intended use.

It provides a summary of a typical data ecosystem we observed around some Gates Foundation funded agronomy projects. The map is intended as a discussion and educational tool to help Programme Officers reflect on the ecosystem within which their programmes are embedded.

This map uses features of Kumu to encourage exploration, providing summaries for each of the different actors in the map. This makes it more dynamic than the previous examples.

Following the methodology we were developing at the ODI it also tries to highlight different types of value exchange: not just data, but also funding, insights, code, etc. These were important inputs and outputs to these programmes.

OpenStreetMap Ecosystem

In contrast to most of the earlier examples, this partial map of the OSM ecosystem tries to show a real-world ecosystem. It would be impossible to properly map the full OSM ecosystem so this is inevitably incomplete and increasingly out of date.

The decision about what detail to include was driven by the goals of the project. The intent was to try and illustrate some of the richness of the ecosystem whilst highlighting how a number of major commercial organisations were participants in that ecosystem. This was not evident to many people until recently.

The map mixes together broad categories of actors, e.g. “End Users” and “Contributor Community” alongside individual commercial companies and real-world applications. The level of detail is therefore varied across the map.

Governance design patterns

The final example comes from this Sage Bionetworks paper. The paper describes a number of design patterns for governing the sharing of data. It includes diagrams of some general patterns as well as real-world applications.

The diagrams shows relatively simple data flows, but they are drawn differently to some of the previous examples. Here the individual actors aren’t directly shown as the endpoints of those data flows. Instead, the data stewards, users and donors are depicted as areas on the map. This is to help emphasise where data is crossing governance boundaries and its use informed by different rules and agreements. Those agreements are also highlighted on the map.

Like the Future of Privacy ecosystem maps, the design is being used to help communicate some important aspects of the ecosystem.

12 ways to improve the GDS guidance on reference data publishing

GDS have published some guidance about publishing reference data for reuse across government. I’ve had a read and it contains a good set of recommendations. But some of them could be clearer. And I feel like some important areas aren’t covered. So I thought I’d write this post to capture my feedback.

Like the original guidance my feedback largely ignores considerations of infrastructure or tools. That’s quite a big topic and recommendations in those areas are unlikely to be applicable solely to reference data.

The guidance also doesn’t address issues around data sharing, such as privacy or regulatory compliance. I’m also going to gloss over that. Again, not because its not important, but because those considerations apply to sharing and publishing any form of data, not just reference data

Here’s the list of things I’d revise or add to this guidance:

  1. The guidance should recommend that reference data be at open as possible, to allow it to be reused as broadly as possible. Reference data that doesn’t contain personal information should be published under an open licence. Licensing is important even for cross-government sharing because other parts of government might be working with private or third sector who also need to be able to use the reference data. This is the biggest omission for me.
  2. Reference data needs to be published over the long term so that other teams can rely on it and build it into their services and workflows. When developing an approach for publishing reference data, consider what investment needs to be made for this to happen. That investment will need to cover people and infrastructure costs. If you can’t do that, then at least indicate how long you expect to be publishing this data. Transparent stewardship can build trust.
  3. For reference data to be used, it needs to be discoverable. The guide mentions creating metadata and doing SEO on dataset pages, but doesn’t include other suggestions such as using Schema.org Dataset metadata or even just depositing metadata in data.gov.uk.
  4. The guidance should recommend that stewardship of reference data is part of a broader data governance strategy. While you may need to identify stewards for individual datasets, governance of reference data should be part of broader data governance within the organisation. It’s not a separate activity. Implementing that wider strategy shouldn’t block making early progress to open up data, but consider reference data alongside other datasets
  5. Forums for discussing how reference data is published should include external voices. The guidance suggests creating a forum for discussing reference data, involving people from across the organisation. But the intent is to publish data so it can be reused by others. This type of forum needs external voices too.
  6. The guidance should recommend documenting provenance of data. It notes that reference data might be created from multiple sources, but does not encourage recording or sharing information about its provenance. That’s important context for reusers.
  7. The guide should recommend documenting how identifiers are assigned and managed. The guidance has quite a bit of detail about adding unique identifiers to records. It should also encourage those publishing reference data to document how and when they create identifiers for things, and what types of things will be identified. Mistakes in understanding the scope and coverage of reference data can have huge impacts.
  8. There is a recommendation to allow users to report errors or provide feedback on a dataset. That should be extended to include a recommendation that the data publisher makes known errors clear to other users, as well as transparency around when individual errors might be fixed. Reporting an error without visibility of the process for fixing data is frustrating
  9. GDS might recommend an API first approach, but reference data is often used in bulk. So there should be a recommendation to have bulk access to data, not just an API. It might also be cheaper and more sustainable to share data in this way
  10. The guidance on versioning should include record level metadata. The guidance contains quite a bit of detail around versioning of datasets. While useful, it should also include suggestions to include status codes and timestamps on individual records, to simplify integration and change monitoring. Change reporting is an important but detailed topic.
  11. While the guidance doesn’t touch on infrastructure, I think it would be helpful for it to recommend that platforms and tools used to manage reference data are open sourced. This will help others to manage and publish their own reference data, and build alignment around how data is published.
  12. Finally, if multiple organisations are benefiting from use of the same reference data then encouraging exploration of collaborative maintenance might help to reduce costs for maintaining data, as well as improving its quality. This can help to ensure that data infrastructure is properly supported and invested in.

OSM Queries

For the past month I’ve been working on a small side project which I’m pleased to launch for Open Data Day 2021.

I’ve long been a fan of OpenStreetMap. I’ve contributed to the map, coordinated a local crowd-mapping project and used OSM tiles to help build web based maps. But I’ve only done a small amount of work with the actual data. Not much more than running a few Overpass API queries and playing with some of the exports available from Geofabrik.

I recently started exploring the Overpass API again to learn how to write useful queries. I wanted to see if I could craft some queries to help me contribute more effectively. For example by helping me to spot areas that might need updating. Or identify locations where I could add links to Wikidata.

There’s a quite a bit of documentation about the Overpass API and the query language it uses, which is called OverpassQL. But I didn’t find them that accessible. The documentation is more of a reference than a tutorial.

And, while, there’s quite a few example queries to find across the OSM wiki and other websites, there isn’t always a great deal of context to those examples that explain how they work or when you might use them.

So I’ve been working on two things to address what I think is a gap in helping people learn how to get more from the OpenStreetMap API.

overpass-doc

The first is a simple tool that will take a collection of Overpass queries and build a set of HTML pages from them. It’s based on a similar tool I built for SPARQL queries a few years ago. Both are inspired by Javadoc and other code documentation tools.

The idea was to encourage the publication of collections of useful, documented queries. E.g. to be shared amongst members of a community or people working on a project. The OSM wiki can be used to share queries, but it might not always be a suitable home for this type of content.

The tool is still at quite an early stage. It’s buggy, but functional.

To test it out I’ve been working on my own collection of Overpass queries. I initially started to pull together some simple examples that illustrated a few features of the language. But then realised that I should just use the tool to write a proper tutorial. So that’s what I’ve been doing for the last week or so.

Announcing OSM Queries

OSM Queries is the result. As of today the website contains four collections of queries. The main collection of queries is a 26 part tutorial that covers the basic features of Overpass QL.

By working through the tutorial you’ll learn:

  • some basics of the OpenStreetMap data model
  • how to write queries to extract nodes, ways and relations from the OSM database using a variety of different methods
  • how to filtering data to extract just the features of interest
  • how to write spatial queries to find features based on whether they are within specific areas or are within proximity to one another
  • how to output data as CSV and JSON for use in other tools

Every query in the tutorial has its own page containing an embedded syntax highlighted version of the query. This makes them easier to share with others. You can click a button to load and run the query using the Overpass Turbo IDE. So you can easily view the results and tinker with the query.

I think the tutorial covers all the basic options for querying and filtering data. Many of the queries include comments that illustrate variations of the syntax, encouraging you to further explore the language.

I’ve also been compiling an Overpass QL syntax reference that provides a more concise view of some of the information in the OSM wiki. There’s a lot of advanced features (like this) which I will likely cover in a separate tutorial.

Writing a tutorial against the live OpenStreetMap database is tricky. The results can change at any time. So I opted to focus on demonstrating the functionality using mostly natural features and administrative boundaries.

In the end I chose to focus on an area around Uluru in Australia. Not just because it provides an interesting and stable backdrop for the tutorial. But because I also wanted to encourage a tiny bit of reflection in the reader about what gets mapped, who does the mapping, and how things get tagged.

A bit of map art, and a request

The three other query collections are quite small:

I ended up getting a bit creative with the MapCSS queries.

For example, to show off the functionality I’ve written a query that shows the masonic symbol hidden in the streets of Bath, styled Brøndby Haveby like a bunch of flowers and the Lotus Bahai Temple as, well, a lotus flower.

These were all done by styling the existing OSM data. No edits were done to change the map. I wouldn’t encourage you to do that.

I’ve put all the source files and content for the website into the public domain so you’re free to adapt, use and share however you see fit.

While I’ll continue to improve the tutorial and add some more examples I’m also hoping that I can encourage others to contribute to the site. If you have useful queries that you could be added to the site then submit them via Github. I’ve provided a simple issue template to help you do that.

I’m hoping this provides a useful resource for people in the OSM community and that we can collectively improve it over time. I’ve love to get some feedback, so feel free to drop me an email, comment on this post or message me on twitter.

And if you’ve never explored the data behind OpenStreetMap then Open Data Day is a great time to dive in. Enjoy.

The data we use in Energy Sparks

Disclaimer: this blog post is about some of the challenges that we have faced in consuming and using data in Energy Sparks. While I am a trustee of the Energy Sparks application, and am currently working with the team on some improvements to the application, this blog post are my own opinions.

Energy Sparks is an online energy analysis tool and energy education programme specifically designed to help schools reduce their electricity and gas usage through the analysis of smart meter data. The service is run by the Energy Sparks charity, which aims to educate young people about climate change and the importance of energy saving and reducing carbon emissions. 

The team provides support for teachers and school eco-teams in running educational activities to help pupils learn about energy and climate change in the context of their school.

It was originally started as a project by Bath: Hacked and Transition Bath and has been funded by a range of organisations. Recent funding has come from the Ovo Foundation and via BEIS as part of the Non-Domestic Smart Meter Innovation Challenge

The application uses a lot of different types of data, to provide insights, analysis and reporting to pupils, teachers and school administrators.

There are a number of challenges with accessing and using these different types of dataset. As there is a lot of work happening across the UK energy data ecosystem at the moment, I thought I’d share some information about what data is being used and where the challenges lie.

School data

Unsurprisingly the base dataset for the service is information about schools. There’s actually a lot of different type of information that is useful to know about a school in order to do some useful analysis:

  • basic data about the school, e.g. its identifier, type and the curriculum key stages taught at the school
  • where it is, so we can map the schools and find local weather data
  • whether the school is part of a multi-academy trust or which local authority it is with
  • information about its physical infrastructure. For example number of pupils, floor area, whether it has solar panels, night storage heaters, a swimming pool or serves school dinners
  • its calendar, so we can identify term times, inset days, etc. Useful if you want to identify when the heating may need to be on, or when it can be switched off
  • contact information for people at the school (provided with consent)
  • what energy meters are installed at the school and what energy tariffs are being used

This data can be tricky to acquire because:

  • there are separate databases of schools across the devolved nations, no consistent method of access or similarity of data
  • calendars vary across local authorities, school groups and on an individual basis
  • schools typically have multiple gas and electricity meters installed in different buildings
  • schools might have direct contracts with energy suppliers, be part of a group purchase scheme managed by a trust or their local authority or be part of a large purchasing framework agreement, so tariff and meter data might need to come from elsewhere. Many local authorities appoint separate meter operators adding a further layer of complexity to data acquisition. 

Weather data

If you want to analyse energy usage then you need to know what the weather was like at the location and time it was being used. You need more energy for heating when it’s cold. But maybe you can switch the heating off if it’s cold and it’s outside of term time.

If you want to suggest that the heating might be adjusted because it’s going to be sunny next week, then you need a weather forecast.

And if you want to help people understand whether solar panels might be useful, then you need to be able to estimate how much energy they might have been able to generate in their location. 

This means we use:

  • half-hourly historical temperature data to analyse the equivalent historical energy usage. On average we’re looking at four years worth of data for each school, but for some schools we have ten or more years
  • forecast temperatures to drive some user alerts and recommendations
  • estimated solar PV generation data 

The unfortunate thing here is that the Met Office doesn’t provide the data we need. They don’t provide historical temperature or solar irradiance data at all. They do provide forecasts via DataPoint, but these are weather station specific forecasts. So not that useful if you want something more local. 

For weather data, in lieu of using Met Office data we draw on other sources. We originally used Weather Underground until IBM acquired the service and then later shut down the API. So then we used Dark Sky until Apple acquired it and released the important and exciting news that they were shutting down the API

We’ve now moved on to using Meteostat. Which is a volunteer run service that provides a range of APIs and bulk data access under a CC-BY-NC-4.0 licence.

The feature that Meteostat, Dark Sky and Weather Underground all offer is location based weather data, based on interpolating observation data from individual stations. This lets us get closer to actual temperature data at the schools. 

It would be great if the Met Office offered a similar feature under a fully open licence. They only offer a feed of recent site-specific observations.

To provide schools with estimates of the potential benefits of installing solar panels, we currently use the Sheffield University Solar PV Live API, which is publicly available, but unfortunately not clearly licensed. But it’s our best option. Based on that data we can indicate potential economic benefits of installing different sizes of solar panels.

National energy generation

We provide schools with reports on their carbon emissions and, as part of the educational activities, give insights into the sources of energy being generated on the national grid. 

For both of these uses, we take data from the Carbon Intensity API provided by the National Grid which publishes data under a CC-BY licence. The API provides both live and historical half-hourly data, which aligns nicely with our other sources.

School energy usage and generation

The bulk of the data coming into the system is half-hourly meter readings from gas and electricity meters (usage) and from solar PV panels from schools that have them (generation and export).

This allows us to chart and analyse data presenting reports and analysis across the school data. 

There are numerous difficulties with getting access to this data:

  • the complexity of the energy ecosystem means that data is passed between meter operators, energy suppliers, local authorities, school groups, solar PV systems and a mixture of intermediary platforms. So just getting permission in the right place can be tricky
  • some solar PV monitoring systems, e.g. SolarEdge and RBee, offer APIs so in some cases we integrate with these, further adding to the mixture of sources
  • the mixture of platforms means that, while there is more or less industry standard reporting of half-hourly readings, there is no standard API for accessing this data. There’s a tangle of proprietary, undocumented or restricted access APIs 
  • meters get added and removed over time, so the number of meters can change
  • for some rural schools, reporting of usage and generation is made trickier because of connectivity issues
  • in some cases, we know schools have solar PV installed, but we can’t get access to a proper feed. So in this case, we have to use the Sheffield PV Live API and knowledge of what panels are installed to create estimated outputs

The feature that most platforms and suppliers seem to consistently offer, at least to non-domestic customers, is a daily or weekly email with a CSV attachment containing the relevant readings. So this is the main route by which we currently bring data into the system.

We’re also currently prototyping ingesting smart meter data, rather than the current AMR data. We will likely be accessing that via one or more intermediaries who provide public APIs that interface with the underlying infrastructure and APIs run by the Data Communications Company (DCC). The DCC are the organisation responsible for the UK’s entire smart meter infrastructure. There is a growing set of these companies that are providing access to this data. 

I plan to write more about that in a future post. But I’ll note here that the approach for managing consent and standardising API access is in a very early stage.

Unfortunately the government legislation backing the shift to smart meters only applies to domestic meters. So there is no requirement to stop installing AMR meters in non-domestic settings or a path to upgrade. So services targeting schools and businesses will need to deal with a tangle of data sources for some time to come. 

In addition to exploring integration with the DCC there are other ways that we might improve our data collection. For example directly integrating with the “Get Information about Schools Service” for data on English schools. Or using the EPC data to help find data on floor area for school buildings. 

But as of today, for the bulk of data we use, the two big wins would be access to historical data from the Met Office in a useful format, and some coordination across the industry around standardising access to meter data. I doubt we’ll see the former and I’m not clear yet whether any of the various open energy initiatives will produce the latter. 

Bath Historical Images

One of my little side projects is to explore historical images and maps of Bath and the surrounding areas. I like understanding the contrast between how Bath used to look and how it is today. It’s grown and changed a huge amount over the years. It gives me a strong sense of place and history.

There is a rich archive of photographs and images of the city and area that were digitised for the Bath in Time project. Unfortunately the council has chosen to turn this archive into a, frankly terrible, website that is being used to sell over-priced framed prints.

The website has limited navigation and there’s no access to higher resolution imagery. Older versions of the site had better navigation and access to some maps.

The current version looks like it’s based on a default ecommerce theme for WordPress rather than being designed to show off the richness of the 40,000 images it contains. Ironically the @bathintime twitter account tweets out higher resolution images than you can find on the website.

This is a real shame. Frankly I can’t imagine there’s a huge amount of revenue being generated from these prints.

If the metadata and images were published under a more open licence (even with a non-commercial limitation) then it would be more useful for people like me who are interested in local history. We might even be able to help build useful interfaces. I would happily invest time in cataloguing images and making something useful with them. In fact, I have been.

In lieu of a proper online archive, I’ve been compiling a list of publicly available images from other museums and collections. So far, I’ve sifted through:

I’ve only found around 230 images (including some duplicates across collections) so far, but there are some interesting items in there. Including some images of old maps.

I’ve published the list as open data.

So you can take the metadata and links and explore them for yourself. I thought they may be useful for anyone looking to reuse images in their research or publications.

I’m in the process of adding geographic coordinates to each of the images, so they can be placed on the map. I’m approaching that by geocoding them as if they were produced using a mobile phone or camera. For example, an image of the abbey won’t have the coordinates of the abbey associated with it, it’ll be wherever the artist was standing when they painted the picture.

This is already showing some interesting common views over the years. I’ve included a selection below.

Views from the river, towards Pulteney Bridge

Southern views of the city

Looking to the east across abbey churchyard

Views of the Orange Grove and Abbey

It’s really interesting to be able to look at the same locations over time. Hopefully that gives a sense of what could be done if more of the archives we made available.

There’s more documentation on the dataset if you want to poke around. If you know of other collections of images I should look at, then let me know.

And if you have metadata or images to release under an open licence, or have archives you want to share, then get in touch as I may be able to help.

The Common Voice data ecosystem

In 2021 I’m planning to spend some more time exploring different data ecosystems with an emphasis on understanding the flows of data within and between different data initiatives, the tools they use to collect and share data, and the role of collaborative maintenance and open standards.

One project I’ve been looking at this week is Mozilla Common Voice. It’s an initiative that is producing a crowd-sourced, public domain dataset that can be used to train voice recognition applications. It’s the largest dataset of its type, consisting of over 7,000 hours of audio across 60 languages.

It’s a great example of communities working to create datasets that are more open and representative. Helping to address biases and supporting the creation of more equitable products and services. I’ve been using it in my recent talks on collaborative maintenance, but have had chance to dig a bit deeper this week.

The main interface allows contributors to either record their voice, by reading short pre-prepared sentences, or validate existing contributions by listening to existing recording and confirming that they match the script.

Behind the scenes is a more complicated process, which I found interesting.

It further highlights the importance of both open source tooling and openly licensed content in supporting the production of open data. It also another example of how choices around licensing can create friction between open projects.

The data pipeline

Essentially, the goal of the Common Voice project is to create new releases of its dataset. With each release including more languages and, for each language, more validated recordings.

The data pipeline that supports that consists of the following basic steps. (There may be other stages involved in the production of the output corpus, but I’ve not dug further into the code and docs.)

  1. Localisation. The Common Voice web application first has to be localised into the required language. This is coordinated via Mozilla Pontoon, with a community of contributors submitting translations licensed under the Mozilla Public Licence 2.0. Pontoon is open source and can be used for other non-Mozilla applications. When the localization gets to 95% the language can be added to the website and the process can move to the next stage
  2. Sentence Collection. Common Voice needs short sentences for people to read. These sentences need to be in the public domain (e.g. via a CC0 waiver). A minimum of 5,000 sentences are required before a language can be added to the website. The content comes from people submitting and validating sentences via the sentence collector tool. The text is also drawn from public domain sources. There’s a sentence extractor tool that can pull content from wikipedia and other sources. For bulk imports the Mozilla team needs to check for licence compatibility before adding text. All of this means that the source texts for each language are different.
  3. Voice Donation. Contributors read the provided sentences to add their voice to their dataset. The reading and validation steps are separate microtasks. Contributions are gamified and there are progress indicators for each language.
  4. Validation. Submitted recordings go through retrospective review to assess their quality. This allows for some moderation, allowing contributors to flag recordings that are offensive, incorrect or are of poor quality. Validation tasks are also gamified. In general there are more submitted recordings than validations. Clips need to be reviewed by two separate users for them to be marked as valid (or invalid).
  5. Publication. The corpus consists of valid, invalid and “other” (not yet validated) recordings, split into development, training and test datasets. There are separate datasets for each language.

There is an additional dataset which consists of 14 single word sentences (the ten digits, “yes”, “no”, “hey”, “Firefox”) which is published separately. The steps 2-4 look similar though.

Some observations

What should be clear is that there are multiple stages, each with their own thresholds for success.

To get a language into the project you need to translate around 600 text fragments from the application and compile a corpus of at least 5,000 sentences before the real work of collecting the voice dataset can begin.

That work requires input from multiple, potentially overlapping communities:

  • the community of translators, working through Pontoon
  • the community of writers, authors, content creators creating public domain content that can be reused in the service
  • the common voice contributors submitting new additional sentences
  • the contributors recording their voice
  • the contributors validating other recordings
  • the teams at Mozilla, coordinating and supporting all of the above

As the Common Voice application and configuration is open source, it is easy to include it in Pontoon to allow others to contribute to its localisation. To build representative datasets, your tools need to work for all the communities that will be using them.

The availability of public domain text in the source languages, is clearly a contributing factor in getting a language added to the site and ultimately included in the dataset.

So the adoption of open licences and the richness of the commons in those languages will be a factor in determining how rich the voice dataset might be for that language. And, hence, how easy it is to create good voice and text applications that can support those communities.

You can clearly create a new dedicated corpus, as people have done for Hakha Chin. But the strength and openness of one area of the commons will impact other areas. It’s all linked.

While there are different communities involved in Common Voice, its clear these reports from communities working on Hakha Chin and Welsh, in some cases its the same community that is working across the whole process.

Every language community is working to address its own needs: “We’re not dependent on anyone else to make this happen…We just have to do it“.

That’s the essence of shared infrastructure. A common resource that supports a mixture of uses and communities.

The decisions about what licences to use is, as ever, really important. At present Common Voice only takes a few sentences from individual pages of the larger Wikipedia instances. As I understand it this is because Wikipedia content is not public domain, so cannot be used wholesale. But small extracts should be covered by fair use?

I would expect that those interested in building and maintaining their language specific instances of wikipedia have overlaps with those interested in making voice applications work in that same language. Incompatible licensing can limit the ability to build on existing work.

Regardless, the Mozilla and the Wikimedia Foundations have made licensing choices that reflect the needs of their communities and the goals of their projects. That’s an important part of building trust. But, as ever, those licensing choices have subtle impacts across the wider ecosystem.

Reflecting on 2020

It’s been a year, eh?

I didn’t have a lot of plans for 2020. And those that I did have were pretty simple. That was deliberate as I tend to beat myself up for not achieving everything. But it turned out to be for the best anyway.

I’m not really expecting 2021 to be any different to be honest. But I’ll write another post about plans for the next 12 months.

For now, I just want to write a diary entry with a few reflections and notes on the year. Largely because I want to capture some of the positives and lessons learned.

Working

Coming in to 2020 I’d decided it was probably time for a change. I’ve loved working at the Open Data Institute, but the effort and expense of long-term distance commuting was starting to get a bit much.

Three trips a week, with increasingly longer days, on top of mounting work pressures was affecting my mood and my health. I was on a health kick in Q1 which helped, but The Lockdown really demonstrated for me how much that commute was wiping me out. And reminded me of what work-life balanced looked like.

I was also ready to do something different. I feel like my career has had regular cycles where I’ve been doing research and consultancy, interleaved with periods of building and delivering things. I decided it was time to get back to the latter.

So, after helping the team weather the shit-storm of COVID-19, I resigned. Next year is going to look a lot different one way or another!

I’ve already written some thoughts on things I’ve enjoyed working on.

Running

I set out to lose some weight and be healthier at the start of the year. And I succeeded in doing that. I’ve been feeling so much better because of that.

I also took up running. I did a kind of self-directed Couch to 5K. I read up on the system and gradually increased distance and periods of running over time as recommended. I’ve previously tried using an app but without much success. I also prefer running without headphones on.

The hardest part has been learning to breathe properly. I suffer from allergic asthma. It used to be really bad when I was a kid. Like, not being able to properly walk up stairs bad. And not being allowed out during school play times bad.

It’s gotten way better and rarely kicks in now unless the pollen is particularly bad. But I still get this rising panic when I’m badly out of breath. I’ve mostly dealt with it now and found that running early in the mornings avoids issues with pollen.

While its never easy, it turns out running can actually be enjoyable. As someone who closely identifies with the sloth, this is a revelation.

It’s also helped to work out nervous tension and stress during the year. So its great to have found a new way to handle that.

Listening

My other loose goal for 2020 was to listen to more music. I’d fallen into the habit of only listening to music while commuting, working or cooking. While that was plenty of opportunity, I felt like I was in a rut. Listening to the same mixes and playlists as they helped me tune out others and concentrate on writing.

I did several things to achieve that goal. I started regularly listening to my Spotify Discover and Release Radar playlists. And dug into the back catalogues from the artists I found there.

I listened to more radio to break out of my recommendation bubble and used the #music channel on the ODI slack to do the same. I also started following some labels on YouTube and via weekly playlists on Spotify.

While I’ve griped about the BBC Sounds app, and while its still flaky, I have to admit its really changed how I engage with the BBC’s radio output. The links from track lists to Spotify is one of the killer features for me.

Building in listening to the BBC Unclassified show with Elizabeth Alker, on my Saturday mornings, has been one of the best decisions I’ve made this year.

Another great decision was to keep a dedicated playlist of “tracks that I loved on first listen, which were released in 2020“. Its helped me be intentional about recording music that I like, so I can dig deeper. Here’s a link to the playlist which has 247 tracks on it.

According to my year in review, Spotify tells me I listened to 630 new artists this year, across 219 new genres. We all know Spotify genres are kind of bullshit, but I’m pleased with that artist count.

Cooking

I generally cook on a Saturday night. I try out a new recipe. We drink cocktails and listen to Craig Charles Funk and Soul Show.

I’ve been tweeting what I’ve been cooking this year to keep a record of what I made. And I bookmark recipes here.

I was most proud of the burger buns, bao buns and gyoza.

We also started a routine of Wednesday Stir Fries, where I cooked whilst Debs was taking Martha to her ice-skating lesson. Like all routines this year that fell away in April.

But, I’ve added Doubanjiang (fermented broad bean chilli paste) to my list of favourite ingredients. Really quick and tasty base for a quick stir fry with a bit of garlic, ginger and whatever veg is to hand.

Gardening

I’ve already published a blog post with my Gardening Retro for 2020.

Reading

Like last year I wanted to read more again this year. As always I’ve been tweeting what I’ve read. I do this for the same reason I tweet things that I cook: it helps me track what I’ve been doing. But it also sometimes prompts interesting chats and other recommendations. Which is why I use social media after all.

I’ve fallen into a good pattern of having one fiction book, one non-fiction book and one graphic novel on the go at any one time. This gives me a choice of things to dip into based on how much time, energy and focus I have. That’s been useful this year.

I’ve read fewer papers and articles (I track those here). This is in large part because my habit was to do this during my commute. But again, that routine has fallen away.

If I’m honest its also because I’ve not really felt like it this year. I’ve read what I needed to, but have otherwise retreated into comfort reading.

The other thing I’ve been doing this year is actively muting words, phrases and hashtags on twitter. It helps me manage what I’m seeing and reading, even if I can’t kick the scrolling habit. I feel vaguely guilty about that. But how else to manage the fire hose of other people’s thoughts, attentions and fears?

Here are some picks. These weren’t all published this year. Its just when I consumed them:

Comics

I also read the entire run of Locke and Key, finished up the Alan Moore Swamp Thing collected editions and started in on Monstress. All great.

I also read a lot of Black Panther single this year. Around 100-150 I think. Which lead to my second most popular tweet this year (40,481 impressions).

Non-fiction

Fiction

I enjoyed but was disappointed by William Gibson’s Agency. Felt like half a novel.

Writing

I started a monthly blogging thread this year. I did that for two reasons. The first was to track what I was writing. I wanted to write more this year and to write differently.

The second was as another low key way to promote posts so that they might find readers. I mostly write for myself, but its good to know that things get read. Again, prompting discussions is why I do this in the open rather than in a diary.

In the end I’ve written more this year than last. Which is good. Not writing at all some months was also fine.

I managed to write a bit of fiction and a few silly posts among the thousand word opinion pieces on obscure data topics. My plan to write more summaries of research papers failed, because I wasn’t reading that many.

My post looking at the statistic about data scientists spending 80% of their time cleaning data, was the most read of what I wrote this year (4379 views). But my most read post of all time remains this one on derived data (25,499 views). I should do a better version.

The posts I’m most pleased with are the one about dataset recipes and the two pieces of speculative fiction.

I carry around stuff in my head, sometimes for weeks or months. Writing it down helps me not just organise those thoughts but also move on to other things. This too is a useful copying mechanism.

Coding

Didn’t really do any this year. All things considered, I’m fine with that. But this will change next year.

Gaming

This year has been about those games that I can quickly pick up and put down again.

I played, loved, but didn’t finish Death Stranding. I need to immerse myself in it and haven’t been in the mood. I dipped back into The Long Dark, which is fantastically well designed, but the survival elements were making me anxious. So I watch other people play it instead.

Things that have worked better: Darkest Dungeon. XCOM: Chimera Squad. Wilmot’s Warehouse. Townscaper. Ancient Enemy. I’ve also been replaying XCOM 2.

These all have relatively short game loops and mission structures that have made them easy to dip into when I’ve been in the mood. Chimera Squad is my game of the year, but Darkest Dungeon is now one of my favourite games ever.

There Is No Game made me laugh. And Townscaper prompted some creativity which I wrote about previously.

That whole exercise lead to my most popular tweet this year (54,025 impressions). People like being creative. Nice to have been responsible for a tiny spark of fun this year.

This is the first year in ages when I’ve not ended up with a new big title that I’m excited to dip into. Tried and failed to get a PS5. Nothing else is really grabbing my interest. I only want a PS5 so I can play the Demon’s Souls remake.

Watching

For the most part I’ve watched all of the things everyone else seems to have watched.

Absolutely loved the Queen’s Gambit. Enjoyed Soul, the Umbrella Academy and The Boys. Thought Kingdom was brilliant (I was late to that one) and #Alive was fun. Korea clearly know how to make Zombie movies and so I’m looking forward to Peninsula.

The Mandalorian was so great its really astounding that no-one thought to make any kind of film or TV follow-up to the original Star Wars trilogies until now. Glad they finally did and managed to mostly avoid making it about the same characters.

But Star Trek: Discovery unfortunately seems to have lost its way. I love the diverse characters and the new setting has so much potential. The plot is just chaotic though. His Dark Materials seems to be a weekly episode of exposition. Yawn.

If I’m being honest though, then my topic picks for 2020 are the things I’ve been able to relax into for hours at a time:

  • The Finnish guy streaming strategy and survival games like The Long Dark and XCom 2
  • The Dutch guy playing classic and community designed Doom 2 levels
  • And the guy doing traditional Japanese woodblock carvings

I’m only slightly exaggerating to say these were the only things I watched in that difficult March-May period.

Everything else

I could write loads more about 2020 and what it was like. But I won’t. I’ve felt all of the things. Had all of the fears, experienced all of the anger, disbelief and loss.

The lesson is to keep moving forward. And to turn to books, music, games, walking, running, cooking to help keep us sane.

A short list of some of the things I’ve worked on which I’ve particularly enjoyed

Part of planning for whatever comes next for me in my career involved reflecting on the things I’ve enjoyed doing. I’m pleased to say that there’s a quite a lot.

I thought I’d write some of them down to help me gather my thoughts around what I’d like to do more of in the future. And, well, it never hurts to share your experience when you’re looking for work. Right?

The list below focuses on projects and activities which I’ve contributed to or had a hand in leading.

There’s obviously more to a career and work than that. For example, I’ve enjoyed building a team and supporting them in their work and development. I’ve enjoyed pitching for and winning work and funding.

I’ve also very much enjoyed working with a talented group of people who have brought a whole range of different skills and experiences to projects we’ve collaborated on together. But this post isn’t about those things.

Some of the things I’ve enjoyed working on at the ODI

  • Writing this paper on the value of open identifiers, which was co-authored with a team at Thomson Reuters. It was a great opportunity to distil a number of insights around the benefits of open, linked data. I think the recommendations stand up well. Its a topic I keep coming back to.
  • Developing the open data maturity model and supporting tool. The model was used by Defra to assess all its arms-length bodies during their big push to release open data. It was adopted by a number of government agencies in Australia, and helped to structure a number of projects that the ODI delivered to some big private sector organisations. Today we’d scope the model around data in general, not just open data. And it needs a stronger emphasis on diversity, inclusion, equity and ethics. But I think the framework is still sound
  • Working with the Met Office on a paper looking at the state of weather data infrastructure. This turned into a whole series of papers looking at different sectors. I particularly enjoyed this first one as it was a nice opportunity to look at data infrastructure through a number of different lenses in an area that was relatively new to me. The insight that an economic downturn in Russian lead to issues with US agriculture because of data gaps in weather forecasting might be my favourite example of how everything is intertwingled. I later used what I learned in that paper to write this primer on data infrastructure.
  • Leading research and development of the open standards for data guidebook. Standards being another of my favourite topics, it was great to have space to explore this area in more detail. And I got to work with Edafe which was ace.
  • Leading development of the OpenActive standards. Standards development is tiring work. But I’m pleased with the overall direction that we took and what we’ve achieved. I learned a lot. And I had the chance to iterate on what we were doing based on what we learned from developing the standards guidebook, before handing it over to others to lead. I’m pleased that we were able to align the standards with Schema.org and SKOS. I’m less pleased that it resulted in lots of video of me on YouTube leading discussions in the open.
  • Developing a methodology for doing data ecosystem mapping. The ODI now has a whole tool and methodology for mapping data ecosystems. It’s used in a lot of projects. While I wouldn’t claim to have invented the idea of doing this type of exercise, the ODI’s approach directly builds on the session I ran at Open Data Camp #4. I plan to continue to work on this as there’s much more to explore.
  • Leading development of the collaborative maintenance guidebook. Patterns provide a great way to synthesise and share insight. So it was fantastic to be able to apply that approach to capturing some of the lessons learned in projects like OpenStreetMap, Wikidata and other projects. There’s a lot that can be applied in this guidebook to help shape many different data projects and platforms. The future of data management is more, not less collaborative.
  • Researching the sustainable institutions report. One of the reasons I (re-)joined the ODI about 4 years ago was to work on data institutions. Although we weren’t using that label at that point. I wanted to help to set up organisations like CrossRef, OpenStreetMap and others that are managing data for a community. So it was great to be involved in this background research. I still want to do that type of work, but want to be working in that type of organisation, rather than advising them.

There’s a whole bunch of other things I did during my time at the ODI.

For example, I’ve designed and delivered a training course on API design, evaluated a number of open data platforms, written code for a bunch of openly available tools, provided advice to bunch of different organisations around the world, and written guidance that still regularly gets used and referenced by people. I get a warm glow from having done all those things.

Things I’ve enjoyed working on elsewhere

I’ve also done a bunch of stuff outside the ODI that I’ve also thoroughly enjoyed. For example:

  • I’ve helped to launch two new data-enabled products. Some years ago, I worked with the founders of Growkudos to design and build the first version of their platform, then helped them hire a technical team to take it forward. I also helped to launch EnergySparks, which is now used by schools around the country. I’m now a trustee of the charity.
  • I’ve worked with the ONS Digital team. After working on this prototype for Matt Jukes and co at the ODI, it was great to spend a few months freelancing with Andy Dudfield and the team working on their data principles and standards to put stats on the web. Publishing statistics is good, solid data infrastructure work.
  • Through Bath: Hacked, I’ve lead a community mapping activity to map wheelchair accessibility in the centre of Bath. It was superb to have people from the local community, from all walks of life, contributing to the project. Not ashamed to admit that I had a little cry when I learned that one of the mappers hadn’t been into the centre of Bath for years, because they’d felt excluded by their disability. But was motivated to be part of the project. That single outcome made it all worthwhile for me.

What do I want to do more of the in future? I’ve spent quite a bit of the last few years doing research and advising people about how they might go about their projects. But its time to get back into doing more hands-on practical work to deliver some data projects or initiatives. More doing, less advising.

So, I’m currently looking for work. If you’re looking for a “Leigh shaped” person in your organisation. Where “Leigh shaped” means “able to do the above kinds of things” then do get in touch.

The Saybox

I’ve been in a reflective mood over the past few weeks as I wrap up my time at the Open Data Institute. One of the little rituals I will miss is the “Saybox”. I thought I’d briefly write it up and explain why I like it.

I can’t remember who originally introduced the idea. It’s been around long enough that I think I was still only working part-time as an associate, so wasn’t always at every team event. But I have a suspicion it was Briony. Maybe someone can correct me on that? (Update: it was Briony 🙂 )

It’s also possible that the idea is well-known and documented elsewhere, but I couldn’t find a good reference. So again, if someone has a pointer, then let me know and I’ll update this post.

Originally, the Saybox was just a decorated shoebox. It had strong school craft project vibes. I’m sad that I can’t find a picture of it

The idea is that anyone in the team can drop an anonymous post-it into the box with a bit of appreciation for another member of the team, questions for the leadership team, a joke or a “did you know”. At our regular team meetings we open the box, pass it around and we all read out a few of the post-its.

I’ll admit that it took me a while to warm to the idea. But it didn’t take me long to be won over.

The Saybox has became part of the team culture. A regular source of recognition for individual team members, warm welcomes for new hires and, at times, a safe way to surface difficult questions. The team have used it to troll each other whilst on holiday and it became a source of running gags. For a time, no Saybox session was complete without a reminder that Simon Bullmore ran a marathon.

As I took on leadership positions in the team, I came to appreciate it for other reasons. It was more than just a means of providing and encouraging feedback across the team. It became a source of prompts for where more clarity on plans or strategy were needed. And, in a very busy setting, it also helped to reinforce how delivery really is a team sport.

There’s nothing like hearing an outpouring of appreciation for a individual or small team to be constantly reminded of the important role they play.

Like any aspect of team culture, the Saybox has evolved over time.

There’s a bit less trolling and fewer running gags now. But the appreciation is still strong.

The shoebox was also eventually replaced by a tidy wooden box. This was never quite the same for me. The shoebox had more of a scruffy, team-owned vibe about it.

As we’ve moved to remote working we’ve adapted the idea. We now use post-it notes on a Jamboard, and take turns reading them over the team zooms. Dr Dave likes to tick them off as we go, helping to orchestrate the reading.

The move to online unfortunately means there isn’t the same constant reminder to provide feedback, in the way that a physical box presents. You don’t just walk past a Jamboard on your way to or from a meeting. This means that the Saybox jamboard is now typically “filled” just before or during the team meetings, which can change the nature of the feedback it contains.

It’s obviously difficult to adapt team practices to virtual settings. But I’m glad the ODI has kept it going.

I’ll end this post with a brief confession. It might help reinforce why rituals like this are so important.

In a Saybox session, when we used to do them in person with actual paper, we handed the note over to whoever it was about. So sometimes you could leave a team meeting with one or more notes of appreciation from the team. That’s a lovely feeling.

I got into the habit of dropping them into my bag or sticking them into my notebook. As I tidied up my bag or had a clearout of old paperwork, I started collecting the notes into an envelope.

The other day I found that envelope in a drawer. As someone who is wired to always look for the negatives in any feedback, having these hand-written notes is lovely.

There’s nothing like reading unprompted bits of positive feedback, collected over about 5 years or so, to help you reflect on your strengths.

Thanks everyone.

A poem about standards

To help me wrap up my time at the ODI I asked the team for suggestions for things I could add to my list of handover documentation.

Amongst the suggestions that came back was: “Maybe also a poem about why standards are the best thing on Earth?”

So, with a nod to the meme and apologies to William Carlos Williams. I wrote this:

I have tidied
the data
in your
spreadsheet

those numbers
you were
planning
to share

Forgive me
they were so messy
but now standard
and FAIR

Close enough I think 🙂