The Common Voice data ecosystem

In 2021 I’m planning to spend some more time exploring different data ecosystems with an emphasis on understanding the flows of data within and between different data initiatives, the tools they use to collect and share data, and the role of collaborative maintenance and open standards.

One project I’ve been looking at this week is Mozilla Common Voice. It’s an initiative that is producing a crowd-sourced, public domain dataset that can be used to train voice recognition applications. It’s the largest dataset of its type, consisting of over 7,000 hours of audio across 60 languages.

It’s a great example of communities working to create datasets that are more open and representative. Helping to address biases and supporting the creation of more equitable products and services. I’ve been using it in my recent talks on collaborative maintenance, but have had chance to dig a bit deeper this week.

The main interface allows contributors to either record their voice, by reading short pre-prepared sentences, or validate existing contributions by listening to existing recording and confirming that they match the script.

Behind the scenes is a more complicated process, which I found interesting.

It further highlights the importance of both open source tooling and openly licensed content in supporting the production of open data. It also another example of how choices around licensing can create friction between open projects.

The data pipeline

Essentially, the goal of the Common Voice project is to create new releases of its dataset. With each release including more languages and, for each language, more validated recordings.

The data pipeline that supports that consists of the following basic steps. (There may be other stages involved in the production of the output corpus, but I’ve not dug further into the code and docs.)

  1. Localisation. The Common Voice web application first has to be localised into the required language. This is coordinated via Mozilla Pontoon, with a community of contributors submitting translations licensed under the Mozilla Public Licence 2.0. Pontoon is open source and can be used for other non-Mozilla applications. When the localization gets to 95% the language can be added to the website and the process can move to the next stage
  2. Sentence Collection. Common Voice needs short sentences for people to read. These sentences need to be in the public domain (e.g. via a CC0 waiver). A minimum of 5,000 sentences are required before a language can be added to the website. The content comes from people submitting and validating sentences via the sentence collector tool. The text is also drawn from public domain sources. There’s a sentence extractor tool that can pull content from wikipedia and other sources. For bulk imports the Mozilla team needs to check for licence compatibility before adding text. All of this means that the source texts for each language are different.
  3. Voice Donation. Contributors read the provided sentences to add their voice to their dataset. The reading and validation steps are separate microtasks. Contributions are gamified and there are progress indicators for each language.
  4. Validation. Submitted recordings go through retrospective review to assess their quality. This allows for some moderation, allowing contributors to flag recordings that are offensive, incorrect or are of poor quality. Validation tasks are also gamified. In general there are more submitted recordings than validations. Clips need to be reviewed by two separate users for them to be marked as valid (or invalid).
  5. Publication. The corpus consists of valid, invalid and “other” (not yet validated) recordings, split into development, training and test datasets. There are separate datasets for each language.

There is an additional dataset which consists of 14 single word sentences (the ten digits, “yes”, “no”, “hey”, “Firefox”) which is published separately. The steps 2-4 look similar though.

Some observations

What should be clear is that there are multiple stages, each with their own thresholds for success.

To get a language into the project you need to translate around 600 text fragments from the application and compile a corpus of at least 5,000 sentences before the real work of collecting the voice dataset can begin.

That work requires input from multiple, potentially overlapping communities:

  • the community of translators, working through Pontoon
  • the community of writers, authors, content creators creating public domain content that can be reused in the service
  • the common voice contributors submitting new additional sentences
  • the contributors recording their voice
  • the contributors validating other recordings
  • the teams at Mozilla, coordinating and supporting all of the above

As the Common Voice application and configuration is open source, it is easy to include it in Pontoon to allow others to contribute to its localisation. To build representative datasets, your tools need to work for all the communities that will be using them.

The availability of public domain text in the source languages, is clearly a contributing factor in getting a language added to the site and ultimately included in the dataset.

So the adoption of open licences and the richness of the commons in those languages will be a factor in determining how rich the voice dataset might be for that language. And, hence, how easy it is to create good voice and text applications that can support those communities.

You can clearly create a new dedicated corpus, as people have done for Hakha Chin. But the strength and openness of one area of the commons will impact other areas. It’s all linked.

While there are different communities involved in Common Voice, its clear these reports from communities working on Hakha Chin and Welsh, in some cases its the same community that is working across the whole process.

Every language community is working to address its own needs: “We’re not dependent on anyone else to make this happen…We just have to do it“.

That’s the essence of shared infrastructure. A common resource that supports a mixture of uses and communities.

The decisions about what licences to use is, as ever, really important. At present Common Voice only takes a few sentences from individual pages of the larger Wikipedia instances. As I understand it this is because Wikipedia content is not public domain, so cannot be used wholesale. But small extracts should be covered by fair use?

I would expect that those interested in building and maintaining their language specific instances of wikipedia have overlaps with those interested in making voice applications work in that same language. Incompatible licensing can limit the ability to build on existing work.

Regardless, the Mozilla and the Wikimedia Foundations have made licensing choices that reflect the needs of their communities and the goals of their projects. That’s an important part of building trust. But, as ever, those licensing choices have subtle impacts across the wider ecosystem.

Reflecting on 2020

It’s been a year, eh?

I didn’t have a lot of plans for 2020. And those that I did have were pretty simple. That was deliberate as I tend to beat myself up for not achieving everything. But it turned out to be for the best anyway.

I’m not really expecting 2021 to be any different to be honest. But I’ll write another post about plans for the next 12 months.

For now, I just want to write a diary entry with a few reflections and notes on the year. Largely because I want to capture some of the positives and lessons learned.

Working

Coming in to 2020 I’d decided it was probably time for a change. I’ve loved working at the Open Data Institute, but the effort and expense of long-term distance commuting was starting to get a bit much.

Three trips a week, with increasingly longer days, on top of mounting work pressures was affecting my mood and my health. I was on a health kick in Q1 which helped, but The Lockdown really demonstrated for me how much that commute was wiping me out. And reminded me of what work-life balanced looked like.

I was also ready to do something different. I feel like my career has had regular cycles where I’ve been doing research and consultancy, interleaved with periods of building and delivering things. I decided it was time to get back to the latter.

So, after helping the team weather the shit-storm of COVID-19, I resigned. Next year is going to look a lot different one way or another!

I’ve already written some thoughts on things I’ve enjoyed working on.

Running

I set out to lose some weight and be healthier at the start of the year. And I succeeded in doing that. I’ve been feeling so much better because of that.

I also took up running. I did a kind of self-directed Couch to 5K. I read up on the system and gradually increased distance and periods of running over time as recommended. I’ve previously tried using an app but without much success. I also prefer running without headphones on.

The hardest part has been learning to breathe properly. I suffer from allergic asthma. It used to be really bad when I was a kid. Like, not being able to properly walk up stairs bad. And not being allowed out during school play times bad.

It’s gotten way better and rarely kicks in now unless the pollen is particularly bad. But I still get this rising panic when I’m badly out of breath. I’ve mostly dealt with it now and found that running early in the mornings avoids issues with pollen.

While its never easy, it turns out running can actually be enjoyable. As someone who closely identifies with the sloth, this is a revelation.

It’s also helped to work out nervous tension and stress during the year. So its great to have found a new way to handle that.

Listening

My other loose goal for 2020 was to listen to more music. I’d fallen into the habit of only listening to music while commuting, working or cooking. While that was plenty of opportunity, I felt like I was in a rut. Listening to the same mixes and playlists as they helped me tune out others and concentrate on writing.

I did several things to achieve that goal. I started regularly listening to my Spotify Discover and Release Radar playlists. And dug into the back catalogues from the artists I found there.

I listened to more radio to break out of my recommendation bubble and used the #music channel on the ODI slack to do the same. I also started following some labels on YouTube and via weekly playlists on Spotify.

While I’ve griped about the BBC Sounds app, and while its still flaky, I have to admit its really changed how I engage with the BBC’s radio output. The links from track lists to Spotify is one of the killer features for me.

Building in listening to the BBC Unclassified show with Elizabeth Alker, on my Saturday mornings, has been one of the best decisions I’ve made this year.

Another great decision was to keep a dedicated playlist of “tracks that I loved on first listen, which were released in 2020“. Its helped me be intentional about recording music that I like, so I can dig deeper. Here’s a link to the playlist which has 247 tracks on it.

According to my year in review, Spotify tells me I listened to 630 new artists this year, across 219 new genres. We all know Spotify genres are kind of bullshit, but I’m pleased with that artist count.

Cooking

I generally cook on a Saturday night. I try out a new recipe. We drink cocktails and listen to Craig Charles Funk and Soul Show.

I’ve been tweeting what I’ve been cooking this year to keep a record of what I made. And I bookmark recipes here.

I was most proud of the burger buns, bao buns and gyoza.

We also started a routine of Wednesday Stir Fries, where I cooked whilst Debs was taking Martha to her ice-skating lesson. Like all routines this year that fell away in April.

But, I’ve added Doubanjiang (fermented broad bean chilli paste) to my list of favourite ingredients. Really quick and tasty base for a quick stir fry with a bit of garlic, ginger and whatever veg is to hand.

Gardening

I’ve already published a blog post with my Gardening Retro for 2020.

Reading

Like last year I wanted to read more again this year. As always I’ve been tweeting what I’ve read. I do this for the same reason I tweet things that I cook: it helps me track what I’ve been doing. But it also sometimes prompts interesting chats and other recommendations. Which is why I use social media after all.

I’ve fallen into a good pattern of having one fiction book, one non-fiction book and one graphic novel on the go at any one time. This gives me a choice of things to dip into based on how much time, energy and focus I have. That’s been useful this year.

I’ve read fewer papers and articles (I track those here). This is in large part because my habit was to do this during my commute. But again, that routine has fallen away.

If I’m honest its also because I’ve not really felt like it this year. I’ve read what I needed to, but have otherwise retreated into comfort reading.

The other thing I’ve been doing this year is actively muting words, phrases and hashtags on twitter. It helps me manage what I’m seeing and reading, even if I can’t kick the scrolling habit. I feel vaguely guilty about that. But how else to manage the fire hose of other people’s thoughts, attentions and fears?

Here are some picks. These weren’t all published this year. Its just when I consumed them:

Comics

I also read the entire run of Locke and Key, finished up the Alan Moore Swamp Thing collected editions and started in on Monstress. All great.

I also read a lot of Black Panther single this year. Around 100-150 I think. Which lead to my second most popular tweet this year (40,481 impressions).

Non-fiction

Fiction

I enjoyed but was disappointed by William Gibson’s Agency. Felt like half a novel.

Writing

I started a monthly blogging thread this year. I did that for two reasons. The first was to track what I was writing. I wanted to write more this year and to write differently.

The second was as another low key way to promote posts so that they might find readers. I mostly write for myself, but its good to know that things get read. Again, prompting discussions is why I do this in the open rather than in a diary.

In the end I’ve written more this year than last. Which is good. Not writing at all some months was also fine.

I managed to write a bit of fiction and a few silly posts among the thousand word opinion pieces on obscure data topics. My plan to write more summaries of research papers failed, because I wasn’t reading that many.

My post looking at the statistic about data scientists spending 80% of their time cleaning data, was the most read of what I wrote this year (4379 views). But my most read post of all time remains this one on derived data (25,499 views). I should do a better version.

The posts I’m most pleased with are the one about dataset recipes and the two pieces of speculative fiction.

I carry around stuff in my head, sometimes for weeks or months. Writing it down helps me not just organise those thoughts but also move on to other things. This too is a useful copying mechanism.

Coding

Didn’t really do any this year. All things considered, I’m fine with that. But this will change next year.

Gaming

This year has been about those games that I can quickly pick up and put down again.

I played, loved, but didn’t finish Death Stranding. I need to immerse myself in it and haven’t been in the mood. I dipped back into The Long Dark, which is fantastically well designed, but the survival elements were making me anxious. So I watch other people play it instead.

Things that have worked better: Darkest Dungeon. XCOM: Chimera Squad. Wilmot’s Warehouse. Townscaper. Ancient Enemy. I’ve also been replaying XCOM 2.

These all have relatively short game loops and mission structures that have made them easy to dip into when I’ve been in the mood. Chimera Squad is my game of the year, but Darkest Dungeon is now one of my favourite games ever.

There Is No Game made me laugh. And Townscaper prompted some creativity which I wrote about previously.

That whole exercise lead to my most popular tweet this year (54,025 impressions). People like being creative. Nice to have been responsible for a tiny spark of fun this year.

This is the first year in ages when I’ve not ended up with a new big title that I’m excited to dip into. Tried and failed to get a PS5. Nothing else is really grabbing my interest. I only want a PS5 so I can play the Demon’s Souls remake.

Watching

For the most part I’ve watched all of the things everyone else seems to have watched.

Absolutely loved the Queen’s Gambit. Enjoyed Soul, the Umbrella Academy and The Boys. Thought Kingdom was brilliant (I was late to that one) and #Alive was fun. Korea clearly know how to make Zombie movies and so I’m looking forward to Peninsula.

The Mandalorian was so great its really astounding that no-one thought to make any kind of film or TV follow-up to the original Star Wars trilogies until now. Glad they finally did and managed to mostly avoid making it about the same characters.

But Star Trek: Discovery unfortunately seems to have lost its way. I love the diverse characters and the new setting has so much potential. The plot is just chaotic though. His Dark Materials seems to be a weekly episode of exposition. Yawn.

If I’m being honest though, then my topic picks for 2020 are the things I’ve been able to relax into for hours at a time:

  • The Finnish guy streaming strategy and survival games like The Long Dark and XCom 2
  • The Dutch guy playing classic and community designed Doom 2 levels
  • And the guy doing traditional Japanese woodblock carvings

I’m only slightly exaggerating to say these were the only things I watched in that difficult March-May period.

Everything else

I could write loads more about 2020 and what it was like. But I won’t. I’ve felt all of the things. Had all of the fears, experienced all of the anger, disbelief and loss.

The lesson is to keep moving forward. And to turn to books, music, games, walking, running, cooking to help keep us sane.

A short list of some of the things I’ve worked on which I’ve particularly enjoyed

Part of planning for whatever comes next for me in my career involved reflecting on the things I’ve enjoyed doing. I’m pleased to say that there’s a quite a lot.

I thought I’d write some of them down to help me gather my thoughts around what I’d like to do more of in the future. And, well, it never hurts to share your experience when you’re looking for work. Right?

The list below focuses on projects and activities which I’ve contributed to or had a hand in leading.

There’s obviously more to a career and work than that. For example, I’ve enjoyed building a team and supporting them in their work and development. I’ve enjoyed pitching for and winning work and funding.

I’ve also very much enjoyed working with a talented group of people who have brought a whole range of different skills and experiences to projects we’ve collaborated on together. But this post isn’t about those things.

Some of the things I’ve enjoyed working on at the ODI

  • Writing this paper on the value of open identifiers, which was co-authored with a team at Thomson Reuters. It was a great opportunity to distil a number of insights around the benefits of open, linked data. I think the recommendations stand up well. Its a topic I keep coming back to.
  • Developing the open data maturity model and supporting tool. The model was used by Defra to assess all its arms-length bodies during their big push to release open data. It was adopted by a number of government agencies in Australia, and helped to structure a number of projects that the ODI delivered to some big private sector organisations. Today we’d scope the model around data in general, not just open data. And it needs a stronger emphasis on diversity, inclusion, equity and ethics. But I think the framework is still sound
  • Working with the Met Office on a paper looking at the state of weather data infrastructure. This turned into a whole series of papers looking at different sectors. I particularly enjoyed this first one as it was a nice opportunity to look at data infrastructure through a number of different lenses in an area that was relatively new to me. The insight that an economic downturn in Russian lead to issues with US agriculture because of data gaps in weather forecasting might be my favourite example of how everything is intertwingled. I later used what I learned in that paper to write this primer on data infrastructure.
  • Leading research and development of the open standards for data guidebook. Standards being another of my favourite topics, it was great to have space to explore this area in more detail. And I got to work with Edafe which was ace.
  • Leading development of the OpenActive standards. Standards development is tiring work. But I’m pleased with the overall direction that we took and what we’ve achieved. I learned a lot. And I had the chance to iterate on what we were doing based on what we learned from developing the standards guidebook, before handing it over to others to lead. I’m pleased that we were able to align the standards with Schema.org and SKOS. I’m less pleased that it resulted in lots of video of me on YouTube leading discussions in the open.
  • Developing a methodology for doing data ecosystem mapping. The ODI now has a whole tool and methodology for mapping data ecosystems. It’s used in a lot of projects. While I wouldn’t claim to have invented the idea of doing this type of exercise, the ODI’s approach directly builds on the session I ran at Open Data Camp #4. I plan to continue to work on this as there’s much more to explore.
  • Leading development of the collaborative maintenance guidebook. Patterns provide a great way to synthesise and share insight. So it was fantastic to be able to apply that approach to capturing some of the lessons learned in projects like OpenStreetMap, Wikidata and other projects. There’s a lot that can be applied in this guidebook to help shape many different data projects and platforms. The future of data management is more, not less collaborative.
  • Researching the sustainable institutions report. One of the reasons I (re-)joined the ODI about 4 years ago was to work on data institutions. Although we weren’t using that label at that point. I wanted to help to set up organisations like CrossRef, OpenStreetMap and others that are managing data for a community. So it was great to be involved in this background research. I still want to do that type of work, but want to be working in that type of organisation, rather than advising them.

There’s a whole bunch of other things I did during my time at the ODI.

For example, I’ve designed and delivered a training course on API design, evaluated a number of open data platforms, written code for a bunch of openly available tools, provided advice to bunch of different organisations around the world, and written guidance that still regularly gets used and referenced by people. I get a warm glow from having done all those things.

Things I’ve enjoyed working on elsewhere

I’ve also done a bunch of stuff outside the ODI that I’ve also thoroughly enjoyed. For example:

  • I’ve helped to launch two new data-enabled products. Some years ago, I worked with the founders of Growkudos to design and build the first version of their platform, then helped them hire a technical team to take it forward. I also helped to launch EnergySparks, which is now used by schools around the country. I’m now a trustee of the charity.
  • I’ve worked with the ONS Digital team. After working on this prototype for Matt Jukes and co at the ODI, it was great to spend a few months freelancing with Andy Dudfield and the team working on their data principles and standards to put stats on the web. Publishing statistics is good, solid data infrastructure work.
  • Through Bath: Hacked, I’ve lead a community mapping activity to map wheelchair accessibility in the centre of Bath. It was superb to have people from the local community, from all walks of life, contributing to the project. Not ashamed to admit that I had a little cry when I learned that one of the mappers hadn’t been into the centre of Bath for years, because they’d felt excluded by their disability. But was motivated to be part of the project. That single outcome made it all worthwhile for me.

What do I want to do more of the in future? I’ve spent quite a bit of the last few years doing research and advising people about how they might go about their projects. But its time to get back into doing more hands-on practical work to deliver some data projects or initiatives. More doing, less advising.

So, I’m currently looking for work. If you’re looking for a “Leigh shaped” person in your organisation. Where “Leigh shaped” means “able to do the above kinds of things” then do get in touch.

The Saybox

I’ve been in a reflective mood over the past few weeks as I wrap up my time at the Open Data Institute. One of the little rituals I will miss is the “Saybox”. I thought I’d briefly write it up and explain why I like it.

I can’t remember who originally introduced the idea. It’s been around long enough that I think I was still only working part-time as an associate, so wasn’t always at every team event. But I have a suspicion it was Briony. Maybe someone can correct me on that? (Update: it was Briony πŸ™‚ )

It’s also possible that the idea is well-known and documented elsewhere, but I couldn’t find a good reference. So again, if someone has a pointer, then let me know and I’ll update this post.

Originally, the Saybox was just a decorated shoebox. It had strong school craft project vibes. I’m sad that I can’t find a picture of it

The idea is that anyone in the team can drop an anonymous post-it into the box with a bit of appreciation for another member of the team, questions for the leadership team, a joke or a “did you know”. At our regular team meetings we open the box, pass it around and we all read out a few of the post-its.

I’ll admit that it took me a while to warm to the idea. But it didn’t take me long to be won over.

The Saybox has became part of the team culture. A regular source of recognition for individual team members, warm welcomes for new hires and, at times, a safe way to surface difficult questions. The team have used it to troll each other whilst on holiday and it became a source of running gags. For a time, no Saybox session was complete without a reminder that Simon Bullmore ran a marathon.

As I took on leadership positions in the team, I came to appreciate it for other reasons. It was more than just a means of providing and encouraging feedback across the team. It became a source of prompts for where more clarity on plans or strategy were needed. And, in a very busy setting, it also helped to reinforce how delivery really is a team sport.

There’s nothing like hearing an outpouring of appreciation for a individual or small team to be constantly reminded of the important role they play.

Like any aspect of team culture, the Saybox has evolved over time.

There’s a bit less trolling and fewer running gags now. But the appreciation is still strong.

The shoebox was also eventually replaced by a tidy wooden box. This was never quite the same for me. The shoebox had more of a scruffy, team-owned vibe about it.

As we’ve moved to remote working we’ve adapted the idea. We now use post-it notes on a Jamboard, and take turns reading them over the team zooms. Dr Dave likes to tick them off as we go, helping to orchestrate the reading.

The move to online unfortunately means there isn’t the same constant reminder to provide feedback, in the way that a physical box presents. You don’t just walk past a Jamboard on your way to or from a meeting. This means that the Saybox jamboard is now typically “filled” just before or during the team meetings, which can change the nature of the feedback it contains.

It’s obviously difficult to adapt team practices to virtual settings. But I’m glad the ODI has kept it going.

I’ll end this post with a brief confession. It might help reinforce why rituals like this are so important.

In a Saybox session, when we used to do them in person with actual paper, we handed the note over to whoever it was about. So sometimes you could leave a team meeting with one or more notes of appreciation from the team. That’s a lovely feeling.

I got into the habit of dropping them into my bag or sticking them into my notebook. As I tidied up my bag or had a clearout of old paperwork, I started collecting the notes into an envelope.

The other day I found that envelope in a drawer. As someone who is wired to always look for the negatives in any feedback, having these hand-written notes is lovely.

There’s nothing like reading unprompted bits of positive feedback, collected over about 5 years or so, to help you reflect on your strengths.

Thanks everyone.

A poem about standards

To help me wrap up my time at the ODI I asked the team for suggestions for things I could add to my list of handover documentation.

Amongst the suggestions that came back was: “Maybe also a poem about why standards are the best thing on Earth?”

So, with a nod to the meme and apologies to William Carlos Williams. I wrote this:

I have tidied
the data
in your
spreadsheet

those numbers
you were
planning
to share

Forgive me
they were so messy
but now standard
and FAIR

Close enough I think πŸ™‚

Brief review of revisions and corrections policies for official statistics

In my earlier post on the importance of tracking updates to datasets I noted that the UK Statistics Authority Code of Practice includes a requirement that publishers of official statistics must publish a policy that describes their approach to revisions and corrections.

See 3.9 inΒ T3: Orderly Release, which states: “Scheduled revisions or unscheduled corrections to the statistics and data should be released as soon as practicable. The changes should be handled transparently in line with a published policy.”

The Code of Practice includes definitions of both Scheduled Revisions and Unscheduled Corrections.

Scheduled Revisions are defined as: “Planned amendments to published statistics in order to improve quality by incorporating additional data that were unavailable at the point of initial publication“.

Whereas Unscheduled Corrections are: “Amendments made to published statistics in response to the identification or errors following their initial publication

I decided to have a read through a bunch of policies to see what they include and how they compare.

Here are some observations based on a brief survey of this list of 15 different policies including those by the Office of National Statistics, the FSA, Gambling Commission, CQC, DfE, PHE, HESA and others.

Scope

The Code of Practice applies to official statistics. Some organisations publishing official statistics also publish other statistical datasets.

In some cases organisations have written policies that apply:

  • to all their statistical outputs, regardless of designation
  • only to those outputs that are official statistics
  • individual policies relating to specific datasets

There’s some variation in the amount of detail provided.

Some read as basic compliance documents with basic statements of intent to follow the recommendations of the code of practice. The include, for example a note that revisions and corrections will be handled transparently, in a timely way and with general notes about how that will happen.

Others are more detailed, giving more insight into how the policy will actually be carried out in practice. From a data consumer perspective these feel a bit more useful as they often include timescales for reporting, lines of responsibility and notes about how changes are communicated.

Definitions

Some policies elaborate on the definitions in the code of practice, providing a bit more breakdown on the types of scheduled revisions and sources of error.

For example some policies indicate that changes to statistics may be driven by:

  • access to new or corrected source data
  • routine recalculations, as per methodologies, to establish baselines
  • improvements to methodologies
  • corrections to calculations

Some organisations publish provisional releases of these statistics. So their policies discuss Scheduled Revisions in this light: a dataset is published in one or more provisional releases before being finalised. During those updates the organisation may have been in receipt of new or updated data that impacts how the statistics are calculated. Or may fix errors.

Other organisations do not publish provisional statistics so their datasets do not have scheduled revisions.

A few policies include a classification of the severity of errors, along the lines of:

  • major errors that impact interpretation or reuse of data
  • minor errors in statistics, which may include anything that is not major
  • other minor errors or mistakes, e.g. typographical errors

These classifications are used to describe different approaches to handling the errors, appropriate to their severity.

Decision making

The policies frequently require decision making around how specific revisions and corrections might be handled. With implications for investment of time and resources in handling and communicating the necessary revisions and corrections.

In some cases responsibility lies with a senior leader, e.g. a Head of Profession, or other senior analyst. In some cases decision making rests with the product owner with responsibility for the dataset.

Scheduled revisions

Scheduled changes are, by definition, planned in advance. So the policy sections relating to these revisions are typically brief and tend to focus on the release process.

In general, the policies align around:

  • having clear timetables for when revisions are to be expected
  • summarising key impacts, detail and extent of revisions in the next release of a publication and/or dataset
  • clear labelling of provisional, final and revised statistics

Several of the policies include methodological changes in their handling of scheduled revisions. These explain that changes will be consulted on and clearly communicated in advance. In some cases historical data may be revised to align with the new methodology.

Corrections

Handling of corrections tends to be the larger sections of each policy. These sections frequently highlight that despite rigorous quality control errors may creep in, either because of mistakes or because of corrections to upstream data sources.

There are different approaches to how quickly errors will be handled and fixed. In some cases this depends on the severity of errors. But in others the process is based on publication schedules or organisational preference.

For example, in one case (SEPA), there is a stated preference to handle publication of unscheduled corrections once a year. In other policies corrections will be applied at the next planned (“orderly”) release of the dataset.

Impact assessments

Several policies note that there will be an impact assessment undertaken to fully understand an error before any changes are made.

These assessments include questions like:

  • does the error impact a headline figure or statistic?
  • is the error within previously reported margins of accuracy or certainty
  • who will be impacted by the change
  • the consequences of the change, e.g. does it impact the main insights from the previously published statistics or how it might be used?

Severity of errors

Major errors tend to get some special treatment. Corrections to these errors are typically made more rapidly. But there are few commitments to timeliness of publishing corrections. “As soon as possible” is a typical statement.

The two exceptions I noted are the MOD policy which notes that minor errors will be corrected within 12 months, and the CQC policy which commits to publishing corrections within 20 days of an agreement to do so. (Others may include commitments that I’ve missed.)

A couple of policies highlight that errors may be found before a fix. In these cases, the existence of the error will still be reported.

The Welsh Revenue Authority was the only policy that noted that it might even retract a dataset from publication while it fixed an error.

A couple of policies noted that minor errors that did not impact interpretation may not be fixed at all. For example one ONS policy notes that errors within the original bounds of uncertainty in the statistics may not be corrected.

Minor typographic errors might just be directly fixed on websites without recording or reporting of changes.

Marking

There seems to be general consensus on the use of “p” for provisional and “r” for revised figures in statistics. Interestingly, in the Welsh Revenue Authority policy they note that while there is an accepted welsh translation for “provisional” and “revised”, the marker symbols remain untranslated.

Some policies clarify that these markers may be applied at several levels, e.g. to individual cells as well as rows and columns in a table.

Only one policy noted a convention around adding “revised” to a dataset name.

Communications

As required by the code of practice, the policies align on providing some transparency around what has been changed and the reason for the changes. Where they differ is around how that will be communicated and how much detail is included in the policy.

In general, revisions and corrections will simply be explained in the next release of the dataset, or before if a major error is fixed. The goal being to provide users with a reason for the change, and the details of the impact on the statistics and data.

These explanations are handled by additional documentation to be included in publications, markers on individual statistics, etc. Revision logs and notices are common.

Significant changes to methodologies or major errors get special treatment. E.g. via notices on websites or announcements via twitter.

Many of the policies also explain that known users or “key users” will be informed of significant revisions or corrections. Presumably this is via email or other communications.

One policy noted that the results of their impact assessment and decision making around how to handle a problem might be shared publicly.

Capturing lessons learned

A few of the policies included a commitment to carry out a review of how an error occurred in order to improve internal processes, procedures and methods. This process may be extended to include data providers where appropriate.

One policy noted that the results of this review and any planned changes might be published where it would be deemed to increase confidence in the data.

Wrapping up

I found this to be an interesting exercise. It isn’t a comprehensive review, but hopefully it provides a useful summary of approaches.

I’m going to resist the urge to write recommendations or thoughts on what might be added to these policies. Reading a policy doesn’t tell us how well its implemented, or whether users feel it is serving their needs.

I will admit to feeling a little surprised that there isn’t a more structured approach in many cases. For example, pointers to where I might find a list of recent revisions or how to sign up to get notified as an interested user of the data.

I had also expected some stronger commitments about how quickly fixes may be made. These can be difficult to make in a general policy, but are what you might expect from a data product or service.

These elements might be covered by other policies or regulations. If you know of any that are worth reviewing, then let me know.

 

The importance of tracking dataset retractions and updates

There are lots of recent examples of researchers collecting and releasing datasets which end up raising serious ethical and legal concerns. The IBM facial recognition dataset being just one example that springs to mind.

I read an interesting post exploring how facial recognition datasets are being widely used despite being taken down due to ethical concerns.

The post highlights how these datasets, despite being retracted, are still being widely used in research. This is in part because the original datasets are still circulating via mirrors of the original files. But also because they have been incorporated into derived datasets which are still being distributed with the original contents intact.

The authors describe how just one dataset, the DukeMTMC dataset was used in more than 135 papers after being retracted, 116 of those drawing on derived datasets. Some datasets have many derivatives, one example cited has been used in 14 derived datasets.

The research raises important questions about how datasets are published, mirrored, used and licensed. There’s a lot to unpack there and I look forward to reading more about the research. The concerns around open licensing are reminiscent of similar debates in the open source community leading to a set of “ethical open source licences“.

But the issue I wanted to highlight here is the difficulty of tracking the mirroring and reuse of datasets.

Change notification is a missing piece of our data infrastructure.

If it were easier to monitor important changes to datasets, then it would be easier to:

  • maintain mirrors of data
  • retract or remove data that breached laws or social and ethical norms
  • update derived datasets to remove or amend data
  • re-run analyses against datasets which has seen significant corrections or revisions
  • assess the impacts of poor quality or unethically shared data
  • proactively notify relevant communities of potential impacts relating to published data
  • monitor and review the reasons why datasets get retracted
  • …etc, etc

The importance of these activities can be seen in other contexts.

For example, Retraction Watch is a project that monitors retractions of research papers. CrossMark helps to highlight major changes to published papers including corrections and retractions.

Principle T3: Orderly Release, of the UK Statistics Authority code of practice explains that scheduled revisions and unscheduled corrections to statistics should be transparent, and that organisations should have a specific policy for how they are handled.

More broadly, product recalls and safety notices are standard for consumer goods. Maybe datasets should be treated similarly?

This feels like an area that warrants further research, investment and infrastructure. At some point we need to raise our sights from setting up even more portals and endlessly refining their feature sets and think more broadly about the system and ecosystem we are building.

Consulting Spreadsheet Detective, Season 1

I was very pleased to announce my new TV series today, loosely based on real events. More details here in the official press release.

FOR IMMEDIATE RELEASE

Coming to all major streaming services in 2021 will be the exciting new series: “Turning the Tables“.

Exploring the murky corporate world of poorly formatted spreadsheets and nefarious macros each episode of this new series will explore another unique mystery.

When the cells lie empty, who can help the CSV:PI team pivot their investigation?

When things don’t add up, who can you turn to but an experienced solver?

Who else but Leigh Dodds, Consulting Spreadsheet Detective?

This smart, exciting and funny new show throws deductive reasoner Dodds into the mix with Detectives Rose Cortana and Colm Bing part of the crack new CSV:PI squad.

Rose: the gifted hacker. Quick to fire up an IDE, but slow to validate new friends.

Colm: the user researcher. Strong on empathy but with an enigmatic past that hints at time in the cells.

What can we expect from Season 1?

Episode 1: #VALUE!

In his first case, Dodds has to demonstrate his worth to a skeptical Rose and Colm, by fixing a corrupt formula in a startup valuation.

Episode 2: #NAME?

A personal data breach leaves the team in a race against time to protect the innocent. A mysterious informant known as VLOOKUP leaves Dodds a note.

Episode 3: #REF!

A light-hearted episode where Dodds is called in to resolve a mishap with a 5-a-side football team matchmaking spreadsheet. Does he stay between the lines?

Episode 4: #NUM?

A misparsed gene name leads a researcher into recommending the wrong vaccine. It’s up to Dodds to fix the formula.

Episode 5: #NULL!

Sometimes it’s not the spreadsheet that’s broken. Rose and Colm have to educate a researcher on the issue of data bias, while Dodds follow up references to the mysterious Macro corporation.

Episode 6: #DIV/0?

Chasing down an internationalisation issue Dodds, Rose and Colm race around the globe following a trail of error messages. As Dodds gets unexpectedly separated from the CSV:PI team, Rose and Colm unmask the hidden cell containing the mysterious VLOOKUP.

In addition to the six episodes in season one, a special feature length episode will air on National Spreadsheet Day 2021:

Feature Episode: #####

Colm’s past resurfaces. Can he grow enough to let the team see the problem, and help him validate his role in the team?

Having previously only anchored documentaries, like “Around with World with 80,000 Apps” and “Great Data Journeys“, taking on the eponymous role will be Dodds’ first foray into fiction. We’re sure he’ll have enough pizazz to wow even the harshest critics.

“Turning the Tables” will feature music composed by Dan Barrett.

Tip for improving standards documentation

I love a good standard. I’ve written about them a lot here.

As its #WorldStandardsDay I thought I’d write a quick post to share something that I’ve learned from leading and supporting some standards work.

I’ve already shared this with a number of people who have asked for advice on standards work, and in some recent user research interviews I’ve participated in. So it makes sense to write it down.

In the ODIHQ standards guide, we explained that at the end of your initial activity to develop a standard, you should plan to produce a range of outputs. This include a variety of tools and guidance that help people use the standard. You will need much more than just a technical specification.

To plan for the different types of documentation that you may need I recommend applying this “Grand Unified Theory of Documentation“.

That framework highlights four different types of documentation are intended to be used by different audiences to address different needs. The content designers and writers out there reading this will be be rolling their eyes at this obvious insight.

Here’s how I’ve been trying to apply it to standards documentation:

Reference

This is your primary technical specification. It’ll have all the detail about the standard, the background concepts, the conformance criteria, etc.

It’s the document of record that captures all of the hard work you’ve invested in building consensus around the standard. It fills a valuable role as the document you can point back to when you need to clarify or confirm what was agreed.

But, unless its a very simple standard, it’s going to have a limited audience. A developer looking to implement a conformant tool, API or library may need to read and digest all of the detail. But most people want something else.

Put the effort into ensuring its clear, precise and well-structured. But plan to also produce three additional categories of documentation.

Explainers

Many people just want an overview of what it is designed to do. What value will it provide? What use cases was it designed to support? Why was it developed? Who is developing it?

These are higher-level introductory questions. The type of questions that business stakeholders want to answer to sign-off on implementing a standard, so it goes onto a product roadmap.

Explainers are also useful background information that are useful for a developer ahead of taking a deeper dive. If there are some key concepts that are important to understanding the design and implementation of a standard, then write an explainer.

Tutorials

A simple, end-to-end description of how to apply the standard. E.g. how to publish a dataset that conforms to the standard, or export data from an existing system.

A tutorial will walk you through using a specific set of tools, frameworks or programming languages. The end result being a basic implementation of the standard. Or a simple dataset that passes some basic validation checks. A tutorial won’t cover all of the detail, it’s enough to get you started.

You may need several tutorials to support different types of users. Or different languages and frameworks.

If you’ve produced a tool, like validator or a template spreadsheet to support data publication, you’ll probably need a tutorial for each of them unless they are very simple to use.

Tutorials are gold for a developer who has been told: “please implement this standard, but you only have 2 days to do it”.

How-Tos

Short, task oriented documentation focused on helping someone apply the standard. E.g. “How to produce a CSV file from Excel”, “Importing GeoJSON data in QGIS”, “Describing a bus stop”. Make them short and digestible.

How-Tos can help developers build from a tutorial, to a more complete implementation. Or help a non-technical user quickly apply a standard or benefit from it.

You’ll probably end up with lots of these over time. Drive creating them from the types of questions or support requests you’re getting. Been asked how to do something three times? Write a How-To.

There’s lots more that can be said about standards documentation. For example you could add Case Studies to this list. And its important to think about whether written documentation is the right format. Maybe your Explainers and How-Tos can be videos?

But I’ve found the framework to be a useful planning tools. Have a look at the documentation for more tips.

Producing extra documentation to support the launch of a standard, and then investing in improving and expanding it over time will always be time well-spent.

A letter from the future about numbers

It’s an odd now looking at early 21st century content in the Internet Archive. So little nuance.

It feels a little like watching those old black and white movies. All that colour which was just right there. But now lost. Easy to imagine that life was just monochrome. Harder to imagine the richer colours.

Or at least hard for me. There are AIs that will imagine it all for you now, of course. There have been for a while. They’ll repaint the pictures using data they’ve gleaned from elsewhere. But it’s not the film that is difficult to look at. It’s the numbers.

How did you manage with just those bare numerals?

If I showed you, a 21st century reader, one of our numbers you wouldn’t know what it was. You wouldn’t be able to read it.

Maybe you’ve seen that film Arrival? Based on a book by Ted Chiang? Remember the alien writing that was so complex and rich in meaning? That’s what our numbers might look like to you. You’d struggle to decode them.

Oh, the rest of it is much the same. The text, emojis and memes. Everything is just that bit richer, more visual. More nuanced. It’s even taught in schools now. Standardised, tested and interpreted for all. It’d be familiar enough.

You struggle with the numbers though. They’d take much more time to learn.

Not all of them. House numbers. Your position in the queue. The cost of a coffee. Those look exactly the same. Why would we change those?

It’s the important numbers that look different. The employment figures. Your pension value. Your expected grade. The air quality. The life-changing numbers. Those all look very different now.

At some point we decided that those numbers needed to be legible in entirely different ways. We needed to be able to see (or hear, or feel) the richness and limitations in the most important numbers. It was, it turned out, the only way to build that shared literacy.

To imagine how we got there, just think about how people have always adapted and co-opted digital platforms and media for their own ends. Hashtags and memes.

Faced with the difficulty of digging behind the numbers – the need to search for sample sizes, cite the sources, highlight the bias, check the facts –  we had to find a different way. It began with adding colour, toying with fonts and diacritics.

5β€”a NUMBER INTERPOLATED.

It took off from there. Layers of annotations becoming conventions and then standards. Whole new planes and dimensions in unicode.

42β€”a PROJECTION based on a SIGNIFICANT POPULATION SAMPLE.

All of the richness, all of the context made visible right there in the number.

27-30β€”a PREDICTED RANGE created by a BAYESIAN INTERPOLATION over a RECENT SAMPLE produced by an OFFICIAL SOURCE.

180β€”an INDICATOR AUTOMATICALLY SELECTED by a DEEP LEARNING system, NO HUMAN INTERVENTION.

Context expressed as colour and weight and strokes in the glyphs. You can just read it all right off the digits. There and there. See?

Things aren’t automatically better of course. Numbers aren’t suddenly to be more trusted. Why would they be?.

It’s easier to see what’s not being said. It’s easier to demand better. It’s that little bit harder to ignore what’s before your eyes. It moves us on in our debates or just helps us recognise when the reasons for them aren’t actually down to the numbers at all.

It’s no longer acceptable to elide the detail. The numbers just look wrong. Simplistic. Black and white.

Which is why it’s difficult to read the Internet Archive sometimes.

We’ve got AIs that can dream up the missing information. Mining the Archive for the necessary provenance and add it all back into the numbers. Just like adding colour to those old films, it can be breathtaking to see. But not in a good way. How could you have deluded yourselves and misled each other so easily?

I’ve got one more analogy for you.

Rorschach tests have long been consigned to history. But one of our numbers – the life-changing ones – might just remind you of a colourful inkblots. And you might accuse use of we’re just reading things into them. Imagining things that you just aren’t there.

But numbers are just inkblots. Shapes in which we choose to see different aspects of the world. They always have been. We’ve just got a better palette.