The Common Voice data ecosystem

In 2021 I’m planning to spend some more time exploring different data ecosystems with an emphasis on understanding the flows of data within and between different data initiatives, the tools they use to collect and share data, and the role of collaborative maintenance and open standards.

One project I’ve been looking at this week is Mozilla Common Voice. It’s an initiative that is producing a crowd-sourced, public domain dataset that can be used to train voice recognition applications. It’s the largest dataset of its type, consisting of over 7,000 hours of audio across 60 languages.

It’s a great example of communities working to create datasets that are more open and representative. Helping to address biases and supporting the creation of more equitable products and services. I’ve been using it in my recent talks on collaborative maintenance, but have had chance to dig a bit deeper this week.

The main interface allows contributors to either record their voice, by reading short pre-prepared sentences, or validate existing contributions by listening to existing recording and confirming that they match the script.

Behind the scenes is a more complicated process, which I found interesting.

It further highlights the importance of both open source tooling and openly licensed content in supporting the production of open data. It also another example of how choices around licensing can create friction between open projects.

The data pipeline

Essentially, the goal of the Common Voice project is to create new releases of its dataset. With each release including more languages and, for each language, more validated recordings.

The data pipeline that supports that consists of the following basic steps. (There may be other stages involved in the production of the output corpus, but I’ve not dug further into the code and docs.)

  1. Localisation. The Common Voice web application first has to be localised into the required language. This is coordinated via Mozilla Pontoon, with a community of contributors submitting translations licensed under the Mozilla Public Licence 2.0. Pontoon is open source and can be used for other non-Mozilla applications. When the localization gets to 95% the language can be added to the website and the process can move to the next stage
  2. Sentence Collection. Common Voice needs short sentences for people to read. These sentences need to be in the public domain (e.g. via a CC0 waiver). A minimum of 5,000 sentences are required before a language can be added to the website. The content comes from people submitting and validating sentences via the sentence collector tool. The text is also drawn from public domain sources. There’s a sentence extractor tool that can pull content from wikipedia and other sources. For bulk imports the Mozilla team needs to check for licence compatibility before adding text. All of this means that the source texts for each language are different.
  3. Voice Donation. Contributors read the provided sentences to add their voice to their dataset. The reading and validation steps are separate microtasks. Contributions are gamified and there are progress indicators for each language.
  4. Validation. Submitted recordings go through retrospective review to assess their quality. This allows for some moderation, allowing contributors to flag recordings that are offensive, incorrect or are of poor quality. Validation tasks are also gamified. In general there are more submitted recordings than validations. Clips need to be reviewed by two separate users for them to be marked as valid (or invalid).
  5. Publication. The corpus consists of valid, invalid and “other” (not yet validated) recordings, split into development, training and test datasets. There are separate datasets for each language.

There is an additional dataset which consists of 14 single word sentences (the ten digits, “yes”, “no”, “hey”, “Firefox”) which is published separately. The steps 2-4 look similar though.

Some observations

What should be clear is that there are multiple stages, each with their own thresholds for success.

To get a language into the project you need to translate around 600 text fragments from the application and compile a corpus of at least 5,000 sentences before the real work of collecting the voice dataset can begin.

That work requires input from multiple, potentially overlapping communities:

  • the community of translators, working through Pontoon
  • the community of writers, authors, content creators creating public domain content that can be reused in the service
  • the common voice contributors submitting new additional sentences
  • the contributors recording their voice
  • the contributors validating other recordings
  • the teams at Mozilla, coordinating and supporting all of the above

As the Common Voice application and configuration is open source, it is easy to include it in Pontoon to allow others to contribute to its localisation. To build representative datasets, your tools need to work for all the communities that will be using them.

The availability of public domain text in the source languages, is clearly a contributing factor in getting a language added to the site and ultimately included in the dataset.

So the adoption of open licences and the richness of the commons in those languages will be a factor in determining how rich the voice dataset might be for that language. And, hence, how easy it is to create good voice and text applications that can support those communities.

You can clearly create a new dedicated corpus, as people have done for Hakha Chin. But the strength and openness of one area of the commons will impact other areas. It’s all linked.

While there are different communities involved in Common Voice, its clear these reports from communities working on Hakha Chin and Welsh, in some cases its the same community that is working across the whole process.

Every language community is working to address its own needs: “We’re not dependent on anyone else to make this happen…We just have to do it“.

That’s the essence of shared infrastructure. A common resource that supports a mixture of uses and communities.

The decisions about what licences to use is, as ever, really important. At present Common Voice only takes a few sentences from individual pages of the larger Wikipedia instances. As I understand it this is because Wikipedia content is not public domain, so cannot be used wholesale. But small extracts should be covered by fair use?

I would expect that those interested in building and maintaining their language specific instances of wikipedia have overlaps with those interested in making voice applications work in that same language. Incompatible licensing can limit the ability to build on existing work.

Regardless, the Mozilla and the Wikimedia Foundations have made licensing choices that reflect the needs of their communities and the goals of their projects. That’s an important part of building trust. But, as ever, those licensing choices have subtle impacts across the wider ecosystem.