Assessing data infrastructure: the Digital Public Goods standard and registry

This is the second in a short series of posts in which I’m sharing my notes and thoughts on a variety of different approaches for assessing data infrastructure and data institutions.

The first post in the series looked at The Principles of Open Scholarly Infrastructure.

In this post I want to take a look at the Digital Public Good (DPG) registry developed by The Digital Public Goods Alliance.

What are Digital Public Goods?

The Digital Public Goods Alliance define digital public goods as:

open-source software, open data, open AI models, open standards, and open content that adhere to privacy and other applicable laws and best practices, do no harm by design, and help attain the Sustainable Development Goals (SDGs)

Digital Public Goods Alliance

While the links to the Sustainable Development Goals narrows the field this definition still encompasses a very diverse set of openly licensed resources.

Investing in the creation and use of DPGs was one of eight key actions in the UN Roadmap for Digital Cooperation published in 2020.

What is the Digital Public Goods Standard?

The Digital Public Goods Standard consists of 9 indicators and requirements that are used to assess whether a dataset, AI model, standard, software package or content can be considered a DPG.

To summarise, the indicators and requirements cover:

  • relevance to the Sustainable Development Goals
  • openness: open licensing, clarity over ownership of the resource and access to data (in software systems)
  • reusability: platform independence and comprehensive documentation, in addition to open licensing
  • use of standards and best practices
  • minimising harms, with the ninth “Do No Harm by Design” principles decomposed into data privacy and security, policies for handling inappropriate and illegal content, and protection from harassment

In contrast to the Principles of Open Scholarly Infrastructure, which defines principles for infrastructure services (i.e. data infrastructure and data institutions) the Digital Public Goods Standard can be viewed as focusing on the outputs of that infrastructure, e.g. the datasets that they publish or the software or standards that they produce.

But assessing a resource to determine if it is a Digital Public Good inevitably involves some consideration of the processes by which it has been produced.

A recent Rockefeller Foundation report on co-developing Digital Public Infrastructure endorsed by the Digital Public Goods Alliance, highlights that Digital Public Goods might also be used to create new digital infrastructure. E.g. by deploying open platforms in other countries or using data and AI models to build new infrastructure.

So Digital Public Goods are produced by, used by, and support the deployment of data and digital infrastructure.

How was the Standard developed?

The Digital Public Goods Standard was developed by the Digital Public Good Alliance (DPGA), “a multi-stakeholder initiative with a mission to accelerate the attainment of the sustainable development goals in low- and middle-income countries by facilitating the discovery, development, use of, and investment in digital public goods

An early pilot of the standard was developed to assess Digital Public Goods focused on Early Grade Reading. The initial assessment criteria were developed by a technical group that explored cross-domain indicators and an expert group that focused on topics relevant to literacy.

This ended up covering 11 categories and 51 different indicators.

That results of that pilot was turned into the initial version of the DPG Standard and published in September 2020. In that process the 51 indicators were reduced down to just 9.

It is interesting to see what was removed, for example:

  • Utility and Impact — whether the Digital Public Good was actually in use in multiple countries
  • Product Design — whether there’s a process for prioritising and managing feature requests
  • Product Quality — accessibility statements and testing, version control, multi-lingual support
  • Community — code of conducts, community management
  • Do No Harm — security audits, data minimisation
  • Financial Sustainability — are there revenue streams that support continual development of the public good?

The process of engaging with domain experts has continued, with the DPGA developing Communities of Practice that have produced reports highlighting key digital public goods in specific domains. An example of what we called “data landscaping” at the ODI.

How are Digital Public Goods assessed?

The assessment process is as follows:

  1. The owner of an openly licensed resource use an eligibility tool to determine whether their resource is suitable for assessment
  2. If eligible, the owner will submit a nomination. The submission process involves answering all of these questions
  3. If accepted, a nominated resource will be listed in the public registry
  4. Nominated submissions will be further reviewed by the DPGA team, in order to complete the assessment at which point the nomination is marked as a Digital Public Good

While nominations can be made by third-parties, some indicators are only assessed based on evidence provided directly by the publisher of the resource.

At the time of writing there are 651 nominees and 87 assessed public goods in the registry. The list of Digital Public Goods consists of the following (items can be in multiple categories):

AI Model4
Distribution of Digital Public Goods catalogued at on 24th February 2022

Its worth noting that several of the items in the “Data” category are actually APIs and services.

The assessment of a verified Digital Public Good is publicly included in the registry. For example here is the recently published assessment of the Mozilla Common Voice dataset. However all of the data supporting the individual nominations can be found in this public repository.

The documentation and the submission guide explain that the benefits of becoming a Digital Public Good include

  • increased adoption or use
  • discoverability, promotion and recognition within development agencies, the UN and governments
  • in the future — additional branding opportunities through use of icons or brand marks
  • in the future — potential to be included in recommendations to government procurers and funding bodies
  • in the future — additional support, e.g. mentoring and funding

Indirectly, by providing a standard for assessment, the DPGA will be influencing the process by which openly licensed resources might be created.

Could the Standard be used in other contexts?

Is the Standard useful as a broader assessment tool, e.g. for projects that are not directly tied to the SDGs? Or for organisations looking to improve their approach to publishing open data, open source or machine-learning models?

I think the detailed submission questions provide some useful points of reflection.

But I think the process of trying to produce a single set of assessment criteria that covers data, code, AI models and content means that useful and important detail is lost.

Even trying to produce a single set of criteria for assessing (open) datasets across domains is difficult. We tried that at the ODI with Open Data Certificates. Others are trying to do this now with the FAIR data principles. You inevitably end up with language like “using appropriate best practices and standards” which is hard to assess without knowledge of the context in which data is being collected and published.

I also think the process of winnowing down the original 51 indicators to just 9 and some supporting questions, has lost some important elements. Particularly around sustainability.

Again, in Open Data Certificates, we asked questions about longer-term sustainability and access to data. This is also seems highly relevant in the context in which the DPGA are operating.

I think the standard might have been better having separate criteria for different types of resource and then directly referencing existing criteria (e.g. FAIR data assessment tools for data) or best practices (e.g. Model Cards for Model Reporting for AI), etc.