What datasets have been classified as Digital Public Goods?

Update: 2024-04-14, I’ve updated this post with some corrections. See below

A couple of years ago I wrote a short series of posts looking at some different approaches for assessing data infrastructure. It includes this post on the Digital Public Goods standard and registry.

Digital Public Goods are defined as:

open-source software, open data, open AI models, open standards, and open content that adhere to privacy and other applicable laws and best practices, do no harm by design, and help attain the Sustainable Development Goals (SDGs)

Digital Public Goods Alliance

I noted recently that OpenStreetMap has been certified as a Digital Public Good. And that the Creative Commons have joined the Digital Public Goods Alliance.

So I thought I’d take a look at how the registry has changed and take a closer look at the “Data” category to see what’s in there.

How has the registry changed?

Category2022-022024-03Change
AI Model440
Content1714-3
Data814+6
Software68127+59
Standard40-4
Total*101153+52
* Note items can be in multiple categories so the total will be higher than the sum of the contents of each category.
Source: the DPG registry

Clearly the main area of growth has been around software rather than in other areas.

As items in the registry are regularly reassessed, I can only assume that the standards that were originally included have been withdrawn or perhaps recategorised.

I couldn’t find a way to tell if something was a Digital Public Good and why it might no longer be classifed as such. Those feel like they are important things to be able to find out. It might be possible to mine that from the github project.

I had expected to see a lot more datasets but there’s been little change in that category. It was a small enough list that I thought I’d take a closer look.

What’s in the data category?

Update: 2024-04-14. In an earlier version of this post I had wrongly suggested that some of the Data DPGs were not actually providing data. This was clarified in this discussion. I have updated the comments below to reflect that.

I’ve created a spreadsheet with the current list of items in the Dataset category for future reference.

There’s a range of domains covered including geospatial and weather data, health, government and agriculture.

Taking a closer look at each of them, I found.

One “Dataset” which I would consider to be a Standard rather than a Dataset: Agrontology. It’s an ontology, so basically a means of organising data, rather than a dataset in itself. It’s very small and is used to help organise data in Agrovoc, which is also in the registry as a dataset.

Crosscut is a commercial service with a free tier. However the service itself is not registered as DPG, it is the example datasets provided here which are the DPGs. When I originally reviewed Crosscut I mistakenly thought these were simply limited examples rather than datasets intended for reuse.

Crosscut claims all data accessed from the service is CC0 but I’m sceptical of that given OSM is one of the sources and the rest are CC-BY 4.0. I’d expect CC-BY 4.0 as a minimum to preserve attribution.

Doptor Open Data provides an API for accessing limited data for Bangladesh but the original sources are unclear. So I’m not sure how its collected or maintained.

Dicra provides some tooling to help to curate and refine datasets. It seems to have a volunteer process for assessing potentially useful data. But these have a range of licenses not all of which are open. The data classified as a DPG are the datasets accessible from this service.

Global Healthsites provides a website and API for accessing data. But it’s ultimately a wrapper around OpenStreetMap which is where all the primary data resides.

Govdirectory is similar, in that while there is a web presence and data downloads, the actual data is curated within Wikidata.

The Open Terms Archive submission indicates that the license for the data is ODC-BY. This is the license applied to this dataset on github, which is part of the DPG submission. However from the Open Terms website you are taken to this license file which uses the ODbL. The other datasets accessible from the site are also ODbL licensed.

Project AEDES is classified as an AI Model, Open Content and Open Data. It’s a research prototype that describes how to build a model for predicting disease outbreaks from google searches. The project doesn’t contain any data that is continually updated: the github repo was last updated a year ago. I would classify this as Software and maybe AI Model.

Reviewing datasets can be tricky, but there’s a few category and licensing errors here that I’d have expected the DPG review process to have caught.

What should be in the registry?

I don’t think the registry is intended to be a discovery tool per se. So I wouldn’t expect it to be comprehensive or a destination for someone looking to find a dataset.

The registration process is driven by the publisher of the data (or code, etc). So there’s inevitably going to be a somewhat uneven mix of entries. These early entries are likely to be from projects that are looking for additional recognition.

If I were running this I’d be actively soliciting specific projects to submit themselves for registration. There’s a lot of national and international geospatial, weather and statistical datasets that would obviously fit the bill as being a “digital public good”.

But if the registry grows really rapidly then there are going to be challenges with using it as discovery tool: the ways you might want to filter datasets will be different from code, content, etc.

To me this type of registry would be best highlighting datasets that have good coverage, are regularly updated and have a sustainable revenue model so that they will be available over the long term.

Some of these datasets in the list fit that bill, but the others seem very small or are essentially just extracts of the others.