Comments on “A data for AI taxonomy”

Jack Hardinges and Elena Simperl recently published a taxonomy to describe the data relevant to AI models and systems. Their goal is to help to better distinguish between the different types of data relevant to developing, using and monitoring AI models and systems to help to better distinguish them and thereby add some nuance to debates around what type of data infrastructure and governance is required.

Datasets are shaped by various factors including their contents, their intended use, the communities involved in collecting and using them, and expectations around their longevity.

For example a dataset of images will be organised very differently to one consisting of tabular data. And, if that same set of tabular data contains geospatial coordinates it might be published using a different set of standards by an organisation working in the geospatial community, than one working in local government (e.g. Geopackage vs a CSV file). One published as the result of a research project might be uploaded to an archive whereas another might be published via an API. Etc, etc.

A few years ago I wrote a paper about different dataset archetypes which was intended to help inform this kind of discussion by highlighting the different characteristics of some commonly produced datasets.

So I read the taxonomy with some interest. Here are a few notes on some of the core definitions.

Existing data

Given that anything digital can now be computed by AI, or any other system, I’m uneasy about using “existing data” as the framing for all text, audio, images, movies, code, etc. Because most people won’t think of that digital stuff as “data”.

The taxonomy sets out to define different types of dataset, so it’s understandable that it attempts to define the “everything else”. But if the intention is to help shed light on the different types of inputs and outputs of AI systems, for a broader audience, then a better label might be more appropriate.

The current definition of “Existing data” encompasses big collections of digital objects harvested from the web, unstructured corpuses of text and imagery, as well as structured datasets, etc.

Some of this “data” might only exist in aggregate as the result of creating a training set. Or may have been intentionally published and structured with a specific set of use cases in mind that did not originally include AI.

I find it hard to think of something like the web as a single dataset. It doesn’t fit my personal definition.

Training data

I also don’t think the taxonomy adequately distinguishes between “Existing data” and “Training data“.

One of the defining characteristics of foundation models are that they are training on very large, unstructured datasets, usually harvested from the web, as opposed to more purposefully curated data intended for use in other types of machine-learning systems.

All of the current concerns around foundation models derive from this feature.

Attempts to retrofit governance and solve licensing issues to outputs of web crawling look different to more traditional approaches of building datasets where gaining consent and permission because there is a closer connection between those people represented in, or contributing to a dataset, and those producing it.

The taxonomy references Common Crawl as “Training data” but its use, at least originally, is much broader. The openly licensed StackOverflow dataset has been used in a variety of ways, but it’s only recently that its use to train foundation models has caused concerns. Whereas ImageNet is nearly 10 years old and has been intentionally published for machine-learning research.

Some datasets are intentionally produced as “Training data“. Others become training data, because they are used that way.

Reference and Local data

One way that AI systems try to incorporate additional data, e.g. facts that were not available when the model was training, is through techniques like “Retrieval Augmented Generation” (RAG). It’s a popular technique due to its ability to incorporate new or private data into a system that is otherwise powered by a foundation model training on a broad set of sources.

The same dataset might feature as both “Reference data” and “Local data” in a deployed same system. E.g. using Wikidata as a source of labelling for a training or fine-tuning dataset AND as a knowledge base that is queried during deployment.

Local data” to me implies data local to a deployment or use of a system, e.g. that of a specific organisation or user. But it might also include any other “existing data“.

A shorter summary of my feedback might be that this is less of a taxonomy of data, e.g. which might be used to classify or describe existing data, and more of the roles in which a dataset might participate in an AI workflow or system.

Finally, going back to my introduction, the reason I’m interested in understanding the roles of different datasets in AI systems and workflows is because that might further shape how that data is being accessed, used and shared. Or provide useful insights into the types of governance models they need.

Training and fine-tuning datasets need to be accessible as a whole. So that implies that they will be accessed and published as a complete dataset. So we may need systems of change discovery and retraction to deal with issues found in that data.

If data is published in a decentralised way then an aggregate dataset may need to be created before it can be used in training. That creates another layer of governance.

Whereas data included via RAG is likely to be more API based. So the responses from that API can be altered in real-time if issues are discovered. Those APIs are also governed by additional terms that may shape use of the data. Existing APIs might be pressed into service to help to deploy AI systems without their providers being aware of those new use cases. Etc, etc.