“AI-Ready Data” is the wrong framing

A paper was published this week by Stefaan Verhulst, Andrew Zahuranec and Hannah Chafetz called “Moving Toward the FAIR-R principles: Advancing AI-Ready Data“.

The paper sets out to do two things:

  1. Make the case that we are in a “Fourth Wave” of open data in which it is critical that data is made useful for AI, and Generative AI (GenAI) in particular, so that data can be democratised and be used to create impact
  2. That the FAIR data framework needs to be extended to make it “FAIR-R”: Ready for AI

I think both of these things are wrong.

Is there a Fourth Wave of open data, and if so it about making data “AI-Ready”?

The paper opens by laying out a description of how the open data movement has evolved. The authors suggest that this can be mapped out as the following stages:

  1. Freedom of Information requests, and government transparency
  2. Open by default
  3. Purpose-driven reuse of data
  4. Preparing data for generative AI

I think it’s impossible to adequately describe the evolution of the open data movement without acknowledging that different parts of that movement have adopted open data, as a tool and a means to an end, for very different reasons. Anything else is unnecessarily reductive.

The open government data movement might be traced back to FOI and transparency. But that’s not the origin of open data in science, which can be traced back for decades with a focus on cooperation and reusability.

These movements do not start from the same place. They also have not gone through the same stages or issues. They’re not facing all of the same issues today.

There was definitely a refocusing of the effort around opening government and commercial data from an “open by default” (“if we release it, the magic will happen“) stance to one that was more focused on creating impact through more purposeful publishing and collaboration.

“Open by default” made sense in the early days as a means to unlock data. But purposeful publishing, with closer coordination between publishers and reusers of data, has been proven to have better results.

But again, this is not the same for all parts of the open data movement.

In my opinion the first three “waves” that the authors describe might be seen as a characterisation of the evolution of the open government data movement, but not more broadly.

I think the suggestion that the next, or current, wave is about preparing data for Generative AI is just wrong.

To me the current situation is more about reconciling the goals of a movement, which always been based on unrestricted access and use of data, with a landscape in which data is being used at a scale which is not sustainable, and in ways that may cause harm.

Responding to that challenge involves improving the stewardship and governance of data, by building on that closer collaboration between the publishers and consumers of data demonstrated in the “Third Wave”. It is not about leaning in to the large scale industrial use of data in AI.

I’m not denying that machine learning and AI can be used in profound and innovative ways. Just that focusing on the needs of a particular type of use increases the risk of eroding trust and creating harms.

Put more simply: framing the “Fourth Wave” of open data as being about servicing the needs of AI makes the same mistake as the “open by default” framing: “if we do this thing, the magic will happen“.

We’re just substituting AI for hackdays.

Do we need to extend FAIR to FAIR-R?

The rest of the paper proposes that FAIR should be extended with an extra letter in the acronym:

Readiness for AI: Datasets must be structured to meet the specific (quality)
requirements of AI applications, such as labeled data for supervised learning or
comprehensive coverage for unsupervised learning.

The authors outline some of the potential impacts of AI, characteristics of the datasets that are useful for training, and the emerging standards for describing and publishing training datasets.

The aspects of FAIR (“Findable“, “Accessible“, “Interoperable” and “Reusable“) are all underpinned by a set of principles that describe that they mean. For example, for a dataset to be “Findable” it must have good metadata, and that metadata should be published somewhere where it can be indexed.

What it means to be “Ready for AI” is not really defined in the paper.

But, with an exception that I’ll explore below, the concept of “Ready for AI” is already covered within the existing elements of FAIR. For a dataset to be useful in training AI means ticking all of the Findable, Accessible, Interoperable and Reusable boxes.

What the authors should instead be proposing is a FAIR implementation profile that describes what FAIR means when applied to training datasets.

I’ve previously described the importance of implementation profiles in bridging the gap between broad principles and actionable data management guidance. Packaging up all of the existing work around improving data infrastructure for AI into a set of actionable best practices would be a useful step.

Not least because it clearer about what is being requested by the authors: which is that data publishers should invest in supporting specific standards and publish data in ways, and within platforms, that are useful for a specific technology and user community.

Open data has always placed more of a burden on publishers than consumers. It is publishers that invest in collecting, structuring, documenting and publishing data in ways that reduce the work of consumers.

Asking data providers to do more work, to be part of a broader data ecosystem, potentially comes at the cost of supporting existing users. That feels problematic to me. In many areas essential data infrastructure needs additional investment in order to remain useful in its current form. We’re asking people to do more.

It’s also problematic because the organisations building new AI based systems are amongst the most well funded organisations in the world. Organisations that might reasonably be expected to be able to shoulder the costs of translating and restructuring data into the forms that are useful for them. Or even invest in its co-creation and maintenance.

In the paper, the authors do note the need to address ethical concerns around the use and reuse of data within AI applications. But don’t acknowledge other efforts like the CARE principles which already address that. I previously summarised a few other limitations of FAIR and efforts to address those. Address ethical and legal concerns should not be focused solely on AI.

In my opinion, FAIR data has always been about advocating for better data management: the processes by which we collect, manage, publish and share data. It hasn’t really been about data governance: the decision making and oversight that guides and informs those data management processes.

To use a metaphor: FAIR is about making data easy to consume. It encourages you to describe what is on the menu; to make sure it’s clearly described and labelled; and to make sure that what is presented is well-cooked and served. It’s never been about telling you the source of those ingredients or what has been going on in the kitchen.

To achieve that we don’t really need more principles.