“AI-Ready Data” is the wrong framing

A paper was published this week by Stefaan Verhulst, Andrew Zahuranec and Hannah Chafetz called "Moving Toward the FAIR-R principles: Advancing AI-Ready Data". The paper sets out to do two things: Make the case that we are in a "Fourth Wave" of open data in which it is critical that data is made useful for … Continue reading “AI-Ready Data” is the wrong framing →

What does community-driven data governance look like?

Some idle thoughts for a Friday afternoon. I was just taking a look at Source.Plus a dataset of public domain images for training Foundation models. It's a project of Spawning.ai which is working to build "data governance for generative AI". I have some thoughts on the tools they're building, but that's not what I'm writing … Continue reading What does community-driven data governance look like? →

Comments on “A data for AI taxonomy”

Jack Hardinges and Elena Simperl recently published a taxonomy to describe the data relevant to AI models and systems. Their goal is to help to better distinguish between the different types of data relevant to developing, using and monitoring AI models and systems to help to better distinguish them and thereby add some nuance to … Continue reading Comments on “A data for AI taxonomy” →

A basis for better definitions of “open”

There's been a lot of discussion around what is means to be "open" recently. I think this has largely been driven by issues and concerns around the development and deployment of Large Language Models and claims for at least some of those models to be "open". What does it mean for an LL or other … Continue reading A basis for better definitions of “open” →

Will AI hamper our ability to crawl the web for useful data?

As websites start to block Common Crawl, and as the project leans in to its role in training LLMs, will it become harder to use data from the web for other purposes?