Some idle thoughts for a Friday afternoon.
I was just taking a look at Source.Plus a dataset of public domain images for training Foundation models. It’s a project of Spawning.ai which is working to build “data governance for generative AI”. I have some thoughts on the tools they’re building, but that’s not what I’m writing about here.
It was this statement which caught my attention: “Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.“
I think its early days for the project, so there’s no much detail about what that novel governance might look like. Although I assume it will be based on the Spawning.ai tools.
What exists at present is a brief summary of how the Source.Plus dataset is managed and specifically how it deals with representation, safety, copyright, etc. And links to relevant policies, e.g. the Takedown Policy which outlines a process that kicks in if content is flagged.
I think what I was expecting to see, is what we called “Visible Processes” in Collaborative Data Patterns: not just a policy document but a set of online tools that would provide:
- a summary of activity, e.g. how many cases are open, how many have been resolved (and how), and how long it takes requests to be completed
- insights into specific cases, e.g. at what stage is my request to take down some content?
- some ability for the community to engage with that process once started, e.g. to upvote a request to add a problematic piece of content or additional evidence that might be useful in the process
- …and maybe some detail on who is driving that process, who are the people behind the email addresses and contact forms, and how might others get involved?
Obviously there’s privacy and safety issues that need to be considered in all of the above. You need to protect both staff, rights holders and contributors for multiple reasons.
I think this type of framework is what I would expect to see as a minimum around a “community-driven dataset governance mechanism“.
But to me community-driven means more than community-initiated. The community should be involved at every step of the process. And that means more than just dealing with takedowns and copyright. It means shaping the content and organisation of the dataset itself.
We captured some more patterns around that but they’re clearly not exhaustive.
The value of capturing these types of patterns is that it becomes easier for different projects to adopt similar approaches, allows the creation of shared infrastructure and tools, and builds community expectations around what good looks like.