Acceptable answers only

It can be hard to comment on a lot of tech news without coming across like Apu taking a bullet for a big tech platform. But a few aspects to the current debate around the new StackOverflow deal with OpenAI have irked me, as reported in TechCrunch and The Register and debated on Mastodon.

So here I am writing a blog post. I’m not trying to take a bullet for StackOverflow or OpenAI. My interest is in how we build and maintain a commons. Read to the end.

I’ve seen some comments along the lines of “Oh no, StackOverflow is now being used to train AI, delete your accounts“.

Which overlooks the fact that the entirety of the StackExchange datasets have been openly licensed for at least ten years and hosted on the internet archive, right here. The quarterly updated dataset has been available for anyone to use for any purpose including training AI for all of that time.

LLMs and other forms of AI have been using that dataset for some time. There’s a Kaggle dataset which is over 5 years old. It’s featured in datasets like The Pile for at least 3 years.

I’ve got a lot of sympathy for people struggling with the impacts of choosing to contribute to openly licensed datasets and repositories.

I’ve also seen suggestions that “Oh no, StackOverflow will now be useless because of LLM contributions“.

Which overlooks the fact that StackOverflow banned LLM contributions some time ago and there’s nothing at all in the announcement that suggests this is going to change.

The deal appears to be about letting StackOverflow use OpenAI models in its enterprise products. And gives OpenAI more real-time access to the content in StackOverflow. It’s not about changing their submission policies.

I’ve also seen suggestions that its “Time to delete your account and posts“. Which overlooks the fact, assuming you could scrub them from an open dataset, the immediate impact will be making the web worse for everyone else who isn’t using an LLM. That’s a net loss for everyone.

Within the limited details currently available about the deal, one likely outcome will be that OpenAI, and other platforms that integrate with StackOverflow, will give better answers to programming questions and cite their sources while doing so. That seems…good?

As a very occasional contributor (but frequent user) of StackOverflow, an attribution is all I’ve ever expected for a useful answer. That to me is a better situation than OpenAI and others just harvesting the data dump.

What I haven’t seen so far is discussion like:

  • If StackOverflow is having to cut costs and contributions have been falling, then will this deal allow the platform to continue to survive until the community can build an alternative?
  • How could StackOverflow (and other platforms and tools) engage better with its community to maintain and build trust?
  • If StackOverflow is doing things that the community is not keen on, what is required to fork it and build a new community owned and governed alternative? The dataset is open and seems pretty comprehensive. The open source community has a culture of forking when things go awry, that seems rarer in the open data community (although did give us things like Discogs, Musicbrainz and OSM)

There’s lots to be unhappy about around AI. Especially the environmental impacts. And the capitalism. I’m not overlooking those issues, but I do think there’s a more nuanced discussion to be had.

The fact that StackOverflow has an openly licensed dataset at all, sets its apart from other platforms carving out similar deals.

I don’t see AI or LLMs as inevitable. But we do need to find a way to build and maintain a commons in an environment that will continue to include a mixture of for-profit and not-for-profit use cases, in ways that maximise value while minimising harms.

Taking a scorched earth policy whenever a platform does a deal with an AI developer is not a good way to maintain and build a thriving commons.

One thought on “Acceptable answers only

  1. I think one of the problems Stack Overflow will increasingly face is that folk will be having the “how do I?” and “why doesnlt this work?” conversations in their editor’s AI chat bar, and won’t be posting questions, self-answered questions, or led into Stack Overflow where they might then be minded to answer questions, perhaps while searching for other things.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.