How can publishing more data decrease the value of existing data?

Last month I wrote a post looking at how publishing new data might increase the value of existing data. I ended up listing seven different ways including things like improving validation, increasing coverage, supporting the ability to link together datasets, etc.

But that post only looked at half of the issue. What about the opposite? Are there ways in which publishing new data might reduce the value of data that’s already available?

The short answer is: yes there are. But before jumping into that, lets take a moment to reflect on the language we’re using.

A note on language

The original post was prompted by an economic framing of the value of data. I was exploring how the option value for a dataset might be affected by increasing access to other data. While this post is primarily looking at how option value might be reduced, we need to acknowledge that “value” isn’t the only way to frame this type of question.

We might also ask, “how might increasing access to data increase potential for harms?” As part of a wider debate around the issues of increasing access to data, we need to use more than just economic language. There’s a wealth of good writing about the impacts of data on privacy and society which I’m not going to attempt to precis here.

It’s also important to highlight that “increasing value” and “decreasing value” are relative terms.

Increasing the value of existing datasets will not seem like a positive outcome if your goal is to attempt to capture as much value as possible, rather than benefit a broader ecosystem. Similarly, decreasing value of existing data, e.g. through obfuscation, might be seen as a positive outcome if it results in better privacy or increased personal safety.

Decreasing value of existing data

Having acknowledged that, lets try and answer the earlier question. In what ways can publishing new data reduce the value we can derive from existing data?

Increased harms leading to retraction and reduced trust

Publishing new data always runs the risk of re-identification and the enabling of unintended inferences. While the impacts of these harms are likely to be most directly felt by both communities and individuals, there are also broader commercial and national security issues. Together, these issues might ultimately reduce the value of the existing data ecosystem in several ways:

Existing datasets may need to be retracted, have their scope changed, or have their circulation reduced in order to avoid further harm. Data privacy impact assessments will need to be updated as the contexts in which data is being shared and published change
Increased concerns over potential privacy impacts might lead to organisations to choose not to increase access to similar or related datasets
Increased concerns might also lead communities and individuals to reduce the amount of data they are willing to share with previously trusted sources

Overall this can lead to a reduction in the overall coverage, quality and linking of data across a data ecosystem. It’s likely to be one of the most significant impact of poorly considered data releases. It can be mitigated through proper impact assessments, consultation and engagement.

Reducing overall quality

Newly published data might be intended to increase coverage, enrich, link, validate or otherwise improve existing data. But it might actually have the opposite effect because its of poor quality. I’ve briefly touched on this in a previous post on fictional data.

Publication of poor quality data might be unintended. For example an organisation may just be publishing the data it has to help address an issue, without properly considering or addressing underlying problems with it. Or a researcher may publish data that contains honest mistakes.

But publication of poor quality data might also be deliberate. For example as spam or misinformation intended to “poison the well“.

More subtly, practices like p-hacking and falsification of data which might be intended to have a short-term direct benefit to the publisher or author, might have longer term issues by impacting the use of other datasets.

This is why understanding and documenting the provenance of data, monitoring of retractions, fixes and updates to data, and the ability to link analyses with datasets are all so important.

Creating unnecessary competition or increasing friction

Publishing new datasets containing new observations and data about an area or topic of interest can lead to positive impacts, e.g. by increasing confidence or coverage. But datasets are also competing with one another. The same types of data might be available from different sources, but under different licences, access arrangements, pricing, etc.

This competition isn’t necessarily positive. For example, the data ecosystem might not benefit as much from the network effects that follow from linking data because key datasets are not linked or cannot be used together. Incompatible and competing datasets can add friction across an ecosystem.

Building poor foundations

Data is often published as a means of building stronger data infrastructure for a sector, or to address a specific challenge. But if that data is poorly maintained or is not sustainably funded, then the energy that goes into building the communities, tools and other datasets around that infrastructure might be wasted.

That reduces the value of existing datasets which might otherwise have provided a better foundation to build upon. Or whose quality is dependent on the shared infrastructure. While this issue is similar to that of the previous one about competition, its root causes and impacts are slightly different.

As I noted in my earlier post. I don’t think this is an exhaustive list and it can be improved by contributions. Leave a comment if you have any thoughts.