There are lots of recent examples of researchers collecting and releasing datasets which end up raising serious ethical and legal concerns. The IBM facial recognition dataset being just one example that springs to mind.
I read an interesting post exploring how facial recognition datasets are being widely used despite being taken down due to ethical concerns.
The post highlights how these datasets, despite being retracted, are still being widely used in research. This is in part because the original datasets are still circulating via mirrors of the original files. But also because they have been incorporated into derived datasets which are still being distributed with the original contents intact.
The authors describe how just one dataset, the DukeMTMC dataset was used in more than 135 papers after being retracted, 116 of those drawing on derived datasets. Some datasets have many derivatives, one example cited has been used in 14 derived datasets.
The research raises important questions about how datasets are published, mirrored, used and licensed. There’s a lot to unpack there and I look forward to reading more about the research. The concerns around open licensing are reminiscent of similar debates in the open source community leading to a set of “ethical open source licences“.
But the issue I wanted to highlight here is the difficulty of tracking the mirroring and reuse of datasets.
If it were easier to monitor important changes to datasets, then it would be easier to:
- maintain mirrors of data
- retract or remove data that breached laws or social and ethical norms
- update derived datasets to remove or amend data
- re-run analyses against datasets which has seen significant corrections or revisions
- assess the impacts of poor quality or unethically shared data
- proactively notify relevant communities of potential impacts relating to published data
- monitor and review the reasons why datasets get retracted
- …etc, etc
The importance of these activities can be seen in other contexts.
Principle T3: Orderly Release, of the UK Statistics Authority code of practice explains that scheduled revisions and unscheduled corrections to statistics should be transparent, and that organisations should have a specific policy for how they are handled.
More broadly, product recalls and safety notices are standard for consumer goods. Maybe datasets should be treated similarly?
This feels like an area that warrants further research, investment and infrastructure. At some point we need to raise our sights from setting up even more portals and endlessly refining their feature sets and think more broadly about the system and ecosystem we are building.