Getting the most value from data, whilst minimising its harmful impacts, is a community activity. Datasets need to be governed and published well. Most of that responsibility falls on the data publisher. Because the choices they make shapes data ecosystems.
But other people have a role to play too. Being a good data user means engaging with that process.
Helping others to find data and find the value in it, feels particularly important at the moment. During the pandemic there are many new datasets becoming available. And there are lots of questions to be answered. Some of them can be answered through better use of data.
So, how can communities work together to support use of data?
There are a lot of different ways to explore that question. But there’s a framework called BASEDEF, created by the open source community, which I find helpful.
BASEDEF stands for Blog, Apply, Suggest, Extend, Document, Evangelize and Fix. It describes the different types of contributions that can support an open source project. It can also be applied to help organise a small team in doing that work. Here’s a handy cheat sheet.
But the framework can also be applied to the task of supporting the use of an openly licensed dataset. Let’s run through the framework with that in mind.
You can write about a dataset to help others to discover it. You can help explain the potential value of applying the dataset to specific problems. Or perhaps you can see some downsides that others should consider.
Writing about how a dataset has been useful to you, by describing how you’ve successfully applied it in a project, will also help others see its potential value.
You can show how a dataset can be used, by creating something with it. You might do a detailed analysis of the data, but some simpler contributions can also be helpful.
For example you might create a simple visualisation. Or write and publish some code that illustrates how the dataset can be accessed and used. You could publish a quick demo showing how the dataset can be imported and used in some frequently used tools and platforms.
At the moment everyone is a bit tired of charts and graphs. And I agree with the first principle in the visualisation design principles for the pandemic. But a helpful visualisation can do a range of things. Visualisation can be exploratory rather than explanatory.
A visualisation could support other people in understanding the shape of a dataset, to inform their analysis and interpretation of it. It can help identify outliers, gaps, or highlight some of the richness in the data. I’d recommend making it clear when you’re doing it type of visualisation, rather than trying to derive specific insights.
Read the documentation. Download and explore the dataset. Ask questions. Give feedback.
Make suggestions to the publisher about changes they could make to publish the data better. Rather than just offer academic critique, be clear about how suggested changes will support your needs or that of your community.
The freedoms granted by an open licence allow you to enrich and improve a dataset.
Sometimes the smallest changes can have the most impact. Convert the data into other common or standard formats. Extracting data from spreadsheets into CSV files. Convert data published in more complex formats or via APIs into simpler tabular data to make it more accessible to analysts rather than programmers.
Or maybe you can enrich a dataset by adding identifiers that will allow it to be linked to other sources. Do the work of merging with other datasets to bring in more context.
The downside here is that if the original data changes your extended version will get out of date. If you can’t commit to keeping your version up to date, then be sure to share your code and document your methods.
Allow others to repeat the steps you’ve taken. And don’t forget to suggest the improvements to the publisher.
Write additional documentation to fill in gaps where the publisher has not provided sufficient background or explanation. Explain technical concepts or academic terms to a non-specialist audience.
As a user of the data, you’re able to write that documentation from a perspective that reflects the needs and questions of your specific community and the kinds of questions you need to ask. The original publisher might not have all that context or understand those needs, so this work can be really helpful.
Email people that might have a need for the data. Tweet about it to a wider community. Highlight it in a presentation. Talk about it over
If the dataset is collaboratively maintained then go ahead and fix errors and omissions. If you’re not confident about making a fix, then submit an error report. In addition to fixing errors you might be able to help verify that data is correct.
If a dataset isn’t collaboratively maintained then, when you find errors, be sure to flag them to the publisher and highlight the issue for others. Or consider publishing an enriched version with fixes applied.
This framework isn’t perfect. The name is a bit clunky for a start. But there’s a couple of things that I like about it.
Firstly, it recognises that not all contributions need to be technical. There’s room for others to use different skills and in different ways.
Secondly, the elements overlap and reinforce one another. Writing documentation and blogging about how you’ve used a dataset helps to evangelise it. Enriching a dataset can help demonstrate in a practical way how a publisher can improve how data is published.
Finally, it serves to highlight some important aspects of community curation which aren’t always well supported in existing data platforms and portals. We can do better here.
If you’re interested in working on adapting this further then happy to chat!. It might be useful to have a cheat sheet that supports its application to data and more examples of how to do these different elements well.