In my last post I explored how we might better support the use of datasets. To do that I applied the BASEDEF framework to outline the ways in which communities might collaborate to help unlock more value from individual datasets.
But what if we changed our focus from supporting discovery and use of datasets and instead focused on helping people explore specific types of problems or questions?
Our paradigm around data discovery is based on helping people find individual datasets. But unless a dataset has been designed to answer the specific question you have in mind, then it’s unlikely to be sufficient. Any non-trivial analysis is likely to need multiple datasets.
We know that data is more useful when it is combined, so why isn’t our approach to discovery based around identifying useful collections of datasets?
A cooking metaphor
To explore this further let’s use a cooking metaphor. I love cooking.
Many cuisines are based on a standard set of elements. Common spices or ingredients that become the base of most dishes. Like a mirepoix, a sofrito, the holy trinity of Cajun cooking, or the mother sauces in French cuisine.
As you learn to cook you come to appreciate how these flavour bases and sauces can be used to create a range of dishes. Add some extra spices and ingredients and you’ve created a complete dish.
Recipes help us consistently recreate these sauces.
A recipe consists of several elements. It will have a set of ingredients and a series of steps to combine them. A good recipe will also include some context. For example some background on the origins of the recipe and descriptions of unusual spices or ingredients. It might provide some things to watch out for during the cooking (“don’t burn the spices”) or suggest substitutions for difficult to source ingredients.
Our current approach to dataset discovery involves trying to document the provenance of an individual ingredient (a dataset) really well. We aren’t helping people combine them together to achieve results.
Efforts to improve dataset metadata, documentation and provenance reporting are important. Projects like the dataset nutrition label are great examples of that. We all want to be ethical, sustainable cooks. To do that we need to make informed choices about our ingredients.
But, to whisk these food metaphors together, nutrition labels are there to help you understand what’s gone into your supermarket pasta sauce. It’s not giving you a recipe to cook it from scratch for yourself. Or an idea of how to use the sauce to make a tasty dish.
Recipes for data-informed problem solving
I think we should think about sharing dataset recipes: instructions for how to mix up a selection of dataset ingredients. What would they consist of?
Firstly, the recipe would need to based around a specific type of question, problem or challenge. Examples might include:
- How can I understand air quality in my city?
- How is deprivation changing in my local area?
- What are the impacts of COVID-19 in my local authority?
Secondly, a recipe would include a list of datasets that have to be sourced, prepared and combined together to explore the specific problem. For example, if you’re exploring impacts of COVID-19 in your local authority you’re probably going to need:
- demographic data from the most recent census
- spatial boundaries to help visualise and present results
- information about deprivation to help identify vulnerable people
Those three datasets are probably the holy trinity of any local spatial analysis?
Finally, you’re going to need some instructions for how to combine the datasets together. The instructions might identify some tools you need (Excel or QGIS), reference some techniques (Reprojection) and maybe some hints about how to substitute for key ingredients if you can’t get them in your local area (FOI).
The recipe might ways to vary the recipe for different purposes: add a sprinkle of Companies House data to understand your local business community, and a dash of OpenStreetMap to identify greenspaces?
As a time saver maybe you can find some pre-made versions of some of the steps in the recipe?
Examples in the wild
OK, its easy to come up with a metaphor and an idea. But would this actually meet a need? There’s a few reasons why I’m reasonably confident that dataset recipes could be helpful. Mostly because I can see this same approach re-appearing in some related contexts. For example:
- The community tutorials published by Digital Ocean aren’t about using data, but include recipes for common technical tasks
- The service recipes published by The Catalyst
- This series of webinars from the UK Data Service that explore what data can help you to understand political behaviour, spoken languages, religion and mental health in the UK
If you have examples then let me know in the comments or on twitter.
How can dataset recipes help?
I think there’s a whole range of ways in which these types of recipe can be useful.
Data analysis always starts by posing a question. By documenting how datasets can be applied specific questions will make them easier to find on search engines. It just fits better with what people want to do.
Data discovery is important during periods where there is a sudden influx of new potential users. For example, where datasets have just been published under an open licence and are now available to more people, for a wider range of purposes.
In my experience data analysts and scientists who understand a domain, e.g population or transport modelling, have built up an tacit understanding of what datasets are most useful in different contexts. They understand the limitations and the process of combining datasets together. This thread from Chris Gale with a recipe about doing spatial analysis using PHE’s COVID-19 data is a perfect example. Documenting and sharing this knowledge can help others to do similar analyses. It’s like a cooking masterclass.
Discovery is also difficult when there is a sudden influx of new data available. Such as during this pandemic. Writing recipes is a good way to share learning across a community.
Documenting useful recipes might help us scale innovation across local areas.
Lastly, we’re still trying to understand which datasets are a most important part of our local, national and international data infrastructure. We’re currently lacking any real quantitative information about how datasets are combined together. In the same way that recipes can be analysed to create ingredient networks, dataset recipes could be analysed to find out how datasets are being used together. We can then strengthen that infrastructure.
If you’ve built something that helps people publish dataset recipes then send me a link to your app. I’d like to try it.