Consulting Spreadsheet Detective, Season 1

I was very pleased to announce my new TV series today, loosely based on real events. More details here in the official press release.

FOR IMMEDIATE RELEASE

Coming to all major streaming services in 2021 will be the exciting new series: “Turning the Tables“.

Exploring the murky corporate world of poorly formatted spreadsheets and nefarious macros each episode of this new series will explore another unique mystery.

When the cells lie empty, who can help the CSV:PI team pivot their investigation?

When things don’t add up, who can you turn to but an experienced solver?

Who else but Leigh Dodds, Consulting Spreadsheet Detective?

This smart, exciting and funny new show throws deductive reasoner Dodds into the mix with Detectives Rose Cortana and Colm Bing part of the crack new CSV:PI squad.

Rose: the gifted hacker. Quick to fire up an IDE, but slow to validate new friends.

Colm: the user researcher. Strong on empathy but with an enigmatic past that hints at time in the cells.

What can we expect from Season 1?

Episode 1: #VALUE!

In his first case, Dodds has to demonstrate his worth to a skeptical Rose and Colm, by fixing a corrupt formula in a startup valuation.

Episode 2: #NAME?

A personal data breach leaves the team in a race against time to protect the innocent. A mysterious informant known as VLOOKUP leaves Dodds a note.

Episode 3: #REF!

A light-hearted episode where Dodds is called in to resolve a mishap with a 5-a-side football team matchmaking spreadsheet. Does he stay between the lines?

Episode 4: #NUM?

A misparsed gene name leads a researcher into recommending the wrong vaccine. It’s up to Dodds to fix the formula.

Episode 5: #NULL!

Sometimes it’s not the spreadsheet that’s broken. Rose and Colm have to educate a researcher on the issue of data bias, while Dodds follow up references to the mysterious Macro corporation.

Episode 6: #DIV/0?

Chasing down an internationalisation issue Dodds, Rose and Colm race around the globe following a trail of error messages. As Dodds gets unexpectedly separated from the CSV:PI team, Rose and Colm unmask the hidden cell containing the mysterious VLOOKUP.

In addition to the six episodes in season one, a special feature length episode will air on National Spreadsheet Day 2021:

Feature Episode: #####

Colm’s past resurfaces. Can he grow enough to let the team see the problem, and help him validate his role in the team?

Having previously only anchored documentaries, like “Around with World with 80,000 Apps” and “Great Data Journeys“, taking on the eponymous role will be Dodds’ first foray into fiction. We’re sure he’ll have enough pizazz to wow even the harshest critics.

“Turning the Tables” will feature music composed by Dan Barrett.

Tip for improving standards documentation

I love a good standard. I’ve written about them a lot here.

As its #WorldStandardsDay I thought I’d write a quick post to share something that I’ve learned from leading and supporting some standards work.

I’ve already shared this with a number of people who have asked for advice on standards work, and in some recent user research interviews I’ve participated in. So it makes sense to write it down.

In the ODIHQ standards guide, we explained that at the end of your initial activity to develop a standard, you should plan to produce a range of outputs. This include a variety of tools and guidance that help people use the standard. You will need much more than just a technical specification.

To plan for the different types of documentation that you may need I recommend applying this “Grand Unified Theory of Documentation“.

That framework highlights four different types of documentation are intended to be used by different audiences to address different needs. The content designers and writers out there reading this will be be rolling their eyes at this obvious insight.

Here’s how I’ve been trying to apply it to standards documentation:

Reference

This is your primary technical specification. It’ll have all the detail about the standard, the background concepts, the conformance criteria, etc.

It’s the document of record that captures all of the hard work you’ve invested in building consensus around the standard. It fills a valuable role as the document you can point back to when you need to clarify or confirm what was agreed.

But, unless its a very simple standard, it’s going to have a limited audience. A developer looking to implement a conformant tool, API or library may need to read and digest all of the detail. But most people want something else.

Put the effort into ensuring its clear, precise and well-structured. But plan to also produce three additional categories of documentation.

Explainers

Many people just want an overview of what it is designed to do. What value will it provide? What use cases was it designed to support? Why was it developed? Who is developing it?

These are higher-level introductory questions. The type of questions that business stakeholders want to answer to sign-off on implementing a standard, so it goes onto a product roadmap.

Explainers are also useful background information that are useful for a developer ahead of taking a deeper dive. If there are some key concepts that are important to understanding the design and implementation of a standard, then write an explainer.

Tutorials

A simple, end-to-end description of how to apply the standard. E.g. how to publish a dataset that conforms to the standard, or export data from an existing system.

A tutorial will walk you through using a specific set of tools, frameworks or programming languages. The end result being a basic implementation of the standard. Or a simple dataset that passes some basic validation checks. A tutorial won’t cover all of the detail, it’s enough to get you started.

You may need several tutorials to support different types of users. Or different languages and frameworks.

If you’ve produced a tool, like validator or a template spreadsheet to support data publication, you’ll probably need a tutorial for each of them unless they are very simple to use.

Tutorials are gold for a developer who has been told: “please implement this standard, but you only have 2 days to do it”.

How-Tos

Short, task oriented documentation focused on helping someone apply the standard. E.g. “How to produce a CSV file from Excel”, “Importing GeoJSON data in QGIS”, “Describing a bus stop”. Make them short and digestible.

How-Tos can help developers build from a tutorial, to a more complete implementation. Or help a non-technical user quickly apply a standard or benefit from it.

You’ll probably end up with lots of these over time. Drive creating them from the types of questions or support requests you’re getting. Been asked how to do something three times? Write a How-To.

There’s lots more that can be said about standards documentation. For example you could add Case Studies to this list. And its important to think about whether written documentation is the right format. Maybe your Explainers and How-Tos can be videos?

But I’ve found the framework to be a useful planning tools. Have a look at the documentation for more tips.

Producing extra documentation to support the launch of a standard, and then investing in improving and expanding it over time will always be time well-spent.

A letter from the future about numbers

It’s an odd now looking at early 21st century content in the Internet Archive. So little nuance.

It feels a little like watching those old black and white movies. All that colour which was just right there. But now lost. Easy to imagine that life was just monochrome. Harder to imagine the richer colours.

Or at least hard for me. There are AIs that will imagine it all for you now, of course. There have been for a while. They’ll repaint the pictures using data they’ve gleaned from elsewhere. But it’s not the film that is difficult to look at. It’s the numbers.

How did you manage with just those bare numerals?

If I showed you, a 21st century reader, one of our numbers you wouldn’t know what it was. You wouldn’t be able to read it.

Maybe you’ve seen that film Arrival? Based on a book by Ted Chiang? Remember the alien writing that was so complex and rich in meaning? That’s what our numbers might look like to you. You’d struggle to decode them.

Oh, the rest of it is much the same. The text, emojis and memes. Everything is just that bit richer, more visual. More nuanced. It’s even taught in schools now. Standardised, tested and interpreted for all. It’d be familiar enough.

You struggle with the numbers though. They’d take much more time to learn.

Not all of them. House numbers. Your position in the queue. The cost of a coffee. Those look exactly the same. Why would we change those?

It’s the important numbers that look different. The employment figures. Your pension value. Your expected grade. The air quality. The life-changing numbers. Those all look very different now.

At some point we decided that those numbers needed to be legible in entirely different ways. We needed to be able to see (or hear, or feel) the richness and limitations in the most important numbers. It was, it turned out, the only way to build that shared literacy.

To imagine how we got there, just think about how people have always adapted and co-opted digital platforms and media for their own ends. Hashtags and memes.

Faced with the difficulty of digging behind the numbers – the need to search for sample sizes, cite the sources, highlight the bias, check the facts –  we had to find a different way. It began with adding colour, toying with fonts and diacritics.

5—a NUMBER INTERPOLATED.

It took off from there. Layers of annotations becoming conventions and then standards. Whole new planes and dimensions in unicode.

42—a PROJECTION based on a SIGNIFICANT POPULATION SAMPLE.

All of the richness, all of the context made visible right there in the number.

27-30—a PREDICTED RANGE created by a BAYESIAN INTERPOLATION over a RECENT SAMPLE produced by an OFFICIAL SOURCE.

180—an INDICATOR AUTOMATICALLY SELECTED by a DEEP LEARNING system, NO HUMAN INTERVENTION.

Context expressed as colour and weight and strokes in the glyphs. You can just read it all right off the digits. There and there. See?

Things aren’t automatically better of course. Numbers aren’t suddenly to be more trusted. Why would they be?.

It’s easier to see what’s not being said. It’s easier to demand better. It’s that little bit harder to ignore what’s before your eyes. It moves us on in our debates or just helps us recognise when the reasons for them aren’t actually down to the numbers at all.

It’s no longer acceptable to elide the detail. The numbers just look wrong. Simplistic. Black and white.

Which is why it’s difficult to read the Internet Archive sometimes.

We’ve got AIs that can dream up the missing information. Mining the Archive for the necessary provenance and add it all back into the numbers. Just like adding colour to those old films, it can be breathtaking to see. But not in a good way. How could you have deluded yourselves and misled each other so easily?

I’ve got one more analogy for you.

Rorschach tests have long been consigned to history. But one of our numbers – the life-changing ones – might just remind you of a colourful inkblots. And you might accuse use of we’re just reading things into them. Imagining things that you just aren’t there.

But numbers are just inkblots. Shapes in which we choose to see different aspects of the world. They always have been. We’ve just got a better palette.

Garden Retro 2020

I’ve been growing vegetables in our garden for years now. I usually end up putting the garden “to bed” for the winter towards the end of September. Harvesting the last bits of produce, weeding out the vegetable patches and covering up the earth until the Spring.

I thought I’d also do a bit of a retro to help me reflect on what worked and what didn’t work so well this year. We’ve had some mixed successes, so there are some things to reflect and improve on.

What did I set out to do this year?

This year I wanted to do a few things:

  • Grow some different vegetables
  • Get more produce out of the garden
  • Have fewer gluts of a single item (no more courgettes!) and limits on wastage
  • Have a more continuous harvest

What changes did I make?

To help achieve my goals, I made the following changes this year:

  • Make some of the planting denser, to try and get more into the same space
  • Plant some vegetables in pots and not just the vegetable patches, to make sure of all available growing space
  • Tried to germinate and plant out seedlings as early as possible
  • Have several plantings of some vegetables, to allow me to harvest blocks of vegetables over a longer period. To help with this I produced a planting layout for each bed at the start of the year
  • Pay closer attention to the dates when produce was due to be ripe, by creating a Google calendar of expected harvest dates
  • Bought some new seeds, as I had a lot of older seeds

What did we grow?

The final list for this year was (new things in bold):

Basil, Butternut Squash, Carrots, Coriander, Cucumber, Lettuce, Pak Choi, Peas, Potato, Radish, Shallots, Spinach, Spring Onion, Sweetcorn

So, not as many new vegetables as I’d hoped, but I did try some different varieties.

What went well?

  • Had a great harvest overall, including about a kilo of fresh peas, 6kg of potatoes, couple of dozen cucumbers, great crop of spring onions and carrots
  • Having the calendar to help guide planting of seeds and planting in blocks across different beds. This definitely helped to limit gluts and spread out the availability of veg
  • Denser planting of peas and giving them a little more space worked well
  • Grew a really great lettuce 
  • Freezing the peas immediately after harvesting, so we could spread out use
  • Making pickled cucumbers and a carrot pickle to preserve some of the produce
  • Using Nemaslug (as usual) to keep the slugs at bay. Seriously, this is my number 1 gardening tip
  • Spring onions grew just fine in pots
  • Spinach harvest was great. None of it went to waste
  • Being able to go to the garden and pick radish, spinach, carrots, spring onion and pak choi and throw them in the wok for dinner was amazing

What didn’t go so well?

  • Germinated and planted out 2-3 different sets of sweetcorn, squash and cucumber plants. Ended up losing them all in early frosts. Nothing more frustrating than seeing things die within a day or so of planting out
  • Basil just didn’t properly germinate or grow this year. Tried 3-4 plantings, end up with a couple of really scrawny plants. Not sure what happened there. They were in pots but were reasonably well watered.
  • Lost some decent lettuces to snails
  • Radish crop was pretty poor. Some good early harvest, but later sets were poor. I think I used some old seed. The close planting and not enough thinning also meant the plants ended up “leggy” and not growing sufficient bulbs
  • Tried coriander indoor and outdoor with mixed success. Like the radishes, they were pretty stringy. Managed to harvest some leaves but in the end, left them to go to seed and harvested those
  • Sweetcorn, after a did get some to grow, weren’t great. Had some decent cobs on a few, but weakest harvest ever. Normally super reliable.
  • Spinach, Pak Choi and some Radishes went to bolt. So didn’t get the full harvest I might have done
  • Cucumbers I grew from seed. But ended up getting a couple of dozen from basically a single monster plant which spread all over the place. So, still had a massive glut of them. There are 7 in the kitchen right now.
  • Crap shallot harvest. Had about half a dozen

What will I do differently?

  • Thin the radishes more, use the early pickings in salads
  • Don’t rush to get the seedlings out too early in the year. This is the second year in a row where I’ve lost plants early on. Make sure to acclimatise them to the outdoors for longer
  • Apply Nemaslug at least twice, not just once a year at the start of the season
  • Try to find a way to control the slugs
  • While I watered regularly when it was very hot, I got lax when we had a wet period. Suspect this may have contributed to some plants going to bolt
  • Need to rotate stuff through the beds next year, to mix up planting
  • Look at where I can do companion planting, e.g. around the sweetcorn 
  • Going to expand the growing patch. The kids have outgrown their trampoline, so will be converting more of garden to beds next year
  • Add another 1-2 compost bins

Main thing I want to do next year is get a green house. I’ve got my eye on this one. I want to grow tomatoes, chillis and peppers. It’ll also help me acclimatise some of the seedling before properly planting out.

Having the space to grow vegetables is a privilege and I’m very glad and very lucky to have the opportunity.

Gardening can be time consuming and frustrating, but I love being able to cook with what I’ve grown myself. Getting out into the garden, doing something physical, seeing things grown is also a nice balm given everything else that is going on.

Looking forward to next year.

 

Four types of innovation around data

Vaughn Tan’s The Uncertainty Mindset is one of the most fascinating books I’ve read this year. It’s an exploration of how to build R&D teams drawing on lessons learned in high-end kitchens around the world. I love cooking and I’m interested in creative R&D and what makes high-performing teams work well. I’d strongly recommend it if you’re interested in any of these topics.

I’m also a sucker for a good intellectual framework that helps me think about things in different ways. I did that recently with the BASEDEF framework.

Tan introduces a nice framework in Chapter 4 of the book which looks at four broad types of innovation around food. These are presented as a way to help the reader understand how and where innovation creates impact in restaurants. The four categories are:

  1. New dishes – new arrangements of ingredients, where innovation might be incremental refinements to existing dishes, combining ingredients together in new ways, or using ingredients from different contexts (think “fusion”)
  2. New ingredients – coming up with new things to be cooked
  3. New cooking methods – new ways of cooking things, like spherification or sous vide
  4. New cooking processes – new ways of organising the processes of cooking, e.g. to help kitchen staff prepare a dish more efficiently and consistently

The categories are the top are more evident to the consumer, those lower down less so. But the impacts of new methods and processes are greater as they apply in a variety of contexts.

Somewhat inevitably, I found myself thinking about how these categories work in the context of data:

  1. New dishes analyses – New derived datasets made from existing primary sources. Or new ways of combining datasets to create insights. I’ve used the metaphor of cooking to describe data analysis before, those recipes for data-informed problem solving help to document this stage to make it reproducible
  2. New ingredients datasets and data sources – Finding and using new sources of data, like turning image, text or audio libraries into datasets, using cheaper sensors, finding a way to extract data from non-traditional sources, or using phone sensors for earthquake detection
  3. New cooking methods for cleaning, managing or analysing data – which includes things like Jupyter notebooks, machine learning or differential privacy
  4. New cooking processes for organising the collection, preparation and analysis of data – e.g. collaborative maintenance, developing open standards for data or approaches to data governance and collective consent?

The breakdown isn’t perfect, but I found the exercise useful to think through the types of innovation around data. I’ve been conscious recently that I’m often using the word “innovation” without really digging into what that means, how that innovation happens and what exactly is being done differently or produced as a result.

The categories are also useful, I think, in reflecting on the possible impacts of breakthroughs of different types. Or perhaps where investment in R&D might be prioritised and where ensuring the translation of innovative approaches into the mainstream might have most impact?

What do you think?

#TownscaperDailyChallenge

This post is a bit of a diary entry. It’s to help me remember a fun little activity that I was involved in recently.

I’d seen little gifs and screenshots of Townscaper on twitter for months. But then suddenly it was in early access.

I bought it and started playing around. I’ve been feeling like I was in a rut recently and wanted to do something creative. After seeing Jim Rossignol mention playing with townscaper as a nightly activity, I thought I’d do similar.

Years ago I used to do lunchtime hacks and experiments as a way to be a bit more creative than I got to be in my day job. Having exactly an hour to create and build something is a nice constraint. Forces you to plan ahead and do the simplest thing to move an idea forward.

I decided to try lunchtime Townscaper builds. Each one with a different theme. I did my first one, with the theme “Bridge”, and shared it on twitter.

Chris Love liked the idea and suggested adding a hashtag so others could do the same. I hadn’t planned to share my themes and builds every day, but I thought, why not? The idea was to try doing something different after all.

So I tweeted out the first theme using the hashtag.

That tweet is the closest thing I’ve ever had to a “viral” tweet. It’s had over 53,523 impressions and over 650 interactions.

Turns out people love Townscaper. And are making lots of cool things with it.

Tweetdeck was pretty busy for the next few days. I had a few people start following me as a result, and suddenly felt a bit pressured. To help orchestra things and manage my own piece of mind, I did a bit of forward planning.

I decided to run the activity for one week. At the end I’d either hand it over to someone or just step back.

I also spent the first evening brainstorming a list of themes. More than enough for me to keep me going for the week, so I could avoid the need to come up with new themes on the fly. I tried to find a mixture of words that were within the bounds of the types of things you could create in Townscaper, but left room for creativity. In the end I revised and prioritized the initial list over the course of the week based on how people engaged.

I wanted the activity to be inclusive so came up with a few ground rules: “No prizes, no winners. It’s just for fun.”. And some brief guidance about how to participate: post screenshots, use the right hashtags).

I also wanted to help gather together submissions, but didn’t want to retweet or share all of them. So decided to finally try out creating twitter moments. One for each daily challenge. This added some work as I was always worrying I’d missed something, but it also meant I spent time looking at every build.

I ended up with two template tweets, one to introduce the challenge and one to publish the results. These were provided as a single thread to help weave everything together.

And over the course of a week, people built some amazing things. Take a look for yourself:

  1. Townscaper Daily Challenge #1 – Bridge
  2. Townscaper Daily Challenge #2 – Garden
  3. Townscaper Daily Challenge #3 – Neighbours
  4. Townscaper Daily Challenge #4 – Canal
  5. Townscaper Daily Challenge #5 – Eyrie
  6. Townscaper Daily Challenge #6 – Fortress
  7. Townscaper Daily Challenge #7 – Labyrinth

People played with the themes in interesting ways. They praised and commented on each others work. It was one of the most interesting, creative and fun things I’ve done on twitter.

By the end of the week, only a few people were contributing, so it was right to let it run its course. (Although I see that people are still occasionally using the hashtag).

It was a reminder than twitter can be and often is a completely different type of social space. A break from the doomscrolling was good.

It was also a reminded me how much I loved creating and making things. So I’m resolved to do more of that in the future.

Increasing inclusion around open standards for data

I read an interesting article this week by Ana Brandusescu, Michael Canares and Silvana Fumega. Called “Open data standards design behind closed doors?” it explores issues of inclusion and equity around the development of “open data standards” (which I’m reading as “open standards for data”).

Ana, Michael and Silvana rightly highlight that standards development is often seen and carried out as a technical process, whereas their development and impacts are often political, social or economic. To ensure that standards are well designed, we need to recognise their power, choose when to wield that tool, and ensure that we use it well. The article also asks questions about how standards are currently developed and suggests a framework for creating more participatory approaches throughout their development.

I’ve been reflecting on the article this week alongside a discussion that took place in this thread started by Ana.

Improving the ODI standards guidebook

I agree that standards development should absolutely be more inclusive. I too often find myself in standards discussions and groups with people that look like me and whose experiences may not always reflect those who are ultimately impacted by the creation and use of a standard.

In the open standards for data guidebook we explore how and why standards are developed to help make that process more transparent to a wider group of people. We also placed an emphasis on the importance of the scoping and adoption phases of standards development because this is so often where standards fail. Not just because the wrong thing is standardised, but also because the standard is designed for the wrong audience, or its potential impacts and value are not communicated.

Sometimes we don’t even need a standard. Standards development isn’t about creating specifications or technology, those are just outputs. The intended impact is to create some wider change in the world, which might be to increase transparency, or support implementation of a policy or to create a more equitable marketplace. Other interventions or activities might achieve those same goals better or faster. Some of them might not even use data(!)

But looking back through the guidebook, while we highlight in many places the need for engagement, outreach, developing a shared understanding of goals and desired impacts and a clear set of roles and responsibilities, we don’t specifically foreground issues of inclusion and equity as much as we could have.

The language and content of the guidebook could be improved. As could some prototype tools we included like the standards canvas. How would that be changed in order to foreground issues of inclusion and equity?

I’d love to get some contributions to the guidebook to help us improve it. Drop me a message if you have suggestions about that.

Standards as shared agreements

Open standards for data are reusable agreements that guide the exchange of data. They shape how I collect data from you, as a data provider. And as a data provider they shape how you (re)present data you have collected and, in many cases will ultimately impact how you collect data in the future.

If we foreground standards as agreements for shaping how data is collected and shared, then to increase inclusion and equity in the design of those agreements we can look to existing work like the Toolkit for Centering Racial Equity which provides a framework for thinking about inclusion throughout the life-cycle of data. Standards development fits within that life-cycle, even if it operates at a larger scale and extends it out to different time frames.

We can also recognise existing work and best practices around good participatory design and research.

We should avoid standards development, as a process, being divorced from broader discussions and best practices around ethics, equity and engagement around data. Taking a more inclusive and equitable approach to standards development is part of the broader discussion around the need for more integration across the computing and social sciences.

We may also need to recognise that sometimes agreements are made that don’t provide equitable outcomes for everyone. We might not be able to achieve a compromise that works for everyone. Being transparent about the goals and aims of a standard, and how it was developed, can help to surface who it is designed for (or not). Sometimes we might just need different standards, optimised for different purposes.

Some standards are more harmful than others

There are many different types of standard. And standards can be applied to different types of data. The authors of the original article didn’t really touch on this within their framework, but I think its important to recognise these differences, as part of any follow-on activities.

The impacts of a poorly designed standard that classifies people or their health outcomes will be much more harmful than a poorly defined data exchange format. See all of Susan Leigh Star‘s work. Or concerns from indigenous peoples about how they are counted and represented (or not) in statistical datasets.

Increasing inclusion can help to mitigate the harmful impacts around data. So focusing on improving inclusion (or recognising existing work and best practices) around the design of standards with greater capacity for harms is important. The skills and experience required in developing a taxonomy is fundamentally different to those required to develop a data exchange format.

Recognising these differences is also helpful when planning how to engage with a wider group of people. As we can identify what help and input is needed: What skills or perspectives are lacking among those leading standards work? What help or support needs to be offered to increase inclusion. E.g. by developing skills, or choosing different collaboration tools or methods of seeking input.

Developing a community of practice

Since we launched the standards guidebook I’ve been wondering whether it would be helpful to have more of a community of practice around standards development. I found myself thinking about this again after reading Ana, Michael and Silvana’s article and the subsequent discussion on twitter.

What would that look like? Does it exist already?

Perhaps supported by a set of learning or training resources that re-purposes some of the ODI guidebook material alongside other resources to help others to engage with and lead impactful, inclusive standards work?

I’m interested to see how this work and discussion unfolds.

FAIR, fairer, fairest?

“FAIR” (or “FAIR data”) is an term that I’ve been bumping into more and more frequently. For example, its included in the UK’s recently published Geospatial Strategy.

FAIR is an acronym that stands for Findable, Accessible, Interoperable and Reusable. It defines a set of principles that highlight some important aspects of publishing machine-readable data well. For example they identify the need to adopt common standards, use common identifiers, provide good metadata and clear usage licences.

The principles were originally defined by researchers in the life sciences. They were intended to help to improve management and sharing of data in research. Since then the principles have been increasingly referenced in other disciplines and domains.

At the ODI we’re currently working with CABI on a project that is applying the FAIR data principles, alongside other recommendations, to improve data sharing in grants and projects funded by the Gates Foundation.

From the perspective of encouraging the management and sharing of well-structured, standardised, machine-readable data, the FAIR principles are pretty good. They explore similar territory as the ODI’s Open Data Certificates and Tim Berners-Lee’s 5-Star Principles.

But the FAIR principles have some limitations and have been critiqued by various communities. As the principles become adopted in other contexts it is important that we understand these limitations, as they may have more of an impact in different situations.

A good background on the FAIR principles and some of their limitations can be found in this 2018 paper. But there are a few I’d like to highlight in this post.

They’re just principles

A key issue with FAIR is that they’re just principles. They offer recommendations about best practices, but they don’t help you answer specific questions. For example:

  • what metadata is useful to publish alongside different types of datasets?
  • which standards and shared identifiers are the best to use when publishing a specific dataset?
  • where will people be looking for this dataset to ensure its findable?
  • what are the trade-offs of using different competing standards?
  • what terms of use and licensing are appropriate to use when publishing a specific dataset for use by a specific community?
  • …etc

Applying the principles to a specific dataset means you need to have a clear idea about what you’re trying to achieve, what standards and best practices are used by the community you’re trying to support, or what approach might best enable the ecosystem you’re trying to grow and support.

We touched on some of these issues in a previous project that CABI and ODI delivered to the Gates Foundation. We encouraged people to think about FAIR in the context of a specific data ecosystem.

Currently there’s very little guidance that exists to support these decisions around FAIR. Which makes it harder to assess whether something is really FAIR in practice. Inevitably there will be trade-offs that involve making choices about standards and how much to invest in data curation and publication. Principles only go so far.

The principles are designed for a specific context

The FAIR principles were designed to reflect the needs of a specific community and context. Many of the recommendations are also broadly applicable to data publishing in other domains and contexts. But they embody design decisions that may not apply universally.

For example, they choose to emphasise machine-readability. Other communities might choose to focus on other elements that are more important to them or their needs.

As an alternative, the CARE principles for indigenous data governance are based around Collective Benefit, Authority to Control, Responsibility and Ethics. Those are good principles too. Other groups have chosen to propose ways to adapt and expand on FAIR.

It may be that the FAIR principles will work well in your specific context or community. But it might also be true that if you were to start from scratch and designed a new set of principles, you might choose to highlight other principles.

Whenever we are applying off-the-shelf principles in new areas, we need to think about whether they are helping us to achieve our own goals. Do they emphasise and prioritise work in the right areas?

The principles are not about being “fair”

Despite the acronym, the principles aren’t about being “fair”.

I don’t really know how to properly define “fair”. But I think it includes things like equity ‒ of access, or representation, or participation. And ethics and engagement. The principles are silent on those topics, leading some people to think about FAIRER data.

Don’t let the memorable acronym distract from the importance of ethics, consequence scanning and centering equity.

FAIR is not open

The principles were designed to be applied in contexts where not all data can be open. Life science research involves lots of sensitive personal information. Instead the principles recommend that data usage rights are clear.

I usually point out that FAIR data can exist across the data spectrum. But the principles don’t remind you that data should be as open as possible. Or prompt you to consider about the impacts of different types of licensing. They just ask you to be clear about the terms of reuse, however restrictive they might be.

So, to recap: the FAIR data principles offer a useful framework of things to consider when making data more accessible and easier to reuse. But they are not perfect. And they do not consider all of the various elements required to build an open and trustworthy data ecosystem.

What kinds of data is it useful to include in a register?

Registers are useful lists of information. A register might be a list of countries, companies, or registered doctors. Or addresses.

At the ODI we did a whole report on registers. It looks at different types of registers and how they’re governed. And GDS built a whole infrastructure to support them being published and used across the UK government.

Registers are core components of some types of identifier systems. They help to collect and share information about some aspect of the world we’re collectively interested in. For that reason it can be useful to know more about how the register is governed. So we know what it contains and how that list might change over time.

When those lists of things are useful in many different contexts, then making those registers open helps us to connect together different datasets and analyse them in new ways. They help to unlock context.

How much information should we put in a register? What information might it be useful to capture about the things ‒ the countries, the companies, or the addresses ‒  that are in our shared lists? Do we record just a company number and a name? Or also include the address of the company headquarters and the date it was founded?

When I’ve been designing registers and similar reference datasets, there’s some common categories of a information that I usually think about.

Identifiers

It’s useful if the things in our list have a unique identifier. They might have other identifiers assigned by different systems.

By capturing identifiers we can do things like:

  • clearly refer to items in the register, so we can find their attributes
  • use that identifier to link together different datasets
  • map between datasets that use different identifiers

Names and Labels

Things in the real world aren’t often referred to by an identifier. We give things names. Sometimes they may have several names.

Including names and labels in our identifiers allows us to do things like:

  • use a consistent, canonical name for things wherever they are referenced
  • link to things from a webpage
  • provide a way for a human being to recognise and find things in the register
  • turn a name into an identifier, so we can find more information about something

Relationships

Things in the real world are related to one another. Sometimes literally: I am your father (not, really). Sometimes spatially (this thing is here, or next to this other thing). Sometimes our world is organised into hierarchies or connected in other ways.

Including relationships in our register allows us to do things like:

  • visualise, present and navigate the contents of the list in a variety of ways
  • aggregate and report data according to the relationships between things
  • put something on a map

Types and categories

The things in our list might not all be the same. Or there may be differences between them. For example different types of companies. Or residential versus business addresses. Things might also be put into different categories. A register of companies might also categories businesses by sector.

Having types and categories in a list allows us to do things like:

  • extract part of the list we are interested in, sometimes we don’t need the whole thing
  • visualise, present and navigate the contents of the list in a greater variety of different ways
  • aggregate and report data according to how things are categorised

Lifecycle information

Things in the real world often have a life cycle. So do many digital things. Things are built, created, updated, revised, republished, retracted and demolished. Sometimes those events are tied to the thing being added to the register (“a list of registered companies”), sometimes they’re not (“a list of our current customers”).

Recording lifecycle information can help us to do things like:

  • understand the current state or status of something, which can help drive business and planning decisions
  • visualise, present and navigate the contents of the list in an even greater variety of ways
  • aggregate and report data according to where things are in their lifecycle

Administrative data (relating to the register)

It’s useful to capture data about when the information in a register has changed. For example when was something added to, or removed from a register? When did we last update its attributes or check that the information is current?

This type of information can help us to:

  • identify when information has been changed, so we can update our local copy of what’s in the register
  • extract part of the list we are interested in, as maybe we only want current or historical entries. Or just the recent additions
  • aggregate and report on how the data in the register has changed

Everything else

The list of useful things we might want to include in a register is potentially open ended. The trick in designing a good register is the working out of which bits are useful to be in the register, and which bits should be part of separate databases.

A good register will contain the data that is most commonly used across systems. Centralising that data can reduce the work, costs and also risks of collecting and maintaining it. If you put too much into the register you may end up increasing costs as you may have more to maintain. Or users have to spend more time pruning out what they don’t need.

But, if you are already maintaining a register and are planning to share it for others to use, you can increase its utility by sharing more information about each entry in the list.

Open UPRNs, a worked example

The UK should have an openly licensed address register. At the ODI we’ve long argued for the need for an open address register. But we don’t have that yet.

We do have a partial subset of our national address register available under an open licence, in the form of OS Open UPRNs product. It contains just the UPRN identifier and some spatial coordinates. Through the information in the related Open Identifiers product, we can also uncover some relationships between UPRNs and other spatial objects and administrative areas.

Drawing from the above examples this means we can do things like:

  • increase use of UPRNs as a common machine-readable identifier across datasets
  • identify a valid UPRN
  • locate them spatially on a map
  • relate those UPRNs to other things of interest, like administrative areas

With a bit of extra data engineering and analysis, e.g to look for variations across versions of the dataset we can also maybe work out a rough date for when a UPRN has been added to the list.

This is more than we can do before, which is great.

But there’s obviously clear much, much more we still can’t do:

  • filter out historical UPRNs
  • filter out UPRNs of different types
  • map between addresses (the names for those places) and the identifiers
  • understand the current status of a UPRN
  • aggregate and report on them using different categories
  • help people by building services that use the names (addresses) they’re familiar with
  • …etc, etc

We won’t be able to do those things until we have a fully open address register. But, until then, even including a handful of additional attributes (like a status code!) would clearly unlock more value.

I’ve previously argued that introducing a bit of product thinking might help to bring some focus to the decisions made about how data is published. And I still stand by much of that. But we need to be able to evaluate whether those product design decisions are achieving the intended effect.

Cooking up a new approach to supporting purposeful use of data

In my last post I explored how we might better support the use of datasets. To do that I applied the BASEDEF framework to outline the ways in which communities might collaborate to help unlock more value from individual datasets.

But what if we changed our focus from supporting discovery and use of datasets and instead focused on helping people explore specific types of problems or questions?

Our paradigm around data discovery is based on helping people find individual datasets. But unless a dataset has been designed to answer the specific question you have in mind, then it’s unlikely to be sufficient. Any non-trivial analysis is likely to need multiple datasets.

We know that data is more useful when it is combined, so why isn’t our approach to discovery based around identifying useful collections of datasets?

A cooking metaphor

To explore this further let’s use a cooking metaphor. I love cooking.

Many cuisines are based on a standard set of elements. Common spices or ingredients that become the base of most dishes. Like a mirepoix, a sofrito, the holy trinity of Cajun cooking, or the mother sauces in French cuisine.

As you learn to cook you come to appreciate how these flavour bases and sauces can be used to create a range of dishes. Add some extra spices and ingredients and you’ve created a complete dish.

Recipes help us consistently recreate these sauces.

A recipe consists of several elements. It will have a set of ingredients and a series of steps to combine them. A good recipe will also include some context. For example some background on the origins of the recipe and descriptions of unusual spices or ingredients. It might provide some things to watch out for during the cooking (“don’t burn the spices”) or suggest substitutions for difficult to source ingredients.

Our current approach to dataset discovery involves trying to document the provenance of an individual ingredient (a dataset) really well. We aren’t helping people combine them together to achieve results.

Efforts to improve dataset metadata, documentation and provenance reporting are important. Projects like the dataset nutrition label are great examples of that. We all want to be ethical, sustainable cooks. To do that we need to make informed choices about our ingredients.

But, to whisk these food metaphors together, nutrition labels are there to help you understand what’s gone into your supermarket pasta sauce. It’s not giving you a recipe to cook it from scratch for yourself. Or an idea of how to use the sauce to make a tasty dish.

Recipes for data-informed problem solving

I think we should think about sharing dataset recipes: instructions for how to mix up a selection of dataset ingredients. What would they consist of?

Firstly, the recipe would need to based around a specific type of question, problem or challenge.  Examples might include:

  • How can I understand air quality in my city?
  • How is deprivation changing in my local area?
  • What are the impacts of COVID-19 in my local authority?

Secondly, a recipe would include a list of datasets that have to be sourced, prepared and combined together to explore the specific problem. For example, if you’re exploring impacts of COVID-19 in your local authority you’re probably going to need:

  • demographic data from the most recent census
  • spatial boundaries to help visualise and present results
  • information about deprivation to help identify vulnerable people

Those three datasets are probably the holy trinity of any local spatial analysis?

Finally, you’re going to need some instructions for how to combine the datasets together. The instructions might identify some tools you need (Excel or QGIS), reference some techniques (Reprojection) and maybe some hints about how to substitute for key ingredients if you can’t get them in your local area (FOI).

The recipe might ways to vary the recipe for different purposes: add a sprinkle of Companies House data to understand your local business community, and a dash of OpenStreetMap to identify greenspaces?

As a time saver maybe you can find some pre-made versions of some of the steps in the recipe?

Examples in the wild

OK, its easy to come up with a metaphor and an idea. But would this actually meet a need? There’s a few reasons why I’m reasonably confident that dataset recipes could be helpful. Mostly because I can see this same approach re-appearing in some related contexts. For example:

If you have examples then let me know in the comments or on twitter.

How can dataset recipes help?

I think there’s a whole range of ways in which these types of recipe can be useful.

Data analysis always starts by posing a question. By documenting how datasets can be applied specific questions will make them easier to find on search engines. It just fits better with what people want to do.

Data discovery is important during periods where there is a sudden influx of new potential users. For example, where datasets have just been published under an open licence and are now available to more people, for a wider range of purposes.

In my experience data analysts and scientists who understand a domain, e.g population or transport modelling, have built up an tacit understanding of what datasets are most useful in different contexts. They understand the limitations and the process of combining datasets together. This thread from Chris Gale with a recipe about doing spatial analysis using PHE’s COVID-19 data is a perfect example. Documenting and sharing this knowledge can help others to do similar analyses. It’s like a cooking masterclass.

Discovery is also difficult when there is a sudden influx of new data available. Such as during this pandemic. Writing recipes is a good way to share learning across a community.

Documenting useful recipes might help us scale innovation across local areas.

Lastly, we’re still trying to understand which datasets are a most important part of our local, national and international data infrastructure. We’re currently lacking any real quantitative information about how datasets are combined together. In the same way that recipes can be analysed to create ingredient networks, dataset recipes could be analysed to find out how datasets are being used together. We can then strengthen that infrastructure.

If you’ve built something that helps people publish dataset recipes then send me a link to your app. I’d like to try it.