I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think. One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.
As Dan pointed out we often need a broader vocabulary when talking about data. Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.
It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.
In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.
The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?
This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.
If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?
It turns out there’s quite a few.
For example there are “samples” and “sampling“. These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples. We might only be allowed to take a few.
“Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.
Or how about “observations” and “observing“? Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.
“Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.
AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.
What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.
In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.
Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?
Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.
I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.
I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.
What terms do you find most useful when talking about data?