A letter from the future about numbers

It’s an odd now looking at early 21st century content in the Internet Archive. So little nuance.

It feels a little like watching those old black and white movies. All that colour which was just right there. But now lost. Easy to imagine that life was just monochrome. Harder to imagine the richer colours.

Or at least hard for me. There are AIs that will imagine it all for you now, of course. There have been for a while. They’ll repaint the pictures using data they’ve gleaned from elsewhere. But it’s not the film that is difficult to look at. It’s the numbers.

How did you manage with just those bare numerals?

If I showed you, a 21st century reader, one of our numbers you wouldn’t know what it was. You wouldn’t be able to read it.

Maybe you’ve seen that film Arrival? Based on a book by Ted Chiang? Remember the alien writing that was so complex and rich in meaning? That’s what our numbers might look like to you. You’d struggle to decode them.

Oh, the rest of it is much the same. The text, emojis and memes. Everything is just that bit richer, more visual. More nuanced. It’s even taught in schools now. Standardised, tested and interpreted for all. It’d be familiar enough.

You struggle with the numbers though. They’d take much more time to learn.

Not all of them. House numbers. Your position in the queue. The cost of a coffee. Those look exactly the same. Why would we change those?

It’s the important numbers that look different. The employment figures. Your pension value. Your expected grade. The air quality. The life-changing numbers. Those all look very different now.

At some point we decided that those numbers needed to be legible in entirely different ways. We needed to be able to see (or hear, or feel) the richness and limitations in the most important numbers. It was, it turned out, the only way to build that shared literacy.

To imagine how we got there, just think about how people have always adapted and co-opted digital platforms and media for their own ends. Hashtags and memes.

Faced with the difficulty of digging behind the numbers – the need to search for sample sizes, cite the sources, highlight the bias, check the facts –  we had to find a different way. It began with adding colour, toying with fonts and diacritics.

5—a NUMBER INTERPOLATED.

It took off from there. Layers of annotations becoming conventions and then standards. Whole new planes and dimensions in unicode.

42—a PROJECTION based on a SIGNIFICANT POPULATION SAMPLE.

All of the richness, all of the context made visible right there in the number.

27-30—a PREDICTED RANGE created by a BAYESIAN INTERPOLATION over a RECENT SAMPLE produced by an OFFICIAL SOURCE.

180—an INDICATOR AUTOMATICALLY SELECTED by a DEEP LEARNING system, NO HUMAN INTERVENTION.

Context expressed as colour and weight and strokes in the glyphs. You can just read it all right off the digits. There and there. See?

Things aren’t automatically better of course. Numbers aren’t suddenly to be more trusted. Why would they be?.

It’s easier to see what’s not being said. It’s easier to demand better. It’s that little bit harder to ignore what’s before your eyes. It moves us on in our debates or just helps us recognise when the reasons for them aren’t actually down to the numbers at all.

It’s no longer acceptable to elide the detail. The numbers just look wrong. Simplistic. Black and white.

Which is why it’s difficult to read the Internet Archive sometimes.

We’ve got AIs that can dream up the missing information. Mining the Archive for the necessary provenance and add it all back into the numbers. Just like adding colour to those old films, it can be breathtaking to see. But not in a good way. How could you have deluded yourselves and misled each other so easily?

I’ve got one more analogy for you.

Rorschach tests have long been consigned to history. But one of our numbers – the life-changing ones – might just remind you of a colourful inkblots. And you might accuse use of we’re just reading things into them. Imagining things that you just aren’t there.

But numbers are just inkblots. Shapes in which we choose to see different aspects of the world. They always have been. We’ve just got a better palette.

Garden Retro 2020

I’ve been growing vegetables in our garden for years now. I usually end up putting the garden “to bed” for the winter towards the end of September. Harvesting the last bits of produce, weeding out the vegetable patches and covering up the earth until the Spring.

I thought I’d also do a bit of a retro to help me reflect on what worked and what didn’t work so well this year. We’ve had some mixed successes, so there are some things to reflect and improve on.

What did I set out to do this year?

This year I wanted to do a few things:

  • Grow some different vegetables
  • Get more produce out of the garden
  • Have fewer gluts of a single item (no more courgettes!) and limits on wastage
  • Have a more continuous harvest

What changes did I make?

To help achieve my goals, I made the following changes this year:

  • Make some of the planting denser, to try and get more into the same space
  • Plant some vegetables in pots and not just the vegetable patches, to make sure of all available growing space
  • Tried to germinate and plant out seedlings as early as possible
  • Have several plantings of some vegetables, to allow me to harvest blocks of vegetables over a longer period. To help with this I produced a planting layout for each bed at the start of the year
  • Pay closer attention to the dates when produce was due to be ripe, by creating a Google calendar of expected harvest dates
  • Bought some new seeds, as I had a lot of older seeds

What did we grow?

The final list for this year was (new things in bold):

Basil, Butternut Squash, Carrots, Coriander, Cucumber, Lettuce, Pak Choi, Peas, Potato, Radish, Shallots, Spinach, Spring Onion, Sweetcorn

So, not as many new vegetables as I’d hoped, but I did try some different varieties.

What went well?

  • Had a great harvest overall, including about a kilo of fresh peas, 6kg of potatoes, couple of dozen cucumbers, great crop of spring onions and carrots
  • Having the calendar to help guide planting of seeds and planting in blocks across different beds. This definitely helped to limit gluts and spread out the availability of veg
  • Denser planting of peas and giving them a little more space worked well
  • Grew a really great lettuce 
  • Freezing the peas immediately after harvesting, so we could spread out use
  • Making pickled cucumbers and a carrot pickle to preserve some of the produce
  • Using Nemaslug (as usual) to keep the slugs at bay. Seriously, this is my number 1 gardening tip
  • Spring onions grew just fine in pots
  • Spinach harvest was great. None of it went to waste
  • Being able to go to the garden and pick radish, spinach, carrots, spring onion and pak choi and throw them in the wok for dinner was amazing

What didn’t go so well?

  • Germinated and planted out 2-3 different sets of sweetcorn, squash and cucumber plants. Ended up losing them all in early frosts. Nothing more frustrating than seeing things die within a day or so of planting out
  • Basil just didn’t properly germinate or grow this year. Tried 3-4 plantings, end up with a couple of really scrawny plants. Not sure what happened there. They were in pots but were reasonably well watered.
  • Lost some decent lettuces to snails
  • Radish crop was pretty poor. Some good early harvest, but later sets were poor. I think I used some old seed. The close planting and not enough thinning also meant the plants ended up “leggy” and not growing sufficient bulbs
  • Tried coriander indoor and outdoor with mixed success. Like the radishes, they were pretty stringy. Managed to harvest some leaves but in the end, left them to go to seed and harvested those
  • Sweetcorn, after a did get some to grow, weren’t great. Had some decent cobs on a few, but weakest harvest ever. Normally super reliable.
  • Spinach, Pak Choi and some Radishes went to bolt. So didn’t get the full harvest I might have done
  • Cucumbers I grew from seed. But ended up getting a couple of dozen from basically a single monster plant which spread all over the place. So, still had a massive glut of them. There are 7 in the kitchen right now.
  • Crap shallot harvest. Had about half a dozen

What will I do differently?

  • Thin the radishes more, use the early pickings in salads
  • Don’t rush to get the seedlings out too early in the year. This is the second year in a row where I’ve lost plants early on. Make sure to acclimatise them to the outdoors for longer
  • Apply Nemaslug at least twice, not just once a year at the start of the season
  • Try to find a way to control the slugs
  • While I watered regularly when it was very hot, I got lax when we had a wet period. Suspect this may have contributed to some plants going to bolt
  • Need to rotate stuff through the beds next year, to mix up planting
  • Look at where I can do companion planting, e.g. around the sweetcorn 
  • Going to expand the growing patch. The kids have outgrown their trampoline, so will be converting more of garden to beds next year
  • Add another 1-2 compost bins

Main thing I want to do next year is get a green house. I’ve got my eye on this one. I want to grow tomatoes, chillis and peppers. It’ll also help me acclimatise some of the seedling before properly planting out.

Having the space to grow vegetables is a privilege and I’m very glad and very lucky to have the opportunity.

Gardening can be time consuming and frustrating, but I love being able to cook with what I’ve grown myself. Getting out into the garden, doing something physical, seeing things grown is also a nice balm given everything else that is going on.

Looking forward to next year.

 

Four types of innovation around data

Vaughn Tan’s The Uncertainty Mindset is one of the most fascinating books I’ve read this year. It’s an exploration of how to build R&D teams drawing on lessons learned in high-end kitchens around the world. I love cooking and I’m interested in creative R&D and what makes high-performing teams work well. I’d strongly recommend it if you’re interested in any of these topics.

I’m also a sucker for a good intellectual framework that helps me think about things in different ways. I did that recently with the BASEDEF framework.

Tan introduces a nice framework in Chapter 4 of the book which looks at four broad types of innovation around food. These are presented as a way to help the reader understand how and where innovation creates impact in restaurants. The four categories are:

  1. New dishes – new arrangements of ingredients, where innovation might be incremental refinements to existing dishes, combining ingredients together in new ways, or using ingredients from different contexts (think “fusion”)
  2. New ingredients – coming up with new things to be cooked
  3. New cooking methods – new ways of cooking things, like spherification or sous vide
  4. New cooking processes – new ways of organising the processes of cooking, e.g. to help kitchen staff prepare a dish more efficiently and consistently

The categories are the top are more evident to the consumer, those lower down less so. But the impacts of new methods and processes are greater as they apply in a variety of contexts.

Somewhat inevitably, I found myself thinking about how these categories work in the context of data:

  1. New dishes analyses – New derived datasets made from existing primary sources. Or new ways of combining datasets to create insights. I’ve used the metaphor of cooking to describe data analysis before, those recipes for data-informed problem solving help to document this stage to make it reproducible
  2. New ingredients datasets and data sources – Finding and using new sources of data, like turning image, text or audio libraries into datasets, using cheaper sensors, finding a way to extract data from non-traditional sources, or using phone sensors for earthquake detection
  3. New cooking methods for cleaning, managing or analysing data – which includes things like Jupyter notebooks, machine learning or differential privacy
  4. New cooking processes for organising the collection, preparation and analysis of data – e.g. collaborative maintenance, developing open standards for data or approaches to data governance and collective consent?

The breakdown isn’t perfect, but I found the exercise useful to think through the types of innovation around data. I’ve been conscious recently that I’m often using the word “innovation” without really digging into what that means, how that innovation happens and what exactly is being done differently or produced as a result.

The categories are also useful, I think, in reflecting on the possible impacts of breakthroughs of different types. Or perhaps where investment in R&D might be prioritised and where ensuring the translation of innovative approaches into the mainstream might have most impact?

What do you think?

#TownscaperDailyChallenge

This post is a bit of a diary entry. It’s to help me remember a fun little activity that I was involved in recently.

I’d seen little gifs and screenshots of Townscaper on twitter for months. But then suddenly it was in early access.

I bought it and started playing around. I’ve been feeling like I was in a rut recently and wanted to do something creative. After seeing Jim Rossignol mention playing with townscaper as a nightly activity, I thought I’d do similar.

Years ago I used to do lunchtime hacks and experiments as a way to be a bit more creative than I got to be in my day job. Having exactly an hour to create and build something is a nice constraint. Forces you to plan ahead and do the simplest thing to move an idea forward.

I decided to try lunchtime Townscaper builds. Each one with a different theme. I did my first one, with the theme “Bridge”, and shared it on twitter.

Chris Love liked the idea and suggested adding a hashtag so others could do the same. I hadn’t planned to share my themes and builds every day, but I thought, why not? The idea was to try doing something different after all.

So I tweeted out the first theme using the hashtag.

That tweet is the closest thing I’ve ever had to a “viral” tweet. It’s had over 53,523 impressions and over 650 interactions.

Turns out people love Townscaper. And are making lots of cool things with it.

Tweetdeck was pretty busy for the next few days. I had a few people start following me as a result, and suddenly felt a bit pressured. To help orchestra things and manage my own piece of mind, I did a bit of forward planning.

I decided to run the activity for one week. At the end I’d either hand it over to someone or just step back.

I also spent the first evening brainstorming a list of themes. More than enough for me to keep me going for the week, so I could avoid the need to come up with new themes on the fly. I tried to find a mixture of words that were within the bounds of the types of things you could create in Townscaper, but left room for creativity. In the end I revised and prioritized the initial list over the course of the week based on how people engaged.

I wanted the activity to be inclusive so came up with a few ground rules: “No prizes, no winners. It’s just for fun.”. And some brief guidance about how to participate: post screenshots, use the right hashtags).

I also wanted to help gather together submissions, but didn’t want to retweet or share all of them. So decided to finally try out creating twitter moments. One for each daily challenge. This added some work as I was always worrying I’d missed something, but it also meant I spent time looking at every build.

I ended up with two template tweets, one to introduce the challenge and one to publish the results. These were provided as a single thread to help weave everything together.

And over the course of a week, people built some amazing things. Take a look for yourself:

  1. Townscaper Daily Challenge #1 – Bridge
  2. Townscaper Daily Challenge #2 – Garden
  3. Townscaper Daily Challenge #3 – Neighbours
  4. Townscaper Daily Challenge #4 – Canal
  5. Townscaper Daily Challenge #5 – Eyrie
  6. Townscaper Daily Challenge #6 – Fortress
  7. Townscaper Daily Challenge #7 – Labyrinth

People played with the themes in interesting ways. They praised and commented on each others work. It was one of the most interesting, creative and fun things I’ve done on twitter.

By the end of the week, only a few people were contributing, so it was right to let it run its course. (Although I see that people are still occasionally using the hashtag).

It was a reminder than twitter can be and often is a completely different type of social space. A break from the doomscrolling was good.

It was also a reminded me how much I loved creating and making things. So I’m resolved to do more of that in the future.

Increasing inclusion around open standards for data

I read an interesting article this week by Ana Brandusescu, Michael Canares and Silvana Fumega. Called “Open data standards design behind closed doors?” it explores issues of inclusion and equity around the development of “open data standards” (which I’m reading as “open standards for data”).

Ana, Michael and Silvana rightly highlight that standards development is often seen and carried out as a technical process, whereas their development and impacts are often political, social or economic. To ensure that standards are well designed, we need to recognise their power, choose when to wield that tool, and ensure that we use it well. The article also asks questions about how standards are currently developed and suggests a framework for creating more participatory approaches throughout their development.

I’ve been reflecting on the article this week alongside a discussion that took place in this thread started by Ana.

Improving the ODI standards guidebook

I agree that standards development should absolutely be more inclusive. I too often find myself in standards discussions and groups with people that look like me and whose experiences may not always reflect those who are ultimately impacted by the creation and use of a standard.

In the open standards for data guidebook we explore how and why standards are developed to help make that process more transparent to a wider group of people. We also placed an emphasis on the importance of the scoping and adoption phases of standards development because this is so often where standards fail. Not just because the wrong thing is standardised, but also because the standard is designed for the wrong audience, or its potential impacts and value are not communicated.

Sometimes we don’t even need a standard. Standards development isn’t about creating specifications or technology, those are just outputs. The intended impact is to create some wider change in the world, which might be to increase transparency, or support implementation of a policy or to create a more equitable marketplace. Other interventions or activities might achieve those same goals better or faster. Some of them might not even use data(!)

But looking back through the guidebook, while we highlight in many places the need for engagement, outreach, developing a shared understanding of goals and desired impacts and a clear set of roles and responsibilities, we don’t specifically foreground issues of inclusion and equity as much as we could have.

The language and content of the guidebook could be improved. As could some prototype tools we included like the standards canvas. How would that be changed in order to foreground issues of inclusion and equity?

I’d love to get some contributions to the guidebook to help us improve it. Drop me a message if you have suggestions about that.

Standards as shared agreements

Open standards for data are reusable agreements that guide the exchange of data. They shape how I collect data from you, as a data provider. And as a data provider they shape how you (re)present data you have collected and, in many cases will ultimately impact how you collect data in the future.

If we foreground standards as agreements for shaping how data is collected and shared, then to increase inclusion and equity in the design of those agreements we can look to existing work like the Toolkit for Centering Racial Equity which provides a framework for thinking about inclusion throughout the life-cycle of data. Standards development fits within that life-cycle, even if it operates at a larger scale and extends it out to different time frames.

We can also recognise existing work and best practices around good participatory design and research.

We should avoid standards development, as a process, being divorced from broader discussions and best practices around ethics, equity and engagement around data. Taking a more inclusive and equitable approach to standards development is part of the broader discussion around the need for more integration across the computing and social sciences.

We may also need to recognise that sometimes agreements are made that don’t provide equitable outcomes for everyone. We might not be able to achieve a compromise that works for everyone. Being transparent about the goals and aims of a standard, and how it was developed, can help to surface who it is designed for (or not). Sometimes we might just need different standards, optimised for different purposes.

Some standards are more harmful than others

There are many different types of standard. And standards can be applied to different types of data. The authors of the original article didn’t really touch on this within their framework, but I think its important to recognise these differences, as part of any follow-on activities.

The impacts of a poorly designed standard that classifies people or their health outcomes will be much more harmful than a poorly defined data exchange format. See all of Susan Leigh Star‘s work. Or concerns from indigenous peoples about how they are counted and represented (or not) in statistical datasets.

Increasing inclusion can help to mitigate the harmful impacts around data. So focusing on improving inclusion (or recognising existing work and best practices) around the design of standards with greater capacity for harms is important. The skills and experience required in developing a taxonomy is fundamentally different to those required to develop a data exchange format.

Recognising these differences is also helpful when planning how to engage with a wider group of people. As we can identify what help and input is needed: What skills or perspectives are lacking among those leading standards work? What help or support needs to be offered to increase inclusion. E.g. by developing skills, or choosing different collaboration tools or methods of seeking input.

Developing a community of practice

Since we launched the standards guidebook I’ve been wondering whether it would be helpful to have more of a community of practice around standards development. I found myself thinking about this again after reading Ana, Michael and Silvana’s article and the subsequent discussion on twitter.

What would that look like? Does it exist already?

Perhaps supported by a set of learning or training resources that re-purposes some of the ODI guidebook material alongside other resources to help others to engage with and lead impactful, inclusive standards work?

I’m interested to see how this work and discussion unfolds.

FAIR, fairer, fairest?

“FAIR” (or “FAIR data”) is an term that I’ve been bumping into more and more frequently. For example, its included in the UK’s recently published Geospatial Strategy.

FAIR is an acronym that stands for Findable, Accessible, Interoperable and Reusable. It defines a set of principles that highlight some important aspects of publishing machine-readable data well. For example they identify the need to adopt common standards, use common identifiers, provide good metadata and clear usage licences.

The principles were originally defined by researchers in the life sciences. They were intended to help to improve management and sharing of data in research. Since then the principles have been increasingly referenced in other disciplines and domains.

At the ODI we’re currently working with CABI on a project that is applying the FAIR data principles, alongside other recommendations, to improve data sharing in grants and projects funded by the Gates Foundation.

From the perspective of encouraging the management and sharing of well-structured, standardised, machine-readable data, the FAIR principles are pretty good. They explore similar territory as the ODI’s Open Data Certificates and Tim Berners-Lee’s 5-Star Principles.

But the FAIR principles have some limitations and have been critiqued by various communities. As the principles become adopted in other contexts it is important that we understand these limitations, as they may have more of an impact in different situations.

A good background on the FAIR principles and some of their limitations can be found in this 2018 paper. But there are a few I’d like to highlight in this post.

They’re just principles

A key issue with FAIR is that they’re just principles. They offer recommendations about best practices, but they don’t help you answer specific questions. For example:

  • what metadata is useful to publish alongside different types of datasets?
  • which standards and shared identifiers are the best to use when publishing a specific dataset?
  • where will people be looking for this dataset to ensure its findable?
  • what are the trade-offs of using different competing standards?
  • what terms of use and licensing are appropriate to use when publishing a specific dataset for use by a specific community?
  • …etc

Applying the principles to a specific dataset means you need to have a clear idea about what you’re trying to achieve, what standards and best practices are used by the community you’re trying to support, or what approach might best enable the ecosystem you’re trying to grow and support.

We touched on some of these issues in a previous project that CABI and ODI delivered to the Gates Foundation. We encouraged people to think about FAIR in the context of a specific data ecosystem.

Currently there’s very little guidance that exists to support these decisions around FAIR. Which makes it harder to assess whether something is really FAIR in practice. Inevitably there will be trade-offs that involve making choices about standards and how much to invest in data curation and publication. Principles only go so far.

The principles are designed for a specific context

The FAIR principles were designed to reflect the needs of a specific community and context. Many of the recommendations are also broadly applicable to data publishing in other domains and contexts. But they embody design decisions that may not apply universally.

For example, they choose to emphasise machine-readability. Other communities might choose to focus on other elements that are more important to them or their needs.

As an alternative, the CARE principles for indigenous data governance are based around Collective Benefit, Authority to Control, Responsibility and Ethics. Those are good principles too. Other groups have chosen to propose ways to adapt and expand on FAIR.

It may be that the FAIR principles will work well in your specific context or community. But it might also be true that if you were to start from scratch and designed a new set of principles, you might choose to highlight other principles.

Whenever we are applying off-the-shelf principles in new areas, we need to think about whether they are helping us to achieve our own goals. Do they emphasise and prioritise work in the right areas?

The principles are not about being “fair”

Despite the acronym, the principles aren’t about being “fair”.

I don’t really know how to properly define “fair”. But I think it includes things like equity ‒ of access, or representation, or participation. And ethics and engagement. The principles are silent on those topics, leading some people to think about FAIRER data.

Don’t let the memorable acronym distract from the importance of ethics, consequence scanning and centering equity.

FAIR is not open

The principles were designed to be applied in contexts where not all data can be open. Life science research involves lots of sensitive personal information. Instead the principles recommend that data usage rights are clear.

I usually point out that FAIR data can exist across the data spectrum. But the principles don’t remind you that data should be as open as possible. Or prompt you to consider about the impacts of different types of licensing. They just ask you to be clear about the terms of reuse, however restrictive they might be.

So, to recap: the FAIR data principles offer a useful framework of things to consider when making data more accessible and easier to reuse. But they are not perfect. And they do not consider all of the various elements required to build an open and trustworthy data ecosystem.

What kinds of data is it useful to include in a register?

Registers are useful lists of information. A register might be a list of countries, companies, or registered doctors. Or addresses.

At the ODI we did a whole report on registers. It looks at different types of registers and how they’re governed. And GDS built a whole infrastructure to support them being published and used across the UK government.

Registers are core components of some types of identifier systems. They help to collect and share information about some aspect of the world we’re collectively interested in. For that reason it can be useful to know more about how the register is governed. So we know what it contains and how that list might change over time.

When those lists of things are useful in many different contexts, then making those registers open helps us to connect together different datasets and analyse them in new ways. They help to unlock context.

How much information should we put in a register? What information might it be useful to capture about the things ‒ the countries, the companies, or the addresses ‒  that are in our shared lists? Do we record just a company number and a name? Or also include the address of the company headquarters and the date it was founded?

When I’ve been designing registers and similar reference datasets, there’s some common categories of a information that I usually think about.

Identifiers

It’s useful if the things in our list have a unique identifier. They might have other identifiers assigned by different systems.

By capturing identifiers we can do things like:

  • clearly refer to items in the register, so we can find their attributes
  • use that identifier to link together different datasets
  • map between datasets that use different identifiers

Names and Labels

Things in the real world aren’t often referred to by an identifier. We give things names. Sometimes they may have several names.

Including names and labels in our identifiers allows us to do things like:

  • use a consistent, canonical name for things wherever they are referenced
  • link to things from a webpage
  • provide a way for a human being to recognise and find things in the register
  • turn a name into an identifier, so we can find more information about something

Relationships

Things in the real world are related to one another. Sometimes literally: I am your father (not, really). Sometimes spatially (this thing is here, or next to this other thing). Sometimes our world is organised into hierarchies or connected in other ways.

Including relationships in our register allows us to do things like:

  • visualise, present and navigate the contents of the list in a variety of ways
  • aggregate and report data according to the relationships between things
  • put something on a map

Types and categories

The things in our list might not all be the same. Or there may be differences between them. For example different types of companies. Or residential versus business addresses. Things might also be put into different categories. A register of companies might also categories businesses by sector.

Having types and categories in a list allows us to do things like:

  • extract part of the list we are interested in, sometimes we don’t need the whole thing
  • visualise, present and navigate the contents of the list in a greater variety of different ways
  • aggregate and report data according to how things are categorised

Lifecycle information

Things in the real world often have a life cycle. So do many digital things. Things are built, created, updated, revised, republished, retracted and demolished. Sometimes those events are tied to the thing being added to the register (“a list of registered companies”), sometimes they’re not (“a list of our current customers”).

Recording lifecycle information can help us to do things like:

  • understand the current state or status of something, which can help drive business and planning decisions
  • visualise, present and navigate the contents of the list in an even greater variety of ways
  • aggregate and report data according to where things are in their lifecycle

Administrative data (relating to the register)

It’s useful to capture data about when the information in a register has changed. For example when was something added to, or removed from a register? When did we last update its attributes or check that the information is current?

This type of information can help us to:

  • identify when information has been changed, so we can update our local copy of what’s in the register
  • extract part of the list we are interested in, as maybe we only want current or historical entries. Or just the recent additions
  • aggregate and report on how the data in the register has changed

Everything else

The list of useful things we might want to include in a register is potentially open ended. The trick in designing a good register is the working out of which bits are useful to be in the register, and which bits should be part of separate databases.

A good register will contain the data that is most commonly used across systems. Centralising that data can reduce the work, costs and also risks of collecting and maintaining it. If you put too much into the register you may end up increasing costs as you may have more to maintain. Or users have to spend more time pruning out what they don’t need.

But, if you are already maintaining a register and are planning to share it for others to use, you can increase its utility by sharing more information about each entry in the list.

Open UPRNs, a worked example

The UK should have an openly licensed address register. At the ODI we’ve long argued for the need for an open address register. But we don’t have that yet.

We do have a partial subset of our national address register available under an open licence, in the form of OS Open UPRNs product. It contains just the UPRN identifier and some spatial coordinates. Through the information in the related Open Identifiers product, we can also uncover some relationships between UPRNs and other spatial objects and administrative areas.

Drawing from the above examples this means we can do things like:

  • increase use of UPRNs as a common machine-readable identifier across datasets
  • identify a valid UPRN
  • locate them spatially on a map
  • relate those UPRNs to other things of interest, like administrative areas

With a bit of extra data engineering and analysis, e.g to look for variations across versions of the dataset we can also maybe work out a rough date for when a UPRN has been added to the list.

This is more than we can do before, which is great.

But there’s obviously clear much, much more we still can’t do:

  • filter out historical UPRNs
  • filter out UPRNs of different types
  • map between addresses (the names for those places) and the identifiers
  • understand the current status of a UPRN
  • aggregate and report on them using different categories
  • help people by building services that use the names (addresses) they’re familiar with
  • …etc, etc

We won’t be able to do those things until we have a fully open address register. But, until then, even including a handful of additional attributes (like a status code!) would clearly unlock more value.

I’ve previously argued that introducing a bit of product thinking might help to bring some focus to the decisions made about how data is published. And I still stand by much of that. But we need to be able to evaluate whether those product design decisions are achieving the intended effect.

Cooking up a new approach to supporting purposeful use of data

In my last post I explored how we might better support the use of datasets. To do that I applied the BASEDEF framework to outline the ways in which communities might collaborate to help unlock more value from individual datasets.

But what if we changed our focus from supporting discovery and use of datasets and instead focused on helping people explore specific types of problems or questions?

Our paradigm around data discovery is based on helping people find individual datasets. But unless a dataset has been designed to answer the specific question you have in mind, then it’s unlikely to be sufficient. Any non-trivial analysis is likely to need multiple datasets.

We know that data is more useful when it is combined, so why isn’t our approach to discovery based around identifying useful collections of datasets?

A cooking metaphor

To explore this further let’s use a cooking metaphor. I love cooking.

Many cuisines are based on a standard set of elements. Common spices or ingredients that become the base of most dishes. Like a mirepoix, a sofrito, the holy trinity of Cajun cooking, or the mother sauces in French cuisine.

As you learn to cook you come to appreciate how these flavour bases and sauces can be used to create a range of dishes. Add some extra spices and ingredients and you’ve created a complete dish.

Recipes help us consistently recreate these sauces.

A recipe consists of several elements. It will have a set of ingredients and a series of steps to combine them. A good recipe will also include some context. For example some background on the origins of the recipe and descriptions of unusual spices or ingredients. It might provide some things to watch out for during the cooking (“don’t burn the spices”) or suggest substitutions for difficult to source ingredients.

Our current approach to dataset discovery involves trying to document the provenance of an individual ingredient (a dataset) really well. We aren’t helping people combine them together to achieve results.

Efforts to improve dataset metadata, documentation and provenance reporting are important. Projects like the dataset nutrition label are great examples of that. We all want to be ethical, sustainable cooks. To do that we need to make informed choices about our ingredients.

But, to whisk these food metaphors together, nutrition labels are there to help you understand what’s gone into your supermarket pasta sauce. It’s not giving you a recipe to cook it from scratch for yourself. Or an idea of how to use the sauce to make a tasty dish.

Recipes for data-informed problem solving

I think we should think about sharing dataset recipes: instructions for how to mix up a selection of dataset ingredients. What would they consist of?

Firstly, the recipe would need to based around a specific type of question, problem or challenge.  Examples might include:

  • How can I understand air quality in my city?
  • How is deprivation changing in my local area?
  • What are the impacts of COVID-19 in my local authority?

Secondly, a recipe would include a list of datasets that have to be sourced, prepared and combined together to explore the specific problem. For example, if you’re exploring impacts of COVID-19 in your local authority you’re probably going to need:

  • demographic data from the most recent census
  • spatial boundaries to help visualise and present results
  • information about deprivation to help identify vulnerable people

Those three datasets are probably the holy trinity of any local spatial analysis?

Finally, you’re going to need some instructions for how to combine the datasets together. The instructions might identify some tools you need (Excel or QGIS), reference some techniques (Reprojection) and maybe some hints about how to substitute for key ingredients if you can’t get them in your local area (FOI).

The recipe might ways to vary the recipe for different purposes: add a sprinkle of Companies House data to understand your local business community, and a dash of OpenStreetMap to identify greenspaces?

As a time saver maybe you can find some pre-made versions of some of the steps in the recipe?

Examples in the wild

OK, its easy to come up with a metaphor and an idea. But would this actually meet a need? There’s a few reasons why I’m reasonably confident that dataset recipes could be helpful. Mostly because I can see this same approach re-appearing in some related contexts. For example:

If you have examples then let me know in the comments or on twitter.

How can dataset recipes help?

I think there’s a whole range of ways in which these types of recipe can be useful.

Data analysis always starts by posing a question. By documenting how datasets can be applied specific questions will make them easier to find on search engines. It just fits better with what people want to do.

Data discovery is important during periods where there is a sudden influx of new potential users. For example, where datasets have just been published under an open licence and are now available to more people, for a wider range of purposes.

In my experience data analysts and scientists who understand a domain, e.g population or transport modelling, have built up an tacit understanding of what datasets are most useful in different contexts. They understand the limitations and the process of combining datasets together. This thread from Chris Gale with a recipe about doing spatial analysis using PHE’s COVID-19 data is a perfect example. Documenting and sharing this knowledge can help others to do similar analyses. It’s like a cooking masterclass.

Discovery is also difficult when there is a sudden influx of new data available. Such as during this pandemic. Writing recipes is a good way to share learning across a community.

Documenting useful recipes might help us scale innovation across local areas.

Lastly, we’re still trying to understand which datasets are a most important part of our local, national and international data infrastructure. We’re currently lacking any real quantitative information about how datasets are combined together. In the same way that recipes can be analysed to create ingredient networks, dataset recipes could be analysed to find out how datasets are being used together. We can then strengthen that infrastructure.

If you’ve built something that helps people publish dataset recipes then send me a link to your app. I’d like to try it.

How can you help support the use of a dataset?

Getting the most value from data, whilst minimising its harmful impacts, is a community activity. Datasets need to be governed and published well. Most of that responsibility falls on the data publisher. Because the choices they make shapes data ecosystems.

But other people have a role to play too. Being a good data user means engaging with that process.

Helping others to find data and find the value in it, feels particularly important at the moment. During the pandemic there are many new datasets becoming available. And there are lots of questions to be answered. Some of them can be answered through better use of data.

So, how can communities work together to support use of data?

There are a lot of different ways to explore that question. But there’s a framework called BASEDEF, created by the open source community, which I find helpful.

BASEDEF stands for Blog, Apply, Suggest, Extend, Document, Evangelize and Fix. It describes the different types of contributions that can support an open source project. It can also be applied to help organise a small team in doing that work. Here’s a handy cheat sheet.

But the framework can also be applied to the task of supporting the use of an openly licensed dataset. Let’s run through the framework with that in mind.


Blog

You can write about a dataset to help others to discover it. You can help explain the potential value of applying the dataset to specific problems. Or perhaps you can see some downsides that others should consider.

Writing about how a dataset has been useful to you, by describing how you’ve successfully applied it in a project, will also help others see its potential value.

Apply

You can show how a dataset can be used, by creating something with it. You might do a detailed analysis of the data, but some simpler contributions can also be helpful.

For example you might create a simple visualisation. Or write and publish some code that illustrates how the dataset can be accessed and used. You could publish a quick demo showing how the dataset can be imported and used in some frequently used tools and platforms.

At the moment everyone is a bit tired of charts and graphs. And I agree with the first principle in the visualisation design principles for the pandemic. But a helpful visualisation can do a range of things. Visualisation can be exploratory rather than explanatory.

A visualisation could support other people in understanding the shape of a dataset, to inform their analysis and interpretation of it. It can help identify outliers, gaps, or highlight some of the richness in the data. I’d recommend making it clear when you’re doing it type of visualisation, rather than trying to derive specific insights.

Suggest

Read the documentation. Download and explore the dataset. Ask questions. Give feedback.

Make suggestions to the publisher about changes they could make to publish the data better. Rather than just offer academic critique, be clear about how suggested changes will support your needs or that of your community.

Extend

The freedoms granted by an open licence allow you to enrich and improve a dataset.

Sometimes the smallest changes can have the most impact. Convert the data into other common or standard formats. Extracting data from spreadsheets into CSV files. Convert data published in more complex formats or via APIs into simpler tabular data to make it more accessible to analysts rather than programmers.

Or maybe you can enrich a dataset by adding identifiers that will allow it to be linked to other sources. Do the work of merging with other datasets to bring in more context.

The downside here is that if the original data changes your extended version will get out of date. If you can’t commit to keeping your version up to date, then be sure to share your code and document your methods.

Allow others to repeat the steps you’ve taken. And don’t forget to suggest the improvements to the publisher.

Document

Write additional documentation to fill in gaps where the publisher has not provided sufficient background or explanation. Explain technical concepts or academic terms to a non-specialist audience.

As a user of the data, you’re able to write that documentation from a perspective that reflects the needs and questions of your specific community and the kinds of questions you need to ask. The original publisher might not have all that context or understand those needs, so this work can be really helpful.

Good documentation can be a finding aid. There are structured ways that you can go about writing documentation, such as this tool for writing civic data guides. (Check out some of the examples).

Evangelise

Email people that might have a need for the data. Tweet about it to a wider community. Highlight it in a presentation. Talk about it over coffee Zoom.

Fix

If the dataset is collaboratively maintained then go ahead and fix errors and omissions. If you’re not confident about making a fix, then submit an error report. In addition to fixing errors you might be able to help verify that data is correct.

If a dataset isn’t collaboratively maintained then, when you find errors, be sure to flag them to the publisher and highlight the issue for others. Or consider publishing an enriched version with fixes applied.


This framework isn’t perfect. The name is a bit clunky for a start. But there’s a couple of things that I like about it.

Firstly, it recognises that not all contributions need to be technical. There’s room for others to use different skills and in different ways.

Secondly, the elements overlap and reinforce one another. Writing documentation and blogging about how you’ve used a dataset helps to evangelise it. Enriching a dataset can help demonstrate in a practical way how a publisher can improve how data is published.

Finally, it serves to highlight some important aspects of community curation which aren’t always well supported in existing data platforms and portals. We can do better here.

If you’re interested in working on adapting this further then happy to chat!. It might be useful to have a cheat sheet that supports its application to data and more examples of how to do these different elements well.

Why is change discovery important for open data?

Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.

Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:

  • How often will a dataset get routinely updated and republished?
  • What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
  • How will the dataset, or parts of it be version controlled?
  • How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
  • How will planned and unplanned updates and changes be communicated to users of the dataset?
  • How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
  • Are large scale changes to the data model expected, and if so over what timescale?
  • Are changes to the technical infrastructure planned, and if so over what timescale?
  • How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?

These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.

Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.

Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.

A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:

  • changes to a row in a dataset might be flagged with a timestamp and a change notice
  • API responses might indicate the version of the object being retrieved
  • dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
  • a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
  • changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum

These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.

Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:

  • make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
  • use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
  • subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
  • subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems

Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.

I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.

The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.

Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.

Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.

In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.

And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.