What kinds of data is it useful to include in a register?

Registers are useful lists of information. A register might be a list of countries, companies, or registered doctors. Or addresses.

At the ODI we did a whole report on registers. It looks at different types of registers and how they’re governed. And GDS built a whole infrastructure to support them being published and used across the UK government.

Registers are core components of some types of identifier systems. They help to collect and share information about some aspect of the world we’re collectively interested in. For that reason it can be useful to know more about how the register is governed. So we know what it contains and how that list might change over time.

When those lists of things are useful in many different contexts, then making those registers open helps us to connect together different datasets and analyse them in new ways. They help to unlock context.

How much information should we put in a register? What information might it be useful to capture about the things ‒ the countries, the companies, or the addresses ‒  that are in our shared lists? Do we record just a company number and a name? Or also include the address of the company headquarters and the date it was founded?

When I’ve been designing registers and similar reference datasets, there’s some common categories of a information that I usually think about.

Identifiers

It’s useful if the things in our list have a unique identifier. They might have other identifiers assigned by different systems.

By capturing identifiers we can do things like:

  • clearly refer to items in the register, so we can find their attributes
  • use that identifier to link together different datasets
  • map between datasets that use different identifiers

Names and Labels

Things in the real world aren’t often referred to by an identifier. We give things names. Sometimes they may have several names.

Including names and labels in our identifiers allows us to do things like:

  • use a consistent, canonical name for things wherever they are referenced
  • link to things from a webpage
  • provide a way for a human being to recognise and find things in the register
  • turn a name into an identifier, so we can find more information about something

Relationships

Things in the real world are related to one another. Sometimes literally: I am your father (not, really). Sometimes spatially (this thing is here, or next to this other thing). Sometimes our world is organised into hierarchies or connected in other ways.

Including relationships in our register allows us to do things like:

  • visualise, present and navigate the contents of the list in a variety of ways
  • aggregate and report data according to the relationships between things
  • put something on a map

Types and categories

The things in our list might not all be the same. Or there may be differences between them. For example different types of companies. Or residential versus business addresses. Things might also be put into different categories. A register of companies might also categories businesses by sector.

Having types and categories in a list allows us to do things like:

  • extract part of the list we are interested in, sometimes we don’t need the whole thing
  • visualise, present and navigate the contents of the list in a greater variety of different ways
  • aggregate and report data according to how things are categorised

Lifecycle information

Things in the real world often have a life cycle. So do many digital things. Things are built, created, updated, revised, republished, retracted and demolished. Sometimes those events are tied to the thing being added to the register (“a list of registered companies”), sometimes they’re not (“a list of our current customers”).

Recording lifecycle information can help us to do things like:

  • understand the current state or status of something, which can help drive business and planning decisions
  • visualise, present and navigate the contents of the list in an even greater variety of ways
  • aggregate and report data according to where things are in their lifecycle

Administrative data (relating to the register)

It’s useful to capture data about when the information in a register has changed. For example when was something added to, or removed from a register? When did we last update its attributes or check that the information is current?

This type of information can help us to:

  • identify when information has been changed, so we can update our local copy of what’s in the register
  • extract part of the list we are interested in, as maybe we only want current or historical entries. Or just the recent additions
  • aggregate and report on how the data in the register has changed

Everything else

The list of useful things we might want to include in a register is potentially open ended. The trick in designing a good register is the working out of which bits are useful to be in the register, and which bits should be part of separate databases.

A good register will contain the data that is most commonly used across systems. Centralising that data can reduce the work, costs and also risks of collecting and maintaining it. If you put too much into the register you may end up increasing costs as you may have more to maintain. Or users have to spend more time pruning out what they don’t need.

But, if you are already maintaining a register and are planning to share it for others to use, you can increase its utility by sharing more information about each entry in the list.

Open UPRNs, a worked example

The UK should have an openly licensed address register. At the ODI we’ve long argued for the need for an open address register. But we don’t have that yet.

We do have a partial subset of our national address register available under an open licence, in the form of OS Open UPRNs product. It contains just the UPRN identifier and some spatial coordinates. Through the information in the related Open Identifiers product, we can also uncover some relationships between UPRNs and other spatial objects and administrative areas.

Drawing from the above examples this means we can do things like:

  • increase use of UPRNs as a common machine-readable identifier across datasets
  • identify a valid UPRN
  • locate them spatially on a map
  • relate those UPRNs to other things of interest, like administrative areas

With a bit of extra data engineering and analysis, e.g to look for variations across versions of the dataset we can also maybe work out a rough date for when a UPRN has been added to the list.

This is more than we can do before, which is great.

But there’s obviously clear much, much more we still can’t do:

  • filter out historical UPRNs
  • filter out UPRNs of different types
  • map between addresses (the names for those places) and the identifiers
  • understand the current status of a UPRN
  • aggregate and report on them using different categories
  • help people by building services that use the names (addresses) they’re familiar with
  • …etc, etc

We won’t be able to do those things until we have a fully open address register. But, until then, even including a handful of additional attributes (like a status code!) would clearly unlock more value.

I’ve previously argued that introducing a bit of product thinking might help to bring some focus to the decisions made about how data is published. And I still stand by much of that. But we need to be able to evaluate whether those product design decisions are achieving the intended effect.