Talk: Documenting Identifiers for Humans and Machines

This is a rough transcript of a talk I recently gave at a session at Pidapalooza 2019. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing. I’d also really welcome you thoughts and feedback on this discussion document.

At the Open Data Institute we think of data as infrastructure. Something that must be invested in and maintained so that we can maximise the value we get from data. For research, to inform policy and for a wide variety of social and economic benefits.

Identifiers, registers and open standards are some of the key building blocks of data infrastructure. We’ve done a lot of work to explore how to build strong, open foundations for our data infrastructure.

A couple of years ago we published a white paper highlighting the importance of openly licensed identifiers in creating open ecosystems around data. We used that to introduce some case studies from different sectors and to explore some of the characteristics of good identifier systems.

We’ve also explored ways to manage and publish registers. “Register” isn’t a word that I’ve encountered much in this community. But its frequently used to describe a whole set of government data assets.

Registers are reference datasets that provide both unique and/or persistent identifiers for things, and data about those things. The datasets of metadata that describe ORCIDs and DOIs are registers. As are lists of doctors, countries and locations where you can get our car taxed. We’ve explored different models for stewarding registers and ways to build trust
around how they are created and maintained.

In the work I’ve done and the conversations I’ve been involved with around identifiers, I think we tend to focus on two things.

The first is persistence. We need identifiers to be persistent in order to be able to rely on them enough to build them into our systems and processes. I’ve seen lots of discussion about the technical and organisational foundations necessary to ensure identifiers are persistent.

There’s also been great work and progress around giving identifiers affordance. Making them actionable.

Identifiers that are URIs can be clicked on in documents and emails. They can be used by humans and machines to find content, data and metadata. Where identifiers are not URIs, then there are often resolvers that will help to make to integrate them with the web.

Persistence and affordance are both vital qualities for identifiers that will help us build a stronger data infrastructure.

But lately I’ve been thinking that there should be more discussion and thought put into how we document identifiers. I think there are three reasons for this.

Firstly, identifiers are boundary objects. As we increase access to data, by sharing it between organisations or publishing it as open data, then an increasing number of data users  and communities are likely to encounter these identifiers.

I’m sure everyone in this room know what a DOI is (aside: they did). But how many people know what a TOID is? (Aside: none of them did). TOIDs are a national identifier scheme. There’s a TOID for every geographic feature on Ordnance Survey maps. As access to OS data increases, more developers will be introduced to TOIDs and could start using them in their applications.

As identifiers become shared between communities. It’s important that the context around how those identifiers are created and managed is accessible, so that we can properly interpret the data that uses them.

Secondly, identifiers are standards. There are many different types of standard. But they all face common problems of achieving wide adoption and impact. Getting a sector to adopt a common set of identifiers is a process of agreement and implementation. Adoption is driven by engagement and support.

To help drive adoption of standards, we need to ensure that they are well documented. So that users can understand their utility and benefits.

Finally identifiers usually exist as part of registers or similar reference data. So when we are publishing identifiers we face all the general challenges of being good data publishers. The data needs to be well described and documented. And to meet a variety of data user needs, we may need a range of services to help people consume and use it.

Together I think these different issues can lead to additional friction that can hinder the adoption of open identifiers. Better documentation could go some way towards addressing some of these challenges.

So what documentation should we publish around identifier schemes?

I’ve created a discussion document to gather and present some thoughts around this. Please have a read and leave you comments and suggestions on that document. For this presentation I’ll just talk through some of the key categories of information.

I think these are:

  • Descriptive information that provides the background to a scheme, such as what it’s for, when it was created, examples of it being used, etc
  • Governance information that describes how the scheme is managed, who operates it and how access is managed
  • Technical notes that describe the syntax and validation rules for the scheme
  • Operational information that helps developers understand how many identifiers there are, when and how new identifiers are assigned
  • Service pointers that signpost to resolvers and other APIs and services that help people use or adopt the identifiers

I take it pretty much as a given that this type of important documentation and metadata should be machine-readable in some form. So we need to approach all of the above in a way that can meet the needs of both human and machine data users.

Before jumping into bike-shedding around formats. There’s a few immediate questions to consider:

  • how do we make this metadata discoverable, e.g. from datasets and individual identifiers?
  • are there different use cases that might encourage us to separate out some of this information into separate formats and/or types of documentation?
  • what services might we build off the metadata?
  • …etc

I’m interested to know whether others think this would be a useful exercise to take further. And also the best forum for doing that. For example should there be a W3C community group or similar that we could use to discuss and publish some best practice.

Please have a look at the discussion document. I’m keen to learn from this community. So let me know what you think.

Thanks for listening.