The words we use for data

I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think.  One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.

As Dan pointed out we often need a broader vocabulary when talking about data.  Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.

It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.

In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.

The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?

This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.

If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?

It turns out there’s quite a few.

For example there are “samples” and “sampling“.  These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples.  We might only be allowed to take a few.

Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.

Or how about “observations” and “observing“?  Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.

Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.

AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.

What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.

In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.

Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?

Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.

I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.

I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.

What terms do you find most useful when talking about data?

How can we describe different types of dataset? Ten dataset archetypes

As a community, when we are discussing recommendations and best practices for how data should be published and governed, there is a natural tendency for people to focus on the types of data they are most familiar with working with.

This leads to suggestions that every dataset should have an API, for example. Or that every dataset should be available in bulk. While good general guidance, those approaches aren’t practical in every case. That’s because we also need to take into account a variety of other issues, including:

  • the characteristics of the dataset
  • the capabilities of the publishing organisation and the funding their have available
  • the purpose behind publishing the data
  • and the ethical, legal and social contexts in which it will be used

I’m not going to cover all of that in this blog post.

But it occurred to me that it might be useful to describe a set of dataset archetypes, that would function a bit like user personas. They might help us better answer some of the basic questions people have around data, discuss recommendations around best practices, inform workshop exercises or just test our assumptions.

To test this idea I’ve briefly described ten archetypes. For each one I’ve tried to describe some it’s features, identified some specific examples, and briefly outlined some of the challenges that might apply in providing sustainable access to it.

Like any characterisation detail is lost. This is not an exhaustive list. I haven’t attempted to list every possible variation based on size, format, timeliness, category, etc. But I’ve tried to capture a range that hopefully illustrate some different characteristics. The archetypes reflect my own experiences, you will have different thoughts and ideas. I’d love to read them.

The Study

The Study is a dataset that was collected to support a research project. The research group collected a variety of new data as part of conducting their study. The dataset is small, focused on a specific use case and there are no plans to maintain or update it further as the research group does not have any ongoing funded to collect or maintain the dataset. The data is provided as is for others to reuse, e.g. to confirm the original analysis of the data or to use it on other studies. To help others, and as part of writing some academic papers that reference the dataset, the research group has documented their methodology for collecting the data. The dataset is likely published in an academic data portal or alongside the academic papers that reference it.

Examples: water quality samples, field sightings of animals, laboratory experiment results, bibliographic data from a literature review, photos showing evidence of plant diseases, consumer research survey results

The Sensor Feed

The Sensor Feed is a stream of sensor readings that are produced by a collection of sensors that have been installed across a city. New readings are added to the stream at regular intervals. The feed is provided to allow a variety of applications to tap into the raw sensor readings.. The data points are as directly reported by the individual sensors and are not quality controlled. The individual sensors may been updated, re-calibrated or replaced over time. The readings are part of the operational infrastructure of the city so can be expected to be available over at least the medium term. This mean the dataset is effectively unbounded: new observations will continue to be reported until the infrastructure is decommissioned.

Examples: air quality readings, car park occupancy, footfall measurements, rain gauges, traffic light queuing counts, real-time bus locations

The Statistical Index

The Statistical Index is intended to provide insights into the performance of specific social or economic policies by measuring some aspect of a local community or economy. For example a sales or well-being index. The index draws on a variety of primary datasets, e.g. on commercial activities, which are then processed according to a documented methodology to generate the index. The Index is stewarded by an organisation and is expected to be available over the long term. The dataset is relatively small and is reported against specific geographic areas (e.g. from The Register) to support comparisons. The Index is updated on a regular basis, e.g. monthly or annually. Use of the data typically involves comparing across time and location at different levels of aggregation.

Examples: street safety survey, consumer price indices, happiness index, various national statistical indexes

The Register

The Register is a set of reference data that is useful for adding context to other datasets. It consists of a list of specific things, e.g. locations, cars, services with an unique identifier and some basic descriptive metadata for each of the entries on the list. The Register is relatively small, but may grow over time. It is stewarded by an organisation tasked with making the data available for others. The steward, or custodian, provides some guarantees around the quality of the data.  It is commonly used as a means to link, validate and enrich other datasets and is rarely used in isolation other than in reporting on changes to the size and composition of the register.

Examples: licensed pubs, registered doctors, lists of MOT stations, registered companies, a taxonomy of business types, a statistical geography, addresses

The Database

The Database is a copy or extract of the data that underpins a specific application or service. The database contains information about a variety of different types of things, e.g. musicians, the albums and songs. It is a relatively large dataset that can be used to perform a variety of different types of query and to support a variety of uses. As it is used in a live service it is regularly updated, undergoes a variety of quality checks, and is growing over time in both volume and scope. Some aspects of The Database may reference one or more Registers or could be considered as Registers in themselves.

Examples: geographic datasets that include a variety of different types of features (e.g. OpenStreetMap, MasterMap), databases of music (e.g. MusicBrainz) and books (e.g. OpenLibrary), company product and customer databases, Wikidata

The Description

The Description is a collection of a few data points relating to a single entity. Embedded into a single web page, it provides some basic information about an event, or place, or company. Individually it may be useful in context, e.g. to support a social interaction or application share. The owner of the website provides some data about the things that are discussed or featured on the website, but does not have access to a full dataset. The individual item descriptions are provided by website contributors using a CRM to add content to the website. If available in aggregate, the individual descriptions might make a useful Database or Register.

Examples: descriptions of jobs, events, stores, video content, articles

The Personal Records

The Personal Records are a history of the interactions of a single person with a product or service. The data provides insight into the individual person’s activities.  The data is a slice of a larger Dataset that contains data for a larger number of people. As the information contains personal information it has to be secure and the individual has various rights over the collection and use of the data as granted by GDPR (or similar local regulation). The dataset is relatively small, is focused on a specific set of interactions, but is growing over time. Analysing the data might provide useful insight to the individual that may help them change their behaviour, increase their health, etc.

Examples: bank transactions, home energy usage, fitness or sleep tracker, order history with an online service, location tracker, health records

The Social Graph

The Social Graph is a dataset that describes the relationships between a group of individuals. It is typically built-up by a small number of contributions made by individuals that provide information about their relationships and connections to others. They may also provide information about those other people, e.g. names, contact numbers, service ratings, etc. When published or exported it is typically focused on a single individual, but might be available in aggregate. It is different to Personal Records as its specifically about multiple people, rather than a history of information about an individual (although Personal Records may reference or include data about others).  The graph as a whole is maintained by an organisation that is operating a social network (or service that has social features).

Examples: social networks data, collaboration graphs, reviews and trip histories from ride sharing services, etc

The Observatory

The Observatory is a very large dataset produce by a coordinated large-scale data collection exercise, for example by a range of earth observation satellites. The data collection is intentionally designed to support a variety of down-stream uses, which informs the scale and type of data collected. The scale and type of data can makes it difficult to use because of the need for specific tools or expertise. But there are a wide range of ways in which the raw data can be processed to create other types of data products, to drive a variety of analyses, or used to power a variety of services.  It is refreshed and re-released as required by the needs and financial constraints of the organisations collaborating on collecting and using the dataset.

Examples: earth observation data, LIDAR point clouds, data from astronomical surveys or Large Hadron Collider experiments

The Forecast

The Forecast is used to predict the outcome of specific real-world events, e.g. a weather or climate forecast. It draws on a variety of primary datasets which are then processed and anlysed to produce the output dataset. The process by which the predictions are made are well-documented to provide insight into the quality of the output. As the predictions are time-based the dataset has a relatively short “shelf-life” which means that users need to quickly access the most recent data for a specific location or area of interest. Depending on the scale and granularity, Forecast datasets can be very large, making them difficult to distribute in a timely manner.

Example: weather forecasts

Let me know what you think of these. Do they provide any useful perspective? How would you use or improve them?

That thing we call “open”

I’ve been involved in a few conversations recently about what “open” or “being open” means in different situations.

As I’ve noted previously when people say “open” they often mean very different things. And while there may be a clear definitions of “open”, people don’t
often use the terms correctly. And some phrases like “open API” are still, well, open to interpretation.

In this post I’m going to summarise some of the ways in which I tend to think about making something “open”.

Let me know if I’m missing something so I can plug gaps in my understanding.

Openness of a “thing”

Digital objects: books, documents, images, music, software and datasets can all be open.

Making things open in this sense is the most well documented, but still the most consistently misunderstood. There are clear definitions for open content
and data, open source, etc. Open in these contexts provide various freedoms to use, remix, share, etc.

People often confuse something being visible or available to them as being open, but that’s not the same thing at all. Being able to see or read something doesn’t give you any legal permissions at all.

It’s worth noting that the definitions of open “things” in different communities are often overlapping. For example, the Creative Commons licences allow works to be licensed in ways that enable a wide variety of legal reuses. But the Open Definition only recognises a subset of those as being open, rather than shared.

Putting an open licence on something also doesn’t necessarily grant you the full freedom to reuse that thing. For example I could open source some machine learning software but it might only be practically reusable if you can train it on some data that I’ve chosen not to share.

Or I might use an open licence like the Open Government Licence that allows me to put an open licence on something whilst ignoring the existence of any third-party rights. No need to do my homework. Reuser beware.

Openness of a process

Processes can be open. It might be better to think about transparency (e.g. of how the process is running) or the ability to participate in a process in this context.

Anything that changes and evolves over time will have a process by which those changes are identified, agreed, prioritised and applied. We sometimes call that governance. The definition of an open standard includes defining both the openness of the standard (the thing) as well as the process.

Stewardship, of a software project, or a dataset, or a standard are also examples of where it might be useful for a process to be open. Questions we can ask of open processes are things like:

  • Can I contribute to the main codebase of a software package, rather than just fork it?
  • Can I get involved in the decision making around how a piece of software or standard evolves?
  • Can I directly fix errors in a dataset?
  • Can I see what decisions have been, or are being made that relate to how something is evolving?

When we’re talking about open data or open source, often we’re really talking about openness of the “thing”. But when we’re making things open to make them
better, I think we’re often talking about being open to contributions and participation. Which needs something more than a licence on a thing.

There’s probably a broader category of openness here which relates to how open a process is socially. Words like inclusivity and diversity spring to mind.

Your standards process isn’t really open to all if all of your meetings are held face to face in Hawaii.

Openness of a product, system or platform

Products, platforms and systems can be open too. Here we can think of openness as relating to the degree to which the system

  • is built around open standards and open data (made from open things)
  •  is operated using open processes
  • is available for wider access and use

We can explore this by asking questions like:

  • Is it designed to run on open infrastructure or is it tied to particular cloud infrastructure or hardware?
  • Are the interfaces to the system built around open standards?
  • Can I get access to an API? Or is it invite only?
  • How do the terms of service shape the acceptable uses of the system?
  • Can I use its outputs, e.g. the data returned by a platform or an API, under an open licence?
  • Can we observe how well the system or platform is performing, or measure its impacts in different ways (e.g. socially, economically, environmentally)

Openness of an ecosystem

Ecosystems can be open too. In one sense an open ecosystem is “all of the above”. But there are properties of an ecosystem that might itself indicate aspects of openness:

  • Is there a choice in providers, or is there a monopoly provider of services or data?
  • How easy is it for new organisations to engage with the ecosystem, e.g to provide
    competing or new services?
  • Can we measure the impacts and operations of the ecosystem?

When we’re talking about openness of an ecosystem we’re usually talking about markets and sectors and regulation and governance.

Applying this in practice

So when  thinking about whether something is “open” the first thing I tend to do is consider which of the above categories apply. In some cases its actually several.

This is evident in my attempt to define “open API“.

For example we’re doing some work @ODIHQ to explore the concept of a digital twin. According to the Gemini Principles a digital twin should be open. Here we can think of an individual digital twin as an object (a piece of software or a model), or a process (e.g. as an open source project), or an operational system or platform depending on how its made available.

We’re also looking at cities. Cities can be open in the sense of the openness of their processes of governance and decision making. They might also be considered as platforms for sharing data and connecting sofrware. Or as ecosystems of the same.

Thinking about the governance of data

I find “governance” to be a tricky word. Particularly when we’re talking about the governance of data.

For example, I’ve experienced conversations with people from a public policy background and people with a background in data management, where its clear that there are different perspectives. From a policy perspective, governance of data could be described as the work that governments do to enforce, encourage or enable an environment where data works for everyone. Which is slightly different to the work that organisations do in order to ensure that data is treated as an asset, which is how I tend to think about organisational data governance.

These aren’t mutually exclusive perspectives. But they operate at different scales with a different emphasis, which I think can sometimes lead to crossed wires or missed opportunities.

As another example, reading this interesting piece of open data governance recently, I found myself wondering about that phrase: “open data governance”. Does it refer to the governance of open data? Being open about how data is governed? The use of open data in governance (e.g. as a public policy tool), or the role of open data in demonstrating good governance (e.g. through transparency). I think the article touched on all of these but they seem quite different things. (Personally I’m not sure there is anything special about the governance of open data as opposed to data in general: open data isn’t special).

Now, all of the above might be completely clear to everyone else and I’m just falling into my usual trap of getting caught up on words and meanings. But picking away at definitions is often useful, so here we are.

The way I’ve rationalised the different data management and public policy perspectives is in thinking about the governance of data as a set of (partly) overlapping contexts. Like this:

 

Governance of data as a set of overlapping contexts

 

Whenever we are managing and using data we are doing so within a nested set of rules, processes, legislation and norms.

In the UK our use of data is bounded by a number of contexts. This includes, for example: legislation from the EU (currently!), legislation from the UK government, legislation defined by regulators, best practices that might be defined how a sector operates, our norms as a society and community, and then the governance processes that apply within our specific organisations, departments and even teams.

Depending on what you’re doing with the data, and the type of data you’re working with, then different contexts might apply. The obvious one being the use of personal data. As data moves between organisations and countries, then different contexts will apply, but we can’t necessarily ignore the broader contexts in which it already sits.

The narrowest contexts, e.g. those within an organisations, will focus on questions like: “how are we managing dataset XYZ to ensure it is protected and managed to a high quality?” The broadest contexts are likely to focus on questions like: “how do we safely manage personal data?

Narrow contexts define the governance and stewardship of individual datasets. Wider contexts guide the stewardship of data more broadly.

What the above diagram hopefully shows is that data, and our use of data, is never free from governance. It’s just that the terms under which it is governed may just be very loosely defined.

This terrible sketch I shared on twitter a while ago shows another way of looking at this. The laws, permissions, norms and guidelines that define the context in which we use data.

Data use in context

One of the ways in which I’ve found this “overlapping contexts” perspective useful, is in thinking about how data moves into and out of different contexts. For example when it is published or shared between organisations and communities. Here’s an example from this week.

IBM have been under fire because they recently released (or re-released) a dataset intended to support facial recognition research. The dataset was constructed by linking to public and openly licensed images already published on the web, e.g. on Flickr. The photographers, and in some cases the people featured in those images, are unhappy about the photographs being used in this new way. In this new context.

In my view, the IBM researchers producing this dataset made two mistakes. Firstly, they didn’t give proper appreciation to the norms and regulations that apply to this data — the broader contexts which inform how it is governed and used, even though its published under an open licence. For example, e.g. people’s expectations about how photographs of them will be used.

An open licence helps data move between organisations — between contexts — but doesn’t absolve anyone from complying with all of the other rules, regulations, norms, etc that will still apply to how it is accessed, used and shared. The statement from Creative Commons helps to clarify that their licenses are not a tool for governance. They just help to support the reuse of information.

This lead to IBM’s second mistake. By creating a new dataset they took on responsibility as its data steward. And being a data steward means having a well-defined, set of data governance processes that are informed and guided by all of the applicable contexts of governance. But they missed some things.

The dataset included content that was created by and features individuals. So their lack of engagement with the community of contributors, in order to discuss norms and expectations was mistaken. The lack of good tools to allow people to remove photos — NBC News created a better tool to allow Flickr users to check the contents of the dataset — is also a shortfall in their duties. Its the combination of these that has lead to the outcry.

If IBM had instead launched an initiative similar where they built this dataset, collaboratively, with the community then they could have avoided this issue. This is the approach that Mozilla took with Voice. IBM, and the world, might even have had a better dataset as a result because people have have opted-in to including more photos. This is important because, as John Wilbanks has pointed out, the market isn’t creating these fairer, more inclusive datasets. We need them to create an open, trustworthy data ecosystem.

Anyway, that’s one example of how I’ve found thinking about the different contexts of governing data, helpful in understanding how to build stronger data infrastructure. What do you think? Am I thinking about this all wrong? What else should I be reading?

 

Impressions from pidapalooza 19

This week I was at the third pidapalooza conference in Dublin. It’s a conference that is dedicated open identifiers: how to create them, steward them, drive adoption and promote their benefits.

Anyone who has spent any time reading this blog or following me on twitter will know that this is a topic close to my heart. Open identifiers are infrastructure.

I’ve separately written up the talk I gave on documenting identifiers to help drive adoption and spur the creation of additional services. I had lots of great nerdy discussions around URIs, identifier schemes, compact URIs, standards development and open data. But I wanted to briefly capture and share a few general impressions.

Firstly, while the conference topic is very much my thing, and the attendees were very much my people (including a number of ex-colleagues and collaborators), I was approaching the event from a very different perspective to the majority of other attendees.

Pidapalooza as a conference has been created by organisations from the scholarly publishing, research and archiving communities. Identifiers are a key part of how the integrity of the scholarly record is maintained over the long term. They’re essential to support archiving and access to a variety of research outputs, with data being a key growth area. Open access and open data were very much in evidence.

But I think I was one of only a few (perhaps the only?) attendee from what I’ll call the “broader” open data community. That wasn’t a complete surprise but I think the conference as a whole could benefit from a wider audience and set of participants.

If you’re working in and around open data, I’d encourage you to go to pidapalooza, submit some talk ideas and consider sponsoring. I think that would be beneficial for several reasons.

Firstly, in the pidapalooza community, the idea of data infrastructure is just a given. It was refreshing to be around a group of people that past the idea of thinking of data as infrastructure and were instead focusing on how to build, govern and drive adoption of that infrastructure. There’s a lot of lessons there that are more generally applicable.

For example I went to a fascinating talk about how EIDR, an identifier for movie and television assets, had helped to drive digital transformation in that sector. Persistent identifiers are critical to digital supply chains (Netflix, streaming services, etc). There are lessons here for other sectors around benefits of wider sharing of data.

I also attended a great talk by the Australian Research Data Commons that reviewed the ways in which they were engaging with their community to drive adoption and best practices for their data infrastructure. They have a programme of policy change, awareness raising, skills development, community building and culture change which could easily be replicated in other areas. It paralleled some of the activities that the Open Data Institute has carried out around its sector programmes like OpenActive.

The need for transparent governance and long-term sustainability were also frequent topics. As was the recognition that data infrastructure takes time to build. Technology is easy, its growing a community and building consensus around an approach that takes time.

(btw, I’d love to spend some time capturing some of the lessons learned by the research and publishing community, perhaps as a new entry to the series of data infrastructure papers that the ODI has previously written. If you’d like to collaborate with or sponsor the ODI to explore that, then drop me a line?)

Secondly, while the pidapalooza community seem to have generally accepted (with a few exceptions) the importance of web identifiers and open licensing of reference data. But that practice is still not widely adopted in other domains. Few of the identifiers I encounter in open government data, for example, are well documented, openly licensed or are supported by a range of APIs and services.

Finally, much of the focus of pidapalooza was on identifying research outputs and related objects: papers, conferences, organisations, datasets, researchers, etc. I didn’t see many discussions around the potential benefits and consequences of use of identifiers in research datasets. Again, this focus follows from the community around the conference.

But as the research, data science and machine-learning communities begin exploring new approaches to increase access to data, it will be increasingly important to explore the use of standard identifiers in that context. Identifiers have a clear role in helping to integrate data from different sources, but there are wider risks around data privacy and ethical considerations around identification of individuals, for example, that will need to happen.

I think we should be building a wider community of practice around use of identifiers in different contexts, and I think pidapalooza could become a great venue to do that.

Talk: Documenting Identifiers for Humans and Machines

This is a rough transcript of a talk I recently gave at a session at Pidapalooza 2019. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing. I’d also really welcome you thoughts and feedback on this discussion document.

At the Open Data Institute we think of data as infrastructure. Something that must be invested in and maintained so that we can maximise the value we get from data. For research, to inform policy and for a wide variety of social and economic benefits.

Identifiers, registers and open standards are some of the key building blocks of data infrastructure. We’ve done a lot of work to explore how to build strong, open foundations for our data infrastructure.

A couple of years ago we published a white paper highlighting the importance of openly licensed identifiers in creating open ecosystems around data. We used that to introduce some case studies from different sectors and to explore some of the characteristics of good identifier systems.

We’ve also explored ways to manage and publish registers. “Register” isn’t a word that I’ve encountered much in this community. But its frequently used to describe a whole set of government data assets.

Registers are reference datasets that provide both unique and/or persistent identifiers for things, and data about those things. The datasets of metadata that describe ORCIDs and DOIs are registers. As are lists of doctors, countries and locations where you can get our car taxed. We’ve explored different models for stewarding registers and ways to build trust
around how they are created and maintained.

In the work I’ve done and the conversations I’ve been involved with around identifiers, I think we tend to focus on two things.

The first is persistence. We need identifiers to be persistent in order to be able to rely on them enough to build them into our systems and processes. I’ve seen lots of discussion about the technical and organisational foundations necessary to ensure identifiers are persistent.

There’s also been great work and progress around giving identifiers affordance. Making them actionable.

Identifiers that are URIs can be clicked on in documents and emails. They can be used by humans and machines to find content, data and metadata. Where identifiers are not URIs, then there are often resolvers that will help to make to integrate them with the web.

Persistence and affordance are both vital qualities for identifiers that will help us build a stronger data infrastructure.

But lately I’ve been thinking that there should be more discussion and thought put into how we document identifiers. I think there are three reasons for this.

Firstly, identifiers are boundary objects. As we increase access to data, by sharing it between organisations or publishing it as open data, then an increasing number of data users  and communities are likely to encounter these identifiers.

I’m sure everyone in this room know what a DOI is (aside: they did). But how many people know what a TOID is? (Aside: none of them did). TOIDs are a national identifier scheme. There’s a TOID for every geographic feature on Ordnance Survey maps. As access to OS data increases, more developers will be introduced to TOIDs and could start using them in their applications.

As identifiers become shared between communities. It’s important that the context around how those identifiers are created and managed is accessible, so that we can properly interpret the data that uses them.

Secondly, identifiers are standards. There are many different types of standard. But they all face common problems of achieving wide adoption and impact. Getting a sector to adopt a common set of identifiers is a process of agreement and implementation. Adoption is driven by engagement and support.

To help drive adoption of standards, we need to ensure that they are well documented. So that users can understand their utility and benefits.

Finally identifiers usually exist as part of registers or similar reference data. So when we are publishing identifiers we face all the general challenges of being good data publishers. The data needs to be well described and documented. And to meet a variety of data user needs, we may need a range of services to help people consume and use it.

Together I think these different issues can lead to additional friction that can hinder the adoption of open identifiers. Better documentation could go some way towards addressing some of these challenges.

So what documentation should we publish around identifier schemes?

I’ve created a discussion document to gather and present some thoughts around this. Please have a read and leave you comments and suggestions on that document. For this presentation I’ll just talk through some of the key categories of information.

I think these are:

  • Descriptive information that provides the background to a scheme, such as what it’s for, when it was created, examples of it being used, etc
  • Governance information that describes how the scheme is managed, who operates it and how access is managed
  • Technical notes that describe the syntax and validation rules for the scheme
  • Operational information that helps developers understand how many identifiers there are, when and how new identifiers are assigned
  • Service pointers that signpost to resolvers and other APIs and services that help people use or adopt the identifiers

I take it pretty much as a given that this type of important documentation and metadata should be machine-readable in some form. So we need to approach all of the above in a way that can meet the needs of both human and machine data users.

Before jumping into bike-shedding around formats. There’s a few immediate questions to consider:

  • how do we make this metadata discoverable, e.g. from datasets and individual identifiers?
  • are there different use cases that might encourage us to separate out some of this information into separate formats and/or types of documentation?
  • what services might we build off the metadata?
  • …etc

I’m interested to know whether others think this would be a useful exercise to take further. And also the best forum for doing that. For example should there be a W3C community group or similar that we could use to discuss and publish some best practice.

Please have a look at the discussion document. I’m keen to learn from this community. So let me know what you think.

Thanks for listening.

Talk: Tabular data on the web

This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.

At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.

An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them

We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.

Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.

The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.

Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.

New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.

So how do we integrate data with the web? To ensure that statistics are well described and easy to find?

We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.

What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.

And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:

The primer provides an excellent walk through of all of the capabilities and I’d encourage you to explore it.

One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.

Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.

(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)

This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.

There also aren’t a great many tools that support CSV on the web currently.

It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?

But adoption could be driven from multiple angles. For example:

  • open data tools, portals and data publishers could start to generate best practice CSVs. That would be easy to implement
  • open data portals could also readily adopt CSV on the Web metadata, most already support DCAT
  • standards developers could adopt CSV on the Web as their primary means of defining schemas for tabular formats

Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.

Thanks for listening.