Thinking about the governance of data

I find “governance” to be a tricky word. Particularly when we’re talking about the governance of data.

For example, I’ve experienced conversations with people from a public policy background and people with a background in data management, where its clear that there are different perspectives. From a policy perspective, governance of data could be described as the work that governments do to enforce, encourage or enable an environment where data works for everyone. Which is slightly different to the work that organisations do in order to ensure that data is treated as an asset, which is how I tend to think about organisational data governance.

These aren’t mutually exclusive perspectives. But they operate at different scales with a different emphasis, which I think can sometimes lead to crossed wires or missed opportunities.

As another example, reading this interesting piece of open data governance recently, I found myself wondering about that phrase: “open data governance”. Does it refer to the governance of open data? Being open about how data is governed? The use of open data in governance (e.g. as a public policy tool), or the role of open data in demonstrating good governance (e.g. through transparency). I think the article touched on all of these but they seem quite different things. (Personally I’m not sure there is anything special about the governance of open data as opposed to data in general: open data isn’t special).

Now, all of the above might be completely clear to everyone else and I’m just falling into my usual trap of getting caught up on words and meanings. But picking away at definitions is often useful, so here we are.

The way I’ve rationalised the different data management and public policy perspectives is in thinking about the governance of data as a set of (partly) overlapping contexts. Like this:

 

Governance of data as a set of overlapping contexts

 

Whenever we are managing and using data we are doing so within a nested set of rules, processes, legislation and norms.

In the UK our use of data is bounded by a number of contexts. This includes, for example: legislation from the EU (currently!), legislation from the UK government, legislation defined by regulators, best practices that might be defined how a sector operates, our norms as a society and community, and then the governance processes that apply within our specific organisations, departments and even teams.

Depending on what you’re doing with the data, and the type of data you’re working with, then different contexts might apply. The obvious one being the use of personal data. As data moves between organisations and countries, then different contexts will apply, but we can’t necessarily ignore the broader contexts in which it already sits.

The narrowest contexts, e.g. those within an organisations, will focus on questions like: “how are we managing dataset XYZ to ensure it is protected and managed to a high quality?” The broadest contexts are likely to focus on questions like: “how do we safely manage personal data?

Narrow contexts define the governance and stewardship of individual datasets. Wider contexts guide the stewardship of data more broadly.

What the above diagram hopefully shows is that data, and our use of data, is never free from governance. It’s just that the terms under which it is governed may just be very loosely defined.

This terrible sketch I shared on twitter a while ago shows another way of looking at this. The laws, permissions, norms and guidelines that define the context in which we use data.

Data use in context

One of the ways in which I’ve found this “overlapping contexts” perspective useful, is in thinking about how data moves into and out of different contexts. For example when it is published or shared between organisations and communities. Here’s an example from this week.

IBM have been under fire because they recently released (or re-released) a dataset intended to support facial recognition research. The dataset was constructed by linking to public and openly licensed images already published on the web, e.g. on Flickr. The photographers, and in some cases the people featured in those images, are unhappy about the photographs being used in this new way. In this new context.

In my view, the IBM researchers producing this dataset made two mistakes. Firstly, they didn’t give proper appreciation to the norms and regulations that apply to this data — the broader contexts which inform how it is governed and used, even though its published under an open licence. For example, e.g. people’s expectations about how photographs of them will be used.

An open licence helps data move between organisations — between contexts — but doesn’t absolve anyone from complying with all of the other rules, regulations, norms, etc that will still apply to how it is accessed, used and shared. The statement from Creative Commons helps to clarify that their licenses are not a tool for governance. They just help to support the reuse of information.

This lead to IBM’s second mistake. By creating a new dataset they took on responsibility as its data steward. And being a data steward means having a well-defined, set of data governance processes that are informed and guided by all of the applicable contexts of governance. But they missed some things.

The dataset included content that was created by and features individuals. So their lack of engagement with the community of contributors, in order to discuss norms and expectations was mistaken. The lack of good tools to allow people to remove photos — NBC News created a better tool to allow Flickr users to check the contents of the dataset — is also a shortfall in their duties. Its the combination of these that has lead to the outcry.

If IBM had instead launched an initiative similar where they built this dataset, collaboratively, with the community then they could have avoided this issue. This is the approach that Mozilla took with Voice. IBM, and the world, might even have had a better dataset as a result because people have have opted-in to including more photos. This is important because, as John Wilbanks has pointed out, the market isn’t creating these fairer, more inclusive datasets. We need them to create an open, trustworthy data ecosystem.

Anyway, that’s one example of how I’ve found thinking about the different contexts of governing data, helpful in understanding how to build stronger data infrastructure. What do you think? Am I thinking about this all wrong? What else should I be reading?

 

Talk: Tabular data on the web

This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.

At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.

An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them

We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.

Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.

The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.

Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.

New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.

So how do we integrate data with the web? To ensure that statistics are well described and easy to find?

We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.

What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.

And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:

The primer provides an excellent walk through of all of the capabilities and I’d encourage you to explore it.

One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.

Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.

(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)

This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.

There also aren’t a great many tools that support CSV on the web currently.

It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?

But adoption could be driven from multiple angles. For example:

  • open data tools, portals and data publishers could start to generate best practice CSVs. That would be easy to implement
  • open data portals could also readily adopt CSV on the Web metadata, most already support DCAT
  • standards developers could adopt CSV on the Web as their primary means of defining schemas for tabular formats

Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.

Thanks for listening.

Creating better checklists, a short review of the Checklist Manifesto

I’ve just finished reading The Checklist Manifesto by Atul Gawande (Cancer Research UK affiliate link). It’s been on my reading list for a while. In my work I’ve written quite a few checklists to help capture best practice or to provide advice. So I was curious about whether I could learn something about creating better checklists.

I wanted to write down a few reflections whilst they’re still fresh in my mind.

The book explores the use of checklists in medicine, aviation and to a lesser extent in business. Checklists aren’t to-do lists. They are a tool to help reduce risk, uncertainty and failure. Gawande uses ample anecdotes, supported by evidence from real-world studies, to illustrate how effective a simple checklist can be. They routinely save people’s lives during surgeries and are a key contributor to the safety of modern aviation.

Gawande explains how checklists allow teams to perform better in complex situations. They protect against individual fallibility, and can help to transfer best practice and research into operational use. He explains that checklists aren’t a teaching tool. They are a means of imposing discipline on a team. Their goal is to improve outcomes.

He also explains why he thinks they’re not being used more widely. In particular, he highlights the tendency of professionals to feel like they’re being undermined or challenged when asked to use simple checklists. A “hero culture” contributes to this problem, something that is very evident in surgery. It’s also something that the tech industry struggles with as well.

Gawande explains that checklists help to address this by re-balancing power within teams. For example, by giving nurses the permission to halt a surgical procedure if a checklist hasn’t been completed to their satisfaction. A hero culture might otherwise silence the raising of concerns, or deter team members from pointing out problems as they see them.

The book highlights that a common pattern that occurs in many successful checklists are specific steps to encourage and make time for team communication. These range from simple introductions and a review of responsibilities, through to a walk-through of expected and possible outcomes. These all contribute towards making the team a more effective unit.

Throughout the book I was wondering how to transfer the insight into other areas. Gawande suggests that checklists are useful anywhere that we have multi-disciplinary teams working together on complex tasks. And specifically where those tasks might have complex outcomes that might have serious impacts.

I think there’s probably a lot of examples in the data and digital world where they might be useful.

What if teams working on data science and machine learning had “preflight checklists” that were used not just at the start of a project, but also at the time of launch and beyond? Would they help highlight problems, increase discipline and allow times for missteps or other concerns to be highlighted?

The ODI data ethics canvas, developed by Amanda Smith, Ellen Broad and Peter Wells is not quite a checklist. But it’s a similar type of tool, aiming to address some similar problems. Privacy impact assessments are another example. But perhaps there are other useful aids?

The book also raises wider questions about the approach we take in our societies towards ensuring safe outcomes of our work, research, etc. There is often too much focus on the use and application of exciting new research and technologies, and not enough on the discipline required to use them safely and effectively.

In short, are we taking care of one another in the best ways we know how?

Creating better checklists

There’s some great insight into creating checklists scattered throughout the Manifesto. But ironically, they’re not gathered together into a single list of suggestions.

So, for my own benefit I’ve jotted down some points to reflect on:

  • Checklists need to be focused. An exhaustive list of steps is not useful. Trust people know how to do their job, just ask them to confirm the most critical or important steps
  • Think about how the checklist will be used. There are READ-DO checklists (where you perform each item and check it off) and DO-CONFIRM checklists (where you carry out an activity, and then review what you have done)
  • Checklists can be used to help with both routine situations (pre-flight) and emergencies (engine failure)
  • Make the checklist easy to use. A good user experience can help embed them into routine practice
  • Consider who is leading use of the checklist. A checklist can help to balance power across a team
  • Include team communications. Teams perform better when they know each other and understand their roles. Ask them to explain what will happen and what the expected outcomes might be. This helps teams deal with items not on the checklist
  • Test and iterate the list
  • Let people customise it, so they can adapt to local use
  • Measuring success and impact (e.g. by measuring outcomes, or even just identifying where they have helped) can help encourage others to adopt it