Managing risks when publishing open data

A question that I frequently encounter when talking to organisations about publishing open data is: “what if someone misuses or misunderstands our data?“.

These concerns stem from several different sources:

that the data might be analysed incorrectly, drawing incorrect conclusions that might be attributed to the publisher
that the data has known limitations and this might reflect on the publisher’s abilities, e.g. exposing issues with their operations
that the data might be used against the publisher in some way, e.g. to paint them in a bad light
that the data might be used for causes with which the publisher does not want to be aligned
that the data might harm the business activities of the publisher, e.g. by allowing someone to replicate a service or product

All of these are understandable and reasonable concerns. And the truth is that when publishing open data you are giving up a great deal of control over your data.

But the same is true of publishing any information: there will always cases of accidental and wilful misuse of information. Short of not sharing information at all, all organisations already face this risk. It’s just that open data, which anyone can access, use and share for any purpose, really draws this issue into the spotlight.

In this post I wanted to share some thoughts about how organisations can manage the risks associated with publishing open data.

Risks of not sharing

Firstly its worth noting that the risks of not sharing data are often unconsciously discounted.

There’s increasing evidence that holding on to data can hamper innovation whereas opening data can unlock value. This might be of direct benefit for the organisation or have wider economic, social and environmental benefits.

Organisations with a specific mission or task can more readily demonstrate their impact and progress by publishing open data. Those that are testing a theory of change will be reporting on indicators that help to measure impact and confirm that interventions are working as expected. Open data is the most transparent way to approach to these impact assessments.

Many organisations, particularly government bodies, are attempting to address challenges that can only be overcome in collaboration with others. Open data specifically, and data sharing practices in general, provides an important foundation for collaborative projects.

As data moves from the closed to the open end of the data spectrum, there is an increasingly wider audience that can access and use that information. We can point to Joy’s Law as a reason why this is a good thing.

In scientific publishing there are growing concerns of a “reproducibility crisis” which is in part fuelled by both a lack of access to original experimental data and analysis. Open publishing of scientific results is one remedy.

But setting aside what might be seen as a sleight of hand re-framing of the original question, how can organisation minimise specific types of risk?

Managing forms of permitted reuse

Organisations manage the forms of reuse of its data through a licence. The challenge for many is that an open licence places few limits on how data can be reused.

There is a wider range of licences that publishers could use, including some that limit creation of derivative works or commercial uses. But all of these restrictions may also unintentionally stop the kinds of reuse that publishers want to encourage or enable. This is particularly true when applying a “non-commercial” use clause. These issues are covered in detail in the recently published ODI guidance on the impacts of non-open licences.

While my default recommendation is that organisations use a CC-BY 4.0 licence, an alternative is the CC-BY-SA licence which requires that any derivative works are published under the same licence, i.e. that reusers must share in the same spirit as the publisher.

This could be a viable alternative that might help organisations feel more confident that they are deterring some forms of undesired reuse, e.g. discouraging a third-party or competitor from publishing a commercial analysis based on their data by requiring that the report also be distributed under an open licence.

The attribution requirement already stops data being reused without its original source being credited.

Managing risks of accidental misinterpretation

When I was working in academic publishing a friend at the OECD told me that at least one statistician had been won over to a plan to publicly publish data by the observation that the alternative was to continue to allow users to manually copy data from published reports, with the obvious risks of transcription errors.

This is a small example of how to manage risks of data being accidentally misused or misinterpreted. Putting appropriate effort into the documentation and publication of a dataset will help reusers understand how it can be correctly used. This includes:

describing what data is being reported
how the data was collected
the quality control, if any, that has been used to check the data
any known limits on its accuracy or gaps in coverage

All of these help to provide reusers with the appropriate context that can guide their use. It also makes them more likely to be successful. This detail is already covered in the ODI certification process.

Writing a short overview of a dataset highlighting its most interesting features, sharing ideas for how it might be used, and clearly marking known limits can also help orientate potential reusers.

Of course, publishers may not have the resources to fully document every dataset. This is where having a contact point to allow users to ask for help, guidance and clarification is important.

Managing risks of wilful misinterpration

Managing risks of wilful misinterpretation of data is harder. You can’t control cases where people totally disregard documentation and licensing in order to push a particular agenda. Publishers can however highlight breaches of social norms and can choose to call out misuse they feel is important to highlight.

It’s important to note that there are standard terms in the majority of open licences, including the Creative Commons Licences and the Open Government Licence, which address:

limited warranties – no guarantees that data is fit for purpose, so reusers can’t claim damages if misused or misapplied
non-endorsement– reusers can’t say that their use of the data was endorsed or supported by the publisher
no use of trademarks, branding, etc. – reusers don’t have permission to brand their analysis as originating from the publisher
attribution– reusers must acknowledge the source of their data and cannot pass it off as their own

These clauses collectively limit the liability of the publisher. It also potentially provides some recourse to take legal action if a reuser did breach the terms of they licence, and the publisher thought that this was worth doing.

I would usually add to this that the attribution requirement means that there is always a link back to the original source of the data. This allows the reader of some analysis to find the original authoritative data and confirm any findings for themselves. It is important that publishers document how they would like to be attributed.

Managing business impacts

Finally, publishers concerned about the risk of releasing data to their business, should ensure they’re doing so with a clear business case. This includes understanding whether supply of data is the core value of your business or whether customers place more value in the services.

One startup I worked with were concerned that an open licence on user contributions might allow a competitor to clone their product. But in this case the defensibility in their business model didn’t derive from controlling the data but in the services provided and the network effects of the platform. These are harder things to replicate.

This post isn’t intended to be a comprehensive review of all approaches to risk management when releasing data. There’s a great deal more which I’ve not covered including the need to pay appropriate attention to data protection, privacy, anonymisation, and general data governance.

But there is plenty of existing guidance available to help organisations work through those areas. I wanted to share some advice that more specifically relates to publishing data under an open licence.

Please leave a comment to let me know what you think. Is this advice useful and is there anything you would add?

3 thoughts on “Managing risks when publishing open data”

Mike says:

November 15, 2015 at 4:59 pm

good stuff! tangential thing about business case for open data: open data was originally sold as a way of engaging with ‘community’, this translated into ‘hey! let’s get other people to understand our data! for free!’

trouble is, data comes with so many footnotes, and because it’s so hard to extract from legacy systems, let alone maintain as an OD release, people quickly saw through this (interesting to see how ODI nodes are now focussing on the business-change aspect, i.e. becoming management consultants).

To seriously make OD a part of the org will probably involve big changes to IT and day to day ops, so consider how you use the data *yourself* before releasing it. if you haven’t internally done any analysis/data viz on the data, it probably means you have no idea what the data means, so how can you release it and expect other people to come up with magical answers?

point is: how can you evaluate the risk attached to data, if you haven’t even tried to analyse it yourself? you can’t! so get your own house in order first, then put the data out there, with a fair degree of comfort that you know what you’re doing… 🙂
Dan Brickley says:

November 17, 2015 at 7:16 am

“Organisations with a specific mission or task can more readily demonstrate their impact and progress by publishing open data.”

One tradeoff you see here is the distinction between putting the whole dataset out into the Web in a technically unencumbered form, versus hiding it behind registration/logins/forms. The latter is in some sense less “open” but provides publishers a lot more machinery for demonstrating use/progress and so on. It would be useful to have some success stories of where useful usage information (case studies, endorsements etc.) was successfully collected without hiding the actual data away…
Pingback: Basic questions about data | Lost Boy

Comments are closed.