Digital public institutions for the information commons?

I’ve been thinking a bit about “the commons” recently. Specifically, the global information commons that is enabled and supported by Creative Commons (CC) licences. This covers an increasingly wide variety of content as you can see in their recent annual review.

The review unfortunately doesn’t mention data although there’s an increasing amount of that published using CC (or compatible) licences. Hopefully they’ll cover that in more detail next year.

I’ve also been following with interest Tom Steinberg’s exploration of Digital Public Institutions (Part 1, Part 2). As a result of my pondering about the information and data commons, think there’s a couple of other types of institution which we might add to Tom’s list.

My proposed examples of digital public services are deliberately broad. They’re intended to serve the citizens of the internet, not just any one country.

Commons curators

Everyone has seen interesting facts and figures about the rapidly growing volume of activity on the web. These are often used as examples of dizzying growth and as a jumping off point for imagining the next future shocks that are only just over the horizon. The world is changing at an ever increasing rate.

But it’s also an archival challenge. The majority of that material will never be listened to, read or watched. Data will remain unanalysed. And in all likelihood it may disappear before anyone has had any chance to unlock its potential. Sometimes media needs time to find its audience.

This is why projects like the Internet Archive are so important. I think the Internet Archive is one of the greatest achievements of the web. If you need convincing then watch this talk by Brewster Kahle. If, like me, you’re of a certain age then these two things alone should be enough to win you over.

I think we might see and arguably need more digital public institutions who are not just archiving great chunks of the web, but also the working with that material to help present it to a wider audience.

I see other signals that this might be a useful thing to do. Think about all of the classic film, radio and TV material that is never likely to ever see the light of day again. Not just for rights reasons, but also because its not HD quality or hasn’t been cut and edited to reflect modern tastes. I think this is at least partly the reason why we so many reboots and remakes.

Archival organisations often worry about how to preserve digital information. One tactic being to consider how to migrate between formats to ensure information remains accessible. What if we treated media the same? E.g. by re-editing or remastering it to make it engaging to a modern audience? Here’s an example of modernising classic scientific texts or and another that is remixing Victorian jokes as memes.

Maybe someone could spin a successful commercial venture out of this type of activity. But I’m wondering whether you could build a “public service broadcasting” organisation that presented refined, edited, curated views of the commons? I think there’s certainly enough raw materials.

Global data infrastructure projects

The ODI have spent some time this year trying to bring into focus the fact that data is now infrastructure. In my view the best exemplar of a truly open piece of global data infrastructure is OpenStreetMap (OSM). A collaboratively maintained map of our world. Anyone can contribute. Anyone can use it.

OSM was set up to try to solve the issue that the UK’s mapping and location infrastructure was, and largely still is, tied up with complex licensing and commercial models. Rather than knocking at the door of existing data holders to convince them to release their data, OSM shows what you can deliver with the participation of a crowd of motivated people using modern technology.

It’s a shining example of the networked age we live in.

There’s no reason to think that this couldn’t be done in for other types of data, creating more publicly owned infrastructure. There are now many more ways in which people could contribute data to such projects. Whether that information is about themselves or the world around us.

Getting coverage and depth to data could also potentially be achieved very quickly. Costs to host and serve data are also dropping, so sustainability also becomes more achievable.

And I also feel (hope?) there is a growing unease with so much data infrastructure being owned by commercial organisations. So perhaps there’s a movement towards wanting more of this type of collaboratively owned infrastructure.

Data infrastructure incubators

If you buy into the fact that we need more projects like OSM, then its natural to start thinking about the common features of such projects. Those that make them successful and sustainable. There are likely to be some common organisational patterns that can be used as a framework for designing these organisations. Currently, while focused on scholarly research, I think this is the best attempt at capturing those patterns that I’ve seen so far.

Given a common framework then it’s becomes possible to create incubators whose job it is to launch these projects and coach, guide and mentor them towards success.

So that is my third and final addition to Steinberg’s list: incubators that are focused not on the creation of the next start-up “unicorn” but on generating successful, global collaborative data infrastructure projects. Institutions whose goal is the creation of the next OpenStreetMap.

These type of projects have a huge potential impact as they’re not focused on a specific sector. OSM is relevant to many different types of application, its data is used in many different ways. I think there’s a lot more foundational data of this type which could and should be publicly owned.

I may be displaying my naivety, but I think this would be a nice thing to work towards.

Improving the global open data index

The 2015 edition of the Global Open Data Index was published this week. From what I’ve seen it’s the result of an enormous volunteer effort and there’s a lot to celebrate. For example, the high ranking of Rwanda resulting from ongoing efforts to improve their open data publication. Owen Boswarva has also highlighted the need for the UK and other highly ranked countries to not get complacent.

Unfortunately, I have a few issues with the index as it stands which I wanted to document and which I hope may be useful input to the revisions for the 2016 review noted at the end of this article. My aim is to provide some constructive feedback rather than to disparage any of the volunteer work or to attempt to discredit any of the results.

My examples below draw on the UK scores simply because its the country with which I’m most familiar. I was also involved in a few recent email discussions relating to the compilation of the final scores and some last minute revisions to the dataset definitions.

Disclaimers aside, here are the problems that I think are worth identifying.

Lack of comparability

Firstly, it should be highlighted that the 2015 index is based on a different set of criteria than previous years. A consultation earlier this year lead to some revisions to the index. These revisions included both the addition of new datasets and revisions to the assessment criteria for some of the existing datasets.

The use of a different set of criteria mean its not really possible to compare rankings of countries between years. You can make comparisons between countries on the 2015 rankings, but you can’t really compare the rank of a single country between years as they are being assessed on different information. The website doesn’t make this clear at all on the ranking pages.

Even worse, the information as presented, is highly misleading. If you look at the UK results for election data in 2015 and then look at the results for 2014 you’ll see that the 2014 page is displaying the scores for 2014 but the assessment criteria for 2015. It should be showing the assessment criteria for 2014 instead. This makes is seem like the UK has gone backwards from 100% open to 0% open for the same criteria, rather than being assessed in a completely different way.

If you look at the Wayback Machine entry for the 2014 results, you’ll see the original criteria.

Recommended fixes:

  • Remove comparisons between years, or at least provide a clear warning indicator about interpretation
  • Ensure that historical assessments include the original criteria used (I can only assume this is a bug)

Lack of progress indicators

Related to the above, another failing of the index is that it doesn’t measure progress towards a desired outcome.

Sticking with the election data example, the changes in the assessment criteria included requirements to report additional vote counts and, most significantly, that data should be reported at the level of the polling station rather than by constituency.

My understanding is the the Electoral Commission have always published the additional counts, e.g. invalid and spoiled ballots. Its only the change in the level of aggregation of results that is different. The UK doesn’t report to polling station level (more on that below). But the data that is available – the same data that previously was 100% open – is still available. But its continued availability has been completely ignored.

This raises important questions about how viable it is for a country to use the index as a means to measure or chart its progress: changes to the criteria can completely discount all previous efforts that went into improving how data is published, and the ongoing publication of valuable data.

Recommended fixes:

  • Include some additional context in the report to acknowledge what data does exist in the domain being measured, and how it aligns with criteria
  • Where reporting to a particular level of detail is required, include some intermediary stages, e.g. “national”, “regional”, “local”, that can help identify where data is available but only at a more aggregated level

Inconsistent and debatable reporting requirements

The reporting requirements in different areas of the index are also somewhat inconsistent. In some areas a very high level of detail is expected, whereas in others less detail is acceptable. This is particularly obvious with datasets that have a geographical component. For example:

  • Location data should be available to postcode (or similar) level
  • Election data must now be reported to polling station, rather than constituency as previously indexed
  • The weather data criteria doesn’t indicate any geographic reporting requirements
  • Similarly the national statistics criteria simply ask for an annual population count, but its unclear at what level is acceptable. E.g. is a single count for the whole country permissible?

While we might naturally expect different criteria for different domains, this belies the fact that:

  • Address data is hugely valuable, widely identified as a critical dataset for lots of applications, but its availability is not measured in the index. The UK gets full marks whereas it should really be marked down because of its closed address database. The UK is getting a free ride where it should be held up as being behind the curve. Unless the criteria are revised again, Australia’s move to open its national address database will be ignored.
  • Changing the level of reporting of election data in the UK will require a change in the way that votes are counted. It may be possible to report turnout at a station level, but we would need to change aspects of the electoral system in order to reach the level required by the index. Is this really warranted?

The general issue is that the rationale behind the definition of specific criteria is not clear.

Recommended fixes:

  • Develop a more tightly defined set of assessment criteria and supplement these with a clear rationale, e.g. what user needs are being met or unmet if data is published to a specific level?
  • Ensure that a wide representation of data consumers have included in drafting any new criteria
  • Review all of the current criteria, with domain experts and current consumers of these datasets, to understand whether their needs are being met

Inconsistent scoring against criteria

I also think that there’s some inconsistent scoring across the index. One area where there seems to be varied interpretation is around the question: “Available in bulk?”. The definition of this question states:

“Data is available in bulk if the whole dataset can be downloaded easily. It is considered non-bulk if the citizens are limited to getting parts of the dataset through an online interface. For example, if restricted to querying a web form and retrieving a few results at a time from a very large database.”

Looking at the UK scores we have:

  1. Location data is marked as available. However there are several separate datasets which cover this area, e.g. list of postcodes are in one dataset, an administrative boundaries are in another. The user has to fill in a form to get access to a separate download URL for each product. The registration step is noted but ignored in the assessment. And it seems that the reviewer is happy with downloading separate files for each dataset
  2. Weather data is marked as available. But the details say that you have to sign-up to use an API on the Azure Marketplace to retrieve the data, which you then harvest using their API. Presumably having first written the custom code required to retrieve it, is this not limiting access to parts of the dataset? And specifically, limiting its access to specific types of users? It’s not what I’d consider to be a bulk download. The reviewer has also missed the download service on which does actually provide the information in bulk
  3. Air quality data is not available in bulk. However you can access all the historical data as Atom files, or have all the historical data for all monitoring sites emailed to you, or access the data as RData files via the openair package. Is this presenting a problem for users of air quality data in the UK?

Personally, as a data consumer I’m happy with the packaging all of these datasets. I definitely agree that the means of access could be improved, e.g. provision of direct download links without registration forms, provision of manifest files to enable easier mirroring. And I have a preference for the use of simpler formats (CSV vs Atom), etc. But that’s not what is being assessed here.

Again, its really a user needs issue. Is it important to have bulk access to all of the UK’s air quality data in a single zip file, or all of the historical data for a particular location in a readily accessible form? The appropriate level of packaging will depend on the use case of the consumer.

Recommended fixes:

  • Provide additional guidance for reviewers
  • Ensure that there are multiple reviewers for each assessment, including those familiar with not just the domain but also the local publishing infrastructure
  • Engage with the data publishers throughout the review process and enable them to contribute to the assessments

Hopefully this feedback is taken in the spirit it’s offered: as constructive input into improving a useful part of the open data landscape.



We have a long way to go

Stood in the queue at the supermarket earlier I noticed the cover of the Bath Chronicle. The lead story this week is: “House prices in Bath almost 13 times the average wage“. This is almost perfectly designed clickbait for me. I can’t help but want to explore the data.

In fact I’ve already done this before, when the paper published a similar headline in September last year: “Average house price in Bath is now eight times average salary“. I wrote a blog post at the time to highlight some of the issues with their reporting.

Now I’m writing another blog post, but this time to highlight how far we still have to go with publishing data on the web.

To try to illustrate the problems, here’s what happened when I got back from the supermarket:

  1. Read the article on the Chronicle website to identify the source of the data, the annual Home Truths report published by the National Housing Federation.
  2. I then googled for “National Housing Federation Home Truths” as the Chronicle didn’t link to its sources.
  3. I then found and downloaded the “Home Truths 2014/15: South West” report which has a badly broken table of figures in it. After some careful reading I realised the figures didn’t match the Chronicle
  4. Double-checking, I browsed around the NHF website and found the correct report: “Home Truths 2015/2016: The housing market in the South West“. Which, you’ll notice, isn’t clearly signposted from their research page
  5. The report has a mean house price of £321,674 for Bath & North East Somerset using Land Registry data from 2014. It also has a figure of £25,324 for mean annual earnings in 2014 for the region, giving a ratio of 12.7. The earnings data is from the ONS ASHE survey
  6. I then googled for the ASHE survey figures as the NHF didn’t link to its sources
  7. Having found the ONS ASHE survey I clicked on the latest figures and found the reference tables before downloading the zip file containing Table 8
  8. Unzipping, I opened the relevant spreadsheet and found the worksheet containing the figures for “All” employees
  9. Realising that the ONS figures were actually weekly rather than annual wages I opened up my calculator and multiplied the value by 52
  10. The figures didn’t match. Checked my maths
  11. I then realised that, like an idiot, I’d downloaded the 2015 figures but the NHF report was based on the 2014 data
  12. Returning to the ONS website I found the tables for the 2014 Revised version of the ASHE
  13. Downloading, unzipping, and calculating I found that again the figures didn’t match
  14. On a hunch, I checked the ONS website again and then found the reference tables for the 2014 Provisional version of the ASHE
  15. Downloading, unzipping, and re-calculating I finally had my match for the NHF figure
  16. I then decided that rather than dig further I’d write this blog post

This is a less than ideal situation. What could have streamlined this process?

The lack of direct linking – from the Chronicle to the NHF, and from the NHF to the ONS – was the root cause of my issues here. I spent far too much time working to locate the correct data. Direct links would have avoided all of my bumbling around.

While a direct link would have taken me straight to the data, I might have missed out on the fact that there were revised figures for 2014. Or that there were actually some new provisional figures for 2015. So there’s actually a update to the story already waiting to be written. The analysis is already out of date.

The new data was published on the 18th November and the NHF report on the 23rd. That gave a five day period in which the relevant tables and commentary could have been updated. Presumably the report was too deep into final production to make changes. Or maybe just no-one thought to check for updated data.

If both the raw data from the ONS and the NHF analysis had been published natively to the web rather than in a PDF maybe some of that production overhead could have been reduced. I know PDF has some better support for embedding and linking data these days, but a web native approach might have provided a more dynamic approach.

In fact, why should the numbers have been manually recalculated at all? The actual analysis involves little more than pulling some cells from existing tables and doing some basic calculations. Maybe that could have been done on the fly? Perhaps by embedding the relevant figures. At the moment I’m left with doing some manual copy-and-paste.

It’s not just NHF that are slow to publish their figures though. Researching the Chronicle article from last year, I turned up some DCLG figures on housing market and house prices. These weren’t actually referenced from the article or any of its sources. I just tripped over them whilst investigating. Because data nerd.

The live (sic) DCLG tables include a ratio of median house prices to median earnings but they haven’t been updated since April 2014. Their analysis only uses the provisional ASHE figures for 2013.

Oh, and just for fun, the NHF analysis uses mean house prices and wages, whilst the DCLG data uses medians. The ONS publish both weekly mean and median earnings for all periods, as well as some data for different quantiles.

And this is just one small example.

My intent here isn’t to criticise the Chronicle, the NHF, DCLG, and especially not the ONS who are working hard to improve how they publish their data.

I just wanted to highlight that:

  • we need better norms around data citation, and including when and how to link to both new and revised data
  • we need better tools for telling stories on the web, that can easily be used by anyone and which can readily access and manipulate raw data
  • we need better discovery tools for data that go beyond just keyword searches
  • we need to make it easier to share not just analyses but also insights and methods, to avoid doing unnecessary work and to make it easier (or indeed unnecessary) to fact check against sources

That’s an awful lot to still be done. Opening data is just the start at building a good data infrastructure for the web. I’m up for the challenge though. This is the stuff I want to help solve.

Shortly after I published this Matt Jukes published a post wondering what a digital statistical publication might look like. Matt’s post and Russell Davies thoughts on digital white papers are definitely worth a read. 

How can open data publishers monitor usage?

Some open data publishers require a user to register with their portal or provide other personal information before downloading a dataset.

For example:

  • the recently launched Consumer Data Research Centre data portal requires users to register and login before data can be downloaded
  • access to any of the OS Open Data products requires the completion of a form which asks for personal information and an email address to which a download link is sent
  • the Met Office Data Point API provides OGL licensed data but users must register in order to obtain an API key

Requiring a registration step is in fact very common when it comes to open data published via an API. Registration is required on Transport API, Network Rail and Companies House to name a few. This isn’t always the case though as the Open Corporates API can be used without a key, as can APIs exposed via the Socrata platform (and other platforms, I’m sure). In both cases registration carries the benefit of increased usage limits.

The question of whether to require a login is one that I’ve run into a few times. I wanted to explore it a little in this post to tease out some of the issues and alternatives.

For the rest of the post whenever I refer to “a login” please read it as “a login, registration step, or other intermediary web form”.

Is requiring a login permitted?

I’ll note from the start that the open definition doesn’t have anything to say about whether a login is permitted or not permitted.

The definition simply says that data “…must be provided as a whole and at no more than a reasonable one-time reproduction cost, and should be downloadable via the Internet without charge”. In addition the data “…must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.”

You can choose to interpret that in a number of ways. The relevant bits of text have gone through a number of iterations since the definition was first published and I think the current language isn’t as strong as that present in previous versions. That side I don’t recall there ever being a specific pronouncement against having a login.

There is however a useful discussion on the open definition list from October 2014 which has some interesting comments and is worth reviewing. Andrew Stott’s comments provide a useful framing, asking whether such a step is necessary to the provision of the information.

In my view there are very few cases where such a step is necessary, so as general guidance I’d always recommend against requiring a login when publishing open data.

But, being a pragmatic chap, I prefer not to deal in absolutes so I’d like you to think about the pros and cons on either side.

Why do publishers want a login?

I’ve encountered several reasons why publishers want to require a login:

  1. to collect user information to learn more about using their data
  2. to help manage and monitor usage of an API
  3. all of the above

The majority of open data publishers I’ve worked with are very keen to understand who is using their data, how they’re using it, and how successful their users are at building things with their data. It’s entirely natural, as part of providing a free resource to want to understand if people are finding it useful.

Knowing that data is in use and is delivering value can help justify ongoing access, publication of additional data, or improvements in how existing data is published. Everyone wants to understand if they’re having an impact. Knowing who is interested enough to download the data is a first step towards measuring that.

An API without usage limits presents a potentially unbounded liability for a publisher in terms of infrastructure costs. The inability to manage or balance usage across a user base means that especially active or abusive users can hamper the ability for everyone to benefit from the API. API keys, and similar authentication methods, provide a hook that can be used to monitor and manage usage. (IP addresses are not enough.)

Why don’t consumers want to login?

There are also several reasons why data consumers don’t want to have to login:

  1. they want to quickly review and explore some data and a registration step provides unnecessary barriers
  2. they want or need the freedom to access data anonymously
  3. they don’t trust the publisher with their personal information
  4. they want to automatically script bulk downloads to create production workflows without the hassle of providing credentials or navigating access control
  5. they want to use an API from a browser based application which limits their ability to provide private credentials
  6. all of the above

Again, these are all reasonable concerns.

What are the alternatives?

So, how can publishers learn more about their users and, where necessary, offer a reasonable quality of service whilst also staying mindful to the concerns of users?

I think the best way to explore that is by focusing on the question that publishers really want to answer: who are the users actively engaged in using my data?

Requiring a registration step or just counting downloads doesn’t help you answer that question. For example:

  • I’ve filled in the OS Open Data download form multiple times for the same product, sometimes on the same day but from different machines. I can’t imagine it tells them much about what I’ve done (or not done) with their data and they’ve never asked
  • I’ve registered on portals in order to download data simply to take a look at its contents without any serious intent to use it
  • I’ve worked with data publishers that have lots of detail from their registration database but no real insight into what users are doing, or have an ongoing relationship with them

In my view the best way to identify active users and learn more about how they are using your data is to talk to them.

Develop an engagement plan that involves users not just after the release some data, but before a release. Give them a reason to want to talk to you. For example:

  • tell them when the data is updated, or you’ve made corrections to it. This is service that many serious consumers would jump at
  • give them a feedback channel that lets them report problems or make suggestions about improvements and then make sure that channel is actually monitored so feedback is acted on
  • help celebrate their successes by telling their stories, featuring their applications in a showcase, or via social media

Giving users a reason to engage can also help with API and resource management. As I mentioned in the introduction, Open Corporates and others provide a basic usage tier that doesn’t require registration. This lets hobbyists, tinkerers and occasional users get what they need. But the promise of freely accessible, raised usage limits gives active users a reason to engage more closely.

If you’re providing data in bulk but are concerned about data volumes then provide smaller sample datasets that can be used as a preview of the full data.

In short, just like any other data collection exercise, its important that publishers understand why they’re asking users to register. If the data is ultimately of low value, e.g. people providing fake details, or isn’t acted on as part of an engagement plan, then there’s very little reason to collect the data at all.

This post is part of my “basic questions about data” series. If you’ve enjoyed this one then take a look at the other articles. I’m also interested to hear suggestions for topics, so let me know if you have an idea. 

Who is the intended audience for open data?

This post is part of my ongoing series: basic questions about data. It’s intended to expand on a point that I made in a previous post in which I asked: who uses data portals?

At times I see quite a bit of debate within the open data community around how best to publish data. For example should data be made available in bulk or via an API? Which option is “best”? Depending on where you sit in the open data community you’re going to have very different responses to that question.

But I find that in the ensuing debate we often overlook that open data is intended to be used by anyone, for any purpose. And that means that maybe we need to think about more than just the immediate needs of developers and the open data community.

While the community has rightly focused on ensuring that data is machine-readable, so it can be used by developers, we mustn’t forget that data needs to be human-readable too. Otherwise we end up with critiques of what I consider to be fairly reasonable and much-needed guidance on structuring spreadsheets, and suggestions of alternatives that are well meaning but feel a little under-baked.

I feel that there are several different and inter-related viewpoints being expressed:

  • That the citizen or user is the focus and we need to understand their needs and build services that support them. Here data tends to be a secondary concern and perhaps focused on transactional statistics on performance of those services, rather than the raw data
  • That open data is not meant for mere mortals and that its primary audience is developers to analyse and present to users. The emphasis here is on provision of the raw data as rapidly as possible
  • A variant of the above that emphasises delivery of data via an API to web and mobile developers allowing them to more rapidly deliver value. Here we see cases being made about the importance of platforms, infrastructure, and API programs
  • That citizens want to engage with data and need tools to explore it. In this case we see arguments for on-line tools to explore and visualise data, or reasonable suggestions to simply publish data in spreadsheets as this is a format with which many, many people are comfortable

Of course all of these are correct, although their prominence around different types of data, application, etc varies wildly. Depending on where you sit in the open data value network your needs are going to be quite different.

It would be useful to map out the different roles of consumers, aggregators, intermediaries, etc to understand what value exchanges are taking place, as I think this would help highlight the value that each role brings to the ecosystem. But until then both consumers and publishers need to be mindful of potentially competing interests. In an ideal world publishers would serve every reuser need equally.

My advice is simple: publish for machines, but don’t forget the humans. All of the humans. Publish data with context that helps anyone – developers and the interested reader – properly understand the data. Ensure there is at least a human-readable summary or view of the data as well as more developer oriented bulk downloads. If you can get APIs “out of the box” with your portal, then invest the effort you would otherwise spend on preparing machine-readable data in providing more human-readable documentation and reports.

Our ambition should be to build an open data commons that is accessible and useful for as many people as possible.


Managing risks when publishing open data

A question that I frequently encounter when talking to organisations about publishing open data is: “what if someone misuses or misunderstands our data?“.

These concerns stem from several different sources:

  • that the data might be analysed incorrectly, drawing incorrect conclusions that might be attributed to the publisher
  • that the data has known limitations and this might reflect on the publisher’s abilities, e.g. exposing issues with their operations
  • that the data might be used against the publisher in some way, e.g. to paint them in a bad light
  • that the data might be used for causes with which the publisher does not want to be aligned
  • that the data might harm the business activities of the publisher, e.g. by allowing someone to replicate a service or product

All of these are understandable and reasonable concerns. And the truth is that when publishing open data you are giving up a great deal of control over your data.

But the same is true of publishing any information: there will always cases of accidental and wilful misuse of information. Short of not sharing information at all, all organisations already face this risk. It’s just that open data, which anyone can access, use and share for any purpose, really draws this issue into the spotlight.

In this post I wanted to share some thoughts about how organisations can manage the risks associated with publishing open data.

Risks of not sharing

Firstly its worth noting that the risks of not sharing data are often unconsciously discounted.

There’s increasing evidence that holding on to data can hamper innovation whereas opening data can unlock value. This might be of direct benefit for the organisation or have wider economic, social and environmental benefits.

Organisations with a specific mission or task can more readily demonstrate their impact and progress by publishing open data. Those that are testing a theory of change will be reporting on indicators that help to measure impact and confirm that interventions are working as expected. Open data is the most transparent way to approach to these impact assessments.

Many organisations, particularly government bodies, are attempting to address challenges that can only be overcome in collaboration with others. Open data specifically, and data sharing practices in general, provides an important foundation for collaborative projects.

As data moves from the closed to the open end of the data spectrum, there is an increasingly wider audience that can access and use that information. We can point to Joy’s Law as a reason why this is a good thing.

In scientific publishing there are growing concerns of a “reproducibility crisis” which is in part fuelled by both a lack of access to original experimental data and analysis.  Open publishing of scientific results is one remedy.

But setting aside what might be seen as a sleight of hand re-framing of the original question, how can organisation minimise specific types of risk?

Managing forms of permitted reuse

Organisations manage the forms of reuse of its data through a licence. The challenge for many is that an open licence places few limits on how data can be reused.

There is a wider range of licences that publishers could use, including some that limit creation of derivative works or commercial uses. But all of these restrictions may also unintentionally stop the kinds of reuse that publishers want to encourage or enable. This is particularly true when applying a “non-commercial” use clause. These issues are covered in detail in the recently published ODI guidance on the impacts of non-open licences.

While my default recommendation is that organisations use a CC-BY 4.0 licence, an alternative is the CC-BY-SA licence which requires that any derivative works are published under the same licence, i.e. that reusers must share in the same spirit as the publisher.

This could be a viable alternative that might help organisations feel more confident that they are deterring some forms of undesired reuse, e.g. discouraging a third-party or competitor from publishing a commercial analysis based on their data by requiring that the report also be distributed under an open licence.

The attribution requirement already stops data being reused without its original source being credited.

Managing risks of accidental misinterpretation

When I was working in academic publishing a friend at the OECD told me that at least one statistician had been won over to a plan to publicly publish data by the observation that the alternative was to continue to allow users to manually copy data from published reports, with the obvious risks of transcription errors.

This is a small example of how to manage risks of data being accidentally misused or misinterpreted. Putting appropriate effort into the documentation and publication of a dataset will help reusers understand how it can be correctly used. This includes:

  • describing what data is being reported
  • how the data was collected
  • the quality control, if any, that has been used to check the data
  • any known limits on its accuracy or gaps in coverage

All of these help to provide reusers with the appropriate context that can guide their use. It also makes them more likely to be successful. This detail is already covered in the ODI certification process.

Writing a short overview of a dataset highlighting its most interesting features, sharing ideas for how it might be used, and clearly marking known limits can also help orientate potential reusers.

Of course, publishers may not have the resources to fully document every dataset. This is where having a contact point to allow users to ask for help, guidance and clarification is important. 

Managing risks of wilful misinterpration

Managing risks of wilful misinterpretation of data is harder. You can’t control cases where people totally disregard documentation and licensing in order to push a particular agenda. Publishers can however highlight breaches of social norms and can choose to call out misuse they feel is important to highlight.

It’s important to note that there are standard terms in the majority of open licences, including the Creative Commons Licences and the Open Government Licence, which address:

  • limited warranties – no guarantees that data is fit for purpose, so reusers can’t claim damages if misused or misapplied
  • non-endorsement– reusers can’t say that their use of the data was endorsed or supported by the publisher
  • no use of trademarks, branding, etc. – reusers don’t have permission to brand their analysis as originating from the publisher
  • attribution– reusers must acknowledge the source of their data and cannot pass it off as their own

These clauses collectively limit the liability of the publisher. It also potentially provides some recourse to take legal action if a reuser did breach the terms of they licence, and the publisher thought that this was worth doing.

I would usually add to this that the attribution requirement means that there is always a link back to the original source of the data. This allows the reader of some analysis to find the original authoritative data and confirm any findings for themselves. It is important that publishers document how they would like to be attributed.

Managing  business impacts

Finally, publishers concerned about the risk of releasing data to their business, should ensure they’re doing so with a clear business case. This includes understanding whether supply of data is the core value of your business or whether customers place more value in the services.

One startup I worked with were concerned that an open licence on user contributions might allow a competitor to clone their product. But in this case the defensibility in their business model didn’t derive from controlling the data but in the services provided and the network effects of the platform. These are harder things to replicate.

This post isn’t intended to be a comprehensive review of all approaches to risk management when releasing data. There’s a great deal more which I’ve not covered including the need to pay appropriate attention to data protection, privacy, anonymisation, and general data governance.

But there is plenty of existing guidance available to help organisations work through those areas. I wanted to share some advice that more specifically relates to publishing data under an open licence.

Please leave a comment to let me know what you think. Is this advice useful and is there anything you would add?

Fictional data

The phrase “fictional data” popped into my head recently, largely because of odd connections between a couple of projects I’ve been working on.

It’s stuck with me because, if you set aside the literal meaning of “data that doesn’t actually exist“, there are some interesting aspects to it. For example the phrase could apply to:

  1. data that is deliberately wrong or inaccurate in order to mislead – lies or spam
  2. data that is deliberately wrong as a proof of origin or claim of ownership – e.g. inaccuracies introduced into maps to identify their sources, or copyright easter eggs
  3. data that is deliberately wrong, but intended as a prank – e.g. the original entry of Uqbar on wikipedia. Uqbar is actually a doubly fictional place.
  4. data that is fictionalised (but still realistic) in order to support testing of some data analysis – e.g. a set of anonymised and obfuscated bank transactions
  5. data that is fictionalised in order to avoid being a nuisance, cause confusion, or accidentally linkage – like 555 prefix telephone numbers or perhaps social media account names
  6. data that is drawn from a work of fiction or a virtual world – such as the marvel universe social graph, the Elite: Dangerous trading economy (context), or the data and algorithms relating to Pokemon capture.

I find all of these fascinating, for a variety of reasons:

  • How do we identify and exclude deliberately fictional data when harvesting, aggregating and analysing data from the web? Credit to Ian Davis for some early thinking about attack vectors for spam in Linked Data. While I’d expect copyright easter eggs to become less frequent they’re unlikely to completely disappear. But we can definitely expect more and more deliberate spam and attacks on authoritative data. (Categories 1, 2, 3)
  • How do we generate useful synthetic datasets that can be used for testing systems? Could we generate data based on some rules and a better understanding of real-world data as a safer alternative to obfuscating data that is shared for research purposes? It turns out that some fictional data is a good proxy for real world social networks. And analysis of videogame economics is useful for creating viable long-term communities. (Categories 4, 6)
  • Some of the most enthusiastic collectors and curators of data are those that are documenting fictional environments. Wikia is a small universe of mini-wikipedias complete with infoboxes and structured data. What can we learn from those communities and what better tools could we build for them? (Category 6)

Interesting, huh?