Jeni Tennison asked an interesting question on twitter last week:
Question: aside from personally identifiable data, is there any data that *should not* be open?
The question prompted some interesting discussion which included examples of data that might be sensitive, suggestions about data that would be useful to open up, and the need for better understanding of how data can be applied.
But I don’t feel like we’ve yet got a good framework for these kinds of discussions. Applying labels can often mask important aspects of the debate.
For example we talk about data being Open but often overlook that this doesn’t immediately make it accessible: by anyone who wants to use it, at any time. That requires skills, supporting infrastructure and applications.
We might want data to be Open, and free at the point of use, but sometimes overlook the costs of collection and curation. Offsetting that, and switching to new more open commercial models, can be achieved in different ways.
Similarly we’re often concerned about privacy of Personal data, but want to reserve right to require some people, e.g. in public office, to release more information. Privacy is rarely a binary decision. Sharing is usually a matter of degree not a public-private distinction.
A Process View?
In my view we focus too much on the data itself — what can we release — rather than the wider process:
- What data is being collected?
- Who is collecting the data?
- Who (or what) is the data being collected about?
- What immediate use is going to be made of the data?
- What future uses might the collector make of the data?
- How is the data going to be distributed?
- Who else can have access to the data?
- What other data might it be remixed with?
And so on. Data collection, curation, publishing and re-use is a process. Understanding that process, as it applies to particular data, helps us to understand the risks & rewards for data sharing, whether its personal data or government data. We often talk about provenance, but that’s usually a retrospective view, e.g. where did this data come from? But we also need to concern ourselves with future uses. Licensing is important, as is sustainability.
Answering these kinds of questions, for different types of data, may be illuminating. For example, data that I collect myself about my diet is highly personal data, but I will have a different attitude to sharing that than my bank statement. Data about my spending habits is collected for me by my bank. We share access to that because it is mutually beneficial (I think!).
Greater access to data that is collected about me (but not necessarily by me) could be useful in other contexts. But unless I plan to analyse all that myself, I’m going to end up sharing it with someone in order to get some useful insight, e.g. suggestions on better financial management or, in the case of utility bills, proposals for a more cost effective provider.
Choosing to publish data openly, for unrestricted use, or within a limited group, or not at all is a decision that has to be made with informed consent and an appreciation of the risks and rewards. That’s true for governments, organisations and individuals.
Data is stored potential.
The Big Data movement is largely about organisations realising that they can tap into their internal large data reserves faster and in more cost-effective ways than was previously possible. The technology is helping unlock stored potential in internal data for its current owners.
In contrast, the Open Data movement is largely about unlocking potential by putting data into the hands of more people. More hands on that data allows it to be used in potentially more creative ways, perhaps to drive innovation or to increase transparency.
Personal data stores and the “midata” vision is intended to unlock potential by allowing individuals to readily access and share their data in more ways.
Unstructured data has less potential than structured data. The effort put into collecting and curating data increases its potential by making it easier to process or improving its quality.
Similarly the potential in data that is released on a one-off basis declines over time. The speed depending on the rate of change of the dataset.
Much of the education that is happening in government and in enterprises around data is in building understanding of the potential in their data. The education that needs to happen for all of us is in understanding the potential of our own data, both for good and for ill. What we give away either willingly or unconsciously can be used in unexpected ways.
However even for simple data items it can be difficult to forsee all potential uses. A single checkpoint at a geographic location is one thing, but a series of check-ins over time enables an entirely different kind of application and analysis. Aggregate that with other data and the options expand in many different ways.
For me the question is less about what kinds of data should or should not be open, but about what processes we want to enable with that data and a judgement on the risk-rewards involved.
Everything being open to everyone is just the opposite extreme to the, largely closed, world we’ve been living in to date. There’s still plenty of scope to discuss the points in between.