Data is Potential

Jeni Tennison asked an interesting question on twitter last week:

Question: aside from personally identifiable data, is there any data that *should not* be open?

The question prompted some interesting discussion which included examples of data that might be sensitive, suggestions about data that would be useful to open up, and the need for better understanding of how data can be applied.

But I don’t feel like we’ve yet got a good framework for these kinds of discussions. Applying labels can often mask important aspects of the debate.

For example we talk about data being Open but often overlook that this doesn’t immediately make it accessible: by anyone who wants to use it, at any time. That requires skills, supporting infrastructure and applications.

We might want data to be Open, and free at the point of use, but sometimes overlook the costs of collection and curation. Offsetting that, and switching to new more open commercial models, can be achieved in different ways.

Similarly we’re often concerned about privacy of Personal data, but want to reserve right to require some people, e.g. in public office, to release more information. Privacy is rarely a binary decision. Sharing is usually a matter of degree not a public-private distinction.

A Process View?

In my view we focus too much on the data itself — what can we release — rather than the wider process:

  • What data is being collected?
  • Who is collecting the data?
  • Who (or what) is the data being collected about?
  • What immediate use is going to be made of the data?
  • What future uses might the collector make of the data?
  • How is the data going to be distributed?
  • Who else can have access to the data?
  • What are the terms of use — either immediate or future — for the data?
  • What other data might it be remixed with?

And so on. Data collection, curation, publishing and re-use is a process. Understanding that process, as it applies to particular data, helps us to understand the risks & rewards for data sharing, whether its personal data or government data. We often talk about provenance, but that’s usually a retrospective view, e.g. where did this data come from? But we also need to concern ourselves with future uses. Licensing is important, as is sustainability.

Answering these kinds of questions, for different types of data, may be illuminating. For example, data that I collect myself about my diet is highly personal data, but I will have a different attitude to sharing that than my bank statement. Data about my spending habits is collected for me by my bank. We share access to that because it is mutually beneficial (I think!).

Greater access to data that is collected about me (but not necessarily by me) could be useful in other contexts. But unless I plan to analyse all that myself, I’m going to end up sharing it with someone in order to get some useful insight, e.g. suggestions on better financial management or, in the case of utility bills, proposals for a more cost effective provider.

Choosing to publish data openly, for unrestricted use, or within a limited group, or not at all is a decision that has to be made with informed consent and an appreciation of the risks and rewards. That’s true for governments, organisations and individuals.

Potential Energy

Data is stored potential.

The Big Data movement is largely about organisations realising that they can tap into their internal large data reserves faster and in more cost-effective ways than was previously possible. The technology is helping unlock stored potential in internal data for its current owners.

In contrast, the Open Data movement is largely about unlocking potential by putting data into the hands of more people. More hands on that data allows it to be used in potentially more creative ways, perhaps to drive innovation or to increase transparency.

Personal data stores and the “midata” vision is intended to unlock potential by allowing individuals to readily access and share their data in more ways.

Unstructured data has less potential than structured data. The effort put into collecting and curating data increases its potential by making it easier to process or improving its quality.

Similarly the potential in data that is released on a one-off basis declines over time. The speed depending on the rate of change of the dataset.

Much of the education that is happening in government and in enterprises around data is in building understanding of the potential in their data. The education that needs to happen for all of us is in understanding the potential of our own data, both for good and for ill. What we give away either willingly or unconsciously can be used in unexpected ways.

However even for simple data items it can be difficult to forsee all potential uses. A single checkpoint at a geographic location is one thing, but a series of check-ins over time enables an entirely different kind of application and analysis. Aggregate that with other data and the options expand in many different ways.

For me the question is less about what kinds of data should or should not be open, but about what processes we want to enable with that data and a judgement on the risk-rewards involved.

Everything being open to everyone is just the opposite extreme to the, largely closed, world we’ve been living in to date. There’s still plenty of scope to discuss the points in between.

UK & EU Linked Data Consultant Network?

As I explained in that announcement that I’m leaving Talis, I’m going to be exploring freelance consulting opportunities.

While I’m not limiting that to Linked Data work, its an area in which I have a lot invested and which there is still lots of activity. Perhaps not enough to support Talis Systems, but there certainly seems to be a number of opportunities that could support freelancers and small consulting businesses.

Talis was always keen to help develop the market and had quite open relationships with others in the industry. Everyone benefits if the pie gets bigger and in an early stage market is makes sense to share opportunities and collaborate where possible.

I’d like to continue that if possible. Even in the last few days I’ve had questions about whether Talis’ decision might mark the beginning of some wider move away from the technology. That’s certainly not how I see it. Even Talis is not moving away from the technology, its just focusing on a specific sector. I’ve already learnt of other companies that are starting to embrace Linked Data within the enterprise.

I think it would be a good thing if those of us working in this area in the UK & EU organise ourselves a little more; to make the most of the available opportunities and to continue to grow the market. There are various interest groups (like Lotico) but those are more community rather than business focused.

A network could take a number of forms. It may be simply be a LinkedIn network. Or a (closed?) mailing list to share opportunities and experience. But it would be nice to find a way to share successes and case studies where they exist. Sites like often promote projects, but I wonder whether something more focused might be useful.

These are just some early stage thoughts. What I’d most like to do is find out:

  • whether others think this is a good idea — would it be useful?
  • what forms people would prefer to see it take — what would be useful for you?
  • who is active, as a freelancer or SME, in this area — I have some contacts but I doubt its exhaustive

If you’ve got thoughts on those then please drop a comment on this post. Or drop me an email.

Leaving Talis

Earlier today I hit the publish buttons on the blog posts announcing the shutdown of Kasabi and the end of Talis’s semantic web activities. Neither of those were easy to write.

My time at Talis — which will have been four years in September — has been a fantastic experience. I’ve worked with some incredibly talented people on a wide range of projects. The culture and outlook at Talis was like no other company I’ve worked for; it’s a real pleasure to have been part of that. I’ve learnt an massive amount in so many different areas.

I’d argue that Talis more than any other company has worked incredibly hard to promote and support work around the Semantic Web and Linked Data. And I’m really proud of that. Despite increasing — but still slow — adoption, the decision was made that there was only so much more that could be done, and that it was time for Talis to focus elsewhere. Over the next few weeks I’ll be winding up Talis Systems’ activities in that area, and working with existing customers on continuity plans.

This year has been very difficult, on a number of levels. On the whole I’m now glad that I can focus on the future with a fresh outlook.

In the short term I’m considering freelance opportunities. If you’re interested in talking about that, then please get in touch. My profile is on LinkedIn and I’m available for work from 1st August.

If you need help with a Linked Data or Open Data project or product, then get in touch. Over the past few years I’ve done everything from data processing through to modelling, product & technical strategy, and even training.

Longer term, I want to take some time to think about the kind of work that I enjoy doing. I love building products, particularly those that are heavily data-driven. I want to build something around Open Data. Beyond that I’m not yet sure.

If you have something that you think I could help with, then I’d love to hear from you.

Four Links Good, Two Links Bad?

Having reviewed a number of Linked Data papers and projects I’ve noticed a recurring argument which goes something like this: “there are billions of triples available as Linked Data but the number of links, either within or between datasets, is still a small fraction of that total number. This is bad, and hence here is our software/project/methodology which will fix that…“.

I’m clearly paraphrasing, partly because I don’t want to single out any particular paper or project, but there seems to be a fairly deep-rooted belief that there aren’t enough links between datasets and this is something that needs fixing. The lack of links is often held up as a reason for why working with different datasets is harder than it should be.

But I’m not sure that I’ve seen anyone attempt to explain why increasing the density of links is automatically a good thing. Or, better yet, attempt to quantify in some way what a “good” level of inter-linking might be.

Simply counting links, and attempting to improve on that number, also glosses over reasons for why some links don’t exist in the first place. Is it better to have more links, or better quality links?

A Simple Illustration

Here’s a simple example. Consider the following trivial piece of Linked Data.

@prefix dct: <> .
@prefix foaf: <> .
@prefix owl: <> .

 dct:title "War and Peace";
 dct:creator <>.
 dct:title "Anna Karenina";
 dct:creator <>.

 foaf:name "Leo Tolstoy";
 owl:sameAs <>.

The example has two resources which are related to their creator, which is identified as being the same as a resource in dbpedia. This is a very common approach as typically a dataset will be progressively enriched with equivalence links. It’s much easier to decouple data conversion from inter-linking if a dataset is initially completely self-referential.

But if we’re counting links, then we only have a single outbound link. We could restructure the data as follows:

@prefix dct: <> .
 dct:title "War and Peace";
 dct:creator <>.

 dct:title "Anna Karenina";
 dct:creator <>.

We now have less triples, but we’ve increased the number of outbound links. If all we’re doing is measuring link density between datasets then clearly the second is “better”. We could go a step further and materialize inferences in our original dataset to assert all of the owl:sameAs links, giving us an even higher link density.

This is clearly a trivial example but it illustrates that even for very simple scenarios we can make decisions on how to publish Linked Data that impact link density. As ever we need a more nuanced understanding to help identify trade-offs for both publisher and consumer.

The first option with the lowest outbound link density is the better option in my opinion, for various reasons:

  • The dataset is initially self-contained, allowing data production to be separated from the process of inter-linking, thereby simplifying data publishing
  • Use of local URIs provides a convenient point of attachment for local annotations of resources, so useful if I have additional statements to make about the equivalent resources
  • Use of local URIs allows me to decide on my own definition of that resource, without immediately buying into a third-party definition.
  • Use of local URIs makes it easier to add new link targets, or remove existing links, at a later date
But there are also downsides:
  • Consumers needs to apply reasoning, or similar, in order to smush together datasets adding extra client-side processing
  • There are more URIs — a great “surface area” — to maintain within my Linked Data
And we’ve not yet considered how the links are created. Regardless of whether you’re creating links manually or automatically, there’s a cost to their creation. So which links are the most important to create? For me, and for the users of my data?
There is likely to be a law of diminishing returns on both sides for the addition of new links, particularly if “missing” relationships between resources can be otherwise inferred. For example if A is sameAs B, then it’s probably unnecessary for me to assert equivalences to all the resources to which B is, in turn, equivalent. Saying less, reduces the amount of data I’m producing so I can focus on making sure its of good quality.

Not All Datasets are Equal

Datasets should exhibit very different link characteristics that derive from how they’re published, who is publishing, and why they are publishing the data. Again these are more nuances that are lost by simply maximising link density.

Some datasets are purely annotations. Annotated Data may have no (new) links in it at all. Such a dataset might only be published to enrich an existing dataset. Because of the lack of links it won’t appear on the Linked Data cloud. They’re also not, yet, easily discoverable. But they’re easy to layer onto existing data and don’t require commitments to maintaining URIs, so they have their advantages.

Some datasets are link bases: they consist only of links and exist to help connect together previously unconnected datasets. Really they’re a particular kind of annotation, so share similar advantages and disadvantages.

Some datasets are hubs. These are intended to be link targets or to be annotated, but may not link to other sources. The UK Government reference interval URIs are one example of a “hub” dataset. The same is true for the Companies House URIs. Its likely that many datasets published by their managing authority will likely have a low outbound link density, simply because they are the definitive source of that data. Where else would you go? Other data publishers may annotate them, or define equivalents, but the source dataset itself may be low in links and remain so over time.

Related to this point, there are are several social, business and technical reasons why links may deliberately not exist between datasets:

  • Because they embody a different world-view or different levels of modelling precision. The Ordnance Survey dataset doesn’t link to dbpedia because even where there are apparent equivalences, with a little more digging it turns out that resources aren’t precisely the same.
  • Because the data publisher chooses not to link to a destination because of concerns about the quality of the destination. A publisher of biomedical data may choose not to link to another dataset if there are concerns about the quality of the data: more harm may be done by linking and then consuming incorrect data, than having no links at all.
  • Because the data publisher chooses not to link to data from a competitor.
  • Because datasets are published and updated on different time-scales. This is the reason for the appearance of many proxy URIs in datasets.

If, as a third party, I publish a Link Base that connects two datasets, then only in the last two scenarios am I automatically improving the situation for everyone.

In the other two scenarios I’m likely to be degrading the value of the available data by leading consumers to incorrect data or conclusions. So if you’re publishing a Link Base you need to be really clear on whether you understand the two datasets you’re connecting and the cost/benefits involved in making those links. Similarly, if you’re a consumer, consider the provenance of those links.

How do consumers rank and qualify different data sources? Blindly following your nose may not always be the best option.

Interestingly I’ve seen very little use of owl:differentFrom by data publishers. I wonder if this would be a useful way for a publisher to indicate that they have clearly considered whether some resources in a dataset are equivalent, but have decided that they are not. Seems like the closest thing to “no follow” in RDF.

Ironically of course, publishing lots of owl:differentFrom statements increases link density! But that speaks to my point that counting links alone isn’t useful. Any dataset can be added to the Linked Data Cloud diagram by adding 51 owl:differentFrom statements to an arbitrary selection of resources.

Studying link density and dataset connectivity is potentially an interesting academic exercise. I’d be interested to see how different datasets, perhaps from different subject domains, relate to known network topologies. But as the Linked Data cloud continues to grow we ought to think carefully about what infrastructure we need to help it be successful.

Focusing on increasing link density, e.g. by publishing more link bases, or by creating more linking tools, may not be the most valuable area to focus on. Infrastructure to support better selection, ranking and discovery of datasets is likely to offer more value longer term; we can see that from the existing web. Similarly, when we’re advising publishers, particularly governments on how to publish and link their data, there are many nuances to consider.

More links aren’t always better.