Four Links Good, Two Links Bad?

Having reviewed a number of Linked Data papers and projects I’ve noticed a recurring argument which goes something like this: “there are billions of triples available as Linked Data but the number of links, either within or between datasets, is still a small fraction of that total number. This is bad, and hence here is our software/project/methodology which will fix that…“.

I’m clearly paraphrasing, partly because I don’t want to single out any particular paper or project, but there seems to be a fairly deep-rooted belief that there aren’t enough links between datasets and this is something that needs fixing. The lack of links is often held up as a reason for why working with different datasets is harder than it should be.

But I’m not sure that I’ve seen anyone attempt to explain why increasing the density of links is automatically a good thing. Or, better yet, attempt to quantify in some way what a “good” level of inter-linking might be.

Simply counting links, and attempting to improve on that number, also glosses over reasons for why some links don’t exist in the first place. Is it better to have more links, or better quality links?

A Simple Illustration

Here’s a simple example. Consider the following trivial piece of Linked Data.

@prefix dct: <> .
@prefix foaf: <> .
@prefix owl: <> .

 dct:title "War and Peace";
 dct:creator <>.
 dct:title "Anna Karenina";
 dct:creator <>.

 foaf:name "Leo Tolstoy";
 owl:sameAs <>.

The example has two resources which are related to their creator, which is identified as being the same as a resource in dbpedia. This is a very common approach as typically a dataset will be progressively enriched with equivalence links. It’s much easier to decouple data conversion from inter-linking if a dataset is initially completely self-referential.

But if we’re counting links, then we only have a single outbound link. We could restructure the data as follows:

@prefix dct: <> .
 dct:title "War and Peace";
 dct:creator <>.

 dct:title "Anna Karenina";
 dct:creator <>.

We now have less triples, but we’ve increased the number of outbound links. If all we’re doing is measuring link density between datasets then clearly the second is “better”. We could go a step further and materialize inferences in our original dataset to assert all of the owl:sameAs links, giving us an even higher link density.

This is clearly a trivial example but it illustrates that even for very simple scenarios we can make decisions on how to publish Linked Data that impact link density. As ever we need a more nuanced understanding to help identify trade-offs for both publisher and consumer.

The first option with the lowest outbound link density is the better option in my opinion, for various reasons:

  • The dataset is initially self-contained, allowing data production to be separated from the process of inter-linking, thereby simplifying data publishing
  • Use of local URIs provides a convenient point of attachment for local annotations of resources, so useful if I have additional statements to make about the equivalent resources
  • Use of local URIs allows me to decide on my own definition of that resource, without immediately buying into a third-party definition.
  • Use of local URIs makes it easier to add new link targets, or remove existing links, at a later date
But there are also downsides:
  • Consumers needs to apply reasoning, or similar, in order to smush together datasets adding extra client-side processing
  • There are more URIs — a great “surface area” — to maintain within my Linked Data
And we’ve not yet considered how the links are created. Regardless of whether you’re creating links manually or automatically, there’s a cost to their creation. So which links are the most important to create? For me, and for the users of my data?
There is likely to be a law of diminishing returns on both sides for the addition of new links, particularly if “missing” relationships between resources can be otherwise inferred. For example if A is sameAs B, then it’s probably unnecessary for me to assert equivalences to all the resources to which B is, in turn, equivalent. Saying less, reduces the amount of data I’m producing so I can focus on making sure its of good quality.

Not All Datasets are Equal

Datasets should exhibit very different link characteristics that derive from how they’re published, who is publishing, and why they are publishing the data. Again these are more nuances that are lost by simply maximising link density.

Some datasets are purely annotations. Annotated Data may have no (new) links in it at all. Such a dataset might only be published to enrich an existing dataset. Because of the lack of links it won’t appear on the Linked Data cloud. They’re also not, yet, easily discoverable. But they’re easy to layer onto existing data and don’t require commitments to maintaining URIs, so they have their advantages.

Some datasets are link bases: they consist only of links and exist to help connect together previously unconnected datasets. Really they’re a particular kind of annotation, so share similar advantages and disadvantages.

Some datasets are hubs. These are intended to be link targets or to be annotated, but may not link to other sources. The UK Government reference interval URIs are one example of a “hub” dataset. The same is true for the Companies House URIs. Its likely that many datasets published by their managing authority will likely have a low outbound link density, simply because they are the definitive source of that data. Where else would you go? Other data publishers may annotate them, or define equivalents, but the source dataset itself may be low in links and remain so over time.

Related to this point, there are are several social, business and technical reasons why links may deliberately not exist between datasets:

  • Because they embody a different world-view or different levels of modelling precision. The Ordnance Survey dataset doesn’t link to dbpedia because even where there are apparent equivalences, with a little more digging it turns out that resources aren’t precisely the same.
  • Because the data publisher chooses not to link to a destination because of concerns about the quality of the destination. A publisher of biomedical data may choose not to link to another dataset if there are concerns about the quality of the data: more harm may be done by linking and then consuming incorrect data, than having no links at all.
  • Because the data publisher chooses not to link to data from a competitor.
  • Because datasets are published and updated on different time-scales. This is the reason for the appearance of many proxy URIs in datasets.

If, as a third party, I publish a Link Base that connects two datasets, then only in the last two scenarios am I automatically improving the situation for everyone.

In the other two scenarios I’m likely to be degrading the value of the available data by leading consumers to incorrect data or conclusions. So if you’re publishing a Link Base you need to be really clear on whether you understand the two datasets you’re connecting and the cost/benefits involved in making those links. Similarly, if you’re a consumer, consider the provenance of those links.

How do consumers rank and qualify different data sources? Blindly following your nose may not always be the best option.

Interestingly I’ve seen very little use of owl:differentFrom by data publishers. I wonder if this would be a useful way for a publisher to indicate that they have clearly considered whether some resources in a dataset are equivalent, but have decided that they are not. Seems like the closest thing to “no follow” in RDF.

Ironically of course, publishing lots of owl:differentFrom statements increases link density! But that speaks to my point that counting links alone isn’t useful. Any dataset can be added to the Linked Data Cloud diagram by adding 51 owl:differentFrom statements to an arbitrary selection of resources.

Studying link density and dataset connectivity is potentially an interesting academic exercise. I’d be interested to see how different datasets, perhaps from different subject domains, relate to known network topologies. But as the Linked Data cloud continues to grow we ought to think carefully about what infrastructure we need to help it be successful.

Focusing on increasing link density, e.g. by publishing more link bases, or by creating more linking tools, may not be the most valuable area to focus on. Infrastructure to support better selection, ranking and discovery of datasets is likely to offer more value longer term; we can see that from the existing web. Similarly, when we’re advising publishers, particularly governments on how to publish and link their data, there are many nuances to consider.

More links aren’t always better.


One thought on “Four Links Good, Two Links Bad?

Comments are closed.