UK & EU Linked Data Consultant Network?

As I explained in that announcement that I’m leaving Talis, I’m going to be exploring freelance consulting opportunities.

While I’m not limiting that to Linked Data work, its an area in which I have a lot invested and which there is still lots of activity. Perhaps not enough to support Talis Systems, but there certainly seems to be a number of opportunities that could support freelancers and small consulting businesses.

Talis was always keen to help develop the market and had quite open relationships with others in the industry. Everyone benefits if the pie gets bigger and in an early stage market is makes sense to share opportunities and collaborate where possible.

I’d like to continue that if possible. Even in the last few days I’ve had questions about whether Talis’ decision might mark the beginning of some wider move away from the technology. That’s certainly not how I see it. Even Talis is not moving away from the technology, its just focusing on a specific sector. I’ve already learnt of other companies that are starting to embrace Linked Data within the enterprise.

I think it would be a good thing if those of us working in this area in the UK & EU organise ourselves a little more; to make the most of the available opportunities and to continue to grow the market. There are various interest groups (like Lotico) but those are more community rather than business focused.

A network could take a number of forms. It may be simply be a LinkedIn network. Or a (closed?) mailing list to share opportunities and experience. But it would be nice to find a way to share successes and case studies where they exist. Sites like SemanticWeb.com often promote projects, but I wonder whether something more focused might be useful.

These are just some early stage thoughts. What I’d most like to do is find out:

  • whether others think this is a good idea — would it be useful?
  • what forms people would prefer to see it take — what would be useful for you?
  • who is active, as a freelancer or SME, in this area — I have some contacts but I doubt its exhaustive

If you’ve got thoughts on those then please drop a comment on this post. Or drop me an email.

Four Links Good, Two Links Bad?

Having reviewed a number of Linked Data papers and projects I’ve noticed a recurring argument which goes something like this: “there are billions of triples available as Linked Data but the number of links, either within or between datasets, is still a small fraction of that total number. This is bad, and hence here is our software/project/methodology which will fix that…“.

I’m clearly paraphrasing, partly because I don’t want to single out any particular paper or project, but there seems to be a fairly deep-rooted belief that there aren’t enough links between datasets and this is something that needs fixing. The lack of links is often held up as a reason for why working with different datasets is harder than it should be.

But I’m not sure that I’ve seen anyone attempt to explain why increasing the density of links is automatically a good thing. Or, better yet, attempt to quantify in some way what a “good” level of inter-linking might be.

Simply counting links, and attempting to improve on that number, also glosses over reasons for why some links don’t exist in the first place. Is it better to have more links, or better quality links?

A Simple Illustration

Here’s a simple example. Consider the following trivial piece of Linked Data.

@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://example.org/book/war-and-peace>
 dct:title "War and Peace";
 dct:creator <http://example.org/author/leo-tolstoy>.
<http://example.org/book/anna-karenina>
 dct:title "Anna Karenina";
 dct:creator <http://example.org/author/leo-tolstoy>.

<http://example.org/author/leo-tolstoy>
 foaf:name "Leo Tolstoy";
 owl:sameAs <http://dbpedia.org/resource/Leo_Tolstoy>.

The example has two resources which are related to their creator, which is identified as being the same as a resource in dbpedia. This is a very common approach as typically a dataset will be progressively enriched with equivalence links. It’s much easier to decouple data conversion from inter-linking if a dataset is initially completely self-referential.

But if we’re counting links, then we only have a single outbound link. We could restructure the data as follows:

@prefix dct: <http://purl.org/dc/terms/> .
<http://example.org/book/war-and-peace>
 dct:title "War and Peace";
 dct:creator <http://dbpedia.org/resource/Leo_Tolstoy>.

<http://example.org/book/anna-karenina>
 dct:title "Anna Karenina";
 dct:creator <http://dbpedia.org/resource/Leo_Tolstoy>.

We now have less triples, but we’ve increased the number of outbound links. If all we’re doing is measuring link density between datasets then clearly the second is “better”. We could go a step further and materialize inferences in our original dataset to assert all of the owl:sameAs links, giving us an even higher link density.

This is clearly a trivial example but it illustrates that even for very simple scenarios we can make decisions on how to publish Linked Data that impact link density. As ever we need a more nuanced understanding to help identify trade-offs for both publisher and consumer.

The first option with the lowest outbound link density is the better option in my opinion, for various reasons:

  • The dataset is initially self-contained, allowing data production to be separated from the process of inter-linking, thereby simplifying data publishing
  • Use of local URIs provides a convenient point of attachment for local annotations of resources, so useful if I have additional statements to make about the equivalent resources
  • Use of local URIs allows me to decide on my own definition of that resource, without immediately buying into a third-party definition.
  • Use of local URIs makes it easier to add new link targets, or remove existing links, at a later date
But there are also downsides:
  • Consumers needs to apply reasoning, or similar, in order to smush together datasets adding extra client-side processing
  • There are more URIs — a great “surface area” — to maintain within my Linked Data
And we’ve not yet considered how the links are created. Regardless of whether you’re creating links manually or automatically, there’s a cost to their creation. So which links are the most important to create? For me, and for the users of my data?
There is likely to be a law of diminishing returns on both sides for the addition of new links, particularly if “missing” relationships between resources can be otherwise inferred. For example if A is sameAs B, then it’s probably unnecessary for me to assert equivalences to all the resources to which B is, in turn, equivalent. Saying less, reduces the amount of data I’m producing so I can focus on making sure its of good quality.

Not All Datasets are Equal

Datasets should exhibit very different link characteristics that derive from how they’re published, who is publishing, and why they are publishing the data. Again these are more nuances that are lost by simply maximising link density.

Some datasets are purely annotations. Annotated Data may have no (new) links in it at all. Such a dataset might only be published to enrich an existing dataset. Because of the lack of links it won’t appear on the Linked Data cloud. They’re also not, yet, easily discoverable. But they’re easy to layer onto existing data and don’t require commitments to maintaining URIs, so they have their advantages.

Some datasets are link bases: they consist only of links and exist to help connect together previously unconnected datasets. Really they’re a particular kind of annotation, so share similar advantages and disadvantages.

Some datasets are hubs. These are intended to be link targets or to be annotated, but may not link to other sources. The UK Government reference interval URIs are one example of a “hub” dataset. The same is true for the Companies House URIs. Its likely that many datasets published by their managing authority will likely have a low outbound link density, simply because they are the definitive source of that data. Where else would you go? Other data publishers may annotate them, or define equivalents, but the source dataset itself may be low in links and remain so over time.

Related to this point, there are are several social, business and technical reasons why links may deliberately not exist between datasets:

  • Because they embody a different world-view or different levels of modelling precision. The Ordnance Survey dataset doesn’t link to dbpedia because even where there are apparent equivalences, with a little more digging it turns out that resources aren’t precisely the same.
  • Because the data publisher chooses not to link to a destination because of concerns about the quality of the destination. A publisher of biomedical data may choose not to link to another dataset if there are concerns about the quality of the data: more harm may be done by linking and then consuming incorrect data, than having no links at all.
  • Because the data publisher chooses not to link to data from a competitor.
  • Because datasets are published and updated on different time-scales. This is the reason for the appearance of many proxy URIs in datasets.

If, as a third party, I publish a Link Base that connects two datasets, then only in the last two scenarios am I automatically improving the situation for everyone.

In the other two scenarios I’m likely to be degrading the value of the available data by leading consumers to incorrect data or conclusions. So if you’re publishing a Link Base you need to be really clear on whether you understand the two datasets you’re connecting and the cost/benefits involved in making those links. Similarly, if you’re a consumer, consider the provenance of those links.

How do consumers rank and qualify different data sources? Blindly following your nose may not always be the best option.

Interestingly I’ve seen very little use of owl:differentFrom by data publishers. I wonder if this would be a useful way for a publisher to indicate that they have clearly considered whether some resources in a dataset are equivalent, but have decided that they are not. Seems like the closest thing to “no follow” in RDF.

Ironically of course, publishing lots of owl:differentFrom statements increases link density! But that speaks to my point that counting links alone isn’t useful. Any dataset can be added to the Linked Data Cloud diagram by adding 51 owl:differentFrom statements to an arbitrary selection of resources.

Studying link density and dataset connectivity is potentially an interesting academic exercise. I’d be interested to see how different datasets, perhaps from different subject domains, relate to known network topologies. But as the Linked Data cloud continues to grow we ought to think carefully about what infrastructure we need to help it be successful.

Focusing on increasing link density, e.g. by publishing more link bases, or by creating more linking tools, may not be the most valuable area to focus on. Infrastructure to support better selection, ranking and discovery of datasets is likely to offer more value longer term; we can see that from the existing web. Similarly, when we’re advising publishers, particularly governments on how to publish and link their data, there are many nuances to consider.

More links aren’t always better.

 

Principled use of RDF/XML

Everyone loves to hate RDF/XML. Indeed many have argued that RDF/XML is responsible for holding back semantic web adoption. I’m not sure that I fully agree with that (there’s a lot of other issues to consider) but its certainly awkward to work with if you’re trying to integrate both RDF and XML tools into your application.

It’s actually that combination that causes the awkwardness. If you’re just using RDF tools then RDF/XML is mostly fine. It benefits from XML’s Unicode support and is the most widely supported RDF serialisation. There are downsides though. For example there are some potential RDF graphs can’t be serialised as RDF/XML. But that is easy to avoid.

Developers, particularly XML developers, feel cheated by RDF/XML because of what they see as false advertising: its an XML format that doesn’t play nicely with XML tools. Some time ago, Dan Brickley wrote a nice history on the design of RDF/XML which is worth a read for some background. My goal here isn’t to rehash the RDF/XML discussion or even to mount a defense of RDF/XML as a good format for RDF (I prefer Turtle).

But developers are still struggling with RDF/XML, particularly in publishing workflows where XML is a good base representation for document structures, so I think its worthwhile capturing some advice on how to reach a compromise with RDF/XML that allows it to work nicely with XML tools. I can’t remember seeing anyone do that before, so I thought I’d write down some of my experiences. These are drawn from creating a publishing platform that ingested metadata and content in XML, used Apache Jena for storing that metadata, and Solr as a search engine. Integration between different components was carried out using XML based messaging. So there were several places where RDF and XML rubbed up against one another.

Tip 1: Don’t rely on default serialisations

The first thing to note is that RDF/XML offers a lot of flexibility in terms of how an RDF graph can be serialised as XML. A lot. The same graph can be serialised in many different ways using a lot of syntactic short-cuts. More on those in a moment.

It’s this unbounded flexibility that is the major source of the problems: producers and consumers may have reasonable default assumptions about how data will be made published that are completely at odds with one another. This makes it very difficult to consume arbitrary RDF/XML with anything other than RDF tools.

JSON-LD offers a lot of flexibility too, and I can’t help but wonder whether that flexibility is going to come back and bite us in the future.

By default RDF tools tend to generate RDF/XML in a form that makes it easy for them to serialise. This tends to mean automatically generated namespace prefixes and a per-triple approach to serialising the graph, e.g:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:p0="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:p1="http://xmlns.com/foaf/0.1/">
  <rdf:Description rdf:about="http://example.org/person/joe">
    <p0:label>Joe Bloggs</po:label>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/person/joe">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/person/joe">
    <p1:homepage rdf:resource="http://example.org/blogs/joe"/>
  </rdf:Description>
 </rdf:RDF>

This is a disaster for XML tools as the description of the resource is spread across multiple elements making it hard to process. But its efficient to generate.

Some RDF frameworks may provide options for customising the output to apply some of the RDF/XML syntactic shortcuts. As we’ll see in a moment these are worth embracing and may produce some useful regularity.

But if you need to generate an XML format that has, for example, a precise ordering of child elements then you’re not going to get that kind of flexibility by default. You’ll need to craft a custom serialiser. Apache Jena allows you to use create RDF Writers to support this kind of customization. This isn’t ideal as you need to write code — even to tweak the output options — but it gives you more control.

So, if you need to generate an XML format from RDF sources then ensure that you normalize your output. If you have control over the XML document formats and can live with some flexibility in the content model, then using RDF/XML syntax shortcuts offered by your RDF tools might well be sufficient. However if you’re working to a more rigid format, then you’re likely to need some custom code.

Tip 2: Use all of the shortcuts

Lets look at the above example again but with a heavy dusting of syntax sugar:

<foaf:Person
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
 xmlns:foaf="http://xmlns.com/foaf/0.1/"
 rdf:about="http://example.org/person/1">
  <rdfs:label>Joe Bloggs</rdfs:label>
  <foaf:homepage rdf:resource="http://example.org/blogs/joe"/>
</foaf:Person>

Much nicer! The above describes exactly the same RDF graph as we had before. What have we done here:

  • We’ve omitted the rdf:RDF element as its unnecessary. If you have a single “root” resource in your graph then you can just this as the document element. If we had multiple, unrelated Person resources in the document then we’d need to re-introduce the rdf:RDF element as a generic container.
  • Defined some default namespace prefixes
  • Grouped triples about the same subject into the same element
  • Removed use of rdf:Description and rdf:type, preferring to instead use namespace element names

The result is something that is easier to read and much easier to work with in an XML context. You could even imagine creating an XML schema for this kind of document, particularly if you know which types and predicates are being used in your RDF graphs.

The nice thing about this approach is that its looks just like namespaced XML. For a publishing project I worked on we defined our XML schemas for receipt of data using this kind of approach; the client didn’t really need to know anything about RDF. We just had to explain that:

  • rdf:about is how we assign a unique identifier to a entity (and we used xml:base to simplify the contents further to avoid repetition)
  • rdf:resource was a “link” between two resources, e.g. for cross-referencing between content and subject categories

If you’re not using RDF containers of collections then those two attributes are the only bit of RDF that creeps into the syntax.

However in our case, we were also using RDF Lists to capture ordering of authors in academic papers. So we also explained that rdf:parseType was a processing instruction to indicate that some element content should be handled as a collection (a list).

This worked very well. We’d ended up with fine-grained document types anyway, to make it easier to update individual resources in the system, e.g. individual journal issues or articles, so the above structure mapped well to the system requirements.

Here’s a slightly more complex example that hopefully further illustrates the point. Here I’m showing nesting of several resource descriptions:

<ex:Article
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
 xmlns:foaf="http://xmlns.com/foaf/0.1/"
 xmlns:dc="http://purl.org/dc/terms/"
 xmlns:skos="http://www.w3.org/2004/02/skos/core#"
 xmlns:ex="http://example.org/ns/schema/"
 rdf:about="http://example.org/articles/1">

 <dc:title>An example article</dc:title>
 <dc:description>This is an article</dc:description>
 <ex:authors rdf:parseType="Collection">
   <foaf:Person rdf:about="http://example.org/person/1">
     <rdfs:label>Joe Bloggs</rdfs:label>
     <foaf:homepage rdf:resource="http://example.org/blogs/joe"/>
   </foaf:Person>
   <foaf:Person rdf:about="http://example.org/person/2">
     <rdfs:label>Sue Bloggs</rdfs:label>
     <foaf:homepage rdf:resource="http://example.org/blogs/sue"/>
   </foaf:Person>
 </ex:authors>
 <dc:related>
   <ex:Article rdf:about="http://example.org/articles/2"/>
 </dc:related>
 <dc:subject>
   <skos:Concept rdf:about="http://example.org/categories/example"/>
 </dc:subject>
</ex:Article>

The reality is whether you’re working in an XML or a RDF context, there is very often a primary resource you’re interested in: e.g. your processing a resource or rendering a view of it, etc. This means that in practice there’s nearly always an obvious and natural “root” element to the graph for creating an RDF/XML serialisation. Its just that RDF tools don’t typically let you identify it.

Tip 3: Use RELAX NG

Because of the syntactic variation, writing schemas for RDF/XML can be damn near impossible. But for highly normalised RDF/XML its a much more tractable problem.

My preference has been to use RELAX NG as it offers more flexibility when creating open and flexible content models for elements, e.g. via interleaving. This gives options to leave the document structures a little looser to facilitate serialisation and also allow the contents of the graph to evolve (e.g. addition of new properties).

If you have the option, then I’d recommend RELAX when defining schemas for your XML data.

Tip 4: RDF for metadata; XML for content

The last tip isn’t about RDF/XML per se, I just want to make a general point about where to apply the different technologies.

XML is fantastic at describing document structures and content. RDF is fantastic at describing relationships between things. Both of those qualities are important, but in very different aspects of an application.

In my work in publishing I ended up using a triple store as the primary data repository. This is because the kinds of application behaviour I wanted to drive were increasingly relationship focused: e.g. browsing to related content, author based navigation, concept relationships, etc. Increasingly I also wanted the ability to create new slices and views across the same content and document structures were too rigid.

The extensibility of the RDF graph allowed me to quickly integrate new workflows (using the Blackboard pattern) so that I could, for example, harvest & integrate external links or use text mining tools to extract new relationships. This could be done without having to rework the main publishing workflow, evolve document formats, or the database for the metadata.

However XML works perfectly well for rendering out the detailed content. It would be crazy to try and capture content in RDF/XML (structure yes; but not content). So for transforming XML into HTML or other views, XML was the perfect starting point. We were early adopters of XProc so using pipelines to generate rendered content and to extract RDF/XML for loading into a triple store was easy to do.

In summary RDF/XML is not a great format for working with RDF in an XML context, but its not completely broken. You just need to know how to get the best from it. It provides a default interoperable format for exchanging RDF data over the web, but there are better alternatives for hand-authoring and efficient loading. Once the RDF Working Group completes work on RDF 1.1 its likely that Turtle will rapidly become the main RDF serialisation.

However, I think that RDF/XML will still have a role, as part of a well-designed system, in bridging between RDF and XML tools.

Layered Data: A Paper & Some Commentary

Two years ago I wrote a short paper about “layering” data but for various reasons never got round to putting it online. The paper tried to capture some of my thinking at the time about the opportunities and approaches for publishing and aggregating data on the web. I’ve finally got around to uploading it and you can read it here.

I’ve made a couple of minor tweaks in a few places but I think it stands up well, even given the recent pace of change around data publishing and re-use. I still think the abstraction that it describes is not only useful but necessary to take us forward on the next wave of data publishing.

Rather than edit the paper to bring it completely up to date with recent changes, I thought I’d publish it as is and then write some additional notes and commentary in this blog post.

You’re probably best off reading the paper, then coming back to the notes here. The illustration referenced in the paper is also now up on slideshare.

RDF & Layering

I see that the RDF Working Group, prompted by Dan Brickley, is now exploring the term. I should acknowledge that I also heard the term “layer” in conjunction with RDF from Dan, but I’ve tried to explore the concept from a number of perspectives.

The RDF Working Group may well end up using the term “layer” to mean a “named graph”. I’m using the term much more loosely in my paper. In my view an entire dataset could be a layer, as well as some easily identifiable sub-set of it. My usage might therefore be closer to Pat Hayes’s concept of a “Surface”, but I’m not sure.

I think that RDF is still an important factor in achieving the goal I outlined of allowing domain experts to quickly assemble aggregates through a layering metaphor. Or, if not RDF, then I think it would need to be based around a graph model, ideally one with a strong notion of identity. I also think that mechanisms to encourage sharing of both schemas and annotations are also useful. It’d be possible to build such a system without RDF, but I’m not sure why you’d go to the effort.

User Experience

One of the things that appeals to me about the concept of layering is that there are some nice ways to create visualisation and interfaces to support the creation, management and exploration of layers. It’s not hard to see how, given some descriptive metadata for a collection of layers, you could create:

  • A drag-and-drop tool for creating and managing new composite layers
  • An inspection tool that would let you explore how the dataset for an application or visualisation has been constructed, e.g. to explore provenance or to support sharing and customization. Think “view source” for data aggregation.
  • A recommendation engine that suggested new useful layers that could be added to a composite, including some indication of what additional query options might become available

There’s been some useful work done on describing datasets within the Linked Data community: VoiD and DCat for example. However there’s not yet enough data routinely available about the structure and relationships of individual datasets, nor enough research into how to provide useful summaries.

This is what prompted my work on an RDF Report Card to try and move the conversation forward beyond simply counting triples.

To start working with layers, we need to understand what each layer contains and how they relate to and complement one another.

Linked Data & Layers

In the paper I suggest that RDF & Linked Data alone aren’t enough and that we need systems, tools and vocabularies for capturing the required descriptive data and enabling the kinds of aggregation I envisage.

I also think that the Linked Data community is spending far too much effort on creating new identifiers for the same things and worrying how best to define equivalences.

I think the leap of faith that’s required, and that people like the BBC have already taken, is that we just need to get much more comfortable re-using other people’s identifiers and publishing annotations. Yes, there will be times when identifiers diverge, but there’s a lot to be gained, especially in terms of efficiency around data curation from just focusing on the value-added data, not re-publishing any copy of a core set of facts.

There are efficiency gains to be had from existing businesses, as well as faster routes to market for startups, if they can reliably build on some existing data. I suspect that there are also businesses that currently compete with one another — because they’re having to compile or re-compile the same core data assets — that could actually complement one another if they could instead focus on the data curation or collection tasks at which they excel.

Types of Data

In the paper I set out seven different facets which I think cover the majority of types of data that we routinely capture and publish. I think the classification could be debated, but I think its a reasonable first attempt.

The intention is to try and illustrate that we can usefully group together different types of data. And organisations may be particularly good at creating or collecting particular types of data. There’s scope for organisations to focus on being really good in a particular area and by avoiding needless competition around collecting and re-collecting the same core facts, there are almost certainly efficiency gains and cost savings to be had.

I’m sure there must be some prior work in this space, particularly around the core categories, so if anyone has pointers please share them.

There are also other ways to usefully categorise data. One area that springs to mind is how the data itself is collected, i.e. its provenance. E.g. is it collected automatically by sensors, or as a side-effect of user activity, or entered by hand by a human curator? Are those curators trained or are they self-selected contributors? Is the data derived from some form of statistical analysis?

I had toyed with provenance as a distinct facet, but I think its an orthogonal concern.

Layering & Big Data

A lot has happened in the last two years and I winced a bit at all of the Web 2.0 references in the paper. Remember that? If I were writing this now then the obvious trend to discuss as context to this approach is Big Data.

Chatting with Matt Biddulph recently he characterised a typical Big Data analysis as being based on “Activity Data” and “Reference Data”. Matt described reference data as being the core facts and information on top of which the activity data — e.g. from users of an application — is added. The analysis then draws on the combination to create some new insight, i.e. more data.

I referenced Matt’s characterisation in my Strata talk (with acknowledgement!). Currently Linked Data does really well in the Reference category but there’s not a great deal of Activity data. So while its potentially useful in a Big Data world, there’s a lot of value still not being captured.

I think Matt’s view of the world chimes well with both the layered data concept and the data classifications that I’ve proposed. Most of the facets in the paper really define different types of Reference data. The outcome of a typical Big Data analysis is usually a new facet, an obvious one being “Comparative” data, e.g. identifying the most popular, most connected, most referenced resources in a network.

However there’s clearly a different in approach between typical Big Data processing and the graph models that I think underpin a layered view of the world.

MapReduce workflows seem to work best with more regular data, however newer approaches like Pregel illustrate the potential for “graph-native” Big Data analysis. But setting that aside, there’s no real contention as a layering approach to combining data doesn’t say anything about how the data must actually be used: it can be easily projected out into structures that are amenable for indexing and processing in different ways.

Kasabi

Looking at the last section of the paper it should be obvious that much of the origin of this analysis was early preparation for Kasabi.

I still think that there’s a great deal of potential to create a marketplace around data layers and tools for interacting with them. But we’re not there yet though for several reasons. Firstly its taken time to get the underlying platform in place to support that. We’ve done that now and you can expect more information on that from more official sources shortly. Secondly I under estimated how much effort is still required to move the market forward: there’s still lots to be done to support organisations in opening up data before we can really explore more horizontal marketplaces. But that is a topic for another post.

This has been quite a ramble of a blog post but hopefully there are some useful thoughts here that chime with your own experience. Let me know what you think.

Beyond the Triple Count

This post was originally published on the Kasabi product blog.

On Monday I gave a talk at the SemTechBiz conference: “The RDF Report Card: Beyond the Triple Count“. I’ve published the slides on Slideshare which I’ve embedded below, but I thought I’d also post some of my notes here.

I’ve felt for a while now that the Linked Data community has an unhealthy fascination on triple counts, i.e. on the size of individual datasets.

This was quite natural in the boot-strapping phase of Linked Data in which we were primarily focused on communicating how much data was being gathered. But we’re now beyond that phase and need to start considering a more nuanced discussion around published data.

If you’re a triple store vendor then you definitely want to talk about the volume of data your store can hold. After all, potential users or customers are going to be very interested in how much data could be indexed in your product. Even so, no-one seriously takes a headline figure at face value. As users we’re much more interested in a variety of other factors. For example how long does it take to load my data? Or, how well does a store perform with my usage profile, taking into account my hardware investment? Etc. This is why we have benchmarks, so we can take into account additional factors and more easily compare stores across different environments.

But there’s not nearly enough attention paid to other factors when evaluating a dataset. A triple count alone tells us nothing. They’re not even a good indicator of the number of useful “facts” in a dataset.

During my talk I illustrated this point by showing how, in Dbpedia, there are often several redundant ways for capturing the same information. These in inflate the size of the datasets without adding useful extra information. By my estimate there’s over 4.6m redundant triples for capturing location information alone. In my view, having multiple copies or variations for the same data point reduces the utility of a dataset, because it adds confusion over which values are reliable.

There can be good reasons for including the same information in slightly different ways, e.g. to support consuming applications that rely on slightly different properties, or which cannot infer additional data. Vocabularies also evolve and become more popular and this too can lead to variants if a publisher is keen to adapt to changing best practices.

But I think too often the default position is to simply use every applicable property to publish some data. From a publishing perspective it’s easier: you don’t have to make a decision about which approach might be best. And because of the general fixation on dataset size, there’s an incentive to just publish more data.

I think it’s better for data publishers to make more considered curation decisions, and instead just use one preferred way to publish each piece of information. Its much easier for clients to use normalized data.

I also challenged the view that we need huge amounts of data to build useful applications. In some scenarios more data is always better, that’s especially true if you’re doing some kind of statistical analysis. Semantic web technology potentially allows us to draw on data from hundreds of different sources by reducing integration costs. But it doesn’t mean that we have to or need to in order to drive useful applications. For many cases we need much more modest collections of data.

I used BBC Programmes as an example here. It’s a great example of publishing high quality Linked Data especially because the BBC were amongst the first (if not the first) primary publisher of data on the Linked Data cloud. BBC Programmes is a very popular site with over 2.5 million unique users a week, triggering over 60 reqs/second on their back-end. Now, while the data isn’t managed in a triple store, if you crawl it then you’ll discover than there’s only about 50 million triples in the whole dataset. So you clearly don’t need billions of triples to drive useful applications.

It’s really easy to generate large amounts of data. Curating a good quality dataset is harder. Much harder.

I think it’s time to move beyond boasting about triple counts and instead provide ways for people to assess dataset quality and utility. There are lots of useful factors to take into account when deciding whether a dataset is fit for purpose. In other words, how can we help users understand whether a dataset can help them solve a particular problem, implement a particular feature, or build an application?

Typically the only information we get about a dataset are some brief notes on its size, a few example resources, perhaps a pointer to a SPARQL endpoint and maybe an RDFs or OWL schema. This is not enough. I’d consider myself to be an experienced semantic web developer and this isn’t nearly enough to get started. I always find myself doing a lot of exploration around and within a dataset before deciding whether its useful.

In the talk I presented a simple conceptual model, an “information spectrum”, that tried to tease out the different aspects of a dataset that are useful to communicate to potential users. Some of that information is more oriented towards “business” decisions: is the dataset from a reliable source, correctly licensed, etc. While others are more technical: how has the dataset been constructed, or modelled?

I identified several broad classes of information on that spectrum:

Metadata. This is the kind of information that people are busily pouring into various data catalogs, primarily from government sources. Dataset metadata, including its title, a description, publication dates, license, etc all help solve the discovery problem, i.e. identifying what datasets are available and whether you might be able to use them.

While the situation is improving, its still too hard to find out when some particular source was updated, who maintains or publishes the data, and (of biggest concern) how the data is licensed.

Scope. Scoping information for a dataset tells us what it contains. E.g. is it about people, places, or creative works? How many of each type of thing does a dataset contain? If the dataset contains points of interest, then what is the geographic coverage? If the dataset contains events, then over what time period(s)?

Then we get to the Structure of a dataset. I don’t mean a list of the specific vocabularies that are used, but more how those vocabularies have been meshed together to describe a particular type of entity. E.g. how is a person described in this dataset? Do all people have a common set of properties?

At the lowest level we then have the dataset Internals. This includes things like lists of RDF terms and their frequencies, use of named graphs, pointers to source files, etc. Triple counts may be useful at this point, but only to identify whether you could reasonably mirror a dataset locally. Knowledge of the underlying infrastructure, etc. might also be of use to developers.

Taken together I see presenting this information to users as being one of progressive disclosure: providing the right detail, to the right audience, at the right time. Currently we don’t routinely provide nearly enough information at any point on the spectrum. The irony here is that when we’re using RDF, the data is so well-structured that much of that detail could be automatically generated. Data publishing platforms need to do more to make this information more readily accessible, as well as providing data publishers with the tools to manage it.

We’ve been applying this conceptual model as we build out the features of Kasabi. Currently we’re ticking all of these boxes. It’s clear from every dataset homepage where some data has come from, how it is licensed and when it was updated. A user can easily then drill down into a dataset to get more information on its scope and internal structure. There’s lots more that we’re planning to do at all stages.

To round out the talk I previewed a feature that we’ll be releasing shortly called the “Report Card”. This is intended to provide an at a glance overview of what types of entity a dataset contains. There are examples included in the slides, but we’re still playing with the visuals. The idea is to quickly allow a user to determine the scope of a dataset, and whether it contains some useful information to them. In the BBC Music example you can quickly see that it contains data on Creative Works (reviews, albums), Organizations (bands) and People (artists) but it doesn’t contain any location information. You’re going to need to draw on a related linked dataset if you want to build a location based music app using BBC Music.

As well as summarizing a dataset, the report card will also be used to drive better discovery tools. This will allow users to quickly find datasets that include the same kinds of information, or relevant complementary data.

Ultimately my talk was arguing that I think it’s time to start focusing more on data curation. We need to give users a clearer view of the quality and utility of the data we’re publishing, and also think more carefully about the data we’re publishing.

This isn’t a unique semantic web problem. The same issues are rearing their heads with other approaches. Where I think we are well placed is in the ability to apply semantic web technology to help analyze and present data in a more useful and accessible way.

Giving RDF Datasets more Affordance

This post was originally published on the Kasabi product blog.

The following is a version of the talk on Creating APIs over RDF I gave at SemTech 2011. I’ve pruned some of the technical details in favour of linking out to other sources and concentrated here on the core message I was trying to get across. Comments welcome!

The Trouble with SPARQL

I’m a big fan of SPARQL, I constantly use it in my own development tasks, have built a number of production systems in which the query language is a core component, and wrote (I think!) one of the first SPARQL tutorials back in 2005 when it was still in Last Call. I’ve also worked with a number of engineering teams and developer communities over the last few years, introducing them to RDF and SPARQL.

My experience so far is two-fold: given some training and guidance SPARQL isn’t hard for any developer to learn. It’s just another query language and syntax. There are often some existing mental models that need to be overcome, but that’s always the case with any new technology. So at small scales SPARQL is easy to adopt, and a very useful tool when you’re working with graph shaped data.

But I’ve found, repeatedly, that when SPARQL is presented to a larger community, then the reaction and experience is very different. Developers quickly reject it as the learning curves are too great and, instead of seeing it as an enabler, they often see it as a barrier that’s placed between them and the data.

It’s easy to dismiss this kind of feedback and criticism by exhorting developers to just try harder or read more documentation. Surely any good developer is keen to learn a new technology? This overlooks the need of many people to just get stuff done quickly. Time and commercial pressures are a reality.

It’s also easy to dismiss this reaction as being down to the quality of the tools and documentation. Now, undoubtedly, there’s still much more that can be done there. And I was pleased to hear about Bob DuCharme’s forthcoming book on SPARQL.

But I think there are some technical reasons why, when we move from small groups to distributed adoption by a wider community, SPARQL causes frustrations. And I think this boils down to its affordance.

Affordance

Consider the interface that most developers are presented with when given access to a SPARQL endpoint. It’s an empty text field and a button. If you’re really lucky, there may be an example query filled in.

In order to do anything you have to not only know how to write a valid SPARQL query, but you also really need to know how the underlying dataset is structured. Two immediate hurdles to get over. Yes, there are queries you can write to return arbitrary triples, or list classes and properties, but that’s still not something a new user would necessarily know. And you typically need a lot of exploration before you can start to understand how to best query the data.

Trial and error experiments aren’t easy either, its not always obvious how a query can be tweaked to customize the results. And when we typically share SPARQL queries, its typically by passing around direct links to an endpoint. Have fun unpicking the query from the URL, reformatting it, so you can understand how it works and how it can be tweaked!

Better tools can definitely help in both of these cases. In Kasabi we’ve added a feature that allows anyone to share useful queries for a SPARQL endpoint along with a description of how it works. It’s a simple click to drop the query into the API explorer to run it, or tweak it.

But in my opinion it’s about more than just the tooling. Affordance flows not just from the tools, but also the syntax and the data. If you point someone at a SPARQL endpoint, it’s not immediately useful, not without a lot of additional background. These are issues that can hamper widespread adoption of a technology but which don’t often arise with smaller groups with direct access to mentors.

Contrast this situation with typical web APIs which have, in my opinion, much more affordance. If I give someone a link to an API call then it’s more immediately useful. I think working with good, RESTful APIs is like pulling a thread: the URLs unravel into useful data. And that data contains more links that I can just follow to find more data.

Trial and error experiments are also much easier. If I want to tweak an API call then I can do that by simply editing the URL. As a developer this is syntax that I already know. URL templates can also give me hints of how to construct a useful request.

Importantly, my understanding of the structure of the dataset can grow as I work with it. My understanding grows through use, rather than before I start using. And that’s a great way to learn. There are no real barriers to progression. I need to know much less in order to start feeling empowered.

So, I’ve come to the conclusion that SPARQL is really for power users. It’s of most use to developers that are willing to take the time and trouble to learn the syntax and the underlying data model in order to get its benefits. This is not a critique of the technology itself, but a reflection on how technology is (or isn’t) being adopted and the challenges people are facing.

The obvious question that springs to mind is: how we can give RDF data more affordance?

Linked Data

Linked Data is all about giving affordance to data. Linking, and “follow your nose” access to data is a core part of the approach. By binding data to the web, making it accessible via a single click, we make it incredibly more useful.

Surely then Linked Data solves all of our problems: “your website is your API”, after all. I think there’s a lot of truth to that and rich Linked Data does remove many of the needs for an separate API.

But I don’t think it addresses all of the requirements, or at least: the current approaches and patterns for publishing Linked Data don’t address all of the requirements. Right now the main guidance is to focus on your domain modelling and the key entities and relationships that it contains. That’s good advice and a useful starting point.

But when you’re developing an application against a dataset, there are many more useful ways to partition the data, e.g.: by date, location, name, etc.

It’s entirely possible to materialize many of these partitions directly in the dataset — as yet more resources and links — but this quickly becomes unfeasible: there are too many useful data partions to realistically do this for any reasonably large or complex dataset. This is the exactly the gap that query languages, and specifically SPARQL, are designed to fill. But if we concede that SPARQL may be too complex for many cases, what other options can we explore?

SPARQL Stored Procedures and the Linked Data API

One option is to just build custom APIs. But this can be expensive to maintain, and can detract from the overall message of the core usefulness of publishing Linked Data. So, are there ways to surface useful views in declarative way, that both takes advantage of, and embraces the utility of a the underlying “web native” graph model?

Currently there are two approaches that we’ve explored. The first is what we’ve calling SPARQL Stored Procedures in Kasabi. This allows developers to:

  • Bind a SPARQL query to a URL, causing that query to be automatically executed when a GET request is made to the URL
  • Indicate that specific URL parameters be injected into the SPARQL query before it is executed, allowing the queries to be parameterized on a per-request basis
  • Generate custom output formats (e.g. XML or JSON) from a query using XSLT stylesheets that can be applied to the query results based on the requested mimetype

Kasabi provides tools for creating this type of API, including the ability to create one based on a SPARQL query shared by another user. This greatly lowers the barrier to entry for sharing useful ways to work with data.

The ability to either access the results of the query directly, e.g. as SPARQL XML results or various RDF serializations, means the underlying graph is still accessible. You can just treat the feature as a convenience layer that hides some of the complexity. But providing custom output formats we can also help developers use the data using existing skills and tools.

The second approach has grown out of work on data.gov.uk. Jeni Tennison (TSO), Dave Reynolds (Epimorphics) and I explored various options for creating APIs over Linked Data, resulting in the publication of the Linked Data API which is in use at data.gov.uk, e.g. to support the excellent organogram visualizations.

As with SPARQL Stored Procedures, the Linked Data API provides a declarative way to create a RESTful API over RDF data sources. However rather than writing SPARQL queries directly, an API developer creates a configuration file that describes how various views of the data should be bound to web requests.

The Linked Data API is much more powerful (at the cost of some complexity), providing many more options for filtering and sorting through data, as well as simple XML and JSON result formats out of the box. In my opinion, the specification does a good job at weaving API interactions together with the underlying Linked Data, creating a very rich way to interact with a dataset. And one that has a lot more affordance than the equivalent SPARQL queries.

Again, Kasabi provides support for hosting this type of API. Right now the tooling is admittedly quite basic but we’re exploring way to make it more interactive. We’ve incorporated the “view source” principle into the custom API hosting feature as a whole, so its possible to view the configuration of an API to see how it was constructed.

I think both of these approaches can usefully provide ways for a wider developer community to get to grips with RDF and Linked Data, removing some of the hurdles to adoption. The tooling we’ve created in Kasabi is designed to allow skilled members of the community to directly drive this adoption by sharing queries and creating different kinds of APIs.

By separating the publication of datasets, from the creation of APIs — useful access paths into the dataset — we hope to let communities find and share useful ways to work with the available data, whatever their skills or preferred technologies.

SemTech Thoughts

This post was originally published on the Kasabi product blog.

Attending SemTech 2011 last week I was struck by a shift in emphasis from “What If?” to “Here’s How”. I think there were more people sharing their experiences, technical & business approaches, and general war stories than on previous years. I think this reflects both the extent to which semantic technologies are, slowly, percolating into the mainstream, and the number of organizations that have jumped in to explore what benefits the technology might bring.

Attendance numbers at SemTech remain high, with around 1500 people visiting the conference this year. SemTech has one of the most punishing schedules of any conference I’ve attended, with 9 parallel tracks on some days! This year I changed my own strategy to spend a little more time in the “hallway track”, which gave me plenty of time to catch up with a number of people.

I did catch a number of talks, and while I won’t attempt to review them all here, I will mention a few stand-out sessions. John O’Donovan’s talk on the experiences of the BBC with semantic web technology was the best keynote. I’ve previously seen other speakers from the BBC talk about the domain modelling approach that is yielding great results for them when building websites, but John was able to put some business and architectural contrast around that which I found interesting. I saw echoes of that during the rest of the conference, with the three part architecture — triple store; CMS; search engine — appearing in a number of talks, e.g. from O’Reilly and Entagen. Not surprising as it allows each component to do what it does best, and its an approach I’ve personally used in the past.

The utility of separate search indexes to complement structured queries using SPARQL is something we’re supporting in Kasabi by having both of these options as part of our standard set of APIs.

I also sat in on Lin Clark’s tutorial on using the new semweb features of Drupal 7. We’re using Drupal in Kasabi currently, but haven’t started using these features as yet. Lin gave a great run down of the current Drupal support for publishing and consuming RDF and Linked Data, and I was impressed with the general capabilities.

My main reason for attending SemTech was to give two talks about Kasabi. My first talk was on some of the work we’ve been doing around building APIs over RDF and Linked Data. Our goal is to make make data as useful in as many different contexts and by as many different developers as possible. You can find the slides for these on Slideshare and I’ve embedded them below:

My second talk was a product demo of Kasabi. We launched Kasabi into public beta a few days before SemTech began and I was very pleased to have hit that milestone, allowing me to give a live demo of the product during the talk. I gave a walk through of the site, showing what we’re doing to make datasets more accessible, the ease of publishing both dataset and APIs, and how to quickly import data from the web using a simple browser plugin.

Again, the slides are up on slideshare, and embedded below, but I’m working on some screencasts that should capture the demonstration which was the bulk of the talk.

We had some fantastic reactions to the demo, and lots of interest in the product in general during the event. I was pleased to see Kasabi getting a mention in four other talks during the week. It’s exciting to be able to show more people what we’re building.

I’m looking forward to the new SemTech events later this year in both London and Washington. However Kasabi isn’t just for semantic web developers and so we’ll also be casting a wider net to reach out to developers from a number of different communities.

Attending Strataconf earlier this year confirmed for me that it will quickly become another key event for those of us interested in data. There seems to be a great community forming around the conference already. I did come away from the January conference wishing there had been more discussion of publishing data to the web, rather than simply using data from the web, but I think the emphasis was right for that first event. I’ll interested to see how Edd Dumbill is planning to add a little more semantic web flavour to the agenda of later events.