Monthly Archives: June 2011

Giving RDF Datasets more Affordance

This post was originally published on the Kasabi product blog.

The following is a version of the talk on Creating APIs over RDF I gave at SemTech 2011. I’ve pruned some of the technical details in favour of linking out to other sources and concentrated here on the core message I was trying to get across. Comments welcome!

The Trouble with SPARQL

I’m a big fan of SPARQL, I constantly use it in my own development tasks, have built a number of production systems in which the query language is a core component, and wrote (I think!) one of the first SPARQL tutorials back in 2005 when it was still in Last Call. I’ve also worked with a number of engineering teams and developer communities over the last few years, introducing them to RDF and SPARQL.

My experience so far is two-fold: given some training and guidance SPARQL isn’t hard for any developer to learn. It’s just another query language and syntax. There are often some existing mental models that need to be overcome, but that’s always the case with any new technology. So at small scales SPARQL is easy to adopt, and a very useful tool when you’re working with graph shaped data.

But I’ve found, repeatedly, that when SPARQL is presented to a larger community, then the reaction and experience is very different. Developers quickly reject it as the learning curves are too great and, instead of seeing it as an enabler, they often see it as a barrier that’s placed between them and the data.

It’s easy to dismiss this kind of feedback and criticism by exhorting developers to just try harder or read more documentation. Surely any good developer is keen to learn a new technology? This overlooks the need of many people to just get stuff done quickly. Time and commercial pressures are a reality.

It’s also easy to dismiss this reaction as being down to the quality of the tools and documentation. Now, undoubtedly, there’s still much more that can be done there. And I was pleased to hear about Bob DuCharme’s forthcoming book on SPARQL.

But I think there are some technical reasons why, when we move from small groups to distributed adoption by a wider community, SPARQL causes frustrations. And I think this boils down to its affordance.

Affordance

Consider the interface that most developers are presented with when given access to a SPARQL endpoint. It’s an empty text field and a button. If you’re really lucky, there may be an example query filled in.

In order to do anything you have to not only know how to write a valid SPARQL query, but you also really need to know how the underlying dataset is structured. Two immediate hurdles to get over. Yes, there are queries you can write to return arbitrary triples, or list classes and properties, but that’s still not something a new user would necessarily know. And you typically need a lot of exploration before you can start to understand how to best query the data.

Trial and error experiments aren’t easy either, its not always obvious how a query can be tweaked to customize the results. And when we typically share SPARQL queries, its typically by passing around direct links to an endpoint. Have fun unpicking the query from the URL, reformatting it, so you can understand how it works and how it can be tweaked!

Better tools can definitely help in both of these cases. In Kasabi we’ve added a feature that allows anyone to share useful queries for a SPARQL endpoint along with a description of how it works. It’s a simple click to drop the query into the API explorer to run it, or tweak it.

But in my opinion it’s about more than just the tooling. Affordance flows not just from the tools, but also the syntax and the data. If you point someone at a SPARQL endpoint, it’s not immediately useful, not without a lot of additional background. These are issues that can hamper widespread adoption of a technology but which don’t often arise with smaller groups with direct access to mentors.

Contrast this situation with typical web APIs which have, in my opinion, much more affordance. If I give someone a link to an API call then it’s more immediately useful. I think working with good, RESTful APIs is like pulling a thread: the URLs unravel into useful data. And that data contains more links that I can just follow to find more data.

Trial and error experiments are also much easier. If I want to tweak an API call then I can do that by simply editing the URL. As a developer this is syntax that I already know. URL templates can also give me hints of how to construct a useful request.

Importantly, my understanding of the structure of the dataset can grow as I work with it. My understanding grows through use, rather than before I start using. And that’s a great way to learn. There are no real barriers to progression. I need to know much less in order to start feeling empowered.

So, I’ve come to the conclusion that SPARQL is really for power users. It’s of most use to developers that are willing to take the time and trouble to learn the syntax and the underlying data model in order to get its benefits. This is not a critique of the technology itself, but a reflection on how technology is (or isn’t) being adopted and the challenges people are facing.

The obvious question that springs to mind is: how we can give RDF data more affordance?

Linked Data

Linked Data is all about giving affordance to data. Linking, and “follow your nose” access to data is a core part of the approach. By binding data to the web, making it accessible via a single click, we make it incredibly more useful.

Surely then Linked Data solves all of our problems: “your website is your API”, after all. I think there’s a lot of truth to that and rich Linked Data does remove many of the needs for an separate API.

But I don’t think it addresses all of the requirements, or at least: the current approaches and patterns for publishing Linked Data don’t address all of the requirements. Right now the main guidance is to focus on your domain modelling and the key entities and relationships that it contains. That’s good advice and a useful starting point.

But when you’re developing an application against a dataset, there are many more useful ways to partition the data, e.g.: by date, location, name, etc.

It’s entirely possible to materialize many of these partitions directly in the dataset — as yet more resources and links — but this quickly becomes unfeasible: there are too many useful data partions to realistically do this for any reasonably large or complex dataset. This is the exactly the gap that query languages, and specifically SPARQL, are designed to fill. But if we concede that SPARQL may be too complex for many cases, what other options can we explore?

SPARQL Stored Procedures and the Linked Data API

One option is to just build custom APIs. But this can be expensive to maintain, and can detract from the overall message of the core usefulness of publishing Linked Data. So, are there ways to surface useful views in declarative way, that both takes advantage of, and embraces the utility of a the underlying “web native” graph model?

Currently there are two approaches that we’ve explored. The first is what we’ve calling SPARQL Stored Procedures in Kasabi. This allows developers to:

  • Bind a SPARQL query to a URL, causing that query to be automatically executed when a GET request is made to the URL
  • Indicate that specific URL parameters be injected into the SPARQL query before it is executed, allowing the queries to be parameterized on a per-request basis
  • Generate custom output formats (e.g. XML or JSON) from a query using XSLT stylesheets that can be applied to the query results based on the requested mimetype

Kasabi provides tools for creating this type of API, including the ability to create one based on a SPARQL query shared by another user. This greatly lowers the barrier to entry for sharing useful ways to work with data.

The ability to either access the results of the query directly, e.g. as SPARQL XML results or various RDF serializations, means the underlying graph is still accessible. You can just treat the feature as a convenience layer that hides some of the complexity. But providing custom output formats we can also help developers use the data using existing skills and tools.

The second approach has grown out of work on data.gov.uk. Jeni Tennison (TSO), Dave Reynolds (Epimorphics) and I explored various options for creating APIs over Linked Data, resulting in the publication of the Linked Data API which is in use at data.gov.uk, e.g. to support the excellent organogram visualizations.

As with SPARQL Stored Procedures, the Linked Data API provides a declarative way to create a RESTful API over RDF data sources. However rather than writing SPARQL queries directly, an API developer creates a configuration file that describes how various views of the data should be bound to web requests.

The Linked Data API is much more powerful (at the cost of some complexity), providing many more options for filtering and sorting through data, as well as simple XML and JSON result formats out of the box. In my opinion, the specification does a good job at weaving API interactions together with the underlying Linked Data, creating a very rich way to interact with a dataset. And one that has a lot more affordance than the equivalent SPARQL queries.

Again, Kasabi provides support for hosting this type of API. Right now the tooling is admittedly quite basic but we’re exploring way to make it more interactive. We’ve incorporated the “view source” principle into the custom API hosting feature as a whole, so its possible to view the configuration of an API to see how it was constructed.

I think both of these approaches can usefully provide ways for a wider developer community to get to grips with RDF and Linked Data, removing some of the hurdles to adoption. The tooling we’ve created in Kasabi is designed to allow skilled members of the community to directly drive this adoption by sharing queries and creating different kinds of APIs.

By separating the publication of datasets, from the creation of APIs — useful access paths into the dataset — we hope to let communities find and share useful ways to work with the available data, whatever their skills or preferred technologies.

SemTech Thoughts

This post was originally published on the Kasabi product blog.

Attending SemTech 2011 last week I was struck by a shift in emphasis from “What If?” to “Here’s How”. I think there were more people sharing their experiences, technical & business approaches, and general war stories than on previous years. I think this reflects both the extent to which semantic technologies are, slowly, percolating into the mainstream, and the number of organizations that have jumped in to explore what benefits the technology might bring.

Attendance numbers at SemTech remain high, with around 1500 people visiting the conference this year. SemTech has one of the most punishing schedules of any conference I’ve attended, with 9 parallel tracks on some days! This year I changed my own strategy to spend a little more time in the “hallway track”, which gave me plenty of time to catch up with a number of people.

I did catch a number of talks, and while I won’t attempt to review them all here, I will mention a few stand-out sessions. John O’Donovan’s talk on the experiences of the BBC with semantic web technology was the best keynote. I’ve previously seen other speakers from the BBC talk about the domain modelling approach that is yielding great results for them when building websites, but John was able to put some business and architectural contrast around that which I found interesting. I saw echoes of that during the rest of the conference, with the three part architecture — triple store; CMS; search engine — appearing in a number of talks, e.g. from O’Reilly and Entagen. Not surprising as it allows each component to do what it does best, and its an approach I’ve personally used in the past.

The utility of separate search indexes to complement structured queries using SPARQL is something we’re supporting in Kasabi by having both of these options as part of our standard set of APIs.

I also sat in on Lin Clark’s tutorial on using the new semweb features of Drupal 7. We’re using Drupal in Kasabi currently, but haven’t started using these features as yet. Lin gave a great run down of the current Drupal support for publishing and consuming RDF and Linked Data, and I was impressed with the general capabilities.

My main reason for attending SemTech was to give two talks about Kasabi. My first talk was on some of the work we’ve been doing around building APIs over RDF and Linked Data. Our goal is to make make data as useful in as many different contexts and by as many different developers as possible. You can find the slides for these on Slideshare and I’ve embedded them below:

My second talk was a product demo of Kasabi. We launched Kasabi into public beta a few days before SemTech began and I was very pleased to have hit that milestone, allowing me to give a live demo of the product during the talk. I gave a walk through of the site, showing what we’re doing to make datasets more accessible, the ease of publishing both dataset and APIs, and how to quickly import data from the web using a simple browser plugin.

Again, the slides are up on slideshare, and embedded below, but I’m working on some screencasts that should capture the demonstration which was the bulk of the talk.

We had some fantastic reactions to the demo, and lots of interest in the product in general during the event. I was pleased to see Kasabi getting a mention in four other talks during the week. It’s exciting to be able to show more people what we’re building.

I’m looking forward to the new SemTech events later this year in both London and Washington. However Kasabi isn’t just for semantic web developers and so we’ll also be casting a wider net to reach out to developers from a number of different communities.

Attending Strataconf earlier this year confirmed for me that it will quickly become another key event for those of us interested in data. There seems to be a great community forming around the conference already. I did come away from the January conference wishing there had been more discussion of publishing data to the web, rather than simply using data from the web, but I think the emphasis was right for that first event. I’ll interested to see how Edd Dumbill is planning to add a little more semantic web flavour to the agenda of later events.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers