RDF Data Access Options, or Isn’t HTTP already the API?

This is a follow-up to my blog post from yesterday about RDF and JSON. Ed Summers tweeted to say:

…your blog post suggests that an API for linked data is needed; isn’t http already the API?

I couldn’t answer that in 140 characters, so am writing this post to elaborate a little on the last section of my post in which I suggested that “there’s a big data access gulf between de-referencing URIs and performing SPARQL queries”. What exactly do I mean there? And why do I think that the Linked Data API helps?

Is Your Website Your API?

Most Linked Data presentations that discuss the publishing of data to the web typically run through the Linked Data principles. At point three we reach the recommend that:

“When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

This has encourages us to create sites that consist of a mesh of interconnected resources described using RDF. We can “follow our nose” through those relationships to find more information.

This gives us two fundamental two data access options:

  • Resource Lookups: by dereferencing APIs we can obtain a (typically) complete description of a resource
  • Graph Traversal: following relationships and recursively de-referencing URIs to retrieve descriptions of related entities; this is (typically, not not necessarily) reconstituted into a graph on the client

However, if we take the “Your Website Is Your API” idea seriously, then we should be able to reflect all of the different points of interaction of that website as RDF, not just resource lookups (viewing a page) and graph traversal (clicking around).

As Tom Coates noted back in 2006 in “Native to a Web of Data“, good data-driven websites will have “list views and batch manipulation interfaces”. So we should be able to provide RDF views of those areas of functionality too. This gives us another kind of access option:

  • Listing: ability to retrieve lists/collections of things; navigation through those lists, e.g. by paging; and list manipulation, e.g. by filtering or sorting.

It’s possible to handle much of that by building some additional structure into your dataset, e.g. creating RDF Lists (or similar) of useful collections of resources. But if you bake this into your data then those views will potentially need to be re-evaluated every time the data changes. And even then there is still no way for a user to manipulate the views, e.g. to page or sort them.

So to achieve the most flexibility you need a more dynamic way of extracting and ordering portions of the underlying data. This is the role that SPARQL often fulfills, it provides some really useful ways to manipulate RDF graphs, and you can achieve far more with it than just extracting and manipulating lists of things.

SPARQL also supports another kind of access option that would otherwise require traversing some or all of the remote graph.

One example would be: “does this graph contain any foaf:name predicates?” or “does anything in this graph relate to http://www.example.org/bob?”. These kinds of existence checks, as well as more complex graph pattern matching, also tend to be the domain of SPARQL queries. It’s more expressive and potentially more efficient to just use a query language for that kind of question. So this gives us a fourth option:

  • Existence Checks: ability to determine whether a particular structure is present in a graph

Interestingly though they are not often the kinds of questions that you can “ask” of a website. There’s no real correlation with typical web browsing features although searching comes close for simple existence check queries.

Where the Linked Data API fits in

So there are at least four kinds of data access option. I doubt whether its exhaustive, but its a useful starting point for discussion.

SPARQL can handle all of these options and more. The graph pattern matching features, and provision of four query types lets us perform any of these kinds of interaction. For example A common way of implementing Resource Lookups over a triple store is to use a DESCRIBE or a CONSTRUCT query.

However the problem, as I see it, is that when we resort to writing SPARQL graph patterns in order to request, say, a list of people, then we’ve kind of stepped around HTTP. We’re no longer specifying and refining our query by interacting with web resources via parameterised URLs, we’re tunnelling the request for what we want in a SPARQL query sent to an endpoint.

From a hypermedia perspective it would be much better if there were a way to be able to handle the “Listing” access option using something that was better integrated with HTTP. It also happens that this might actually be easier for the majority of web developers to get to grips with, because they no longer have to learn SPARQL.

This is what I meant by a “RESTful API” in yesterday’s blog post. In my mind, “Listing things” sits in between Resource Lookups and Existence Checks or complex pattern matching in terms of access options.

It’s precisely this role that the Linked Data API is intended to fulfil. It defines a way to dynamically generate lists of resources from an underlying RDF graph, along with ways to manipulate those collections of resources, e.g. by sorting and filtering. It’s possible to use it to define a number of useful list views for an RDF dataset that nicely complements the relationships present in the data. It’s actually defined in terms of executing SPARQL queries over that graph, but this isn’t obvious to the end user.

These features are supplemented with the definition of simple XML and JSON formats, to supplement the RDF serializations that it supports. This is really intended to encourage adoption by making it easier to process the data using non RDF tools.

So, Isn’t HTTP the API?

Which brings me to the answer to Ed’s question: isn’t HTTP the API we need? The answer is yes, but we need more than just HTTP, we also need well defined media-types.

Mike Amundsen has created a nice categorisation of media types and a description of different types of factors they contain: H Factor.

Section of Fielding’s dissertation explains that:

Control data defines the purpose of a message between components, such as the action being requested or the meaning of a response. It is also used to parameterize requests and override the default behavior of some connecting elements.

As it stands today neither RDF nor the Linked Data API specification ticks all of the the HFactor boxes. What we’ve really done so far is define how to parameterise some requests, e.g. to filter or sort based on a property value, but we’ve not yet defined that in a standard media type; the API configuration captures a lot of the requisite information but isn’t quite there.

That’s a long rambly blog post for a Friday night! Hopefully I’ve clarified what I was referring to yesterday. I absolutely don’t want to see anyone define an API for RDF that steps around HTTP. We need something that is much more closely aligned with the web. And hopefully I’ve also answered Ed’s question.

RDF and JSON: A Clash of Model and Syntax

I had been meaning to write this post for some time. After reading Jeni Tennison’s post from earlier this week I had decided that I didn’t need too, but Jeni and Thomas Roessler suggested I publish my thoughts. So here they are. I’ve got more things to say about where efforts should be expended in meeting the challenges that face us over the next period of growth of the semantic web, but I’ll keep those for future posts.

Everyone agrees that a JSON serialization of RDF is a Good Thing. And I think nearly everyone would agree that a standard JSON serialization of RDF would be even better. The problem is no-one can agree on what constitutes a good JSON serialization of RDF. As the RDF Next Working Group is about to convene to try and define a standard JSON serialization now is a very good time to think about what it is we really want them to achieve.

RDF in JSON, is RDF in XML all over again

There are very few people who like RDF/XML. Personally, while it’s not my favourite RDF syntax, I’m glad its there for when I want to convert XML formats into RDF. I’ve even built an entire RDF workflow that began with the ingestion of RDF/XML documents; we even validated them against a schema!

There are several reasons why people dislike RDF/XML.

Firstly, there is a mis-match in the data models: serialization involves turning a graph into a tree. There are many different ways to achieve that so, without applying some external constraints, the output can be highly variable. The problem is that those constraints can be highly specific, so are difficult to generalize. This results in a high degree of syntax variability of RDF/XML in the wild, and that undermines the ability to use RDF/XML with standard XML tools like XPath, XSLT, etc. They (unsurprisingly) operate only on the surface XML syntax not the “real” data model.

Secondly, people dislike RDF/XML because of the mis-match in (loosely speaking) the native data types. XML is largely about elements and attributes whereas RDF has resources, properties, literals, blank nodes, lists, sequences, etc. And of course there are those ever present URIs. This leads to additional syntax short-cuts and hijacking of features like XML Namespaces to simplify the output, whilst simultaneously causing even more variability in the possible serializations.

Thirdly, when it comes to parsing, RDF/XML just isn’t a very efficient serialization. It’s typically more verbose and can involve much more of a memory overhead when parsing than some of the other syntaxes.

Because of these issues, we end up with a syntax which, while flexible, requires some profiling to be really useful within an XML toolchain. Or you just ignore the fact that its XML at all and throw it straight into a triple store, which is what I suspect most people do. If you do that then an XML serialization of RDF is just a convenient way to generate RDF data from an XML toolchain.

Unfortunately when we look at serializing RDF as JSON we discover that we have nearly all of the same issues. JSON is a tree; so we have the same variety of potential options for serializing any given graph. The data types are also still different: key-value pairs, hashes, lists, strings, dates (of a form!), etc. versus resource, properties, literals, etc. While there is potential to use more native datatypes, the practical issues of repeatable properties, blank nodes, etc mean that a 1:1 mapping isn’t feasible. Lack of support for anything like XML Namespaces means that hiding URIs is also impossible without additional syntax conventions.

So, ultimately, both XML and JSON are poor fits for handling RDF. I think most people would agree that a specific format like Turtle is much easier to work with. It’s also better as starting point for learning RDF because most of the syntax is re-used in SPARQL. That’s why standardising Turtle, ideally extended to support Named Graphs, needs to be the first item on the RDF Next Working Group’s agenda.

What do we actually want?

What purpose are we trying to achieve with a JSON serialization of RDF? I’d argue that there are several goals:

  1. Support for scripting languages: Provide better support for processing RDF in scripting languages
  2. Creating convergence: Build some convergence around the dizzying array of existing RDF in JSON proposals, to create consistency in how data is published
  3. Gaining traction: Make RDF more acceptable for web developers, with the hope of increasing engagement with RDF and Linked Data

I don’t think that anyone considers a JSON serialization of RDF as a better replacement for RDF/XML. I think everyone is looking to Turtle to provide that.

I also don’t think that anyone sees JSON as a particularly efficient serialization of RDF, particularly for bulk loading. It might be, but I think N-Triples (a subset of Turtle) fulfills that niche already: it’s easy to stream and to process in parallel.

Lets look at each of those goals in turn.

Support for scripting languages

Unarguably its much, much easier to process JSON in scripting languages like Javascript, Ruby, PHP than RDF/XML.

Parser support for JSON is ubiquitous as its the syntax de jour. Just as XML was when the RDF specifications were being written. Typically JSON parsing is much more efficient. That’s especially true when we look at Javascript in the browser.

From that perspective RDF in JSON is an instant win as it will simplify consumption of Linked Data and the results of SPARQL CONSTRUCT and DESCRIBE queries. There are other issues with getting wide-spread support for RDF across different programming languages, e.g. proper validation of URIs, but fast parsing of the basic data structure would be a step in the right direction.

Creating Convergence

I think I’ve seen about a dozen or more different RDF in JSON proposals. There’s a list on the ESW wiki and some comparison notes on the Talis Platform wiki, but I don’t think either are complete. If I get chance I’ll update them. The sheer variety confirms my earlier points about the mis-matches between models: everyone has their own conception of what constitutes a useful JSON serialization.

Because there are less syntax options in JSON, the proposals run the full spectrum from capturing the full RDF model but making poor use of JSON syntax, through to making good use of JSON syntax but at the cost of either ignoring aspects of the RDF model or layering additional syntax conventions on top of JSON itself. As an aside, I find it interesting that so many people are happy with subsetting RDF to achieve this one goal.

The thing we should recognise is that none of the existing RDF in JSON formats are really useful without an accompanying API. I’ve used a number of different formats and no matter what serialization I’ve used I’ve ended up with helper code that simplifies some or all of the following:

  • Lookup of all properties of a single resource
  • Mapping between URIs and short names (e.g. CURIES or locally defined keys) for properties
  • Mapping between conventions for encoding particular datatypes (or language annotations) and native objects in the scripting language
  • Cross-referencing between subjects and objects; and vice-versa
  • Looking up all values of a property or a single value (often the first)

In addition, if I’m consuming the results of multiple requests then I may also end up with a custom data structure and code for merging together different descriptions. Even if its just an array of parsed JSON documents and code to perform the above lookups across that collection.

So, while we can debate the relative aesthetics of different approaches, I think its focusing attention on the wrong areas. What we should really be looking at is an API for manipulating RDF. One that will work in Javascript, Ruby or PHP. While I acknowledge the lingering horror of the DOM, I think the design space here is much simpler. Maybe I’m just an optimist!

If we take this approach then what we need is an JSON serialization of RDF that covers as much of the RDF model as possible and, ideally, is already as well supported as possible. From what I’ve seen the RDF/JSON serialization is actually closest to that ideal. It’s supported in a number of different parsing and serialising libraries already and only needs to be extended to support blank nodes and Named Graphs, which is trivial to do. While its not the prettiest serialization, given a vote, I’d look at standardising that and moving on to focus on the more important area: the API.

Gaining Traction

Which brings me to the last use case. Can we create a JSON serialization of RDF that will help Linked Data and RDF get some traction in the wider web development community?

The answer is no.

If you believe that the issues with gaining adoption are purely related to syntax then you’re not listening to the web developer community closely enough. While a friendlier syntax may undoubtedly help, an API would be even better. The majority of web developers these days are very happy indeed to work with tools like JQuery to handle client-side scripting. A standard JQuery extension for RDF would help adoption much more than spending months debating the best way to profile the RDF model into a clean JSON serialization.

But the real issue is that we’re asking web developers to learn not just new syntax but also an entirely new way to access data: we’re asking them to use SPARQL rather than simple RESTful APIs.

While I think SPARQL is an important and powerful tool in the RDF toolchain I don’t think it should be seen as the standard way of querying RDF over the web. There’s a big data access gulf between de-referencing URIs and performing SPARQL queries. We need something to fill that space, and I think the Linked Data API fills that gap very nicely. We should be promoting a range of access options.

I have similar doubts about SPARQL Update as the standard way of updating triple stores over the web, but that’s the topic of another post.

Summing Up

As the RDF Next Working Group gets underway I think it needs to carefully prioritise its activities to ensure that we get the most out of this next phase of development and effort around the Semantic Web specifications. It’s particularly crucial right now as we’re beginning to see the ideas being adopted and embraced more widely. As I’ve tried to highlight here, I think there’s a lot of value to be had in having a standard JSON serialization of RDF. But I don’t think that there’s much merit in attempting to create a clean, simple JSON serialization that will meet everyone’s needs.

Standardising Turtle and an API for manipulating RDF data has more value in my view. RDF/JSON as a well implemented specification meets the core needs of the semantic web developer; a simple scripting API meets the needs of everyone else.