RDF and JSON: A Clash of Model and Syntax

I had been meaning to write this post for some time. After reading Jeni Tennison’s post from earlier this week I had decided that I didn’t need too, but Jeni and Thomas Roessler suggested I publish my thoughts. So here they are. I’ve got more things to say about where efforts should be expended in meeting the challenges that face us over the next period of growth of the semantic web, but I’ll keep those for future posts.

Everyone agrees that a JSON serialization of RDF is a Good Thing. And I think nearly everyone would agree that a standard JSON serialization of RDF would be even better. The problem is no-one can agree on what constitutes a good JSON serialization of RDF. As the RDF Next Working Group is about to convene to try and define a standard JSON serialization now is a very good time to think about what it is we really want them to achieve.

RDF in JSON, is RDF in XML all over again

There are very few people who like RDF/XML. Personally, while it’s not my favourite RDF syntax, I’m glad its there for when I want to convert XML formats into RDF. I’ve even built an entire RDF workflow that began with the ingestion of RDF/XML documents; we even validated them against a schema!

There are several reasons why people dislike RDF/XML.

Firstly, there is a mis-match in the data models: serialization involves turning a graph into a tree. There are many different ways to achieve that so, without applying some external constraints, the output can be highly variable. The problem is that those constraints can be highly specific, so are difficult to generalize. This results in a high degree of syntax variability of RDF/XML in the wild, and that undermines the ability to use RDF/XML with standard XML tools like XPath, XSLT, etc. They (unsurprisingly) operate only on the surface XML syntax not the “real” data model.

Secondly, people dislike RDF/XML because of the mis-match in (loosely speaking) the native data types. XML is largely about elements and attributes whereas RDF has resources, properties, literals, blank nodes, lists, sequences, etc. And of course there are those ever present URIs. This leads to additional syntax short-cuts and hijacking of features like XML Namespaces to simplify the output, whilst simultaneously causing even more variability in the possible serializations.

Thirdly, when it comes to parsing, RDF/XML just isn’t a very efficient serialization. It’s typically more verbose and can involve much more of a memory overhead when parsing than some of the other syntaxes.

Because of these issues, we end up with a syntax which, while flexible, requires some profiling to be really useful within an XML toolchain. Or you just ignore the fact that its XML at all and throw it straight into a triple store, which is what I suspect most people do. If you do that then an XML serialization of RDF is just a convenient way to generate RDF data from an XML toolchain.

Unfortunately when we look at serializing RDF as JSON we discover that we have nearly all of the same issues. JSON is a tree; so we have the same variety of potential options for serializing any given graph. The data types are also still different: key-value pairs, hashes, lists, strings, dates (of a form!), etc. versus resource, properties, literals, etc. While there is potential to use more native datatypes, the practical issues of repeatable properties, blank nodes, etc mean that a 1:1 mapping isn’t feasible. Lack of support for anything like XML Namespaces means that hiding URIs is also impossible without additional syntax conventions.

So, ultimately, both XML and JSON are poor fits for handling RDF. I think most people would agree that a specific format like Turtle is much easier to work with. It’s also better as starting point for learning RDF because most of the syntax is re-used in SPARQL. That’s why standardising Turtle, ideally extended to support Named Graphs, needs to be the first item on the RDF Next Working Group’s agenda.

What do we actually want?

What purpose are we trying to achieve with a JSON serialization of RDF? I’d argue that there are several goals:

  1. Support for scripting languages: Provide better support for processing RDF in scripting languages
  2. Creating convergence: Build some convergence around the dizzying array of existing RDF in JSON proposals, to create consistency in how data is published
  3. Gaining traction: Make RDF more acceptable for web developers, with the hope of increasing engagement with RDF and Linked Data

I don’t think that anyone considers a JSON serialization of RDF as a better replacement for RDF/XML. I think everyone is looking to Turtle to provide that.

I also don’t think that anyone sees JSON as a particularly efficient serialization of RDF, particularly for bulk loading. It might be, but I think N-Triples (a subset of Turtle) fulfills that niche already: it’s easy to stream and to process in parallel.

Lets look at each of those goals in turn.

Support for scripting languages

Unarguably its much, much easier to process JSON in scripting languages like Javascript, Ruby, PHP than RDF/XML.

Parser support for JSON is ubiquitous as its the syntax de jour. Just as XML was when the RDF specifications were being written. Typically JSON parsing is much more efficient. That’s especially true when we look at Javascript in the browser.

From that perspective RDF in JSON is an instant win as it will simplify consumption of Linked Data and the results of SPARQL CONSTRUCT and DESCRIBE queries. There are other issues with getting wide-spread support for RDF across different programming languages, e.g. proper validation of URIs, but fast parsing of the basic data structure would be a step in the right direction.

Creating Convergence

I think I’ve seen about a dozen or more different RDF in JSON proposals. There’s a list on the ESW wiki and some comparison notes on the Talis Platform wiki, but I don’t think either are complete. If I get chance I’ll update them. The sheer variety confirms my earlier points about the mis-matches between models: everyone has their own conception of what constitutes a useful JSON serialization.

Because there are less syntax options in JSON, the proposals run the full spectrum from capturing the full RDF model but making poor use of JSON syntax, through to making good use of JSON syntax but at the cost of either ignoring aspects of the RDF model or layering additional syntax conventions on top of JSON itself. As an aside, I find it interesting that so many people are happy with subsetting RDF to achieve this one goal.

The thing we should recognise is that none of the existing RDF in JSON formats are really useful without an accompanying API. I’ve used a number of different formats and no matter what serialization I’ve used I’ve ended up with helper code that simplifies some or all of the following:

  • Lookup of all properties of a single resource
  • Mapping between URIs and short names (e.g. CURIES or locally defined keys) for properties
  • Mapping between conventions for encoding particular datatypes (or language annotations) and native objects in the scripting language
  • Cross-referencing between subjects and objects; and vice-versa
  • Looking up all values of a property or a single value (often the first)

In addition, if I’m consuming the results of multiple requests then I may also end up with a custom data structure and code for merging together different descriptions. Even if its just an array of parsed JSON documents and code to perform the above lookups across that collection.

So, while we can debate the relative aesthetics of different approaches, I think its focusing attention on the wrong areas. What we should really be looking at is an API for manipulating RDF. One that will work in Javascript, Ruby or PHP. While I acknowledge the lingering horror of the DOM, I think the design space here is much simpler. Maybe I’m just an optimist!

If we take this approach then what we need is an JSON serialization of RDF that covers as much of the RDF model as possible and, ideally, is already as well supported as possible. From what I’ve seen the RDF/JSON serialization is actually closest to that ideal. It’s supported in a number of different parsing and serialising libraries already and only needs to be extended to support blank nodes and Named Graphs, which is trivial to do. While its not the prettiest serialization, given a vote, I’d look at standardising that and moving on to focus on the more important area: the API.

Gaining Traction

Which brings me to the last use case. Can we create a JSON serialization of RDF that will help Linked Data and RDF get some traction in the wider web development community?

The answer is no.

If you believe that the issues with gaining adoption are purely related to syntax then you’re not listening to the web developer community closely enough. While a friendlier syntax may undoubtedly help, an API would be even better. The majority of web developers these days are very happy indeed to work with tools like JQuery to handle client-side scripting. A standard JQuery extension for RDF would help adoption much more than spending months debating the best way to profile the RDF model into a clean JSON serialization.

But the real issue is that we’re asking web developers to learn not just new syntax but also an entirely new way to access data: we’re asking them to use SPARQL rather than simple RESTful APIs.

While I think SPARQL is an important and powerful tool in the RDF toolchain I don’t think it should be seen as the standard way of querying RDF over the web. There’s a big data access gulf between de-referencing URIs and performing SPARQL queries. We need something to fill that space, and I think the Linked Data API fills that gap very nicely. We should be promoting a range of access options.

I have similar doubts about SPARQL Update as the standard way of updating triple stores over the web, but that’s the topic of another post.

Summing Up

As the RDF Next Working Group gets underway I think it needs to carefully prioritise its activities to ensure that we get the most out of this next phase of development and effort around the Semantic Web specifications. It’s particularly crucial right now as we’re beginning to see the ideas being adopted and embraced more widely. As I’ve tried to highlight here, I think there’s a lot of value to be had in having a standard JSON serialization of RDF. But I don’t think that there’s much merit in attempting to create a clean, simple JSON serialization that will meet everyone’s needs.

Standardising Turtle and an API for manipulating RDF data has more value in my view. RDF/JSON as a well implemented specification meets the core needs of the semantic web developer; a simple scripting API meets the needs of everyone else.