Everyone loves to hate RDF/XML. Indeed many have argued that RDF/XML is responsible for holding back semantic web adoption. I’m not sure that I fully agree with that (there’s a lot of other issues to consider) but its certainly awkward to work with if you’re trying to integrate both RDF and XML tools into your application.
It’s actually that combination that causes the awkwardness. If you’re just using RDF tools then RDF/XML is mostly fine. It benefits from XML’s Unicode support and is the most widely supported RDF serialisation. There are downsides though. For example there are some potential RDF graphs can’t be serialised as RDF/XML. But that is easy to avoid.
Developers, particularly XML developers, feel cheated by RDF/XML because of what they see as false advertising: its an XML format that doesn’t play nicely with XML tools. Some time ago, Dan Brickley wrote a nice history on the design of RDF/XML which is worth a read for some background. My goal here isn’t to rehash the RDF/XML discussion or even to mount a defense of RDF/XML as a good format for RDF (I prefer Turtle).
But developers are still struggling with RDF/XML, particularly in publishing workflows where XML is a good base representation for document structures, so I think its worthwhile capturing some advice on how to reach a compromise with RDF/XML that allows it to work nicely with XML tools. I can’t remember seeing anyone do that before, so I thought I’d write down some of my experiences. These are drawn from creating a publishing platform that ingested metadata and content in XML, used Apache Jena for storing that metadata, and Solr as a search engine. Integration between different components was carried out using XML based messaging. So there were several places where RDF and XML rubbed up against one another.
Tip 1: Don’t rely on default serialisations
The first thing to note is that RDF/XML offers a lot of flexibility in terms of how an RDF graph can be serialised as XML. A lot. The same graph can be serialised in many different ways using a lot of syntactic short-cuts. More on those in a moment.
It’s this unbounded flexibility that is the major source of the problems: producers and consumers may have reasonable default assumptions about how data will be made published that are completely at odds with one another. This makes it very difficult to consume arbitrary RDF/XML with anything other than RDF tools.
JSON-LD offers a lot of flexibility too, and I can’t help but wonder whether that flexibility is going to come back and bite us in the future.
By default RDF tools tend to generate RDF/XML in a form that makes it easy for them to serialise. This tends to mean automatically generated namespace prefixes and a per-triple approach to serialising the graph, e.g:
This is a disaster for XML tools as the description of the resource is spread across multiple elements making it hard to process. But its efficient to generate.
Some RDF frameworks may provide options for customising the output to apply some of the RDF/XML syntactic shortcuts. As we’ll see in a moment these are worth embracing and may produce some useful regularity.
But if you need to generate an XML format that has, for example, a precise ordering of child elements then you’re not going to get that kind of flexibility by default. You’ll need to craft a custom serialiser. Apache Jena allows you to use create RDF Writers to support this kind of customization. This isn’t ideal as you need to write code — even to tweak the output options — but it gives you more control.
So, if you need to generate an XML format from RDF sources then ensure that you normalize your output. If you have control over the XML document formats and can live with some flexibility in the content model, then using RDF/XML syntax shortcuts offered by your RDF tools might well be sufficient. However if you’re working to a more rigid format, then you’re likely to need some custom code.
Tip 2: Use all of the shortcuts
Lets look at the above example again but with a heavy dusting of syntax sugar:
Much nicer! The above describes exactly the same RDF graph as we had before. What have we done here:
- We’ve omitted the rdf:RDF element as its unnecessary. If you have a single “root” resource in your graph then you can just this as the document element. If we had multiple, unrelated Person resources in the document then we’d need to re-introduce the rdf:RDF element as a generic container.
- Defined some default namespace prefixes
- Grouped triples about the same subject into the same element
- Removed use of rdf:Description and rdf:type, preferring to instead use namespace element names
The result is something that is easier to read and much easier to work with in an XML context. You could even imagine creating an XML schema for this kind of document, particularly if you know which types and predicates are being used in your RDF graphs.
The nice thing about this approach is that its looks just like namespaced XML. For a publishing project I worked on we defined our XML schemas for receipt of data using this kind of approach; the client didn’t really need to know anything about RDF. We just had to explain that:
- rdf:about is how we assign a unique identifier to a entity (and we used xml:base to simplify the contents further to avoid repetition)
- rdf:resource was a “link” between two resources, e.g. for cross-referencing between content and subject categories
If you’re not using RDF containers of collections then those two attributes are the only bit of RDF that creeps into the syntax.
However in our case, we were also using RDF Lists to capture ordering of authors in academic papers. So we also explained that rdf:parseType was a processing instruction to indicate that some element content should be handled as a collection (a list).
This worked very well. We’d ended up with fine-grained document types anyway, to make it easier to update individual resources in the system, e.g. individual journal issues or articles, so the above structure mapped well to the system requirements.
Here’s a slightly more complex example that hopefully further illustrates the point. Here I’m showing nesting of several resource descriptions:
<dc:title>An example article</dc:title>
<dc:description>This is an article</dc:description>
The reality is whether you’re working in an XML or a RDF context, there is very often a primary resource you’re interested in: e.g. your processing a resource or rendering a view of it, etc. This means that in practice there’s nearly always an obvious and natural “root” element to the graph for creating an RDF/XML serialisation. Its just that RDF tools don’t typically let you identify it.
Tip 3: Use RELAX NG
Because of the syntactic variation, writing schemas for RDF/XML can be damn near impossible. But for highly normalised RDF/XML its a much more tractable problem.
My preference has been to use RELAX NG as it offers more flexibility when creating open and flexible content models for elements, e.g. via interleaving. This gives options to leave the document structures a little looser to facilitate serialisation and also allow the contents of the graph to evolve (e.g. addition of new properties).
If you have the option, then I’d recommend RELAX when defining schemas for your XML data.
Tip 4: RDF for metadata; XML for content
The last tip isn’t about RDF/XML per se, I just want to make a general point about where to apply the different technologies.
XML is fantastic at describing document structures and content. RDF is fantastic at describing relationships between things. Both of those qualities are important, but in very different aspects of an application.
In my work in publishing I ended up using a triple store as the primary data repository. This is because the kinds of application behaviour I wanted to drive were increasingly relationship focused: e.g. browsing to related content, author based navigation, concept relationships, etc. Increasingly I also wanted the ability to create new slices and views across the same content and document structures were too rigid.
The extensibility of the RDF graph allowed me to quickly integrate new workflows (using the Blackboard pattern) so that I could, for example, harvest & integrate external links or use text mining tools to extract new relationships. This could be done without having to rework the main publishing workflow, evolve document formats, or the database for the metadata.
However XML works perfectly well for rendering out the detailed content. It would be crazy to try and capture content in RDF/XML (structure yes; but not content). So for transforming XML into HTML or other views, XML was the perfect starting point. We were early adopters of XProc so using pipelines to generate rendered content and to extract RDF/XML for loading into a triple store was easy to do.
In summary RDF/XML is not a great format for working with RDF in an XML context, but its not completely broken. You just need to know how to get the best from it. It provides a default interoperable format for exchanging RDF data over the web, but there are better alternatives for hand-authoring and efficient loading. Once the RDF Working Group completes work on RDF 1.1 its likely that Turtle will rapidly become the main RDF serialisation.
However, I think that RDF/XML will still have a role, as part of a well-designed system, in bridging between RDF and XML tools.