Documents Types in RDF

The notion of a “document” and a “document type” are core concepts in XML. The specification includes a precise description of document, what it means for a document to be well-formed, valid, and so on. Even if you’re not using a DTD or XML schema, and are just using XML as a syntax for exchanging structured or semi-structured data, the concept of document is still a useful one. For example a document has a clear boundary and content, and so there is a limited scope for the data that an application has to deal with.
The ability to define classes of documents (“document types”) brings other benefits: the structure and content of documents can be standardized. The document type becomes both a contract that can be enforced by an application prior to its processing of any given document, and a description of the acceptable inputs of that application.
The concepts of “document” and “document type” are quite general and aren’t limited to XML applications. See, for example, the JSON schema discussion. The same concepts and their attendant benefits also crop up in messaging systems.
But you don’t see much discussion about the concept of a document or their types in RDF applications. Granted, RDF/XML does define a document type for serializing RDF graphs, but we all know that the large variation in how any single RDF graph could be encoded in valid RDF/XML means that the same benefits we get from non-RDF XML vocabularies are lost. The document scope can be highly variable scope, as can content and syntax. Of course it is is possible to create “RDF profiles” that constrain the RDF/XML syntax so that an XML schema can be used to validate documents. Jeni Tennison has recently discussed some approaches to this, and I’ve explored the topic myself in the past. In fact I regularly apply it when designing RDF based systems: it is extremely useful (essential) to be able to validate incoming data.
But generally the notion of document types doesn’t sit well with RDF. RDF is a data model for semi-structured data. It assumes an “open world model” in which missing information is not invalid, or as Dan Brickley has put it “missing isn’t broken“. This wild and woolly nature of RDF is, I think, one of the reasons many people struggle with it. As Dan says:

If nothing is mandatory, then how can they write code that knows what to expect?

Dan concludes that posting by suggesting that there are certain bedrocks which application authors can still rely on, e.g. XML+Namespaces, conformance to the RDF model, etc. But lately I’ve come around to the view that we need to go beyond that and offer tighter ways to document, declare and validate data that is being exchanged in RDF applications. I don’t know of any applications that adopt an open world model; quite the opposite in fact. I think there are benefits in looking at the notion of “document” and “document type” in an RDF context. Although “document” may not be the right term here, a better one may be “view”.
So how might we achieve this, and what are the benefits in doing so?
We can use the aforemention “profiling” option to create an constrained RDF/XML vocabularly that can be validated using XML schema (of whatever kind). Where two parties need to have an agreed on format for data exchange this works well. So for example, the OECD are supplying us with XML documents according to an XML schema. The documents are valid RDF/XML so we can simply pour them into a triple store for our application to use. Each XML document is basically a packet of RDF that describes one section (or sub-graph) of the entire data set. Those same packets are used as the basic message format for passing between internal components (e.g. the search indexer). So this is one useful application of the document concept in an application which is otherwise entirely RDF-driven and which goes to some length to be agnostic to the details of the data it contains.
In a scenario where there isn’t any prior co-ordination between the parties exchanging data then there are other options. A typical scenario here might be submitting my FOAF document (either directly, or referenced via an OpenId) to register/configure some online service. There are many ways I might structure my FOAF document, so how does the service validate or check that the required data is present? The answer here is SPARQL. SPARQL can be used to to validate a graph by testing whether specific graph patterns are present using ASK or CONSTRUCT. It can also be used to CONSTRUCT a constrained “view” of the submitted data that throws away anything that the application isn’t directly interestd in. The other side benefit to using SPARQL is that it doesn’t really matter that RDF syntax is being used: the validation and data extraction is happening at the level of the data model not the syntax.
We use the technique of defining RDF views using CONSTRUCT elsewhere in our applications. The primary one being fetching the data required to present some aspect of the RDF graph to an end user. I’ve described this, and the underlying system and its assumptions in a recent presentation. Here the “view” or “document type” is used to drive a simple data binding layer, and is essentially the contract between the application logic and the presentation layer. The application doesn’t need to deal with the entire graph, just useful use case specific subsets. And these are different “document types” to that used when loading the original data. The application doesn’t have a single document type: it has many and they’re used in different contexts. This avoids overly constraining the model (we want to be able to store arbitrary additional properties) but imposes local scoping to gain the benefits of validation, known contents, etc.
It turns out that there’s another use case where RDF document types or views are useful: managing updates to a triple store. If you know that some incoming data is constrained to a particular view (e.g. by prior agreement, or through extracting only those graph patterns that are of interest) then apply the incoming message as an update to the store is simply a matter of doing some set algebra. Extract the equivalent view from the store (i.e. the relevant sub-graph) and then look for the difference between the stored and incoming sub-graphs. The end result is a list of triples to delete and add to the store.
I’ll follow up more on the topics in this posting, as I think there are huge benefits to be had here from looking at how the notion of documents and document types can add value to RDF systems. It’s very easy to get caught up in the completely general case of a highly-distributed, wild and woolly world of RDF and the Semantic Web. But the majority of applications will have a much more limited world view, and my experience so far is that applying some additional constraints here and there can have huge benefits. Embracing the notion of multiple document types is one of these.