In this post I want to put down some thoughts around using named graphs to manage and query RDF datasets. This thinking is prompted is in large part by thinking how best to use Named Graphs to support publishing of Linked Data, but also most recently by the first Working Drafts drafts of SPARQL 1.1.
While the notion of named graphs for RDF has been around for many years now, the closest they have come to being standardised as a feature is through the SPARQL 1.0 specification which refers to named graphs in its specification of the dataset for a SPARQL query. SPARQL 1.1 expands on this, explaining how named graphs may be used in SPARQL Update, and also as part of the new Uniform HTTP Protocol for Managing RDF Graphs document.
Named graphs are an increasingly important feature of RDF triplestores and are very relevant to the notion of publishing Linked Data, so their use and specification does bear some additional analysis.
What Are Named Graphs?
Named Graphs turn the RDF triple model into a quad model by extending a triple to include an additional item of information. This extra piece of information takes the form of a URI which provides some additional context to the triple with which it is associated, providing an extra degree of freedom when it comes to managing RDF data. The ability to group triples around a URI underlies features such as:
- Tracking provenance of RDF data — here the extra URI is used to track the source of the data; especially useful for web crawling scenarios
- Replication of RDF graphs — triples are grouped into sets, labelled by a URI, that may then be separately exchanged and replicated
- Managing RDF datasets — here the set of triples may be an entire RDF dataset, e.g. all of dbpedia, or all of musicbrainz, making it easier to identify and query subsets within an aggregation
- Versioning — the URI identifies a set of triples, and that URI may be separately described, e.g. to capture the creation & modification dates of the triples in that set, who performed the change, etc.
- Access Control — by identifying sets of triples we can then record access control related metadata
Clearly there’s some degree of overlap between these different approaches, but then you’d expect that given that they’re all built on what is a fairly simple extension to the RDF model. Two of the key differentiators are:
- Granularity: i.e. does the named graph relate to a discrete identifiable subset of a dataset, e.g. every statement about a specific resource, or does it identify the dataset itself, e.g. dbpedia
- Concrete-ness: do the named graphs relate to how the data is actually being managed or stored; or does it instead reflect some other useful partitioning of the data?
One of the nice things about the simplicity of Named Graphs is that you can do so many things with that extra degree of freedom, i.e. by managing quads rather than triples.
Exchanging Named Graphs
Clearly if we’re working with Named Graphs then it would be useful if there were a way to exchange them. Being able to serialize RDF quads would allow a complete Named Graph to be transferred between stores. Actually, for some uses of Named Graphs this may not be required. For example if I’m using Named Graphs to as a means to track which triples came from which URIs during a web crawl I only need to serialize the quads if I decide to move data between the stores, not as part of the basic functionality.
Unsurprisingly none of the standard RDF vocabularies are capable of serializing Named Graphs, however there are two serializations that have been developed to support their interchange: TriG and TriX. TriG is a plain text format, which is a variant of Turtle, while TriX is a highly normalized XML format for RDF that includes the ability to name graphs.
Named Graphs in SPARQL 1.0
Lets look at how Named Graphs are used in SPARQL 1.0 and in the SPARQL 1.1 drafts. SPARQL 1.0 explains that a query executes against an RDF Dataset which “…represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI. A SPARQL query can match different parts of the query pattern against different graphs“.
In practice one uses the FROM and FROM NAMED clauses to identify the default and named graphs, and the GRAPH keyword to apply triple pattern matches to specific graphs. There’s a few things to observe here already, some of which are consequences of the above, some from wording in the SPARQL protocol:
- A SPARQL endpoint may not support Named Graphs at all
- A SPARQL endpoint may let you define an arbitrary dataset for your query. Some open endpoints will fetch data by dereferencing the URIs mentioned in the FROM/FROM NAMED clause but thats quite rare; mainly because of efficiency, cost, and security reasons.
- A SPARQL endpoint may not let you define the dataset for your query, i.e. it might use a fixed dataset scoped to some backing store. Any definition of the dataset in the protocol request or query is either optional, or must match the definition of the endpoint
- A SPARQL endpoint may let you define the default graph to be used in a query, but may not be willing/able to do arbitrary graph merges. For example in an endpoint containing dbpedia and geonames, you might be able to select FROM one of them, but not both.
- A SPARQL endpoint may be backed by a triple store that is organized around the model of an RDF Dataset, and therefore has a fixed default graph and any number of multiple named graphs. This limits flexibility of constructing the dataset for a query, as it is fixed by the underlying storage model.
- A SPARQL endpoint may let you query graphs that don’t physically exist in the underlying tripestore. Such a synthetic graph may be, for example, the merge of all Named Graphs in the triple store.
There may be other variations, but I’m aware of implementations and endpoints that exhibit each of those outlined above. The important thing to realise is that while SPARQL doesn’t place any restrictions on how you use named graphs, implementation decisions of the endpoint and/or the underlying triple store may place some limits on how they can be used in queries. The other important point to draw out is that the set of named graphs exposed through a SPARQL query interface may be different than the set of named graphs managed in the backing storage. This is most obvious in the case of synthetic graphs.
Synthetic graphs are a very useful feature as they can provide some useful abstraction over how the underlying data is managed and how it is queried.
For example, one might use a large number of separate named graphs when managing data, thereby making it easy to merge and manage data from different sources (e.g. a web crawl). Some applications use thousands of very small Named Graphs simply because they’re easier to manage. By using a synthetic graph which exposes all of the data through a SPARQL endpoint as if it were in fact in a single graph, then its possible to abstract over those details of storage. There are a few stores that support this kind of technique, and it can be pushed further by making the definition of the synthetic graph more flexible, e.g. the set of all graphs that are valid for between particular dates, or the set of all graphs that are related by a specific URI pattern. This approach can help abstract away management/modelling issues that are necessary for dealing with issues like versioning.
Named Graphs in SPARQL 1.1
Lets look at how SPARQL 1.1 might impact on the above scenarios. I use “might” advisedly as its still early days, we’ve only just had the first public Working Drafts and so the state of play might change.
Section 4.1 of the SPARQL 1.1 Update draft notes that: If an update service is managing some graph store, then there is no presumption that this exactly corresponds to any RDF dataset offered by some query service. The dataset of the query service may be formed from graphs managed by the update service but the dataset requested by the query can be a subset of the graphs in the update service dataset and the names of those graphs may be different. The composition of the RDF dataset may also change.
So basically the set of RDF graphs exposed by an SPARQL 1.1 Update service may be disjoint from a Query service exposed by the same endpoint. This will always be the case if the Query endpoint exposes any synthetic graphs. These, presumably overlapping, sets make sense from the perspective of wanting some flexibility in how data is managed versus how it is queried. Its likely that we’ll see implementations offer a range of options with the most likely case being that the “core” set of graphs is identical, but that an additional set may be available for querying.
SPARQL 1.1 as it currently stands, includes an Uniform HTTP Protocol for Managing RDF Graphs. I’m very happy to see this and think that its an important part of the picture for publishing RDF data on the web in a RESTful way. As part of the overall Linked Data message we’ve been saying that “your website is your API”; that by assigning clear stable URIs to things in your system and then exposing both human and machine-readable data at those URIs, then Linked Data just drops out of the design. And this is also clearly a RESTful approach.
But to make things completely RESTful then we need to not only be able to read data from those URIs, we should be able to update the data at those URIs using the uniform protocol that HTTP defines. I was always a little wary of SPARQL Update because it seemed like it might supplant a more RESTful option, but I’m encouraged by the presence of this working draft that this won’t be the case. But I don’t think the draft goes far enough in a few places: I’d like to have the ability to make changes to individual statements within a graph, as well as just whole graphs, using techniques like Changesets.
The draft currently doesn’t get into the issues surrounding how URIs might be managed on a server, instead deferring that to the implementation. But I think its an important topic to explore, so lets devote some time to it here.
Approaches to Managing Named Graphs on the Web
For the most part the mapping of graph management to the web is uncontroversial, the four HTTP verbs of GET, PUT, POST and DELETE have obvious and intuitive meanings. Some of the subtleties arise out of issues such as how are URIs assigned to graphs, and what does that URI identify?
Client Managed Graph Identifiers
There are two ways that URIs can be assigned to graphs managed in a networked store. The first and simplest is that the client assigns all URIs. To create a new graph and populate it with data, we just PUT to a new URI. Starting from a base URI, distinct from any SPARQL endpoint the service might expose, the client can build out a URI space for the graphs but just PUTting to URIs. In this scenario one really only needs GET, PUT, and DELETE. POST doesn’t have any clear role, but could be used to handle, e.g. submissions of Changesets.
Even with the simple style of client-side URIs for graphs, there’s one wrinkle we need to address. As I explained in the start of the post there may be several different reasons why someone is using Named Graphs. Using the graph identifier to keep track of the source of the data is a fairly common requirement. So this means you have several options for how URIs might be supplied:
/graphs/abc— here the client is building out a collection of named graphs whose identifiers all share a common prefix, with each having a suffix. We may end up with a relatively flat structure or a hierarchical one, e.g.
/graphs/abc/123. There’s no implicit requirement that graphs URIs that have a hierarchical arrangement have any formal relationship, but this does have the useful property that the URIs are hackable.
/graphs/http://www.example.org/abc— this is similar to the above except the unique portion of each graph name is a complete URI. This would probably need to be encoded but I’ve omitted that for readability. This approach is useful when using Named Graphs to track the source URI of a graph.
/graphs?graph=http://www.example.org/abc— this is a variant of the second option but moving the graph identifier out into a parameter rather than allowing it to be put into the path info of the base URI. I think typically the value of the parameter would be a full rather than a relative URI, but a server could support resolving URIs against a base.
Its clear that while Option 1 provides nice clean identifiers for graphs, ultimately its limiting for scenarios where the graph may have another "natural" identifier, e.g. its source. for Options 2 & 3 we have to deal with URL encoding (especially if the URI itself contains parameters). Personally of the two alternate options I think 3 is nicer, if only aesthetically. I'm not aware of any problems or limitations with performing an HTTP PUT to a URI with parameters: it is the full Request URI, including any parameters, that identifies the resource being created or updated.
Server Assigned Graph Identifiers
A server managing Named Graphs may not allow clients to assign graph identifiers. For example, the server may want to enforce a particular naming conventions for graphs. This might also be useful for clients too, e.g. if they want to throw some data into a named graph as a scratch store. What restrictions does this scenario apply?
Firstly it would require the client to POST data to be stored to a generic graphs collection (
/graphs), the server would then determine the graph URI, and then return an HTTP 201 response with a Location header indicating where the client can find the data it has just stored. This way the client would know whether to find the data and could then use further requests (GET/PUT/POST/DELETE) to manage it.
To support tracking of the source of a graph, one might allow a
graph parameter to be added to the URI. And, to avoid a client having to maintain a local mapping from the original graph UI to the stored alias, the server could store the value of the
graph parameter as metadata associated with the graph it creates. The server could support a GET request on
/graphs?graph=X, returning a 302 redirect to the URI which is acting as a local alias for graph X. The client could then PUT/POST/DELETE that resource. If a client sent a repeated POST request, identifying the same graph URI, then the server could allow this, and return a 303 See Other response rather than a 201.
Its also possible to support a hybrid approach in which a client may PUT to any URI with a base of
/graphs but disallow use of graph ids that start with
http://. For those URIs, the server could require that a client let it assign the id, supporting the
graph parameter as described earlier in this section.
There's no right or wrong way here. The differences fall out of the different ways we can map graph management onto the HTTP protocol. While a lot is fixed (methods, response codes and their semantics) if we are aiming to be RESTful, there are still some degrees of freedom with which to play around with different mappings. The SPARQL 1.1. uniform protocol specification doesn't address this, so perhaps there's room for the community to standardise best practices or conventions. However I think it'd be useful to at least see some informational text in the document.
Named Graphs are an important part of the overall technical framework for managing, publishing and querying RDF and Linked Data, and its important to understand the trade-offs in different approaches to using them. Hopefully this document is a step in the right direction.
If anyone has any strong opinions on how they think Named Graphs should be managed RESTfully, then please feel free to comment on this posting. I'm very interested to hear your thoughts.
One thing that interests me is: how can we use Named Graphs to support publishing of Linked Data? That's something I'll follow up on in a separate post.