Managing RDF Using Named Graphs

In this post I want to put down some thoughts around using named graphs to manage and query RDF datasets. This thinking is prompted is in large part by thinking how best to use Named Graphs to support publishing of Linked Data, but also most recently by the first Working Drafts drafts of SPARQL 1.1.

While the notion of named graphs for RDF has been around for many years now, the closest they have come to being standardised as a feature is through the SPARQL 1.0 specification which refers to named graphs in its specification of the dataset for a SPARQL query. SPARQL 1.1 expands on this, explaining how named graphs may be used in SPARQL Update, and also as part of the new Uniform HTTP Protocol for Managing RDF Graphs document.

Named graphs are an increasingly important feature of RDF triplestores and are very relevant to the notion of publishing Linked Data, so their use and specification does bear some additional analysis.

What Are Named Graphs?

Named Graphs turn the RDF triple model into a quad model by extending a triple to include an additional item of information. This extra piece of information takes the form of a URI which provides some additional context to the triple with which it is associated, providing an extra degree of freedom when it comes to managing RDF data. The ability to group triples around a URI underlies features such as:

  • Tracking provenance of RDF data — here the extra URI is used to track the source of the data; especially useful for web crawling scenarios
  • Replication of RDF graphs — triples are grouped into sets, labelled by a URI, that may then be separately exchanged and replicated
  • Managing RDF datasets — here the set of triples may be an entire RDF dataset, e.g. all of dbpedia, or all of musicbrainz, making it easier to identify and query subsets within an aggregation
  • Versioning — the URI identifies a set of triples, and that URI may be separately described, e.g. to capture the creation & modification dates of the triples in that set, who performed the change, etc.
  • Access Control — by identifying sets of triples we can then record access control related metadata

…and many more. There’s some useful background available on Named Graphs in general in a paper about NG4J, and specifically on their use in OpenAnzo.

Clearly there’s some degree of overlap between these different approaches, but then you’d expect that given that they’re all built on what is a fairly simple extension to the RDF model. Two of the key differentiators are:

  • Granularity: i.e. does the named graph relate to a discrete identifiable subset of a dataset, e.g. every statement about a specific resource, or does it identify the dataset itself, e.g. dbpedia
  • Concrete-ness: do the named graphs relate to how the data is actually being managed or stored; or does it instead reflect some other useful partitioning of the data?

One of the nice things about the simplicity of Named Graphs is that you can do so many things with that extra degree of freedom, i.e. by managing quads rather than triples.

Exchanging Named Graphs

Clearly if we’re working with Named Graphs then it would be useful if there were a way to exchange them. Being able to serialize RDF quads would allow a complete Named Graph to be transferred between stores. Actually, for some uses of Named Graphs this may not be required. For example if I’m using Named Graphs to as a means to track which triples came from which URIs during a web crawl I only need to serialize the quads if I decide to move data between the stores, not as part of the basic functionality.

Unsurprisingly none of the standard RDF vocabularies are capable of serializing Named Graphs, however there are two serializations that have been developed to support their interchange: TriG and TriX. TriG is a plain text format, which is a variant of Turtle, while TriX is a highly normalized XML format for RDF that includes the ability to name graphs.

Named Graphs in SPARQL 1.0

Lets look at how Named Graphs are used in SPARQL 1.0 and in the SPARQL 1.1 drafts. SPARQL 1.0 explains that a query executes against an RDF Dataset which “…represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI. A SPARQL query can match different parts of the query pattern against different graphs“.

In practice one uses the FROM and FROM NAMED clauses to identify the default and named graphs, and the GRAPH keyword to apply triple pattern matches to specific graphs. There’s a few things to observe here already, some of which are consequences of the above, some from wording in the SPARQL protocol:

  • A SPARQL endpoint may not support Named Graphs at all
  • A SPARQL endpoint may let you define an arbitrary dataset for your query. Some open endpoints will fetch data by dereferencing the URIs mentioned in the FROM/FROM NAMED clause but thats quite rare; mainly because of efficiency, cost, and security reasons.
  • A SPARQL endpoint may not let you define the dataset for your query, i.e. it might use a fixed dataset scoped to some backing store. Any definition of the dataset in the protocol request or query is either optional, or must match the definition of the endpoint
  • A SPARQL endpoint may let you define the default graph to be used in a query, but may not be willing/able to do arbitrary graph merges. For example in an endpoint containing dbpedia and geonames, you might be able to select FROM one of them, but not both.
  • A SPARQL endpoint may be backed by a triple store that is organized around the model of an RDF Dataset, and therefore has a fixed default graph and any number of multiple named graphs. This limits flexibility of constructing the dataset for a query, as it is fixed by the underlying storage model.
  • A SPARQL endpoint may let you query graphs that don’t physically exist in the underlying tripestore. Such a synthetic graph may be, for example, the merge of all Named Graphs in the triple store.

There may be other variations, but I’m aware of implementations and endpoints that exhibit each of those outlined above. The important thing to realise is that while SPARQL doesn’t place any restrictions on how you use named graphs, implementation decisions of the endpoint and/or the underlying triple store may place some limits on how they can be used in queries. The other important point to draw out is that the set of named graphs exposed through a SPARQL query interface may be different than the set of named graphs managed in the backing storage. This is most obvious in the case of synthetic graphs.

Synthetic graphs are a very useful feature as they can provide some useful abstraction over how the underlying data is managed and how it is queried.

For example, one might use a large number of separate named graphs when managing data, thereby making it easy to merge and manage data from different sources (e.g. a web crawl). Some applications use thousands of very small Named Graphs simply because they’re easier to manage. By using a synthetic graph which exposes all of the data through a SPARQL endpoint as if it were in fact in a single graph, then its possible to abstract over those details of storage. There are a few stores that support this kind of technique, and it can be pushed further by making the definition of the synthetic graph more flexible, e.g. the set of all graphs that are valid for between particular dates, or the set of all graphs that are related by a specific URI pattern. This approach can help abstract away management/modelling issues that are necessary for dealing with issues like versioning.

Named Graphs in SPARQL 1.1

Lets look at how SPARQL 1.1 might impact on the above scenarios. I use “might” advisedly as its still early days, we’ve only just had the first public Working Drafts and so the state of play might change.

Section 4.1 of the SPARQL 1.1 Update draft notes that: If an update service is managing some graph store, then there is no presumption that this exactly corresponds to any RDF dataset offered by some query service. The dataset of the query service may be formed from graphs managed by the update service but the dataset requested by the query can be a subset of the graphs in the update service dataset and the names of those graphs may be different. The composition of the RDF dataset may also change.

So basically the set of RDF graphs exposed by an SPARQL 1.1 Update service may be disjoint from a Query service exposed by the same endpoint. This will always be the case if the Query endpoint exposes any synthetic graphs. These, presumably overlapping, sets make sense from the perspective of wanting some flexibility in how data is managed versus how it is queried. Its likely that we’ll see implementations offer a range of options with the most likely case being that the “core” set of graphs is identical, but that an additional set may be available for querying.

SPARQL 1.1 as it currently stands, includes an Uniform HTTP Protocol for Managing RDF Graphs. I’m very happy to see this and think that its an important part of the picture for publishing RDF data on the web in a RESTful way. As part of the overall Linked Data message we’ve been saying that “your website is your API”; that by assigning clear stable URIs to things in your system and then exposing both human and machine-readable data at those URIs, then Linked Data just drops out of the design. And this is also clearly a RESTful approach.

But to make things completely RESTful then we need to not only be able to read data from those URIs, we should be able to update the data at those URIs using the uniform protocol that HTTP defines. I was always a little wary of SPARQL Update because it seemed like it might supplant a more RESTful option, but I’m encouraged by the presence of this working draft that this won’t be the case. But I don’t think the draft goes far enough in a few places: I’d like to have the ability to make changes to individual statements within a graph, as well as just whole graphs, using techniques like Changesets.

The draft currently doesn’t get into the issues surrounding how URIs might be managed on a server, instead deferring that to the implementation. But I think its an important topic to explore, so lets devote some time to it here.

Approaches to Managing Named Graphs on the Web

For the most part the mapping of graph management to the web is uncontroversial, the four HTTP verbs of GET, PUT, POST and DELETE have obvious and intuitive meanings. Some of the subtleties arise out of issues such as how are URIs assigned to graphs, and what does that URI identify?

Client Managed Graph Identifiers

There are two ways that URIs can be assigned to graphs managed in a networked store. The first and simplest is that the client assigns all URIs. To create a new graph and populate it with data, we just PUT to a new URI. Starting from a base URI, distinct from any SPARQL endpoint the service might expose, the client can build out a URI space for the graphs but just PUTting to URIs. In this scenario one really only needs GET, PUT, and DELETE. POST doesn’t have any clear role, but could be used to handle, e.g. submissions of Changesets.

Even with the simple style of client-side URIs for graphs, there’s one wrinkle we need to address. As I explained in the start of the post there may be several different reasons why someone is using Named Graphs. Using the graph identifier to keep track of the source of the data is a fairly common requirement. So this means you have several options for how URIs might be supplied:

  1. /graphs/abc — here the client is building out a collection of named graphs whose identifiers all share a common prefix, with each having a suffix. We may end up with a relatively flat structure or a hierarchical one, e.g. /graphs/abc/123. There’s no implicit requirement that graphs URIs that have a hierarchical arrangement have any formal relationship, but this does have the useful property that the URIs are hackable.
  2. /graphs/http://www.example.org/abc — this is similar to the above except the unique portion of each graph name is a complete URI. This would probably need to be encoded but I’ve omitted that for readability. This approach is useful when using Named Graphs to track the source URI of a graph.
  3. /graphs?graph=http://www.example.org/abc — this is a variant of the second option but moving the graph identifier out into a parameter rather than allowing it to be put into the path info of the base URI. I think typically the value of the parameter would be a full rather than a relative URI, but a server could support resolving URIs against a base.

Its clear that while Option 1 provides nice clean identifiers for graphs, ultimately its limiting for scenarios where the graph may have another “natural” identifier, e.g. its source. for Options 2 & 3 we have to deal with URL encoding (especially if the URI itself contains parameters). Personally of the two alternate options I think 3 is nicer, if only aesthetically. I’m not aware of any problems or limitations with performing an HTTP PUT to a URI with parameters: it is the full Request URI, including any parameters, that identifies the resource being created or updated.

Server Assigned Graph Identifiers

A server managing Named Graphs may not allow clients to assign graph identifiers. For example, the server may want to enforce a particular naming conventions for graphs. This might also be useful for clients too, e.g. if they want to throw some data into a named graph as a scratch store. What restrictions does this scenario apply?

Firstly it would require the client to POST data to be stored to a generic graphs collection (/graphs), the server would then determine the graph URI, and then return an HTTP 201 response with a Location header indicating where the client can find the data it has just stored. This way the client would know whether to find the data and could then use further requests (GET/PUT/POST/DELETE) to manage it.

To support tracking of the source of a graph, one might allow a graph parameter to be added to the URI. And, to avoid a client having to maintain a local mapping from the original graph UI to the stored alias, the server could store the value of the graph parameter as metadata associated with the graph it creates. The server could support a GET request on /graphs?graph=X, returning a 302 redirect to the URI which is acting as a local alias for graph X. The client could then PUT/POST/DELETE that resource. If a client sent a repeated POST request, identifying the same graph URI, then the server could allow this, and return a 303 See Other response rather than a 201.

Its also possible to support a hybrid approach in which a client may PUT to any URI with a base of /graphs but disallow use of graph ids that start with http://. For those URIs, the server could require that a client let it assign the id, supporting the graph parameter as described earlier in this section.

There’s no right or wrong way here. The differences fall out of the different ways we can map graph management onto the HTTP protocol. While a lot is fixed (methods, response codes and their semantics) if we are aiming to be RESTful, there are still some degrees of freedom with which to play around with different mappings. The SPARQL 1.1. uniform protocol specification doesn’t address this, so perhaps there’s room for the community to standardise best practices or conventions. However I think it’d be useful to at least see some informational text in the document.

Conclusions

Named Graphs are an important part of the overall technical framework for managing, publishing and querying RDF and Linked Data, and its important to understand the trade-offs in different approaches to using them. Hopefully this document is a step in the right direction.

If anyone has any strong opinions on how they think Named Graphs should be managed RESTfully, then please feel free to comment on this posting. I’m very interested to hear your thoughts.

One thing that interests me is: how can we use Named Graphs to support publishing of Linked Data? That’s something I’ll follow up on in a separate post.

13 thoughts on “Managing RDF Using Named Graphs

  1. I have some moderately strong opinions based on authoring an implementation that is in use for daily production work. See http://bitbucket.org/gavin/tenuki/

    My current implementation uses your “2” method:

    /graphs/http://www.example.org/abc

    There have been some issues with clients failing to escape the http:// uris. Mostly folks using curl by hand to load graphs, on the other hand it maps reasonably nicely into the Java Restlet API. Caching and ETAG support is much simpler so far with non query string based URIs for graph resources. Adding support for method (3) Query String based API would be possible but not nearly as simple.

    The majority of the implementation of method (2) can be expressed in 2 lines of Java:
    router.attach(“/graphs/”, GraphsResource.class);
    router.attach(“/graphs/{graphName}”, GraphResource.class);

    Routing based on a query parameter would require writing a custom routing class. Wouldn’t be hard, but most REST web frameworks seem to be leaning that way. Pylons RestControllers work the same way.

    The client for 2 is also nearly as simple as for 3 if not as simple. Our Python client:

    graph = graph.serialize()
    request_url = self.url + “graphs/” + urllib.quote(graph_uri, safe=””)
    log.debug(“Storing graph %s at URL %r”, graph_uri, request_url)
    resp, content = self.http.request(request_url, method=”POST”,
    body=graph,
    headers={“Content-Type”:
    “application/rdf+xml”})

    Anyway, look forward to more conversation on the topic.

    1. Hi Gavin,

      Great feedback thanks. The issue of how easy it is to support the different styles on the client, server, and implications for proxy servers (e.g. for caching) is definitely something that needs closer attention.

      The ?graph= is easier to handle on the client side if you’re using curl, wget, or HTML forms. These clients wil more readily handle the escaping if the graph uri is in a parameter. If you’re coding something up then I’d argue that using a query string or constructing a complete URI is just as easy either way. Depending on your HTTP client library, you might not need to care about encoding of parameters, so the code *may* be slightly cleaner, but thats clearly debatable.

      On the server side, I’m not sure I see much of a distinction either. In my experience its just as easy (and in some cases easier) to pull out a parameter from a request than split up the request URI. This is with both Ruby and Java frameworks. In the latter case I’ve not used Restlet, but I’m pretty sure that Jersey handles this OK. You mention that “routing based on query parameter” is slightly harder, and I agree, although I think most frameworks are leaning towards supporting binding to all aspects of the URI? I’ll have to do some more digging.

      Your comments lead me to dig into the public-rdf-dawg mailing list to see how the Working Group have been discussing this issue. There is some interesting, related discussion on this thread about REST and HTTP Update and these suggestions for HTTP updates. Haven’t digested it all yet, but the main issues seem to be support for PUT with parameters in web frameworks, and impact of parameters on caching of requests.

      Cheers,

      L.

  2. A few questions:
    – is TriX compatible with RDF/XML ?
    – how to include the same triple to several different named graphs (syntax in RDF/XML is prefered)
    – is it possible to declare that a graph includes other graphs? (i.e a kind of supergraph – subgraphs)

    1. Hi Olivier,

      I’ll do my best to answer your questions.

      1. Is TriX compatible with RDF/XML?

      TriX is a completely separate XML serialization for RDF. As an extension it includes supports for grouping triples into graphs. There are some examples here: http://sw.nokia.com/trix/examples.xml

      So from a syntax point of view, the two are incompatible. From an RDF model point of you, TriX can encode things (like named graphs) that you can’t encode in RDF/XML.

      2. how to include the same triple to several different named graphs (syntax in RDF/XML is prefered)

      As RDF/XML doesn’t allow you to indicate that a triple is included in a specific graph, then there’s no way I can provide an example using that syntax I’m afraid! 🙂

      Using TriG notation (which I prefer) you could write:


      @prefix: <http://www.example.org/doc#> .
      @foaf: <http://xmlns.com/foaf/0.1/>.

      :G1 { foaf:name "Leigh Dodds". }
      :G2 { foaf:name "Leigh Dodds". }

      This asserts the same triple in two different graphs. If we had a synthetic graph which was the RDF merge of these two, then it would contain a single triple.

      3. is it possible to declare that a graph includes other graphs? (i.e a kind of supergraph – subgraphs)

      I don’t think either TriG or TriX address this, but then I’m not sure I’d expect them to as they’re primarily a serialization format. I think what you’re asking here is how we can associate metadata with graphs, e.g. to indicate that they have some relationship. I’m sure there has been work done on this, but I’m not aware of it.

      Metadata about named graphs, and how to manage that separately from the graphs themselves is something I can to cover in my next posting.

    1. Hi Tony,

      I’m *very* wary about using the word “namespace” in conjunction with Named Graphs 🙂

      In the sense that they provide some context, then yes, they I suppose they are similar.

      But XML namespaces typically indicate some collection of elements and attributes that have been defined by a specific source. So the closest analogy to a namespace in the RDF world is a schema or vocabulary. There’s no implication for named graphs that the data in the graph is all from one source, or uses one vocabulary, or is about one thing. There are many ways they can be used.

  3. Re namespacing, I wasn’t implying any semantic overloading. Nor, I think, is that necessarily true in an XML context. Far from it.

    Basically namespacing is a mechanism that allows for a simple modularization so that two separate peer entities (or structures) can be managed within a common application space.

    In that sense I do see think that “namespacing” is exactly what the “named” qualifier in “Named Graphs” is bringing to the party. It allows for seperate RDF graphs to be “named” (hence “namespaced”) and to coexist within a single application.

    How that “namespace” is applied – i.e. the semantic overtones attributed to it – is another matter.

    But just as XML were namespaces developed subsequently to XML 1.0 and that XML 1.0 had no namespaces, so RDF namespaces (or Named Graphs) are not represented in RDF 1.0. (aka RDF).

    Tony

    1. Hi Peter,

      I’m not sure I see a big overlap between Resource Maps and Named Graphs. Resource Maps are an RDF resource that describes an aggregation of other resources, whereas a Named Graph provides context to a set of triples. I suppose in a loose sense, the Resource Map is also providing “context”, but I don’t think the implementation or use cases really overlap. We might use one Named Graph for each Resource Map, but we could just as easily store several of them in a Named Graph.

      I did wonder if a Resource Map could be considered as a serialization of a Named Graph, but I don’t think that makes a lot of sense. One could probably implement a means to populate a collection of Named Graphs from a set of Resource Maps, but I think this would be messing with the semantics of both.

      For me Resource Maps are just extra assertions, describing additional relationships between resources, whereas Named Graphs contextualise data in a different, more “out of band” way.

  4. The “synthetic graphs” idea seems very similar to the notion of “views” in relational databases and there is an implementation for Sesame called Networked Graphs which basically does that. But I think you need to combine this with access control management on the graph level (I think you can do that in Virtuoso). So for open external access you only offer a set of synthetic graphs for querying. But an application that has to query+update the data would get access to the original graphs for both operations. Otherwise you get some nasty problems, comparable to updates on views in RDBs.

    Regarding subgraphs: http://www.w3.org/2004/03/trix/rdfg-1/ has a predicate for just that. But I wonder if explicitly stating this relationship is the best way of doing it. You can either associate statements with one graph only and then group those graphs together into larger graphs which don’t actually have any statements associated with them in the store but are just linked to the subgraphs with this predicate. Or you can add statements to several graphs at once where some graphs contain fewer statements and are for more fine-grained purposes. The store could still store this efficiently by using statement IDs and associating those with different graph URIs instead of storing the statement multiple times. I’m sure that’s what some stores do anyway.

Comments are closed.