RDF Dataset Notifications

Like many people in the RDF community I’ve been thinking about the issue of syndicating updates to RDF datasets. If we want to support truly distributed aggregation and processing of data then we need an efficient way to share updates.

There’s been a lot of experimentation around different mechanisms, and PubSubHubbub seems to be a current favourite approach. I’ve been playing with it myself recently and have hacked up a basic push mechanism around Talis Platform stores. More on that another time.

But I’ve not yet seen any general discussion about the merits of different approaches, or even discussion about what it is that we really want to syndicate.

So let’s take it from the top.

It seems to me that there’s basically three broad categories of information we want to syndicate:

  • Dataset Notifications — has a new dataset been added to a directory? has one been updated in some way, e.g. through the addition or removal of triples?
  • Resource Notifications — what resources have been added or modified within a dataset?
  • Triple Notifications — what triples have been changed within a dataset?

Each one of these categories is syndicating a different level of detail and may benefit from a different technical approach. For example there’s a different volume of information being exchanged if one is simply notifying dataset changes vs every triple. We’ll also likely need a different format or syntax.

Actually there may be a fourth category: notifications of graph structural changes to a dataset, e.g. adding or removing named graphs. I’ve not yet seen anyone exploring that level of syndication, but suspect it may be very useful.

Now, for each of those different categories, there are two different styles of notifications: push or pull. Pull mechanisms are typified by feed subscriptions, crawlers, or repeated queries of datasets. Push mechanisms are usually based on some form of publish-subscribe system.

Given those different scenarios, we can take a look at some existing technologies and categorise them. I’ve done just that and published a simple Google spreadsheet with my first stab at this analysis. (This probably needs a little more context in places but hopefully the classifications are fairly obvious).

PubSubHubbub seems to offer the most flexibility in that it mixes a standard Pull based Feed architecture with a Push based subscription system. Clearly worthy of the attention its getting. Other technologies offer similar features but are optimised for different purposes.

However that doesn’t mean that PubSubhubbub is just perfect out of the box. For example it’s worth noting that consumers aren’t required to use the Push aspects of the system, they can just subscribe to the feeds. So you need to be prepared to scale a PubSubHubbub system just as you would a Pull based Feed.

It may also be sub-optimal for systems which are syndicating out high-volume Triple level updates. The Feeds can potentially get very large and the hub system needs to be prepared to handle large exchanges. It also doesn’t say anything about how to catch-up or recover from missed updates. A hybrid approach may be required to cover for all use cases and scenarios and to produce a robust system.

In order to be able to properly compare different approaches we need to understand their respective trade-offs. I’m hoping this posting contributes to that discussion and can complement the ongoing community experimentation.

Am interested to hear your thoughts.

4 thoughts on “RDF Dataset Notifications

    1. Hi Inigo,

      Yes but only very briefly so far. I plan to look a little closer, I have listed it on the spreadsheet. Let me know if it’s misclassified!


  1. Hi Leigh,
    There are some important (I think) user requirements that might add some extra colour to your analysis:

    * atomic updates of sets of triples: suppose I add an OWL class declaration, I’d like consumers (push or pull) to see all of the triples in the class declaration or none

    * bi-directionality: sync with the master copy, go off-line, do updates, reconnect, re-sync.

    * conflict detection: if I get updates from more than one location, I’d at least like to know if there are conflicting updates (a la git merge).

    The third, and possibly second, might be protocol layers on top of the basic update mechanism, I suppose, but it may be that the the mechanism can help or hinder those higher levels.


Comments are closed.