Monthly Archives: April 2010

RDF Dataset Notifications

Like many people in the RDF community I’ve been thinking about the issue of syndicating updates to RDF datasets. If we want to support truly distributed aggregation and processing of data then we need an efficient way to share updates.

There’s been a lot of experimentation around different mechanisms, and PubSubHubbub seems to be a current favourite approach. I’ve been playing with it myself recently and have hacked up a basic push mechanism around Talis Platform stores. More on that another time.

But I’ve not yet seen any general discussion about the merits of different approaches, or even discussion about what it is that we really want to syndicate.

So let’s take it from the top.

It seems to me that there’s basically three broad categories of information we want to syndicate:

  • Dataset Notifications — has a new dataset been added to a directory? has one been updated in some way, e.g. through the addition or removal of triples?
  • Resource Notifications — what resources have been added or modified within a dataset?
  • Triple Notifications — what triples have been changed within a dataset?

Each one of these categories is syndicating a different level of detail and may benefit from a different technical approach. For example there’s a different volume of information being exchanged if one is simply notifying dataset changes vs every triple. We’ll also likely need a different format or syntax.

Actually there may be a fourth category: notifications of graph structural changes to a dataset, e.g. adding or removing named graphs. I’ve not yet seen anyone exploring that level of syndication, but suspect it may be very useful.

Now, for each of those different categories, there are two different styles of notifications: push or pull. Pull mechanisms are typified by feed subscriptions, crawlers, or repeated queries of datasets. Push mechanisms are usually based on some form of publish-subscribe system.

Given those different scenarios, we can take a look at some existing technologies and categorise them. I’ve done just that and published a simple Google spreadsheet with my first stab at this analysis. (This probably needs a little more context in places but hopefully the classifications are fairly obvious).

PubSubHubbub seems to offer the most flexibility in that it mixes a standard Pull based Feed architecture with a Push based subscription system. Clearly worthy of the attention its getting. Other technologies offer similar features but are optimised for different purposes.

However that doesn’t mean that PubSubhubbub is just perfect out of the box. For example it’s worth noting that consumers aren’t required to use the Push aspects of the system, they can just subscribe to the feeds. So you need to be prepared to scale a PubSubHubbub system just as you would a Pull based Feed.

It may also be sub-optimal for systems which are syndicating out high-volume Triple level updates. The Feeds can potentially get very large and the hub system needs to be prepared to handle large exchanges. It also doesn’t say anything about how to catch-up or recover from missed updates. A hybrid approach may be required to cover for all use cases and scenarios and to produce a robust system.

In order to be able to properly compare different approaches we need to understand their respective trade-offs. I’m hoping this posting contributes to that discussion and can complement the ongoing community experimentation.

Am interested to hear your thoughts.

Tagged

Linked Data Patterns: a free book for practitioners

A few months ago Ian Davis and I were chatting about some new approaches to helping practitioners climb the learning curve around Linked Data, RDF and related technologies. We were both keen to help communicate the value of Linked Data, share knowledge amongst practitioners, and to encourage the community to converge on best practices. We kicked around a number of different ideas in this vein.

For example, Ian was keen to provide guidance as to how to mix and match different vocabularies to achieve a particular goal, like describing a person or a book. Having a ready reference containing recipes for these common tasks would address a number of goals. He’s ended up exploring that idea further in the recently released Schemapedia. If you’ve not seen it yet, then you should take a look. It provides a really nice way to navigate through RDF vocabularies and explore their intersections.

The other thing that we discussed was Design Patterns. I’ve been a Design Pattern nut for some time now. Discovering them was something of a right of passage for me during my Master’s dissertation. I’d spent weeks revising and honing a design for the distributed system I was building, only to discover that what I’d produced was already documented as a design pattern in an obscure corner of the research literature. While I’d clearly reinvented the wheel, the discovery not only provided external validation for what I’d produced, but also neatly illustrated the benefit of using design patterns to share knowledge and experience within a community. Knowing when to apply particular patterns is a key skill for any developer, and the terms are a part of the design vocabulary we all share.

I suggested to Ian that we explore writing some patterns for Linked Data. Patterns for assigning identifiers, modelling data, as well as application development. We experimented with this for a while but ended up parking the discussion for a few months whilst other priorities intervened.

I recently revived the project. It’s pretty clear to me that there’s still a big skills gap between experienced practitioners and those seeking to apply the technology. I think the current situation is reminiscent of the move of OO programming from the research lab out into the developer community; design patterns played a key role there too.

Ian and I have decided to share this with the community as an on-line book, a pattern catalogue that covers a range of different use cases. We started out with about half a dozen patterns, but over the last few weeks I’ve expanded that figure to thirty. I’ve still got a number on my short-list (more than a dozen, I think) but it’s time to start sharing this with the community. The work won’t ever be complete as the space is still unfolding, it will just get refined over time.

You can read the book online at http://patterns.dataincubator.org.

The work is licensed under a Creative Commons Attribution license so you’re free to use it as you see fit, but please attribute the source. If you want to download it, then there’s a PDF, and an EPUB too. We’re using DocBook for the text so there will be a number of different access options.

I’ll stress that this is a very early draft, so be gentle. But we’d love to hear your comments.

Tagged

A Tour of the OS 50k Gazetteer Linked Data

The Ordnance Survey have today published the first in a series of open datasets. In addition to the administrative geography that was published last year, the Linked Data available from data.ordnancesurvey.co.uk now includes data from their 1:50 000 Scale Gazetteer. In this blog post I thought I’d post an overview of the dataset to summarise what it contains.

Analysis

The Gazetteer identifiers all have a base URL of:

http://data.ordnancesurvey.co.uk/id/50kGazetteer/.

The base URL is suffixed with a unique numeric code. I’m not sure where this originates from, and its not present in the underlying data.

The dataset consist of 2,368,655 triples (individual facts) asserted over 259,080 unique resources. So about 9 triples per resource. Here’s how the properties break down:

http://www.w3.org/1999/02/22-rdf-syntax-ns#type 259080
http://xmlns.com/foaf/0.1/name 259080
http://www.w3.org/2000/01/rdf-schema#label 259080
http://data.ordnancesurvey.co.uk/ontology/spatialrelations/northing 259080
http://data.ordnancesurvey.co.uk/ontology/spatialrelations/easting 259080
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/featureType 259080
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/oneKMGridReference 259080
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/twentyKMGridReference 259080
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/mapReference 296015

The first few properties are labels and a type for each resource. The additional predicates are from the OS Spatial Relations ontology, providing the Eastings and Northings for each feature. The remainining four predicates provide a “feature type” and OS map & grid references. There are slightly more map references, so some resources have more than one such property, i.e. because they’re large enough to span a particular map. You can see that there are no links to other datasets as yet, or lat/long co-ordinates.

Lets look closer at some of the predicates. For the RDF types, I discovered that the every resource has the same type, they’re all instances of a “Named Place”:

http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/NamedPlace.

Presumably then the detailed classification for the different types of landscape feature is present in the “feature type” predicate. A SPARQL query to count and group the values for that predicate gives me:

http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/Other 128662
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/OtherSettlement 41228
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/Farm 34723
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/WaterFeature 24425
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/HillOrMountain 14524
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/ForestOrWood 8708
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/Antiquity 5252
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/Town 1259
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/RomanAntiquity 237
http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/City 62

We can see that 128,662 resources (49% of total) are simply “Other” with another 41,228 being “Other Settlement”; not that inspiring! The rest of the feature types are more interesting, and give us some very basic data on various geographic features. The Roman Antiquity features piqued my interested; Hadrian’s Wall has the following identifier (click to see the data):

http://data.ordnancesurvey.co.uk/id/50kGazetteer/106584

The values for the Easting and Northing properties should be obvious, so I’ll skip over those. The remaining properties are all map references, and the values of these are all resources. So the Gazetteer has begun assigning URIs to all of the 1KM and 20KM grid references, as well as each of OS LandRanger Maps. Here are some sample URLs for each, taken from the descripion of Hadrian’s Wall:

http://data.ordnancesurvey.co.uk/id/1kmgridsquare/NY3359
http://data.ordnancesurvey.co.uk/id/20kmgridsquare/NY24
http://data.ordnancesurvey.co.uk/id/OSLandrangerMap/85

The URIs seem predictable and can probably be derived from data found elsewhere. Unfortunately, no further data has been included about these resources. I believe they are place-holders for data that has yet to be released.

Overall the data in the Gazetteer is pretty sparse but presumably it will become much richer once more OS data is released. Latitude and longitudes is something that I’d particularly like to see added. There’s an opportunity here for someone to link up these resources with pages in Wikipedia & resources in DbPedia.

Sample Queries

If you want to play with the data, here are a couple of SPARQL queries to get you started. The first retrieves 10 features classified as Roman Antiquities


PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX spatial: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
PREFIX gaz: <http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/>

SELECT ?uri ?label ?easting ?northing ?one ?twenty ?map 
WHERE {
  ?uri 
    #filter on type
    gaz:featureType gaz:RomanAntiquity;

    #bind everything we want to return
    rdfs:label ?label;
    spatial:easting ?easting;
    spatial:northing ?northing;
    gaz:oneKMGridReference ?one;
    gaz:twentyKMGridReference ?twenty;
    gaz:mapReference ?map.
}
LIMIT 10

Results in JSON

The following query lists all of the features on a specific OS Landranger map. So even though we don’t (yet) have any details about the map, we can use its identifier as a means to filter the results:


PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX spatial: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
PREFIX gaz: <http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/>

SELECT ?uri ?label ?easting ?northing ?featureType 
WHERE {
  ?uri 
    #filter on map reference
    gaz:mapReference <http://data.ordnancesurvey.co.uk/id/OSLandrangerMap/85>;

    #bind everything we want to return
    rdfs:label ?label;
    spatial:easting ?easting;
    spatial:northing ?northing;
    gaz:featureType ?featureType.
}

Results in JSON

Tagged
Follow

Get every new post delivered to your Inbox.

Join 29 other followers