Monthly Archives: September 2009

A Spectrum of Schema Related Questions

On Thursday and Friday I was luckily enough to be able to take some time out to attend VocampBristol2009. This was the third Vocamp event I’ve attended, the previous two being the very first (Oxford) and a recent event hosted by Yahoo in Sunnyvale.

There seems to be a common theme emerging around the topics and discussions for these events. On the one hand there’s a focus on practical exercises, i.e. actually authoring or extending a vocabulary, with the aim of creating some kind of deliverable at the end of the event. Unsurprising, as this is the fundamental goal of this particular breed of unconference: to ensure that people can take the time out from day to day issues and contribute towards the creation of useful schemas.

One the other hand there’s also commonly a desire amongst attendees to discuss more general issues around RDF modelling, vocabulary creation and management. Again this is useful stuff, even if it is unlikely to yield immediate practical outcomes.

This trend held true for VocampBristol2009 and on Friday morning we had a really interesting group discussion that touched on a number of different areas. The general framing of the discussion was how we, as a community, should be helping people better understand how to create RDF schemas; actually use them; and also understand how they have been deployed. The latter point is of particular importance for schema authors wanting to connect with their user base and see how a schema might evolve over time with minimum of impact.

The discussion was wide ranging but seemed to me to fall into a series of issues that ranged from being about, at one end of the spectrum, RDF data modelling patterns to, at the other, the importance of having ready access to statistics on how data has been deployed.

I tried to sum this up as series of questions:

  1. Generic Modelling Patterns: e.g. When do I use an n-ary relationship in my RDF modelling?
  2. Specific Modelling Patterns: e.g. How do I model time-series data in RDF?
  3. Vocabulary Usage Patterns: e.g. How do I use the Example.org Time Series Schema in my own dataset?
  4. Deployment Patterns & Statistics: e.g. How many people are using ex:TimeSeries? How many are using this specific predicate?

The consensus of the discussion seemed, to me at least, agreement that we need to address these questions, but recognition that as they range along a spectrum from generic modelling patterns, through to information about real-world data, that there are a variety of means of achieving that. And also that while there’s already a lot of previous and ongoing work in these areas to draw upon that there is still a lot more co-ordination to be done.

Here’s my personal view of what infrastructure we need to support each question:

  1. Generic Modelling Patterns: a good, well-run wiki of formal design patterns that cover both general issues…
  2. Specific Modelling Patterns: …and more specific questions. The ESW wiki is, frankly, a bit messy, and I think this kind of documentation and discussion warrants a specific site with a dedicated community interested in creating and refining some documentation (i.e. discussion should happen elsewhere). The kind of material that could easily end up as a book.
  3. Vocabulary Usage Patterns: some additional documentation on best practices for schema publishing. Every schema should have both clear documentation and clear examples that explore the different aspects of the model. Examples of mixing terms with other schemas are also particularly useful, as are SPARQL queries that can “validate” that some data matches the expectations of the schema.
  4. Deployment Patterns & Statistics: converge on some standard statistical measures for Linked Data sets. These will range from the obvious metrics of size and reports on class and property usage, through to descriptions of common “features” of the dataset, e.g. hub resources, common pairings of properties or classes, etc. This should be backed up with a means of publishing those stats (VOID and SCOVO should help here); and we need services like Sindice, the Talis Platform, etc that are hosting or indexing a number of datasets to generate and publish this information.

These kinds of questions are increasingly important as the web of data grows and as new communities begin to explore both the data and the technology.

Follow

Get every new post delivered to your Inbox.

Join 28 other followers