Monthly Archives: December 2009

Thoughts on Enterprise Linked Data

There have been a number of discussions about “Enterprise Linked Data” recently, and I took part on a panel on precisely that topic at ESTC 2009. Unfortunately the panel was cut short due to time pressures so I didn’t get chance to say everything I’d hoped. In lieu of that debate here’s a blog post containing a few thoughts on the subject.

When we refer to enterprise use of Linked Data, there are a number of different facets to that discussion which are worth highlighting. In my opinion the issues and justifications relating to each of them are quite different. So different in fact that we’re in danger of having a confused debate unless we tease out this different aspects.

Aspects of the Debate

In my view there are three facets to the discussion:

  • Publishing Linked Data, the key question here being: What does an Enterprise have to benefit by publishing Linked Data?
  • Consuming Linked Data: What does an Enterprise have to benefit from consuming Linked Data?
  • Adopting Linked Data: What benefits can an Enterprise gain by deploying Linked Data technologies internally?

I think these facets whilst obviously closely related are largely orthogonal. For example I could see a scenario in which an organization consumed Linked Data but didn’t store or use it as RDF, but just fed it into existing applications. Similarly businesses could clearly adopt Linked Data as a technology without publishing or using any data to the web at all.

These issues are also largely orthogonal to the Open Data discussion: an enterprise might use, consume and publish Linked Data but this might not be completely open for others to reuse. The data may only be available behind the firewall, amongst authorised business partners, or only available to licensed third-parties. So, while the issue as to whether to publish open data is a very important aspect of the discussion, its not a defining one.

Here’s a few thoughts on each of these different facets.

Publishing Linked Data

So why might an enterprise publish Linked Data? And if that is a worthwhile goal, then is it clear how to achieve it? Lets tackle the second question first as its the simplest.

There is an increasingly large amount of good advice available online, as well as tools and applications, to support the publishing of Linked Data. We’re making good strides towards making the important transition from moving Linked Data out of the research area and into the hands of actual practitioners. The How to Publish Linked Data on the Web tutorial is an great resource but to my mind Jeni Tennison’s recent series on publishing Linked Data is an excellent end-to-end guide full of great practical advice.

We can declare victory when someone writes the O’Reilly book on the subject and do for Linked Data what RESTful Web Services did for REST. (And the two would make great companion pieces).

But technology issues aside, what are the benefits to an organization in publishing Linked Data? There are several ways to approach answering that question but I think in most discussions Linked Data tends to get compared with Web APIs. The value of creating an API is now reasonably well understood, and many of the benefits that come from opening data through an API also apply to Linked Data.

However the argument that Linked Data married with a SPARQL endpoint is as easy for developers to use as a Web API is still a little weak at this stage. SPARQL can be off-putting for developers used to simpler more tightly defined APIs. As a community we ought to consider it as a power tool and look for ways to make it easier to get started with. It’s also worth recognising that a search API is also a useful addition to a SPARQL endpoint as part of Linked Data deployment.

But publishing Linked Data can’t be directly compared to just creating an API, because its also largely a pattern for web publishing in general. Its increasingly easier to instrument existing content management systems to expose RDF(a) and Linked Data. So rather than create a custom API, which will involve expensive development costs, particularly if its going to scale, its possible to simply expose Linked Data as part of an existing website.

By following the Linked Data pattern for web publishing, in particular the use of strong identifiers, an enterprise can end up with a single point of presence on the web for publishing all of its human and machine-readable data, resulting in a website that is strongly Search Engine Optimised. Search engines can better crawl and index well structured websites and are increasingly ingesting embedded RDFa to improve search results and rankings. That’s a strong incentive to publish Linked Data by itself.

Adopting Linked Data, particularly as part of a reorganization of an existing web presence, could deliver improved search engine rankings and exposure of content whilst saving on the costs of developing and running a custom API. The longer term benefits of being part of the growing web of data can be the icing on the cake.

Consuming Linked Data

Next we can consider why an enterprise might want to consume Linked Data.

To my knowledge organizations are currently only publishing Linked Open Data (albeit with some wide variations in licensing terms), so we’ll skip for the present whether enterprises have an option of consuming non-open Linked Data, e.g. as part of a privately licensed dataset.

The LOD Cloud is still growing and provides a great resource of highly interlinked data. The main issues that face an organization consuming this data are ones of quantity (there’s still a lot more data that could be available); quality (how good is the data, and how well is it modelled); and trust (picking and choosing reliable sources).

To some extent these issues face any organization that begins relying on a third-party API or dataset. However at present a lot of the data in the LOD cloud is still from secondary sources. The same can’t be said for the majority of web APIs, which tend to be published by the original curators of the data.

These issues should resolve themselves over time as more primary sources join the LOD cloud. Because Linked Data is all based on the same data model bulk loading and merging data from external sources is very simple. This gives enterprises the option of creating their own mirrors of LOD data sources which will provide some additional reassurances around stability and longevity.

Linked Data, with its reliance on strong identifiers, is much easier to navigate and process than other sources, even if you’re not storing the results of that processing as RDF. There’s also a much greater chance of serendipity, resulting in the discovery of new data sources and new data items. Whereas there is virtually no serendipity in a Web API as each API needs to be explicitly integrated.

But this benefit is only going to become evident if we continue to put effort into helping (enterprise) developers understand how to consume Linked Data. E.g. as part of existing frameworks or using new data integration patterns is another area that needs more attention. The Consuming Linked Data tutorial at ISWC 2009 was a good step in that direction, although the message needs to be circulated wider, outside of the core semantic web community.

In my opinion it will be easier for enterprises to consume Linked Data if they first begin to publish it. By publishing data they are putting their identifiers out into the wild. These identifiers become points for annotation and reuse by the community, creating liminal zones from which the enterprise can harvest and filter useful data. This is a benefit that I think is unique to Linked Data as with an Web API the end results are typically mashups or widgets displaying in a third-party application; these are just new silos one step removed from the data publisher.

Adopting Linked Data

Finally, what value could be gained if an organization adopts Linked Data internally as a means to manage and integrate data behind the firewall?

The issues and potential benefits here are largely a mixture of the above, except that there are little or no issues with trust as all of the data comes from known sources. In a typical enterprise environment Linked Data as an integration technology will be compared to a wider range of systems ranging from integrated developer tools through to middleware systems. There’s a reason why SOAP based systems are still well used in enterprise IT as most organizations aren’t (yet?) internally organized as if they were true microcosms of the web.

Its interesting to see that Linked Data can potentially provide a means for solving many of the issues that Master Data Management is trying to address. Linked Data encourages strong identifiers; clean modelling; and linking to, rather than replicating data. These are core issues for data consolidation within the enterprise. Coupled with the ability to link out to data that is part of the LOD Cloud, or published by business partners, Linked Data has the potential to provide a unifying infrastructure for managing both internal and external data sources.

Its worth noting however that semantic technologies in general, e.g. document analysis, entity extraction, reasoning and ontologies seem to be much more widely deployed in enterprise systems than Linked Data. This is no doubt in large part because the advantages of those technologies may currently be much more easily articulated as they’re more easily packaged into a product.

Summary

In this post I wanted to tease out some of the questions that underpin the discussions about enterprise adoption of Linked Data. I’ve presented a few thoughts on those questions and I’d love to hear your opinions.

Along the way I’ve attempted to highlight some areas where we need to focus to help transition from a researcher-led to a practioner-led community. More data, more documentation, and more tools are the key themes.

Tagged

SPARQL Extension Function Survey Summary

This post contains the first set of results from my SPARQL extension survey. I’ve completed an initial survey of a number of different SPARQL processors to itemise the extension functions that each of them have implemented. This will be an ongoing activity as implementations evolve continually, but I thought it would be useful to summarise my findings so far.

If you want to look at the results for yourselves, then I’ve created a publically accessible Google Spreadsheet that lists all of the results. The first tab of the spreadsheet includes the list SPARQL endpoints/processors that I’ve surveyed.

I completed the initial round of the survey a few weeks ago, so any updates since then won’t have been included.

List of Implementations

The full list of surveyed processors/endpoints consists of:

  • Allegrograph
  • ARQ
  • Corese
  • Geospatialweb project
  • Mulgara
  • OpenAnzo
  • Openlink Virtuoso
  • Sesame
  • TopBraid product suite
  • XMLArmyKnife.com

If I’ve missed any other implementations that support extension functions then please let me know. I’m aware that other engines also support property functions, but I’ve not included this type of extension in my first survey round. I’ll be exploring that area in the new year.

I want to thank the implementers of a number of these systems for providing me with additional information, feedback and support as I’ve compiled the results. If anything has been misrepresented or simply missed, then you have my apologies and I will endeavour to fix any reported problems ASAP. The goal is to perform a fair, objective survey of the current situation: I’m not pushing any agenda here, other than a desire for convergence and continual improvement.

Breakdown of Results

The currently implemented extension functions can be organised into the following categories:

  • String
  • Date/Time
  • Math/Logic
  • RDF/Graph Manipulation
  • Geospatial
  • Network

The first three categories, covering string, date, and mathematical manipulations have the largest number of functions. This is as expected as these areas are the most useful for any programming or query language. Given that extension functions are restricted to value testing in SPARQL 1.0, then you would also assume that they would be most commonly used to provide additional flexibility when comparing strings, manipulating and comparing dates, and performing simple mathematical functions.

Very few implementations offer any functions in the remaining categories. I had originally expected to find more functions in the Geospatial category but I think that the majority of exploration in that area has focused on using property functions instead.

I would expect to see the number of distinct functions in each area to grow with the delivery of SPARQL 1.1, if it becomes possible to use them as part of a SELECT expression, e.g. to create new values/bindings, as well as just in FILTER tests. Those implementations that already offer a wide range of additional functions, such as Virtuoso, already have additional SPARQL language extensions that allow functions to be used in this way.

Currently however the numbers are inflated due to repeated implementation of the same function in different engines. For example ARQ, Virtuoso and Corese all have their own variant of a “contains” function.

Portability

This brings me to the topic of query portability. A SPARQL query is portable if it can run unchanged on any SPARQL processor. A query is not portable if it uses proprietary extensions that are not supported on other processors. Implementers can increase portability by supporting each others extensions or by converging on a common set of functions. As a standard develops, you’d expect to see some replication of functions across engines before pressures from users, and a better understanding of the utility of various extensions, encourages convergence.

It’s encouraging to see that some replication of functions is happening across SPARQL engines. For example both Mulgara and TopQuadrant support a basic set of string functions that were originally provided by the ARQ engine. These functions are part of the XPath Functions and Operators library which acts as a handy “off-the-shelf” set of function definitions for SPARQL implementors to converge around. Mulgara also now supports a number of the EXSLT functions which can act as another reference point for useful function definitions.

Looking at the list of extensions, its easy to see that more convergence could take place as there are plenty of other extension functions that have been independently implemented. Expanding the set of commonly used functions in SPARQL is currently a time-permitting feature for SPARQL 1.1.

Replication of functions across implementations is partially hampered because of a couple of non-standard ways that extension functions have been implemented. For example both Corese and Virtuoso implement their extension functions as language extensions, i.e. they don’t quite conform to the SPARQL 1.0 recommendation. Corese doesn’t associate its functions with a URI, i.e. they are just functions that are exposed in the basic language. The Virtuoso “bif” (built-in) functions are used with a prefix (e.g. bif:contains) but this prefix is not (and cannot) be associated with a URI. In both cases this means that implementations cannot replicate the functions using existing extension points: they’d have to be implemented with similar language extensions, or query rewriting.

Conclusions and Recommendations

I’m encouraged to see the wide range of experimentation that has been taking place around SPARQL extensions as it illustrates that developers are exploring how to use the language in a variety of ways. Extensions also indicate areas where the query language could be extended to encourage interoperability and address common issues faced by developers.

There are clearly a common set of functions around strings, dates and mathematical operators that ought to be available as a core part of the language. If the SPARQL 1.1 specification doesn’t end up defining this then I’d like to encourage the implementer community to do further work to explore replicating useful extensions or converging on a common set outside of the Working Group.

To help this process along it would be useful for developers to provide more feedback on the functions they provide useful, and for some statistics to be gathered around which functions are being commonly used in practice.

Right now there are a common set of functions available from the ARQ engine that are implemented in at least two other SPARQL processors. The same functions can be ported to other engines with a minimum of query rewriting, often with little more than changes to query prefixes.

My other recommendation at this stage is that implementers need to work harder on documenting the extensions they provide. Some engines have pretty good documentation, but for others the documentation is either hard to find or clearly lagging behind the latest code base. Publishing documentation about extensions, ideally with examples, really does help developers get started much quicker.

Tagged

Approaches to Publishing Linked Data via Named Graphs

This is a follow-up to my previous post on managing RDF using named graphs. In that post I looked at the basic concept of named graphs, how they are used in SPARQL 1.0/1.1, and discussed RESTful APIs for managing named graphs. In this post I wanted to look at how Named Graphs can be used to support publishing of Linked Data.

There are two scenarios I’m going to explore. The first uses Named Graphs in a way that provides a low friction method for publishing Linked Data. The second prioritizes ease of data management, and in particular the scenario where RDF is being generated by converting from other sources. Lets look at each in turn and their relative merits.

Publishing Scenario #1: One Resource per Graph

For this scenario lets assume that we’re building a simple book website. Our URI space is going to look like this:



http://www.example.org/id/{thing}/{id}


http://www.example.org/doc/{thing}/{id}

The first URI being the pattern for identifiers in our system, the second being the URI to which we’ll 303 clients in order to get the document containing the metadata about the thing with that identifier. We’ll have several types of thing in our system: books, authors, and subjects.

The application will obviously include a range of features such as authentication, registration, search, etc. But I’m only going to look at the Linked Data delivery aspects of the application here in order to highlight our Named Graphs can support that.

Our application is going to be backed by a triplestore that offers an HTTP protocol for managing Named Graphs, e.g. as specified by SPARQL 1.1. This triplestore will expose graphs from the following base URI:

http://internal.example.org/graphs

The simplest way to manage our application data is to store the data about resource in a separate Named Graph. Each resource will therefore be fully described in a single graph, so all of the metadata about:

http://www.example.org/id/book/1234

with be found in:

http://internal.example.org/graphs/book/1234

The contents of that graph will be the Concise Bounded Description of http://www.example.org/id/book/1234, i.e. all its literal properties, any related blank nodes, as well a properties referencing related resources.

This means delivering the Linked Data view for this resource is trivial. A GET request to http://www.example.org/doc/book/1234 will trigger our application to perform a GET request to our internal triplestore at http://internal.example.org/graphs/book/1234.

If the triplestore supports multiple serializations then there’s no need for our application to parse or otherwise process the results: we can request the format desired by the client directly from the store and then proxy the response straight-through. Ideally the store would also support ETags and/or other HTTP caching headers which we can also reuse. ETags will be simple to generate as it will be easy to track whether a specific Named Graph has been updated.

As the application code to do all this is agnostic to the type of resource being requested, we don’t have to change anything if we were to expand our application to store information about new types of thing. This is the sort of generic behaviour that could easily be abstracted out into a reusable framework.

Another nice architectural feature is that it will be easy to slot in internal load-balancing over a replicated store to spread requests over multiple servers. Because the data is organised into graphs there are also natural ways to “shard” the data if we wanted to replicate the data in other ways.

This gets us a simple Linked Data publishing framework, but does it help us build an application, i.e. the HTML views of that data? Clearly in that case we’ll need to parse the data so that it can be passed off to a templating engine of some form. And if we need to compose a page containing details of multiple resource then this can easily be turned into requests for multiple graphs as there’s a clear mapping from resource URI to graph URI.

When we’re creating new things in the system, e.g. capturing data about a new book, then the application will have to handle any newly submitted data, perform any needed validation and generate an RDF graph describing the resource. It then simply PUTs the newly generated data to a new graph in the store. Updates are similarly straight-forward.

If we want to store provenance data, e.g. an update history for each resource, then we can store that in a separate related graph, e.g. http://internal.example.org/graphs/provenance/book/1234.

Benefits and Limitations

This basic approach is simple, effective, and makes good use of the Named Graph feature. Identifying where to retrieve or update data is little more than URI rewriting. It’s well optimised for the common case for Linked Data, which is retrieving, displaying, and updating data about a single resource. To support more complex queries and interactions, ideally our triplestore would also expose a SPARQL endpoint that supported querying against a “synthetic” default graph which consists of the RDF union of all the Named Graphs in the system. This gives us the ability to query against the entire graph but still manage it as smaller chunks.

(Aside: Actually, we’re likely to want two different synthetic graphs: one that merges all our public data, and one that merges the public data + that in the provenance graphs.)

There are a couple of limitations which we’ll hit when managing data using the scenario. The first is that the RDF in the Linked Data views will be quite sparse, e.g the data wouldn’t contain the labels of any referenced resources. To be friendly to Linked Data browsers we’ll want to include more data. We can work around this issue by performing two requests to the store for each client request: the first to get the individual graph, the second to perform a SPARQL query something like this:


CONSTRUCT {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 ?referenced rdfs:label ?label.
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 ?refencing rdfs:label ?label2.  
} WHERE {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 OPTIONAL {
   ?referenced rdfs:label ?label.
 }
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 OPTIONAL {
   ?refencing rdfs:label ?label2.
 }
}

The above query would be executed against the union graph of our triplestore and would let us retrieve the labels of any resources referenced by a specific book (in this case), plus the labels and properties of any referencing resources. This query can be done in parallel to the request for the graph and merged with its RDF by our application framework.

The other limitation is also related to how we’ve chosen to factor out the data into CBDs. Any time we need to put in reciprocal relationships, e.g. when we add or update resources, then we will have to update several different graphs. This could become expensive depending on the number of affected resources. We could potentially work around that by adopting an Eventual Consistency model and deferring updates using a message queue. This lets us relax the constraint that updates to all resources need to be synchronized, allowing more of that work to be done both asynchronously and in parallel. The same approach can be applied to manage list of items in the store, e.g. a list of all authors: these can be stored as individual graphs, but regenerated on a regular basis.

The same limitation hits us if we want to do any large scale updates to all resources. In this case SPARUL updates might become more effective, especially if the engine can update individual graphs, although handling updates to the related provenance graphs might be problematic. What I think is interesting is that in this data management model this is the only area in which we might really need something with the power of SPARUL. For the majority of use cases graph level updates using simple HTTP PUTs coupled with a mechanism like Changesets are more than sufficient. This is one reason why I’m so keen to see attention paid to the HTTP protocol for managing graphs and data in SPARQL 1.1: not every system will need SPARUL.

The final limitation relates to the number of named graphs we will end up storing in our triplestore. One graph per resource means that we could easily end up with millions of individual graphs in a large system. I’m not sure that any triplestore is currently handling this many graphs, so there may be some scaling issues. But for small-medium sized applications this should be a minor concern.

Publishing Scenario #2: Multiple Resources per Graph

The second scenario I want to introduce in this posting is one which I think is slightly more conventional. As a result I’m going to spend less time reviewing it. Rather than using one graph per resource, we instead store multiple resources per Named Graph. This means that each Named Graph will be much larger, perhaps including data about thousands of resources. It also means that there may not be a simple mapping from a resource URI to a single graph URI: the triples for each resource may be spread across multiple graphs, although there’s no requirement that this be the case.

Whereas the first scenario was optimised for data that was largely created, managed, and owned by a web application, this scenario is most useful when the data in the store is derived from other sources. The primary data sources may be a large collection of inter-related spreadsheets which we are regularly converting into RDF, and the triplestore is just a secondary copy of the data created to support Linked Data publishing. It should be obvious that the same approach could be used when aggregating existing RDF data, e.g. as a result of a web crawl.

To make our data conversion workflow system easier to manage it makes sense to use a Named Graph per data source, i.e. one for each spreadsheet, rather than one per resource. E.g:



http://internal.example.org/graphs/spreadsheet/A


http://internal.example.org/graphs/spreadsheet/B


http://internal.example.org/graphs/spreadsheet/C


The end result of our document conversion workflow would then be the updating or replacing of a single specific Named Graph in the system. The underlying triplestore in our system will need to expose a SPARQL endpoint that includes a synthetic graph which is the RDF union of all graphs in the system. This ensures that where data about an individual resource might be spread across a number of underlying graphs, that a union view is available where required.

As noted in the first scenario we can store provenance data in a separate related graph, e.g. http://internal.example.org/graphs/provenance/spreadsheet/A.

Benefits and Limitations

From a data publishing point of view our application framework can no longer use URI rewriting to map a request to a GET on a Named Graph. It must instead submit SPARQL DESCRIBE or CONSTRUCT queries to the triplestore, executing them against the union graph. This lets the application ignore the details of the organisation and identifiers, of the Named Graphs in the store when retrieving data.

If the application is going to support updates to the underlying data then it will need to know which Named Graph(s) must be updated. This information should be available by querying the store to identify the graphs that contain the specific triple patterns that must be updated. SPARUL request(s) can then be issued to apply the changes across the affected graphs.

The difficult of co-ordinating updates from the application with updates from the document conversion (or crawling) workflow means that this scenario may be best suited for read-only publishing of data.

Its clear that this approach is much more optimised to support the underlying data conversion and/or collection workflows that the publishing web application. The trade-off doesn’t add much more complexity to the application implementation, but doesn’t exhibit some of the same architectural benefits, e.g. easy HTTP caching, data sharding, etc, that the first model exhibits.

Summary

In this blog post I’ve explored two different approaches to managing and publishing RDF data using Named Graphs. The first scenario described an architecture that used Named Graphs in a way that simplified application code whilst exposing some nice architectural properties. This was traded off against ease of data management for large scales updates to the system.

The second scenario was more optimised data conversion & collection workflows and is particularly well suited for systems publishing Linked Data derived from other primary sources. This flexibility was traded off against slightly more complex application implementation.

My goal has been to try to highlight different patterns for using Named Graphs and how those patterns place greater or lesser emphasis on features such as RESTful protocols for managing graphs, and different styles of update language. In reality an application might mix together both styles in different areas, or even at different stages of its lifecycle.

If you’re using Named Graphs in your applications then I’d love to hear more about how you’re making use of the feature. Particularly if you’ve layered on additional functionality such as versioning and other elements of workflow.

Better understanding of how to use these kinds of features will help the community begin assembling good application frameworks to support Linked Data application development.

Tagged , ,

Annotated Data

One of the things I’ve always liked about the Semantic Web vision is the idea that “Anyone can say Anything, Anywhere” (hereafter: The AAA Principle). That I can publish data about anything; and which links to and annotates data that other people are publishing elsewhere. I’ve been thinking recently whether we’ve spent a lot of time focusing on the publishing of data and not enough about annotation. Some of this thinking is potentially heretical so I’m hoping for an interesting debate!

Before I leap into the heresy, lets review the key steps of publishing Linked Data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

The dominant publishing pattern for Linked Data is for people to mint new URIs for their resources in a domain that they control. We then make links to other sources by using them as the object of statements in our data; owl:sameAs links are a special case of linking that asserts equality between the subject and object of that specific statement. Through this approach we tick off all of the Linked Data publishing steps.

Some people have argued that maybe we can drop the requirement of using RDF & SPARQL and still have “linked data”. I don’t agree with that, largely because the term already has a precise definition and so muddying it doesn’t really help the discussion. Publishing of data using HTTP URIs, using formats that natively define a linking mechanism, is to my mind simply “RESTful data publishing”. I’ve already recently referred to this as “web integrated data“. I mention this because its an approach to data publishing that only uses three of the four Linked Data publishing guidelines.

What would happen if we chose to follow some other subset of the guidelines? In fact, what if we didn’t assign URIs to things, or publish data at those URIs, and instead just published RDF to the web?

If we want to take advantage of The AAA Principle then technically we don’t need to assign URIs to things. Or rather, to be precise, we don’t need to assign new URIs to things. We can simply reuse someone else’s URI; no need to mint a new one. We also don’t need to publish data at those URIs: we just need to make sure that the data is linked into the growing web of data and is therefore discoverable. We can do this and still use/publish RDF. Lets refer to this form of publishing as “Annotated Data”, to distinguish it from Linked Data and Web Integrated Data.

Annotation is about publishing additional data about things that are already in the web. For that simple use case the need to deploy a Linked Data publishing framework is potentially overkill: publishing a document to a web server is all the machinery I need. Obviously by using someone else’s URIs I’m buying into the longevity of that URI space and the meaning of those identifiers. This may not be the right thing for some applications, but for many common use cases it may be good enough. Also, over time, as we get more hubs in the web of data, certain URI spaces are going to become much more stable because people will need them to be so in order to be reliable platforms upon which applications can be constructed. To put that another way: if we’re too fearful about relying on other peoples identifiers then we’ve got bigger problems.

Clearly if we’re just publishing RDF documents which contain statements about other people’s URIs then we can’t publish data at those URIs. So how will our annotations be found? How will it become part of the web of data? This is actually not that different to the current situation. Any given RDF data set may have links to a small number of other data sets, but it will never comprehensively have links to all possible related datasets. That level of co-ordination just isn’t achievable. It may also not be desirable: there may be valid reasons why I don’t want to have reciprocal links to everyone who links to me, e.g. spam or other untrusted data sources. The solution here is that services like sameas.org or sindice let us search and locate documents that refer to a specific resource, or other resources that have declared an equivalence. This same solution works for publishing Annotated Data: if we can ping a service or crawler that will index the content of our document then this small additional part can be linked into the whole. The current document web is not fully linked, so there’s no reason to expect the web of data to be either — there will always be the need for bridging/linking services.

What I’m describing here is broadly what we used to do in the early days of FOAF: we just published RDF documents with rdfs:seeAlso links and crawled them to compile data. This scruffy, lo-fi approach to the web of data was based on the assumption that having strong identifiers for things (particularly people) may not scale or be socially acceptable. It was also based on having more flexible notions of data merging; identification by description (“smushing”) gave us a little more leeway. Now we promote use of strong identifiers and strong notions of equality using owl:sameAs. This is clearly progress, as evidenced by the much larger collections of data we’ve created. But there are concerns about whether owl:sameAs may be too formal for lightweight Linked Data integration. Perhaps we could see these approaches as opposite ends of the spectrum, and be willing to explore more of the middle-ground?

Some questions that occur to me are:

  • Why not encourage people to reuse strong identifiers rather than create new ones. This reduces need for owl:sameAs linking, and makes it even easier to merge data.
  • Can smushing and approaches to using rdfs:seeAlso be more widely promoted/discussed as an approach to linking/fusion?
  • Can we create simple data annotation tools that let people contribute to the web of data without requiring that they follow all of the Linked Data principles?

The notion of Annotated Data I’ve described in this post is an attempt to start that conversation. Because it lowers the bar to contribution, it may be easier to move people up the “on ramp” to contributing to the web of data. And arguably as the web of data grows, increasingly what people and organizations will be doing is annotating existing resources rather than creating new ones.

As a concrete use cases, why not encourage publishers to simply publish RDF documents listing the foaf:topic‘s of their content, but using dbpedia, or Freebase, or OpenCalais URIs as the topic URIs? This is simpler than publishing full Linked Data, is lower cost, and is fairly trivial to do using RDFa. They might later want to adopt more of the Linked Data publishing principles if they want more control over their URI schemes or are prepared to invest deeper in the technology.

Heresy or just good use of the full range of hypertext publishing mechanisms we have in RDF? Let me know your thoughts.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers