One of the things I’ve always liked about the Semantic Web vision is the idea that “Anyone can say Anything, Anywhere” (hereafter: The AAA Principle). That I can publish data about anything; and which links to and annotates data that other people are publishing elsewhere. I’ve been thinking recently whether we’ve spent a lot of time focusing on the publishing of data and not enough about annotation. Some of this thinking is potentially heretical so I’m hoping for an interesting debate!
Before I leap into the heresy, lets review the key steps of publishing Linked Data:
- Use URIs as names for things
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
- Include links to other URIs. so that they can discover more things.
The dominant publishing pattern for Linked Data is for people to mint new URIs for their resources in a domain that they control. We then make links to other sources by using them as the object of statements in our data;
owl:sameAs links are a special case of linking that asserts equality between the subject and object of that specific statement. Through this approach we tick off all of the Linked Data publishing steps.
Some people have argued that maybe we can drop the requirement of using RDF & SPARQL and still have “linked data”. I don’t agree with that, largely because the term already has a precise definition and so muddying it doesn’t really help the discussion. Publishing of data using HTTP URIs, using formats that natively define a linking mechanism, is to my mind simply “RESTful data publishing”. I’ve already recently referred to this as “web integrated data“. I mention this because its an approach to data publishing that only uses three of the four Linked Data publishing guidelines.
What would happen if we chose to follow some other subset of the guidelines? In fact, what if we didn’t assign URIs to things, or publish data at those URIs, and instead just published RDF to the web?
If we want to take advantage of The AAA Principle then technically we don’t need to assign URIs to things. Or rather, to be precise, we don’t need to assign new URIs to things. We can simply reuse someone else’s URI; no need to mint a new one. We also don’t need to publish data at those URIs: we just need to make sure that the data is linked into the growing web of data and is therefore discoverable. We can do this and still use/publish RDF. Lets refer to this form of publishing as “Annotated Data”, to distinguish it from Linked Data and Web Integrated Data.
Annotation is about publishing additional data about things that are already in the web. For that simple use case the need to deploy a Linked Data publishing framework is potentially overkill: publishing a document to a web server is all the machinery I need. Obviously by using someone else’s URIs I’m buying into the longevity of that URI space and the meaning of those identifiers. This may not be the right thing for some applications, but for many common use cases it may be good enough. Also, over time, as we get more hubs in the web of data, certain URI spaces are going to become much more stable because people will need them to be so in order to be reliable platforms upon which applications can be constructed. To put that another way: if we’re too fearful about relying on other peoples identifiers then we’ve got bigger problems.
Clearly if we’re just publishing RDF documents which contain statements about other people’s URIs then we can’t publish data at those URIs. So how will our annotations be found? How will it become part of the web of data? This is actually not that different to the current situation. Any given RDF data set may have links to a small number of other data sets, but it will never comprehensively have links to all possible related datasets. That level of co-ordination just isn’t achievable. It may also not be desirable: there may be valid reasons why I don’t want to have reciprocal links to everyone who links to me, e.g. spam or other untrusted data sources. The solution here is that services like sameas.org or sindice let us search and locate documents that refer to a specific resource, or other resources that have declared an equivalence. This same solution works for publishing Annotated Data: if we can ping a service or crawler that will index the content of our document then this small additional part can be linked into the whole. The current document web is not fully linked, so there’s no reason to expect the web of data to be either — there will always be the need for bridging/linking services.
What I’m describing here is broadly what we used to do in the early days of FOAF: we just published RDF documents with
rdfs:seeAlso links and crawled them to compile data. This scruffy, lo-fi approach to the web of data was based on the assumption that having strong identifiers for things (particularly people) may not scale or be socially acceptable. It was also based on having more flexible notions of data merging; identification by description (“smushing”) gave us a little more leeway. Now we promote use of strong identifiers and strong notions of equality using
owl:sameAs. This is clearly progress, as evidenced by the much larger collections of data we’ve created. But there are concerns about whether
owl:sameAs may be too formal for lightweight Linked Data integration. Perhaps we could see these approaches as opposite ends of the spectrum, and be willing to explore more of the middle-ground?
Some questions that occur to me are:
- Why not encourage people to reuse strong identifiers rather than create new ones. This reduces need for
owl:sameAslinking, and makes it even easier to merge data.
- Can smushing and approaches to using
rdfs:seeAlsobe more widely promoted/discussed as an approach to linking/fusion?
- Can we create simple data annotation tools that let people contribute to the web of data without requiring that they follow all of the Linked Data principles?
The notion of Annotated Data I’ve described in this post is an attempt to start that conversation. Because it lowers the bar to contribution, it may be easier to move people up the “on ramp” to contributing to the web of data. And arguably as the web of data grows, increasingly what people and organizations will be doing is annotating existing resources rather than creating new ones.
As a concrete use cases, why not encourage publishers to simply publish RDF documents listing the
foaf:topic‘s of their content, but using dbpedia, or Freebase, or OpenCalais URIs as the topic URIs? This is simpler than publishing full Linked Data, is lower cost, and is fairly trivial to do using RDFa. They might later want to adopt more of the Linked Data publishing principles if they want more control over their URI schemes or are prepared to invest deeper in the technology.
Heresy or just good use of the full range of hypertext publishing mechanisms we have in RDF? Let me know your thoughts.