Annotated Data

One of the things I’ve always liked about the Semantic Web vision is the idea that “Anyone can say Anything, Anywhere” (hereafter: The AAA Principle). That I can publish data about anything; and which links to and annotates data that other people are publishing elsewhere. I’ve been thinking recently whether we’ve spent a lot of time focusing on the publishing of data and not enough about annotation. Some of this thinking is potentially heretical so I’m hoping for an interesting debate!

Before I leap into the heresy, lets review the key steps of publishing Linked Data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

The dominant publishing pattern for Linked Data is for people to mint new URIs for their resources in a domain that they control. We then make links to other sources by using them as the object of statements in our data; owl:sameAs links are a special case of linking that asserts equality between the subject and object of that specific statement. Through this approach we tick off all of the Linked Data publishing steps.

Some people have argued that maybe we can drop the requirement of using RDF & SPARQL and still have “linked data”. I don’t agree with that, largely because the term already has a precise definition and so muddying it doesn’t really help the discussion. Publishing of data using HTTP URIs, using formats that natively define a linking mechanism, is to my mind simply “RESTful data publishing”. I’ve already recently referred to this as “web integrated data“. I mention this because its an approach to data publishing that only uses three of the four Linked Data publishing guidelines.

What would happen if we chose to follow some other subset of the guidelines? In fact, what if we didn’t assign URIs to things, or publish data at those URIs, and instead just published RDF to the web?

If we want to take advantage of The AAA Principle then technically we don’t need to assign URIs to things. Or rather, to be precise, we don’t need to assign new URIs to things. We can simply reuse someone else’s URI; no need to mint a new one. We also don’t need to publish data at those URIs: we just need to make sure that the data is linked into the growing web of data and is therefore discoverable. We can do this and still use/publish RDF. Lets refer to this form of publishing as “Annotated Data”, to distinguish it from Linked Data and Web Integrated Data.

Annotation is about publishing additional data about things that are already in the web. For that simple use case the need to deploy a Linked Data publishing framework is potentially overkill: publishing a document to a web server is all the machinery I need. Obviously by using someone else’s URIs I’m buying into the longevity of that URI space and the meaning of those identifiers. This may not be the right thing for some applications, but for many common use cases it may be good enough. Also, over time, as we get more hubs in the web of data, certain URI spaces are going to become much more stable because people will need them to be so in order to be reliable platforms upon which applications can be constructed. To put that another way: if we’re too fearful about relying on other peoples identifiers then we’ve got bigger problems.

Clearly if we’re just publishing RDF documents which contain statements about other people’s URIs then we can’t publish data at those URIs. So how will our annotations be found? How will it become part of the web of data? This is actually not that different to the current situation. Any given RDF data set may have links to a small number of other data sets, but it will never comprehensively have links to all possible related datasets. That level of co-ordination just isn’t achievable. It may also not be desirable: there may be valid reasons why I don’t want to have reciprocal links to everyone who links to me, e.g. spam or other untrusted data sources. The solution here is that services like sameas.org or sindice let us search and locate documents that refer to a specific resource, or other resources that have declared an equivalence. This same solution works for publishing Annotated Data: if we can ping a service or crawler that will index the content of our document then this small additional part can be linked into the whole. The current document web is not fully linked, so there’s no reason to expect the web of data to be either — there will always be the need for bridging/linking services.

What I’m describing here is broadly what we used to do in the early days of FOAF: we just published RDF documents with rdfs:seeAlso links and crawled them to compile data. This scruffy, lo-fi approach to the web of data was based on the assumption that having strong identifiers for things (particularly people) may not scale or be socially acceptable. It was also based on having more flexible notions of data merging; identification by description (“smushing”) gave us a little more leeway. Now we promote use of strong identifiers and strong notions of equality using owl:sameAs. This is clearly progress, as evidenced by the much larger collections of data we’ve created. But there are concerns about whether owl:sameAs may be too formal for lightweight Linked Data integration. Perhaps we could see these approaches as opposite ends of the spectrum, and be willing to explore more of the middle-ground?

Some questions that occur to me are:

  • Why not encourage people to reuse strong identifiers rather than create new ones. This reduces need for owl:sameAs linking, and makes it even easier to merge data.
  • Can smushing and approaches to using rdfs:seeAlso be more widely promoted/discussed as an approach to linking/fusion?
  • Can we create simple data annotation tools that let people contribute to the web of data without requiring that they follow all of the Linked Data principles?

The notion of Annotated Data I’ve described in this post is an attempt to start that conversation. Because it lowers the bar to contribution, it may be easier to move people up the “on ramp” to contributing to the web of data. And arguably as the web of data grows, increasingly what people and organizations will be doing is annotating existing resources rather than creating new ones.

As a concrete use cases, why not encourage publishers to simply publish RDF documents listing the foaf:topic‘s of their content, but using dbpedia, or Freebase, or OpenCalais URIs as the topic URIs? This is simpler than publishing full Linked Data, is lower cost, and is fairly trivial to do using RDFa. They might later want to adopt more of the Linked Data publishing principles if they want more control over their URI schemes or are prepared to invest deeper in the technology.

Heresy or just good use of the full range of hypertext publishing mechanisms we have in RDF? Let me know your thoughts.

16 thoughts on “Annotated Data

  1. doesn’t sound heretic to me. in the projects i’m involved with, we’re just a bit behind in terms of URIs, so we first have to mint a couple new ones to have something to talk about 🙂

  2. I definitely agree that the ‘annotation’ approach is often the most useful one.

    One problem is that the RDF that can be obtained by dereferencing the URI is likely to be regarded as having a special status, because it can be found directly without having to consult any external index. But in general it may not be the case that the ‘owner’ of the URI has the most relevant or accurate information about that resource.

  3. Eric,

    I agree with you, but the usual discussion of how to publish Linked Data — and I’m as guilty of this as anyone — doesn’t explore notions that the URIs that we use to name things (Principle #1) may be in other domains. Or that “looking up” a URI (Principle #4) doesn’t have to be a direct dereference or that URI, it may involve looking up that URI using a service like SameAs.org.

    Its natural enough when putting together tutorials and frameworks for linked data publishing to focus on the mint your own URIs & publish from those URIs approach, because that gets the core message across. Its also natural when there aren’t many stable URIs to reuse.

    But as things grow, the emphasis will natural change, and so as we encourage people to adopt the Linked Data style of publishing data, we need to encourage more annotation. Its not only simpler, its also side-steps a few issues, as I alluded to in the post.

    I’m also being a bit playful with the whole “heresy” thing.

    Cheers,

    L.

  4. minting URIs in your own domain or not is already an implementation detail, i’d say. using HTTP URIs to identify things in the first place seems to be the big change. what i’m typically confronted with are annotations of stuff that can’t be identified unambigously – think “as explained in Dodds 2009”. so owl:sameAs or using the same URI in two annotations doesn’t make much difference compared to teaching a machine the difference between “Dodds 2009a” and “Dodds 2009b”.

  5. Interesting discussion Leigh. Thanks for starting it. Couple of points.

    First: The only reason why we can have this conversation today is because now we have quite a nice selection of existing, reasonably stable, well-managed, dereferenceable identifiers around, from DBpedia, Freebase, and hopefully soon from many governments and other public organisations. This wasn’t the case two years ago. Two years ago, the conversation was: “Use a bNode, or use some unresolvable URI, or use the identifier for France that someone minted in their FOAF file, or mint my own.” Out of those, minting your own is clearly the best.

    Second: Until quite recently, there was a widespread attitude that URI aliases is harmful, that the Semantic Web would be impossible to attain if we invent new identifiers rather than going to great lengths finding some existing canonical identifier. This attitude is slowly disappearing, luckily.

    Third: If the backend system that you expose as RDF has its own mechanisms for identifier management, then you’re really better off publishing your data attached to URIs that are based on these internal identifiers. Linking to other identifiers, which are managed independently, should be a separate step; especially if you use imperfect heuristics to discover the coreference.

    Fourth: If your data is inherently connected to someone else’s identifiers (e.g. because you annotated their identifiers in the first place, or because both of you use the same non-URI identifiers as a base, like motorway numbers), then I think it’s reasonable to simply publish RDF documents that use those foreign identifiers.

    Fifth: If you trust the owner of those foreign identifiers to be around longer than the expected lifetime of your dataset, then it’s quite fine to base your data on their identifiers. If not, then you should manage your own identifiers and link to theirs.

    • Hi Richard,

      Thanks for the comments, you make some good points. I’d argue that there’s still not enough of an understanding that creating new URI aliases and using owl:sameAs is not the only way to publish RDF data. I see little advice to the contrary.

      Your point about surfacing data from existing applications is well taken, creating new URIs based on database keys may be the simplest approach. Although at some point you’d expect there to be some alignment between URIs in, e.g. dbpedia. So starting from a perspective of “tagging” a local identifier with a public URI is equally valid.

      The other thing to highlight is that some organization could have a valid and important role in the development of the web of data if they only focused on defining stable URIs for things — governments might do this for example, enabling citizens and businesses to use these as points for common annotation. To some extent this is what is happening in the first phases of the development of the UK government linked data.

    • Hi Tom,

      Good point re: simply adding data at someone’s URI space. Linked Data publishers could usefully provide or enable this kind of feature, suitably qualified for trust-worthiness, as part of their publishing framework.

      btw, I love the “web as a cms” concept, I think its a really important change in perspective for how organisations can engage with the web.

  6. (Richard, you may need to check the definition of ‘couple’)

    Thanks Leigh, great essay. It would be unfortunate if publishing data in RDF becomes an exercise in bondage-and-discipline, which is what some linked data advocates (the new neats?) seem to advocate.

    Horrible antipattern I’ve seen rather too often:

    owl:sameAs dbpedia:Sesame_Street ;
    rdfs:label “Sesame Street”@en ; ….

  7. I want to publish an “annotation” about dbpedia:France on my own web server. Regardless of whether I use dbpedia:France or ex:France (owl:sameAs dbpedia:France) in my descriptions, consumers still need to discover those descriptions, and the only way for them to do so is to query a central indexer (Sindice and/or sameas.org).

    This means that even in the *current* Linked Data web, the annotation aspect is not “linked” at all: you cannot discover the annotations of dbpedia:France by links alone. The nature of the URIs used does not affect this; it merely determines if you need an additional owl:sameAs step to interpret the annotations.

  8. […] Annotation Datasets provide context to, and enrich other reference datasets. Annotations might be limited to linking information (“Link Sets”) or they may add new facts/properties about existing resources. Independently sourced quality control information could be published as annotations. […]

Comments are closed.