Enhanced Descriptions: “Premium Linked Data”

I’ve had several conversations recently with people who are either interested in, or actually implementing Linked Data, and are struggling with some important questions

  • How much data should I give away?
  • If I wanted to charge for more than just the basic data, then how would I handle that?

My usual response to the first of those questions is: “as much as you feel comfortable with”. There’s still so much data that’s not yet visible or accessible in machine-readable formats that any progress is good progress. Let’s get more data out there now. More is better.

It usually doesn’t take long to get to the second question. If you’ve spent time evangelising to people about the power and value of data, and particularly their data, then its natural for them to begin thinking about how it can be monetized.

Scott Brinker has done a good job of summarising a range of options for Linked Data business models. I’ve chipped into that discussion already. Instead what I wanted to briefly discuss here is some of the mechanics of implementing access to what we might call “premium Linked Data”, or as I’ll refer to it “Enhanced Descriptions”.

Premium Linked Data

It’s possible to publish Linked Data that is entirely access controlled. Access might be limited to users behind the firewall (“Enterprise Linked Data”) or only to authorised paying customers. As a paid up customer you’d be given an entry point into that Linked Data and would supply appropriate credentials in order to access it.

This data isn’t going to be something you’d discover on the open web. There are many different authentication models that could be used to mediate access to this “Dark Data”. The precise mechanisms aren’t that important and the right one is likely to vary for different industries and use cases. Although I think there’s a strong argument in using something that dove-tails nicely with HTTP and web infrastructure in general.

What interests me more is the scenario in which a data publisher might be exposing some public data under a liberal open license, but also wants to make available some “premium” metadata. I.e. some value-added data that is only available to paid-up customers. In this scenario it would be useful to be able to link together the open and closed data, allowing a user agent to detect that there is extra value hidden behind some kind of authentication barrier. I think this is likely to become a very common pattern as it aids discovery of the value-added material. Essentially its the existing pattern for access controlling content that we have on the web of documents.

Its the mechanics of implementing this public/private scenario that has cropped up in my recent conversations.

Enhanced Descriptions

When I dereference the URI of a resource I will typically get redirected to a document that describes that resource. This document might contain data like this (in Turtle):


ex:document 
  foaf:primaryTopic ex:thing.

ex:thing 
  rdfs:label "Some Thing".

i.e. the document contains some data about the resource, and there’s a primary topic relationship between the document and the resource.

If we want to point to additional RDF documents that also describe this resource, or related data, then we can use an rdfs:seeAlso link:


ex:document 
  foaf:primaryTopic ex:thing.

ex:thing rdfs:label "Some Thing";
  rdfs:seeAlso ex:otherDocument.

We can use the rdfs:seeAlso relationship to point to additional documents either within a specific dataset or in other locations on the web. Those documents provide useful annotations about a resource.

An “Enhanced Description” will contain additional value-added data about a resource. We could just refer to this document using an rdfs:seeAlso link. But if we do that then a user agent can’t easily distinguish between an arbitrary rdfs:seeAlso link and one that refers to some additional data. We could instead use an additional relationship, a specialisation of rdfs:seeAlso, that can be used to disambiguate between the relationships. I’ve defined just such a predicate: ov:enhancedDescription.


ex:document 
  foaf:primaryTopic ex:thing.

ex:thing rdfs:label "Some Thing";
  rdfs:seeAlso ex:otherDocument;
  ov:enhancedDescription ex:premiumDocument.

By using a separate document to hold the value-added annotations we have the opportunity for user agents to identify those documents (via the predicate) and to also be challenged for credentials when they retrieve the URI (e.g. with an HTTP 401 response code).

It also means data publishers can safely dip a toe in the open data waters, but leave richer descriptions protected but still discoverable behind an access control layer.

Another Approach?

Interestingly I discovered earlier today that OpenCalais returns a “402 Payment Required” status code for some documents.

To see this in practice visit their description of IBM and try accessing the last of the owl:sameAs links. I’m guessing they’re using a similar technique to the one I’ve outlined here. But the key difference is that rather than use separate documents, they’ve decided to create new URIs for the access controlled version of the Linked Data. It would be nice if someone out there could confirm that.

Assuming I’ve interpreted what they’re doing correctly, I think this approach has some failings. Firstly it creates extra URIs that aren’t really needed. I’m not sure that we really need more URIs for things; a pattern in which publishers have 2 URIs (public & private) for each resource isn’t going to help matters

Secondly, just like using a generic “see also” relation, using owl:sameAs means its impossible to detect which resource is the one providing access to premium data, and others that exist on the web, without doing some fragile URI matching.

Apologies for the OpenCalais team if I’ve misunderstood the mechanism they’re using. I’ll happily publish a correction, but regardless, I’m intrigued by the 402 status code! 🙂

Summary

In my view, the “Enhanced Description” approach is a simple to implement pattern. Its one that I’ve been recommending to people recently but I’ve not seen documented anywhere, so thought I’d write it up.

I’d be interested to hear from others that have either implemented the same mechanism, or like OpenCalais are using other schemes.

13 thoughts on “Enhanced Descriptions: “Premium Linked Data”

  1. While the technical options are interesting (eg. is this a role for certificates, FOAF+SSL etc), I just wanted to mention one business/license scenario: people can pay for earlier access to time sensitive materials, which otherwise make their way into universally-available public datasets some time later. How the timing looks would naturally vary by domain – in some fields, a few minutes makes all the difference. I think MusicBrainz for example has sometimes licensed immediate access to some records, while making them open and public some time later. Can’t find details on this right now but I’m pretty sure I didn’t imagine it!

  2. Linked Data is basically a high-way building business model. As was articulated a while back re. UMBEL project [1]. The quality of highways (inference context rules) and the tolls (402’s) are naturally part of the mix re. Business of Linked Data (BOLD) models.

    People in the Linked Data realm often pooh-pooh OWL, ironically, its OWL that’s going to be the key weapon for executing premium routes on the Linked Data driven Information Super Highway 🙂

    Links:

    1. http://umbel.org
    2. http://bit.ly/90tUKJ — old post re. state of Linked Data Web (which should link to to post about UMBEL and Data Dictionaries etc).

    Kingsley

  3. Thanks for this post, Leigh!

    I think this approach is at once elegant and straightforward to both understand and implement. It allows providers to manage the “controlled” aspects of their datasets in a straightforward way. A disadvantage is that it provides the “extended” data as “canard” on the primary cluster of data.

    From a security standpoint, some might be concerned that this approach allows the client to learn of the existence of a resource that they potentially don’t have access to. A more secure solution (I believe) would ensure that clients can only ever see what they are authorized to see.

    What I’m suggesting is that “access” could either be controlled at the predicate/vocabulary level — the are certain “premium” predicates, and access to the triples based on them is restricted — or at the triple level (meaning, really, at the quad level).

    This means that every client would be challenged to authenticate; depending upon their credentials, they would have access to different subsets of the total set of triples.

    An advantage of this approach is that a client service (customer) could readily “update” without having to think about processing the extended set; once they upgraded, their view of the data would simply be expanded.

    I hope this has made sense; we can discuss further!

    John

  4. John,

    Remember that we now have the following re. Linked Data and their host Graphs:

    1. FOAF+SSL based Identity
    2. Access Control Lists scoped to Named Graphs
    3. Context Rules exist in Named Graphs (so #1 + #2 applies)
    4. SPARQL endpoints can be protected by FOAF+SSL (so you can control access to data sets at an even higher level).

    FOAF+SSL should render current API Keys obsolete.

    Named Graphs are going to become the Entity Oriented Data Access realm variant of SQL Views (Transient or Fully Materialized).

    No only do we have the super information highways, we also have the concessions and scenic views into the mix too. User Agents based on User Profiles provide enough information for an Linked Data Server to provide optimized paths based on preferences 🙂

    Kingsley

  5. Two remarks:
    – Regarding technical details of “how do we distribute premium linked data” we should keep all technical options open, as long as the things refererred to are described with the same URIs. (I do not disagree with the options mentioned, but the concrete means is to be decided for each use case at customer site)

    – My assumption is that we are quite mature (as a community) regarding platforms & tools to technically provide linked data to any kind of consumers.

    – I think we need fitting answers to business asking us, “how to implement and execute these business models IT-wise?”. (e.g. billing of data access/usage etc.)

  6. Regarding keeping “all technical options open,” I’d like to understand a bit more Daniel about where you would draw the line.

    Looking forward, one of the key opportunities with the “Web of Data” model (leveraging linked data principles) is ease of data integration across a variety of sources. As value-added, premium datasets become more widely available, it will be critically important for clients and especially other services to access them in a uniform way. This will be much easier if those services are presented with, and in term present to their consumers, uniform authentication and authorization models.

    Consider the iconic Web of Data graph. That is a reality in part because of the consistency of the data model, and in part because of the uniformity of the access models for the various providers — including access control, which is non existent!

    One can imagine the emergence of an ecosystem of premium applications based on constructing and traverse such graphs. It is hard to imagine this happening at significant scale if such construction cannot be done with ease and agility, which begs for a minimum degree of technical uniformity…

  7. The value unit is a LINK (which is now both an Object Identifier and a Data Representation Location), value is a function of “data access” via Generic HTTP URIs. This why the URI is also the Digital Brand Emblem kinda like seeing the sign: “Route 93”, which also puts on that highway should it match to transport requirements (sticking to highway metaphor).

    Kingsley

  8. Re John’ question:

    I want to separate the question of which data are selected (thats where the premium content goes in) from the way how data are transported.

    As in earlier ‘middleware’ history you will have different means for different (technical) requirements in customers scenarios. (I think of e.g. a service bus scenario, where the focus is on fast distribution of updates).

    But one things remains important: regardless how you get and distribute the data, the data consumer has to able to put it together again – based on URIs.

  9. RE the premium presentation of facets of data, consider for a moment using FluidDB’s permissions model to provide differential access control to <a href="http://linkeddata.org dataset. Specifically, imagine that a given dataset falls under the equivalent of FluidDB’s closed policy, and individual users — more likely sets of users — have differential access to assertions, just as FluidDB users have differential access to “tags” associated with objects.

    One might push this analogy even further; FluidDB’s permissions model (which covers only namespaces and tags; objects don’t have permissions) addresses all of CRUD for operations on tags, which one might extrapolate to assertions in a triple store or quad store. I do think the analogy only works if one equates FluidDB’s namespaces with named graphs, however…

    Note: This would be a much simpler discussion if FluidDB was more obviously compatible with linked data principles…

Comments are closed.