Lunch Hour Game

Our daily office random lunch hour discussion veered into the topic of reality TV today, namely: what new shows could we make up? Come on, you’ve all done it!
Here are my contributions:
1950’s Wife Swap: Like Wife Swap except your exchange spouse with a family from the 1950s. Hilarity ensues. Note: idea slightly limited by need for time travel and/or availability of character actors.
Ready, Steady, Survive!: Ray Mears takes a number of well-known cooks into the wilderness and then presents them with 5 random ingredients harvested from Nature. The winner is the chef to make the best dish out of the available bush tucker.
Habitat Swap: Presented by David Attenborough and Davina McCall this show selects two animals and forces them to swap habitats for a week. The viewers get to follow the travails of the beasts as they attempt to evolve within a week. The winner is presented with a wildlife preservation order. First guests are a red ant and a black ant.
Call Yourself a Pharoah?: Sarah Beeny presents this show following the efforts of several tyrants to construct massive monuments and/or tombs using a thousand slaves each. Beeney provides constructive advice on managing a large scale project, e.g. transportation of massive stone blocks, costing the plaster work required for a pyramid, etc.
Any better than that?


Danny’s discussion about sending FOAF URLs as HTTP headers reminded me that I’d not yet followed up on some similar proposals I’d made at XTech 2005. In particular, the use of DOAP descriptions instead of “API Keys” for RESTful interfaces.
In my paper after reviewing how services supported authentication and linking of resources, I wrote:

Many of the services support the notion of an “API Key”. These keys are allocated on a per-application basis and are a required parameter in all requests. Typically a form is provided that allows an application developer to quickly obtain a key. Often some context about its intended usage, such as application name, description and a homepage URL must be supplied.

While API keys are not used for authentication they are used as a mechanism to support usage tracking of the API, e.g. to identify active applications and potentially spot abuses. From this perspective they are a useful feature that furnishes service providers with useful usage statistics about client applications…

Later, I critiqued the use of hypermedia to link together different resources exposed via several RESTful interfaces, noting that very few actually used this technique instead relying on the client to construct additional URLs in order to extract more data. One item that frequently needs to be added is an API Key which:

…prohibits free publishing of links, as given URL is only suitable for use by a single application, the one to which the key was assigned.

It is the use of API keys that is the most troublesome. While obviously providing a useful feature, API keys hamper loose ad hoc integration; clients must know how to manipulate a URL to insert an API key. Therefore, while a service may provide unauthenticated use of read-only URLs, these links cannot be openly published without also sharing an API key. This obviously undermines their potential benefits.

An alternative to using API keys in the URL, is to require applications to identify themselves using an existing HTTP feature: the User-Agent header. This header can be used to pass an application name, URL, or other token to a service without requiring modification of the request URL. An API key is actually request metadata, and HTTP headers are the correct place for this metadata to be communicated.

Some APIs already support or encourage use of User-Agent, notably and WebJay. However the technique isn’t that suitable for all environments, e.g. a bookmarklet, where one has no control over the HTTP headers.
User-Agent is also problematic due to its unstructured format: the field is basically free text and browser User-Agent‘s are already hopelessly muddled.
In my presentation I suggested using an alternate HTTP header X-DOAP whose value would be the URI of the DOAP description of the client application. This header would supplant the use of API Keys, or at least encouraged as an alternate mechanism for identifying a client application. To my mind this provides the same level of detail and usage tracking as an API key, but in a more flexible manner.
It’s worth noting that Greasemonkey (and other AJAX environments I assume) allow the addition of custom HTTP headers to outgoing requests. So one can use both X-DOAP and Danny’s (X-)FOAF headers to identify both the client application and the user. As far as I can see it’s only bookmarklets that are limited to not having access to either the User-Agent settings or other outgoing headers. I’m not certain that a lot of API accesses come from those environments anyway, I’d hazard that custom applications and AJAX clients are increasingly the norm.
X-DOAP could be used now, assuming consensus could be reached amongst the various service providers. As Danny has noted an official registration would carry a lot more weight.

Lost: The Game

Watching the latest episode of Lost last night I started wonder whether anyone had already seized on the idea of turning it into a game.
Maybe it’s a little close to the bone given current events, but helping survivors build a working community after a plane crash or shipwreck seems like an interesting spin on the whole God Sim genre. Those kind of games revolve around basic puzzle solving and resource management.
Throw a means to write mods to alter the game play, add new elements, etc and you could quickly create the kind of bizarre and strange scenarios that we’re seeing crop up in Lost.
Kind of “The Sims get Lost” or “Post-Apocalyptic Civ”.

Smushing Algorithms

I was pleased to see Leo Sauermann recently publish a draft smushing algorithm as he’s saved me a job! There’s some subsequent discussion on the ESW wiki.
I agree with Sauermann that this is an underspecified but significant area. I also suspect there’s room for a range of algorithms optimised for different purposes.
For example, in a simple application working on relatively small data sets it may be simpler and sufficient to smush together all resources irrespective of whether they’re blank nodes or URIs. Just do a global merge to reorganize the properies to ensure that all the data is collated in a single resource. This could simplify things somewhat at the application level and would remove the need for a triple store that was aware of the semantics of owl:sameAs. This is what my own code does for example.
However if you’re regularly trawling the web for data, maintaining provenance and original URIs will be important. So simply collapsing bNodes into a suitable “canonical resource” with owl:sameAs linking related resources is more flexible.
For large data sets, especially where they’re incrementally updated, incremental smushing will be important. This suggests keeping indexes of IFP properties and values to make the merging more efficient. Depending on the store implementation it may also be more efficient to simple add properties to existing resources rather than merge the graphs and subsequently smush the data.
There’s a lot of scope for experimental research here to explore different approaches and the trade-offs. Here’s plenty of data out there to play with, and some performance metrics would be a useful supplement to Sauermann’s specification.

Due Diligence

According to the Wikipedia due diligence is “the effort a party makes to avoid harm to another party.“. It goes on to note that within a business context: A “due diligence report” is often prepared to discover all risks and implications regarding a decision to be made.
I think this concept should be embraced by the open data movement. In short when you publish a public collection of data I think there’s some due diligence that should take place. My reasoning plays off both of these definitions.
Firstly: avoiding harm. This one is relatively straight-forward. Don’t publish any data about a user unless they’ve expressly allowed it. Or perhaps more realistically, don’t publish any sensitive data (e.g. email addresses) without permission. I’m not aware of any sites that do this, but it should probably be set in stone somewhere to reinforce the convention. Privacy issues are only going to get worse as data becomes more easily available.
Secondly: understanding the risks and implications of the decision. There are several aspects to explore here.
The business implications of releasing open data can be manifold: what are you gaining and losing as a result of increased data sharing? From my perspective, opening up your database is at least tacit acknowledge that you’re happy that at least part, perhaps all, of your business model is shifting to exploit second order effects. For example you now longer charge for or hide data, with the intent that increased traffic or usage as a result of social content hacking will indireclty effect revenues.
There are also copyright and licensing issues. Very few social content sites that I’ve explored have clear licensing of its data and APIs. MusicBrainz is streets ahead here. You have to think beyond simple usage licensing (personal/academic/commercial) to issues like aggregation:

  • Can I freely aggregate all your data for a non-commercial application?
  • Is all of your data consistently licensed? For example flickr allows Creative Commons licensing of photos, but what about the photo and personal metadata?
  • Can I redistribute your metadata? And how can I relicense it?
  • How much provenance tracking must be done?

This area is well-worn ground in DRM circles but the issues are not incompatible with open licensing.
Which brings me to my third point: relationships between your data and that already out there “in the wild”.
I’ve spent a fair amount of time looking through collections of data on and very little of it, even where the data is from a common domain, is interlinked. For example there are several geo data sets which could easily be interrelated.
The beauty of RDF is that I can of course begin to publish these interconnections myself, but I think this should become part of the due diligence undertaken by data providers. It helps to avoid perpetuating data islands and makes free mixing of data much easier.
The diligence doesn’t only apply to data, but also schemas. If you’re publishing an RDF schema it’s your job to ensure that you’ve made some effort to relate your terms to existing vocabularies where possible. Again, third-parties can easily annotate your schema to include missing or additional relationships. However to ensure we have not only a web of documents, but also a web of schemas (allowing agents to explore ontology relationships), schema authors must include relevant links. There are some other best practices they should follow too.

Bookmarking Etiquette

Some notes on a brief discussion I had with Geoff yesterday about tagging behaviour, in particular: what’s the etiquette involved in shared bookmarking?
Geoff has previously written about social bookmarking as telltale and the advantages of brain subscriptions. He’d also recently pointed me at a New Scientist article discussing research which shows that email forwarding amounts to ritual gift exchange
Or rather, he bookmarked it in and I noticed it via my RSS reader and went away and read the article. I, like Geoff, subscribe to various peoples bookmark feeds as a way to find interesting and relevant content.
As we subscribe to each others feeds, the act of bookmarking has started to subsume the previous activity of email/link forwarding. I think this is another form of tagging behaviour thats distinct from both filing and annotative approaches.
The potential point of etiquette was this: if you tag something, as a result of a friend’s bookmarking activity, are you adding noise to their aggregator? After all, they’ve seen it already.
I think we concluded that it wasn’t. After all one might still legitimately want to bookmark something to file it away. This raised the issue of whether both public and private bookmarks are useful features. Furl has them I believe, but does not. There’s also the fact that the additional bookmark constitutes another hop through the social network graph.
As I write this it occurs to me that the feedback loop is useful in that it indicates that the “gift” was received and deemed useful or relevant. Certainly worth a bookmark. Annotation features provide a way to add simple comments that might also be of interest to the originator of the bookmark. A kind of tagging back channel?
Given the level of interest in this space, I presume that someone has begun classifying these different types of activity. The classification would include not only the act of tagging (is it filing, annotation, or sharing) but also the types of tags themselves which range from keywords through actions, such as toread, to the descriptive conventions I discussed here.