Fictional data

The phrase “fictional data” popped into my head recently, largely because of odd connections between a couple of projects I’ve been working on.

It’s stuck with me because, if you set aside the literal meaning of “data that doesn’t actually exist“, there are some interesting aspects to it. For example the phrase could apply to:

  1. data that is deliberately wrong or inaccurate in order to mislead – lies or spam
  2. data that is deliberately wrong as a proof of origin or claim of ownership – e.g. inaccuracies introduced into maps to identify their sources, or copyright easter eggs
  3. data that is deliberately wrong, but intended as a prank – e.g. the original entry of Uqbar on wikipedia. Uqbar is actually a doubly fictional place.
  4. data that is fictionalised (but still realistic) in order to support testing of some data analysis – e.g. a set of anonymised and obfuscated bank transactions
  5. data that is fictionalised in order to avoid being a nuisance, cause confusion, or accidentally linkage – like 555 prefix telephone numbers or perhaps social media account names
  6. data that is drawn from a work of fiction or a virtual world – such as the marvel universe social graph, the Elite: Dangerous trading economy (context), or the data and algorithms relating to Pokemon capture.

I find all of these fascinating, for a variety of reasons:

  • How do we identify and exclude deliberately fictional data when harvesting, aggregating and analysing data from the web? Credit to Ian Davis for some early thinking about attack vectors for spam in Linked Data. While I’d expect copyright easter eggs to become less frequent they’re unlikely to completely disappear. But we can definitely expect more and more deliberate spam and attacks on authoritative data. (Categories 1, 2, 3)
  • How do we generate useful synthetic datasets that can be used for testing systems? Could we generate data based on some rules and a better understanding of real-world data as a safer alternative to obfuscating data that is shared for research purposes? It turns out that some fictional data is a good proxy for real world social networks. And analysis of videogame economics is useful for creating viable long-term communities. (Categories 4, 6)
  • Some of the most enthusiastic collectors and curators of data are those that are documenting fictional environments. Wikia is a small universe of mini-wikipedias complete with infoboxes and structured data. What can we learn from those communities and what better tools could we build for them? (Category 6)

Interesting, huh?

Thoughts on the Netflix API Closure

A year ago Netflix announced that they were shuttering their public API: no new API keys or affiliates and no more support. Earlier this week they announced that the entire public API will be shutdown by November 2014.

This is interesting news and its been covered in various places already, including this good overview at Programmable Web. I find it  interesting because its the first time that I can recall an public API being so visibly switched out for a closed, private alternative. Netflix will still offer an API but only for a limited set of eight existing affiliates and (of course) their own applications. Private APIs have always existed and will continue to do so, but the trend to date has been about these being made public, rather than a move in the opposite direction.

It’s reasonable to consider if this might be the first of a new trend, or whether its just an outlier. Netflix have been reasonably forthcoming about their API design decisions so I expect many others will be reflecting on their decision and whether it would make sense for them.

But does it make sense at all?

If you read this article by Daniel Jacobson (Director of Engineering for the Netflix API) you can get more detail on the decision and some insight into their thought process. By closing the public API and focusing on a few affiliates Jacobson suggests that they are able to optimise the API to fit the needs of those specific consumers. The article suggests that a fine-grained resource-oriented API is excellent for supporting largely un-mediated use by a wide range of different consumers with a range of different use cases. In contrast an API that is optimised for fewer use cases and types of query may be able to offer better performance. An API with a smaller surface area will have lower maintenance overheads. Support overheads will also be lower because there’s few interactions to consider and a smaller user base making them.

That rationale is hard to argue with from either a technical or business perspective. If you have a small number of users driving most of your revenue and a long tail of users generating little or no revenue but with a high support code, it mostly makes sense to follow the revenue. I don’t buy all of the technical rationale though. It would be possible to support a mixture of resource types in the API, as well as a mixture of support and service level agreements. So I suspect the business drivers are the main rationale here. APIs have generally meant businesses giving up control, if Netflix are able to make this work then I would be surprised if more business don’t do the same eventually, as a means to regain that control.

But by withdrawing from any kind of public API Netflix are essentially admitting that they don’t see any further innovation happening around their API: what they’ve seen so far is everything they’re going to see. They’re not expecting a sudden new type of usage to drive revenue and users to the service. Or at least not enough to warrant maintaining a more generic API. If they felt that the community was growing, or building new and interesting applications that benefited their business, they’d keep the API open. By restricting it they’re admitting that closer integration with a small number of applications is a better investment. It’s a standard vertical integration move that gives them greater control over all user experience with their platform. It wouldn’t surprise me if they acquired some of these applications in the future.

However it all feels a bit short-sighted to me as they’re essentially withdrawing from the Web. They’re no longer going to be able to benefit from any of the network effects of having their API be a part of the wider web and remixable (within their Terms of Service) with other services and datasets. Innovation will be limited to just those companies they’re choosing to work with through an “experience” driven API. That feels like a bottleneck in the making.

It’s always possible to optimise a business and an API to support a limited set of interactions, but that type of close coupling inevitably results in less flexibility. Personally I’d be backing the Web.

What is an Open API?

I was reading a document this week that referred to an “Open API”. It occurred to me that I hadn’t really thought about what that term was supposed to mean before. Having looked at the API in question, it turned out it did not mean what I thought it meant. The definition of Open API on Wikipedia and the associated list of Open APIs are also both a bit lacklustre.

We could probably do with being more precise about what we mean by that term, particularly in how it relates to Open Source and Open Data. So far I’ve seen it used in several different ways:

  1. An API that is free for anyone to use — I think it would be clearer to refer to these as “Public APIs”. Some may require authentication, some may only have a limited free tier of usage, but the API is accessible to anyone that wants to use it
  2. An API that is backed by open data — the data that is extracted by the API is covered by an open licence. A Public API isn’t necessarily backed by Open Data. While it might be free for me to use an API, I may be limited in how I can use the data by API terms and/or a non-open data licence that applies to the data
  3. An API that is based on an open standard — the data available via an API might not be open, but the means of accessing and querying the data is covered by a specification that has been created by a standards body or has otherwise be openly published, e.g. the specification of the API is covered by an open licence. The important thing here is that the API could be (re-)implemented in an open source or commercial product without infringing on anyone’s rights or intellectual property. The specification of APIs that serve open data aren’t necessarily open. A commercial vendor may provide a data publishing service whose API is entirely proprietary.

Personally I think an Open API is one that meets that final definition.

These are important distinctions and I’d encourage you to look at the APIs you’re using or the API’s you’re publishing and considering into which category they fall. APIs built on open source software typically fall into the third category: a reference implementation and API documentation are already in the open. It’s easy to create alternate versions, improve an existing code base, or run a copy of a service.

While the data in a platform may be open, lock-in (whether planned or otherwise) can happen when APIs are proprietary. This limits competition and the ability for both data publishers and consumers to choose other vendors. This is also one reason why APIs shouldn’t be the default for open government data: at some level the raw data should be portable and useful outside of whatever platform the organisation may choose to deploy. Ideally platforms aimed at supporting open government data publishing should be open source or should, at the very least, openly licence their API documentation.

Its about more than the link

To be successful the web sacrificed some of the features of hypertext systems. Things like backwards linking and link integrity, etc. One of the great things about the web is that its possible to rebuild some of those features, but in a distributed way. Different communities can then address their own requirements.

Link integrity is one of those aspects. In many cases link integrity is not an issue. Some web resources are ephemeral (e.g. pastebin snippets), but others — particularly those used and consumed by scholarly communities — need to be longer lived. CrossRef and other members of the DOI Foundation have been successfully building linking services that attempt to provide persistent links to material references in scholarly research, for many years.

Yesterday Geoff Bilder published a great piece that describes what CrossRef and others are doing in this area, highlighting the different communities being served and the different features that the services offer. Just because something has a DOI doesn’t necessarily make it reliable, give any guarantees about its quality, or even imply what kind of resource it is; but it may have some guarantees around persistence.

Geoff’s piece highlights some similar concerns that I’ve had recently. I’m particularly concerned that there seems to be some notion that for something to be citeable it must have a DOI. That’s not true. For something to be citeable it just needs to be online, so people can point at it.

There may be other qualities we want the resource to have, e.g. persistence, but if your goal is to share some data, then get it online first, then address the persistence issue. Data and content sharing platforms and services can help there but we need to assess them against different criteria, e..g whether they are good publishing platforms, and separately whether they can make good claims about persistence and longevity.

Assessing persistence means more than just assessing technical issues, it means understanding the legal and business context of the service. What are its terms of service? Does the service have any kind of long term business plan that means it can make viable claims about longevity of the links it produces, etc.

I recently came across a service called that aims to bring some stability to legal citations. There’s a New York Times article that highlights some of the issues and the goals of the service.

The service allows users to create stable links to content. The content that the links refer to is then archived so if the original link doesn’t resolve then users can still get to the archived content.

This isn’t a new idea: bookmarking services often archive bookmarked content to build personal archives; other citation and linking services have offered similar features that handle content going offline.

It’s also not that hard to implement. Creating link aliases is easy. Archiving content is less easy but is easily achievable for well-known formats and common cases: it gets harder if you have to deal with dynamic resources/content, or want to preserve a range of formats for the long term.

It’s less easy to build stable commercial entities. It’s also tricky dealing with rights issues. Archival organisations often ensure that they have rights to preserve content, e.g. by having agreements with data publishers.

Personally I’m not convinced that have nailed that aspect yet. If you look at their terms of service (PDF, 23rd Sept 2013), I think there are some problems:

You may use the service “only for non-commercial scholarly and research purposes that do not infringe or violate anyone’s copyright or other rights“. Defining “non-commercial” use is very tricky, it’s an issue with many open content and data licenses. One might argue that a publisher creating links is using it for commercial purposes.

But I find Section 5 “User Submitted Content and Licensing” confusing. For example it seems to suggest that I either have to own the content that I am creating a link for, or that I’ve done all the rights clearance on behalf of

I don’t see how that can possibly work in the general case. Particularly as you must also grant a license to use the content however they wish. If you’re trying to build links to 3rd party content, e.g. many of the scenarios described in the New York Times article, then you don’t have any rights to grant them. Even if its published under an open content license you may not have all the rights they require.

They also reserve the right to remove any content, and presumably links, that they’re required to remove. From a legal perspective this makes some sense, but I’d be interested to know how that works in practice. For example will the link just disappear or will there be any history available?

Perhaps I’m misunderstanding the terms (entirely possible) or the intended users of the service, I’d be interested in hearing any clarifications.

My general point here is not to be overly critical of — I’m largely just confused by their terms. My pointis that bringing permanence to (parts of) the web isn’t necessarily a technical issue to solve, its one that has important legal, social and economic aspects.

Signing up to a service to create links is easy. Longevity is harder to achieve.

Thoughts on Coursera and Online Courses

I recently completed my first online course (or “MOOC“) on Coursera. It was an interesting experience and wanted to share some thoughts here.

I decided to take an online course for several reasons. Firstly the topic, Astrobiology, was fun and I thought the short course might make an interesting alternative to watching BBC documentaries and US TV box sets. I certainly wasn’t disappointed as the course content was accessible and well-presented. As a biology graduate I found much of the content was fairly entry-level, but it was nevertheless a good refresher in a number of areas. The mix of biology, astronomy, chemistry and geology was really interesting. The course was very well attended, with around 40,000 registrants and 16,000 active students.

The second reason I wanted to try a course was because MOOCs are so popular at the moment. I was curious how well an online course would work, in terms of both content delivery and the social aspects of learning. Many courses are longer and are more rigorously marked and assessed, but the short Astrobiology course looked like it would still offer some useful insights into online learning.

Clearly some of my experiences will be specific to the particular course and Coursera, but I think some of the comments below will generalise to other platforms.

Firstly, the positives:

  • The course material was clear and well presented
  • The course tutors appeared to be engaged and actively participated in discussions
  • The ability to download the video lectures, allowing me to (re)view content whilst travelling was really appreciated. Flexibility around consuming course content seems like an essential feature to me. While the online experience will undoubtedly be richer, I’m guessing that many people are doing these courses in spare time around other activities. With this in mind, video content needs to be available in easily downloadable chunks.
  • The Coursera site itself was on the whole well constructed. It was easy to navigate to the content, tests and the discussions. The service offered timely notifications that new content and assessments had been published
  • Although I didn’t use it myself, the site offered good integration with services like Meetup, allowing students to start their own local groups. This seemed like a good feature, particularly for longer running courses.

However there were a number of areas in which I thought things could be greatly improved:

  • The online discussion forums very quickly became unmanageable. With so many people contributing, across many different threads, it’s hard to separate the signal from the noise. The community had some interesting extremes: people associated with the early NASA programme, through to alien contact and conspiracy theory nut-cases. While those particular extremes are peculiar to this course, I expect other courses may experience similar challenges
  • Related to the above point, the ability to post anonymously in forums lead to trolling on a number of occasions. I’m sensitive to privacy, but perhaps pseudonyms may be better than anonymity?
  • The discussions are divorced from the content, e.g. I can’t comment directly on a video I have to create a new thread for it in a discussion group. I wanted to see something more sophisticated, maybe SoundCloud style annotations on the videos or per-video discussion threads.
  • No integration with wider social networks: there were discussions also happening on twitter, G+ and Facebook. Maybe its better to just integrate those, rather than offer a separate discussion forum?
  • Students consumed content at different rates which meant that some discussions contained “spoilers” for material I hadn’t yet watched. This is largely a side-effect of the discussions happening independently from the content.
  • Coursera offered a course wiki but this seemed useless
  • It wasn’t clear to me during the course what would happen to the discussions after the course ended. Would they be wiped out, preserved, or would later students build on what was there already? Now that it’s finished it looks each course is instanced and discussions are preserved as an archive. I’m not sure what the right option is there. Starting with a clean slate seems like a good default, but can particularly useful discussions be highlighted in later courses? Seems like the course discussions would be an interesting thing to mine for links and topics, especially for lecturers

There are some interesting challenges with designing this kind of product. Unlike almost every other social application the communities for these courses don’t ramp up over time: they arrive en masse at a particular date and then more or less evaporate over night.

As a member of that community this makes it very hard to identify which people in the community are worth listening too and who to ignore: all of a sudden I’m surrounded by 16000 people all talking at once. When things ramp up more slowly, I can build out my social network more easily. Coursera doesn’t have any notion of study groups.

I expect the lecturers must have similar challenges as very quickly they’re faced with a lot of material that they might have to potentially read, review and respond to. This must present challenges when engaging with each new intake.

While a traditional discussion forum might provide the basic infrastructure for enabling the necessary basic communication, MOOC platforms need to have more nuanced social features — for both students and lecturers — to support the community. Features that are sensitive to the sudden growth of the community. I found myself wanting to find out things like:

  • Who is posting on which topics and how frequently?
  • Which commentators are getting up-voted (or down-voted) the most?
  • Which community members are at the same stage in the course as me?
  • Which community members have something to offer on a particular topic, e.g. because of their specific background?
  • What links are people sharing in discussions? Perhaps filtered by users.
  • What courses are my fellow students undertaking next? Are there shared journeys?
  • Is there anyone watching this material at the same time?

Answering all of these requires more than just mining discussions but it feels like some useful metrics could be nevertheless. For example, one common use of the forums was to share additional material, e.g. recent news reports, scientific papers, you tube videos, etc. That kind of content could either be collected in other ways, e.g. via a shared reading list, or as a list that is automatically surfaced from discussions. I ended up sifting through the forums and creating a reading list on readlists, as well as a YouTube playlist just to see whether others would find them useful (they did).

All of these challenges we can see playing out in wider social media, but with a MOOC they’re often compressed into relatively short time spans.

(Perhaps inevitably) I also kept thinking that much of the process of creating, delivering and consuming the content could be improved with better linking and annotation tools. Indeed, do we even need specialised MOOC platforms at all? Why not just place all of the content on services like YouTube, ReadLists, etc. Isn’t the web our learning infrastructure?

Well I think there is a role for these platforms. The role in certification — these people have taken this course — is clearly going to become more important, for example.

However I think their real value is in marking out a space within which the learning experience takes place: these people are taking this content during this period. The community needs a focal point, even if its short-lived.

If everything was just on the web, with no real definition to the course, then that completely dissolves the community experience. By concentrating the community into time-boxed, instanced courses, it creates focus that can enrich the experience. The challenge is balancing unwieldy MOOC “flashmobs” against a more diffused internet community.

Google AppEngine for Personal Web Presence?

Some thinking aloud…
I’ve browsed through the Google App Engine gallery and the applications you can find there at the moment are pretty much what you’d expect: lots of Web 2.0 “share this, share that” sites. These are what you’d expect because firstly they’re the kind of simple application you’d build whilst exploring any new environment. Secondly because they’re exactly the kind of sites that are currently being released every which way you turn.
But for me App Engine is intriguing as it might provide an interesting new perspective on distributing shrink-wrapped packaged software. When Google take the lid off of the number of sign-ups, its going to be a simple matter for anyone to have their own App Engine environment. Forget cheap web hosting and the expensive and configuration overhead that that entails: just sign up for an App Engine account.
App Engine has the potential to provide an enormous number of people with a well-documented stable environment into which an application can be deployed.
It will be interesting to see if anyone seizes on App Engine as an opportunity to create a simple personal application that combines elements of all of the Web 2.0 favourites: bookmarks, blogging, calendar, photos, travel, and perhaps an OpenId provider. One that that makes me the administrator of all of my own data, but doesn’t scrimp on the options for other people to harvest, syndicate and browse what I’m uploading.
At the moment our online identities start out fragmented, because we have to push data into a number of different services. And then we strive for ways to bring that data together and knit it into other sites that we, or our social network, use.
But why not turn this on it’s head? And seize on App Engine as a way to avoid this early fragmentation and instead start out with a centralized, personal web presence; but one which seamlessly integrates with data in other spaces. The potential is in open data, and services that are built around it. So why aren’t we managing our own open data repositories and letting others offer us services against particular aspects of it?
The App Engine environment doesn’t involve any configuration on behalf of the end user, and I suspect you could probably create an App Engine Deployer using App Engine itself. So sign-up, deployment and upgrades could also be pretty straight-forward. Python seems well suited for creating a simple modular web application that could be extended to cover new areas as users needed.
Instead of using lots of different web applications, we can each have our own modular web application that is intimately linked into the web, and becomes the primary repository for the data you want on the web. Data portability follows from the fact that you’d be the administrator of your own data.
This would also change the nature of the kinds of applications that we’d need elsewhere on the web. Instead of lots of specialist databases, we need more generic services and more community/local/temporary aggregations.