A river of research, not news

I already hate the phrase “fake news”. We have better words to describe lies, disinformation, propaganda and slander, so lets just use those.

While the phrase “fake news” might originally have been used to refer to hoaxes and disinformation, it’s rapidly becoming a meaningless term used to refer to anything you don’t disagree with. Trump’s recent remarks being a case in point: unverified news is something very different.

Of course this is all on a sliding scale. Many news outlets breathlessly report on scientific research. This can make for fun, if eye-rolling reading. Advances in AI and discovery of alien mega-structures are two examples that spring to mind.

And then there’s the way in which statistics and research is given a spin by the newspapers or politicians. This often glosses over key details in favour of getting across a political message or point scoring. Today I was getting cross about Theresa May’s blaming of GP’s for the NHS crisis. Her remarks are based on a report recently published by the National Audit Office. I haven’t seen a single coverage of the piece link to the NAO press release or the high-level summary (PDF), so you’ll either have to accept their remarks or search for it yourself.

Organisations like Full Fact do an excellent job of digging into these claims. They link the commentary to the underlying research or statistics alongside a clear explanation. In the same vein is NHS Choices Behind the Headlines which fills a similar role, but focuses on the reporting of medical and health issues.

There’s also a lot of attention focused on helping to surface this type of fact checking and explanations via search results. Fact checking, to properly dig into statistics and clearly present them is, I suspect, a time consuming exercise. Especially if you’re hoping to present a neutral point of view.

What I think I’d like though is a service that brings all those different services together. To literally give me the missing links between research, news and commentary.

But rather than aggregating news articles or fact checking reports to give me a feed, or what we used to call a “river of news”, why not present a river of research instead? Let me see the statistics or reports that are being being debated and then let me jump off to see the variety of commentary and fact checking associated with it.

That way I could choose to read the research or a summary of it, and then decide to look at the commentary. Or, more realistically, I could at least see the variety of ways in which a specific report is being presented, described and debated. That would be a useful perspective I think. It would shift the focus away from individual outlets and help us find alternative viewpoints.

I doubt that this would become anyone’s primary way to consume the news. But it could be interesting to those of who like to dig behind the headlines. It would also be useful as a research tool in its own right. In the face of consistent lack of interest from news outlets in linking to primary sources, this might be something that could be crowd-sourced.

Does this type of service already exist? I suspect there are similar efforts around academic research, but I don’t recall seeing anything that covers a wider set of outputs including national and government statistics.

 

Checking Fact Checkers

As of last month Google News attempts to highlight fact check articles. Content from fact checking organisations will be tagged so that their contribution to on-line debate can be more clearly identified. I think this is a great move and a first small step towards addressing wider concerns around use of the web for disinformation and a “post truth” society.

So how does it work?

Firstly, news sites can now advertise fact checking articles using a pending schema.org extension called Claim Review. The mark-up allows a fact checker to indicate which article they are critiquing along with a brief summary of what aspects are being reviewed.

Metadata alone is obviously ripe for abuse. Anyone could claim any article is a fact check. So there’s an additional level of editorial control that Google layer on top of that metadata. They’ve outlined their criteria in their help pages. These seems perfectly reasonable: it should be clear what facts are being checked, sources must be cited, organisations must be non-partisan and transparent, etc.

It’s the latter aspect that I think is worth digging into a little more. The Google News announcement references the International Fact Checking Network and a study on fact checking sites. The study, by the Duke Reporter’s Lab, outlines how they identify fact checking organisations. Again, they mention both transparency of sources and organisational transparency as being important criteria.

I think I’d go a step further and require that:

  • Google’s (and other’s) lists of approved fact checking organisations are published as open data
  • The lists are cross-referenced with identifiers from sources like OpenCorporates that will allow independent verification of ownership, etc.
  • Fact checking organisations publish open data about their sources of funding and affiliations
  • Fact checking organisations publish open data, perhaps using Schema.org annotations, about the dataset(s) they use to check individual claims in their articles
  • Fact checking organisations licence their ClaimReview metadata for reuse by anyone

Fact checking is an area that benefits from the greatest possible transparency. Open data can deliver that transparency.

Another angle to consider is that fact checking may be carried out by more than just media organisations. John Udell has written a couple of interesting pieces on annotating the wild-west of information flow and bird-dogging the web that highlight the potential role of annotation services in helping to fact check and create constructive debate and discussion on-line.

Digital public institutions for the information commons?

I’ve been thinking a bit about “the commons” recently. Specifically, the global information commons that is enabled and supported by Creative Commons (CC) licences. This covers an increasingly wide variety of content as you can see in their recent annual review.

The review unfortunately doesn’t mention data although there’s an increasing amount of that published using CC (or compatible) licences. Hopefully they’ll cover that in more detail next year.

I’ve also been following with interest Tom Steinberg’s exploration of Digital Public Institutions (Part 1, Part 2). As a result of my pondering about the information and data commons, think there’s a couple of other types of institution which we might add to Tom’s list.

My proposed examples of digital public services are deliberately broad. They’re intended to serve the citizens of the internet, not just any one country.

Commons curators

Everyone has seen interesting facts and figures about the rapidly growing volume of activity on the web. These are often used as examples of dizzying growth and as a jumping off point for imagining the next future shocks that are only just over the horizon. The world is changing at an ever increasing rate.

But it’s also an archival challenge. The majority of that material will never be listened to, read or watched. Data will remain unanalysed. And in all likelihood it may disappear before anyone has had any chance to unlock its potential. Sometimes media needs time to find its audience.

This is why projects like the Internet Archive are so important. I think the Internet Archive is one of the greatest achievements of the web. If you need convincing then watch this talk by Brewster Kahle. If, like me, you’re of a certain age then these two things alone should be enough to win you over.

I think we might see and arguably need more digital public institutions who are not just archiving great chunks of the web, but also the working with that material to help present it to a wider audience.

I see other signals that this might be a useful thing to do. Think about all of the classic film, radio and TV material that is never likely to ever see the light of day again. Not just for rights reasons, but also because its not HD quality or hasn’t been cut and edited to reflect modern tastes. I think this is at least partly the reason why we so many reboots and remakes.

Archival organisations often worry about how to preserve digital information. One tactic being to consider how to migrate between formats to ensure information remains accessible. What if we treated media the same? E.g. by re-editing or remastering it to make it engaging to a modern audience? Here’s an example of modernising classic scientific texts or and another that is remixing Victorian jokes as memes.

Maybe someone could spin a successful commercial venture out of this type of activity. But I’m wondering whether you could build a “public service broadcasting” organisation that presented refined, edited, curated views of the commons? I think there’s certainly enough raw materials.

Global data infrastructure projects

The ODI have spent some time this year trying to bring into focus the fact that data is now infrastructure. In my view the best exemplar of a truly open piece of global data infrastructure is OpenStreetMap (OSM). A collaboratively maintained map of our world. Anyone can contribute. Anyone can use it.

OSM was set up to try to solve the issue that the UK’s mapping and location infrastructure was, and largely still is, tied up with complex licensing and commercial models. Rather than knocking at the door of existing data holders to convince them to release their data, OSM shows what you can deliver with the participation of a crowd of motivated people using modern technology.

It’s a shining example of the networked age we live in.

There’s no reason to think that this couldn’t be done in for other types of data, creating more publicly owned infrastructure. There are now many more ways in which people could contribute data to such projects. Whether that information is about themselves or the world around us.

Getting coverage and depth to data could also potentially be achieved very quickly. Costs to host and serve data are also dropping, so sustainability also becomes more achievable.

And I also feel (hope?) there is a growing unease with so much data infrastructure being owned by commercial organisations. So perhaps there’s a movement towards wanting more of this type of collaboratively owned infrastructure.

Data infrastructure incubators

If you buy into the fact that we need more projects like OSM, then its natural to start thinking about the common features of such projects. Those that make them successful and sustainable. There are likely to be some common organisational patterns that can be used as a framework for designing these organisations. Currently, while focused on scholarly research, I think this is the best attempt at capturing those patterns that I’ve seen so far.

Given a common framework then it’s becomes possible to create incubators whose job it is to launch these projects and coach, guide and mentor them towards success.

So that is my third and final addition to Steinberg’s list: incubators that are focused not on the creation of the next start-up “unicorn” but on generating successful, global collaborative data infrastructure projects. Institutions whose goal is the creation of the next OpenStreetMap.

These type of projects have a huge potential impact as they’re not focused on a specific sector. OSM is relevant to many different types of application, its data is used in many different ways. I think there’s a lot more foundational data of this type which could and should be publicly owned.

I may be displaying my naivety, but I think this would be a nice thing to work towards.

Fictional data

The phrase “fictional data” popped into my head recently, largely because of odd connections between a couple of projects I’ve been working on.

It’s stuck with me because, if you set aside the literal meaning of “data that doesn’t actually exist“, there are some interesting aspects to it. For example the phrase could apply to:

  1. data that is deliberately wrong or inaccurate in order to mislead – lies or spam
  2. data that is deliberately wrong as a proof of origin or claim of ownership – e.g. inaccuracies introduced into maps to identify their sources, or copyright easter eggs
  3. data that is deliberately wrong, but intended as a prank – e.g. the original entry of Uqbar on wikipedia. Uqbar is actually a doubly fictional place.
  4. data that is fictionalised (but still realistic) in order to support testing of some data analysis – e.g. a set of anonymised and obfuscated bank transactions
  5. data that is fictionalised in order to avoid being a nuisance, cause confusion, or accidentally linkage – like 555 prefix telephone numbers or perhaps social media account names
  6. data that is drawn from a work of fiction or a virtual world – such as the marvel universe social graph, the Elite: Dangerous trading economy (context), or the data and algorithms relating to Pokemon capture.

I find all of these fascinating, for a variety of reasons:

  • How do we identify and exclude deliberately fictional data when harvesting, aggregating and analysing data from the web? Credit to Ian Davis for some early thinking about attack vectors for spam in Linked Data. While I’d expect copyright easter eggs to become less frequent they’re unlikely to completely disappear. But we can definitely expect more and more deliberate spam and attacks on authoritative data. (Categories 1, 2, 3)
  • How do we generate useful synthetic datasets that can be used for testing systems? Could we generate data based on some rules and a better understanding of real-world data as a safer alternative to obfuscating data that is shared for research purposes? It turns out that some fictional data is a good proxy for real world social networks. And analysis of videogame economics is useful for creating viable long-term communities. (Categories 4, 6)
  • Some of the most enthusiastic collectors and curators of data are those that are documenting fictional environments. Wikia is a small universe of mini-wikipedias complete with infoboxes and structured data. What can we learn from those communities and what better tools could we build for them? (Category 6)

Interesting, huh?

Thoughts on the Netflix API Closure

A year ago Netflix announced that they were shuttering their public API: no new API keys or affiliates and no more support. Earlier this week they announced that the entire public API will be shutdown by November 2014.

This is interesting news and its been covered in various places already, including this good overview at Programmable Web. I find it  interesting because its the first time that I can recall an public API being so visibly switched out for a closed, private alternative. Netflix will still offer an API but only for a limited set of eight existing affiliates and (of course) their own applications. Private APIs have always existed and will continue to do so, but the trend to date has been about these being made public, rather than a move in the opposite direction.

It’s reasonable to consider if this might be the first of a new trend, or whether its just an outlier. Netflix have been reasonably forthcoming about their API design decisions so I expect many others will be reflecting on their decision and whether it would make sense for them.

But does it make sense at all?

If you read this article by Daniel Jacobson (Director of Engineering for the Netflix API) you can get more detail on the decision and some insight into their thought process. By closing the public API and focusing on a few affiliates Jacobson suggests that they are able to optimise the API to fit the needs of those specific consumers. The article suggests that a fine-grained resource-oriented API is excellent for supporting largely un-mediated use by a wide range of different consumers with a range of different use cases. In contrast an API that is optimised for fewer use cases and types of query may be able to offer better performance. An API with a smaller surface area will have lower maintenance overheads. Support overheads will also be lower because there’s few interactions to consider and a smaller user base making them.

That rationale is hard to argue with from either a technical or business perspective. If you have a small number of users driving most of your revenue and a long tail of users generating little or no revenue but with a high support code, it mostly makes sense to follow the revenue. I don’t buy all of the technical rationale though. It would be possible to support a mixture of resource types in the API, as well as a mixture of support and service level agreements. So I suspect the business drivers are the main rationale here. APIs have generally meant businesses giving up control, if Netflix are able to make this work then I would be surprised if more business don’t do the same eventually, as a means to regain that control.

But by withdrawing from any kind of public API Netflix are essentially admitting that they don’t see any further innovation happening around their API: what they’ve seen so far is everything they’re going to see. They’re not expecting a sudden new type of usage to drive revenue and users to the service. Or at least not enough to warrant maintaining a more generic API. If they felt that the community was growing, or building new and interesting applications that benefited their business, they’d keep the API open. By restricting it they’re admitting that closer integration with a small number of applications is a better investment. It’s a standard vertical integration move that gives them greater control over all user experience with their platform. It wouldn’t surprise me if they acquired some of these applications in the future.

However it all feels a bit short-sighted to me as they’re essentially withdrawing from the Web. They’re no longer going to be able to benefit from any of the network effects of having their API be a part of the wider web and remixable (within their Terms of Service) with other services and datasets. Innovation will be limited to just those companies they’re choosing to work with through an “experience” driven API. That feels like a bottleneck in the making.

It’s always possible to optimise a business and an API to support a limited set of interactions, but that type of close coupling inevitably results in less flexibility. Personally I’d be backing the Web.

What is an Open API?

I was reading a document this week that referred to an “Open API”. It occurred to me that I hadn’t really thought about what that term was supposed to mean before. Having looked at the API in question, it turned out it did not mean what I thought it meant. The definition of Open API on Wikipedia and the associated list of Open APIs are also both a bit lacklustre.

We could probably do with being more precise about what we mean by that term, particularly in how it relates to Open Source and Open Data. So far I’ve seen it used in several different ways:

  1. An API that is free for anyone to use — I think it would be clearer to refer to these as “Public APIs”. Some may require authentication, some may only have a limited free tier of usage, but the API is accessible to anyone that wants to use it
  2. An API that is backed by open data — the data that is extracted by the API is covered by an open licence. A Public API isn’t necessarily backed by Open Data. While it might be free for me to use an API, I may be limited in how I can use the data by API terms and/or a non-open data licence that applies to the data
  3. An API that is based on an open standard — the data available via an API might not be open, but the means of accessing and querying the data is covered by a specification that has been created by a standards body or has otherwise be openly published, e.g. the specification of the API is covered by an open licence. The important thing here is that the API could be (re-)implemented in an open source or commercial product without infringing on anyone’s rights or intellectual property. The specification of APIs that serve open data aren’t necessarily open. A commercial vendor may provide a data publishing service whose API is entirely proprietary.

Personally I think an Open API is one that meets that final definition.

These are important distinctions and I’d encourage you to look at the APIs you’re using or the API’s you’re publishing and considering into which category they fall. APIs built on open source software typically fall into the third category: a reference implementation and API documentation are already in the open. It’s easy to create alternate versions, improve an existing code base, or run a copy of a service.

While the data in a platform may be open, lock-in (whether planned or otherwise) can happen when APIs are proprietary. This limits competition and the ability for both data publishers and consumers to choose other vendors. This is also one reason why APIs shouldn’t be the default for open government data: at some level the raw data should be portable and useful outside of whatever platform the organisation may choose to deploy. Ideally platforms aimed at supporting open government data publishing should be open source or should, at the very least, openly licence their API documentation.

Its about more than the link

To be successful the web sacrificed some of the features of hypertext systems. Things like backwards linking and link integrity, etc. One of the great things about the web is that its possible to rebuild some of those features, but in a distributed way. Different communities can then address their own requirements.

Link integrity is one of those aspects. In many cases link integrity is not an issue. Some web resources are ephemeral (e.g. pastebin snippets), but others — particularly those used and consumed by scholarly communities — need to be longer lived. CrossRef and other members of the DOI Foundation have been successfully building linking services that attempt to provide persistent links to material references in scholarly research, for many years.

Yesterday Geoff Bilder published a great piece that describes what CrossRef and others are doing in this area, highlighting the different communities being served and the different features that the services offer. Just because something has a DOI doesn’t necessarily make it reliable, give any guarantees about its quality, or even imply what kind of resource it is; but it may have some guarantees around persistence.

Geoff’s piece highlights some similar concerns that I’ve had recently. I’m particularly concerned that there seems to be some notion that for something to be citeable it must have a DOI. That’s not true. For something to be citeable it just needs to be online, so people can point at it.

There may be other qualities we want the resource to have, e.g. persistence, but if your goal is to share some data, then get it online first, then address the persistence issue. Data and content sharing platforms and services can help there but we need to assess them against different criteria, e..g whether they are good publishing platforms, and separately whether they can make good claims about persistence and longevity.

Assessing persistence means more than just assessing technical issues, it means understanding the legal and business context of the service. What are its terms of service? Does the service have any kind of long term business plan that means it can make viable claims about longevity of the links it produces, etc.

I recently came across a service called perma.cc that aims to bring some stability to legal citations. There’s a New York Times article that highlights some of the issues and the goals of the service.

The perma.cc service allows users to create stable links to content. The content that the links refer to is then archived so if the original link doesn’t resolve then users can still get to the archived content.

This isn’t a new idea: bookmarking services often archive bookmarked content to build personal archives; other citation and linking services have offered similar features that handle content going offline.

It’s also not that hard to implement. Creating link aliases is easy. Archiving content is less easy but is easily achievable for well-known formats and common cases: it gets harder if you have to deal with dynamic resources/content, or want to preserve a range of formats for the long term.

It’s less easy to build stable commercial entities. It’s also tricky dealing with rights issues. Archival organisations often ensure that they have rights to preserve content, e.g. by having agreements with data publishers.

Personally I’m not convinced that perma.cc have nailed that aspect yet. If you look at their terms of service (PDF, 23rd Sept 2013), I think there are some problems:

You may use the service “only for non-commercial scholarly and research purposes that do not infringe or violate anyone’s copyright or other rights“. Defining “non-commercial” use is very tricky, it’s an issue with many open content and data licenses. One might argue that a publisher creating perma.cc links is using it for commercial purposes.

But I find Section 5 “User Submitted Content and Licensing” confusing. For example it seems to suggest that I either have to own the content that I am creating a perma.cc link for, or that I’ve done all the rights clearance on behalf of perma.cc.

I don’t see how that can possibly work in the general case. Particularly as you must also grant perma.cc a license to use the content however they wish. If you’re trying to build perma.cc links to 3rd party content, e.g. many of the scenarios described in the New York Times article, then you don’t have any rights to grant them. Even if its published under an open content license you may not have all the rights they require.

They also reserve the right to remove any content, and presumably links, that they’re required to remove. From a legal perspective this makes some sense, but I’d be interested to know how that works in practice. For example will the perma.cc link just disappear or will there be any history available?

Perhaps I’m misunderstanding the terms (entirely possible) or the intended users of the service, I’d be interested in hearing any clarifications.

My general point here is not to be overly critical of perma.cc — I’m largely just confused by their terms. My pointis that bringing permanence to (parts of) the web isn’t necessarily a technical issue to solve, its one that has important legal, social and economic aspects.

Signing up to a service to create links is easy. Longevity is harder to achieve.