Enabling data forensics

I’m interested in how people share information, particularly data, on social networks. I think it’s something to which it’s worth paying attention, so we can ensure that it’s easy for people to share insights and engage in online debates.

There’s lots of discussion at the moment around fact checking and similar ways that we can improve the ability to identify reliable and unreliable information online. But there may be other ways that we can make some small improvements in order to help people identify and find sources of data.

Data forensics is a term that usually refers to analysis of data to identify illegal activities. But the term does have a broader meaning that encompasses “identifying, preserving, recovering, analyzing, and presenting attributes of digital information“. So I’m going to appropriate the term to put a label on a few ideas.

The design of the Twitter and Facebook platforms constrain how we can share information. Within those constraints people have, inevitably, adopted various patterns that allow them to publish and share content in preferred ways. For example, information might be shared:

  1. As a link to a page, where the content of the tweet or post is just the title
  2. As a link to a page, but with a comment and/or hashtags for context
  3. As a screenshot, e.g. of some text, chart or something. This usually has some commentary attached. Some apps enable this automatically, allowing you to share a screenshot of some highlighted text
  4. As images and photographs, e.g. of printed page or report (or even sometimes a screenshot of text from another app)

In the first examples there are always links that allow someone to go and read the original content. In fact that seems to be the typical intention: go read (or watch) this thing.

The other two examples are usually workarounds for the fact that its often hard to deep link to a section of a page or video.

Sometimes it’s just not possible because the information of interest isn’t in a bookmarkable section of a page. Or perhaps the user doesn’t know how to create that kind of deep link. Or they may be further constrained by a mobile app or other service that is restricting their ability to easily share a link. Not every application let’s the web happen.

In some cases screenshotting may also be conscious choice, e.g. posting a photo of someone’s tweet because you don’t want to directly interact with them.

Whatever the reason, this means there is usually no link in the resulting post. Which often makes it difficult for a reader to find the original content. While social media is reducing friction in sharing, its increasing friction around our ability to check the reliability and accuracy of what’s been shared.

If you tweet out a graph with some figures in a debate, I want to know where it’s come from. I want to see the context that goes with it. The ability to easily identify the source of shared content is, I think, part of “data forensics”.

So, what can we do fix this?

Firstly, there’s more that could be done to build better ways to deep link into pages, e.g. to allow sharing of individual page elements. But people have been trying to do that on and off for years without much visible success. It’s a hard problem, particularly if you want to allow someone to link to a piece of text. It could be time for a standards body to have another crack at it. Or I might have missed some exciting process, so please tell me if I have! But I think something like this would need some serious push behind. You need support from not just web frameworks and the major CMS platforms, but also (probably) browser vendors.

Secondly, Twitter and Facebook could allow us some more flexibility. For example, allow apps to post additional links and/or other metadata that are then attached to posts and tweets. It won’t address every scenario, but it could help. It also feels like a relatively easy thing for them to do as its a natural extension of some existing features.

Thirdly, we could look at ways to attach data to the images people are posting, regardless of what the platforms support. I’ve previously wondered about using XMP packets to attach provenance and attribution information to images. Unfortunately it doesn’t work for every format and it turns out that most platforms strip embedded metadata anyway. This is presumably due to reasonable concerns around privacy, but they could still white-list some metadata. We could maybe use steganography too.

But the major downsides here is that you’d need a custom social media client or browser extension to let you see and interact with the data. So, again that’s a massive deployment issue.

As things currently stand I think the best approach is to plan for visualisations and information to be shared, and design the interactions and content accordingly. Assume that your carefully crafted web page is going to be shared in a million different pieces. Which means that you should:

  • Include plenty of in-page anchors and use clear labelling to help people build links to relevant sections
  • Adapt your social media sharing buttons to not just link to the whole page, but also allow the user to share a link to a specific section
  • Design your twitter cards and other social metadata, for example is there a key graphic that would be best used as the page image?
  • Include links and source information on all of the graphs and infographics that you share. Make sure the link is short and persistent in case it has to be re-keyed from a screenshot
  • Provide direct ways to tweet and share out a graph that will automatically include a clearly labelled image, that contains a link
  • Help users cite their sources
  • …etc

What do you think? Any tips or suggestions you’d add to this list? With a bit of awareness around how data is shared, we might be able to make small improvements to online discussions.

The British Hypertextual Society (1905-2017)

With their globe-spanning satellite network nearing completion, Peter Linkage reports on some of the key milestones in the history of the British Hypertextual Society.

The British Hypertextual Society was founded in 1905 with a parliamentary grant from the Royal Society of London. At the time there was growing international interest in finding better ways to manage information, particularly scientific research. Undoubtedly the decision to invest in the creation of a British centre of expertise for knowledge organisation was also influenced by the rapid progress being made in Europe.

Paul Otlet‘s Universal Bibliographic Repertory and his ground-breaking postal search engine were rapidly demonstrating their usefulness to scholars. Otlet’s team began publishing the first version of their Universal Decimal Classification only the year before. Letters between Royal Society members during that period demonstrate concern that Britain was losing the lead in knowledge science.

As you might expect, the launch of the British Hypertextual Society (BHS) was a grand affair. The centre piece of the opening ceremony was the Babbage Bookwheel Engine, which remains on show (and in good working order!) in their headquarters to this day. The Engine was commissioned from Henry Prevost Babbage, who refined a number of his fathers ideas to automate and improve on Ramelli’s Bookwheel concept.

While it might originally have been intended as only a centre piece, it was the creation of this Engine that laid the ground work for many of the Society’s later successes. Competition between the BHS members and Otlet’s team in Belgium encouraged the rapid development of new tools. This includes refinements to the Bookwheel Engine, prompting its switch from index cards to microfilm. Ultimately it was also instrumental in the creation of the United Kingdom’s national grid and the early success of the BBC.

In the 1920s, in an effort to improve on the Belgium Postal Search Service, the British Government decided to invest in its own solution. This involved reproducing decks of index cards and microfilm sheets that could be easily interchanged between Bookwheel Engines. The new, standardised electric engines were dubbed “Card Wheels”.

The task of distributing the decks and the machines to schools, universities and libraries was given to the recently launched BBC as part of its mission to inform, educate and entertain. Their microfilm version of the Domesday book was the headline grabbing release, but the BBC also freely distributed a number of scholarly and encyclopedic works.

Problems with reliable supply of electricity to parts of the UK hampered the roll out of the Card Wheels. This lead to the Electricity (Supply) Act of 1926 and the creation of Central Electricity Board. This simultaneously laid the foundations for a significant cabling infrastructure that would later carry information to the nation in digital forms.

These data infrastructural improvements were mirrored by a number of theoretical breakthroughs. Drawing on Ada Lovelace’s work and algorithms for the Difference Engine, British Hypertextual Society scholars were able to make rapid advances in the area of graph theory and analysis.

These major advances in the distribution of knowledge across the United Kingdom lead to Otlet moving to Britain in the early 1930s. A major scandal at the time, this triggered the end of many of the projects underway in Belgium and beyond. Awarded a senior position in the BHS, Otlet transferred his work on the Mundaneum to London. Close ties between the BHS members and key government
officials meant that the London we know today is truly the “World City” envisioned by Otlet. It’s interesting to walk through London and consider how so much of the skyline and our familiar landmarks are influenced by the history of hypertext.

The development of the Memex in the 1940s laid the foundations for the development of both home and personal hypertext devices. Combining the latest mechanical and theoretical achievements of the BHS with some American entrepreneurship lead to devices rapidly spreading into people’s homes. However the device was the source of some consternation within the BHS as it was felt that British ideas hadn’t been properly credited in the development of that commercial product.

Of course we shouldn’t overlook the importance of the InterGraph in ensuring easy access to information around the globe. Designed to resist nuclear attack, the InterGraph used graph theory concepts developed by the BHS to create a world-wide mesh network between hypertext devices and sensors. All of our homes, cars and devices are part of this truly distributed network.

Tim Berners-Lee‘s development of the Hypertext Resource Locator was initially seen as a minor breakthrough. But it actually laid the foundations for the replacement of Otlet’s classification scheme and accelerated the creation of the World Hypertext Engine (WHE) and the global information commons. Today the WHE is ubiquitous. It’s something we all use and contribute to on a daily basis.

But, while we all contribute to the WHE, it’s the tireless work of the “Controllers of The Graph” in London that ensures that the entire knowledge base remains coherent and reliable. How else would we distinguish between reliable, authoritative sources and information published by any random source? Their work to fact check information, manage link integrity and ensure maintenance of core assets are key features of the WHE as a system.

Some have wondered what an alternate hypertext system might look like. Scholars have pointed to ideas such as Ted Nelson’s “Xanadu” as one example of an alternative system. Indeed it is one of many that grew out of the counter-culture movement in the 1960s. Xanadu retained many of the features of the WHE as we know it today, e.g. transclusion and micro-transactions, but removed the notion of a centralised index and register of content. This not only removed the ability to have reliable, bi-directional links,  but would have allowed anyone to contribute anything, regardless of its veracity.

For many its hard to imagine how such a chaotic system would actually work. Xanadu has been dismissed as “a foam of ever-popping bubbles“. And a heavily commercialised and unreliable system of information is a vision to which a few would subscribe.

Who would want to give up the thrill of seeing their first contributions accepted into the global graph? It’s a rite of passage that many reflect on fondly. What would the British economy look like if it were not based on providing access to the world’s information? Would we want to use a system that was not fundamentally based on the “Inform, Educate and Entertain” ideal?

This brings us to the present day. The launch of a final batch of satellites will allow the British Hypertextual Society to deliver on a long-standing goal whilst also enabling its next step into the future.

Launched from the British space centre at Goonhilly, each of the standardised CardSat satellites carries both a high-resolution camera and an InterGraph mesh network node. The camera will be used to image the globe in unprecedented detail. This will be used to ensure that every key geographical feature, including every tree and many large animals can be assigned a unique identifier, bringing them into
the global graph. And, by extending the mesh network into space the BHS will ensure that the InterGraph has complete global coverage, whilst also improving connectivity between the fleet of British space drones.

It’s an exciting time for the future of information sharing. Let’s keep sharing what we know!

A river of research, not news

I already hate the phrase “fake news”. We have better words to describe lies, disinformation, propaganda and slander, so lets just use those.

While the phrase “fake news” might originally have been used to refer to hoaxes and disinformation, it’s rapidly becoming a meaningless term used to refer to anything you don’t disagree with. Trump’s recent remarks being a case in point: unverified news is something very different.

Of course this is all on a sliding scale. Many news outlets breathlessly report on scientific research. This can make for fun, if eye-rolling reading. Advances in AI and discovery of alien mega-structures are two examples that spring to mind.

And then there’s the way in which statistics and research is given a spin by the newspapers or politicians. This often glosses over key details in favour of getting across a political message or point scoring. Today I was getting cross about Theresa May’s blaming of GP’s for the NHS crisis. Her remarks are based on a report recently published by the National Audit Office. I haven’t seen a single coverage of the piece link to the NAO press release or the high-level summary (PDF), so you’ll either have to accept their remarks or search for it yourself.

Organisations like Full Fact do an excellent job of digging into these claims. They link the commentary to the underlying research or statistics alongside a clear explanation. In the same vein is NHS Choices Behind the Headlines which fills a similar role, but focuses on the reporting of medical and health issues.

There’s also a lot of attention focused on helping to surface this type of fact checking and explanations via search results. Fact checking, to properly dig into statistics and clearly present them is, I suspect, a time consuming exercise. Especially if you’re hoping to present a neutral point of view.

What I think I’d like though is a service that brings all those different services together. To literally give me the missing links between research, news and commentary.

But rather than aggregating news articles or fact checking reports to give me a feed, or what we used to call a “river of news”, why not present a river of research instead? Let me see the statistics or reports that are being being debated and then let me jump off to see the variety of commentary and fact checking associated with it.

That way I could choose to read the research or a summary of it, and then decide to look at the commentary. Or, more realistically, I could at least see the variety of ways in which a specific report is being presented, described and debated. That would be a useful perspective I think. It would shift the focus away from individual outlets and help us find alternative viewpoints.

I doubt that this would become anyone’s primary way to consume the news. But it could be interesting to those of who like to dig behind the headlines. It would also be useful as a research tool in its own right. In the face of consistent lack of interest from news outlets in linking to primary sources, this might be something that could be crowd-sourced.

Does this type of service already exist? I suspect there are similar efforts around academic research, but I don’t recall seeing anything that covers a wider set of outputs including national and government statistics.


Checking Fact Checkers

As of last month Google News attempts to highlight fact check articles. Content from fact checking organisations will be tagged so that their contribution to on-line debate can be more clearly identified. I think this is a great move and a first small step towards addressing wider concerns around use of the web for disinformation and a “post truth” society.

So how does it work?

Firstly, news sites can now advertise fact checking articles using a pending schema.org extension called Claim Review. The mark-up allows a fact checker to indicate which article they are critiquing along with a brief summary of what aspects are being reviewed.

Metadata alone is obviously ripe for abuse. Anyone could claim any article is a fact check. So there’s an additional level of editorial control that Google layer on top of that metadata. They’ve outlined their criteria in their help pages. These seems perfectly reasonable: it should be clear what facts are being checked, sources must be cited, organisations must be non-partisan and transparent, etc.

It’s the latter aspect that I think is worth digging into a little more. The Google News announcement references the International Fact Checking Network and a study on fact checking sites. The study, by the Duke Reporter’s Lab, outlines how they identify fact checking organisations. Again, they mention both transparency of sources and organisational transparency as being important criteria.

I think I’d go a step further and require that:

  • Google’s (and other’s) lists of approved fact checking organisations are published as open data
  • The lists are cross-referenced with identifiers from sources like OpenCorporates that will allow independent verification of ownership, etc.
  • Fact checking organisations publish open data about their sources of funding and affiliations
  • Fact checking organisations publish open data, perhaps using Schema.org annotations, about the dataset(s) they use to check individual claims in their articles
  • Fact checking organisations licence their ClaimReview metadata for reuse by anyone

Fact checking is an area that benefits from the greatest possible transparency. Open data can deliver that transparency.

Another angle to consider is that fact checking may be carried out by more than just media organisations. John Udell has written a couple of interesting pieces on annotating the wild-west of information flow and bird-dogging the web that highlight the potential role of annotation services in helping to fact check and create constructive debate and discussion on-line.

Digital public institutions for the information commons?

I’ve been thinking a bit about “the commons” recently. Specifically, the global information commons that is enabled and supported by Creative Commons (CC) licences. This covers an increasingly wide variety of content as you can see in their recent annual review.

The review unfortunately doesn’t mention data although there’s an increasing amount of that published using CC (or compatible) licences. Hopefully they’ll cover that in more detail next year.

I’ve also been following with interest Tom Steinberg’s exploration of Digital Public Institutions (Part 1, Part 2). As a result of my pondering about the information and data commons, think there’s a couple of other types of institution which we might add to Tom’s list.

My proposed examples of digital public services are deliberately broad. They’re intended to serve the citizens of the internet, not just any one country.

Commons curators

Everyone has seen interesting facts and figures about the rapidly growing volume of activity on the web. These are often used as examples of dizzying growth and as a jumping off point for imagining the next future shocks that are only just over the horizon. The world is changing at an ever increasing rate.

But it’s also an archival challenge. The majority of that material will never be listened to, read or watched. Data will remain unanalysed. And in all likelihood it may disappear before anyone has had any chance to unlock its potential. Sometimes media needs time to find its audience.

This is why projects like the Internet Archive are so important. I think the Internet Archive is one of the greatest achievements of the web. If you need convincing then watch this talk by Brewster Kahle. If, like me, you’re of a certain age then these two things alone should be enough to win you over.

I think we might see and arguably need more digital public institutions who are not just archiving great chunks of the web, but also the working with that material to help present it to a wider audience.

I see other signals that this might be a useful thing to do. Think about all of the classic film, radio and TV material that is never likely to ever see the light of day again. Not just for rights reasons, but also because its not HD quality or hasn’t been cut and edited to reflect modern tastes. I think this is at least partly the reason why we so many reboots and remakes.

Archival organisations often worry about how to preserve digital information. One tactic being to consider how to migrate between formats to ensure information remains accessible. What if we treated media the same? E.g. by re-editing or remastering it to make it engaging to a modern audience? Here’s an example of modernising classic scientific texts or and another that is remixing Victorian jokes as memes.

Maybe someone could spin a successful commercial venture out of this type of activity. But I’m wondering whether you could build a “public service broadcasting” organisation that presented refined, edited, curated views of the commons? I think there’s certainly enough raw materials.

Global data infrastructure projects

The ODI have spent some time this year trying to bring into focus the fact that data is now infrastructure. In my view the best exemplar of a truly open piece of global data infrastructure is OpenStreetMap (OSM). A collaboratively maintained map of our world. Anyone can contribute. Anyone can use it.

OSM was set up to try to solve the issue that the UK’s mapping and location infrastructure was, and largely still is, tied up with complex licensing and commercial models. Rather than knocking at the door of existing data holders to convince them to release their data, OSM shows what you can deliver with the participation of a crowd of motivated people using modern technology.

It’s a shining example of the networked age we live in.

There’s no reason to think that this couldn’t be done in for other types of data, creating more publicly owned infrastructure. There are now many more ways in which people could contribute data to such projects. Whether that information is about themselves or the world around us.

Getting coverage and depth to data could also potentially be achieved very quickly. Costs to host and serve data are also dropping, so sustainability also becomes more achievable.

And I also feel (hope?) there is a growing unease with so much data infrastructure being owned by commercial organisations. So perhaps there’s a movement towards wanting more of this type of collaboratively owned infrastructure.

Data infrastructure incubators

If you buy into the fact that we need more projects like OSM, then its natural to start thinking about the common features of such projects. Those that make them successful and sustainable. There are likely to be some common organisational patterns that can be used as a framework for designing these organisations. Currently, while focused on scholarly research, I think this is the best attempt at capturing those patterns that I’ve seen so far.

Given a common framework then it’s becomes possible to create incubators whose job it is to launch these projects and coach, guide and mentor them towards success.

So that is my third and final addition to Steinberg’s list: incubators that are focused not on the creation of the next start-up “unicorn” but on generating successful, global collaborative data infrastructure projects. Institutions whose goal is the creation of the next OpenStreetMap.

These type of projects have a huge potential impact as they’re not focused on a specific sector. OSM is relevant to many different types of application, its data is used in many different ways. I think there’s a lot more foundational data of this type which could and should be publicly owned.

I may be displaying my naivety, but I think this would be a nice thing to work towards.

Fictional data

The phrase “fictional data” popped into my head recently, largely because of odd connections between a couple of projects I’ve been working on.

It’s stuck with me because, if you set aside the literal meaning of “data that doesn’t actually exist“, there are some interesting aspects to it. For example the phrase could apply to:

  1. data that is deliberately wrong or inaccurate in order to mislead – lies or spam
  2. data that is deliberately wrong as a proof of origin or claim of ownership – e.g. inaccuracies introduced into maps to identify their sources, or copyright easter eggs
  3. data that is deliberately wrong, but intended as a prank – e.g. the original entry of Uqbar on wikipedia. Uqbar is actually a doubly fictional place.
  4. data that is fictionalised (but still realistic) in order to support testing of some data analysis – e.g. a set of anonymised and obfuscated bank transactions
  5. data that is fictionalised in order to avoid being a nuisance, cause confusion, or accidentally linkage – like 555 prefix telephone numbers or perhaps social media account names
  6. data that is drawn from a work of fiction or a virtual world – such as the marvel universe social graph, the Elite: Dangerous trading economy (context), or the data and algorithms relating to Pokemon capture.

I find all of these fascinating, for a variety of reasons:

  • How do we identify and exclude deliberately fictional data when harvesting, aggregating and analysing data from the web? Credit to Ian Davis for some early thinking about attack vectors for spam in Linked Data. While I’d expect copyright easter eggs to become less frequent they’re unlikely to completely disappear. But we can definitely expect more and more deliberate spam and attacks on authoritative data. (Categories 1, 2, 3)
  • How do we generate useful synthetic datasets that can be used for testing systems? Could we generate data based on some rules and a better understanding of real-world data as a safer alternative to obfuscating data that is shared for research purposes? It turns out that some fictional data is a good proxy for real world social networks. And analysis of videogame economics is useful for creating viable long-term communities. (Categories 4, 6)
  • Some of the most enthusiastic collectors and curators of data are those that are documenting fictional environments. Wikia is a small universe of mini-wikipedias complete with infoboxes and structured data. What can we learn from those communities and what better tools could we build for them? (Category 6)

Interesting, huh?

Thoughts on the Netflix API Closure

A year ago Netflix announced that they were shuttering their public API: no new API keys or affiliates and no more support. Earlier this week they announced that the entire public API will be shutdown by November 2014.

This is interesting news and its been covered in various places already, including this good overview at Programmable Web. I find it  interesting because its the first time that I can recall an public API being so visibly switched out for a closed, private alternative. Netflix will still offer an API but only for a limited set of eight existing affiliates and (of course) their own applications. Private APIs have always existed and will continue to do so, but the trend to date has been about these being made public, rather than a move in the opposite direction.

It’s reasonable to consider if this might be the first of a new trend, or whether its just an outlier. Netflix have been reasonably forthcoming about their API design decisions so I expect many others will be reflecting on their decision and whether it would make sense for them.

But does it make sense at all?

If you read this article by Daniel Jacobson (Director of Engineering for the Netflix API) you can get more detail on the decision and some insight into their thought process. By closing the public API and focusing on a few affiliates Jacobson suggests that they are able to optimise the API to fit the needs of those specific consumers. The article suggests that a fine-grained resource-oriented API is excellent for supporting largely un-mediated use by a wide range of different consumers with a range of different use cases. In contrast an API that is optimised for fewer use cases and types of query may be able to offer better performance. An API with a smaller surface area will have lower maintenance overheads. Support overheads will also be lower because there’s few interactions to consider and a smaller user base making them.

That rationale is hard to argue with from either a technical or business perspective. If you have a small number of users driving most of your revenue and a long tail of users generating little or no revenue but with a high support code, it mostly makes sense to follow the revenue. I don’t buy all of the technical rationale though. It would be possible to support a mixture of resource types in the API, as well as a mixture of support and service level agreements. So I suspect the business drivers are the main rationale here. APIs have generally meant businesses giving up control, if Netflix are able to make this work then I would be surprised if more business don’t do the same eventually, as a means to regain that control.

But by withdrawing from any kind of public API Netflix are essentially admitting that they don’t see any further innovation happening around their API: what they’ve seen so far is everything they’re going to see. They’re not expecting a sudden new type of usage to drive revenue and users to the service. Or at least not enough to warrant maintaining a more generic API. If they felt that the community was growing, or building new and interesting applications that benefited their business, they’d keep the API open. By restricting it they’re admitting that closer integration with a small number of applications is a better investment. It’s a standard vertical integration move that gives them greater control over all user experience with their platform. It wouldn’t surprise me if they acquired some of these applications in the future.

However it all feels a bit short-sighted to me as they’re essentially withdrawing from the Web. They’re no longer going to be able to benefit from any of the network effects of having their API be a part of the wider web and remixable (within their Terms of Service) with other services and datasets. Innovation will be limited to just those companies they’re choosing to work with through an “experience” driven API. That feels like a bottleneck in the making.

It’s always possible to optimise a business and an API to support a limited set of interactions, but that type of close coupling inevitably results in less flexibility. Personally I’d be backing the Web.