Enabling data forensics

I’m interested in how people share information, particularly data, on social networks. I think it’s something to which it’s worth paying attention, so we can ensure that it’s easy for people to share insights and engage in online debates.

There’s lots of discussion at the moment around fact checking and similar ways that we can improve the ability to identify reliable and unreliable information online. But there may be other ways that we can make some small improvements in order to help people identify and find sources of data.

Data forensics is a term that usually refers to analysis of data to identify illegal activities. But the term does have a broader meaning that encompasses “identifying, preserving, recovering, analyzing, and presenting attributes of digital information“. So I’m going to appropriate the term to put a label on a few ideas.

The design of the Twitter and Facebook platforms constrain how we can share information. Within those constraints people have, inevitably, adopted various patterns that allow them to publish and share content in preferred ways. For example, information might be shared:

  1. As a link to a page, where the content of the tweet or post is just the title
  2. As a link to a page, but with a comment and/or hashtags for context
  3. As a screenshot, e.g. of some text, chart or something. This usually has some commentary attached. Some apps enable this automatically, allowing you to share a screenshot of some highlighted text
  4. As images and photographs, e.g. of printed page or report (or even sometimes a screenshot of text from another app)

In the first examples there are always links that allow someone to go and read the original content. In fact that seems to be the typical intention: go read (or watch) this thing.

The other two examples are usually workarounds for the fact that its often hard to deep link to a section of a page or video.

Sometimes it’s just not possible because the information of interest isn’t in a bookmarkable section of a page. Or perhaps the user doesn’t know how to create that kind of deep link. Or they may be further constrained by a mobile app or other service that is restricting their ability to easily share a link. Not every application let’s the web happen.

In some cases screenshotting may also be conscious choice, e.g. posting a photo of someone’s tweet because you don’t want to directly interact with them.

Whatever the reason, this means there is usually no link in the resulting post. Which often makes it difficult for a reader to find the original content. While social media is reducing friction in sharing, its increasing friction around our ability to check the reliability and accuracy of what’s been shared.

If you tweet out a graph with some figures in a debate, I want to know where it’s come from. I want to see the context that goes with it. The ability to easily identify the source of shared content is, I think, part of “data forensics”.

So, what can we do fix this?

Firstly, there’s more that could be done to build better ways to deep link into pages, e.g. to allow sharing of individual page elements. But people have been trying to do that on and off for years without much visible success. It’s a hard problem, particularly if you want to allow someone to link to a piece of text. It could be time for a standards body to have another crack at it. Or I might have missed some exciting process, so please tell me if I have! But I think something like this would need some serious push behind. You need support from not just web frameworks and the major CMS platforms, but also (probably) browser vendors.

Secondly, Twitter and Facebook could allow us some more flexibility. For example, allow apps to post additional links and/or other metadata that are then attached to posts and tweets. It won’t address every scenario, but it could help. It also feels like a relatively easy thing for them to do as its a natural extension of some existing features.

Thirdly, we could look at ways to attach data to the images people are posting, regardless of what the platforms support. I’ve previously wondered about using XMP packets to attach provenance and attribution information to images. Unfortunately it doesn’t work for every format and it turns out that most platforms strip embedded metadata anyway. This is presumably due to reasonable concerns around privacy, but they could still white-list some metadata. We could maybe use steganography too.

But the major downsides here is that you’d need a custom social media client or browser extension to let you see and interact with the data. So, again that’s a massive deployment issue.

As things currently stand I think the best approach is to plan for visualisations and information to be shared, and design the interactions and content accordingly. Assume that your carefully crafted web page is going to be shared in a million different pieces. Which means that you should:

  • Include plenty of in-page anchors and use clear labelling to help people build links to relevant sections
  • Adapt your social media sharing buttons to not just link to the whole page, but also allow the user to share a link to a specific section
  • Design your twitter cards and other social metadata, for example is there a key graphic that would be best used as the page image?
  • Include links and source information on all of the graphs and infographics that you share. Make sure the link is short and persistent in case it has to be re-keyed from a screenshot
  • Provide direct ways to tweet and share out a graph that will automatically include a clearly labelled image, that contains a link
  • Help users cite their sources
  • …etc

What do you think? Any tips or suggestions you’d add to this list? With a bit of awareness around how data is shared, we might be able to make small improvements to online discussions.