“The Rock Thane”, an open data parable

In a time long past, in a land far away, there was a once a troubled kingdom. Despite the efforts of the King to offer justice freely to all, many of his subjects were troubled by unscrupulous merchants and greedy landowners. Time and again, the King heard claims of goods not being delivered, or disputes over land.

While the merchants and landowners were able to produce documents and affidavits to their defence, the King grew increasingly troubled. He felt that his subjects were being wronged, and he grew distrustful of the scribes that thronged the hallways of his courts and marketplaces.

One day, three wizards visited the kingdom. The wizards had travelled from the Far East, where as Masters of the Satoshi School, they had developed many curious spells. The three wizards were brothers. Na was the youngest, and was made to work hardest by his elder brothers, Ka and Mo. Mo, the eldest, was versed in many arts still unknown to his brothers.

Their offer to the King was simple: through the use of their magic they would remove all corruption from his lands. In return they would expect to be well paid for their efforts. Keen to be a just and respected ruler, the King agreed to the wizards’ plan. But while their offer was simple, the plan itself was complex.

The wizards explained that, through an obscure art, they could cause words and images to appear within a certain type of rock, or crystal which could be found commonly throughout the land. Once imbued with words, a crystal could no longer be changed even by a powerful wizard. In a masterful show of power, Ka and Mo embedded the King’s favourite poem and then a painting of his mother in a pair of crystals of the highest quality.

The wizards explained that rather than relying on parchment which could be faked or changed through the cunning application of pumice stones, they could use inscribed crystals to create indelible records of trading bills, property sales and other important documents.

The wizards also demonstrated to the King how, by channelling the power of their masters, groups of their acolytes could simultaneously record the same words in crystals all across the land. This meant that not only would there be an indisputable record of a given trade, but that there would immediately be dozens of copies available across the land, for anyone to check. Readily available and verifiable copies of any bill of trade would mean that no merchant could ever falsify a transaction.

In payment, the wizards would receive a gold piece for every crystal inscribed by their acolytes. Each crystal providing a clear proof of their works.

Impressed, the King decreed that henceforth, all across his lands, trading would now be carried out in trading posts staffed by teams of the wizards’ acolytes.

And, for a time, everything was fine.

But the King began to again receive troubling reports about trading disputes. Trust was failing once again. Speaking to his advisers and visiting some of the new trading posts, the King learned the source of the concerns.

When trading bills had been written on parchment, they could be read by anyone. This made them accessible to all. But only the wizards and their acolytes could read the words inscribed in the crystals. And the King’s subjects didn’t trust them.

Demanding an explanation, the King learnt that Na, the youngest wizard, had been tasked with providing the power necessary to inscribe the crystals. Not as versed in the art as his elder brothers, he was only able to inscribe the crystals with a limited number of words and only the haziest of images. Rather than inscribing easily readable bills of trade, Na and the acolytes were making inscriptions in a cryptic language known only to wizards.

Anyone wanting to read a bill had to request an acolyte to interpret it for them. Rumours had been spreading that the acolytes could be paid to interpret the runes in ways that were advantageous to those with sufficient coin.

The middle brother, Ka, attempted to placate the enraged King, proposing an alternative arrangement. He would oversee the inscribing of the crystals in the place of his brother. Skilled in additional spells, Ka’s proposal was that the crystals would no longer be inscribed with runes describing the bills of sale. Instead each crystal would simply hold the number of a page in a magical book. Each Book of Bills, would hold an infinite number of pages. And, when a sale was made one acolyte would write the bill into a fresh page of a Book, whilst another would inscribe the page number into a crystal. As before, across the land, other acolytes would simultaneously inscribe copies of the bills into other crystals and other copies of the Book.

In this way, anyone wanting to read a bill of sale could simply ask a Book of Bills to turn to the page they needed. Anyone could then read from the book. But the crystals themselves would remain the ultimate proof of the trade. While someone might have been able to fake a copy of a Book, no-one could fake one of the crystals.

Grudgingly accepting this even more complex arrangement, the King was briefly satisfied. Until the accident.

One day, the wizard Ka visited the Craggy Valley, to forage for the rare Ipoh herb, which was known to grow in that part of the Kingdom. However, in a sudden fog, the wizard slipped and fell to his doom. And at the moment of his death, all of the wizard’s spells were undone. In a blink of an eye, all of the magical Books of Bills disappeared. Along with every proof of trade.

Enraged once more, the King gave the eldest wizard one more opportunity to deliver. Mo reassured the King that his power was far greater and that he was uniquely able to deliver on his late brother’s promise. Mo explained that through various dark arts he was able to resist death. He demonstrated his skill to the King, recklessly drinking terrible poisons, and throwing himself from a high tower only to land unharmed. Stunned at this show of power, the King agreed that Mo could take up his brother’s task.

For a few months, the turmoil was resolved, until fresh reports of corruption begin to spread.

A dismayed King granted an audience to a retinue of merchants who had travelled from all across his kingdom. The merchants claimed to have evidence that discrepancies had begun to appear in the Books of Bills. In different towns and cities the Books showed slightly different numbers. There was also talk of a strange, shadowy figure who had been present at many of the trading posts in which discrepancies had been found.

Troubled, the King sent out soldiers to set watch on the trading posts, giving orders that they should attempt to capture and bring this stranger to the court.

Many weeks of waiting and watching passed. More evidence of corrupted Books of Bills continued to appear. Challenged to explain the allegations, Mo scoffed at the evidence. The wizard suggested that the problem was illiterate merchants, asserting that his acolytes were above suspicion.

But finally the king’s soldiers captured the shadowy stranger, and his identity was revealed.

While Mo was the oldest of the three wizards, he was not the eldest. There was a fourth brother, named To. Much older than his brothers, To had been stripped of his riches and banished for studying certain forbidden arts. It was from their brother that Na, Ka and Mo had learned many of their spells, including the arts of inscribing crystals and books, and the means of channelling their powers through acolytes.

Except To had not taught them everything. He had kept many secrets for himself and was able to corrupt the spells used to inscribe the crystals and Books. He was able to change page numbers to refer to other pages which he had inscribed with different words. He had been selling his skills to unscrupulous merchants in an attempt to grow rich once again.

Sickened of wizards and their complicated schemes, the King banished them from his kingdom, never to return.

The King then turned to the task of once more building trust in commerce across his land. He did this not by trusting in magics and complex schemes, but by addressing the problems with which he was originally faced. He decreed the founding of a guild, to create a cadre of trusted, reliable scribes. He appointed new ombudsman and magistrates across the land, to help oversee and administer all forms of trade. He founded libraries and reading rooms to increase literacy amongst his subjects, so that more of them could read and write their own bills of trade. And he offered free use of the courts to all, so that none were denied an opportunity to seek justice.

Many years passed before the King and his kingdom worked through their troubles. But in the history books, the King was forever known as “The Rock Thane”.

Read the previous open data parables: The scribe and the djinn’s agreement, and The woodcutter.

Data is infrastructure, so it needs a design manual

Data is like roads. Roads help us navigate to a destination. Data helps us navigate to a decision. I like that metaphor. It helps to highlight the increasingly important role that data plays in modern society and business.

Roads help us travel to work and school. They also support a variety of different business uses. Roads are infrastructure that are created and maintained by society for the benefit of everyone. Open data, and especially open data published by the public sector, has similar characteristics. Like roads, data is infrastructure.

I think “infrastructure” is a fantastic framing for thinking about how we design and build systems that support the collection, use and reuse of data. It encourages us to think not just about the technology but also about the people who might use that data, or be impacted by it. And so we can define some principles that help define good data infrastructure.

Because I like to leave no metaphor un-stretched, I was excited to learn about the Design Manual for Roads and Bridges. It’s a 15 volume collection that helps to provide standards, advice and other guidance relating to the design, assessment and operation of roads. They’re packed with technical guidance that supports our national infrastructure.

And that’s not all. The government has also helpfully provided the Manual for Streets. The manual explains that “Good design is fundamental to achieving high-quality, attractive places that are socially, economically and environmentally sustainable”. I couldn’t agree more.

The summary explains that the manual breaks down the design of streets into processes that range from policy through to implementation. There’s also a hierarchy that priorities the needs of pedestrians, those that will be most impacted by the infrastructure, over others that might also benefit from it. The manual explains that this helps to ensure that all user needs are met. Just like how we must think about the individual first when building systems that collect and use data.

The Manual for Streets also talks about the importance of standards, of connectivity and assessing quality. It also notes the need to supporting and encourage multiple uses. All of these have obvious parallels in data infrastructure and open licensing. The manual also highlights the importance of thinking about maintenance and sustainability which is another important characteristic of data infrastructure which is often overlooked.

I think it might be interesting to think about what a Design Manual for Data Infrastructure would look like. Perhaps we can use the roads metaphor to help scope that?

For example, the first few volumes in the Design Manual for Roads and Bridges focuses on general design principles, materials and methods of inspection and maintenance. That’s followed by more specific guidance on things like Road Geometry (data modelling and formats), Traffic Signs and Lighting (metadata, documentation, provenance), Traffic Control (data publishing and API design) and Communications (user engagement). There are also separate volumes that cover assessing environmental impact (data ethic, privacy impact assessments, etc).

We’re at an early stage of understanding how to build good data infrastructure. But there are already projects out there that we could learn from. And we can turn that learning into more detailed guidance and patterns that can be reused across sectors.

Sometimes metaphors can be stretched too far, but I think there’s a bit more mileage in the road metaphor yet. (Sorry, not sorry).

Lunchtime Lecture: “How you (yes, you) can contribute to open data”

The following is a written version of the lunchtime lecture I gave today at the Open Data Institute. I’ll put in a link to the video when it comes online. It’s not a transcript, I’m just writing down what I had planned to say.


I’m going to talk today about some of the projects that first got me excited about data on the web and open data specifically. I’m hopefully going to get you excited about them too. And show some ways in which you can individually get involved in creating some open data.

Open data is not (just) open government data

I’ve been reflecting recently about the shape of the open data community and ecosystem, to try and understand common issues and areas for useful work.

For example, we spend a lot of time focusing on Open Government Data. And so we talk about how open data can drive economic growth, create transparency, and be used to help tackle social issues.

But open data isn’t just government data. It’s a broader church that includes many different communities and organisations who are publishing and using open data for different purposes.

Open data is not (just) organisational data

More recently, as a community, we’ve focused some of our activism on encouraging commercial organisations to not just use open data (which many have been doing for years), but also to publish open data.

And so we talk about how open data can be supported by different business models and the need for organisational change to create more open cultures. And we collect evidence of impact to encourage more organisations to also become more open.

But open data isn’t just about data from organisations. Open data can be created and published by individuals and communities for their own needs and purposes.

Open data can (also) be a creative activity

Open data can also be a creative activity. A means for communities to collaborate around sharing what they know about a topic that is important or meaningful to them. Simply because they want to do it. I think sometimes we overlook these projects in the drive to encourage governments and other organisations to publish open data.

So I’m going to talk through eight (you said six in the talk, idiot! – Ed) different example projects. Some you will have definitely heard about before, but I suspect there will be a few that you haven’t. In most cases the primary goals of these projects are to create an openly licensed dataset. So when you contribute to the project, you’re directly helping to create more open data.

Of course, there are other ways in which we each contribute to open data. But these are often indirect contributions. For example where our personal data that is held in various services is aggregated, anonymised and openly published. But today I want to focus today on more direct contributions.

For each of the examples I’ve collected a few figures that indicate the date the project started, the number of contributors, and an indication of the size of the dataset. Hopefully this will help paint a picture of the level of effort that is already going into maintaining these resources. (Psst, see the slides for the figures – Ed)


The first example is Wikipedia. Everyone knows that anyone can edit Wikipedia. But you might not be aware that Wikipedia can be turned into structured data and used in applications. There’s lots of projects that do it. E.g. dbpedia which brings Wikipedia into the web of data.

The bit that’s turned into structured data are the “infoboxes” that give you the facts and figures about the person (for example) that you’re reading about. So if you add to Wikipedia and specifically add to the infoboxes, then you’re adding to an openly licensed dataset.

The most obvious example of where this data is used is in Google search results. The infoboxes you seen on search results whenever you google for a person, place or thing is partly powered by Wikipedia data.

A few years ago I added a wikipedia page for Gordon Boshell, the author of some children’s books I loved as a kid. There wasn’t a great deal of information about him on the web, so I pulled whatever I could find together and created a page for him. Now when anyone searches for Gordon Boshell they can see some information about him right on Google. And they now link out to the books that he wrote. It’s nice to think that I’ve helped raise his profile.

There’s also a related project from the Wikimedia Foundation called Wikidata. Again, anyone can edit it, but its a database of facts and figures rather than an encyclopedia.


The second example is OpenStreetMap. You’ll definitely have already heard about its goal to create a crowd-sourced map of the world. OpenStreetMap is fascinating because its grown this incredible ecosystem of tools and projects that make it easier to contribute to the database.

I’ve recently been getting involved with contributing to OpenStreetMap. My initial impression was that I was probably going to have to get a commercial GPS and go out and do complicated surveying. But its not like that at all. It’s really easy to add points to the map, and to use their tools to trace buildings from satellite imagery. They provide create tutorials to help you get started.

It’s surprisingly therapeutic. I’ve spent a few evenings drinking a couple of beers and tracing buildings. It’s a bit like an adult colouring book, except you’re creating a better map of the world. Neat!

There are a variety of other tools that let you contribute to OpenStreetMap. For example Wheelmap allows you to add wheelchair accessibility ratings to locations on the map. We’ve been using this in the AccessibleBath project to help crowd-source data about wheelchair accessibility in Bath. One afternoon we got a group of around 25 volunteers together for a couple of hours and mapped 30% of the city centre.

There’s a lot of humanitarian mapping that happens using OpenStreetMap. If there’s been a disaster or a disease outbreak then aid workers often need better maps to help reach the local population and target their efforts. Missing Maps lets you take part in that. They have a really nice workflow that lets you contribute towards improving the map by tracing satellite imagery.

There’s a related project called MapSwipe. Its a mobile application that presents you with a grid of satellite images. All you have to do is click the titles which contain a building and then swipe left. Behind the scenes this data is used to direct Missing Maps volunteers towards the areas where more detailed mapping would be most useful. This focuses contributors attention where its best needed and so is really respectful of people’s time.

MapSwipe can also be used offline. So you can download a work package to do when you’re on your daily commute. Easy!


You’ve probably also heard of Zooniverse, which is my third example. It’s a platform for citizen science projects. That just means using crowd-sourcing to create scientific datasets.

Their most famous project is probably GalaxyZoo which asked people to help classify objects in astronomical imagery. But there are many other projects. If you’re interested in biology then perhaps you’d like to help catalogue specimens held in the archives of the Natural History Museum?

Or there’s Old Weather, which I might get involved with. In that project you can help to build a picture of our historical climate by transcribing the weather reports that whaling ship captains wrote in their logs. By collecting that information we can build a dataset that tells us more about our climate.

I think its a really innovative way to use historical documents.


This is my fourth and favourite example. MusicBrainz is a database of music metadata: information about artists, albums, and tracks. It was created in direct response to commercial music databases that were asking people to contribute to their dataset, but then were taking all of the profits and not returning any value to the community. MusicBrainz created a free, open alternative.

I think MusicBrainz is the first open dataset I first got involved with. I wrote a client library to help developers use the data. (14 years ago, and you’re still talking about it – Ed)

MusicBrainz has also grown a commercial ecosystem around it, which has helped it be sustainable. There are a number of projects that use the dataset, including Spotify. And its been powering the BBC Music website for about ten years.


My fifth example, Discogs is also a music dataset. But its a dataset about vinyl releases. So it focuses more on the releases, labels, etc. Discogs is a little different because it started as, and still is a commercial service. At its core its a marketplace for record collectors. But to power that marketplace you need a dataset of vinyl releases. So they created tools to help the community build it. And, over time, its become progressively more open.

Today all of the data is in the public domain.


My sixth example is OpenPlaques. It’s a database of the commemorative plaques that you can see dotted around on buildings and streets. The plaques mark that an important event happened in that building, or that someone famous was born or lived there. Volunteers take photos of the plaques and share them with the service, along with the text and names of anyone who might be mentioned in the plaque.

It provides a really interesting way to explore the historical information in the context of cities and buildings. All of the information is linked to Wikipedia so you can find out more information.


My seventh example is Rebrickable which you’re unlikely to have heard about. I’m cheating a little here as its a service and not strictly a dataset. But its Lego, so I had to include it.

Rebrickable has a big database of all the official lego sets and what parts they contain. If you’re a fan of lego (they’re called AFOLS – Ed) design and create your own custom lego models (they’re known as MOCS – Ed) then you can upload the design and instructions to the service in machine-readable LEGO CAD formats.

Rebrickable exposes all of the information via an API under a liberal licence. So people can build useful tools. For example using the service you can find out which other official and custom sets you can build with bricks you already own.

Grand Comics Database

My eighth and final example is the Grand Comics Database. It’s also the oldest project as it was started in 1994. The original creators started with desktop tools before bringing it to the web.

It’s a big database of 1.3m comics. It contains everything from The Dandy and The Beano through to Marvel and DC releases. Its not just data on the comics, but also story arcs, artists, authors, etc. If you love comics you’ll love GCD. I checked and this weeks 2000AD (published 2 days ago – Ed) is in there already.

So those are my examples of places where you could contribute to open data.

Open data is an enabler

The interesting thing about them all is that open data is an enabler. Open data isn’t creating economic growth, or being used as a business model. Open licensing is being applied as a tool.

It creates a level playing field that means that everyone who contributes has an equal stake in the results. If you and I both contribute then we can both use the end result for any purpose. A commercial organisation is not extracting that value from us.

Open licensing can help to encourage people to share what they know, which is the reason the web exists.

Working with data

The projects are also great examples of ways of working with data on the web. They’re all highly distributed projects, accepting submissions from people internationally who will have very different skill sets and experience. That creates a challenge that can only be dealt with by having good collaboration tools and by having really strong community engagement.

Understanding the reasons how and why people collaborate to your open database is important. Because often those reasons will change over time. When OpenStreetMap had just started, contributors had the thrill of filling in a blank map with data about their local area. But now contributions are different. It’s more about maintaining data and adding depth.

Collaborative maintenance

In the open data community we often talk about making things open to make them better. It’s the tenth GDS design principle. And making data open does make them better in the sense that more people can use it. And perhaps more eyes can help spot flaws.

But if you really want to let people help make something better, then you need to put your data into a collaborative environment. Then data can get better at the pace of the community and not your ability to accept feedback.

It’s not work if you love it

Hopefully the examples give you an indication of the size of these communities and how much has been created. It struck me that many of them have been around since the early 2000s. I’ve not really found any good recent examples (Maybe people can suggest some – Ed). I wonder what that is?

Most of the examples were born around the Web 2.0 era (Mate. That phrase dates you. – Ed) when we were all excitedly contributing different types of content to different services. Bookmarks and photos and playlists. But now we mostly share things on social media. It feels like we’ve lost something. So it’s worth revisiting these services to see that they still exist and that we can still contribute.

While these fan communities are quietly hard at work, maybe we in the open data community can do more to support them?

There’s a lot of examples of “open” datasets that I didn’t use because they’re not actually open. The licenses are restrictive. Or the community has decided not to think about it. Perhaps we can help them understand why being a bit more open might be better?

There are also examples of openly licensed content that could be turned into more data. Take Wikia for example. It contains 360,000 wikis all with openly licensed content. They get 190m views a month and the system contains 43 million pages. About the same size as the English version of Wikipedia is currently. They’re all full of infoboxes that are crying out to be turned into structured data.

I think it’d be great to have all this fan produced data to a proper part of the open data commons, sitting alongside the government and organisational datasets that are being published.

Thank you (yes, you!)

That’s the end of my talk. I hope I’ve piqued your interest in looking at one or more of these projects in more detail. Hopefully there’s a project that will help you express your inner data geek.

Photo Attributions

Lego SpacemanEdwin AndradeJamie Street, Olu Elet, Aaron Burden, Volkan OlmezAlvaro SerranoRawPixel.com, Jordan WhitfieldAnthony DELANOIX


Where can you contribute to open data? Yes, you!

This is just a quick post to gather together some pointers and links that were shared in answer to a question I asked on twitter yesterday:

I want to try out a bunch of different services to explore how easy it is for people to contribute to open data project. Because I’m interested in how we can contribute as individuals, then I’m ruling out things like government open data portals. They’re not typically places where mere punters like you or I can contribute.

I’m also interested in sites that generate open data. Not public data. There needs to be an open licence on the results. Or, at very least a note along the lines of: “do whatever you want with this”.

I’m thinking more of places where we can collaborate around creating open data.

The short list

Here’s a quick list of the suggestions, along with a few I’d already turned up. I’m sure there are a lot more. Please leave a comment or ping me on twitter if you have suggestions. And yes, I’ll turn this into data at some point.

  1. OpenStreetMap was the starter for ten. I’ve already written about a number of ways to can contribute to the effort
  2. Discogs, contribute to their public domain database
  3. Wikipedia, content in infoboxes is presented as data via dbpedia and wikidata
  4. You can also contribute directly to Wikidata
  5. MusicBrainz, is completely crowd-sourced
  6. You can contribute company information to OpenCorporates
  7. Questions you answer on Stackoverflow end up as open data
  8. DemocracyClub are doing an awesome job of co-ordinating crowd-sourced data collection that the UK government should just be doing itself
  9. The product data you add to OpenFoodFacts is open
  10. It looks like you can contribute Creative Commons licensed content and data to the Encylopedia of Life
  11. OpenPlaques is open to contributions
  12. The Quick, Draw with Google data is actually open. Google seem to be opening up more of their research data
  13. ESRI are building some crowdsourcing apps, which generate open data
  14. If you’re in Germany and have some sensor data, you can feed it into OpenSenseMap. Their data dumps are in the public domain

What else should be on this list?


There were also a number of sites that were suggested, or which I considered, but had to be rejected. Mostly because they’re not actually publishing open data. They either have restrictions on usage, or the licensing is very unclear. If you can clarify any of these then let me know.

Clearly there are hundreds of non-open databases, but do let me know if I’m wrong about any of the above, and I’ll amend the article accordingly.

Can you publish tweets as open data?

Can you publish data from twitter as open data? The short answer is: No. Read on for some notes, pointers and comments.

Twitter’s developer policy places a number of restrictions on your use of their API and the data you get from it. Some of the key ones are:

  • In the Restrictions on Use of Licensed Materials (II.C) they make it clear that you can’t use any geographic data from the platform. You can only use it to identify the location from which a tweet was made and not for any other purpose. You also can’t aggregate or cache it, unless you’re storing it with the the rest of the tweet. And elsewhere they place further restrictions on storage of tweets. They reiterate this in section B.9
  • Section F.2 “Be a Good Partner to Twitter” (sic) is the key one for data, as here you’re agreeing to not store anything except the ID for a tweet. You can’t store the message, it’s metadata or anything about the user, just the ID.
  • You are allowed to make those IDs downloadable in various ways but there are restrictions on how many tweets you can publish per user, per day
  • In the Ownership and Feedback section, they make it clear that the only rights you have to use content are derived from this agreement and those rights can be taken away at any time.
  • Anyone that you distribute data to must also agree to ALL of twitters terms, not just the developer policy, but its general terms of service, privacy policy, etc. So everyone’s agreements can be revoked at any time.

That’s a very closed set of terms.

There’s some great analysis of the terms and what they mean for researchers elsewhere. Ernesto Priego has an interesting pair of posts looking at twitter as public evidence and the ethics of twitter research and why you might want to archive and share small twitter datasets.

Ed Summers has also written about archiving twitter datasets and the process of “hydrating” a twitter ID to turn it back into useful content. There’s a whole set of APIs, tools and practices that have built up around the process of hydration as a means to work around twitters terms. I think it’s interesting as an example of using a combination of data and open source to address licensing limitations.

Yesterday, Justin Littman published a short piece highlighting how Twitter have just further restricted their terms. The key changes are around placing upper limits on how many tweet IDs you can distribute. The changes raise concerns about how archival projects like DocNow can continue. Although in my reading of the terms, those projects were already under question as Twitter doesn’t grant you the rights to re-publish data under anything other than its own terms. I think those datasets were already in breach of the agreement.

So, we get to our answer: no you can’t publish anything from twitter under an open licence. If you’re intending to do this in a project then I recommend you get approval from twitter directly.

Obviously these terms are designed for Twitters sole benefit. It helps them retain as much value as possible while still operating as a platform. Data asymmetry in action.

I think what’s particularly frustrating is that they seem to rarely enforce these terms, even for services that clearly breach them. After crafting a legal agreement they choose not to actively police it, because its not worth their time to do so. Presumably they will step in if there are large scale, significant breaches. But it makes you wonder how much value is really being protected.

In the meantime we are left with areas of doubt and uncertainty. Does the continued existence of a service mean its an exemplar of acceptable practice. Or are twitter just choosing to ignore it? And this starts to poison the well of open data. A more open approach would be for them to offer some allowance for small scale archiving and data sharing. Openly licensing twitter IDs would be a start.

For better or worse Twitter’s data has a role in helping us understand modern society, so we should be able to use it. Unfortunately their donation of the twitter archive to the Library of Congress is floundering because of a mixture of technical and legal issues. Twitter is not really a public space. It’s a private hall where we choose to meet.


A couple of final extra points based on comments on this post (see below) and on twitter. Ed Summers rightly pointed out is that services that are seemingly breaching Twitter’s terms may in fact have permission to do so. In fact a couple of examples came up.

Andy Piper (Twitter Dev lead) notes that Twitter have posted a policy update clarification:

The clarification explains that developers can request permission to share more 1.5m tweet ids in a 30 day period. It also notes that researchers from “an accredited academic institution” can share unlimited number of tweets. This raises some of the restrictions on distribution, but also reinforces some of the key points I make above: any use of the data remains subject to Twitter’s policies. By default data from Twitter can’t be published as open data. But if you’re willing to pay then it looks like Twitter are willing to share more widely.

Joe Wass from CrossRef explained that they’ve had explicit permission from Google to distribute Tweet IDs under a CC0 waiver within their Event Data service.

CrossRef negotiated this permission as part of their commercial arrangement with Twitter. This means that at least some Tweet IDs can be considered to be in the public domain. It just depends on where you got them from: the Twitter API or CrossRef.

Enabling data forensics

I’m interested in how people share information, particularly data, on social networks. I think it’s something to which it’s worth paying attention, so we can ensure that it’s easy for people to share insights and engage in online debates.

There’s lots of discussion at the moment around fact checking and similar ways that we can improve the ability to identify reliable and unreliable information online. But there may be other ways that we can make some small improvements in order to help people identify and find sources of data.

Data forensics is a term that usually refers to analysis of data to identify illegal activities. But the term does have a broader meaning that encompasses “identifying, preserving, recovering, analyzing, and presenting attributes of digital information“. So I’m going to appropriate the term to put a label on a few ideas.

The design of the Twitter and Facebook platforms constrain how we can share information. Within those constraints people have, inevitably, adopted various patterns that allow them to publish and share content in preferred ways. For example, information might be shared:

  1. As a link to a page, where the content of the tweet or post is just the title
  2. As a link to a page, but with a comment and/or hashtags for context
  3. As a screenshot, e.g. of some text, chart or something. This usually has some commentary attached. Some apps enable this automatically, allowing you to share a screenshot of some highlighted text
  4. As images and photographs, e.g. of printed page or report (or even sometimes a screenshot of text from another app)

In the first examples there are always links that allow someone to go and read the original content. In fact that seems to be the typical intention: go read (or watch) this thing.

The other two examples are usually workarounds for the fact that its often hard to deep link to a section of a page or video.

Sometimes it’s just not possible because the information of interest isn’t in a bookmarkable section of a page. Or perhaps the user doesn’t know how to create that kind of deep link. Or they may be further constrained by a mobile app or other service that is restricting their ability to easily share a link. Not every application let’s the web happen.

In some cases screenshotting may also be conscious choice, e.g. posting a photo of someone’s tweet because you don’t want to directly interact with them.

Whatever the reason, this means there is usually no link in the resulting post. Which often makes it difficult for a reader to find the original content. While social media is reducing friction in sharing, its increasing friction around our ability to check the reliability and accuracy of what’s been shared.

If you tweet out a graph with some figures in a debate, I want to know where it’s come from. I want to see the context that goes with it. The ability to easily identify the source of shared content is, I think, part of “data forensics”.

So, what can we do fix this?

Firstly, there’s more that could be done to build better ways to deep link into pages, e.g. to allow sharing of individual page elements. But people have been trying to do that on and off for years without much visible success. It’s a hard problem, particularly if you want to allow someone to link to a piece of text. It could be time for a standards body to have another crack at it. Or I might have missed some exciting process, so please tell me if I have! But I think something like this would need some serious push behind. You need support from not just web frameworks and the major CMS platforms, but also (probably) browser vendors.

Secondly, Twitter and Facebook could allow us some more flexibility. For example, allow apps to post additional links and/or other metadata that are then attached to posts and tweets. It won’t address every scenario, but it could help. It also feels like a relatively easy thing for them to do as its a natural extension of some existing features.

Thirdly, we could look at ways to attach data to the images people are posting, regardless of what the platforms support. I’ve previously wondered about using XMP packets to attach provenance and attribution information to images. Unfortunately it doesn’t work for every format and it turns out that most platforms strip embedded metadata anyway. This is presumably due to reasonable concerns around privacy, but they could still white-list some metadata. We could maybe use steganography too.

But the major downsides here is that you’d need a custom social media client or browser extension to let you see and interact with the data. So, again that’s a massive deployment issue.

As things currently stand I think the best approach is to plan for visualisations and information to be shared, and design the interactions and content accordingly. Assume that your carefully crafted web page is going to be shared in a million different pieces. Which means that you should:

  • Include plenty of in-page anchors and use clear labelling to help people build links to relevant sections
  • Adapt your social media sharing buttons to not just link to the whole page, but also allow the user to share a link to a specific section
  • Design your twitter cards and other social metadata, for example is there a key graphic that would be best used as the page image?
  • Include links and source information on all of the graphs and infographics that you share. Make sure the link is short and persistent in case it has to be re-keyed from a screenshot
  • Provide direct ways to tweet and share out a graph that will automatically include a clearly labelled image, that contains a link
  • Help users cite their sources
  • …etc

What do you think? Any tips or suggestions you’d add to this list? With a bit of awareness around how data is shared, we might be able to make small improvements to online discussions.

Adventures in geodata

I spend a lot of my professional life giving people advice. Mostly around how to publish and use open data. In order to make sure I give people the best advice I can, I try and spend a lot of time actually publishing and using open data. A mixture of research and practical work is the best way I’ve found of improving my own open data practice. This is one of the reasons I run Bath: Hacked, continue to work at the Open Data Institute, and like to stay hands-on with data projects.

Amongst my goals for this year was to spend time learning some new skills. For example, I’ve not been involved in running a crowd-sourcing project, but now have that underway with Accessible Bath.

And, while I’ve done some work with geographic data, until recently I hadn’t really spent any time contributing to OpenStreetmap or exploring its ecosystem. But I’ve spent the last couple of months fixing that by immersing myself in its community and tools. In this blog post I wanted to share some of the things I’ve learnt. It’s been really fascinating and, as I’d hoped, given me a new perspective on a number of issues.

Finding my way

To begin with I looked around for some online tutorials. While I knew that OpenStreetmap allowed anyone to contribute, I wasn’t really sure about how I could go about doing that. I had a bunch of questions such as:

  • Did I need a dedicated GPS device or could I collect data I my phone? (Answer: you can use your phone)
  • Did I need to go out with a clipboard and do a formal survey or are there other ways to contribute? (Answer: you can contribute in a lot of different ways)
  • How do you actually go about editing the map, what tools do you need to use? (Answer: however you feel comfortable)
  • How do I find useful ways to contribute? Has everything been mapped already? (Answer: there’s still a lot to do!)

To help answer my questions I started out by watching some YouTube tutorials. There’s a lot of great training material for the OpenStreetMap ecosystem that covers the basics of mapping, how to add buildings, and some nice bite size videos that introduce best elements of the tool-set.

Other people in the Bath: Hacked community had also been looking at OpenStreetmap, mainly as a potential data resource. So we held a small evening meetup to get together and share what we knew. We had two experienced local mappers who came along and also offered encouragement (thanks Neil and Dave!).

This was a great way to learn the ropes and build up the confidence to wade in. I personally found having some existing members of the OSM community on hand very helpful. Dave has been particularly supportive of reviewing my edits and offering suggested improvements.

Equipping my expedition

There’s an amazing set of tools that support the OSM community. Too many to mention in a single blog post. But here’s a few that I’ve found particularly useful:

  • There are a few different OSM editors, but the new, default iD editor is really easy to use. If you plan on making some editrs, focus on learning this tool, rather than looking at the older, more complex tools (although they have their uses). It’s really nice to work with. It also has some pleasing little UX elements.
  • osmtracker is an Android (and Windows mobile) application that lets you record GPS traces, upload them to OSM (where they can be viewed in the iD editor), and exported to GPX files for use in other tools. It’s in the app store so easy to install
  • The OSM wiki is an essential resource. The OSM database itself is basically a wiki: you can add tags to any item on the map. While the online editor does a lot of the work for you, sometimes you need to add some additional metadata and the trick is in knowing which tags to add to which locations. The wiki provides plenty of examples. It also includes some beginner tutorials, but I found the videos to be a good starting point

My first attempt at proper mapping was walking my local high street, recording my progress and using osmtracker to take notes of the names of each shop. I later updated the building outlines, names and details of all the locations.

Into the unknown

That process of collecting data and updating a map lit up the bits of my brain that likes exploring and scavenging in video games, so I was immediately keen to do more. That’s when I starting contributing to Missing Maps, which I’d heard about from Rares during our meetup.

Missing Maps uses volunteers to trace satellite imagery of locations around the world. This data is then improved locally and used by humanitarian organisations to plan their disaster response activities. So I spent a happy evening finding and tracing Tukuls in Sudan. I thoroughly enjoyed it. It felt like doing an adult colouring book, but where I was painting the world a bit better with each stroke.

As a contributor the tooling is great: simple task allocation, clear guidance and tutorials, and making contributions is straight-forward as you’re using the standard editors. The community was also quick to provide feedback.

I also tripped over MapSwipe. This lets you identify, with a simple click, satellite images that contain buildings. This generates new tasks that go into the Missing Maps pipeline. It also has some light gamification and encouragement to keep you contributing.

Even if you’re not confident about editing the full map, you can quickly make small contributions using this mobile app. You can download tasks for use offline, so it’s also possible to map when you’re on the go. There’s a little micro-tasking app called StreetComplete which takes a similar approach towards making local contributions as easy as possible.

Between MapSwipe, MissingMaps and editing the local OSM map and updating locations on the Wheelmap app,  I’m now trying to make a small contribution to OSM every day.

The landscape

I’ve been really blown away by the range of tools and applications that fill out the OSM ecosystem. I plan on doing a lengthier post on some of this at a later date, but I’d be very surprised if this ecosystem wasn’t at least as good as, or even better than those used internally by national mapping agencies.

The ecosystem doesn’t just consist of hobbyists, there’s a growing commercial community that are contributing to, supporting and helping develop OpenStreetMap. Just look at how clearly Mapbox and Mapillary articulate how their company strategies align with making OSM a continued success.

I was also really surprised to learn that the satellite imagery that all OSM mappers are using has been donated by Microsoft. The Bing aerial imagery is free for use in OSM mapping and has been since 2010. That’s a significant contribution to an open data ecosystem.

If you’re interested in learning more about the OSM community, I’d encourage you to explore the videos from the annual State of the Map conference. There’s some really interesting work presented there including:

  • introductions to new OSM tools and research
  • analysis and discussion about the OSM community itself, the reasons why people contribute and how to encourage them to continue to do so
  • case studies of how OSM data and tooling is used in a variety of projects

New territory

I’ve now done several street surveys of Bath and have refined my workflow. What I’ve found to be the simplest approach is to use osmtracker to record my route and uses its facility to take photos of streets and shop fronts. This gives a quick way to collect information on the go, and I can then use this update the map later.

Uploading the GPX traces to OSM, putting the photos into the public domain on Flickr, and also publishing them to Mapillary allows me to demonstrate that I’ve actually done the field work, rather than just sneakily copied from Google StreetView, whilst also making them available to other people to use when they’re mapping. Mapillary data can be added to the iD editor so you can see contributed photos as you work.

I’ve decided that the surveys are a good way to encourage me to be more active over the summer!

Trip report

This post has just been a taster of what I’ve learnt and explored over the last couple of months. If you’ve ever wondered about contributing to OSM I’d encourage you to have a go. And I’m happy to help you get started! As I’ve outlined here, there are a number of different ways you can contribute either your local knowledge, or pitch in to some humanitarian mapping.

I’m going to be writing more here about some of the ecosystem in future. The exercise has been a great insight into how the OSM community hangs together and I’ve really only scratched the surface.

To briefly summarise though I think there’s some aspects of OSM that could work well in other contexts, for example:

  • the various approaches taken to ensuring quality and consistency of the map
  • the effort that goes into understanding and managing the community
  • the means by which commercial and volunteer efforts can both contribute to an open resource

If you’re interested in data as infrastructure then OSM is a great project to study in more detail. I think it embodies all of the key principles of a strong open data infrastructure.

Someone also needs to do a proper review of the OSM ecosystem because all of that “open data impact” people are looking to measure is right there. There’s a bit too much focus on measuring impact of government data IMHO, when there’s an existing ecosystem which can provide some great insights.