Do data scientists spend 80% of their time cleaning data? Turns out, no?

It’s hard to read an article about data science or really anything that involves creating something useful from data these days without tripping over this factoid, or some variant of it:

Data scientists spend 80% of their time cleaning data rather than creating insights.


Data scientists only spend 20% of their time creating insights, the rest wrangling data.

It’s frequently used to highlight the need to address a number of issues around data quality, standards, access. Or as a way to sell portals, dashboards and other analytic tools.

The thing is, I think it’s a bullshit statistic.

Not because I don’t think there aren’t improvements to be made about how we access and share data. Far from it. My issue is more about how that statistic is framed and because its just endlessly parroted without any real insight.

What did the surveys say?

I’ve tried to dig out the underlying survey or source of that factoid, to see if there’s more context. While the figure is widely referenced its rarely accompanied by a link to a survey or results.

Amusingly this IBM data science product marketing page cites this 2018 HBR blog post which cites this 2017 IBM blog which cites this 2016 Crowdflower survey. Why don’t people link to original sources?

In terms of sources of data on how data scientists actually spend their time, I’ve found two ongoing surveys.

So what do these surveys actually say?

  • Crowdflower, 2015: “66.7% said cleaning and organizing data is one of their most time-consuming tasks“.
    • They didn’t report estimates of time spent
  • Crowdflower, 2016: “What data scientists spend the most time doing? Cleaning and organizing data: 60%, Collecting data sets; 19% …“.
    • Only 80% of time spent if you also lump in collecting data as well
  • Crowdflower, 2017: “What activity takes up most of your time? 51% Collecting, labeling, cleaning and organizing data
    • Less than 80% and also now includes tasks like labelling of data
  • Figure Eight, 2018: Doesn’t cover this question.
  • Figure Eight, 2019: “Nearly three quarters of technical respondents 73.5% spend 25% or more of their time managing, cleaning, and/or labeling data
    • That’s pretty far from 80%!
  • Kaggle, 2017: Doesn’t cover this question
  • Kaggle, 2018: “During a typical data science project, what percent of your time is spent engaged in the following tasks? ~11% Gathering data, 15% Cleaning data…
    • Again, much less than 80%

Only the Crowdflower survey reports anything close to 80%, but you need to lump in actually collecting data as well.

Are there other sources? I’ve not spent too much time on it. But this 2015 bizreport article mentions another survey which suggests “between 50% and 90% of business intelligence (BI) workers’ time is spend prepping data to be analyzed“.

And an August 2014 New York Times article states that: “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data“. But doesn’t link to the surveys, because newspapers hate links.

It’s worth noting that “Data Scientist” as a job started to really become a thing around 2009. Although the concept of data science is older. So there may not be much more to dig up. If you’ve seen some earlier surveys, then let me know.

Is it a useful statistic?

So looking at the figures, it looks to me that this is a bullshit statistic. Data scientists do a whole range of different types of task. If you arbitrary label some of these as analysis and others not, then you can make them add up to 80%.

But that’s not the only reason why I think its a bullshit statistic.

Firstly there’s the implication that cleaning and working with data is somehow not worth the time of a data scientist. It’s “data janitor work” work. And “It’s a waste of their skills to be polishing the materials they rely on“. Ugh.

Who, might I ask, is supposed to do this janitorial work?

I would argue that spending time working with data. To transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in.

Understand the material better and you’ll get better insights.

Secondly, I think data science use cases and workflows are a poor measure for how well data is published. Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardised a dataset its likely to need some work.

A sculptor has different needs than a bricklayer. They both use similar materials. And they both create things of lasting value and worth.

We could measure utility better using other assessments than time spent on bespoke work.

Thirdly, it’s measuring the wrong thing. Actually, maybe some friction around the use of data is a good thing. Especially if it encourages you to spend more time understanding a dataset. Even more so if it in any way puts a break on dumb uses of machine-learning.

If we want the process of accessing, using and sharing data to be as frictionless as possible in a technical sense, then let’s make sure that is offset by adding friction elsewhere. E.g. to add checkpoints for reviews of ethical impacts. No matter how highly paid a data scientist is, the impacts of poor use of data and AI can be much, much larger.

Don’t tell me that data scientists are spending time too much time working with data and not enough time getting insights into production. Tell me that data scientists are increasingly spending 50% of their time considering the ethical and social impacts of their work.

Let’s measure that.

Observations on the web

Eight years ago I was invited to a workshop. The Office for National Statistics were gathering together people from the statistics and linked data communities to talk about publishing statistics on the web.

At the time there was lots of ongoing discussion within and between the two communities around this topic. With a particular emphasis on government statistics.

I was invited along to talk about how publishing linked data could help improve discovery of related datasets.

Others were there to talk about other related projects. There were lots of people there from the SDMX community who were working hard to standardise how statistics can be exchanged between organisations.

There’s a short write-up that mentions the workshop, some key findings and some follow on work.

One general point of agreement was that statistical data points or observations should be part of the web.

Every number, like the current population of Bath & North East Somerset, should have a unique address or URI. So people could just point at it. With their browsers or code.

Last week the ONS launched the beta of a new API that allows you to create links to individual observations.

Seven years on they’ve started delivering on the recommendations of that workshop.

Agreeing that observations should have URIs was easy. The hard work of doing the digital transformation required to actually deliver it has taken much longer.

Proof-of-concept demos have been around for a while. We made one at the ODI.

But the patient, painstaking work to change processes and culture to create sustainable change takes time. And in the tech community we consistently underestimate how long that takes, and how much work is required.

So kudos to Laura, Matt, Andy, Rob, Darren Barnee and the rest of present and past ONS team for making this happen. I’ve see glimpses of the hard work they’ve had to put in behind the scenes. You’re doing an amazing and necessary job.

If you’re unsure as to why this is such a great step forward, here’s a user need I learned at that workshop.

Amongst the attendees was a designer who worked on data visualisations. He would spend a great deal of time working with data to get it into the right format and then designing engaging, interactive views of it.

Often there were unusual peaks and troughs in the graphs and charts which needed some explanation. Maybe there had been an external event that impacted the data, or a change in methodology. Or a data quality issue that needed explaining. Or maybe just something interesting that should be highlighted to users.

What he wanted was a way for the statisticians to give him that context, so he could add notes and explanations to the diagrams. He was doing this manually and it was a lot of time and effort.

For decades statisticians have been putting these useful insights into the margins of their work. Because of the limitations of the printed page and spreadsheet tables this useful context has been relegated into footnotes for the reader to find for themselves.

But by putting this data onto the web, at individual URIs, we can deliver those numbers in context. Everything you need to know can be provided with the statistic, along with pointers to other useful information.

Giving observation unique URIs, frees statisticians from the tyranny of the document. And might help us all to share and discuss data in a much richer way.

I’m not naive enough to think that linking data can help us address issues with fake news. But it’s hard for me to imagine how being able to more easily work with data on the web isn’t at least part of the solution.

Lunchtime Lecture: “How you (yes, you) can contribute to open data”

The following is a written version of the lunchtime lecture I gave today at the Open Data Institute. I’ll put in a link to the video when it comes online. It’s not a transcript, I’m just writing down what I had planned to say.


I’m going to talk today about some of the projects that first got me excited about data on the web and open data specifically. I’m hopefully going to get you excited about them too. And show some ways in which you can individually get involved in creating some open data.

Open data is not (just) open government data

I’ve been reflecting recently about the shape of the open data community and ecosystem, to try and understand common issues and areas for useful work.

For example, we spend a lot of time focusing on Open Government Data. And so we talk about how open data can drive economic growth, create transparency, and be used to help tackle social issues.

But open data isn’t just government data. It’s a broader church that includes many different communities and organisations who are publishing and using open data for different purposes.

Open data is not (just) organisational data

More recently, as a community, we’ve focused some of our activism on encouraging commercial organisations to not just use open data (which many have been doing for years), but also to publish open data.

And so we talk about how open data can be supported by different business models and the need for organisational change to create more open cultures. And we collect evidence of impact to encourage more organisations to also become more open.

But open data isn’t just about data from organisations. Open data can be created and published by individuals and communities for their own needs and purposes.

Open data can (also) be a creative activity

Open data can also be a creative activity. A means for communities to collaborate around sharing what they know about a topic that is important or meaningful to them. Simply because they want to do it. I think sometimes we overlook these projects in the drive to encourage governments and other organisations to publish open data.

So I’m going to talk through eight (you said six in the talk, idiot! – Ed) different example projects. Some you will have definitely heard about before, but I suspect there will be a few that you haven’t. In most cases the primary goals of these projects are to create an openly licensed dataset. So when you contribute to the project, you’re directly helping to create more open data.

Of course, there are other ways in which we each contribute to open data. But these are often indirect contributions. For example where our personal data that is held in various services is aggregated, anonymised and openly published. But today I want to focus today on more direct contributions.

For each of the examples I’ve collected a few figures that indicate the date the project started, the number of contributors, and an indication of the size of the dataset. Hopefully this will help paint a picture of the level of effort that is already going into maintaining these resources. (Psst, see the slides for the figures – Ed)


The first example is Wikipedia. Everyone knows that anyone can edit Wikipedia. But you might not be aware that Wikipedia can be turned into structured data and used in applications. There’s lots of projects that do it. E.g. dbpedia which brings Wikipedia into the web of data.

The bit that’s turned into structured data are the “infoboxes” that give you the facts and figures about the person (for example) that you’re reading about. So if you add to Wikipedia and specifically add to the infoboxes, then you’re adding to an openly licensed dataset.

The most obvious example of where this data is used is in Google search results. The infoboxes you seen on search results whenever you google for a person, place or thing is partly powered by Wikipedia data.

A few years ago I added a wikipedia page for Gordon Boshell, the author of some children’s books I loved as a kid. There wasn’t a great deal of information about him on the web, so I pulled whatever I could find together and created a page for him. Now when anyone searches for Gordon Boshell they can see some information about him right on Google. And they now link out to the books that he wrote. It’s nice to think that I’ve helped raise his profile.

There’s also a related project from the Wikimedia Foundation called Wikidata. Again, anyone can edit it, but its a database of facts and figures rather than an encyclopedia.


The second example is OpenStreetMap. You’ll definitely have already heard about its goal to create a crowd-sourced map of the world. OpenStreetMap is fascinating because its grown this incredible ecosystem of tools and projects that make it easier to contribute to the database.

I’ve recently been getting involved with contributing to OpenStreetMap. My initial impression was that I was probably going to have to get a commercial GPS and go out and do complicated surveying. But its not like that at all. It’s really easy to add points to the map, and to use their tools to trace buildings from satellite imagery. They provide create tutorials to help you get started.

It’s surprisingly therapeutic. I’ve spent a few evenings drinking a couple of beers and tracing buildings. It’s a bit like an adult colouring book, except you’re creating a better map of the world. Neat!

There are a variety of other tools that let you contribute to OpenStreetMap. For example Wheelmap allows you to add wheelchair accessibility ratings to locations on the map. We’ve been using this in the AccessibleBath project to help crowd-source data about wheelchair accessibility in Bath. One afternoon we got a group of around 25 volunteers together for a couple of hours and mapped 30% of the city centre.

There’s a lot of humanitarian mapping that happens using OpenStreetMap. If there’s been a disaster or a disease outbreak then aid workers often need better maps to help reach the local population and target their efforts. Missing Maps lets you take part in that. They have a really nice workflow that lets you contribute towards improving the map by tracing satellite imagery.

There’s a related project called MapSwipe. Its a mobile application that presents you with a grid of satellite images. All you have to do is click the titles which contain a building and then swipe left. Behind the scenes this data is used to direct Missing Maps volunteers towards the areas where more detailed mapping would be most useful. This focuses contributors attention where its best needed and so is really respectful of people’s time.

MapSwipe can also be used offline. So you can download a work package to do when you’re on your daily commute. Easy!


You’ve probably also heard of Zooniverse, which is my third example. It’s a platform for citizen science projects. That just means using crowd-sourcing to create scientific datasets.

Their most famous project is probably GalaxyZoo which asked people to help classify objects in astronomical imagery. But there are many other projects. If you’re interested in biology then perhaps you’d like to help catalogue specimens held in the archives of the Natural History Museum?

Or there’s Old Weather, which I might get involved with. In that project you can help to build a picture of our historical climate by transcribing the weather reports that whaling ship captains wrote in their logs. By collecting that information we can build a dataset that tells us more about our climate.

I think its a really innovative way to use historical documents.


This is my fourth and favourite example. MusicBrainz is a database of music metadata: information about artists, albums, and tracks. It was created in direct response to commercial music databases that were asking people to contribute to their dataset, but then were taking all of the profits and not returning any value to the community. MusicBrainz created a free, open alternative.

I think MusicBrainz is the first open dataset I first got involved with. I wrote a client library to help developers use the data. (14 years ago, and you’re still talking about it – Ed)

MusicBrainz has also grown a commercial ecosystem around it, which has helped it be sustainable. There are a number of projects that use the dataset, including Spotify. And its been powering the BBC Music website for about ten years.


My fifth example, Discogs is also a music dataset. But its a dataset about vinyl releases. So it focuses more on the releases, labels, etc. Discogs is a little different because it started as, and still is a commercial service. At its core its a marketplace for record collectors. But to power that marketplace you need a dataset of vinyl releases. So they created tools to help the community build it. And, over time, its become progressively more open.

Today all of the data is in the public domain.


My sixth example is OpenPlaques. It’s a database of the commemorative plaques that you can see dotted around on buildings and streets. The plaques mark that an important event happened in that building, or that someone famous was born or lived there. Volunteers take photos of the plaques and share them with the service, along with the text and names of anyone who might be mentioned in the plaque.

It provides a really interesting way to explore the historical information in the context of cities and buildings. All of the information is linked to Wikipedia so you can find out more information.


My seventh example is Rebrickable which you’re unlikely to have heard about. I’m cheating a little here as its a service and not strictly a dataset. But its Lego, so I had to include it.

Rebrickable has a big database of all the official lego sets and what parts they contain. If you’re a fan of lego (they’re called AFOLS – Ed) design and create your own custom lego models (they’re known as MOCS – Ed) then you can upload the design and instructions to the service in machine-readable LEGO CAD formats.

Rebrickable exposes all of the information via an API under a liberal licence. So people can build useful tools. For example using the service you can find out which other official and custom sets you can build with bricks you already own.

Grand Comics Database

My eighth and final example is the Grand Comics Database. It’s also the oldest project as it was started in 1994. The original creators started with desktop tools before bringing it to the web.

It’s a big database of 1.3m comics. It contains everything from The Dandy and The Beano through to Marvel and DC releases. Its not just data on the comics, but also story arcs, artists, authors, etc. If you love comics you’ll love GCD. I checked and this weeks 2000AD (published 2 days ago – Ed) is in there already.

So those are my examples of places where you could contribute to open data.

Open data is an enabler

The interesting thing about them all is that open data is an enabler. Open data isn’t creating economic growth, or being used as a business model. Open licensing is being applied as a tool.

It creates a level playing field that means that everyone who contributes has an equal stake in the results. If you and I both contribute then we can both use the end result for any purpose. A commercial organisation is not extracting that value from us.

Open licensing can help to encourage people to share what they know, which is the reason the web exists.

Working with data

The projects are also great examples of ways of working with data on the web. They’re all highly distributed projects, accepting submissions from people internationally who will have very different skill sets and experience. That creates a challenge that can only be dealt with by having good collaboration tools and by having really strong community engagement.

Understanding the reasons how and why people collaborate to your open database is important. Because often those reasons will change over time. When OpenStreetMap had just started, contributors had the thrill of filling in a blank map with data about their local area. But now contributions are different. It’s more about maintaining data and adding depth.

Collaborative maintenance

In the open data community we often talk about making things open to make them better. It’s the tenth GDS design principle. And making data open does make them better in the sense that more people can use it. And perhaps more eyes can help spot flaws.

But if you really want to let people help make something better, then you need to put your data into a collaborative environment. Then data can get better at the pace of the community and not your ability to accept feedback.

It’s not work if you love it

Hopefully the examples give you an indication of the size of these communities and how much has been created. It struck me that many of them have been around since the early 2000s. I’ve not really found any good recent examples (Maybe people can suggest some – Ed). I wonder what that is?

Most of the examples were born around the Web 2.0 era (Mate. That phrase dates you. – Ed) when we were all excitedly contributing different types of content to different services. Bookmarks and photos and playlists. But now we mostly share things on social media. It feels like we’ve lost something. So it’s worth revisiting these services to see that they still exist and that we can still contribute.

While these fan communities are quietly hard at work, maybe we in the open data community can do more to support them?

There’s a lot of examples of “open” datasets that I didn’t use because they’re not actually open. The licenses are restrictive. Or the community has decided not to think about it. Perhaps we can help them understand why being a bit more open might be better?

There are also examples of openly licensed content that could be turned into more data. Take Wikia for example. It contains 360,000 wikis all with openly licensed content. They get 190m views a month and the system contains 43 million pages. About the same size as the English version of Wikipedia is currently. They’re all full of infoboxes that are crying out to be turned into structured data.

I think it’d be great to have all this fan produced data to a proper part of the open data commons, sitting alongside the government and organisational datasets that are being published.

Thank you (yes, you!)

That’s the end of my talk. I hope I’ve piqued your interest in looking at one or more of these projects in more detail. Hopefully there’s a project that will help you express your inner data geek.

Photo Attributions

Lego SpacemanEdwin AndradeJamie Street, Olu Elet, Aaron Burden, Volkan OlmezAlvaro, Jordan WhitfieldAnthony DELANOIX


Exploring open data quality

Originally published on the Open Data Institute blog. Original URL:

There are a number of initiatives at the moment exploring the idea of data quality, with particular reference to describing, measuring and improving the quality of open data.

For example, the W3C Data on the Web Best Practices Working Group are producing a vocabulary for publishing and describing data quality metrics. There is also related work capturing best practices for sharing public sector data.

Various open data projects and communities are working to improve the quality of their open data and have started to share guidance. For example have recently shared their data quality guide for tabular data. And Mark Frank and Johanna Walker at Southampton University have recently published a paper exploring a user-centred view of data quality.

To contribute to this ongoing discussion, we recently undertook a small project with Experian to explore data quality in some open datasets.

The project had several goals:

  • to identify the types of data quality issues we might find in some existing open datasets
  • to suggest some common data quality checks that both publishers and users could apply to data
  • to explore the idea of an ‘open data quality index’, building on existing work on Open Data Certificates and benchmarking open data

For the initial exploratory project we’ve used the Land Registry Price Paid data, the Companies House register and the NHS Choices GP Practices and Surgeries.

We worked with the data quality team at Experian to run the datasets through their Pandora data quality tool. Pandora is a data-profiling tool designed to support exploration of datasets, highlight data quality issues and enrich data against other sources. For this project we used Pandora to generate some quality metrics for each of the datasets we reviewed.

You can recreate a number of the checks we carried out using the free version of the tool.

The outputs have been published under an open license and we’ve written a short report on the findings.

Our key insights are as follows:

  • There is still scope to improve how well datasets are documented and published to and beyond
  • Even in large, well-used and maintained datasets there are a number of basic data quality checks that could be applied to improve data quality
  • Defining and using standard schemas for datasets would benefit both data publishers and users
  • Being able to quickly summarise and explore a dataset offers a powerful way to understand its structure and highlight potential data quality issues
  • The use of standard, open registers will be a significant boost to the quality of many open datasets

If you have any feedback on the findings or suggestions for how to build on the work further, then please get in touch with our labs team.

Four things you should know about open data quality

Originally published on the Open Data Institute blog. Original URL:

1. A quality dataset is a well-published dataset

First impressions are everything. The efforts made to publish a dataset will guide a user’s experience in finding, accessing and using it. No matter how good the contents of your dataset, if it is not clearly documented, well-structured and easily accessible, then it won’t get used.

Open data certificates are a mark of quality and trust for open data. They measure the legal, technical, practical and social aspects of publishing data. Creating and publishing a certificate will help a publisher build confidence in their data. Open data certificates complement the five star scheme, that assesses how well data is integrated with the web.

2. A dataset can contain a variety of problems

Data quality also relates to the contents of a dataset. Data errors usually occur when the data was originally collected. But the problems may only become apparent once a user begins working with the data.

There are a number of different types of data quality problem. The following list isn’t exhaustive but includes some of the most common:

  • The dataset isn’t valid when compared to its schema, for example there are missing columns, or they are in the wrong order
  • The dataset contains invalid or incorrect values, for example numbers that are not within their expected range, text where there should be numbers, spelling mistakes or invalid phone numbers
  • The dataset has missing data from some fields or the dataset doesn’t include all of the available data – some addresses in a dataset might be missing their postcode, for example
  • The data may have precision problems — these may be due to limits in accuracy of the sensors or other devices (such as GPS devices) that were used to record the data, or they many be due to simple rounding errors introduced during analysis

3. There are several ways to fix data errors

Some types of error are more easily discovered and fixed than others. Tools like CSVLint can use a schema to validate a dataset, applying rules to confirm that data values are valid. But sometimes extra steps are needed to confirm whether a value is correct.

For example, an email address (for contacting a company, for example) might be formatted correctly but it might contain a spelling mistake that means it is unusable. There are a variety of ways to improve confidence that an email address is valid, but you can only reliably confirm that an email address is both valid and actually in use by sending an email and asking a user to confirm receipt.

Another way to help identify data quality issues is to check data against a register that provides a master list of legal values. For example, country names might be validated against a standard register of countries. Open registers are an important part of the data ecosystem.

Other types of errors are much harder to fix. Company names and addresses may become invalid or incorrect over time. Publishing data openly can allow others to identify and contribute fixes. Making things open can help make them better.

4. Sometimes ‘good quality’ depends on your needs

One way to help improve data quality is to generate quality metrics for a dataset. Metrics can help summarise the kinds of issues found in a dataset. You might choose to count the numbers of valid and invalid values in specific columns. Run regularly, metrics can identify if the quality of a dataset is changing over time.

However, it’s hard to make an objective assessment about whether a dataset is of a good quality. Sometimes quality is in the eye of the beholder. For example:

  • GPS accuracy in a dataset might not be important if you only want to do a simple geographic visualisation. But if you’re involved in a boundary dispute then precision may be vital.
  • Inaccurate readings from a broken sensor might be an annoyance for the majority of users who might want them filtered out of a raw dataset. But if you are interested in gathering analytics on sensor failures then seeing the errors is important.

Fixing all data quality issues in a dataset can involve significant investment, sometimes with diminishing returns. Data publishers and users need to decide how good is good enough, based on their individual needs and resources.

However, by opening data and letting others contribute fixes, we can spread the cost of maintaining data. Making things open can help make them better, remember?


For the purposes of having something to point to in future, here’s a list of different meanings of “open” that I’ve encountered.

XYZ is “open” because:

  • It’s on the web
  • It’s free to use
  • It’s published under an open licence
  • It’s published under a custom licence, which limits some types of use (usually commercial, often everything except personal)
  • It’s published under an open licence, but we’ve not checked too deeply in whether we can do that
  • It’s free to use, so long as you do so within our app or application
  • There’s a restricted/limited access free version
  • There’s documentation on how it works
  • It was (or is) being made in public, with equal participation by anyone
  • It was (or is) being made in public, lead by a consortium or group that has limitation on membership (even if just fees)
  • It was (or is) being made privately, but the results are then being made available publicly for you to use

I gather that at IODC “open washing” was a frequently referenced topic. It’s not surprising given the variety of ways in which the word “open” is used. Many of which are not open at all. And the list I’ve given above is hardly comprehensive. This is why the Open Definition is such an important reference. Even if it may have it’s faults.

Depending on your needs, any or all of those definitions might be fine. But “open” for you, may not be “open” for everyone. So let’s not lose sight of the goal and keep checking that we’re using that word correctly.

And, importantly, if we’re really making things open to make them better, then we might need to more open to collaboration. Open isn’t entirely about licensing either.


Building best practices for publish sector data

Originally published on the Open Data Institute blog. Original URL:

At the ODI we’re big fans of capturing best practices and simple design patterns to help guide people towards the most effective ways to publish data.

By breaking down complex technical and organisational challenges into smaller steps, we can identify common problems across sectors and begin cataloguing common solutions. This is the common thread that ties together our research and technicalprojects and it’s this experience that we bring to our advisory projects.

We’ve been contributing to the Share-PSI project, which has been documenting a range of best practices that relate to publishing public-sector data. Some of the best practices address specific technical questions relating to the web of data, and these form part of the W3C’s ‘Data on the web best practices’ guidance.

But some of the best practices address higher-level issues, such as the importance of creating an open data strategy and a release plan to support it. Or the creation of change by supporting startups and enabling ecosystems. Each best practice sets out the underlying challenge, a recommended solution, and provides pointers to further reading.

Our guidance, white papers and reports help to add depth to these best practices by linking them to evidence of their successful adoption, both here in the UK and internationally. This helps to ground the best practices in concrete guidance that draws on the experience of the wider community.

The best practices also provide a useful way to explore the elements of existing open data programmes.

For example, it’s possible to see how a large public-sector initiative like #OpenDefrahas been successful through its adoption of so many of these discrete best practices. These include the creation of a common open data strategy across its network, use of a release process that allowed for more rapid publication of data while managing risks, benchmarking practice using a maturity model, moving to an open by default licensing model, and its efforts to engage users and stimulate the wider ecosystem.

The best practices are a useful resource for anyone leading or contributing to an open data initiative. We’re looking forward to adding further to this body of evidence.

We’ve also begun to think about capturing common patterns that illustrate how open and shared data can be successfully used to deliver specific types of government policies. We are looking for feedback on this draft catalogue of strategic government interventions – you can either add comments in the document or email

How to open your data in six easy steps

Originally published on the Open Data Institute blog. Original URL:

1. Scope out the job at hand

Before taking the plunge and jumping straight into publishing, there are a few things to think through first. Take time to consider what data you’re going to release, what it contains and what the business case is for releasing it in the first place.

Consider what licence you’re going to put on the data for others to use. There’s a selection to choose from, depending on how you want others to use it, see our guidance here.

Here are some other key things to consider at this stage:

  • Where will it be published?
  • Will I need documentation around it?
  • What level of support is needed?
  • How frequently will I release the data?

2. Get prepared

Your data is only really useful to others if it’s well structured and has clear metadata (or a data description) to give it context and explain what it’s about and where it comes from.

Start your prep with a technical review using sample data, and identify suitable formats for release and the level of detail and metadata required. Also consider whether it’ll be most useful to the user as an API or a download. Data can be more useful when linked to other datasets, so keep an eye out for opportunities.

Consider your capabilities in-house and whether you need any training in order to release the data, whether technical or around certification. Some ODI courses can help with this.

Finally, think about what metadata you’re going to add to your data to describe what it is or how to use it.

3. Test your data

Before you release your data, you might want think about doing a preview with some of your potential users to get some detailed feedback. This isn’t necessarily required for smaller datasets, but for larger releases this user-testing can be really useful.

Don’t forget to get an Open Data Certificate to verify that your data is being published properly.

4. Release your data

Now for the exciting bit: releasing your data, the metadata and the documentation to go with it.

The key thing here is to release your data where your users will be. Otherwise, what’s the point? Where you should release it depends on who you are, but in general you should publish it on your own website, ensuring it’s also listed on relevant portals. For example, public sector organisations should add their data to Some sectors have their own portals – in science it’s the norm to publish in an institutional repository or a scientific data repository.

Basically, do your research into how your community shares data, and make sure it’s located in a place you have control over or where you’re confident the data can be consistently available.

When applying the Open Data Certificate, we’ll ask for evidence that the dataset is being listed in one or more portals to ensure it’s accessible.

5. Get engagement and promotion

It’s easy to relax after spending so much time and effort in preparing and releasing your dataset, but don’t just ‘fire and forget’. Make sure you have follow-up activities to let people know the data exists and be responsive to questions they might have. You can engage people in multiple ways (depending on your target audience), for example through blogs or social media. Encourage users to tell you how they’re using the data, so you can promote success stories around it too.

6. Reflect and improve

Now your dataset it out there in the big wide world, take some time to reflect on it. Listen to feedback, and decide what changes you could make or what you’d do differently next time.

If you want to measure your improvement, consider taking a maturity assessment using our Open Data Pathway tool.


Today I am 101100.

That’s XLIV in Roman.

44 is also the square root of 1936. 1936 was a leap year starting on a Wednesday.

The Year 44 was also a leap year starting on a Wednesday.

It was also known as the Year of the Consulship of Crispus and Taurus. Which is another coincidence because I like crisps and I’m also a Taurus.

And while we’re on Wikipedia, we can use the API to find out that page id 101100 is Sydney Harbour National Park which opened when I was 3.

Wolfram Alpha reminds me that 44 is the ASCII code for a comma.

Whichever way you look at it #101100 is a disappointing colour.

But if we use the random art generator then we can make a more colourful image from the number. But actually the image with that identifier is more interesting. Glitchy!

The binary number is also a car multimedia entertainment system. But £200 feels a bit steep, even if it is my birthday.

A 12 year old boy once bid £101,100 for a flooded Seat Toledo on EBay. Because reasons.

101100, or tubulin tyrosine ligase-like family, member 3 to its friends, also seems to do important things for mice.

I didn’t really enjoy Jamendo album 101100, the Jamez Anthony story.

Care of Cell Block 101100 was a bit better in my opinion. But only a bit.

Discogs release 101100 is The Sun’s Running Out by Perfume Tree. Of which the most notable thing is that track six includes a sample from a Dr Who episode.

I’m not really sure what the tag 101100 on flickr means.

IMDB entry 101100 is “Flesh ‘n’ Blood

The Board Game Geek identifier 101100 is for an XBox 360 version of 1 vs 100. That’s not even a board game!

Whereas Drive Thru RPG catalogue product 101100 as Battlemage. Which sounds much more interesting.

If I search for “101100 coordinates” on google, then it tells me that it’s somewhere in China. I should probably know why.

There are 26 results for 101100 on But none on Which explains why the UK is #1 in the world for open data.

But HD 101100 is also a star.

And a minor planet discovered on 14th September 1998

CAS 101-10-0 is 2-(3-Chlorophenoxy)propionic acid. I think its a herbicide. Anyway, this is what it looks like.

It’s also a marine worm.

And an insect.

In the database of useful biological numbers, we discover that entry 101100 is the maximal emission wavelength for Venus fluorophore. Which is, of course, 528 nm.

I think the main thing I’ve learnt in my 44 years is that the web is an amazing place.

Data marketplaces, we hardly knew ye

I’m on a panel at the ODI lunchtime lecture this week, where I’m hoping to help answer the question of “what does a good data market look like?“.

As many of you know I was previously the product manager/lead for a data marketplace called Kasabi. That meant that I spent quite a bit of time exploring options for building both free and commercial services around data, business models for data supply, etc. At the time data marketplaces seemed to be “a thing”. See also this piece from 2011. There were suddenly a number of data marketplaces springing up from a variety of organisations.

The idea of data marketplaces, perhaps as an evolution of current data portals is one that seems to be resurfacing. I’ve already written about why I think “data marketplace” isn’t the right framing for encouraging more collaboration around data, particularly in cities.

I’m not going to rehash that here, but, as preparation for Friday, I thought I’d take a look at how the various data marketplaces are fairing. Here’s a quick run down.

If you think I’ve misrepresented anything then leave a comment and I’ll correct the post.

  • Data Market was originally focused on delivering data to businesses, offered sophisticated charting and APIs. Drew largely on national and international statistics. Great platform and a really nice team (disclaimer: have previously done some freelance work with them). They were acquired by Qlik. My understanding is that this rounded out their product offering by having an off-the-shelf platform for visualising on-demand data. This is no longer what I’d consider a marketplace, more a curated set of data feeds.
  • Azure Data Marketplace is still around but seems to be largely offering only Microsoft’s own data and APIs. Seems to be in the middle of a revamp and refocus on cloud apps and more general APIs rather than a marketplace. In its early stages Microsoft explored iterating this into an enterprise data portal as well as deeper integration with some of their products like SQL Server.
  • Kasabi. Shutdown. Sob.
  • BuzzData. Shutdown.
  • FreeBase. Acquired by Google, continued as a free services for a while and shutdown in 2015. The data is now part of wikidata.
  • Infochimps. Originally a data marketplace, the team spent a lot of timing building out a data processing pipeline using Big Data technologies. They were acquired for this technology.
  • Timetric started out as a data platform focusing on statistical and time series data, now seems to have evolved in a slightly different direction.
  • Factual continue to focus on location data. I was always intrigued by their approach which (at least originally) included businesses pooling their data together to create a richer resource, which was then used to drive additional revenue and sales. While there were suggestions they may expand into other sectors, that hasn’t happened.
  • Gnip and Datasift are still around, both still focusing on services and data analysis around social media data

There are others that could be included in the list. There’s one interesting new contender that shares a lot of similarity with some things that we were building in Kasabi, but they’re currently in stealth mode so I won’t share more.

I also don’t include Amazon Public Datasets or Google Public Data as they’re not really marketplaces. They’re collections of large datasets that Amazon or Google are providing as an enabler or encouragement to use some of their cloud services. Difficult to demonstrate big data analysis unless there’s a nice collection of demo datasets.

So, really only the Microsoft offering is still around in its originaly form of a data marketplace and its clear that the emphasis is is shifting elsewhere. The other services that are still around are all focused on a specific vertical or business sector rather than offering a general purpose (“horizontal”) platform for the supply and selling of data.

This matches what we can see elsewhere, there are lots of businesses that have been selling data for some time. While the original emphasis was on the data, the move now is to sell services on top of it. But they’re all focused on a specific sector or vertical. I think cities are neither.