Consulting Spreadsheet Detective, Season 1

I was very pleased to announce my new TV series today, loosely based on real events. More details here in the official press release.

FOR IMMEDIATE RELEASE

Coming to all major streaming services in 2021 will be the exciting new series: “Turning the Tables“.

Exploring the murky corporate world of poorly formatted spreadsheets and nefarious macros each episode of this new series will explore another unique mystery.

When the cells lie empty, who can help the CSV:PI team pivot their investigation?

When things don’t add up, who can you turn to but an experienced solver?

Who else but Leigh Dodds, Consulting Spreadsheet Detective?

This smart, exciting and funny new show throws deductive reasoner Dodds into the mix with Detectives Rose Cortana and Colm Bing part of the crack new CSV:PI squad.

Rose: the gifted hacker. Quick to fire up an IDE, but slow to validate new friends.

Colm: the user researcher. Strong on empathy but with an enigmatic past that hints at time in the cells.

What can we expect from Season 1?

Episode 1: #VALUE!

In his first case, Dodds has to demonstrate his worth to a skeptical Rose and Colm, by fixing a corrupt formula in a startup valuation.

Episode 2: #NAME?

A personal data breach leaves the team in a race against time to protect the innocent. A mysterious informant known as VLOOKUP leaves Dodds a note.

Episode 3: #REF!

A light-hearted episode where Dodds is called in to resolve a mishap with a 5-a-side football team matchmaking spreadsheet. Does he stay between the lines?

Episode 4: #NUM?

A misparsed gene name leads a researcher into recommending the wrong vaccine. It’s up to Dodds to fix the formula.

Episode 5: #NULL!

Sometimes it’s not the spreadsheet that’s broken. Rose and Colm have to educate a researcher on the issue of data bias, while Dodds follow up references to the mysterious Macro corporation.

Episode 6: #DIV/0?

Chasing down an internationalisation issue Dodds, Rose and Colm race around the globe following a trail of error messages. As Dodds gets unexpectedly separated from the CSV:PI team, Rose and Colm unmask the hidden cell containing the mysterious VLOOKUP.

In addition to the six episodes in season one, a special feature length episode will air on National Spreadsheet Day 2021:

Feature Episode: #####

Colm’s past resurfaces. Can he grow enough to let the team see the problem, and help him validate his role in the team?

Having previously only anchored documentaries, like “Around with World with 80,000 Apps” and “Great Data Journeys“, taking on the eponymous role will be Dodds’ first foray into fiction. We’re sure he’ll have enough pizazz to wow even the harshest critics.

“Turning the Tables” will feature music composed by Dan Barrett.

Tip for improving standards documentation

I love a good standard. I’ve written about them a lot here.

As its #WorldStandardsDay I thought I’d write a quick post to share something that I’ve learned from leading and supporting some standards work.

I’ve already shared this with a number of people who have asked for advice on standards work, and in some recent user research interviews I’ve participated in. So it makes sense to write it down.

In the ODIHQ standards guide, we explained that at the end of your initial activity to develop a standard, you should plan to produce a range of outputs. This include a variety of tools and guidance that help people use the standard. You will need much more than just a technical specification.

To plan for the different types of documentation that you may need I recommend applying this “Grand Unified Theory of Documentation“.

That framework highlights four different types of documentation are intended to be used by different audiences to address different needs. The content designers and writers out there reading this will be be rolling their eyes at this obvious insight.

Here’s how I’ve been trying to apply it to standards documentation:

Reference

This is your primary technical specification. It’ll have all the detail about the standard, the background concepts, the conformance criteria, etc.

It’s the document of record that captures all of the hard work you’ve invested in building consensus around the standard. It fills a valuable role as the document you can point back to when you need to clarify or confirm what was agreed.

But, unless its a very simple standard, it’s going to have a limited audience. A developer looking to implement a conformant tool, API or library may need to read and digest all of the detail. But most people want something else.

Put the effort into ensuring its clear, precise and well-structured. But plan to also produce three additional categories of documentation.

Explainers

Many people just want an overview of what it is designed to do. What value will it provide? What use cases was it designed to support? Why was it developed? Who is developing it?

These are higher-level introductory questions. The type of questions that business stakeholders want to answer to sign-off on implementing a standard, so it goes onto a product roadmap.

Explainers are also useful background information that are useful for a developer ahead of taking a deeper dive. If there are some key concepts that are important to understanding the design and implementation of a standard, then write an explainer.

Tutorials

A simple, end-to-end description of how to apply the standard. E.g. how to publish a dataset that conforms to the standard, or export data from an existing system.

A tutorial will walk you through using a specific set of tools, frameworks or programming languages. The end result being a basic implementation of the standard. Or a simple dataset that passes some basic validation checks. A tutorial won’t cover all of the detail, it’s enough to get you started.

You may need several tutorials to support different types of users. Or different languages and frameworks.

If you’ve produced a tool, like validator or a template spreadsheet to support data publication, you’ll probably need a tutorial for each of them unless they are very simple to use.

Tutorials are gold for a developer who has been told: “please implement this standard, but you only have 2 days to do it”.

How-Tos

Short, task oriented documentation focused on helping someone apply the standard. E.g. “How to produce a CSV file from Excel”, “Importing GeoJSON data in QGIS”, “Describing a bus stop”. Make them short and digestible.

How-Tos can help developers build from a tutorial, to a more complete implementation. Or help a non-technical user quickly apply a standard or benefit from it.

You’ll probably end up with lots of these over time. Drive creating them from the types of questions or support requests you’re getting. Been asked how to do something three times? Write a How-To.

There’s lots more that can be said about standards documentation. For example you could add Case Studies to this list. And its important to think about whether written documentation is the right format. Maybe your Explainers and How-Tos can be videos?

But I’ve found the framework to be a useful planning tools. Have a look at the documentation for more tips.

Producing extra documentation to support the launch of a standard, and then investing in improving and expanding it over time will always be time well-spent.

Lunchtime Lecture: “How you (yes, you) can contribute to open data”

The following is a written version of the lunchtime lecture I gave today at the Open Data Institute. I’ll put in a link to the video when it comes online. It’s not a transcript, I’m just writing down what I had planned to say.

Hello!

I’m going to talk today about some of the projects that first got me excited about data on the web and open data specifically. I’m hopefully going to get you excited about them too. And show some ways in which you can individually get involved in creating some open data.

Open data is not (just) open government data

I’ve been reflecting recently about the shape of the open data community and ecosystem, to try and understand common issues and areas for useful work.

For example, we spend a lot of time focusing on Open Government Data. And so we talk about how open data can drive economic growth, create transparency, and be used to help tackle social issues.

But open data isn’t just government data. It’s a broader church that includes many different communities and organisations who are publishing and using open data for different purposes.

Open data is not (just) organisational data

More recently, as a community, we’ve focused some of our activism on encouraging commercial organisations to not just use open data (which many have been doing for years), but also to publish open data.

And so we talk about how open data can be supported by different business models and the need for organisational change to create more open cultures. And we collect evidence of impact to encourage more organisations to also become more open.

But open data isn’t just about data from organisations. Open data can be created and published by individuals and communities for their own needs and purposes.

Open data can (also) be a creative activity

Open data can also be a creative activity. A means for communities to collaborate around sharing what they know about a topic that is important or meaningful to them. Simply because they want to do it. I think sometimes we overlook these projects in the drive to encourage governments and other organisations to publish open data.

So I’m going to talk through eight (you said six in the talk, idiot! – Ed) different example projects. Some you will have definitely heard about before, but I suspect there will be a few that you haven’t. In most cases the primary goals of these projects are to create an openly licensed dataset. So when you contribute to the project, you’re directly helping to create more open data.

Of course, there are other ways in which we each contribute to open data. But these are often indirect contributions. For example where our personal data that is held in various services is aggregated, anonymised and openly published. But today I want to focus today on more direct contributions.

For each of the examples I’ve collected a few figures that indicate the date the project started, the number of contributors, and an indication of the size of the dataset. Hopefully this will help paint a picture of the level of effort that is already going into maintaining these resources. (Psst, see the slides for the figures – Ed)

Wikipedia

The first example is Wikipedia. Everyone knows that anyone can edit Wikipedia. But you might not be aware that Wikipedia can be turned into structured data and used in applications. There’s lots of projects that do it. E.g. dbpedia which brings Wikipedia into the web of data.

The bit that’s turned into structured data are the “infoboxes” that give you the facts and figures about the person (for example) that you’re reading about. So if you add to Wikipedia and specifically add to the infoboxes, then you’re adding to an openly licensed dataset.

The most obvious example of where this data is used is in Google search results. The infoboxes you seen on search results whenever you google for a person, place or thing is partly powered by Wikipedia data.

A few years ago I added a wikipedia page for Gordon Boshell, the author of some children’s books I loved as a kid. There wasn’t a great deal of information about him on the web, so I pulled whatever I could find together and created a page for him. Now when anyone searches for Gordon Boshell they can see some information about him right on Google. And they now link out to the books that he wrote. It’s nice to think that I’ve helped raise his profile.

There’s also a related project from the Wikimedia Foundation called Wikidata. Again, anyone can edit it, but its a database of facts and figures rather than an encyclopedia.

OpenStreetMap

The second example is OpenStreetMap. You’ll definitely have already heard about its goal to create a crowd-sourced map of the world. OpenStreetMap is fascinating because its grown this incredible ecosystem of tools and projects that make it easier to contribute to the database.

I’ve recently been getting involved with contributing to OpenStreetMap. My initial impression was that I was probably going to have to get a commercial GPS and go out and do complicated surveying. But its not like that at all. It’s really easy to add points to the map, and to use their tools to trace buildings from satellite imagery. They provide create tutorials to help you get started.

It’s surprisingly therapeutic. I’ve spent a few evenings drinking a couple of beers and tracing buildings. It’s a bit like an adult colouring book, except you’re creating a better map of the world. Neat!

There are a variety of other tools that let you contribute to OpenStreetMap. For example Wheelmap allows you to add wheelchair accessibility ratings to locations on the map. We’ve been using this in the AccessibleBath project to help crowd-source data about wheelchair accessibility in Bath. One afternoon we got a group of around 25 volunteers together for a couple of hours and mapped 30% of the city centre.

There’s a lot of humanitarian mapping that happens using OpenStreetMap. If there’s been a disaster or a disease outbreak then aid workers often need better maps to help reach the local population and target their efforts. Missing Maps lets you take part in that. They have a really nice workflow that lets you contribute towards improving the map by tracing satellite imagery.

There’s a related project called MapSwipe. Its a mobile application that presents you with a grid of satellite images. All you have to do is click the titles which contain a building and then swipe left. Behind the scenes this data is used to direct Missing Maps volunteers towards the areas where more detailed mapping would be most useful. This focuses contributors attention where its best needed and so is really respectful of people’s time.

MapSwipe can also be used offline. So you can download a work package to do when you’re on your daily commute. Easy!

Zooniverse

You’ve probably also heard of Zooniverse, which is my third example. It’s a platform for citizen science projects. That just means using crowd-sourcing to create scientific datasets.

Their most famous project is probably GalaxyZoo which asked people to help classify objects in astronomical imagery. But there are many other projects. If you’re interested in biology then perhaps you’d like to help catalogue specimens held in the archives of the Natural History Museum?

Or there’s Old Weather, which I might get involved with. In that project you can help to build a picture of our historical climate by transcribing the weather reports that whaling ship captains wrote in their logs. By collecting that information we can build a dataset that tells us more about our climate.

I think its a really innovative way to use historical documents.

MusicBrainz

This is my fourth and favourite example. MusicBrainz is a database of music metadata: information about artists, albums, and tracks. It was created in direct response to commercial music databases that were asking people to contribute to their dataset, but then were taking all of the profits and not returning any value to the community. MusicBrainz created a free, open alternative.

I think MusicBrainz is the first open dataset I first got involved with. I wrote a client library to help developers use the data. (14 years ago, and you’re still talking about it – Ed)

MusicBrainz has also grown a commercial ecosystem around it, which has helped it be sustainable. There are a number of projects that use the dataset, including Spotify. And its been powering the BBC Music website for about ten years.

Discogs

My fifth example, Discogs is also a music dataset. But its a dataset about vinyl releases. So it focuses more on the releases, labels, etc. Discogs is a little different because it started as, and still is a commercial service. At its core its a marketplace for record collectors. But to power that marketplace you need a dataset of vinyl releases. So they created tools to help the community build it. And, over time, its become progressively more open.

Today all of the data is in the public domain.

OpenPlaques

My sixth example is OpenPlaques. It’s a database of the commemorative plaques that you can see dotted around on buildings and streets. The plaques mark that an important event happened in that building, or that someone famous was born or lived there. Volunteers take photos of the plaques and share them with the service, along with the text and names of anyone who might be mentioned in the plaque.

It provides a really interesting way to explore the historical information in the context of cities and buildings. All of the information is linked to Wikipedia so you can find out more information.

Rebrickable

My seventh example is Rebrickable which you’re unlikely to have heard about. I’m cheating a little here as its a service and not strictly a dataset. But its Lego, so I had to include it.

Rebrickable has a big database of all the official lego sets and what parts they contain. If you’re a fan of lego (they’re called AFOLS – Ed) design and create your own custom lego models (they’re known as MOCS – Ed) then you can upload the design and instructions to the service in machine-readable LEGO CAD formats.

Rebrickable exposes all of the information via an API under a liberal licence. So people can build useful tools. For example using the service you can find out which other official and custom sets you can build with bricks you already own.

Grand Comics Database

My eighth and final example is the Grand Comics Database. It’s also the oldest project as it was started in 1994. The original creators started with desktop tools before bringing it to the web.

It’s a big database of 1.3m comics. It contains everything from The Dandy and The Beano through to Marvel and DC releases. Its not just data on the comics, but also story arcs, artists, authors, etc. If you love comics you’ll love GCD. I checked and this weeks 2000AD (published 2 days ago – Ed) is in there already.

So those are my examples of places where you could contribute to open data.

Open data is an enabler

The interesting thing about them all is that open data is an enabler. Open data isn’t creating economic growth, or being used as a business model. Open licensing is being applied as a tool.

It creates a level playing field that means that everyone who contributes has an equal stake in the results. If you and I both contribute then we can both use the end result for any purpose. A commercial organisation is not extracting that value from us.

Open licensing can help to encourage people to share what they know, which is the reason the web exists.

Working with data

The projects are also great examples of ways of working with data on the web. They’re all highly distributed projects, accepting submissions from people internationally who will have very different skill sets and experience. That creates a challenge that can only be dealt with by having good collaboration tools and by having really strong community engagement.

Understanding the reasons how and why people collaborate to your open database is important. Because often those reasons will change over time. When OpenStreetMap had just started, contributors had the thrill of filling in a blank map with data about their local area. But now contributions are different. It’s more about maintaining data and adding depth.

Collaborative maintenance

In the open data community we often talk about making things open to make them better. It’s the tenth GDS design principle. And making data open does make them better in the sense that more people can use it. And perhaps more eyes can help spot flaws.

But if you really want to let people help make something better, then you need to put your data into a collaborative environment. Then data can get better at the pace of the community and not your ability to accept feedback.

It’s not work if you love it

Hopefully the examples give you an indication of the size of these communities and how much has been created. It struck me that many of them have been around since the early 2000s. I’ve not really found any good recent examples (Maybe people can suggest some – Ed). I wonder what that is?

Most of the examples were born around the Web 2.0 era (Mate. That phrase dates you. – Ed) when we were all excitedly contributing different types of content to different services. Bookmarks and photos and playlists. But now we mostly share things on social media. It feels like we’ve lost something. So it’s worth revisiting these services to see that they still exist and that we can still contribute.

While these fan communities are quietly hard at work, maybe we in the open data community can do more to support them?

There’s a lot of examples of “open” datasets that I didn’t use because they’re not actually open. The licenses are restrictive. Or the community has decided not to think about it. Perhaps we can help them understand why being a bit more open might be better?

There are also examples of openly licensed content that could be turned into more data. Take Wikia for example. It contains 360,000 wikis all with openly licensed content. They get 190m views a month and the system contains 43 million pages. About the same size as the English version of Wikipedia is currently. They’re all full of infoboxes that are crying out to be turned into structured data.

I think it’d be great to have all this fan produced data to a proper part of the open data commons, sitting alongside the government and organisational datasets that are being published.

Thank you (yes, you!)

That’s the end of my talk. I hope I’ve piqued your interest in looking at one or more of these projects in more detail. Hopefully there’s a project that will help you express your inner data geek.

Photo Attributions

Lego SpacemanEdwin AndradeJamie Street, Olu Elet, Aaron Burden, Volkan OlmezAlvaro SerranoRawPixel.com, Jordan WhitfieldAnthony DELANOIX

 

“Open”

For the purposes of having something to point to in future, here’s a list of different meanings of “open” that I’ve encountered.

XYZ is “open” because:

  • It’s on the web
  • It’s free to use
  • It’s published under an open licence
  • It’s published under a custom licence, which limits some types of use (usually commercial, often everything except personal)
  • It’s published under an open licence, but we’ve not checked too deeply in whether we can do that
  • It’s free to use, so long as you do so within our app or application
  • There’s a restricted/limited access free version
  • There’s documentation on how it works
  • It was (or is) being made in public, with equal participation by anyone
  • It was (or is) being made in public, lead by a consortium or group that has limitation on membership (even if just fees)
  • It was (or is) being made privately, but the results are then being made available publicly for you to use

I gather that at IODC “open washing” was a frequently referenced topic. It’s not surprising given the variety of ways in which the word “open” is used. Many of which are not open at all. And the list I’ve given above is hardly comprehensive. This is why the Open Definition is such an important reference. Even if it may have it’s faults.

Depending on your needs, any or all of those definitions might be fine. But “open” for you, may not be “open” for everyone. So let’s not lose sight of the goal and keep checking that we’re using that word correctly.

And, importantly, if we’re really making things open to make them better, then we might need to more open to collaboration. Open isn’t entirely about licensing either.

 

Building best practices for publish sector data

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/building-best-practices-for-sharing-public-sector-data

At the ODI we’re big fans of capturing best practices and simple design patterns to help guide people towards the most effective ways to publish data.

By breaking down complex technical and organisational challenges into smaller steps, we can identify common problems across sectors and begin cataloguing common solutions. This is the common thread that ties together our research and technicalprojects and it’s this experience that we bring to our advisory projects.

We’ve been contributing to the Share-PSI project, which has been documenting a range of best practices that relate to publishing public-sector data. Some of the best practices address specific technical questions relating to the web of data, and these form part of the W3C’s ‘Data on the web best practices’ guidance.

But some of the best practices address higher-level issues, such as the importance of creating an open data strategy and a release plan to support it. Or the creation of change by supporting startups and enabling ecosystems. Each best practice sets out the underlying challenge, a recommended solution, and provides pointers to further reading.

Our guidance, white papers and reports help to add depth to these best practices by linking them to evidence of their successful adoption, both here in the UK and internationally. This helps to ground the best practices in concrete guidance that draws on the experience of the wider community.

The best practices also provide a useful way to explore the elements of existing open data programmes.

For example, it’s possible to see how a large public-sector initiative like #OpenDefrahas been successful through its adoption of so many of these discrete best practices. These include the creation of a common open data strategy across its network, use of a release process that allowed for more rapid publication of data while managing risks, benchmarking practice using a maturity model, moving to an open by default licensing model, and its efforts to engage users and stimulate the wider ecosystem.

The best practices are a useful resource for anyone leading or contributing to an open data initiative. We’re looking forward to adding further to this body of evidence.

We’ve also begun to think about capturing common patterns that illustrate how open and shared data can be successfully used to deliver specific types of government policies. We are looking for feedback on this draft catalogue of strategic government interventions – you can either add comments in the document or email policy@theodi.org.)

How to open your data in six easy steps

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/how-to-open-your-data-in-six-easy-steps

1. Scope out the job at hand

Before taking the plunge and jumping straight into publishing, there are a few things to think through first. Take time to consider what data you’re going to release, what it contains and what the business case is for releasing it in the first place.

Consider what licence you’re going to put on the data for others to use. There’s a selection to choose from, depending on how you want others to use it, see our guidance here.

Here are some other key things to consider at this stage:

  • Where will it be published?
  • Will I need documentation around it?
  • What level of support is needed?
  • How frequently will I release the data?

2. Get prepared

Your data is only really useful to others if it’s well structured and has clear metadata (or a data description) to give it context and explain what it’s about and where it comes from.

Start your prep with a technical review using sample data, and identify suitable formats for release and the level of detail and metadata required. Also consider whether it’ll be most useful to the user as an API or a download. Data can be more useful when linked to other datasets, so keep an eye out for opportunities.

Consider your capabilities in-house and whether you need any training in order to release the data, whether technical or around certification. Some ODI courses can help with this.

Finally, think about what metadata you’re going to add to your data to describe what it is or how to use it.

3. Test your data

Before you release your data, you might want think about doing a preview with some of your potential users to get some detailed feedback. This isn’t necessarily required for smaller datasets, but for larger releases this user-testing can be really useful.

Don’t forget to get an Open Data Certificate to verify that your data is being published properly.

4. Release your data

Now for the exciting bit: releasing your data, the metadata and the documentation to go with it.

The key thing here is to release your data where your users will be. Otherwise, what’s the point? Where you should release it depends on who you are, but in general you should publish it on your own website, ensuring it’s also listed on relevant portals. For example, public sector organisations should add their data to data.gov.uk. Some sectors have their own portals – in science it’s the norm to publish in an institutional repository or a scientific data repository.

Basically, do your research into how your community shares data, and make sure it’s located in a place you have control over or where you’re confident the data can be consistently available.

When applying the Open Data Certificate, we’ll ask for evidence that the dataset is being listed in one or more portals to ensure it’s accessible.

5. Get engagement and promotion

It’s easy to relax after spending so much time and effort in preparing and releasing your dataset, but don’t just ‘fire and forget’. Make sure you have follow-up activities to let people know the data exists and be responsive to questions they might have. You can engage people in multiple ways (depending on your target audience), for example through blogs or social media. Encourage users to tell you how they’re using the data, so you can promote success stories around it too.

6. Reflect and improve

Now your dataset it out there in the big wide world, take some time to reflect on it. Listen to feedback, and decide what changes you could make or what you’d do differently next time.

If you want to measure your improvement, consider taking a maturity assessment using our Open Data Pathway tool.

101100

Today I am 101100.

That’s XLIV in Roman.

44 is also the square root of 1936. 1936 was a leap year starting on a Wednesday.

The Year 44 was also a leap year starting on a Wednesday.

It was also known as the Year of the Consulship of Crispus and Taurus. Which is another coincidence because I like crisps and I’m also a Taurus.

And while we’re on Wikipedia, we can use the API to find out that page id 101100 is Sydney Harbour National Park which opened when I was 3.

Wolfram Alpha reminds me that 44 is the ASCII code for a comma.

Whichever way you look at it #101100 is a disappointing colour.

But if we use the random art generator then we can make a more colourful image from the number. But actually the image with that identifier is more interesting. Glitchy!

The binary number is also a car multimedia entertainment system. But £200 feels a bit steep, even if it is my birthday.

A 12 year old boy once bid £101,100 for a flooded Seat Toledo on EBay. Because reasons.

101100, or tubulin tyrosine ligase-like family, member 3 to its friends, also seems to do important things for mice.

I didn’t really enjoy Jamendo album 101100, the Jamez Anthony story.

Care of Cell Block 101100 was a bit better in my opinion. But only a bit.

Discogs release 101100 is The Sun’s Running Out by Perfume Tree. Of which the most notable thing is that track six includes a sample from a Dr Who episode.

I’m not really sure what the tag 101100 on flickr means.

IMDB entry 101100 is “Flesh ‘n’ Blood

The Board Game Geek identifier 101100 is for an XBox 360 version of 1 vs 100. That’s not even a board game!

Whereas Drive Thru RPG catalogue product 101100 as Battlemage. Which sounds much more interesting.

If I search for “101100 coordinates” on google, then it tells me that it’s somewhere in China. I should probably know why.

There are 26 results for 101100 on data.gov.uk. But none on data.gov. Which explains why the UK is #1 in the world for open data.

But HD 101100 is also a star.

And a minor planet discovered on 14th September 1998

CAS 101-10-0 is 2-(3-Chlorophenoxy)propionic acid. I think its a herbicide. Anyway, this is what it looks like.

It’s also a marine worm.

And an insect.

In the database of useful biological numbers, we discover that entry 101100 is the maximal emission wavelength for Venus fluorophore. Which is, of course, 528 nm.

I think the main thing I’ve learnt in my 44 years is that the web is an amazing place.

On accessibility of data

My third open data “parable”. You can read the first and second ones here. With apologies to Borges.

. . . In that Empire, the Art of Information attained such Perfection that the data of a single City occupied the entirety of a Spreadsheet, and the datasets of the Empire, the entirety of a Portal. In time, those Unconscionable Datasets no longer satisfied, and the Governance Guilds struck a Register of the Empire whose coverage was that of the Empire, and which coincided identifier for identifier with it. The following Governments, who were not so fond of the Openness of Data as their Forebears had been, saw that that vast register was Valuable, and not without some Pitilessness was it, that they delivered it up to the Voraciousness of Privatisation and Monopolies. In the Repositories of the Net, still today, there are Stale Copies of that Data, crowd-sourced by Startups and Citizens; in all the Commons there is no other Relic of the Disciplines of Transparency.

Sharon More, The data roads less travelled. London, 2058.

 

Caution: data, use responsibly

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/caution-data-use-responsibly

In December 2015, Ben Goldacre and Anna Powell-Smith launched the beta of Open Prescribing. The site, which was swiftly celebrated in the open data community and beyond, provides insight into the prescribing practices of GPs around the UK. Its visualisations and reports give an entirely new perspective on some of the bulk open datasets available from the NHS.

Open Prescribing is a fantastic demonstration of how openly publishing data can unlock new, creative uses.

There is a particular feature of the site which piqued my interest: a page entitled, ‘Caution: how to use the data responsibly‘. Goldacre and Powell-Smith have included some clear guidance that helps users to properly interpret their findings, including:

  • guidance on how to interpret high and low values for the measurements, encouraging thought into what patterns they may or may not demonstrate – because of differences in population around a practice, for example
  • notes on how the individual measures were decided upon
  • insight into the importance of specific drugs and measures for a non-specialist audience
  • links to useful background information from the original data publishers

The ‘About‘ page for the site also attributes all of the datasets that were used as input to the analysis.

Clear attribution, provenance reporting and guidance on limits to the analysis might be expected from authors with a background in evidence-based medicine. It’s not yet normal practice within the open data community. But it should be.

As a society, we are making an increasing number of decisions based on data, about our health, economy and businesses. So it’s becoming more and more important that we know the limits of what that data can reliably tell us. Data enables informed decisions. Knowing the limits of data also makes us more informed.

In my opinion all data analysis should have an equivalent of the Open Prescribing “/caution” URL.

To achieve this data users need to know more about how data is collected and processed before it is published. This is why the higher levels of Open Data Certificaterequire publishers to:

  • document any known quality issues or limitations with the data
  • publish details of their quality control processes, including how to report errors
  • describe the provenance of the data, e.g. how it was collected and analysed

That information provides the necessary foundation for re-users to properly interpret and apply data. This information can then be cited, as it is on Open Prescribing, to help downstream users understand the impacts on any analysis.

Documenting the datasets used in an analysis is another norm that’s common in the medical and scientific communities. Linking to source datasets is the basis for citation analysis in academic research. These links power many types of discovery tools, and help improve reproducibility and transparency in research.

Use of machine-readable attributions could do the same for more general uses of data online. In the early days of the web, developers would “view source” to view the markup behind a webpage to learn how it was put together. The ability to “view sources” to discover the data underlying an application or data analysis would be a useful feature for the data web.

So, if you’re doing some data analysis, follow the best practices embodied by Open Prescribing and help users and other developers to understand how you’ve achieved your results.

Take your first steps with Open Data Pathway

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/take-your-first-steps-with-open-data-pathway

We’re launching a new tool today called Open Data Pathway. It’s a self-assessment tool that will help you assess how well your organisation publishes and consumes open data, and identify actions for improvement.

The tool is based on the Open Data Maturity Model we have been developing in partnership with the Department for Environment, Food & Rural Affairs.

The maturity model is based around five themes and maturity levels. Each theme represents a broad area of operations within an organisation, and is broken down into areas of activity which can then be used to assess progress.

We’ve previously published the maturity model as a public draft. We would like to thank everyone from across central and local government, agencies and other organisations who have given feedback on the draft documents. Your contributions and ideas were extremely valuable. We’re pleased to announce that the final, first edition of the model is now available.

Open Data Pathway supports open data practitioners in carrying out a maturity assessment. Completing an assessment will create a report that scores your organisation against each activity. The report also includes practical recommendations that suggest how scores can be improved for each activity. Combined with the ability to set targets for improvement, Open Data Pathway provides a complete self-assessment tool to enable practitioners to successfully apply the maturity model to their organisation.

Open Data Pathway offers a useful complement to the Open Data Certificates. The certificates measure how effectively someone is sharing a dataset for ease of reuse. Open Data Pathway helps organisations assess how well they publish and consume open data, helping build a roadmap for their open data journey.

We are initially launching the tool as an alpha release to help us gain valuable user feedback. The beta version will launch at the end of April, 2015, and will have the functionality to support results sharing and organisation benchmarking.

Please sign up and explore the tool and let us know what you think.