What is a Dataset? Part 2: A Working Definition

A few years ago I wrote a post called “What is a Dataset?” It lists a variety of the different definitions of “dataset” used in different communities and standards. What I didn’t do is give my own working definition of dataset. I wanted to share that here along with a few additional thoughts on some related terms.

Answering the right question

I’ve noticed that often, when people ask for a definition of “dataset”, its for one of two reasons.

The first occurs when they’re actually asking a different question: “What is data?” Here I usually try and avoid getting into a lengthy discussion around data, facts, information and knowledge and instead focus on providing examples of datasets. I include databases, spreadsheets, sensors readings and collections of documents, images and video. This is to help get across that actually everything is data these days. It just depends how you process it.

The second question occurs when someone is trying to decide how to turn an existing database or some other collection of data into a “dataset” they can publish it on their website, or in a portal, or via an API. Answering this question involves a number of other questions. For example:

  • Is a dataset a single data file?
    • Answer: Not necessarily, it could be several files that have been split up for ease of production or consumption
  • Is a database one dataset or several?
    • Answer: It depends. Sometimes a database might be a single dataset, but sometimes it might be better published as several smaller datasets. You’ll often need to strip personal or commercially sensitive data anyway, so what you publish is unlikely to be exactly what you’ve got in your database. But you might decide to publish a collection of different data files (e.g. one per table) packaged together in some way. This might be best if someone will always want to consume the whole thing, e.g. to create a local copy of your database
  • Are there reasons why a single larger collection of data might be broken up into different datasets?
    • Answer: Yes, if it makes it easier for people to access and use the data. Or maybe there are regular updates, each of which is a separate dataset
  • If a database contains data from different sources, should it be published as several different datasets?
    • Answer: It depends. If you’ve created a useful aggregation, then publishing it as a whole makes sense as a user can access the whole thing. Ditto if you’ve corrected, fixed or improved some third-party data. But sometimes you might just want to release whatever new data you’ve added or created, and let people find other datasets that you reference or reuse by providing a link to the original versions
  • …etc

There are no hard and fast answers. Like everything around publishing open data, you need to take into account a number of different factors.

A working definition

Bringing this together, I’ve ended up with the followingrough working definition of “dataset”:

A dataset is a collection of data that is managed using the same set of governance processes, have a shared provenance and share a common schema

By requiring a common set of governance processes, you group together data that has the same level of quality assurance, security and other policies. By requiring a shared provenance, we focus on data that has been collected in similar ways, which means that they will have similar licensing and rights issues. Sharing a common schema means that the data is consistently expressed.

To test this out:

  • If you have a produce a set of official statistics, each annual release is a new dataset. Because the data has been collected and processed at different times
  • A database of images and comments that users have made against them would probably best be released as two datasets: one containing the images (& their metadata) and another containing the comments. Images and comments are two different types of object, they’re collected and managed in different ways
  • A set of food hygiene ratings collected by different councils across the UK consists of multiple datasets. Data on each local area will have been collected at different times by different organisations. Publishing them separately allows users to take just the data they need, when it’s updated
  • …etc

There are always exception to any rule, but I’ve found this reasonably useful in practice. As it highlights some important considerations. But I’m pretty sure it can be improved. Let me know if you have comments.

This post is part of a series called “basic questions about data“.


The Lego Analogy

I think Lego is a great analogy for understanding the importance of data standards and registers.

Lego have been making plastic toys and bricks since the late 40s. It took them a little while to perfect their designs. But since 1958 they’ve been manufacturing bricks in the same way, to the same basic standard. This means that you can take any brick that’s been manufactured over the last 59 years and they’ll fit together. As a company, they have extremely high standards around how their bricks are manufactured. Only 18 in a million are ever rejected.

A commitment to standards maximises the utility of all of the bricks that the company has ever produced.

Open data standards apply the same principle but to data. By publishing data using common APIs, formats and schemas, we can start to treat data like Lego bricks. Standards help us recombine data in many, many different ways.

There are now many more types and shapes of Lego brick than there used to be. The Lego standard colour palette has also evolved over the years. The types and colours of bricks have changed to reflect the company’s desire to create a wider variety of sets and themes.

If you look across all of the different sets that Lego have produced, you can see that some basic pieces are used very frequently. A number of these pieces are “plates” that help to connect other bricks together. If you ask a Master Lego Builder for a list of their favourite pieces, you’ll discover the same. Elements that help you connect other bricks together in new and interesting ways are the most popular.

Registers are small, simple datasets that play the same role in the data ecosystem. They provide a means for us to connect datasets together. A way to improve the quality and structure of other datasets. They may not be the most excitingly shaped data. Sometimes they’re just simple lists and tables. But they play a very important role in unlocking the value of other data.

So there we have it, the Lego analogy for standards and registers.

Mapping wheelchair accessibility, how google could help

This month Google announced a new campaign to crowd-source information on wheelchair accessibility. It will be asking the Local Guides community of volunteers to begin answering simple questions about the wheelchair accessibility of places that appear on Google Maps. Google already crowd-sources a lot of information from volunteers. For example, it asks them to contribute photos, add reviews and validate the data its displaying to users of its mapping products.

It’s great to see Google responding to requests from wheelchair users for better information on accessibility. But I think they can do better.

There are many projects exploring how to improve accessibility information for people with mobility issues, and how to use data to increase mobility. I’ve recently been leading a project in Bath that is using a service called Wheelmap to crowd-source wheelchair accessibility information for the centre of the city. Over two Saturday afternoons we’ve mapped 86% of the city. Crowd-sourcing is a great way to collect this type of information and Google has the reach to really take this to another level.

The problem is that the resulting data is only available to Google. Displaying the data on Google maps will put it in front of millions of people, but that data could potentially be reused in a variety of other ways.

For example, for the Accessible Bath project we’re now able to explore accessibility information based on the type of location. This may be useful for policy makes to help shape support and investment in local businesses to improve accessibility across the city. Bath is a popular tourist destination so it’s important that we’re accessible to all.

We’re able to do this because Wheelmap stores all of its data in OpenStreetMap. We have access to all of the data our volunteers collect and can use it in combination with the rich metadata already in OpenStreetMap. And we can also start to combine it with other information, e.g. data on the ages of buildings, which may yield more insight.

As we learnt in our meetings with local wheelchair users and stroke survivors, mobility and accessibility issues are tricky to address. Road and pavement surfaces and types of dropped kerbs can impacts you differently depending on your specific needs. Often you need more data and more context from other sources to provide the necessary support. Like Google we’re starting with wheelchair accessibility because that’s the easiest problem to begin to address.

To improve routing, for example you might need data on terrain, or be able to identify the locations and sizes of individual disabled parking spaces. Microsoft’s Cities Unlocked are combining accessibility and location data from OpenStreetmap with Wikipedia entries to help blind users navigate a city. They chose OpenStreetMap as their data source because of its flexibility, existing support for accessibility information and rapid updates. This type of innovation requires greater access to raw data, not just data on a map.

By only collecting and displaying data only on its own maps, Google is not maximising the value of the contributions made by it’s Local Guides community. If the data they collected was published under an open licence, it could be used in many other projects. By improving its maps, Google is addressing a specific set of user needs. By opening up the data it could let more people address more user needs.

If Google felt they were unable to publish the data under an open licence, they could at least make the data available to OpenStreetMap contributors to support their mapping events. This type of limited licensing is already being used by Microsoft, Digital Globe and others to make commercial satellite imagery available to the OpenStreetMap community. While restrictive licensing is not ideal, allowing the data to be used to improve open databases, without the need to worry about IP issues is a useful step forward from keeping the data locked down.

Another form of support that Google could offer is to extend Schema.org to allow accessibility information to be associated with Places. By incorporating this into Google Maps and then openly publishing or sharing that data, it would encourage more organisations to publish this information about their locations.

But I find it hard to think of good reasons why Google wouldn’t make this data openly available. I think its Local Guides community would agree that they’re contributing in order to help make the world a better place. Ensuring that the data can be used by anyone, for any purpose, is the best way to achieve that goal.

Under construction

It’s been a while since I posted a more personal update here. But, as I announced this morning, I’ve got a new job! I thought I’d write a quick overview of what I’ll be doing and what I hope to achieve.

I’ve been considering giving up freelancing for a while now. I’ve been doing it on and off since 2012 when I left Talis. Freelancing has given me a huge amount of flexibility to take on a mixture of different projects. Looking back, there’s a lot of projects I’m really proud of. I’ve worked with the Ordnance Survey, the British Library and the Barbican. I helped launch a startup which is now celebrating its fifth birthday. And I’ve had far too much fun working with the ONS Digital team.

I’ve also been able to devote time to helping lead a plucky band of civic hackers in Bath. We’ve run free training courses, built an energy-saving application for schools and mapped the city. Amongst many other things.

I’ve spent a significant amount of time over the last few years working with the Open Data Institute. The ODI is five and I think I’ve been involved with the organisation for around 4.5 years. Mostly as a part-time associate, but also for a year or so as a consultant. It turned out that wasn’t quite the right role for me, hence the recent dive back into freelancing.

But over that time, I’ve had the opportunity to work on a similarly wide-ranging set of projects. I’ve researched how election data is collected and used and learnt about weather data. I’ve helped to create guidance around open identifiers, licensing, and open data policies.  And explored ways to direct organisations on their open data journey. I’ve also provided advice and support to startups, government and multi-national organisations. That’s pretty cool.

I’ve also worked with an amazing set of people. Some of those people are still at the ODI and others have now moved on. I’ve learnt loads from all of them.

I was pretty clear what type of work I wanted to do in a more permanent role. Firstly, I wanted to take on bigger projects. There’s only so much you can do as an independent freelancer. Secondly, I wanted to work on “data infrastructure”. While collectively we’ve only just begun thinking through the idea of data as infrastructure, looking back over my career it’s a useful label for the types of work I’ve been doing. The majority of which has involved looking at applications of data, technology, standards and processes.

I realised that the best place for me to do all of that was at the ODI. So I’ve seized the opportunity to jump back into the organisation.

My new job title is “Data Infrastructure Programme Lead”. In practice this means that I’m going to be:

  • helping to develop the ODI’s programme of work around data infrastructure, including the creation of research, standards, guidance and tools that will support the creation of good data infrastructure
  • taking on product ownership for certificates and pathway, so we’ve got a way to measure good data infrastructure
  • working with the ODI’s partners and network to support them in building stronger data infrastructure
  • building relationships with others who are working on building data infrastructure in public and private sector, so we can learn from one another

And no doubt, a whole lot of other things besides!

I’ll be working closely with Peter and Olivier, as my role should complement theirs. And I’m looking forward to spending more time with the rest of the ODI team, so I can find ways to support and learn more from them all.

My immediate priorities will be are working on standards and tools to help build data infrastructure in the physical activity sector, through the OpenActive project. And leading on projects looking at how to build better standards and how to develop collaborative registers.

I’m genuinely excited about the opportunities we have for improving the publication and use of data on the web. It’s a topic that continues to occupy a lot of my attention. For example, I’m keen to see whether we can build a design manual for data infrastructure. Or improve governance around data through analysing existing sources. Or whether mapping data ecosystems and diagramming data flows can help us understand what makes a good data infrastructure. And a million other things. It’s also probably time we started to recognise and invest in the building blocks for data infrastructure that we’ve already built.

If you’re interesting in talking about data infrastructure, then I’d love to hear from you. You can reach me on twitter or email.

Bath Playbills 1812-1851

This weekend I published scans of over 2000 historical playbills for the Theatre Royal in Bath. Here are some notes on whey they come from and how they might be useful.

The scans are all available on Flickr and have been placed into the public domain under a CC0 waiver. You’re free to use them in any way you see fit. The playbills date from 1812 through to 1851. This is the period just before the fire and rebuilding of the theatre in its current location.

The scans are taken from 5 public domain books available digitally from the British Library. All I’ve done in this instance is take the PDF versions of the books, split out the pages into separate images and then upload them to Flickr, into separate collections.

This is a small step, but will hopefully make the contents more discoverable and accessible. The individual playbills are now part of the web, so can be individually referenced and commented on.

For example there are some great images in the later bills. And I learned that in 1840 you could have seen lions, tigers and leopards.


And this playbill includes detail on the plot and scenes from a play called “Susan Hopley” and an intriguing reference to “Punchinello Vampire!”.


As they are all in the public domain, then images will hopefully be of interest to Wikipedians interested in the history of Bath, the theatre or performers such as Joseph Grimaldi. (I did try adding a reference to a playbill myself, but had this reverted because I was “linking to my own social media site”).

There’s a lot of detail to the bills which it might be useful to extract. E.g. the dates of each bill, the plays being performed and details of the performers and sponsors. If anyone is interested in helping to crowd-source that, then let me know!


We can strengthen data infrastructure by analysing open data

Data is infrastructure for our society and businesses. To create stronger, sustainable data infrastructure that supports a variety of users and uses, we need to build it in a principled way.

Over time, as we gain experience with a variety of infrastructures supporting both shared and open data, we can identify the common elements of good data infrastructure. We can use that to help to write a design manual for data infrastructure.

There a variety of ways to approach that task. We can write case studies on specific projects, and we can map ecosystems to understand how value is created through data. We can also take time to contribute to projects. Experiencing different types of governance, following processes and using tools can provide useful insight.

We can also analyse open data to look for additional insights that might help use improve data infrastructure. I’ve recently been involved in two short projects that have analysed some existing open data.

Exploring open data quality

Working with Experian and colleagues at the ODI, we looked at the quality of some UK government datasets. We used a data quality tool to analyse data from the Land Registry, the NHS and Companies House. We found issues with each of the datasets.

It’s clear that there’s is still plenty of scope to make basic improvements to how data is published, by providing:

  • better guidance on the structure, content and licensing of data
  • basic data models and machine-readable schemas to help standardise approaches to sharing similar data
  • better tooling to help reconcile data against authoritative registers

The UK is also still in need of a national open address register.

Open data quality is a current topic in the open data community. The community might benefit from access to an “open data quality index” that provides more detail into these issues. Open data certificates would be an important part of that index. The tools used to generate that index could also be used on shared datasets. The results could be open, even if the datasets themselves might not be.

Exploring the evolution of data

There are currently plans to further improve the data infrastructure that supports academic research by standardising organisation identifiers. I’ve been doing some R&D work for that project to analyse several different shared and open datasets of organisation identifiers. By collecting and indexing the data, we’ve been able to assess how well they can support improving existing data, through automated reconciliation and by creating better data entry tools for users.

Increasingly, when we are building new data infrastructures, we are building on and linking together existing datasets. So it’s important to have a good understanding of the scope, coverage and governance of the source data we are using. Access to regularly published data gives us an opportunity to explore the dynamics around the management of those sources.

For example, I’ve explored the growth of the GRID organisational identifiers.

This type of analysis can help assess the level of investment required to maintain different types of dataset and registers. The type of governance we decide to put around data will have a big impact on the technology and processes that need to be created to maintain it. A collaborative, user maintained register will operate very differently to one that is managed by a single authority.

One final area in which I hope the community can begin to draw together some insight is around how data is used. At present there are no standards to guide the collection and reporting on metrics for the usage of either shared or open data. Publishing open data about how data is used could be extremely useful not just in understanding data infrastructure, but also in providing transparency about when and how data is being used.


Experiences with the Freestyle Libre

We’ve been using the Freestyle Libre for just over a year now to help my daughter manage her Type-1 diabetes. I wanted to share a few thoughts about how well it’s been working for us. I had lots of questions at the start, so I wanted to help capture what we’ve learned in case its useful for anyone else.

I’m writing this as a parent, rather than as a person with experience of wearing a sensor or the emotional cost of dealing with diabetes. Do not take anything I write here as medical advice, this is just a summary of our experience with the sensors.

My daughter is now 13. It was her decision to trial the sensor and hers to continue its use.

Cost & Shipping

Firstly, the Libre is not currently available on the NHS. But I believe it’s under review. This means we’re paying for the sensor ourselves. We’re lucky enough to be able to afford that, but not everyone is able to do so.

To use the sensors you need a reader (£57.95) and then sensors. While the sensors are priced at £57.95 each this includes VAT. When completing an order, if you’re buying the sensor for yourself to help you manage your diabetes, or for a family member then you can fill in a disclaimer and the VAT is waived. For our last sensor order we paid £48.29 per sensor. Sensors last a maximum of 14 days (see below) so on average you will be paying around £24 a week.

Shipping is quick and you’ll pay around £5-6 for postage. We buy ours in packs of 5 as that covers around 10 weeks worth of usage and reduces postage costs.

When we first bought the sensors we bought a pack of 10. I wouldn’t advise this as the sensors do have a use-by date, so you can’t just stock up.


Once fitted a sensor lasts 14 days maximum. You can’t choose to wear it for longer: the reader will no longer collect data from a sensor 14 days after its been activated.

While we’ve had a full 14 days from many of the sensors, in some cases they may come off early. They’re pretty secure once fitted, and in the optimum location in the back of the upper arm they are generally out of the way. But we’ve also had a number that haven’t lasted that long. They can be knocked off. We’ve also had to put tape over some sensors that have started to come off the skin.

When travelling we generally take a spare as well as manual blood testing equipment. See below.

Fitting the sensor

Fitting the sensor is straight-forward. The arm is swabbed with an alcohol wipe to clean the skin, then the sensor is pushed into the arm using a single-use applicator that comes with each sensor. They can be fitted in a couple of minutes. After activation using the reader it takes an hour before the first readings are available.

The applicator makes a clunking sound as the sensor is injected into the skin. It’s a bit like using a hole punch. Martha occasionally has some pain and soreness but that passes quickly.

On one occasion I’ve had a sensor fail to attach properly. This was because I tried to apply the sensor too soon after using the alcohol wipe. I’d recommend letting the skin fully dry before application to ensure the sticky pad adheres properly.

You can shower and swim when wearing the sensor.

Travelling with the sensor

Travelling with diabetes isn’t easier. Airports aren’t generally welcoming to people carrying bags of needles and vials of liquid.

Our understanding is that the sensors won’t go through metal detectors, but might be OK for X-Rays. As a minimum you’ll need to inform security if you or a family member is wearing a sensor. As our daughter now also wears an insulin pump, on our last few flights we’ve ended up having to opt out of all scans. This involves some headaches as you might imagine, but staff in the UK and elsewhere have so far been very helpful.

There is probably better advice online. Our experience is relatively limited here.

Removing the sensor

The sensors are fairly easy to peel off and remove, although the glue takes some scrubbing off. There’s no needle in the device, just a hair-thin sensor. The applicators can go in the bin, but we put the sensors in our sharps bin.

Sometimes the skin under the sensor can be a bit inflamed, but we’ve not had any serious side effects or issues.

Taking readings

To collect readings from the sensor you just scan it with the Reader. It works through clothing, so very easy to do.

The sensor collects readings every 15 minutes automatically and stores up to 8 hours of readings. All of the stored readings are automatically downloaded to the Reader whenever you scan it.

If you have an NFC enabled phone then you can collect readings using the LibreLink app. There’s also a LinkUp app to share readings with family members.

How the sensor helps to manage diabetes

The sensor removes the need to do routine finger prick tests. Martha no longer has to take blood glucose readings before meals, we can just scan her sensor and then work out the necessary dose and any correction.

Now that she is also using an insulin pump its really just a matter of scanning and then entering the data into the pump. It will work out any necessary corrections. The combination of the sensor and the pump has made an incredible difference to the routines of managing diabetes. For the better.

However using the Libre doesn’t mean that you can give up finger pricks completely. The sensor has limited accuracy with blood glucose levels below 4 or above 14. Outside of those ranges you must still do a finger prick test to ensure that you have an accurate reading for treating hypo- or hyperglycemia.

Accuracy of the sensor

Our biggest challenge when starting to use the Libre was understanding its differences from routine finger prick tests. This made us very wary about its accuracy initially. I’ll try to explain why.

If you want a detailed review of the Libre’s accuracy, then you can read this scientific paper which summarises a controlled test of the Libre. It helps to demonstrate the accuracy and reliability of the sensors, but may be too detailed for some people.

When you perform a finger prick test you are directly measuring the amount of glucose in your blood. But the Libre isn’t testing your blood glucose. The Libre sensor is testing the fluid between the cells in your skin. That fluid is known as interstitial fluid.

Interstitial fluid, it’s nutrients and oxygen are replenished from your blood stream. This means that you’re only indirectly testing your blood glucose. It takes time for glucose to pass from your blood into the fluid. Roughly speaking a measurement from the sensor is around 5-10 minutes behind your actual blood glucose level. If you’re running low on the sensor, your blood glucose might be even lower. And vice versa.

This explains why you need to finger prick when you’re low or high: you need to be treating your actual levels. On a routine basis, this delay isn’t an issue. It’s only when you’re particularly low or high that you may need to be more vigilant. This also explains why you need to travel with a full set of equipment and not just replacement sensors.

While there are delays, the fact that the Libre is constantly recording means that whenever you scan you’re getting an updated graph of your glucose levels, not just a single reading. The Reader will show you the graph and also give you an ideal if you’re level, rising (or falling) slowly, or rising (or falling) rapidly. That makes a massive amount of difference.

When we started testing the Libre we were doing routine finger pricks as well. The end result was a bit like wearing several watches, each of which is showing a slightly different time. We felt like we wouldn’t be able to trust the sensor because it was so often at odds with the blood glucose readings. The fact that this was also happening at a time when Martha’s levels were particularly erratic didn’t help: with highly variable blood glucose levels, you can feel one step behind.

Once we committed to using the Libre as our means of routine testing, everything was fine. You just need to be aware of the differences. Martha’s HBA1C levels demonstrate that we’re able to efficiently manage her glucose levels.

One additional issue to be aware of is that it takes time for the sensors to bed in. A sensor won’t start reporting readings until after its been on the skin for an hour. But we, and others, have found that it can take some time after that before readings seems reliable. Some sensors seem to work fine straight away, others seem a bit variable.

We’ve not had an issue with a sensor never settling down, they’re normally fine after a few hours. But its often hard to tell: is it the sensor, or just a particularly variable set of glucose levels.

We’ve heard that some people using the Libre install a new sensor 24 hours before the previous one runs out, to allow time for it to settle in. We’ve not found it necessary to do that.


Type 1 diabetes is an incredibly difficult condition to live with. I have nothing but admiration for how well Martha is dealing with it. She is my hero.

The Libre has made a significant difference to her (and our) quality of life. Removing the need for routine use of needles greatly reduces the number of medical interventions we have to make every day. The ability to easily scan to get a reading of glucose levels makes it easier for Martha in all aspects of her daily life. It’s much less obtrusive than finger pricking.

As parents it’s easy for us to check on her levels when she’s sleeping. A quick scan is all it takes. An integrated sensor and pump might be even better, but the smaller size of the Libre sensors make it perfectly adequate for now.

I hope the Libre becomes more widely available on the NHS so that more people can benefit from it. I also hope this article has been useful. We’re very happy to answer any other questions. Leave a comment or drop me an email.