A few years ago I wrote a post called “What is a Dataset?” It lists a variety of the different definitions of “dataset” used in different communities and standards. What I didn’t do is give my own working definition of dataset. I wanted to share that here along with a few additional thoughts on some related terms.
Answering the right question
I’ve noticed that often, when people ask for a definition of “dataset”, its for one of two reasons.
The first occurs when they’re actually asking a different question: “What is data?” Here I usually try and avoid getting into a lengthy discussion around data, facts, information and knowledge and instead focus on providing examples of datasets. I include databases, spreadsheets, sensors readings and collections of documents, images and video. This is to help get across that actually everything is data these days. It just depends how you process it.
The second question occurs when someone is trying to decide how to turn an existing database or some other collection of data into a “dataset” they can publish it on their website, or in a portal, or via an API. Answering this question involves a number of other questions. For example:
- Is a dataset a single data file?
- Answer: Not necessarily, it could be several files that have been split up for ease of production or consumption
- Is a database one dataset or several?
- Answer: It depends. Sometimes a database might be a single dataset, but sometimes it might be better published as several smaller datasets. You’ll often need to strip personal or commercially sensitive data anyway, so what you publish is unlikely to be exactly what you’ve got in your database. But you might decide to publish a collection of different data files (e.g. one per table) packaged together in some way. This might be best if someone will always want to consume the whole thing, e.g. to create a local copy of your database
- Are there reasons why a single larger collection of data might be broken up into different datasets?
- Answer: Yes, if it makes it easier for people to access and use the data. Or maybe there are regular updates, each of which is a separate dataset
- If a database contains data from different sources, should it be published as several different datasets?
- Answer: It depends. If you’ve created a useful aggregation, then publishing it as a whole makes sense as a user can access the whole thing. Ditto if you’ve corrected, fixed or improved some third-party data. But sometimes you might just want to release whatever new data you’ve added or created, and let people find other datasets that you reference or reuse by providing a link to the original versions
There are no hard and fast answers. Like everything around publishing open data, you need to take into account a number of different factors.
A working definition
Bringing this together, I’ve ended up with the following rough working definition of “dataset”:
A dataset is a collection of data that is managed using the same set of governance processes, have a shared provenance and share a common schema
By requiring a common set of governance processes, you group together data that has the same level of quality assurance, security and other policies.
By requiring a shared provenance, we focus on data that has been collected in similar ways, which means that they will have similar licensing and rights issues.
Sharing a common schema means that the data is consistently expressed.
To test this out:
- If you have a produce a set of official statistics, each annual release is a new dataset. Because the data has been collected and processed at different times
- A database of images and comments that users have made against them would probably best be released as two datasets: one containing the images (& their metadata) and another containing the comments. Images and comments are two different types of object, they’re collected and managed in different ways
- A set of food hygiene ratings collected by different councils across the UK consists of multiple datasets. Data on each local area will have been collected at different times by different organisations. Publishing them separately allows users to take just the data they need, when it’s updated
There are always exception to any rule, but I’ve found this reasonably useful in practice. As it highlights some important considerations. But I’m pretty sure it can be improved. Let me know if you have comments.
This post is part of a series called “basic questions about data“.