What is a Dataset?

As my last post highlighted, I’ve been thinking about how we can find and discover datasets and their related APIs and services. I’m thinking of putting together some simple tools to help explore and encourage the kind of linking that my diagram illustrated.

There’s some related work going on in a few areas which is also worth mentioning:

  • Within the UK Government Linked Data group there’s some work progressing around the notion of a “registry” for Linked Data that could be used to collect dataset metadata as well as supporting dataset discovery. There’s a draft specification which is open for comment. I’d recommend you ignore the term “registry” and see it more as a modular approach for supporting dataset discovery, lightweight Linked Data publishing, and “namespace management” (aka URL redirection). A registry function is really just one aspect of the model.
  • There’s an Open Data on the Web workshop in April which will cover a range of topics including dataset discovery. My current thoughts are partly preparation for that event (and I’m on the Programme Committee)
  • There’s been some discussion and a draft proposal for adding the Dataset type to Schema.org. This could result in the publication of more embedded metadata about datasets. I’m interested in tools that can extract that information and do something useful with it.

Thinking about these topics I realised that there are many definitions of “dataset”. Unsurprisingly it means different things in different contexts. If we’re defining models, registries and markup for describing datasets we may need to get a sense of what these different definitions actually are.

As a result, I ended up looking around for a series of definitions and I thought I’d write them down here.

Definitions of Dataset

Lets start with the most basic, for example Dictionary.com has the following definition:

“a collection of data records for computer processing”

Which is pretty vague. Wikipedia has a definition which derives from the terms use in a mainframe environment:

“A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

Nontabular datasets can take the form of marked up strings of characters, such as an XML file.”

The W3C Data Catalog Vocabulary defines a dataset as:

“A collection of data, published or curated by a single source, and available for access or download in one or more formats.”

The JISC “Data Information Specialists Committee” have a definition of dataset as:

“…a group of data files–usually numeric or encoded–along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use. Generally a dataset is un-usable for sound analysis by a second party unless it is well documented.”

Which is a good definition as it highlights that the dataset is more than just the individual data files or facts, it also consists of some documentation that supports its use or analysis. I also came across a document called “A guide to data development” (2007) from the National Data Development and Standards Unit in Australia which describes a dataset as

“A data set is a set of data that is collected for a specific purpose. There are many ways in which data can be collected—for example, as part of service delivery, one-off surveys, interviews, observations, and so on. In order to ensure that the meaning of data in the data set is clearly understood and data can be consistently collected and used, data are defined using metadata…”

This too has the notion of context and clear definitions to support usage, but also notes that the data may be collected in a variety of ways.

A Legal Definition

As it happens, there’s also a legal definition of a dataset in the UK, at least as far as it relates to the Freedom of Information. The “Protections of Freedom Act 2012 Part 6, (102) c” includes the following definition:

In this Act “dataset” means information comprising a collection of information held in electronic form where all or most of the information in the collection—

  • (a)has been obtained or recorded for the purpose of providing a public authority with information in connection with the provision of a service by the authority or the carrying out of any other function of the authority,
  • (b)is factual information which—
    • (i)is not the product of analysis or interpretation other than calculation, and
    • (ii)is not an official statistic (within the meaning given by section 6(1) of the Statistics and Registration Service Act 2007), and
  • (c)remains presented in a way that (except for the purpose of forming part of the collection) has not been organised, adapted or otherwise materially altered since it was obtained or recorded.”

This definition is useful as it defines the boundaries for what type of data is covered by Freedom of Information requests. It clearly states that the data is collected as part of the normal business of the public body and also that the data is essentially “raw”, i.e. not the result of analysis or has not been adapted or altered.

Raw data (as defined here!) is more useful as it supports more downstream usage. Raw data has more potential.

Statistical Datasets

The statistical community has also worked towards having a clear definition of dataset. The OECD Glossary defines a Dataset as “any organised collection of data”, but then includes context that describes that further. For example that a dataset is a set of values that have a common structure and are usually thematically related. However there’s also this note that suggests that a dataset may also be made up of derived data:

“A data set is any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances”

Privacy is one key reason why a dataset may contain derived information only.

The RDF Data Cube vocabulary, which borrows heavily from SDMX — a key standard in the statistical community — defines a dataset as being made up of several parts:

  1. “Observations – This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
  2. Organizational structure – To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation…
  3. Internal metadata – Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated?…
  4. External metadata — This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed.”

The SDMX implementors guide has a long definition of dataset (page 7) which also focuses on the organisation of the data and specifically how individual observations are qualified along different dimensions and measures.

Scientific and Research Datasets

Over the last few years the scientific and research community have been working towards making their datasets more open, discoverable and accessible. Organisations like the Welcome Foundation have published guidance for researchers on data sharing; services like CrossRef and DataCite provide the means for giving datasets stable identifiers; and platforms like FigShare support the publishing and sharing process.

While I couldn’t find a definition of dataset from that community (happy to take pointers!) its clear that the definition of dataset is extremely broad. It could cover both raw results, e.g. output from sensors or equipment, through to more analysed results. The boundaries are hard to define.

Given the broad range of data formats and standards, services like FigShare accept any or all data formats. But as the Welcome Trust note:

“Data should be shared in accordance with recognised data standards where these exist, and in a way that maximises opportunities for data linkage and interoperability. Sufficient metadata must be provided to enable the dataset to be used by others. Agreed best practice standards for metadata provision should be adopted where these are in place.”

This echoes the earlier definitions that included supporting materials as being part of the dataset.

RDF Datasets

I’ve mentioned a couple of RDF vocabularies already, but within the RDF and Linked Data community there are a couple of other definitions of dataset to be found. The Vocabulary for Organising Interlinked Datasets (VoiD) is similar to, but predates, DCAT. Whereas DCAT focuses on describing a broad class of different datasets, VoiD describes a dataset as:

“…a set of RDF triples that are published, maintained or aggregated by a single provider…the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.”

Like the more general definitions this includes the notion that the data may relate to a specific topic or be curated by a single organisation. But this definition also makes some assumption about the technical aspects of how the data is organised and published. VoiD also includes support for linking to the services that relate to a dataset.

Along the same lines, SPARQL also has a definition of a Dataset:

“A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI…”

Unsurprisingly for a technical specification this is a very narrow definition of dataset. It also differs from the VoiD definition. While both assume RDF as the means for organising the data, the VoiD term is more general, e.g. it glosses over details of the internal organisation of the dataset into named graphs. This results in some awkwardness when attempting to navigate between a VoiD description and a SPARQL Service Description.

Summary

If you’ve gotten this far, then well done 🙂

I think there’s a couple of things we can draw out from these definitions which might help us when discussing “datasets”:

  • There’s a clear sense that a dataset relates to specific topic and is collected for a particular purpose.
  • The means by which a dataset is collected and the definitions of its contents are important for supporting proper re-use
  • Whether a dataset consists of “raw data” or more analysed results can vary across communities. Both forms of dataset might be available, but in some circumstances (e.g. for privacy reasons) only derived data might be published
  • Depending on your perspective and your immediate use case the dataset may be just the data items, perhaps expressed in a particular way (e.g. as RDF).  But in a broader sense, the dataset also includes the supporting documentation, definitions, licensing statements, etc.

While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web.

7 thoughts on “What is a Dataset?

  1. Thanks for collecting all these definitions. There’s also the legal definition from the European Database Directive that reads:

    “For the purposes of this Directive, ‘database` shall mean a collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic or other means.”

    This definition differs from the others in that it includes collections that aren’t accessible electronically, like a card index.

  2. Hi Leigh,
    I came across your diagram of linked data in today’s Semantic Web feed. I’m intrigued because I’m in the process of looking for APIs to use with my app, TapAMap.

    I’m new to this world (previously a journalist and teacher), but have learned Objective-C and created my first app, TapAMap, which has been on the App Store for a month.

    In creating it, I stumbled across the GeoNames database as an API of geo-linked Wikipedia articles. Only later did I realize that GeoNames was part of the linked data cloud of the Semantic Web. I’ve since become very excited about semantic-web possibilities, and am currently taking Harald Sack’s online course through the Hasso Plattner Institut.

    Your post intrigues me because I’m looking for more APIs, albeit in a “follow your nose” manner. Specifically, I’d love to find a RESTful API that would provide shapefiles (polygons) of geographical areas. I want to send a point (lat/long) and get back the polygons that contain that point. I read that MongoDB does not yet have this capability.

    Please check out TapAMap (it’s free), and let me know what you think:
    https://itunes.apple.com/us/app/tapamap/id591552628?mt=8

    Do you have any suggestions for how I might go about finding APIs of geolinked data?

    Sincerely,

    Scott C. Smith

  3. Whoops! I thought the first comment disappeared because I hadn’t logged in to WordPress, but it re-appeared after I posted the second version… (Feel free to delete one; I didn’t see that I could).

  4. Scott,

    You’ve already found Geonames and there are a few other datasets on the Linked Data Cloud that contain geographical information that might support your use case.

    I’m not aware of any that integrate shape files however. What I think would be particularly useful is an API to take co-ords and then not just return the polygons, but identify the regions, e.g. the wards, political districts, etc that relate to it. This would give you a way to start to associate arbitrary geographic locations with the rest of the Linked Data graph.

    This is particularly useful if you want to associate activity or data points with a region. E.g. sentiment analysis for an election.

    We were exploring some ideas around this at my former employer. There’s some code here:

    https://github.com/kasabi/linkedspatialindex

  5. missed this post at the time and drawn to it by a tweet yesterday; this response really relates to Scott/connectacopia and Leigh’s response to that requirement; Without wishing to blow our own trumpet in this forum “we” (not myself, the guys in the back room!) have had this (RESTful API returning attributes (from spatial entities be they points/rasters, lines or polygons) based on spatial qualifiers such as containment or adjacency) as a primary capability for 5 or 6 years now…….of course returning the attributes of multiple geographies (e.g. admin/political, environmental, social etc) is likely only the jumping off point for analytics and insight in the context of the query…..

Comments are closed.