Monthly Archives: February 2013

What is a Dataset?

As my last post highlighted, I’ve been thinking about how we can find and discover datasets and their related APIs and services. I’m thinking of putting together some simple tools to help explore and encourage the kind of linking that my diagram illustrated.

There’s some related work going on in a few areas which is also worth mentioning:

  • Within the UK Government Linked Data group there’s some work progressing around the notion of a “registry” for Linked Data that could be used to collect dataset metadata as well as supporting dataset discovery. There’s a draft specification which is open for comment. I’d recommend you ignore the term “registry” and see it more as a modular approach for supporting dataset discovery, lightweight Linked Data publishing, and “namespace management” (aka URL redirection). A registry function is really just one aspect of the model.
  • There’s an Open Data on the Web workshop in April which will cover a range of topics including dataset discovery. My current thoughts are partly preparation for that event (and I’m on the Programme Committee)
  • There’s been some discussion and a draft proposal for adding the Dataset type to Schema.org. This could result in the publication of more embedded metadata about datasets. I’m interested in tools that can extract that information and do something useful with it.

Thinking about these topics I realised that there are many definitions of “dataset”. Unsurprisingly it means different things in different contexts. If we’re defining models, registries and markup for describing datasets we may need to get a sense of what these different definitions actually are.

As a result, I ended up looking around for a series of definitions and I thought I’d write them down here.

Definitions of Dataset

Lets start with the most basic, for example Dictionary.com has the following definition:

“a collection of data records for computer processing”

Which is pretty vague. Wikipedia has a definition which derives from the terms use in a mainframe environment:

“A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

Nontabular datasets can take the form of marked up strings of characters, such as an XML file.”

The W3C Data Catalog Vocabulary defines a dataset as:

“A collection of data, published or curated by a single source, and available for access or download in one or more formats.”

The JISC “Data Information Specialists Committee” have a definition of dataset as:

“…a group of data files–usually numeric or encoded–along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use. Generally a dataset is un-usable for sound analysis by a second party unless it is well documented.”

Which is a good definition as it highlights that the dataset is more than just the individual data files or facts, it also consists of some documentation that supports its use or analysis. I also came across a document called “A guide to data development” (2007) from the National Data Development and Standards Unit in Australia which describes a dataset as

“A data set is a set of data that is collected for a specific purpose. There are many ways in which data can be collected—for example, as part of service delivery, one-off surveys, interviews, observations, and so on. In order to ensure that the meaning of data in the data set is clearly understood and data can be consistently collected and used, data are defined using metadata…”

This too has the notion of context and clear definitions to support usage, but also notes that the data may be collected in a variety of ways.

A Legal Definition

As it happens, there’s also a legal definition of a dataset in the UK, at least as far as it relates to the Freedom of Information. The “Protections of Freedom Act 2012 Part 6, (102) c” includes the following definition:

In this Act “dataset” means information comprising a collection of information held in electronic form where all or most of the information in the collection—

  • (a)has been obtained or recorded for the purpose of providing a public authority with information in connection with the provision of a service by the authority or the carrying out of any other function of the authority,
  • (b)is factual information which—
    • (i)is not the product of analysis or interpretation other than calculation, and
    • (ii)is not an official statistic (within the meaning given by section 6(1) of the Statistics and Registration Service Act 2007), and
  • (c)remains presented in a way that (except for the purpose of forming part of the collection) has not been organised, adapted or otherwise materially altered since it was obtained or recorded.”

This definition is useful as it defines the boundaries for what type of data is covered by Freedom of Information requests. It clearly states that the data is collected as part of the normal business of the public body and also that the data is essentially “raw”, i.e. not the result of analysis or has not been adapted or altered.

Raw data (as defined here!) is more useful as it supports more downstream usage. Raw data has more potential.

Statistical Datasets

The statistical community has also worked towards having a clear definition of dataset. The OECD Glossary defines a Dataset as “any organised collection of data”, but then includes context that describes that further. For example that a dataset is a set of values that have a common structure and are usually thematically related. However there’s also this note that suggests that a dataset may also be made up of derived data:

“A data set is any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances”

Privacy is one key reason why a dataset may contain derived information only.

The RDF Data Cube vocabulary, which borrows heavily from SDMX — a key standard in the statistical community — defines a dataset as being made up of several parts:

  1. “Observations – This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
  2. Organizational structure – To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation…
  3. Internal metadata – Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated?…
  4. External metadata — This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed.”

The SDMX implementors guide has a long definition of dataset (page 7) which also focuses on the organisation of the data and specifically how individual observations are qualified along different dimensions and measures.

Scientific and Research Datasets

Over the last few years the scientific and research community have been working towards making their datasets more open, discoverable and accessible. Organisations like the Welcome Foundation have published guidance for researchers on data sharing; services like CrossRef and DataCite provide the means for giving datasets stable identifiers; and platforms like FigShare support the publishing and sharing process.

While I couldn’t find a definition of dataset from that community (happy to take pointers!) its clear that the definition of dataset is extremely broad. It could cover both raw results, e.g. output from sensors or equipment, through to more analysed results. The boundaries are hard to define.

Given the broad range of data formats and standards, services like FigShare accept any or all data formats. But as the Welcome Trust note:

“Data should be shared in accordance with recognised data standards where these exist, and in a way that maximises opportunities for data linkage and interoperability. Sufficient metadata must be provided to enable the dataset to be used by others. Agreed best practice standards for metadata provision should be adopted where these are in place.”

This echoes the earlier definitions that included supporting materials as being part of the dataset.

RDF Datasets

I’ve mentioned a couple of RDF vocabularies already, but within the RDF and Linked Data community there are a couple of other definitions of dataset to be found. The Vocabulary for Organising Interlinked Datasets (VoiD) is similar to, but predates, DCAT. Whereas DCAT focuses on describing a broad class of different datasets, VoiD describes a dataset as:

“…a set of RDF triples that are published, maintained or aggregated by a single provider…the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.”

Like the more general definitions this includes the notion that the data may relate to a specific topic or be curated by a single organisation. But this definition also makes some assumption about the technical aspects of how the data is organised and published. VoiD also includes support for linking to the services that relate to a dataset.

Along the same lines, SPARQL also has a definition of a Dataset:

“A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI…”

Unsurprisingly for a technical specification this is a very narrow definition of dataset. It also differs from the VoiD definition. While both assume RDF as the means for organising the data, the VoiD term is more general, e.g. it glosses over details of the internal organisation of the dataset into named graphs. This results in some awkwardness when attempting to navigate between a VoiD description and a SPARQL Service Description.

Summary

If you’ve gotten this far, then well done :)

I think there’s a couple of things we can draw out from these definitions which might help us when discussing “datasets”:

  • There’s a clear sense that a dataset relates to specific topic and is collected for a particular purpose.
  • The means by which a dataset is collected and the definitions of its contents are important for supporting proper re-use
  • Whether a dataset consists of “raw data” or more analysed results can vary across communities. Both forms of dataset might be available, but in some circumstances (e.g. for privacy reasons) only derived data might be published
  • Depending on your perspective and your immediate use case the dataset may be just the data items, perhaps expressed in a particular way (e.g. as RDF).  But in a broader sense, the dataset also includes the supporting documentation, definitions, licensing statements, etc.

While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web.

Dataset and API Discovery in Linked Data

I’ve been recently thinking about how applications can discover additional data and relevant APIs in Linked Data. While there’s been lots of research done on finding and using (semantic) web services I’m initially interested in supporting the kind of bootstrapping use cases covered by Autodiscovery.

We can characterise that use case as helping to answer the following kinds of questions:

  • Given a resource URI, how can I find out which dataset it is from?
  • Given a dataset URI, how can I find out which resources it contains and which APIs might let me interact with it?
  • Given a domain on the web, how can I find out whether it exposes some machine-readable data?
  • Where is the SPARQL endpoint for this dataset?

More succinctly: can we follow our nose to find all related data and APIs?

I decided to try and draw a diagram to illustrate the different resources involved and their connections. I’ve included a small version below:

Data and API Discovery with Linked Data

Lets run through the links between different types of resources:

  • From Dataset to Sparql Endpoint (and Item Lookup, and Open Search Description): this is covered by VoiD which provides simple predicates for linking a dataset to three types of resources. I’m not aware of other types of linking yet, but it might be nice to support reconciliation APIs.
  • From Well-Known VoiD Description (background) to Dataset. This well known URL allows a client to find the “top-level” VoiD description for a domain. It’s not clear what that entails, but I suspect the default option will be to serve a basic description of a single dataset, with reference to sub-sets (void:subset) where appropriate. There might also just be rdfs:seeAlso links.
  • From a Dataset to a Resource. A VoiD description can include example resources, this blesses a few resources in the dataset with direct links. Ideally these resources ought to be good representative examples of resources in the dataset, but they might also be good starting points for further browsing or crawling.
  • From a Resource to a Resource Description. If you’re using “slash” URIs in your data, then URIs will usually redirect to a resource description that contains the actual data. The resource description might be available in multiple formats and clients can content negotiation or follow Link headers to find alternative representations.
  • From a Resource Description to a Resource. A description will typically have a single primary topic, i.e. the resource its describing. It might also reference other related resources, either as direct relationships between those resources or via rdfs:seeAlso type links (“more data over here”).
  • From a Resource Description to a Dataset. This is where we might use a dct:source relationship to state that the current description has been extracted from a specific dataset.
  • From a SPARQL Endpoint (Service Description) to a Dataset. Here we run into some differences between definitions of dataset, but essentially we can describe in some detail the structure of the SPARQL dataset that is used in an endpoint and tie that back to the VoiD description. I found myself looking for a simple predicate that linked to a void:Dataset rather than describing the default and named graphs, but couldn’t find one.
  • I couldn’t find any way to relate a Graph Store to a Dataset or SPARQL endpoint. Early versions of the SPARQL Graph Store protocol had some notes on autodiscovery of descriptions, but these aren’t in the latest versions.

These links are expressed, for the most part, in the data but could also be present as Link headers in HTTP responses or in HTML (perhaps with embedded RDFa).

I’ve also not covered sitemaps at all, which provide a way to exhaustively list the key resources in a website or dataset to support mirroring and crawling. But I thought this diagram might be useful.

I’m not sure that the community has yet standardised on best practices for all of these cases and across all formats. That’s one area of discussion I’m keen to explore further.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers