What Does Your Dataset Contain?

Having explored some ways that we might find related data and services, as well as different definitions of “dataset”, I wanted to look at the topic of dataset description and analysis. Specifically, how can we answer the following questions:

what kinds of information does this dataset contain?
what types of entity are described in this dataset?
how can I determine if this dataset will fulfil my requirements?

There’s been plenty of work done around trying to capture dataset metadata, e.g. VoiD and DCAT; there’s also the upcoming working on Open Data on the Web. Much of that work has focused on capturing the core metadata about a dataset, e.g. who published it, when was it last updated, where can I find the data files, etc. But there’s still plenty of work to be done here, to encourage broader adoption of best practices, and also to explore ways to expose more information about the internals of a dataset.

This is a topic I’ve touched on before, and which we experimented with in Kasabi. I wanted to move “beyond the triple count” and provide a “report card” that gave a little more insight into a dataset. A report card could usefully complement an ODI Open Data Certificate, for example. Understanding the composition of a dataset can also help support new ways of manipulating and combining datasets.

In this post I want to propose a conceptual framework for capturing metadata about datasets. Its intended as a discussion point, so I’m interested in getting feedback. (I would have submitted this to the ODW workshop but ran out of time before the deadline).

At the top level I think there are five broad categories of dataset information: Descriptive Data; Access Information; Indicators; Compositional Data; and Relationships. Compositional data can be broken down into smaller categories — this is what I described as an “information spectrum” in the Beyond the Triple Count post.

While I’ve thought about this largely from the perspective of Linked Data, I think its applicable to any format/technology.

Descriptive Data

This kind of information helps us understand a dataset as a “work”: its name, a human-readable description or summary, its license, and pointers to other relevant documentation such as quality control or feedback processes. This information is typically created and maintained directly by the data publisher, whereas the other categories of data I describe here can potentially be derived automatically by data analysis

Examples:

Title
Description
License
Publisher
Subject Categories

Access Information

Basically, where do I get the data?

Where do I download the latest data?
Where can I download archived or previous versions of the data?
Are there mirrors for the dataset?
Are there APIs that use this data?
How do I obtain access to the data or API?

Indicators

This is statistical information that can help provide some insight into the data set, for example its size. But indicators can also build confidence in re-users by highlighting useful statistics such as the timeliness of releases, speed of responding to data fixes, etc.

While a data publisher might publish some of these indicators as targets that they are aiming to achieve, many of these figures could be derived automatically from an underlying publishing platform or service.

Examples of indicators:

Size
Rate of Growth
Date of Last Update
Frequency of Updates
Number of Re-users (e.g. size of user community, or number of apps that use it)
Number of Contributors
Frequency of Use
Turn-around time for data fixes
Number of known errors
Availability (for API based access)

Relationships

Relationship data primarily drives discovery use cases: to which other datasets does this dataset relate? For example the dataset might re-use identifiers or directly link to resources in other datasets. Knowing the source of that information can help us build trust in the reliability of the combined data, as well as give us sign-posts to other useful context. This is where Linked Data excels.

Annotation Datasets provide context to, and enrich other reference datasets. Annotations might be limited to linking information (“Link Sets”) or they may add new facts/properties about existing resources. Independently sourced quality control information could be published as annotations.

Provenance is also a form of relationship information. Derived datasets, e.g. created through analysis or data conversions, should refer to their original input datasets, and ideally also the algorithms and/or code that were applied.

Again, much of this information can be derived from data analysis. Recommendations for relevant related datasets might be created based on existing links between datasets or by analysing usage patterns. Set algebra on URIs in datasets can be used to do analysis on their overlap, to discover linkages and to determine whether one dataset contains annotations of another.

Examples:

List of dataset(s) that this dataset draws on (e.g. re-uses identifiers, controlled vocabulary, etc)
List of datasets that this datasets references, e.g. via links
List of source datasets used to compile or create this dataset
List of datasets that link to this dataset (“back links”)
Which datasets are often used in conjunction with this dataset?

Compositional Data

This is information about the internals of a dataset: e.g. what kind of data does it contain, how is that data organized, and what kinds of things are being described?

This is the most complex area as there are potentially a number of different audiences and abilities to cater for. At one end of the spectrum we want to provide high level summaries of the contents of a dataset, while at the other end we want to provide detailed schema information to support developers. I’ve previously advocated a “progressive disclosure” approach to allow re-users to quickly find the data they need; a product manager looking for data to support a new feature will be looking for different information to a developer constructing queries over a dataset.

I think there are three broad ways that we can decompose Compositional Data further. There are particular questions and types of information that relate to each of them:

Scope or Coverage
- What kinds of things does this dataset describe? Is it people, places, or other objects?
- How many of these things are in the dataset?
- Is there a geographical focus to the dataset, e.g. a county, region, country or is it global?
- Is the data confined to a particular data period (archival data) or does it contain recent information?
Structure
- What are some typical example records from the dataset?
- What schema does it conform to?
- What graph patterns (e.g. combinations of vocabularies) are commonly found in the data?
- How are various types of resource related to one another?
- What is the logical data model for the data?
Internals
- What RDF terms and vocabularies that are used in the data?
- What formats are used for capturing dates, times, or other structured values?
- Are there custom validation rules for particular fields or properties?
- Are there caveats or qualifiers to individual schema elements or data items?
- What is the physical data model
- How is the dataset laid out in a particular database schema, across a collection of files, or named graphs?

The experiments we did in Kasabi around the report card (see the last slides for examples) were exploring ways to help visualise the scope of a dataset. It was based on identifying broad categories of entity in a dataset. I’m not sure we got the implementation quite right, but I think it was a useful visual indicator to help understand a dataset.

This is a project I plan to revive when I get some free time. Related to this is the work I did to map the Schema.org Types to the Noun Project Icons.

Summary

I’ve tried to present a framework that captures most, if not all of the kinds of questions that I’ve seen people ask when trying to get to grips with a new dataset. If we can understand the types of information people need and the questions they want to answer, then we can create a better set of data publishing and analysis tools.

To date, I think there’s been a tendency to focus on the Descriptive Data and Access Information — because we want to be able to discover data — and its Internals — so we know how to use it.

But for data to become more accessible to a non-technical audience we need to think about a broader range of information and how this might be surfaced by data publishing platforms.

If you have feedback on the framework, particularly if you think I’ve missed a category of information, then please leave a comment. The next step is to explore ways to automatically derive and surface some of this information.

Size-weighted tag clouds (the more generic term) can be useful, although typographic variations (e.g. ascenders nd descenders) and psychological perceptual weighting (some terms will leap out almost without thought) can limit the effectiveness of a tag cloud for formal study.

The main danger I see is that you need to have analyzed the data and determined which features to present, which can lead to missing major features. A secondary difficulty is in choosing term weighting. This is a general problem in information retrieval, especially Salton-style cosine vector approaches, where longer texts are always reported to be more “relevant” to a query than shorter texts because they contain the search terms more often. The most important term in a vocabulary is often document.title, but it occurs only once per document. For example, _Encyclopædia Britannica_ 🙂

One approach to resolving the relative differences between a word such as _whitehouse_ occurring in the E.B. and the same word occurring in a telegram from George Bush to the ben Laden family might be to use Situational Semantics, an emerging field that considers both context and situation.

For now, in the case that you have some knowledge of your data and some control over it, you’re obviously in a good position to produce finding aids. Even there, there’s a question about what the person using the finding aids is actually seeking. When I’m looking for a particular illustration to answer an image search request for a stock image, perhaps to illustrate a book or a documentary, I find that most people scanning or photographing books and putting them online are in many cases not interested even in mentioning whether the books are illustrated, let alone naming the artists.

My conclusion so far is that to a large extent the best finding aids are dynamic, but based on guiding information provided by a curator.

9 thoughts on “What Does Your Dataset Contain?”

Liam Quin says:

March 5, 2013 at 2:23 pm

Surprised to see an article like this not having reference to the Finding Aids work that came out of museum and library culture in the 1990s.
1. Leigh Dodds says:
  
  March 5, 2013 at 2:48 pm
  
  Hi Liam,
  
  To be honest it never occurred to me to look at this from an library or archiving perspective. At least not beyond Dublin Core style metadata. Were you thinking of Encoded Archival Descriptions, or something else? Any pointers appreciated!
  
  Cheers,
  
  L.
  1. Liam Quin says:
    
    March 5, 2013 at 3:08 pm
    
    EAD would be a good place to start (Dan Pitti comes to mind). There’s been a ton of work in this area since then, and I’m afraid I haven’t been following it directly.
Olaf Hartig says:

March 6, 2013 at 12:55 am

Hi Leigh,

Nice! This seems like a reasonable collection of categories to me.

What is not entirely clear to me though are the subcategories “Structure” and “Internals.” In particular:
* What exactly differentiates these groups of information?
* Is it may be better to think of the subcategory “Structure” as information about the logical organization of the dataset, and “Internals” as information about the physical organization of the dataset?
* Why do you categorize information about RDF terms and vocabularies used in the data as “Internals” and information about the schema used as “Structure”?

Cheers,
Olaf
Peter Arbuthnott says:

March 12, 2013 at 12:26 pm

When attempting to visualize the ‘size’ of each of our product implementations, we have imagined a diagram similar to the wordle thing. Lots of ‘object types’ (in our case Book, Journal, Working Paper, Chapter, Article, Publisher, Author etc) whose size is related to the number of instances of that object type in the data. Next to each object type is the actual number so that we can measure one implementation against another. While this method does not count the actual triples related to each type OR the total actual size of the data, it does help simple visualization and comparison. I suspect this approach might be useful for a first pass on other data sets?

In fact, could a wordle type algorithm not be passed across the whole data set in order to provide a first pass at an algorithmic understanding of the data?
1. barefootliam says:
  
  March 12, 2013 at 2:29 pm
  
  Size-weighted tag clouds (the more generic term) can be useful, although typographic variations (e.g. ascenders nd descenders) and psychological perceptual weighting (some terms will leap out almost without thought) can limit the effectiveness of a tag cloud for formal study.
  
  The main danger I see is that you need to have analyzed the data and determined which features to present, which can lead to missing major features. A secondary difficulty is in choosing term weighting. This is a general problem in information retrieval, especially Salton-style cosine vector approaches, where longer texts are always reported to be more “relevant” to a query than shorter texts because they contain the search terms more often. The most important term in a vocabulary is often document.title, but it occurs only once per document. For example, _Encyclopædia Britannica_ 🙂
  
  One approach to resolving the relative differences between a word such as _whitehouse_ occurring in the E.B. and the same word occurring in a telegram from George Bush to the ben Laden family might be to use Situational Semantics, an emerging field that considers both context and situation.
  
  For now, in the case that you have some knowledge of your data and some control over it, you’re obviously in a good position to produce finding aids. Even there, there’s a question about what the person using the finding aids is actually seeking. When I’m looking for a particular illustration to answer an image search request for a stock image, perhaps to illustrate a book or a documentary, I find that most people scanning or photographing books and putting them online are in many cases not interested even in mentioning whether the books are illustrated, let alone naming the artists.
  
  My conclusion so far is that to a large extent the best finding aids are dynamic, but based on guiding information provided by a curator.
Sarven Capadisli says:

March 15, 2013 at 8:11 am

I’ve raised similar set of questions at http://csarven.ca/statistical-linked-dataspaces#requirements that should be asked before consuming datasets. There is a strong overlap for some of the items, which I guess makes them noteworthy.

I too approached them from the Linked Data angle but it is probably applicable to data consumption using any technology stack.
Pingback: How Do We Attribute Data? | Lost Boy
Pingback: Summarising Geographic Coverage of Dbpedia (and Wikipedia) | Lost Boy

Comments are closed.