Having explored some ways that we might find related data and services, as well as different definitions of “dataset”, I wanted to look at the topic of dataset description and analysis. Specifically, how can we answer the following questions:
- what kinds of information does this dataset contain?
- what types of entity are described in this dataset?
- how can I determine if this dataset will fulfil my requirements?
There’s been plenty of work done around trying to capture dataset metadata, e.g. VoiD and DCAT; there’s also the upcoming working on Open Data on the Web. Much of that work has focused on capturing the core metadata about a dataset, e.g. who published it, when was it last updated, where can I find the data files, etc. But there’s still plenty of work to be done here, to encourage broader adoption of best practices, and also to explore ways to expose more information about the internals of a dataset.
This is a topic I’ve touched on before, and which we experimented with in Kasabi. I wanted to move “beyond the triple count” and provide a “report card” that gave a little more insight into a dataset. A report card could usefully complement an ODI Open Data Certificate, for example. Understanding the composition of a dataset can also help support new ways of manipulating and combining datasets.
In this post I want to propose a conceptual framework for capturing metadata about datasets. Its intended as a discussion point, so I’m interested in getting feedback. (I would have submitted this to the ODW workshop but ran out of time before the deadline).
At the top level I think there are five broad categories of dataset information: Descriptive Data; Access Information; Indicators; Compositional Data; and Relationships. Compositional data can be broken down into smaller categories — this is what I described as an “information spectrum” in the Beyond the Triple Count post.
While I’ve thought about this largely from the perspective of Linked Data, I think its applicable to any format/technology.
This kind of information helps us understand a dataset as a “work”: its name, a human-readable description or summary, its license, and pointers to other relevant documentation such as quality control or feedback processes. This information is typically created and maintained directly by the data publisher, whereas the other categories of data I describe here can potentially be derived automatically by data analysis
- Subject Categories
Basically, where do I get the data?
- Where do I download the latest data?
- Where can I download archived or previous versions of the data?
- Are there mirrors for the dataset?
- Are there APIs that use this data?
- How do I obtain access to the data or API?
This is statistical information that can help provide some insight into the data set, for example its size. But indicators can also build confidence in re-users by highlighting useful statistics such as the timeliness of releases, speed of responding to data fixes, etc.
While a data publisher might publish some of these indicators as targets that they are aiming to achieve, many of these figures could be derived automatically from an underlying publishing platform or service.
Examples of indicators:
- Rate of Growth
- Date of Last Update
- Frequency of Updates
- Number of Re-users (e.g. size of user community, or number of apps that use it)
- Number of Contributors
- Frequency of Use
- Turn-around time for data fixes
- Number of known errors
- Availability (for API based access)
Relationship data primarily drives discovery use cases: to which other datasets does this dataset relate? For example the dataset might re-use identifiers or directly link to resources in other datasets. Knowing the source of that information can help us build trust in the reliability of the combined data, as well as give us sign-posts to other useful context. This is where Linked Data excels.
Annotation Datasets provide context to, and enrich other reference datasets. Annotations might be limited to linking information (“Link Sets”) or they may add new facts/properties about existing resources. Independently sourced quality control information could be published as annotations.
Provenance is also a form of relationship information. Derived datasets, e.g. created through analysis or data conversions, should refer to their original input datasets, and ideally also the algorithms and/or code that were applied.
Again, much of this information can be derived from data analysis. Recommendations for relevant related datasets might be created based on existing links between datasets or by analysing usage patterns. Set algebra on URIs in datasets can be used to do analysis on their overlap, to discover linkages and to determine whether one dataset contains annotations of another.
- List of dataset(s) that this dataset draws on (e.g. re-uses identifiers, controlled vocabulary, etc)
- List of datasets that this datasets references, e.g. via links
- List of source datasets used to compile or create this dataset
- List of datasets that link to this dataset (“back links”)
- Which datasets are often used in conjunction with this dataset?
This is information about the internals of a dataset: e.g. what kind of data does it contain, how is that data organized, and what kinds of things are being described?
This is the most complex area as there are potentially a number of different audiences and abilities to cater for. At one end of the spectrum we want to provide high level summaries of the contents of a dataset, while at the other end we want to provide detailed schema information to support developers. I’ve previously advocated a “progressive disclosure” approach to allow re-users to quickly find the data they need; a product manager looking for data to support a new feature will be looking for different information to a developer constructing queries over a dataset.
I think there are three broad ways that we can decompose Compositional Data further. There are particular questions and types of information that relate to each of them:
- Scope or Coverage
- What kinds of things does this dataset describe? Is it people, places, or other objects?
- How many of these things are in the dataset?
- Is there a geographical focus to the dataset, e.g. a county, region, country or is it global?
- Is the data confined to a particular data period (archival data) or does it contain recent information?
- What are some typical example records from the dataset?
- What schema does it conform to?
- What graph patterns (e.g. combinations of vocabularies) are commonly found in the data?
- How are various types of resource related to one another?
- What is the logical data model for the data?
- What RDF terms and vocabularies that are used in the data?
- What formats are used for capturing dates, times, or other structured values?
- Are there custom validation rules for particular fields or properties?
- Are there caveats or qualifiers to individual schema elements or data items?
- What is the physical data model
- How is the dataset laid out in a particular database schema, across a collection of files, or named graphs?
The experiments we did in Kasabi around the report card (see the last slides for examples) were exploring ways to help visualise the scope of a dataset. It was based on identifying broad categories of entity in a dataset. I’m not sure we got the implementation quite right, but I think it was a useful visual indicator to help understand a dataset.
This is a project I plan to revive when I get some free time. Related to this is the work I did to map the Schema.org Types to the Noun Project Icons.
I’ve tried to present a framework that captures most, if not all of the kinds of questions that I’ve seen people ask when trying to get to grips with a new dataset. If we can understand the types of information people need and the questions they want to answer, then we can create a better set of data publishing and analysis tools.
To date, I think there’s been a tendency to focus on the Descriptive Data and Access Information — because we want to be able to discover data — and its Internals — so we know how to use it.
But for data to become more accessible to a non-technical audience we need to think about a broader range of information and how this might be surfaced by data publishing platforms.
If you have feedback on the framework, particularly if you think I’ve missed a category of information, then please leave a comment. The next step is to explore ways to automatically derive and surface some of this information.