DataIncubator: What Is It and What’s In It?

This article first appeared in Talis Nodalities magazine issue 8.

The Linking Open Data project has had a huge amount of success in bootstrapping the burgeoning Linked Data cloud. There’s now a definite sense of momentum behind the project, and a growing number of organisations are now seriously investigating how their data could further enrich the growing Semantic Web, and how the underlying technologies may help them to innovate and explore new opportunities.

The Linked Data community has rightly begun to look at the next round of challenges: What can we do with all this data? How can it be pressed into service to create new applications? What kinds of frameworks do we need to support consumption of Linked Data? But it is important that we shouldn’t lose sight of the fact that there’s still a huge amount of evangelism to be done and a great deal of data that could and should be part of the web of data. The Linked Data landscape is still not fully mapped out. In short, we need to keep up the process of accumulating, converting, publishing and linking data in as many different subject areas and disciplines as possible.

To date, the bootstrapping process has been supported by a number of community lead projects that convert and re-publish datasets to bring them into the web of data. The recently founded DataIncubator project (http://dataincubator.org) aims to adopt this same “show don’t tell” approach, but with the addition of some best practices and with an eye on long term sustainability.

Sustainability, Repeatability, Reusability

A key goal of the project is to lightly formalise the way these dataset conversions are carried out to make sure they are sustainable, repeatable, and reusable. But why are these particular aspects important?

Firstly, lets consider sustainability. As usage of the Linked Data cloud grows, we need to make sure that new data being added isn’t going to disappear later—e.g. because a small project website goes offline; or because the original project owner loses interest. It is critical that as serious applications begin to be built against this data that consumer can rely on it. One of the primary ways the project is ensuring sustainability is through making use of the Talis Connected Commons scheme (http://www.talis.com/cc). All of the public domain datasets that are converted and published through the DataIncubator project site are being hosted in the Talis Platform. This takes full advantage of the free data hosting offered under the Connected Commons initiative. Talis is therefore contributing to the sustainability of that data.

The second aspect to consider is repeatability. The first goal is to make sure that the data conversion process is itself repeatable—that is: we can easily re-generate the data to allow for modelling changes, bug fixes, and the ingesting of new data. And not just now when a project is active, but in three years time when the project may be picked up and extended by a number of other contributors. Ensuring that each of the incubated datasets is supported by open source code makes this more achievable. Ideally, the original dataset owners will be convinced by the benefits long before a project goes stale, but it’s important to recognise that evangelism can take time and that different industries move at different speeds. There are already a few Linked Data and RDF projects on the web that model and re-publish the same basic dataset in other ways. By trying to build a community around curating the conversion of a dataset and not just the data itself, DataIncubator hopes to avoid these issues.

The final aspect is one that is often over-looked: how can the original dataset owner build on what the community has created? How can the community’s efforts by reused? Reusability is enabled by ensuring that the conversion code is open source and that schemas and modelling design decisions are well documented. This can lower the barrier to entry facing data providers or publishers looking to embrace Semantic Web technology. This is the case particularly where the data conversion is acting on source data(e.g. open, but not linked data). In this case, the data owner may merely need to re-run the data conversion and publish the Linked Data through their own site rather than DataIncubator. This makes adoption much, much easier.

Community Norms

Alongside addressing these procedural aspects of the data conversion process, the DataIncubator project also encourages a number of useful community norms that will hopefully improve the quality of the converted datasets.

The first of these is to ensure that there is a sufficient amount of both linking and attribution. Every dataset within the umbrella project should reference its original sources. This should not take place just at a high-level, such as within in the corresponding Void description: http://rdfs.org/ns/void/. Instead, references should be deeper so resources can be associated with, for example, the original web pages that describe them. This ensures that there is a clear path back to the original source of the data. Attribution—in various forms—is an important community norm in its own right, but it is especially important in the context of converting and re-publishing an existing dataset. We want to ensure that the original curators of the data don’t think that the community is trying to appropriate or steal its work. Quite the opposite, we want them to embrace it.

The other norm relates once again to sustainability. Links to the data should be stable, but how do we achieve this if the data will ultimately be removed from the DataIncubator site and moved to another domain? The proposal here is that as data is migrated to its permanent home, redirects will be put into place to ensure that web browsers and semantic web agents can follow the links to their primary source. Every effort will be taken to ensure that links don’t break.

What’s In It?

The DataIncubator project already has a wide range of datasets available:

http://nasa.dataincubator.org – Data about satellite launches from 1957 to the present day

http://ol.dataincubator.org – An attempt at improving on the modelling of the OpenLibrary data

http://discogs.dataincubator.org – A conversion of the public domain Discogs music database

http://periodicals.dataincubator.org – Data on thousands of academic journals and publishers

http://airports.dataincubator.org – Facts and figures from the OurAirports.com website, cross-linked with Yahoo! Weather

http://jacs.dataincubator.org – SKOS conversion of the Joint Academic Coding System

There’s a lot more that could yet be added to this list. My personal wishlist includes a conversion of the Prelinger Archives (http://www.archive.org/details/prelinger). This is hosted as part of the Internet Archive project and consists of over 2000 industrial, educational, travel, and propaganda videos published from 1903 to the 1970’s. The content is completely within the public domain, so it’s just begging to be converted. It would also be a great dataset on which to explore the modelling of media and media annotations in general.

Currently, one domain with very little Linked Data is gaming, in all of its forms. For example there is a vast amount of community curated data about Lego, Lego sets, and Lego models. And what about all of the facts and figures that are routinely collected around online gaming? Data might be available through specific community websites, but what could be built if the data were more open, allowing the community to analyse and re-present this data in new ways?

It strikes me that games and gaming is an area that is ripe for exploration. There are many interesting dimensions to the data, and the communities are very engaged. Many gamers are typically very interested in statistics and data about the games they play. This is just one area of the Linked Data landscape that the DataIncubator project is hoping to help explore.

Sustainability, Repeatability, Reusability

Community Norms

What’s In It?

Share this:

Published by Leigh Dodds