How Do We Attribute Data?

This post is another in my ongoing series of “basic questions about open data”, which includes “What is a Dataset?” and “What does a dataset contain?“. In this post I want to focus on dataset attribution and in particular questions such as:

  • Why should we attribute data?
  • How are data publishers asking to be attributed?
  • What are some of the issues with attribution?
  • Can we identify some common conventions around attribution?
  • Can we monitor or track attribution?

I started to think about this because I’ve encountered a number of data publishers recently that have published Open Data but are now struggling to highlight how and where that data has been used or consumed. If data is published for anonymous download, or accessible through an open API then a data publisher only has usage logs to draw on.

I had thought that attribution might help here: if we can find links back to sources, then perhaps we can help data publishers mine the web for links and help them build evidence of usage. But it quickly became clear, as we’ll see in a moment, that there really aren’t any conventions around attribution, making it difficult to achieve this.

So lets explore the topic from first principles and tick off my questions individually.

Why Attribute?

The obvious answer here is simply that if we are building on the work of others, then it’s only fair that those efforts should be acknowledged. This helps the creator of the data (or work, or code) be recognised for their creativity and effort, which is the very least we can do if we’re not exchanging hard cash.

There are also legal reasons why the source of some data might be need to be acknowledged. Some licenses require attribution, copyright may need to be acknowledged. As a consumer I might also want to (or need to) clearly indicate that I am not the originator of some data in case it is find to be false, or misleading, etc.

Acknowledging my sources may also help guarantee that the data I’m using continues to be available: a data publisher might be collecting evidence of successful re-use in order to justify ongoing budget for data collection, curation and publishing. This is especially true when the data publisher is not directly benefiting from the data supply; and I think it’s almost always true for public sector data. If I’m reusing some data I should make it as clear as possible that I’m so doing.

There’s some additional useful background on attribution from a public sector perspective in a document called “Supporting attribution, protecting reputation, and preserving integrity“.

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

While these two cases clearly overlap, the intention is often slightly different. As a user of an application, or the reader of an academic paper, I might want a clear citation to the underlying dataset so I can re-use it myself, or do some fact checking. The important use case there is tracking facts and figures back to their sources. Attribution is more about crediting the effort involved in collecting that information.

It may be possible to achieve both goals with a simple link, but I think recognising the different use cases is important.

How are data publishers asking to be attributed?

So how are data publishers asking for attribution? What follows isn’t an exhaustive survey but should hopefully illustrate some of the variety.

Lets look first at some of the suggested wordings in some common Open Data licenses, then poke around in some terms and conditions to see how these are being applied in practice.

Attribution Statements in Common Open Data Licenses

The Open Data Commons Attribution license includes some recommended text (Section 4.3a – Example Notice):

Contains information from DATABASE NAME which is made available under the ODC Attribution License.

Where DATABASE NAME is the name of the dataset and is linked to the dataset homepage. Notice no mention of the originator, just the database. The license notes that in plain text the links should be included as text. The Open Data Commons Database license has the same text (again, section 4.3a)

The UK Open Government License notes that re-users should:

…acknowledge the source of the Information by including any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence

Where no attribution is provided, or multiple sources must be attributed, then the suggested default text, which should include a link to the license is:

Contains public sector information licensed under the Open Government Licence v1.0.

So again, no reference to the publisher but also no reference to the dataset either. The National Archives have some guidance on attribution which includes some other variations.  These variants do suggest including more detail including name of department, date of publication, etc. These look more like typical bibliographic citations.

As another data point we can look at the Ordnance Survey Open Data License. This is a variation of the Open Government License but carries some additional requirements, specifically around attribution. The basic attribution statement is:

Contains Ordnance Survey data © Crown copyright and database right [year]

However the Code Point Open dataset has some additional attribution requirements, which also acknowledge copyright of the Royal Mail and National Statistics. All of these statements acknowledge the originators and there’s no requirement to cite the dataset itself.

Interestingly, while the previous licenses state that re-publication of data should be under a compatible license, only the OS Open Data license explicitly notes that the attribution statements must also be preserved. So both the license and attribution have viral qualities.

Attribution Statements in Terms and Conditions

Now lets look at some specific Open Data services to see what attribution provisions they include.

Freebase is an interesting example. It draws on multiple datasets which are supplemented by contributions of its user community. Some of that data is under different licenses. As you can see from their attribution page, there are variants in attribution statements depending on whether the data is about one or several resources and whether it includes Wikipedia content, which must be specially acknowledged.

They provide a handy HTML snippet for you to include in your webpage to make sure you get the attribution exactly right. Ironically at the time of writing this service is broken (“User Rate Limit Exceeded”). If you want a slightly different attribution, then you’re asked to contact them.

Now, while Freebase might not meet everyone’s definition of Open Data, its an interesting data point.  Particularly as they ask for deep links back to the dataset, as well as having a clear expectation of where/how the attribution will be surfaced.

OpenCorporates is another illustrative example. Their legal/license info page examples that their dataset is licensed under the Open Data Commons Database License and explains that:

Use of any data must be accompanied by a hyperlink reading “from OpenCorporates” and linking to either the OpenCorporates homepage or the page referring to the information in question

There are also clear expectations around the visibility of that attribution:

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

So there is a clear expectation that the attribution should be displayed alongside any data. Like the OS license these attribution requirements are also viral as they must be passed on by aggregators.

My intention isn’t to criticise either OpenCorporates or Freebase, but merely to highlight some real world examples.

What are some of the issues with data attribution?

Clearly we could undertake a much more thorough review than I have done here. But this is sufficient to highlight what I think are some of the key issues. Put yourself in the position of a developer consuming some Open Data under any or all of these conditions. How do you responsibly provide attribution?

The questions that occur to me, at least are:

  • Do I need to put attribution on every page of my application, or can I simply add it to a colophon? (Aside: lanyrd has a great colophon page). In some cases it seems like I might have some freedom of choice, in others I don’t
  • If I do have to put a link or some text on a page, then do I have any flexibility around its size, positioning, visibility, etc? Again, in some cases I may do, but in others I have some clear guidance to follow. This might be challenging if I’m creating a mobile application with limited screen space. Or creating a voice or SMS application.
  • What if I just re-use the data as part of some back-end analysis, but none of that data is actually surfaced to the user? How do I attribute in this scenario?
  • Do I need to acknowledge the publisher, or a link to the source page(s)?
  • What if I need to address multiple requirements, e.g. if I mashed up data from data.gov.uk, the Ordnance Survey, Freebase and OpenCorporates? That might get awkward.

There are no clear answers to these questions. For individual datasets I might be able to get guidance, but it requires me to read the detailed terms and conditions for the dataset or API I’m using. Isn’t the whole purpose in having off-the-shelf licenses like the OGL or ODbL supposed to help us streamline data sharing? Attribution, or rather unclear or overly detailed attribution requirements are a clear source of friction. Especially if there are legal consequences for getting it wrong.

And that’s just when we’re considering integrating data sources by hand. What about if we want to automatically combine data? How is a machine going to understand these conditions? I suspect that every Linked Data browser and application fails to comply with the attribution requirements of the data its consuming.

Of course these issues have been explored already. The Science Commons Protocol encourages publishing data into the public domain — so no legal requirement for attribution at all. It also acknowledges the “Attribution Stacking” problem (section 5.3) which occurs when trying to attribute large numbers of datasets, each with their own requirements. Too much friction discourages use, whether its research or commercial.

Unfortunately the recently published Amsterdam Manifesto on data citation seems to overlook these issues, requiring all authors/contributors to be attributed.

The scientific community may be more comfortable with a public domain licensing approach and a best effort attribution model because it is supported by strong social norms: citation and attribution is essential to scientific discourse. We don’t have anything like that in the broader open data community. Maybe its not achievable, but it seems like clear guidance would be very useful.

There’s some useful background on problems with attribution and marking requirements on the Creative Commons wiki that also references some possible amendments and clarifications.

Can we convergence on some common conventions?

So would it be possible to converge on a simple set of conventions or norms around data re-use? Ideally to the extent that attribution can be simplified and ideally automated as far as possible.

How about the following:

  • Publishers should clearly describe their attribution requirements. Ideally this should be a short simple statement (similar to the Open Government License) which includes their name and a link to their homepage. This attribution could be included anywhere on the web site or application that consumes the data.
  • Publishers should be aware that the consumers of their data will be doing so in a variety of applications and on a variety of platforms. This means allowing a deal of flexibility around how/where attribution is displayed.
  • Publishers should clearly indicate whether attribution must be passed on to down-stream users
  • Publishers should separately document their citation requirements. If they want to encourage users to link to the dataset, or an individual page on their site, to allow users to find the original context, then they should publish instructions on how to do it. However this kind of linking is for citation so consumers should be bound to include it
  • Consumers should comply with publishers wishes and include an about page on their site or within their application that attributes the originators of the data they use. Where feasible they should also provide citations to specific resources or datasets from within their applications. This provides their users with clear citations to sources of data
  • Both sides should collaborate on structured markup to support publication of these attribution and citation requirements, as well as harvesting of links

Whether attribution should be a legally enforced is another discussion. Personally I’d be keen to see a common set of conventions regardless of the legal basis for doing it. Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

Thoughts on Coursera and Online Courses

I recently completed my first online course (or “MOOC“) on Coursera. It was an interesting experience and wanted to share some thoughts here.

I decided to take an online course for several reasons. Firstly the topic, Astrobiology, was fun and I thought the short course might make an interesting alternative to watching BBC documentaries and US TV box sets. I certainly wasn’t disappointed as the course content was accessible and well-presented. As a biology graduate I found much of the content was fairly entry-level, but it was nevertheless a good refresher in a number of areas. The mix of biology, astronomy, chemistry and geology was really interesting. The course was very well attended, with around 40,000 registrants and 16,000 active students.

The second reason I wanted to try a course was because MOOCs are so popular at the moment. I was curious how well an online course would work, in terms of both content delivery and the social aspects of learning. Many courses are longer and are more rigorously marked and assessed, but the short Astrobiology course looked like it would still offer some useful insights into online learning.

Clearly some of my experiences will be specific to the particular course and Coursera, but I think some of the comments below will generalise to other platforms.

Firstly, the positives:

  • The course material was clear and well presented
  • The course tutors appeared to be engaged and actively participated in discussions
  • The ability to download the video lectures, allowing me to (re)view content whilst travelling was really appreciated. Flexibility around consuming course content seems like an essential feature to me. While the online experience will undoubtedly be richer, I’m guessing that many people are doing these courses in spare time around other activities. With this in mind, video content needs to be available in easily downloadable chunks.
  • The Coursera site itself was on the whole well constructed. It was easy to navigate to the content, tests and the discussions. The service offered timely notifications that new content and assessments had been published
  • Although I didn’t use it myself, the site offered good integration with services like Meetup, allowing students to start their own local groups. This seemed like a good feature, particularly for longer running courses.

However there were a number of areas in which I thought things could be greatly improved:

  • The online discussion forums very quickly became unmanageable. With so many people contributing, across many different threads, it’s hard to separate the signal from the noise. The community had some interesting extremes: people associated with the early NASA programme, through to alien contact and conspiracy theory nut-cases. While those particular extremes are peculiar to this course, I expect other courses may experience similar challenges
  • Related to the above point, the ability to post anonymously in forums lead to trolling on a number of occasions. I’m sensitive to privacy, but perhaps pseudonyms may be better than anonymity?
  • The discussions are divorced from the content, e.g. I can’t comment directly on a video I have to create a new thread for it in a discussion group. I wanted to see something more sophisticated, maybe SoundCloud style annotations on the videos or per-video discussion threads.
  • No integration with wider social networks: there were discussions also happening on twitter, G+ and Facebook. Maybe its better to just integrate those, rather than offer a separate discussion forum?
  • Students consumed content at different rates which meant that some discussions contained “spoilers” for material I hadn’t yet watched. This is largely a side-effect of the discussions happening independently from the content.
  • Coursera offered a course wiki but this seemed useless
  • It wasn’t clear to me during the course what would happen to the discussions after the course ended. Would they be wiped out, preserved, or would later students build on what was there already? Now that it’s finished it looks each course is instanced and discussions are preserved as an archive. I’m not sure what the right option is there. Starting with a clean slate seems like a good default, but can particularly useful discussions be highlighted in later courses? Seems like the course discussions would be an interesting thing to mine for links and topics, especially for lecturers

There are some interesting challenges with designing this kind of product. Unlike almost every other social application the communities for these courses don’t ramp up over time: they arrive en masse at a particular date and then more or less evaporate over night.

As a member of that community this makes it very hard to identify which people in the community are worth listening too and who to ignore: all of a sudden I’m surrounded by 16000 people all talking at once. When things ramp up more slowly, I can build out my social network more easily. Coursera doesn’t have any notion of study groups.

I expect the lecturers must have similar challenges as very quickly they’re faced with a lot of material that they might have to potentially read, review and respond to. This must present challenges when engaging with each new intake.

While a traditional discussion forum might provide the basic infrastructure for enabling the necessary basic communication, MOOC platforms need to have more nuanced social features — for both students and lecturers — to support the community. Features that are sensitive to the sudden growth of the community. I found myself wanting to find out things like:

  • Who is posting on which topics and how frequently?
  • Which commentators are getting up-voted (or down-voted) the most?
  • Which community members are at the same stage in the course as me?
  • Which community members have something to offer on a particular topic, e.g. because of their specific background?
  • What links are people sharing in discussions? Perhaps filtered by users.
  • What courses are my fellow students undertaking next? Are there shared journeys?
  • Is there anyone watching this material at the same time?

Answering all of these requires more than just mining discussions but it feels like some useful metrics could be nevertheless. For example, one common use of the forums was to share additional material, e.g. recent news reports, scientific papers, you tube videos, etc. That kind of content could either be collected in other ways, e.g. via a shared reading list, or as a list that is automatically surfaced from discussions. I ended up sifting through the forums and creating a reading list on readlists, as well as a YouTube playlist just to see whether others would find them useful (they did).

All of these challenges we can see playing out in wider social media, but with a MOOC they’re often compressed into relatively short time spans.

(Perhaps inevitably) I also kept thinking that much of the process of creating, delivering and consuming the content could be improved with better linking and annotation tools. Indeed, do we even need specialised MOOC platforms at all? Why not just place all of the content on services like YouTube, ReadLists, etc. Isn’t the web our learning infrastructure?

Well I think there is a role for these platforms. The role in certification — these people have taken this course — is clearly going to become more important, for example.

However I think their real value is in marking out a space within which the learning experience takes place: these people are taking this content during this period. The community needs a focal point, even if its short-lived.

If everything was just on the web, with no real definition to the course, then that completely dissolves the community experience. By concentrating the community into time-boxed, instanced courses, it creates focus that can enrich the experience. The challenge is balancing unwieldy MOOC “flashmobs” against a more diffused internet community.

Minecraft Activities for Younger Kids

So earlier today Thayer asked if I had any suggestions for things to do in Minecraft for younger kids. I thought I might as well write this up as a blog post rather than a long email, then everyone else can add their suggestions too.

Of course, the first thing I did was turn to the experts: my kids. Both my son and daughter have been addicted to Minecraft for some time now. We run our own home server and this gets visits from my son’s friends too. They also play on-line, but having a local server gives us a nice safe environment where we can play with the latest plugins and releases.

So most of the list below was suggested by the kids. I’ve just written it up and added in the links. We’ve assumed that Thayer and Nemi (and you) are playing on the “vanilla” (i.e. out of the box, un-patched) PC version of Minecraft. The browser, XBox, mobile and the Raspberry Pi versions lag behind a little bit so not all of these activities or options might be available.

There’s also a lot more fun to be had exploring many of the amazing mods, maps and mod-packs (bundled packages of mods) that the Minecraft community has shared. The kids are currently obsessed with the Voids Wrath mod pack which has some excellent RPG features. But we decided to focus on the vanilla experience first.

Be Creative

There are lots of ways to play Minecraft. You can opt to build things, undertake survival challenges, explore above and below ground, and push the limits of its sandbox environment in lots of different ways.

But for younger kids we thought it would be best to play on Creative and Peaceful mode. Creative mode removes the survival challenge aspects, allowing you to focus on building and exploration. In Creative mode you have instant access to every block, tool, weapon, etc. in the game. So you can cut to the chase and start building, rather than undertaking a lot of mining and exploring at the start. Peaceful mode disables all of the naturally spawning enemies, meaning you don’t have to worry about being killed by skeletons, exploded by creepers or going hungry whilst travelling.

The Creative mode inventory system can be a little daunting: there are a lot of different blocks in the game. But if you use the compass to activate the search, you can quickly find what you’re looking for. The list of blocks is on the wiki, so that can also be a handy reference.

Build a House

Building a house is obviously the first place to start. Building and decorating a house gives you a base of operations for all your other activities.

There are lots of material to choose from, including varieties of stone, wood, and even glass. Use different coloured wools to make carpets, and then add book shelves, paintings, beds (right click to sleep in it), a jukebox, and lots of other types of furniture.

The minecraft wiki has a good starting reference for building types of shelter

Become a farmer

Once you’ve built a house, you can next turn your hand to a spot of gardening and become a farmer.

Using a hoe, some water, some earth and the right kinds of seeds you can grow all kinds of things, including Wheat, Carrots, Cacti, Melons and even mushrooms. You can even grow giant mushrooms for that fairy tale feel.

You can use a bucket to carry water around to water your garden or to create a pool.

Build a Zoo

You’re not limited to plants. Minecraft has a lot of different types of animals, including pigs, sheep, cows, wolves and ocelots(?!). In Minecraft every animal can be hatched from an egg (unless you breed them).

In Creative mode you have access to Spawn Eggs which you can use to spawn animals. So build you zoo enclosures and then fill them with whatever animals you like.

Tame animals for Pets

Wolves and ocelots can be tamed to make pets. You can do this if you discover them in the wild or just try taming some animals from your zoo. Feed a wolf a bone and it’ll be come a pet dog. If you want a pet cat, then you’ll want to tame an ocelot with some raw fish. Pets will follow you around and you can make your pet dog sit and stay, or follow you on adventures.

Ride a Pig

While we’re talking about animals, why not give Pig riding a try? You’ll be needing a saddle and a carrot on a stick.

Build something amazing

You’re only limited by your imagination when crafting giant castles, houses, hotels, beach resorts, tree houses, or models of your own home and garden. Using creative mode you’ll have full access to all the tools you need. Moving water and lava around using a bucket you can create some really cool landscapes.

If you need some space then you can create a “Superflat” world, using the advanced options in the game menu. This creates a basic, flat world that gives you plenty of space to build, but you won’t get any of those amazing Minecraft landscapes. Speaking of which…

Fly around the world

Roaming around on foot is great for exploring, but you can’t beat flying. Flying is another bonus of creative mode and it’ll let you quickly explore huge areas to find the best place for your next awesome build. Don’t forget to pull out a map and a compass from your inventory. The map will give you a birds-eye view, while the compass will always point to home (e.g. where you last slept).

Other Ideas

Some other ideas we had:

  • Once you’ve mastered flying, then you could try and create some Minecraft pixel art
  • You might also want to change the skin in your minecraft account. There’s lots of sites and apps that provide tools for building and changing skins to your favourite characters.
  • You could also try out an alternative look-and-feel for Minecraft by installing a texture pack. While installing them will probably need some adult help, once installed its easy to switch between them. Then you make Minecraft look more like Pokemon or maybe just give it a more child friendly feel.

Those were just our ideas. If you’ve got other suggestions for younger kids, or starting player, then feel free to leave a comment below.

What Does Your Dataset Contain?

Having explored some ways that we might find related data and services, as well as different definitions of “dataset”, I wanted to look at the topic of dataset description and analysis. Specifically, how can we answer the following questions:

  • what kinds of information does this dataset contain?
  • what types of entity are described in this dataset?
  • how can I determine if this dataset will fulfil my requirements?

There’s been plenty of work done around trying to capture dataset metadata, e.g. VoiD and DCAT; there’s also the upcoming working on Open Data on the Web. Much of that work has focused on capturing the core metadata about a dataset, e.g. who published it, when was it last updated, where can I find the data files, etc. But there’s still plenty of work to be done here, to encourage broader adoption of best practices, and also to explore ways to expose more information about the internals of a dataset.

This is a topic I’ve touched on before, and which we experimented with in Kasabi. I wanted to move “beyond the triple count” and provide a “report card” that gave a little more insight into a dataset. A report card could usefully complement an ODI Open Data Certificate, for example. Understanding the composition of a dataset can also help support new ways of manipulating and combining datasets.

In this post I want to propose a conceptual framework for capturing metadata about datasets. Its intended as a discussion point, so I’m interested in getting feedback. (I would have submitted this to the ODW workshop but ran out of time before the deadline).

At the top level I think there are five broad categories of dataset information: Descriptive Data; Access Information; Indicators; Compositional Data; and Relationships. Compositional data can be broken down into smaller categories — this is what I described as an “information spectrum” in the Beyond the Triple Count post.

While I’ve thought about this largely from the perspective of Linked Data, I think its applicable to any format/technology.

Descriptive Data

This kind of information helps us understand a dataset as a “work”: its name, a human-readable description or summary, its license, and pointers to other relevant documentation such as quality control or feedback processes. This information is typically created and maintained directly by the data publisher, whereas the other categories of data I describe here can potentially be derived automatically by data analysis

Examples:

  • Title
  • Description
  • License
  • Publisher
  • Subject Categories

Access Information

Basically, where do I get the data?

  • Where do I download the latest data?
  • Where can I download archived or previous versions of the data?
  • Are there mirrors for the dataset?
  • Are there APIs that use this data?
  • How do I obtain access to the data or API?

Indicators

This is statistical information that can help provide some insight into the data set, for example its size. But indicators can also build confidence in re-users by highlighting useful statistics such as the timeliness of releases, speed of responding to data fixes, etc.

While a data publisher might publish some of these indicators as targets that they are aiming to achieve, many of these figures could be derived automatically from an underlying publishing platform or service.

Examples of indicators:

  • Size
  • Rate of Growth
  • Date of Last Update
  • Frequency of Updates
  • Number of Re-users (e.g. size of user community, or number of apps that use it)
  • Number of Contributors
  • Frequency of Use
  • Turn-around time for data fixes
  • Number of known errors
  • Availability (for API based access)

Relationships

Relationship data primarily drives discovery use cases: to which other datasets does this dataset relate? For example the dataset might re-use identifiers or directly link to resources in other datasets. Knowing the source of that information can help us build trust in the reliability of the combined data, as well as give us sign-posts to other useful context. This is where Linked Data excels.

Annotation Datasets provide context to, and enrich other reference datasets. Annotations might be limited to linking information (“Link Sets”) or they may add new facts/properties about existing resources. Independently sourced quality control information could be published as annotations.

Provenance is also a form of relationship information. Derived datasets, e.g. created through analysis or data conversions, should refer to their original input datasets, and ideally also the algorithms and/or code that were applied.

Again, much of this information can be derived from data analysis. Recommendations for relevant related datasets might be created based on existing links between datasets or by analysing usage patterns. Set algebra on URIs in datasets can be used to do analysis on their overlap, to discover linkages and to determine whether one dataset contains annotations of another.

Examples:

  • List of dataset(s) that this dataset draws on (e.g. re-uses identifiers, controlled vocabulary, etc)
  • List of datasets that this datasets references, e.g. via links
  • List of source datasets used to compile or create this dataset
  • List of datasets that link to this dataset (“back links”)
  • Which datasets are often used in conjunction with this dataset?

Compositional Data

This is information about the internals of a dataset: e.g. what kind of data does it contain, how is that data organized, and what kinds of things are being described?

This is the most complex area as there are potentially a number of different audiences and abilities to cater for. At one end of the spectrum we want to provide high level summaries of the contents of a dataset, while at the other end we want to provide detailed schema information to support developers. I’ve previously advocated a “progressive disclosure” approach to allow re-users to quickly find the data they need; a product manager looking for data to support a new feature will be looking for different information to a developer constructing queries over a dataset.

I think there are three broad ways that we can decompose Compositional Data further. There are particular questions and types of information that relate to each of them:

  • Scope or Coverage 
    • What kinds of things does this dataset describe? Is it people, places, or other objects?
    • How many of these things are in the dataset?
    • Is there a geographical focus to the dataset, e.g. a county, region, country or is it global?
    • Is the data confined to a particular data period (archival data) or does it contain recent information?
  • Structure
    • What are some typical example records from the dataset?
    • What schema does it conform to?
    • What graph patterns (e.g. combinations of vocabularies) are commonly found in the data?
    • How are various types of resource related to one another?
    • What is the logical data model for the data?
  • Internals
    • What RDF terms and vocabularies that are used in the data?
    • What formats are used for capturing dates, times, or other structured values?
    • Are there custom validation rules for particular fields or properties?
    • Are there caveats or qualifiers to individual schema elements or data items?
    • What is the physical data model
    • How is the dataset laid out in a particular database schema, across a collection of files, or named graphs?

The experiments we did in Kasabi around the report card (see the last slides for examples) were exploring ways to help visualise the scope of a dataset. It was based on identifying broad categories of entity in a dataset. I’m not sure we got the implementation quite right, but I think it was a useful visual indicator to help understand a dataset.

This is a project I plan to revive when I get some free time. Related to this is the work I did to map the Schema.org Types to the Noun Project Icons.

Summary

I’ve tried to present a framework that captures most, if not all of the kinds of questions that I’ve seen people ask when trying to get to grips with a new dataset. If we can understand the types of information people need and the questions they want to answer, then we can create a better set of data publishing and analysis tools.

To date, I think there’s been a tendency to focus on the Descriptive Data and Access Information — because we want to be able to discover data — and its Internals — so we know how to use it.

But for data to become more accessible to a non-technical audience we need to think about a broader range of information and how this might be surfaced by data publishing platforms.

If you have feedback on the framework, particularly if you think I’ve missed a category of information, then please leave a comment. The next step is to explore ways to automatically derive and surface some of this information.

What is a Dataset?

As my last post highlighted, I’ve been thinking about how we can find and discover datasets and their related APIs and services. I’m thinking of putting together some simple tools to help explore and encourage the kind of linking that my diagram illustrated.

There’s some related work going on in a few areas which is also worth mentioning:

  • Within the UK Government Linked Data group there’s some work progressing around the notion of a “registry” for Linked Data that could be used to collect dataset metadata as well as supporting dataset discovery. There’s a draft specification which is open for comment. I’d recommend you ignore the term “registry” and see it more as a modular approach for supporting dataset discovery, lightweight Linked Data publishing, and “namespace management” (aka URL redirection). A registry function is really just one aspect of the model.
  • There’s an Open Data on the Web workshop in April which will cover a range of topics including dataset discovery. My current thoughts are partly preparation for that event (and I’m on the Programme Committee)
  • There’s been some discussion and a draft proposal for adding the Dataset type to Schema.org. This could result in the publication of more embedded metadata about datasets. I’m interested in tools that can extract that information and do something useful with it.

Thinking about these topics I realised that there are many definitions of “dataset”. Unsurprisingly it means different things in different contexts. If we’re defining models, registries and markup for describing datasets we may need to get a sense of what these different definitions actually are.

As a result, I ended up looking around for a series of definitions and I thought I’d write them down here.

Definitions of Dataset

Lets start with the most basic, for example Dictionary.com has the following definition:

“a collection of data records for computer processing”

Which is pretty vague. Wikipedia has a definition which derives from the terms use in a mainframe environment:

“A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

Nontabular datasets can take the form of marked up strings of characters, such as an XML file.”

The W3C Data Catalog Vocabulary defines a dataset as:

“A collection of data, published or curated by a single source, and available for access or download in one or more formats.”

The JISC “Data Information Specialists Committee” have a definition of dataset as:

“…a group of data files–usually numeric or encoded–along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use. Generally a dataset is un-usable for sound analysis by a second party unless it is well documented.”

Which is a good definition as it highlights that the dataset is more than just the individual data files or facts, it also consists of some documentation that supports its use or analysis. I also came across a document called “A guide to data development” (2007) from the National Data Development and Standards Unit in Australia which describes a dataset as

“A data set is a set of data that is collected for a specific purpose. There are many ways in which data can be collected—for example, as part of service delivery, one-off surveys, interviews, observations, and so on. In order to ensure that the meaning of data in the data set is clearly understood and data can be consistently collected and used, data are defined using metadata…”

This too has the notion of context and clear definitions to support usage, but also notes that the data may be collected in a variety of ways.

A Legal Definition

As it happens, there’s also a legal definition of a dataset in the UK, at least as far as it relates to the Freedom of Information. The “Protections of Freedom Act 2012 Part 6, (102) c” includes the following definition:

In this Act “dataset” means information comprising a collection of information held in electronic form where all or most of the information in the collection—

  • (a)has been obtained or recorded for the purpose of providing a public authority with information in connection with the provision of a service by the authority or the carrying out of any other function of the authority,
  • (b)is factual information which—
    • (i)is not the product of analysis or interpretation other than calculation, and
    • (ii)is not an official statistic (within the meaning given by section 6(1) of the Statistics and Registration Service Act 2007), and
  • (c)remains presented in a way that (except for the purpose of forming part of the collection) has not been organised, adapted or otherwise materially altered since it was obtained or recorded.”

This definition is useful as it defines the boundaries for what type of data is covered by Freedom of Information requests. It clearly states that the data is collected as part of the normal business of the public body and also that the data is essentially “raw”, i.e. not the result of analysis or has not been adapted or altered.

Raw data (as defined here!) is more useful as it supports more downstream usage. Raw data has more potential.

Statistical Datasets

The statistical community has also worked towards having a clear definition of dataset. The OECD Glossary defines a Dataset as “any organised collection of data”, but then includes context that describes that further. For example that a dataset is a set of values that have a common structure and are usually thematically related. However there’s also this note that suggests that a dataset may also be made up of derived data:

“A data set is any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances”

Privacy is one key reason why a dataset may contain derived information only.

The RDF Data Cube vocabulary, which borrows heavily from SDMX — a key standard in the statistical community — defines a dataset as being made up of several parts:

  1. “Observations – This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
  2. Organizational structure – To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation…
  3. Internal metadata – Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated?…
  4. External metadata — This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed.”

The SDMX implementors guide has a long definition of dataset (page 7) which also focuses on the organisation of the data and specifically how individual observations are qualified along different dimensions and measures.

Scientific and Research Datasets

Over the last few years the scientific and research community have been working towards making their datasets more open, discoverable and accessible. Organisations like the Welcome Foundation have published guidance for researchers on data sharing; services like CrossRef and DataCite provide the means for giving datasets stable identifiers; and platforms like FigShare support the publishing and sharing process.

While I couldn’t find a definition of dataset from that community (happy to take pointers!) its clear that the definition of dataset is extremely broad. It could cover both raw results, e.g. output from sensors or equipment, through to more analysed results. The boundaries are hard to define.

Given the broad range of data formats and standards, services like FigShare accept any or all data formats. But as the Welcome Trust note:

“Data should be shared in accordance with recognised data standards where these exist, and in a way that maximises opportunities for data linkage and interoperability. Sufficient metadata must be provided to enable the dataset to be used by others. Agreed best practice standards for metadata provision should be adopted where these are in place.”

This echoes the earlier definitions that included supporting materials as being part of the dataset.

RDF Datasets

I’ve mentioned a couple of RDF vocabularies already, but within the RDF and Linked Data community there are a couple of other definitions of dataset to be found. The Vocabulary for Organising Interlinked Datasets (VoiD) is similar to, but predates, DCAT. Whereas DCAT focuses on describing a broad class of different datasets, VoiD describes a dataset as:

“…a set of RDF triples that are published, maintained or aggregated by a single provider…the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.”

Like the more general definitions this includes the notion that the data may relate to a specific topic or be curated by a single organisation. But this definition also makes some assumption about the technical aspects of how the data is organised and published. VoiD also includes support for linking to the services that relate to a dataset.

Along the same lines, SPARQL also has a definition of a Dataset:

“A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI…”

Unsurprisingly for a technical specification this is a very narrow definition of dataset. It also differs from the VoiD definition. While both assume RDF as the means for organising the data, the VoiD term is more general, e.g. it glosses over details of the internal organisation of the dataset into named graphs. This results in some awkwardness when attempting to navigate between a VoiD description and a SPARQL Service Description.

Summary

If you’ve gotten this far, then well done :)

I think there’s a couple of things we can draw out from these definitions which might help us when discussing “datasets”:

  • There’s a clear sense that a dataset relates to specific topic and is collected for a particular purpose.
  • The means by which a dataset is collected and the definitions of its contents are important for supporting proper re-use
  • Whether a dataset consists of “raw data” or more analysed results can vary across communities. Both forms of dataset might be available, but in some circumstances (e.g. for privacy reasons) only derived data might be published
  • Depending on your perspective and your immediate use case the dataset may be just the data items, perhaps expressed in a particular way (e.g. as RDF).  But in a broader sense, the dataset also includes the supporting documentation, definitions, licensing statements, etc.

While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web.

Dataset and API Discovery in Linked Data

I’ve been recently thinking about how applications can discover additional data and relevant APIs in Linked Data. While there’s been lots of research done on finding and using (semantic) web services I’m initially interested in supporting the kind of bootstrapping use cases covered by Autodiscovery.

We can characterise that use case as helping to answer the following kinds of questions:

  • Given a resource URI, how can I find out which dataset it is from?
  • Given a dataset URI, how can I find out which resources it contains and which APIs might let me interact with it?
  • Given a domain on the web, how can I find out whether it exposes some machine-readable data?
  • Where is the SPARQL endpoint for this dataset?

More succinctly: can we follow our nose to find all related data and APIs?

I decided to try and draw a diagram to illustrate the different resources involved and their connections. I’ve included a small version below:

Data and API Discovery with Linked Data

Lets run through the links between different types of resources:

  • From Dataset to Sparql Endpoint (and Item Lookup, and Open Search Description): this is covered by VoiD which provides simple predicates for linking a dataset to three types of resources. I’m not aware of other types of linking yet, but it might be nice to support reconciliation APIs.
  • From Well-Known VoiD Description (background) to Dataset. This well known URL allows a client to find the “top-level” VoiD description for a domain. It’s not clear what that entails, but I suspect the default option will be to serve a basic description of a single dataset, with reference to sub-sets (void:subset) where appropriate. There might also just be rdfs:seeAlso links.
  • From a Dataset to a Resource. A VoiD description can include example resources, this blesses a few resources in the dataset with direct links. Ideally these resources ought to be good representative examples of resources in the dataset, but they might also be good starting points for further browsing or crawling.
  • From a Resource to a Resource Description. If you’re using “slash” URIs in your data, then URIs will usually redirect to a resource description that contains the actual data. The resource description might be available in multiple formats and clients can content negotiation or follow Link headers to find alternative representations.
  • From a Resource Description to a Resource. A description will typically have a single primary topic, i.e. the resource its describing. It might also reference other related resources, either as direct relationships between those resources or via rdfs:seeAlso type links (“more data over here”).
  • From a Resource Description to a Dataset. This is where we might use a dct:source relationship to state that the current description has been extracted from a specific dataset.
  • From a SPARQL Endpoint (Service Description) to a Dataset. Here we run into some differences between definitions of dataset, but essentially we can describe in some detail the structure of the SPARQL dataset that is used in an endpoint and tie that back to the VoiD description. I found myself looking for a simple predicate that linked to a void:Dataset rather than describing the default and named graphs, but couldn’t find one.
  • I couldn’t find any way to relate a Graph Store to a Dataset or SPARQL endpoint. Early versions of the SPARQL Graph Store protocol had some notes on autodiscovery of descriptions, but these aren’t in the latest versions.

These links are expressed, for the most part, in the data but could also be present as Link headers in HTTP responses or in HTML (perhaps with embedded RDFa).

I’ve also not covered sitemaps at all, which provide a way to exhaustively list the key resources in a website or dataset to support mirroring and crawling. But I thought this diagram might be useful.

I’m not sure that the community has yet standardised on best practices for all of these cases and across all formats. That’s one area of discussion I’m keen to explore further.

Publishing SPARQL queries and documentation using github

Yesterday I released an early version of sparql-doc a SPARQL documentation generator. I’ve got plans for improving the functionality of the tool, but I wanted to briefly document how to use github and sparql-doc to publish SPARQL queries and their documentation.

Create a github project for the queries

First, create a new github project to hold your queries. If you’re new to github then you can follow their guide to get a basic repository set up with a README file.

The simplest way to manage queries in github is to publish each query as a separate SPARQL query file (with the extension “.rq”). You should also add a note somewhere specifying the license associated with your work.

As an example, here is how I’ve published a collection of SPARQL queries for the British National Bibliography.

When you create a new query be sure to add it to your repository and regularly commit and push the changes:

git add my-new-query.rq
git commit my-new-query.rq -m "Added another example"
git push origin master

The benefit of using github is that users can report bugs or submit pull requests to contribute improvements, optimisations, or new queries.

Use sparql-doc to document your queries

When you’re writing your queries follow the commenting conventions encouraged by sparql-doc. You should also install sparql-doc so you can use it from the command-line. (Note: installation guide and notes still needs some work!)

As a test you can try generating the documentation from your queries as follows. Make a new local directory called, e.g. ~/projects/examples. Execute the following from your github project directory to create the docs:

sparql-doc ~/projects/examples ~/projects/docs

You should then be able to open up the index.html document in ~/projects/examples to see how the documentation looks.

If you have an existing web server somewhere then you can just zip up those docs and put them somewhere public to share them with others.However you can also publish them via Github pages. This means you don’t have to setup any web hosting at all.

Use github pages to publish your docs

Github Pages allows github users to host public, static websites directly from github projects. It can be used to publish blogs or other project documentation. But using github pages can seem a little odd if you’re not familiar with git.

Effectively what we’re going to do is create a separate collection of files — the documentation — that sits in parallel to the actual queries. In git terms this is done by creating a separate “orphan” branch. The documentation lives in the branch, which must be called gh-pages, while the queries will remain in the master branch.

Again, github have a guide for manually adding and pushing files for hosting as pages. The steps below follow the same process, but using sparql-doc to create the files.

Github recommend starting with a fresh separate checkout of your project. You then create a new branch in that checkout, remove the existing files and replace them with your documentation.

As a convention I suggest that when you checkout the project a second time, that you give it a separate name, e.g. by adding a “-pages” suffix.

So for my bnb-queries project, I will have two separate checkouts:

The original checkout I did when setting up the project for the BNB queries was:

git clone git@github.com:ldodds/bnb-queries.git

This gave me a local ~/projects/bnb-queries directory containing the main project code. So to create the required github pages branch, I would do this in the ~/projects directory:

#specify different directory name on clone
git clone git@github.com:ldodds/bnb-queries.git bnb-queries-pages
#then follow the steps as per the github guide to create the pages branch
cd bnb-queries-pages
git checkout --orphan gh-pages
#then remove existing files
git rm -rf .

This gives me two parallel directories one containing the master branch and the other the documentation branch. Make sure you really are working with a separate checkout before deleting and replacing all of the files!

To generate the documentation I then run sparql-doc telling it to read the queries from the project directory containing the master branch, and then use the directory containing the gh-pages branch as the output, e.g.:

sparql-doc ~/projects/bnb-queries ~/projects/bnb-queries-pages

Once that is done, I then add, commit and push the documentation, as per the final step in the github guide:

cd ~/projects/bnb-queries-pages
git add *
git commit -am "Initial commit"
git push origin gh-pages

The first time you do this, it’ll take about 10 minutes for the page to become active. They will appear at the following URL:

http://USER.github.com/PROJECT

E.g. http://ldodds.github.com/sparql-doc

If you add new queries to the project, be sure to re-run sparql-doc and add/commit the updated files.

Hopefully that’s relatively clear.

The main thing to understand is that locally you’ll have your github project checked out twice: once for the main line of code changes, and once for the output of sparql-doc.

These will need to be separately updated to add, commit and push files. In practice this is very straight-forward and means that you can publicly share queries and their documentation without the need for web hosting.

sparql-doc

Developers often struggle with SPARQL queries. There aren’t always enough good examples to play with when learning the language or when trying to get to grips with a new dataset. Data publishers often overlook the need to publish examples or, if they do, rarely include much descriptive documentation.

I’ve also been involved with projects that make heavy use of SPARQL queries. These are often externalised into separate files to allow them to be easily tuned or tweaked without having to change code. Having documentation on what a query does and how it should be used is useful. I’ve seen projects that have hundreds of different queries.

It occurred to me that while we have plenty of tools for documenting code, we don’t have a tool for documenting SPARQL queries. If generating and publishing documentation was a little more frictionless, then perhaps people will do it more often. Services like SPARQLbin are useful, but provide address a slightly different use case.

Today I’ve hacked up the first version of a tool that I’m calling sparql-doc. Its primary usage is likely to be helping to publish good SPARQL examples, but might also make a good addition to existing code/project documentation tools.

The code is up on github for you to try out. Its still very rough-and-ready but already produces some useful output. You can see a short example here.

Its very simple and adopts the same approach as tools like rdoc and Javadoc: its just specifies some conventions for writing structured comments. Currently it supports adding a title, description, list of authors, tags, and related links to a query. Because the syntax is simple, I’m hoping that other SPARQL tools and IDEs will support it.

I plan to improve the documentation output to provide more ways to navigate the queries, e.g. by tag, query type, prefix, etc.

Let me know what you think!

A Brief Review of the Land Registry Linked Data

The Land Registry have today announced the publication of their Open Data — including both Price Paid information and Transactions as Linked Data. This is great to see, as it means that there is another UK public body making a commitment to Linked Data publishing.

I’ve taken some time to begin exploring the data. This blog post provides some pointers that may help others in using the Linked Data. I’m also including some hopefully constructive feedback on the approach that the Land Registry have taken.

The Land Registry Linked Data

The Linked Data is available from http://landregistry.data.gov.uk this follows the general pattern used by other organisations publishing public sector Linked Data in the UK.

The data consists of a single SPARQL endpoint — based on the Open Source Fuseki server — which contains RDF versions of both the Price Paid and Transaction data. The documentation notes that the endpoint will be updated on the 20th of each month, with the equivalent to the monthly releases that are already published as CSV files.

Based on some quick tests, it would appear that the endpoint contains all of the currently published Open Data, which in total is 16,873,170 triples covering 663,979 transactions.

The data seems to primarily use custom vocabularies for describing the data:

The landing page for the data doesn’t include any examples, but I ran some SPARQL queries to extract a few, e.g:

So for Price Paid Data, the model appears to be that a Transaction has a Transaction Record which in turn has an associated Address. The transaction counts seem to be standalone resources.

The SPARQL endpoint for the data is at http://landregistry.data.gov.uk/landregistry/sparql. A test form is also available and that page has a couple of example queries, including getting Price Paid data based on a postcode search.

However I’d suggest that the following version might be slightly better as it includes the record status for the record, which will indicate whether it is an “add” or a “delete”:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX lrppi: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX lrcommon: <http://landregistry.data.gov.uk/def/common/>
SELECT ?paon ?saon ?street ?town ?county ?postcode ?amount ?date ?status
WHERE
{ ?transx lrppi:pricePaid ?amount .
 ?transx lrppi:transactionDate ?date .
 ?transx lrppi:propertyAddress ?addr.
 ?transx lrppi:recordStatus ?status.

 ?addr lrcommon:postcode "PL6 8RU"^^xsd:string .
 ?addr lrcommon:postcode ?postcode .

 OPTIONAL {?addr lrcommon:county ?county .}
 OPTIONAL {?addr lrcommon:paon ?paon .}
 OPTIONAL {?addr lrcommon:saon ?saon .}
 OPTIONAL {?addr lrcommon:street ?street .}
 OPTIONAL {?addr lrcommon:town ?town .}
}
ORDER BY ?amount

General Feedback

Lets start with the good points:

  • The data is clearly licensed so is open for widespread re-use
  • There is a clear commitment to regularly updating the data, so it should stay in line with the Land Registry’s other Open Data. This makes it reliable for developers to use the data and the identifiers it contains
  • The data uses Patterned URIs based on Shared Keys (the Land Registry’s own transaction identifiers) so building links is relatively straight-forward
  • The vocabularies are documented and the URIs resolve, so it is possible to lookup the definitions of terms. I’m already finding that easier than digging through the FAQs that the Land Registry publish for the CSV versions.

However I think there is room for improvement in a number of areas:

  • It would be useful to have more example queries, e.g. how to find the transactional data, as well as example Linked Data resources. A key benefit of a linked dataset is that you should be able to explore it in your browser. I had to run SPARQL queries to find simple examples
  • The SPARQL form could be improved: currently it uses a POST by default and so I don’t get a shareable URL for my query; the Javascript in the page also wipes out my query every time I hit the back button, making it frustrating to use
  • The vocabularies could be better documented, for example a diagram showing the key relationships would be useful, as would a landing page providing more of a conceptual overview
  • The URIs in the data don’t match the patterns recommended in Designing URI Sets for the Public Sector. While I believe that guidance is under review, the data is diverging from current documented best practice. Linked Data purists may also lament the lack of a distinction between resource and page.
  • The data uses custom vocabulary where there are existing vocabularies that fit the bill. The transactional statistics could have been adequately described by the Data Cube vocabulary with custom terms for the dimensions. The related organisations could have been described by the ORG ontology and VCard with extensions ought to have covered the address information.

But I think the biggest oversight is the lack of linking, both internal and external. The data uses “strings” where it could have used “things”: for places, customers, localities, post codes, addresses, etc.

Improving the internal linking will make the dataset richer, e.g. allowing navigation to all transactions relating to a specific address, or all transactions for a specific town or postcode region. I’ve struggled to get a Post Code District based query to work (e.g. “price paid information for BA1″) because the query has to resort to regular expressions which are often poorly optimised in triple stores. Matching based on URIs is always much faster and more reliable.

External linking could have been improved in two ways:

  1. The dates in the transactions could have been linked to the UK Government Interval Sets. This provides URIs for individual days
  2. The postcode, locality, district and other regional information could have been linked to the Ordnance Survey Linked Data. That dataset already has URIs for all of these resources. While it may have been a little more work to match regions, the postcode based URIs are predictable so are trivial to generate.

These improvements would have moved from Land Registry data from 4 to 5 Stars with little additional effort. That does more than tick boxes, it makes the entire dataset easier to consume, query and remix with others.

Hopefully this feedback is useful for others looking to consume the data or who might be undertaking similar efforts. I’m also hoping that it is useful to the Land Registry as they evolve their Linked Data offering. I’m sure that what we’re seeing so far is just the initial steps.

How I organise data conversions

Factual announced a new project last week, called Drake which is billed as a “make for data”. The tool provides a make style environment for building workflows for data conversions, it has support for multiple programming languages, uses a standard project layout, and integrates with HDFS.

It looks like a really nice tool and I plan to take a closer look at it. When you’re doing multiple data conversions, particularly in a production setting, its important to adopt some standard practices. Having a consistent way to manage assets, convert data and manage workflows is really useful. Quick and dirty data conversions might get the job done, but a little thought up front can save time later when you need to refresh a dataset, fix bugs, or allow others to contribute. Consistency also helps when you come to add another layer of automation, to run a number of conversions on a regular basis.

I’ve done a fair few data conversions over the last few years and I’ve already adopted a similar approach to Drake: I use a standard workflow, programming environment and project structure. I thought I’d write this down here in case its useful for others. Its certainly saved me time. I’d be interested to learn what approaches other people take to help organise their data conversions.

Project Layout

My standard project layout is:

  • bin — the command-line scripts used to run a conversion. I tend to keep these task based, e.g. focusing on one element of the workflow or conversion. E.g. separate scripts for crawling data, converting types of data, etc. Scripts are parameterised with input/output directories and/or filenames
  • data — created automatically this sub-directory holds the output data
    • cache — a cache directory for all data retrieved from the web. when crawling or scraping data I always work on a local cached copy to avoid unnecessary network traffic
    • nt (or rdf) — for RDF conversions I typically generate ntriple output as its simple to generate and work with in a range of tools. I sometimes generate RDF/XML output, but only if I’m using XSLT to do transformations from XML sources
  • etc — additional supporting files, including:
    • static — static data, e.g. hand-crafted data sources, RDF schema files, etc
    • sparql — SPARQL queries that are used in the conversion, as part of the “enrichment” phase
    • xslt — For keeping XSLT transforms when I’m using XML input and have found it easier to process using XSLT rather than using libxml.
  • lib — the actual code for the conversion. The scripts in the bin directory handle the input/output, the rest is done is Ruby classes
  • Rakefile — a Ruby Rakefile that describes the workflow. I use this to actually run the conversions

While there are some minor variations I’ve used this same structure across a number of different conversions, including:

Workflow

The workflow for the conversion is managed using a Ruby Rakefile. Like Factual, I’ve found that a make style environment is useful for organising simple data conversion workflows. Rake allows me to execute command-line tools, e.g. curl for downloading data or rapper for doing RDF format conversions, execute arbitrary Ruby code, as well as shell out to dedicated scripts

I try to use a standard set of rake targets to co-ordinate the overall workflow. These are broken down into smaller stages where necessary. While the steps vary between datasets, the stages I most often use are:

  1. download (or cache) — the main starting point which fetches the necessary data. I try and avoid manually downloading any data and rely on curl or perhaps dpm to get the required files. I’ve tended to use “download” for when I’m just grabbing static files and “cache” for when I’m doing a website crawl. This is just a cue for me. I like to tread carefully when hitting other people’s servers so aggressively cache files. Having a separate stage to grab data is also handy for when you’re working offline on later steps
  2. convert — perform the actual conversion, working on the locally cached files only. So far I tend to use either custom Ruby code or XSLT.
  3. reconcile — generate links to other dataset, often using the Google Refine Reconciliation API
  4. enrich — enrich the dataset with additional data, e.g. by performing SPARQL queries to fetch remote data, or materialise new data
  5. package — package up the generated output as a tar.gz file
  6. publish — the overall target which runs all of the above

The precise stages used vary between projects and there are usually a number of other targets in the Rakefile that perform specific tasks, for example the convert stage is usually dependent on several other steps that generate particular types of data. But having standard stage names makes it easier to run specific parts of the overall conversion. One additional stage that would be useful to have is “validation“, so you can check the quality of the output.

At various times I’ve considered formalising these stages further, e.g by creating some dedicated Rake extensions, but I’ve not yet found the need to do that as there’s usually very little code in each step.

I tend to separate out dependencies on external resources, e.g. remote APIs, from the core conversion. The convert stage will work entirely on locally cached data and then I can call out to other APIs in a separate reconcile or enrich stage. Again, this helps when working on parts of the conversion offline and allows the core conversion to happen without risk of failure because of external dependencies. If a remote API fails, I don’t want to have to re-run a potentially lengthy data conversion, I just want to do resume from a known point.

I also try and avoid, as far as possible, using extra infrastructure, e.g. relying on databases, triple stores, or a particular environment. While this might help improve performances in some cases (particularly for large conversions) I like to minimise dependencies to make it easier to run the conversions in a range of environments, with minimal set-up, and minimal cost for anyone running the conversion code. But many of the conversions I’ve been doing are relatively small scale. For larger datasets using a triple store or Hadoop might be necessary. But this would be easy to integrate into the existing stages, perhaps adding a “prepare” stage to do any necessary installation and configuration.

For me its very important to be able to automate the download of the initial data files or web pages that need scraping. This allows the whole process to be automated and cached files re-used where possible. This simplifies the process of using the data and avoids unnecessary load on data repositories. As I noted at the end of yesterday’s post on using dpm with data.gov.uk, having easy access to the data files is important. The context for interpreting that data mustn’t be overlooked, but consuming that information is done separately from using the data.

To summarise, there’s nothing very revolutionary here: I’m sure many of you use similar and perhaps better approaches. But I wanted to share my style for organising conversions and encourage others to do likewise.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers