How I organise data conversions

Factual announced a new project last week, called Drake which is billed as a “make for data”. The tool provides a make style environment for building workflows for data conversions, it has support for multiple programming languages, uses a standard project layout, and integrates with HDFS.

It looks like a really nice tool and I plan to take a closer look at it. When you’re doing multiple data conversions, particularly in a production setting, its important to adopt some standard practices. Having a consistent way to manage assets, convert data and manage workflows is really useful. Quick and dirty data conversions might get the job done, but a little thought up front can save time later when you need to refresh a dataset, fix bugs, or allow others to contribute. Consistency also helps when you come to add another layer of automation, to run a number of conversions on a regular basis.

I’ve done a fair few data conversions over the last few years and I’ve already adopted a similar approach to Drake: I use a standard workflow, programming environment and project structure. I thought I’d write this down here in case its useful for others. Its certainly saved me time. I’d be interested to learn what approaches other people take to help organise their data conversions.

Project Layout

My standard project layout is:

bin — the command-line scripts used to run a conversion. I tend to keep these task based, e.g. focusing on one element of the workflow or conversion. E.g. separate scripts for crawling data, converting types of data, etc. Scripts are parameterised with input/output directories and/or filenames
data — created automatically this sub-directory holds the output data
- cache — a cache directory for all data retrieved from the web. when crawling or scraping data I always work on a local cached copy to avoid unnecessary network traffic
- nt (or rdf) — for RDF conversions I typically generate ntriple output as its simple to generate and work with in a range of tools. I sometimes generate RDF/XML output, but only if I’m using XSLT to do transformations from XML sources
etc — additional supporting files, including:
- static — static data, e.g. hand-crafted data sources, RDF schema files, etc
- sparql — SPARQL queries that are used in the conversion, as part of the “enrichment” phase
- xslt — For keeping XSLT transforms when I’m using XML input and have found it easier to process using XSLT rather than using libxml.
lib — the actual code for the conversion. The scripts in the bin directory handle the input/output, the rest is done is Ruby classes
Rakefile — a Ruby Rakefile that describes the workflow. I use this to actually run the conversions

While there are some minor variations I’ve used this same structure across a number of different conversions, including:

Bricklink RDF conversion — CSV to RDF
Prelinger Archive — crawl Internet Archive to generate RDF
Government Art Collection — crawl GAC website to scrape HTML and generate RDF
…Geoplanet, Discogs, etc.

Workflow

The workflow for the conversion is managed using a Ruby Rakefile. Like Factual, I’ve found that a make style environment is useful for organising simple data conversion workflows. Rake allows me to execute command-line tools, e.g. curl for downloading data or rapper for doing RDF format conversions, execute arbitrary Ruby code, as well as shell out to dedicated scripts

I try to use a standard set of rake targets to co-ordinate the overall workflow. These are broken down into smaller stages where necessary. While the steps vary between datasets, the stages I most often use are:

download (or cache) — the main starting point which fetches the necessary data. I try and avoid manually downloading any data and rely on curl or perhaps dpm to get the required files. I’ve tended to use “download” for when I’m just grabbing static files and “cache” for when I’m doing a website crawl. This is just a cue for me. I like to tread carefully when hitting other people’s servers so aggressively cache files. Having a separate stage to grab data is also handy for when you’re working offline on later steps
convert — perform the actual conversion, working on the locally cached files only. So far I tend to use either custom Ruby code or XSLT.
reconcile — generate links to other dataset, often using the Google Refine Reconciliation API
enrich — enrich the dataset with additional data, e.g. by performing SPARQL queries to fetch remote data, or materialise new data
package — package up the generated output as a tar.gz file
publish — the overall target which runs all of the above

The precise stages used vary between projects and there are usually a number of other targets in the Rakefile that perform specific tasks, for example the convert stage is usually dependent on several other steps that generate particular types of data. But having standard stage names makes it easier to run specific parts of the overall conversion. One additional stage that would be useful to have is “validation“, so you can check the quality of the output.

At various times I’ve considered formalising these stages further, e.g by creating some dedicated Rake extensions, but I’ve not yet found the need to do that as there’s usually very little code in each step.

I tend to separate out dependencies on external resources, e.g. remote APIs, from the core conversion. The convert stage will work entirely on locally cached data and then I can call out to other APIs in a separate reconcile or enrich stage. Again, this helps when working on parts of the conversion offline and allows the core conversion to happen without risk of failure because of external dependencies. If a remote API fails, I don’t want to have to re-run a potentially lengthy data conversion, I just want to do resume from a known point.

I also try and avoid, as far as possible, using extra infrastructure, e.g. relying on databases, triple stores, or a particular environment. While this might help improve performances in some cases (particularly for large conversions) I like to minimise dependencies to make it easier to run the conversions in a range of environments, with minimal set-up, and minimal cost for anyone running the conversion code. But many of the conversions I’ve been doing are relatively small scale. For larger datasets using a triple store or Hadoop might be necessary. But this would be easy to integrate into the existing stages, perhaps adding a “prepare” stage to do any necessary installation and configuration.

For me its very important to be able to automate the download of the initial data files or web pages that need scraping. This allows the whole process to be automated and cached files re-used where possible. This simplifies the process of using the data and avoids unnecessary load on data repositories. As I noted at the end of yesterday’s post on using dpm with data.gov.uk, having easy access to the data files is important. The context for interpreting that data mustn’t be overlooked, but consuming that information is done separately from using the data.

To summarise, there’s nothing very revolutionary here: I’m sure many of you use similar and perhaps better approaches. But I wanted to share my style for organising conversions and encourage others to do likewise.

Project Layout

Workflow

Share this:

Published by Leigh Dodds