5 ways to be a better open data reuser

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/5-ways-better-open-data-reuse

Open data is still in its infancy. The focus so far has been on encouraging and supporting owners of data to publish it openly. A lot has been written about why opening up data is valuable, how to build business cases for open data sharing, and how to publish data in order to make it easy for people to reuse.

But, while it’s great there is so much advice for data publishers, we don’t often talk about how to be a good reuser of data. One of the few resources that give users advice is the Open Data Commons Attribution-Sharealike Community Norms.

I want to build on those points and offer some more tips and insights on how to use open data better.

1. Take time to understand the data

It almost goes without saying that in order to use data you need to understand it first. But effective reuse involves more than just understanding the structure and format of some data. We are asking publishers to be clear about how their data was collected, processed and licensed. So it’s important for reusers to use this valuable information and make informed decisions about using data.

It may mean that data is not fit for the purpose you intend, or perhaps you just need to be aware of caveats that impact its interpretation. These caveats should be shared when you are presenting your own analysis or conclusions, based on the data.

2. Be open about your sources

Attribution is a requirement of many open licences and reusers should be sure they are correctly attributing their sources. But citation of sources should be a community norm, not just a provision in a licence. Within research communities the norm is to publish data under a CC0 licence, because attribution and citation of data is already well-embedded as a best-practice: every scientific paper has a list of references.

The same principles should apply to the wider open data community. Acknowledging sources not only helps credit the work of data publishers, it also helps to identify widely-used, high-quality datasets.

Consider adding a page to your application that lists both the open source software and open data sources that you’ve used in developing it. The Lanyrd colophon pageprovides one example of how this might look.

3. Engage with the publisher

If you’re using someone’s data, tell them! Every open data publisher is keen to understand who is using their data and how. It’s by identifying the value that comes from reuse of their data that publishers can justify continual (and additional) investment in open data publishing.

Engage with publishers when they ask for examples of how their data is being reused. Provide constructive feedback on the data itself and identify quality issues if you find them. Point to improvements in how the data is published that might help you and others consume it more easily.

If it was hard for you to get in touch with the publisher, encourage them to provide clearer contact details on their website. Getting them to complete an Open Data Certificate will help make this point: you can’t get a Pilot rating unless you provide this information.

If open data is a benefit to your business, then share your story. Evidence of open data benefits provides a positive feedback loop that can help people to unlock more data.

4. Share what you know

In some cases it’s not easy or possible to provide feedback directly to publishers, so share what you learn about working with open data with the wider community.

Do you have some tips about how to consume a dataset? Consider writing a blog to share them. Maybe you can even share some open source code to help work with the data.

Have you identified some issues with a dataset? Those issues may well affect others, so share your observations with the wider community, not just the data publisher.

5. Help build the commons

The open data commons consists of all of the openly licensed and inter-connected datasets that are published to the web. The commons can grow and become more stable if we all contribute to it. There are various ways to achieve this beyond attribution and knowledge-sharing.

For example, if you’ve made improvements to a dataset, perhaps to enrich it against other sources, consider sharing that new dataset under an open licence. This might be the start of a more collaborative relationship with the original publisher or open up new business opportunities.

Some datasets are built and maintained collaboratively. Consider contributing some resources to help maintain the dataset, contributing your fixes or improvements. The more people do this, the more valuable the whole dataset becomes.

Direct financial contributions might also be an option, especially if you’re a commercial organisation making large-scale use of an open dataset. This is a direct way to support open data as a public good.

What do you think?

A mature open data commons will consist of a network of datasets published and reused by a variety of organisations. All organisations will be both publishers and consumers of open data. As we move forward with developing open data culture we need to think about how to encourage and support good practice in both roles.

The suggestions in this blog should prompt further discussion. We’d like to develop this further into some guidance for open data practitioners.

Comparing the 5-star scheme with Open Data Certificates

Originally published on the Open Data Institute blog.

I’ve been asked several times recently about the differences between the 5-star scheme for open data and the Open Data Certificates. How do the two ratings relate to one another, if at all? In this blog post I aim to answer that question.

The 5-star scheme

The 5-star deployment scheme was originally proposed by (our President) Tim Berners-Lee in his linked data design principles. The scheme is neatly summarised in this reference, which also identifies the costs and benefits associated with each stage.

Essentially, the scheme measures how well data is integrated into the web. “1-star” data is published in proprietary formats that users must download and process. “5-star” data can be accessed online, uses URIs to identify the resources in the data, and contains links to other sources.

The scheme is primarily focused on how data is published: the formats and technologies being used. Assessing whether a dataset is published at 2, 3 or 4 stars requires some insight into how the data has been published, which can be difficult for a non-technical person to assess.

The scheme is therefore arguably best used as a technical roadmap and a short-hand assessment of the technical aspects of data publishing.

Open Data Certificates

The Open Data Certificates process takes an alternative but complementary view. A certificate measures how effectively someone is sharing a dataset for ease of reuse. The scope covers more than just technical issues including rights and licensing, documentation, and guarantees about availability. A certificate therefore offers a more rounded assessment of the quality of publication of a dataset.

For data publishers the process of assessing a dataset provides insight into how they might improve their publishing process. The assessment process is therefore valuable in itself, but the certificate that is produced is also of value to reusers.

An Open Data Certificate acts as a reference sheet containing information of interest to reusers of a dataset. This saves time and effort digging through a publishers website to find out whether a dataset can meet their needs. The ability to search and browse for certified datasets may eventually make it easier to find useful data.

Despite these differences, the certificates and the 5-star scheme are in broad alignment. Both aim to improve the quality and accessibility of published data. And both require that data is published under open licences using standard formats. We would expect a dataset published to Expert level on the certificates to be well-integrated into the web, for example.

However it doesn’t necessarily follow that all “5-star” data would automatically gain an Expert rating: a dataset may be well integrated into the web but still be poorly maintained or documented.

In our view the Open Data Certificates provide clearer guidance for data publishers to consider when planning and improving their publishing efforts. They help publishers look at the bigger picture of data-user needs, many of which are not about the data format or whether the data contains URIs. This bigger picture can help inform data publishing roadmaps, procurement of data publishing services and policy development.

The certificates also provide a clear quality mark for reusers looking for assurances around how well data is published.

The 5-star scheme has been very effective at moving publishers away from Excel and closed licences and towards CSV and open licences. But for sustained and sustainable open data, reusers need the publishers of open data to consider more than licences and data formats. The Open Data Certificates helps publishers do that.

Simplifying the UK open data licensing landscape

Originally published on the Open Data Institute blog. Original url: https://theodi.org/blog/simplifying-the-uk-open-data-licensing-landscape

The Ordnance Survey has adopted the Open Government Licence (OGL) as the default licence for all of its open data products. This is great news for the open data community as it simplifies licensing around many important UK open datasets. It’s also an opportunity for other data publishers to reflect on their own approach to data licensing.

The original “OS Open Data licence” was based on a customised version of the first version of the OGL. Unfortunately these changes left the open data community in some doubt about how the new clauses were to be interpreted. For example, the Open Street Map community decided that the terms were incompatible with the Open Database Licence, requiring them to seek explicit permission to use the open data. These are exactly the problems that standard open licences are meant to avoid.

By switching licence the Ordnance Survey has not only resolved outstanding confusion but has also ensured that its data can be freely and easily mixed with other UK Government sources. The knock-on effects will also simplify the licensing of local government data released under the Public Sector Mapping Agreement. The result is a much clearer and simpler open data landscape in the UK.

At the ODI we’ve previously highlighted our concerns around the proliferation of open government licences. Many of these licences have taken a similar approach to the OS Open Data licence and are derived from earlier versions of the OGL.

We think this is a good time for all data publishers to consider their licensing choices:

  • If your custom licence is derived from the OGL then consider adopting the original version unchanged.
  • If you’re using a bespoke licence then consider how adopting a standard licence such as the OGL or the Creative Commons Attribution licence could benefit potential reusers.

For more information you can browse our guidance on open data licensing and our draft guidance on problematic licensing terms.

Ultimately, the simplification of the open data licensing landscape benefits everyone and we ask other publishers to follow the Ordnance Survey’s lead.

Public draft of the open data maturity model

In partnership with the Department for Environment, Food & Rural Affairs (Defra), the ODI has been developing a maturity model to help assess how effective organisations are at publishing and consuming open data.

We are pleased to launch a public draft of the model and invite feedback on it from the wider community.

Last year we announced the start of a project to develop an open data maturity model. Funded through the Release of Data Fund, the project aims to support organisations in mapping out their open data journey and comparing their progress with others. The model will be of immediate value to Defra in implementing its open data strategy, but the aim has always been to develop a model that can be applied by a wide range of organisations.

Since November we’ve run a series of requirements workshops to explore this idea in more detail with representatives from 10 different organisations, including members of the Defra network and the wider open data community.

The results have been used to create a maturity model that will help organisations assess their maturity as both publishers and reusers of open data in several areas:

  • Data management processes
  • Knowledge and skills
  • Customer support and engagement
  • Investment and financial performance
  • Strategy and governance

The draft model consists of two components:

  • An assessment grid that identifies the key elements of the model and the steps towards maturity.
  • An supporting guidance document that provides more detail on the structure of the model, the activities described in the grid and some notes on how to undertake an assessment.

The documents are at a stage where we would like to invite input from the open data community.

We’d welcome all feedback, but are particularly interested in knowing whether:

  • the model covers the right elements of assessing maturity,
  • the guidance includes the right amount of detail and supporting notes, or
  • the results you get from assessing your organisation seem reasonable.

Please read through both documents and let us know your thoughts. It might be useful to read some of the introductory parts of the guide before reviewing the grid and other guidance in more detail.

You can comment on the documents directly or if you’d prefer then email your feedbackto Leigh.Dodds@theodi.org.

Our aim is to deliver a final version of the model by the end of March. So please provide your feedback by Friday, 13 March.

In the meantime, we will be starting the second phase of the project which focuses on developing an assessment tool to support people in using the model.

Developing an open data maturity model

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/developing-an-open-data-maturity-model

Organisational change is an important aspect of becoming an open data publisher. Often the technical process of getting data published is actually the easiest step. But if users are to have reliable, ongoing access to data then organisations need to consider the strategic, financial and operational impacts of making their open data publishing efforts sustainable.

Existing open data publishers are all at different stages in this change process: some are only just beginning to publish data, others have already undergone significant changes towards a more “open by default” model. Understanding the issues commonly encountered provides an opportunity for organisations to both learn from the successes of others and to assess their “maturity” as an open data publisher.

The Defra Transparency Panel, put together by the Department for Environment, Food & Rural Affairs, recently identified the need to be able to assess the open data maturity of organisations Defra works with, with a view to using this as a means to further promote open data publishing. As this type of assessment would clearly have value to other public bodies, Defra has partnered with the ODI to explore the creation of a general “maturity model” for open data publishers.

Funded by the Release of Data Fund, the project is just beginning and has several goals:

  • convene a group of stakeholders from central and local government, and the wider open data community, to provide input into the model
  • create an assessment model that considers the technical, strategic, financial, internal operational, customer and knowledge aspects of open data publication
  • develop a simple tool that will allow organisations to assess themselves against the model, producing a simple scorecard and recommended areas for improvement.

Consisting of a series of questions and measures, the assessment will be a natural complement to the Open Data Certificates. But where the certificates focus on a single dataset, the maturity model will assess the wider organisation. Where possible, data from existing certificates and data.gov.uk measures will be used to help support completion of the assessment.

By providing a means for public bodies to better understand their open data maturity and concrete guidance on areas for improvement, the goal is to ultimately drive an increase in the volume and quality of open data.

First steps

The initial part of the project will consist of a series of workshops. The first of these will focus on requirements gathering with a later workshop providing an opportunity to review and test the model before it is finalised. Development of the assessment tool will then begin.

The team are currently drawing up a shortlist of stakeholders to invite to the workshops, the first of which will take place before the end of the year. The initial set of attendees is based on existing expressions of interest from the Defra Transparency Panel, organisations in the Defra network, DCLG, and some local authorities. The goal is to have a representative mix of different types of organisation with different levels of experience in publishing open data.

While spaces will be limited, if you are interested in taking part in one of the workshops then please send me an email at leigh.dodds@theodi.org as soon as possible. However, the intention is to openly publish both the draft and final models so there will be opportunities for wider review before it is released. We’ll also provide further updates on the project over the coming months.

Loading the British National Bibliography into an RDF Database

This is the second in a series of posts (1, 2, 3, 4) providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

Note: while I’ve attempted to fix up these instructions to account with changes to the software and how the data is published, there may still be some errors. If there are then please leave a comment or drop me an email and I’ll endeavour to fix.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is published under a public domain license and is available for access online or as a bulk download.

This tutorial provides developers with guidance on how to download the BNB data and load it into an RDF database, or “triple store” for local processing. The tutorial covers:

  • An overview of the different formats available
  • How to download the BNB data
  • Instructions for loading the data into two different open source triple stores

The instructions given in this tutorial are for users of Ubuntu. Where necessary pointers to instructions for other operating systems are provided. It is assumed that the reader is confident in downloading and installing software packages and working with the command-line.

Bulk Access to the BNB

While the BNB is available for online access as Linked Data and via a SPARQL endpoint there are a number of reasons why working with the dataset locally might be useful, e.g:

  • Analysis of the data might require custom indexing or processing
  • Using a local triple store might offer more performance or functionality
  • Re-publishing the dataset as part of aggregating data from a number of data providers
  • The full dataset provides additional data which is not included in the Linked Data.

To support these and other use cases the BNB is available for bulk download, allowing developers the flexibilty to process the data in a variety of ways.

The BNB is actually available in two different packages. Both provide exports of the data in RDF but differ in both the file formats used and the structure of the data.

BNB Basic

The BNB Basic dataset is provided as an export in RDF/XML format. The individual files are available for download from the BL website.

This version provides the most basic export of the BNB data. Each record is mapped to a simple RDF/XML description that uses terms from several schemas including Dublin Core, SKOS, and Bibliographic Ontology.

As its provides a fairly raw version of the data, BNB Basic is likely to be most useful when the data is going to undergo further local conversion or analysis.

Linked Open BNB

The Linked Open BNB offers a much more structured view of the BNB data.

This version of the BNB has been modelled according to Linked Data principles:

  • Every resource, e.g. author, book, category, has been given a unique URI
  • Data has been modelled using a wider range of standard vocabularies, including the Bibliographic Ontology, Event Ontology and FOAF.
  • Where possible the data has been linked to other datasets, including LCSH and Geonames

It is this version of the data that is used to provide both the SPARQL endpoint and the Linked Data views, e.g. of The Hobbit.

This package provides the best option for mirroring or aggregating the BNB data because its contents matches that of the online versions. The additional structure to the dataset may also make it easier to work with in some cases. For example lists of unique authors or locations can be easily extracted from the data.

Downloading The Data

Both the BNB Basic and Linked Open BNB are available for download from the BL website

Each dataset is split over multiple zipped files. The BNB Basic is published in RDF/XML format while the Linked Open BNB is published as ntriples. The individual data files can be downloaded from CKAN although this can be time consuming to do manually.

The rest of this tutorial will assume that the packages have been downloaded to ~data/bl

Unpacking the files is a simple matter of unzipping them:

cd ~/data/bl
unzip \*.zip
#Remove original zip files
rm *.zip

The rest of this tutorial provides guidance on how to load and index the BNB data in two different open source triple stores.

Using the BNB with Fuseki

Apache Jena is an Open Source project that provides access to a number of tools and Java libraries for working with RDF data. One component of the project is the Fuseki SPARQL server.

Fuseki provides support for indexing and querying RDF data using the SPARQL protocol and query language.

The Fuseki documentation provides a full guide for installing and administering a local Fuseki server. The following sections provide a short tutorial on using Fuseki to work with the BNB data.

Installation

Firstly, if Java is not already installed then download the correct version for your operating system.

Once Java has been installed, download the latest binary distribution of Fuseki. At the time of writing this is Jena Fuseki 1.1.0.

The steps to download and unzip Fuseki are as follows:

#Make directory
mkdir -p ~/tools
cd ~/tools

#Download latest version using wget (or manually download)
wget http://www.apache.org/dist/jena/binaries/jena-fuseki-1.1.0-distribution.zip

#Unzip
unzip jena-fuseki-1.1.0-distribution.zip

Change the download URL and local path as required. Then ensure that the fuseki-server script is executable:

cd jena-fuseki-1.1.0
chmod +x fuseki-server

To test whether Fuseki is installed correctly, run the following (on Windows systems use fuseki-server.bat):

./fuseki-server --mem /ds

This will start Fuseki with a empty read-only in-memory database. Visiting http://localhost:3030/ in your browser should show the basic Fuseki server page. Use Ctrl-C to shutdown the server once the installation test is completed.

Loading the BNB Data into Fuseki

While Fuseki provides an API for loading RDF data into a running instance, for bulk loading it is more efficient to index the data separately. The manually created indexes can then be deployed by a Fuseki instance.

Fuseki is bundled with the TDB triple store. The TDB data loader can be run as follows:

java -cp fuseki-server.jar tdb.tdbloader --loc /path/to/indexes file.nt

This command would create TDB indexes in the /path/to/indexes directory and load the file.nt into it.

To index all of the Linked Open BNB run the following command, adjusting paths as required:

java -Xms1024M -cp fuseki-server.jar tdb.tdbloader --loc ~/data/indexes/bluk-bnb ~/data/bl/BNB*

This will process each of the data files and may take several hours to complete depending on the hardware being used.

Once the loader has completed the final step is to generate a statistics file for the TDB optimiser. Without this file SPARQL queries will be very slow. The file should be generated into a temporary location and then copied into the index directory:

java -Xms1024M -cp fuseki-server.jar tdb.stats --loc ~/data/indexes/bluk-bnb >/tmp/stats.opt
mv /tmp/stats.opt ~/data/indexes/bluk-bnb

Running Fuseki

Once the data load has completed Fuseki can be started and instructed to use the indexes as follows:

./fuseki-server --loc ~/data/indexes/bluk-bnb /bluk-bnb

The --loc parameter instructs Fuseki to use the TDB indexes from a specific directory. The second parameter tells Fuseki where to mount the index in the web application. Using a mount point of /bluk-bnb the SPARQL endpoint for the dataset would then be found at:

http://localhost:3030/bluk-bnb/query

To select the dataset and work with it in the admin interface visit the Fuseki control panel:

http://localhost:3030/control-panel.tpl

Fuseki has a basic SPARQL interface for testing out SPARQL queries, e.g. the following will return 10 triples from the data:

SELECT ?s ?p ?o WHERE {
  ?s ?p ?o
}

For more information on using and administering the server read the Fuseki documentation.

Using the BNB with 4Store

Like Fuseki, 4Store is an Open Source project that provides a SPARQL based server for managing RDF data. 4Store is written in C and has been proven to scale to very large datasets across multiple systems. It offers a similar level of SPARQL support as Fuseki so is good alternative for working with RDF in a production setting.

As the 4Store download page explains, the project has been packaged for a number of different operating systems.

Installation

As 4Store is available as an Ubuntu package installation is quite simple:

sudo apt-get install 4store

This will install a number of command-line tools for working with the 4Store server. 4Store works differently to Fuseki in that there are separate server processes for managing the data and serving the SPARQL interface.

The following command will create a 4Store database called bluk_bnb:

#ensure /var/lib/4store exists
sudo mkdir -p /var/lib/4store

sudo 4s-backend-setup bluk_bnb

By default 4Store puts all of its indexes in /var/lib/4store. In order to have more control over where the indexes are kept it is currently necessary to build 4store manually. The build configuration can be altered to instruct 4Store to use an alternate location.

Once a database has been created, start a 4Store backend to manage it:

sudo 4s-backend bluk_bnb

This process must to be running before data can be imported, or queried from the database.

Once the database is running a SPARQL interface can then be started to provide access to its contents. The following command will start a SPARQL server on port 8000:

sudo 4s-httpd -p 8000 bluk_bnb

To check whether the server is running correctly visit:

http://localhost:8000/status/

It is not possible to run a bulk import into 4Store while the SPARQL process is running. So after confirming that 4Store is running successfully, kill the httpd process before continuing:

sudo pkill '^4s-httpd'

Loading the Data

4Store ships with a command-line tool for importing data called 4s-import. It can be used to perform bulk imports of data once the database process has been started.

To bulk import the Linked Open BNB, run the following command adjusting paths as necessary:

4s-import bluk_bnb --format ntriples ~/data/bl/BNB*

Once the import is complete, restart the SPARQL server:

sudo 4s-httpd -p 8000 bluk_bnb

Testing the Data Load

4Store offers a simple SPARQL form for submitting queries against a dataset. Assuming that the SPARQL server is running on port 8000 this can be found at:

http://localhost:8000/test/

Alternatively 4Store provides a command-line tool for submitting queries:

 4s-query bluk_bnb 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'

Summary

The BNB dataset is not just available for use as Linked Data or via a SPARQL endpoint. The underlying data can be downloaded for local analysis or indexing.

To support this type of usage the British Library have made available two versions of the BNB. A “basic” version that uses a simple record-oriented data model and the “Linked Open BNB” which offers a more structured dataset.

This tutorial has reviewed how to access both of these datasets and how to download and index the data using two different open source triple stores: Fuseki and 4Store.

The BNB data could also be processed in other ways, e.g. to load into a standard relational database or into a document store like CouchDB.

The basic version of the BNB offers a raw version of the data that supports this type of usage, while the richer Linked Data version supports a variety of aggregation and mirroring use cases.

Research: investigation into publishing open election data

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/election-data-tables

A number of people at the ODI have recently been looking at the topic of open election data, asking how election results could be collected, reported and analysed in order to increase transparency and drive democratic engagement.

For example the technology team recently developed an approach to collaborative data collection using the European election data. In partnership with Deloitte, the research team conducted a project to explore potential applications of election data, which ultimately highlighted the issues with obtaining good quality election data.

Recognising the need for a better approach to publishing open election data we decided to explore the topic further. Supported by Partnership for Open Data, we went back to first principles to look at:

  • what types of data are used in electoral processes?
  • how is election data currently being reported internationally?
  • do the differences between different electoral systems impact how data is reported?
  • what data formats currently exist for sharing election data?

We’ve published the results of that research and analysis in a draft paper: Publishing Election Data.

The paper also introduces a simple conceptual model which could inform the design of data standards for open election data.

We also felt that there was scope to define some simple, customisable data formats that could be used to support reporting of election results internationally.

With that in mind we’ve also created a draft specification called “Election Data Tables” that defines some simple tabular formats for election results.

The github project includes some example data using election results from the UK, Albania and Zimbabwe to illustrate some uses of the format.

The specification is still at a very early stage and more work is required in various areas, including defining some schemas to support data validation. But the work is at a stage where it would be really useful to get external feedback and we’d like your input!

For example, if you’re working with, or publishing election data, does the format support your specific use cases?

If you’re interested then please take a read through the paper and the specification and let us know what you think. If you have any thoughts on the research paper then feel free to comment on the document or perhaps leave a note here. If you’d like to suggest amendments to the draft specification then please raise an issue on github or submit a pull request with your suggested changes.

Publishing open statistical data

null

At the request of the ONS, the ODI tech team have recently been exploring some ideas around publishing statistical open data. This blog post shares some of the results of that thinking and you can also explore a proof-of-concept that showcases some of these ideas with real-world data.

Obviously, there are plenty of existing best practices for data publication that should be followed regardless of the type of data being published: clear licensing, availability of bulk downloads, use of standard, open formats, are all important. The guidance in the Open Data Certificate questionnaire applies to all types of data, including statistics.

But arguably there are some specific challenges that apply to the publication and re-use of statistical data. These challenges are partly due to the inherent complexity of (even simple) statistical publications. Communicating this context effectively is incredibly important in understanding and properly interpreting this type of data.

The wide community of re-users of statistical data, each have their own distinct needs, also presents challenges: politicians, policy makers, journalists, application developers and members of the general public all need to access to official statistics at various times and in different ways.

It is vital that statistical data is published in a way that makes it easy to locate, easy to understand, and easy to use. Statistical data needs to be immediately accessible to all users and this is as much about good user experience design as it is about ready access to bulk downloads and APIs.

There are four key elements to what the team have been exploring so far.

Documentation is Vital

Documentation is a vital part of a statistical data release. Many statistical organisations publish analysis alongside their raw data. It is this analysis that starts to tell the initial stories around the data, drawing out the key highlights.

But at a deeper level all aspects of the dataset need to be documented, and this information needs to be readily available in both human and machine-readable forms:

  • What are the dimensions of the dataset? If the dimension is based on a code list or controlled vocabulary then what do the individual values mean and how do they relate to one another?
  • What is actually being measured?
  • What contextual information is required to understand individual values? For example are the numbers provisional, or are they based on limited coverage?
  • Is this the latest data available? When was it published? Has it been revised?

Unfortunately it can often be difficult to answer these questions. Even when the raw data is available, the context required to properly interpret the data is not readily available.

Statistical data, more than any other form of government open data, is prone to being published in carefully formatted Excel spreadsheets. The formatting is usually unnecessary, but often its there to communicate some additional context. Provisional values may be in italics. Letter codes might be added to individual data points, referencing footnotes that provide some important notes on interpreting the value.

When we publish data to the web we can do better: We can link directly to the necessary context, in both the human and machine-readable views of the data.

Link All The Things

Every aspect of a statistical dataset should be part of the web: the dataset, its structural elements (dimensions, attributes and measures), and every individual observation should each have a unique URL.

If observations have their own URLs then users can link to individual data points. This allows analysis to be directly linked to its supporting evidence.

Dynamic URLs can be used to redirect to the latest figures, ensuring that people are always accessing the latest data. But the archive of individual observations can still be navigated by linking together observations collected in the previous month, quarter, or year.

If all of the structural elements of a dataset have unique identifiers, then the definitions of terms become accessible from a single click, rather than searching through supporting notes and documentation.

Slicing and Dicing

null

There are many differents ways to slice through a statistical dataset. Comparisons can be made across different dimensions to create time-series and other charts. Rather than create a few fixed presentations, e.g. individual data tables or static charts, the data should be published via an API that supports dynamically slicing the data. This facilitates the creation of more dynamic presentations of the data:

  • Developers can use the API to extract the collection of data points they need, rather than downloading a whole dataset
  • Users can navigate through a dataset to generate simple visualisations, e.g. time series of some area of interest

This can also simplify the production processes that support the creation of statistical releases allowing analysts to dynamically generate whatever charts and tables are required.

Embeddable Views

Dynamic views of the data should not be confined to the original website. The charts and data tables produced for end-users should be embeddable in other websites, allowing them to be included in news reports, blog posts, etc.

This greatly simplifies the process of data re-use for many users. Seamless sharing of information, via linking, is an important part of how discussion happens on the web today. Embeddedable views also help address questions of provenance and trust by allowing readers to easily locate the original data sources referenced in an article.

Taken together we think these four elements will usefully complement publication of raw open data; help to integrate statistical data into the web; and make it easier for all types of user to make the most from statistical publications.

Visit the proof-of-concept application to explore how some of these ideas work in the context of the Producer Price Index dataset and for more background on how we built the application.

Introducing CSVLint

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/introducing-csvlint

The ODI tech team has recently been building a tool to validate CSV files. While CSV is a very simple format, it is surprisingly easy to create files that are hard for others to use.

The tool we’ve created is called CSVLint and this blog post provides some background on why we’ve built the tool, its key features, and why we think it can help improve the quality of a large amount of open data.

Why build a CSV validator?

Jeni Tennison recently described 2014 as the Year of CSV. A lot, perhaps even the majority, of Open Data is published in the tabular format CSV. CSV has many short-comings, but is very widely supported and can be easy to use.

Unfortunately though, lots of CSV data is published in broken and inconsistent ways. There are numerous reasons why this happens, but two of the key issues are that:

  • tools differ in how they produce or expect to consume CSV data, leading to the creation of many different variants or “dialects” of CSV that have different delimiters, escape characters, encoding, etc
  • lots of data is dumped from spreadsheets that are designed for human-reading and not automated processing

This case study on the status of CSVs on data.gov.uk highlights the size of this issue: only a third of the data was machine-readable.

These types of issues can be addressed by better tooling. Validation tools can help guide data publishers towards best practices providing them with a means to check data before it is published to ensure it is usable. Validation tools can also help re-users check data before it is consumed and provide useful feedback to publishers on issues.

This is the motivation behind CSVLint.

Gathering Requirements

To ensure that we were building a tool that would meet the needs of a variety of users, we gathered requirements from several sources:

  • user workshop — we engaged with a group of data publishers and re-users to discuss the issues they faced and the features they would like to see. The attendees identified and prioritised a potential set of features for the tool
  • background research — we explored a range of different tools, techniques and formats for validating and describing CSV files. This allowed us to identify the types of validation that might be useful and ways to describe constraints and create schemas

We used this input to refine an initial set of features which formed the backlog for the project. The key things that we needed to deliver were:

  • a CSV syntax validator to check that the basic structure of a CSV file
  • an extended validator that could check a CSV file against a schema, e.g. to ensure that it contained the correct columns with the correct data types
  • a way to generate documentation for schemas, to make it easy for people to publish and aggregate data in common formats
  • a tool that can be used to check data both before and after publication
  • clear guidance on how to fix identified problems
  • integration options for embedding these tools into other various workflows

The CSVLint Alpha

null

The end result of our efforts is CSVLint an open service that supports the validation of CSV files published in a variety of ways.

The service is made up of two components. The web application provides all of the user facing functionality, including the reporting, etc. It is backed by an underlying software library, csvlint.rb, that does all of the heavy lifting around data validation.

Both the web application and the library are open source. This means that everything we’ve built is available for others to customise, improve, or re-deploy.

The service builds on some existing work by the Open Knowledge Foundation, including the CSV dialectJSON Table Schema and Data Package formats.

CSVLint supports validating CSV data that has been published in a variety of different ways:

  • As a single CSV file available from a public URL
  • As a collection of CSV files packaged into a zip file
  • As a collection of CSV files associated with a CKAN package
  • As a Data Package
  • Via uploading individual files
  • Via uploading a zip file

Data uploaded to the tool is deemed to be “pre-publication” so the validation reports are not logged. This allows publishers to validate and improve their data files before making them public.

All other data is deemed to be public and validation reports are added to the list of recent validations. This provides a feedback loop to help highlight common errors.

Validation Reports

null

The validation reports (example) have been designed to give “at a glance” feedback on the results, as well as a detailed breakdown of each issue.

All feedback is classified along two different dimensions:

  • Type of feedback
  • Error — problems that needs to be fixed for the CSV to be considered valid
  • Warning — problems that should be fixed, but aren’t critical
  • Message — additional feedback on areas for improvement or assumptions made during the validation
  • Category of error
  • Structure — problems with the syntax of the file, e.g. problems with quoting or delimiters
  • Schema — issues caused by schema validation failures
  • Context — problems related to how the data has been published, e.g. the Content-Type used to serve the file

The summary table for each validation result is supplemented with detailed feedback on every reported issue with suggested improvements.

The report also includes badges that allow a summary result and a link to a full report to be embedded in other web applications.

JSON view of a validation result provides other integration options.

Schema Validation

In addition to checking structural problems with CSV files, the CSVLint service can also validate a file against a schema.

We proposed some suggested improvements to the JSON Table Schema format that would allow constraints to be expressed for individual fields in a table, e.g. minimum length, patterns, etc. These have now been incorporated into the latest version of the specification.

CSVLint currently supports schemas based on the latest version of JSON Table Schema. There is some background in the documentation (see “How To Write a Schema”) and it is possible to see a list of recently used schemas to view further examples.

Using a schema it is possible to perform additional checks, including:

  • whether the columns have the right name
  • required fields are populated
  • fields have a minimum or maximum length, or match a pattern
  • field values are unique
  • values match a declared type, e.g. a date

This provides a lot of flexibility for checking the data contained in a CSV file. When validating a file a user may specify a schema file to be used when validating the data, either uploading it with the data or pointing to an existing schema that has been published to the web. For Data Packages any built-in schema is automatically applied. Schemas can be uploaded along with a data file or published openly on the web.

CSVLint automatically generates some summary documentation for schemas loaded from the web, e.g. this schema for the Land Registry Price Paid data.

How CSVLint can make a difference

While CSLint is still an alpha release there are already a rich set of features available to support guiding and improving data publication. We think that the service can potentially play a number of roles:

  • by helping users of all kinds improve the data they are publishing via a quick feedback loop that will guide them on fixing errors
  • enabling communities to publish schemas that describe and validate data formats to simplify the aggregation of open data
  • supporting data re-users in checking source data to catch common problems and provide useful constructive feedback to data publishers
  • allowing data repositories to use CSVLint badges in their service to provide immediate feedback to both publishers and re-users on data quality

But to prove this we need people to start using the tool. User feedback will provide us with useful guidance on how the service might evolve. So we’re really keen to get feedback on how well CSVLint supports your particular data publication or re-use use case.

Please try out the service and share your experience by leaving a comment on this blog post. If you encounter a bug, or have an idea for a new feature, then please file an issue.

Tools For Working with CSV Files

Anyone working with open data for any length of time will have inevitably spent a long time wrangling CSV files to tidy, extract and reformat data.

There are various ways to get that job done, but I’ve been compiling a list of useful command-line tools that can provide some useful functionality. The ability to automate data conversion and cleaning helps make a process repeatable, which is essential if you’re doing more than just a one-off task.

My current favourite tool for working with CSV files is csvkit, which is a collection of utilities that support:

  • cleaning CSV files to resolve syntax errors
  • viewing and searching CSV files to extract relevant data
  • generating summaries of columns in the data
  • merging together CSV files, joining them based on common column values

The getting started guide introduces you to each of the tools in turn. Taken together they provide a quick way to inspect the data in a set of CSV files and then combine them to create a more useful structure.

However, if you need to do more detailed clean-up then csvfix might be a better alternative. The documentation includes some solutions to common problems which gives a good overview of the functionality. csvfix is particularly good at tidying up and reformatting fields within a CSV file making it a good complement to csvkit.

Of course data is often published as an Excel file rather than as CSV, often requiring a manual step to convert it to CSV before applying other tools. Data might also be spread across separate worksheets, making the export process more laborious. I wrote the xls-split utility to help extract worksheets from Excel files, converting them into one or more CSV files. Its very useful for extracting data from a set of related spreadsheets, e.g. annual or monthly statistics, in order to then build an aggregate data file.

Validating data in a CSV file, to check that it conforms to expectations, is another common task. This is something that we’ve been looking at in more depth at the ODI. This github project and the related documentation explores various tools and approaches for validating CSV files against simple schemas.

If working on the command-line isn’t your thing then OpenRefine is still one of the best tools for interactively tidying up messy data.

What other tools do you use when working with CSV files? Leave a comment and let us know.