Anyone working with open data for any length of time will have inevitably spent a long time wrangling CSV files to tidy, extract and reformat data.
There are various ways to get that job done, but I’ve been compiling a list of useful command-line tools that can provide some useful functionality. The ability to automate data conversion and cleaning helps make a process repeatable, which is essential if you’re doing more than just a one-off task.
My current favourite tool for working with CSV files is csvkit, which is a collection of utilities that support:
- cleaning CSV files to resolve syntax errors
- viewing and searching CSV files to extract relevant data
- generating summaries of columns in the data
- merging together CSV files, joining them based on common column values
The getting started guide introduces you to each of the tools in turn. Taken together they provide a quick way to inspect the data in a set of CSV files and then combine them to create a more useful structure.
However, if you need to do more detailed clean-up then csvfix might be a better alternative. The documentation includes some solutions to common problems which gives a good overview of the functionality. csvfix is particularly good at tidying up and reformatting fields within a CSV file making it a good complement to csvkit.
Of course data is often published as an Excel file rather than as CSV, often requiring a manual step to convert it to CSV before applying other tools. Data might also be spread across separate worksheets, making the export process more laborious. I wrote the xls-split utility to help extract worksheets from Excel files, converting them into one or more CSV files. Its very useful for extracting data from a set of related spreadsheets, e.g. annual or monthly statistics, in order to then build an aggregate data file.
Validating data in a CSV file, to check that it conforms to expectations, is another common task. This is something that we’ve been looking at in more depth at the ODI. This github project and the related documentation explores various tools and approaches for validating CSV files against simple schemas.
If working on the command-line isn’t your thing then OpenRefine is still one of the best tools for interactively tidying up messy data.
What other tools do you use when working with CSV files? Leave a comment and let us know.