A couple of the projects I’m involved with at the moment are at a stage where there’s some thinking going on around how to best provide CSV files for users. This has left me thinking about what options we actually have when it comes to designing a CSV file format.
CSV is a very useful, but pretty mundane format. I suspect many of us don’t really think very much about how to organise our CSV files. It’s just a table, right? What decisions do we need to make?
But there are actually quite a few different options we have that might make a specific CSV format more or less suited for specific audiences. So I thought I’d write down some of the options that occured to me. It might be useful input into both my current projects as well as future work on standard formats.
Starting from the “outside in”, we have decisions to make about all of the following:
How are you going to name your CSV file? A good file naming convention can help ensure that a data file has an unambiguous name within a data package or after a user has downloaded it.
Including a name, timestamp or other version indicator will avoid clobbering existing files if a user is archiving or regularly collecting data.
Adopting a similar policy to generating URL slugs can help generate readable file names that work across different platforms.
Tabular Data Packages recommends using a .csv file name extension. Which seems sensible!
CSV is a loosely defined format of which there are several potential dialects. Variants can use different delimiter, line-endings and quoting policies. Content encoding is another variable. CSV files may or may not have headers.
The CSV on the Web standard defines a best practise CSV file dialect. Unless there’s a good reason, this ought to be your default dialect when defining new formats. But note that the recommended UTF-8 encoding may cause some issues with Excel.
CSV on the Web doesn’t say how many header rows a CSV file should have, but does define how, when parsing a CSV file, multiple header rows can be skipped. Multiple header rows are often used as a way to add metadata or comments, but I’d recommend using a CSV on the Web file instead as it provides more options.
What naming convention to use for columns? Options are to use an all lower case convention similar to a URL slug. This might make it marginally easier when accessing columns by name in an application. But if there are expectations that a CSV file will be opened in a spreadsheet application, having readable column names (including spaces) will make the data more user friendly.
CSV on the Web has a few other notes about column and row labelling.
Also, what language will you use in the column headings?
How are you going to order the columns in your CSV? The ordering of columns in a CSV file can enhance readability. But there’s is likely to be several different orderings, some of them more “natural” than others.
A common convention is to start with an identifier and other properties (dimensions) that describe what is being reported first, and then the actual observed values. So for example in a sales report we might have:
region, customer, product, total
Or in a statistical dataset
dimension1, dimension2, dimension3, value
dimension1, dimension2, dimension3, value, qualifier
This has the advantage of having a more natural reading order for the table. Particularly if as you move from left to right the columns can have fewer values. Adding qualifiers and notes to the end also ensures that they sit naturally next to the value they are annotating
Is your CSV sorted by default? Sorting may be less relevant if a CSV is being automatically processed and not worrying about order might reduce overheads when generating a data dump.
But if the CSV is going to be inspected or manipulated in a spreadsheet, then defining a default order can help a reader make sense of it.
If the CSV isn’t ordered, then document this somewhere.
How is the data in your table organised?
The Tidy Data guidance recommends having variables in columns, observations in rows, and only a single type of measure/value per table.
In addition to this, I’d also recommend that where there are qualifiers for reported values (as there often are for statistical data) that these are always provided in a separate column, rather than within the main value column. This has the advantage of letting you value column be numeric, rather than a mix of numbers and symbols or other codes. Missing and surpressed values can also then just be omitted and accompanied by an explanation in an adjacent column.
Another pattern I’ve seen with table layouts is to include an element of redundancy to include both labels and identifiers for something referenced in a row. Going back to the sales report example, we might structure this as follows:
region_id, region_name, customer_id, customer_name, product, total
This allows an identifier (which might be a URI) to be provided alongside a human-readable name. This makes the data more readable, at the cost of increasing file size. But it does avoid the need to publish a separate lookup table of identifiers.
You might also sometimes fine a need for repeated values. This is sometimes handled by adding additional redundant columns, e.g. “SICCode1″…”SICCode4” as used in the Companies House data. This works reasonably well and should be handled by most tools, at the potential cost of having lots of extra columns and a sparsely populated table. The alternative is to use a delimiter to put all the values in a single CSV. Again, CSV on the Web defines ways to process this.
And finally we have to decide how to include values in the individual cells. In the section on parsing the CSV on the Web recommends XML Schema data types and date formats as a default, but also allows formats to be defined in an accompanying metadata file.
Other things to think about are more application specific issues, such as how to specify co-ordinates, e.g. lat/lng or lng/lat?
Again, you should think about likely uses of the data and how, for example data and date formats might be interpreted by spreadsheet applications as well as other internationalisation issues.
This is just an initial list of thoughts. CSV on the Web clearly provides a lot of useful guidance that we can now build on, but there are still reasonable questions and trade-offs to be made. I think I’d also now recommend always producing a CSV on the Web metadata file along with any CSV file to help document its structure and any of the decisions made around its design. It would be nice to see the Tabular Data Package specification begin to align itself with that standard.
I suspect there a number of useful tips and guidance which could be added to what I’ve drawn up here. If you have any comments or thoughts then let me know.