I’ve been getting frustrated by CSV files again.
The context for this is my day job at Energy Sparks. I’ve written about the wide range of
different CSV formats that we have to contend with in order to accept data from a range of energy suppliers and meter operators. While there are a number of loose conventions around how that data is formatted, there’s also a lot of unnecessary variety.
When I wrote that analysis we were processing around 24 different tabular formats for half-hourly energy data. We’re now up to 45 different formats. Often several per data source as organisations frequently export different formats from different systems.
One reason for the increase is the recent change in policy that requires energy suppliers to make half-hourly energy data freely available to their non-domestic customers. Unfortunately that policy did not require a specific format or for the industry to coordinate around creation of a common format.
I beg you, if you’re working on policy to increase access to data, please recognise that in order for your policy to be successful that data needs to be consistently formatted. This report from MySociety and the Centre for Public Data has good suggestions for what to do.
Anyway, while the new policy has eased some of our issues with getting access to data it has somewhat inevitably lead to us having to deal with even more formats. Some suppliers are getting ahead of the curve and, anticipating increased demand, are revamping their customer portal to provide better access to data in new or updated formats.
But in other cases, we’re just dealing with any old export format that can be provided. And some are pretty awful. One is so bad that you can’t actually use the data without reformatting the column in Excel as timestamps are hidden by default. That’s pretty user hostile in my opinion.
I’ve been trying to distract myself from taking on the task of defining a common format (easy, enticing) and then trying to get people to use it (exhausting, endless).
Part of my distraction technique is to instead think about how things can get this bad?
The need for coordination
One answer to that question is that there’s no-one across the sector that is advocating for users needs. If there were then I think there’d be some recommendations for how this type of data might be made more useful.
There also seems to be little practical coordination around data standards and technology in the sector. While there’s been work from Ofgem on best practices for data and recommendations from Icebreaker One’s Open Energy project these are all focused on licensing, metadata and so on. It’s not getting into the the detail of specific formats.
Without coordination organisations will make isolated decisions about how to organise and publish data. This will never lead to convergence.
I don’t know who is resourced or funded to do this type of work. If it’s happening, then let me know.
Designing good formats, a course outline
The second answer to my question is that I think developers aren’t necessarily trained to think about designing data formats. In much the same way there seems to be an education gap around security, maybe there should be more focus on teaching developers about standards, good data formats and so on?
In my experience, API design very often focuses on other aspects of sharing data, not the formats themselves. And it seems to me that data engineering focuses more on architecture, tooling and cleaning rather formatting data for sharing.
I think unless you’ve spent a lot of time working with other people’s data, it’s easy to overlook even basic improvements to the design of a simple CSV file. Things that might be a quality of life improvement for the user on the receiving end.
Format design is obviously a UX problem.
So if I were designing a short course or workshop on data format design I’d focus on things like:
- The importance of knowing user needs. What will they be doing with the data, and what types of tooling or workflows will they be using? A format that works well for one situation (bulk ingestion, or within API responses) won’t work so well in others (Excel)
- How to design good tabular and JSON formats. What kind of design works with the grain of the underlying format you’re using? For tabular formats I’d focus on the kinds of things I highlighted in Designing CSV files.
- Designing data formats that work for different types of data exchange. For example, the way you’d design a streaming format, an export format, or a format for synchronising data are all different.
- The importance of identifiers and how to choose between different identifiers. For example when to use an internal, shared or public identifier? Or the value of including labels alongside identifiers
- How to design data formats that are split across files. There are design and usability decisions that aren’t quite the same as database normalisation, IMO.
- How to package data for exchange along with its metadata. For example use or zip or other packaging formats, or how to embed data in different formats versus providing it separately.
- How to document your format so it’s legible to others. What makes for good data format and dataset documentation? What kind of information is it useful to provide?
- How to coordinate with others around defining and adopting a common standard. There’s a set of negotiation, research and collaboration skills that are involved in creating a shared standard. And ways to encourage consistency around data in lieu of a format standards process.
I think that’s a pretty comprehensive set of topics. There’s obviously other things that could be covered, like anonymisation but I think that focuses more on the data rather than the format.
What do you think?
One thought on “Data format design is a UX issue”
Comments are closed.