A couple of months ago I wrote a post that provided some background on the data we use in Energy Sparks.
The largest data source comes from gas and electricity meters (consumption) and solar panels (generation). While we’re integrating with APIs that allow us to access data from smart meters, for the foreseeable future most of this data will still be collected via AMR rather than SMETS-2 meters. And then shared with us as CSV files attached to emails.
That data is sent via a variety of systems and platforms run by energy companies, aggregators and local authorities. We’re currently dealing with about 24 different variations of what is basically the same dataset.
I thought I’d share a quick summary of that variation. As its interests from a “designing CSV files” and data standards perspective.
For a quick overview, you can look at this Google spreadsheet which provides a summary of the formats, in a way that hopefully makes them easy to compare.
The rest of this post has some notes on the variations.
What data are we talking about?
In Energy Sparks we work with half-hourly consumption and production data. A typical dataset will consist of a series of 48 daily readings for each meter.
Each half hourly data point reports the total amount of energy consumed (or generated) in the previous 30 minutes.
The dataset might usually contain data for several days of readings for many different meters.
This means that the key bits of information that we need to process each dataset is:
- An identifier for the meter, e.g. an MPAN or MPRN
- The date that the readings was taken
- A series of 48 data points making up a full days readings
Pretty straight-forward. But as you can see in the spreadsheet there’s a lot of different variations.
We receive different formats for both the gas and electricity data. Different formats for historical vs ongoing data supply. Or both.
And formats might change as schools or local authorities change platform, suppliers, etc.
Use of CSV
In general, the CSV files are pretty consistent. We rely on the Ruby CSV parsers default behaviour to automatically identify line endings. And all the formats we’re using use commas, rather than tabs, as delimiters.
The number of header rows varies. Most have a single row, but some don’t have any. A couple have two.
Various date formats are used. The following lists the most common first:
%b %e %Y %I:%M%p(1)
%e %b %Y %H:%M:%S(1)
Not much use of ISO 8601!
But the skew towards readable formats probably makes sense given that the primary anticipated use of this data is for people to open it in a spreadsheet.
Where we have several different formats from a single source (yes, this happens), I’ve noticed that the %Y based date formats are used in formats used to provide historical data, while %y year format seems to be the default for ongoing data.
Data is supplied either as UTC dates or, most commonly, in whatever the current timezone is in the UK. So readings switch from GMT to BST. And this means that when the clocks change we end up with gaps in the readings.
The majority of formats (22/24) are column oriented. By which I mean the tables consist of one row per meter, per day. Each row having 48 half-hourly readings as separate columns.
Two are row oriented. Each row containing a measurement for a specific meter at a specific date-time.
The column used to hold meter identifiers also varies. We might expect at least two: MPAN for electricity meters and MPRN for gas. What we actually get is:
“Meter” seems fair as a generic column header if you know what you’re getting. Otherwise some baffling variations here.
What about the column that contains the date (or date-time for row oriented files). What are they called?
The default is that data is supplied in kilowatt-hours (kwh).
So few of the formats actually bother to specify a unit. Those that do call it “
Units” or “
One format actually contains 48 columns reporting kwh and another 48 columns reporting Kilo Volt Amperes Reactive Hours (kVah).
Focusing on the column oriented formats, what are the columns containing the 48 half-hourly readings called?
Most commonly they’re named after the half-hour. For example a column called “
20:00” will contain the kwh consumption for the time between 7.30pm and 8pm.
In other cases the columns are positional, e.g. “half hour 1” through to “half hour 48”. This gives us the following variants:
For added fun, some formats have their first column as 00:30, while others have 00:00.
Some formats interleave the actual readings with an extra column that is used to provide a note or qualifier. There are two variants of this:
In addition to the meter numbers, dates, readings, etc the files sometimes contain extra columns, e.g:
We generally ignore this information as its either redundant or irrelevant to our needs.
Some files provide additional meter names, numbers or identifiers that are bespoke to the data source rather than a public identifier.
We’ve got the point now that adding new formats is relatively straight-forward.
Like anyone dealing with large volumes of tabular data, we’ve got a configuration driven data ingest which we can tailor for different formats. We largely just need to know the name of the date column, the name of the column containing the meter id, and the names of the 48 readings columns.
But it’s taken time to develop that.
Most of the ongoing effort is during the setup of a new school or data provider, when we need to check to see if a data feed matches something we know, or whether we need to configure another slightly different variation.
And we have ongoing reporting to alert us when formats change without notice.
The fact that there are so many variations isn’t a surprise. There are many different sources and at every organisation someone has made a reasonable guess at what a useful format might be. They might have spoken to users, but probably don’t know what their competitors are doing.
This variation inevitably creates cost. This costs isn’t immediately felt by the average user who only has to deal with 1-2 formats at a time when they’re working with their own data in spreadsheets.
But those costs add up for those of us building tools and platforms, and operating systems, to support those users.
I don’t see anyone driving a standardisation effort in this area. Although, as I’ve hopefully shown here, behind the variations there is a simple, common tabular format that is waiting to be defined.
My impression at the moment is that most focus is on the emerging smart meter data ecosystem, and the new range of APIs that might support faster access to this same data.
But as I pointed out in my other post, if there isn’t an early attempt to standardise those, we’ll just end up with a whole range of new, slightly different APIs and data feeds. What we need is a common API standard.