Stood in the queue at the supermarket earlier I noticed the cover of the Bath Chronicle. The lead story this week is: “House prices in Bath almost 13 times the average wage“. This is almost perfectly designed clickbait for me. I can’t help but want to explore the data.
In fact I’ve already done this before, when the paper published a similar headline in September last year: “Average house price in Bath is now eight times average salary“. I wrote a blog post at the time to highlight some of the issues with their reporting.
Now I’m writing another blog post, but this time to highlight how far we still have to go with publishing data on the web.
To try to illustrate the problems, here’s what happened when I got back from the supermarket:
- Read the article on the Chronicle website to identify the source of the data, the annual Home Truths report published by the National Housing Federation.
- I then googled for “National Housing Federation Home Truths” as the Chronicle didn’t link to its sources.
- I then found and downloaded the “Home Truths 2014/15: South West” report which has a badly broken table of figures in it. After some careful reading I realised the figures didn’t match the Chronicle
- Double-checking, I browsed around the NHF website and found the correct report: “Home Truths 2015/2016: The housing market in the South West“. Which, you’ll notice, isn’t clearly signposted from their research page
- The report has a mean house price of £321,674 for Bath & North East Somerset using Land Registry data from 2014. It also has a figure of £25,324 for mean annual earnings in 2014 for the region, giving a ratio of 12.7. The earnings data is from the ONS ASHE survey
- I then googled for the ASHE survey figures as the NHF didn’t link to its sources
- Having found the ONS ASHE survey I clicked on the latest figures and found the reference tables before downloading the zip file containing Table 8
- Unzipping, I opened the relevant spreadsheet and found the worksheet containing the figures for “All” employees
- Realising that the ONS figures were actually weekly rather than annual wages I opened up my calculator and multiplied the value by 52
- The figures didn’t match. Checked my maths
- I then realised that, like an idiot, I’d downloaded the 2015 figures but the NHF report was based on the 2014 data
- Returning to the ONS website I found the tables for the 2014 Revised version of the ASHE
- Downloading, unzipping, and calculating I found that again the figures didn’t match
- On a hunch, I checked the ONS website again and then found the reference tables for the 2014 Provisional version of the ASHE
- Downloading, unzipping, and re-calculating I finally had my match for the NHF figure
- I then decided that rather than dig further I’d write this blog post
This is a less than ideal situation. What could have streamlined this process?
The lack of direct linking – from the Chronicle to the NHF, and from the NHF to the ONS – was the root cause of my issues here. I spent far too much time working to locate the correct data. Direct links would have avoided all of my bumbling around.
While a direct link would have taken me straight to the data, I might have missed out on the fact that there were revised figures for 2014. Or that there were actually some new provisional figures for 2015. So there’s actually a update to the story already waiting to be written. The analysis is already out of date.
The new data was published on the 18th November and the NHF report on the 23rd. That gave a five day period in which the relevant tables and commentary could have been updated. Presumably the report was too deep into final production to make changes. Or maybe just no-one thought to check for updated data.
If both the raw data from the ONS and the NHF analysis had been published natively to the web rather than in a PDF maybe some of that production overhead could have been reduced. I know PDF has some better support for embedding and linking data these days, but a web native approach might have provided a more dynamic approach.
In fact, why should the numbers have been manually recalculated at all? The actual analysis involves little more than pulling some cells from existing tables and doing some basic calculations. Maybe that could have been done on the fly? Perhaps by embedding the relevant figures. At the moment I’m left with doing some manual copy-and-paste.
It’s not just NHF that are slow to publish their figures though. Researching the Chronicle article from last year, I turned up some DCLG figures on housing market and house prices. These weren’t actually referenced from the article or any of its sources. I just tripped over them whilst investigating. Because data nerd.
The live (sic) DCLG tables include a ratio of median house prices to median earnings but they haven’t been updated since April 2014. Their analysis only uses the provisional ASHE figures for 2013.
Oh, and just for fun, the NHF analysis uses mean house prices and wages, whilst the DCLG data uses medians. The ONS publish both weekly mean and median earnings for all periods, as well as some data for different quantiles.
And this is just one small example.
My intent here isn’t to criticise the Chronicle, the NHF, DCLG, and especially not the ONS who are working hard to improve how they publish their data.
I just wanted to highlight that:
- we need better norms around data citation, and including when and how to link to both new and revised data
- we need better tools for telling stories on the web, that can easily be used by anyone and which can readily access and manipulate raw data
- we need better discovery tools for data that go beyond just keyword searches
- we need to make it easier to share not just analyses but also insights and methods, to avoid doing unnecessary work and to make it easier (or indeed unnecessary) to fact check against sources
That’s an awful lot to still be done. Opening data is just the start at building a good data infrastructure for the web. I’m up for the challenge though. This is the stuff I want to help solve.
Shortly after I published this Matt Jukes published a post wondering what a digital statistical publication might look like. Matt’s post and Russell Davies thoughts on digital white papers are definitely worth a read.