Ignore the Bat Caves and Marketplaces: lets talk about Zoning

Cities are increasingly the place where interesting work is happening within the broader open data community.

Cities, of any size, have a well defined area of influence, a ready made community and are becoming instrumented with sensors. The latter is either explicit, through the installation of devices by local government or implicit via the data automatically collected by and about citizens through mobile apps and devices.

Smart cities

I dislike the term “smart city” for many reasons. The ODI reframing of “open city” is more accessible but the idea still needs some exploration.

The original narrative around smart cities is what I’ve referred to as “the bat cave vision“.

Somewhere in the city, accessible only to a few, is a shadowy control centre. Inside it is a set of wonderful toys that are used to observe and then take action in the city.

In the Bat Cave vision the city is instrumented and the data is pressed into service to protect and help the citizens. There’s a great article called a History of the Urban Dashboard that digs into the evolution of this view of cities, which I’d recommend reading.

I think when people talk about failure of the smart city vision, they’re referring to this particular view. Given that much of that narrative seems to have been around selling hardware and infrastructure to cities it’s perhaps not surprising.

Unless you have a clear view of what problems you’re solving, then no amount of hardware and Ipads are going to help. Bat man had a first mission and then built the cave. Not the other way around.

And that’s setting aside whether dashboards themselves are really useful. To quote from the historical review:

Given that much of what we perceive on our urban dashboards is sanitized, decontextualized, and necessarily partial, we have to wonder, too, about the political and ethical implications of this framing: what ideals of “openness” and “accountability” and “participation” are represented by the sterilized quasi-transparency of the dashboard?

Data portals

If smart cities were primarily about collecting data for an administration to use, then data portals propose the opposite.

There are many, many open data portals across Europe and globally many of which focus on specific cities or localities. To date, these portals have primarily focused on publishing data from government for use by citizens, businesses and other organisations.

The target audience for using data portals is varied. And, depending on where you sit in that audience, you may find that portals are either getting in your way or not providing enough support.

Based on my experience with Bath: Hacked and from talking to those involved in other local initiatives, the critical success factor for these initiatives is finding and engaging with that audience.

Local authorities rarely have experience of working with a local technical or start-up community except as suppliers (and perhaps not even then due to procurement issues). And while authorities usually have great engagement with local civic groups, data isn’t typically a part of that conversation or used as a basis for collaboration.

City marketplaces?

It’s reasonable to consider whether there’s a middle ground between these two views. One in which there is sharing and reuse of data on both sides.

In his recent blog post, Eddie Copeland has suggested one approach: the city data marketplace. The post piqued my interest as I’ve got form (and scars) in this area.

Copeland suggests that data portals should evolve from just publishing platforms into a forum in which data can be more easily exchanged between a variety of publishers and consumers. This is something I absolutely agree with.

But I’m not convinced by the re-framing around a marketplace vision. And particularly one which proposing paying for access to data.

I think city data portals should be multi-tenant, allowing anyone in a city to be publish data into it, for use by others. And I believe that this also means that they should be owned and operated as a shared piece of data infrastructure.

Looking at each of the eight benefits that Copeland proposes:

  1. Increase availability of previously inaccessible datasets – a shared, openly available platform could also deliver this same benefit. By spreading maintenance costs across many different organisations, a platform can have lower overheads and be more inclusive: letting even small volunteer organisations with limited resources share their data.  For commercial organisations there is scope to benefit from collaborative maintenance around local datasets, or publishing data in exchange for others doing the same.
  2. Increase innovation – a collaboration platform could drive similar supply-and-demand around data, indeed for citizen data collection and open, freely available dataset may be more of a draw and easier to deliver than one commercially licensed via a marketplace
  3. Competitive pricing – ability to compete on prices suggests that there will be several potential sources for the same dataset. This overlooks the fact that even after many years of growth, the majority of datasets in the open data ecosystem are single sourced. It’s only around a few foundational datasets, such as mapping where we see a mix of commercial and open alternatives. And open seems to be winning.
  4. Review and feedback – we built this type of feature into Kasabi and it exists in many data portals too. But if there is only a single source for the data then reviews and feedback don’t provide much help. While a low quality option might be quickly supplanted by an alternative version, the network effects around data usage mean that a single dataset will quickly dominate. This is something I hadn’t originally appreciated.
  5. Potential new revenue stream for city authorities – cost recovery for APIs and services offered is one way to ensure sustainability. But it works in an open platform too.
  6. Help spread best practice – slow spread of best practice is definitely an issue, but there are ways to achieve that outside of a marketplace. For example in the South West we’ve started sharing experience around publishing parking data which I hope will lead to convergence on standards and also a set of open source & free apps that can easily consume that data.
  7. Highlight which open data sets cities should provide – again, agree with the need to identify and find value in open data. But there’s some excellent thinking emerging around this already.
  8. Policy based on hard data in place of modelling – data-driven policy making, ftw!

My general intent here is to start a wider dialogue around the future direction of city data portals, rather than critique Eddie specifically. I’ve itemised the points to identify areas of agreement as much as disagreement.

I’m just concerned that we might waste time exploring a paid future rather than embracing a commons based approach. I don’t think we’ve fully explored or experimented with the potential benefits of shared, collaborative open data platforms.

Let’s talk about zoning

You can hear me talking about digital graffiti and data infrastructure for cities in this talk from a few years ago.

The graffiti metaphor I used in my talk was deliberately intended to be challenging. A city data infrastructure shouldn’t reflect a single world-view or be a clean, sealed environment. It should allow itself to be annotated and re-purposed by its community whilst continuing to deliver value to everyone.

One of the points I was trying to make in the talk is that local government are already really good at helping to divide up physical cities to allow communities to grow, meet and do business. We call it zoning.

What is the equivalent of zoning for digital cities?

I’d argue that it will involve:

  • defining a space (or spaces) within which anyone can share data, either openly or with additional controls.
  • creating touch points, identifiers, that connect datasets and our physical and digital spaces
  • describing the rules of engagement for using that space, such as open licensing, transparency around data collection, and anonymisation practices
  • support the collection of data from the physical city, by providing access to infrastructure and perhaps permission to instrument the city
  • providing an environment that supports both public and private spaces and enterprises

Zoning might not sound exciting. It’s unlikely to sell hardware and may not directly inspire business models. But its reusing a successful, proven pattern. It’s also focused on managing access to, and use of (public) spaces, which feels more inclusive to me.

 

Everyone loves a laser

It’s been really interested to watch how the recently published Environment Agency (EA) LIDAR data has been seized on by a variety of communities to create interesting, fun and useful tools.

For many people, myself included, learning about LIDAR has been an interesting experience. It’s a technology that suddenly seems to be everywhere and which can be used for a number of different purposes, including helping cars see the world. For a geek perspective it’s a classic example of an exciting open data release. It’s not just that the data wasn’t previously accessible. For many people it wasn’t obvious that this type of data even existed. Happily, as it turns out there are plenty of tools with which we can use it. Cue lots of interested hackers playing with laser data. Everyone loves a laser, especially the green ones.

I think it’s going to be difficult to top that excitement with further releases. But that’s fine, because open data isn’t about finding the next most exciting new data. Its about publishing what we have. Sharing what we know. And then using that information to solve problems.

Focus on the exciting and you’ll risk missing the useful. Data infrastructure is likely to be boring.

Coverage and impacts

If you take a look at this map of all of the EA LIDAR data you can see that it’s not a complete map of the UK. The data is collected for EA’s operational needs so the choice of areas to scan are based on what they need to know. I’ve seen a few people disappointed that their home city or area isn’t covered.

However there are lots of commercial firms that have offered LIDAR surveys for many years. I recently chatted with someone that had used a commercial survey to fill in coverage in an area of London, as part of a solar installation project. So it’s possible to fill in the gaps, at a price.

I think it would be interesting to explore whether:

  • the EA LIDAR release has impacted those businesses, or has it just been a useful sales and marketing tool to help showcase the value of LIDAR surveys?
  • are customers of LIDAR surveys interested in just the raw point cloud and/or surface models as released by EA or are there value added services to be offered?
  • how do the typical surveys they do compare with the area covered by the EA?
  • what are the typical costs for surveying an area to a similar resolution as the EA use?

Collaborative curation as a pattern

The reason why I think this is interesting is that if the EA data is genuinely useful, then perhaps there should be a national dataset that covered the entire UK? It’s unlikely that the EA would want to take on this responsibility as it’s likely to be outside of their remit. So what would it take to fill in the gaps? Is there a collaborative business model that could be explored here?

It would be interesting to consider whether there’s a business model for extending the LIDAR survey coverage that might include:

  • shared costs for regular basic surveying and data release, including data collection, analysis and publishing
  • value-added services for data analysis
  • value-added services for more frequent or detailed surveys

I think the Ordnance Survey may also do some LIDAR surveying or at least benefit from it. So is this an area they could invest in? If the data is useful in a variety of businesses, then sharing costs might allow for a more comprehensive national dataset without negatively impacting existing businesses offering services around LIDAR data.

Taking a step back I wonder whether this is a pattern that we can expect to reoccur in other sectors. Specifically:

  • an open data release highlights the utility of wider availability and consumption of a specific operational dataset
  • the dataset does not have the ideal coverage, is not as up to date as it could be, requiring further investment
  • organisations in a sector come together to explore benefits of a shared ownership model

The other examples of this that I’m familiar with are the model used by legislation.gov.uk and organisations like CrossRef and ORCID. Although in each case the motivations and interactions are slightly different. I’ve previously described a way to characterise these projects, if you’re interested.

I also wonder whether we might see similar activities emerge around national statistics and other survey data. For example what might we learn if the Defra Family Food Survey was more regular and comprehensive?

As I’ve suggested previously, I think what we need is a data infrastructure incubator to help explore these ideas and begin brokering interactions within a sector.

 

Digital public institutions for the information commons?

I’ve been thinking a bit about “the commons” recently. Specifically, the global information commons that is enabled and supported by Creative Commons (CC) licences. This covers an increasingly wide variety of content as you can see in their recent annual review.

The review unfortunately doesn’t mention data although there’s an increasing amount of that published using CC (or compatible) licences. Hopefully they’ll cover that in more detail next year.

I’ve also been following with interest Tom Steinberg’s exploration of Digital Public Institutions (Part 1, Part 2). As a result of my pondering about the information and data commons, think there’s a couple of other types of institution which we might add to Tom’s list.

My proposed examples of digital public services are deliberately broad. They’re intended to serve the citizens of the internet, not just any one country.

Commons curators

Everyone has seen interesting facts and figures about the rapidly growing volume of activity on the web. These are often used as examples of dizzying growth and as a jumping off point for imagining the next future shocks that are only just over the horizon. The world is changing at an ever increasing rate.

But it’s also an archival challenge. The majority of that material will never be listened to, read or watched. Data will remain unanalysed. And in all likelihood it may disappear before anyone has had any chance to unlock its potential. Sometimes media needs time to find its audience.

This is why projects like the Internet Archive are so important. I think the Internet Archive is one of the greatest achievements of the web. If you need convincing then watch this talk by Brewster Kahle. If, like me, you’re of a certain age then these two things alone should be enough to win you over.

I think we might see and arguably need more digital public institutions who are not just archiving great chunks of the web, but also the working with that material to help present it to a wider audience.

I see other signals that this might be a useful thing to do. Think about all of the classic film, radio and TV material that is never likely to ever see the light of day again. Not just for rights reasons, but also because its not HD quality or hasn’t been cut and edited to reflect modern tastes. I think this is at least partly the reason why we so many reboots and remakes.

Archival organisations often worry about how to preserve digital information. One tactic being to consider how to migrate between formats to ensure information remains accessible. What if we treated media the same? E.g. by re-editing or remastering it to make it engaging to a modern audience? Here’s an example of modernising classic scientific texts or and another that is remixing Victorian jokes as memes.

Maybe someone could spin a successful commercial venture out of this type of activity. But I’m wondering whether you could build a “public service broadcasting” organisation that presented refined, edited, curated views of the commons? I think there’s certainly enough raw materials.

Global data infrastructure projects

The ODI have spent some time this year trying to bring into focus the fact that data is now infrastructure. In my view the best exemplar of a truly open piece of global data infrastructure is OpenStreetMap (OSM). A collaboratively maintained map of our world. Anyone can contribute. Anyone can use it.

OSM was set up to try to solve the issue that the UK’s mapping and location infrastructure was, and largely still is, tied up with complex licensing and commercial models. Rather than knocking at the door of existing data holders to convince them to release their data, OSM shows what you can deliver with the participation of a crowd of motivated people using modern technology.

It’s a shining example of the networked age we live in.

There’s no reason to think that this couldn’t be done in for other types of data, creating more publicly owned infrastructure. There are now many more ways in which people could contribute data to such projects. Whether that information is about themselves or the world around us.

Getting coverage and depth to data could also potentially be achieved very quickly. Costs to host and serve data are also dropping, so sustainability also becomes more achievable.

And I also feel (hope?) there is a growing unease with so much data infrastructure being owned by commercial organisations. So perhaps there’s a movement towards wanting more of this type of collaboratively owned infrastructure.

Data infrastructure incubators

If you buy into the fact that we need more projects like OSM, then its natural to start thinking about the common features of such projects. Those that make them successful and sustainable. There are likely to be some common organisational patterns that can be used as a framework for designing these organisations. Currently, while focused on scholarly research, I think this is the best attempt at capturing those patterns that I’ve seen so far.

Given a common framework then it’s becomes possible to create incubators whose job it is to launch these projects and coach, guide and mentor them towards success.

So that is my third and final addition to Steinberg’s list: incubators that are focused not on the creation of the next start-up “unicorn” but on generating successful, global collaborative data infrastructure projects. Institutions whose goal is the creation of the next OpenStreetMap.

These type of projects have a huge potential impact as they’re not focused on a specific sector. OSM is relevant to many different types of application, its data is used in many different ways. I think there’s a lot more foundational data of this type which could and should be publicly owned.

I may be displaying my naivety, but I think this would be a nice thing to work towards.

Improving the global open data index

The 2015 edition of the Global Open Data Index was published this week. From what I’ve seen it’s the result of an enormous volunteer effort and there’s a lot to celebrate. For example, the high ranking of Rwanda resulting from ongoing efforts to improve their open data publication. Owen Boswarva has also highlighted the need for the UK and other highly ranked countries to not get complacent.

Unfortunately, I have a few issues with the index as it stands which I wanted to document and which I hope may be useful input to the revisions for the 2016 review noted at the end of this article. My aim is to provide some constructive feedback rather than to disparage any of the volunteer work or to attempt to discredit any of the results.

My examples below draw on the UK scores simply because its the country with which I’m most familiar. I was also involved in a few recent email discussions relating to the compilation of the final scores and some last minute revisions to the dataset definitions.

Disclaimers aside, here are the problems that I think are worth identifying.

Lack of comparability

Firstly, it should be highlighted that the 2015 index is based on a different set of criteria than previous years. A consultation earlier this year lead to some revisions to the index. These revisions included both the addition of new datasets and revisions to the assessment criteria for some of the existing datasets.

The use of a different set of criteria mean its not really possible to compare rankings of countries between years. You can make comparisons between countries on the 2015 rankings, but you can’t really compare the rank of a single country between years as they are being assessed on different information. The website doesn’t make this clear at all on the ranking pages.

Even worse, the information as presented, is highly misleading. If you look at the UK results for election data in 2015 and then look at the results for 2014 you’ll see that the 2014 page is displaying the scores for 2014 but the assessment criteria for 2015. It should be showing the assessment criteria for 2014 instead. This makes is seem like the UK has gone backwards from 100% open to 0% open for the same criteria, rather than being assessed in a completely different way.

If you look at the Wayback Machine entry for the 2014 results, you’ll see the original criteria.

Recommended fixes:

  • Remove comparisons between years, or at least provide a clear warning indicator about interpretation
  • Ensure that historical assessments include the original criteria used (I can only assume this is a bug)

Lack of progress indicators

Related to the above, another failing of the index is that it doesn’t measure progress towards a desired outcome.

Sticking with the election data example, the changes in the assessment criteria included requirements to report additional vote counts and, most significantly, that data should be reported at the level of the polling station rather than by constituency.

My understanding is the the Electoral Commission have always published the additional counts, e.g. invalid and spoiled ballots. Its only the change in the level of aggregation of results that is different. The UK doesn’t report to polling station level (more on that below). But the data that is available – the same data that previously was 100% open – is still available. But its continued availability has been completely ignored.

This raises important questions about how viable it is for a country to use the index as a means to measure or chart its progress: changes to the criteria can completely discount all previous efforts that went into improving how data is published, and the ongoing publication of valuable data.

Recommended fixes:

  • Include some additional context in the report to acknowledge what data does exist in the domain being measured, and how it aligns with criteria
  • Where reporting to a particular level of detail is required, include some intermediary stages, e.g. “national”, “regional”, “local”, that can help identify where data is available but only at a more aggregated level

Inconsistent and debatable reporting requirements

The reporting requirements in different areas of the index are also somewhat inconsistent. In some areas a very high level of detail is expected, whereas in others less detail is acceptable. This is particularly obvious with datasets that have a geographical component. For example:

  • Location data should be available to postcode (or similar) level
  • Election data must now be reported to polling station, rather than constituency as previously indexed
  • The weather data criteria doesn’t indicate any geographic reporting requirements
  • Similarly the national statistics criteria simply ask for an annual population count, but its unclear at what level is acceptable. E.g. is a single count for the whole country permissible?

While we might naturally expect different criteria for different domains, this belies the fact that:

  • Address data is hugely valuable, widely identified as a critical dataset for lots of applications, but its availability is not measured in the index. The UK gets full marks whereas it should really be marked down because of its closed address database. The UK is getting a free ride where it should be held up as being behind the curve. Unless the criteria are revised again, Australia’s move to open its national address database will be ignored.
  • Changing the level of reporting of election data in the UK will require a change in the way that votes are counted. It may be possible to report turnout at a station level, but we would need to change aspects of the electoral system in order to reach the level required by the index. Is this really warranted?

The general issue is that the rationale behind the definition of specific criteria is not clear.

Recommended fixes:

  • Develop a more tightly defined set of assessment criteria and supplement these with a clear rationale, e.g. what user needs are being met or unmet if data is published to a specific level?
  • Ensure that a wide representation of data consumers have included in drafting any new criteria
  • Review all of the current criteria, with domain experts and current consumers of these datasets, to understand whether their needs are being met

Inconsistent scoring against criteria

I also think that there’s some inconsistent scoring across the index. One area where there seems to be varied interpretation is around the question: “Available in bulk?”. The definition of this question states:

“Data is available in bulk if the whole dataset can be downloaded easily. It is considered non-bulk if the citizens are limited to getting parts of the dataset through an online interface. For example, if restricted to querying a web form and retrieving a few results at a time from a very large database.”

Looking at the UK scores we have:

  1. Location data is marked as available. However there are several separate datasets which cover this area, e.g. list of postcodes are in one dataset, an administrative boundaries are in another. The user has to fill in a form to get access to a separate download URL for each product. The registration step is noted but ignored in the assessment. And it seems that the reviewer is happy with downloading separate files for each dataset
  2. Weather data is marked as available. But the details say that you have to sign-up to use an API on the Azure Marketplace to retrieve the data, which you then harvest using their API. Presumably having first written the custom code required to retrieve it, is this not limiting access to parts of the dataset? And specifically, limiting its access to specific types of users? It’s not what I’d consider to be a bulk download. The reviewer has also missed the download service on data.gov.uk which does actually provide the information in bulk
  3. Air quality data is not available in bulk. However you can access all the historical data as Atom files, or have all the historical data for all monitoring sites emailed to you, or access the data as RData files via the openair package. Is this presenting a problem for users of air quality data in the UK?

Personally, as a data consumer I’m happy with the packaging all of these datasets. I definitely agree that the means of access could be improved, e.g. provision of direct download links without registration forms, provision of manifest files to enable easier mirroring. And I have a preference for the use of simpler formats (CSV vs Atom), etc. But that’s not what is being assessed here.

Again, its really a user needs issue. Is it important to have bulk access to all of the UK’s air quality data in a single zip file, or all of the historical data for a particular location in a readily accessible form? The appropriate level of packaging will depend on the use case of the consumer.

Recommended fixes:

  • Provide additional guidance for reviewers
  • Ensure that there are multiple reviewers for each assessment, including those familiar with not just the domain but also the local publishing infrastructure
  • Engage with the data publishers throughout the review process and enable them to contribute to the assessments

Hopefully this feedback is taken in the spirit it’s offered: as constructive input into improving a useful part of the open data landscape.

 

 

We have a long way to go

Stood in the queue at the supermarket earlier I noticed the cover of the Bath Chronicle. The lead story this week is: “House prices in Bath almost 13 times the average wage“. This is almost perfectly designed clickbait for me. I can’t help but want to explore the data.

In fact I’ve already done this before, when the paper published a similar headline in September last year: “Average house price in Bath is now eight times average salary“. I wrote a blog post at the time to highlight some of the issues with their reporting.

Now I’m writing another blog post, but this time to highlight how far we still have to go with publishing data on the web.

To try to illustrate the problems, here’s what happened when I got back from the supermarket:

  1. Read the article on the Chronicle website to identify the source of the data, the annual Home Truths report published by the National Housing Federation.
  2. I then googled for “National Housing Federation Home Truths” as the Chronicle didn’t link to its sources.
  3. I then found and downloaded the “Home Truths 2014/15: South West” report which has a badly broken table of figures in it. After some careful reading I realised the figures didn’t match the Chronicle
  4. Double-checking, I browsed around the NHF website and found the correct report: “Home Truths 2015/2016: The housing market in the South West“. Which, you’ll notice, isn’t clearly signposted from their research page
  5. The report has a mean house price of £321,674 for Bath & North East Somerset using Land Registry data from 2014. It also has a figure of £25,324 for mean annual earnings in 2014 for the region, giving a ratio of 12.7. The earnings data is from the ONS ASHE survey
  6. I then googled for the ASHE survey figures as the NHF didn’t link to its sources
  7. Having found the ONS ASHE survey I clicked on the latest figures and found the reference tables before downloading the zip file containing Table 8
  8. Unzipping, I opened the relevant spreadsheet and found the worksheet containing the figures for “All” employees
  9. Realising that the ONS figures were actually weekly rather than annual wages I opened up my calculator and multiplied the value by 52
  10. The figures didn’t match. Checked my maths
  11. I then realised that, like an idiot, I’d downloaded the 2015 figures but the NHF report was based on the 2014 data
  12. Returning to the ONS website I found the tables for the 2014 Revised version of the ASHE
  13. Downloading, unzipping, and calculating I found that again the figures didn’t match
  14. On a hunch, I checked the ONS website again and then found the reference tables for the 2014 Provisional version of the ASHE
  15. Downloading, unzipping, and re-calculating I finally had my match for the NHF figure
  16. I then decided that rather than dig further I’d write this blog post

This is a less than ideal situation. What could have streamlined this process?

The lack of direct linking – from the Chronicle to the NHF, and from the NHF to the ONS – was the root cause of my issues here. I spent far too much time working to locate the correct data. Direct links would have avoided all of my bumbling around.

While a direct link would have taken me straight to the data, I might have missed out on the fact that there were revised figures for 2014. Or that there were actually some new provisional figures for 2015. So there’s actually a update to the story already waiting to be written. The analysis is already out of date.

The new data was published on the 18th November and the NHF report on the 23rd. That gave a five day period in which the relevant tables and commentary could have been updated. Presumably the report was too deep into final production to make changes. Or maybe just no-one thought to check for updated data.

If both the raw data from the ONS and the NHF analysis had been published natively to the web rather than in a PDF maybe some of that production overhead could have been reduced. I know PDF has some better support for embedding and linking data these days, but a web native approach might have provided a more dynamic approach.

In fact, why should the numbers have been manually recalculated at all? The actual analysis involves little more than pulling some cells from existing tables and doing some basic calculations. Maybe that could have been done on the fly? Perhaps by embedding the relevant figures. At the moment I’m left with doing some manual copy-and-paste.

It’s not just NHF that are slow to publish their figures though. Researching the Chronicle article from last year, I turned up some DCLG figures on housing market and house prices. These weren’t actually referenced from the article or any of its sources. I just tripped over them whilst investigating. Because data nerd.

The live (sic) DCLG tables include a ratio of median house prices to median earnings but they haven’t been updated since April 2014. Their analysis only uses the provisional ASHE figures for 2013.

Oh, and just for fun, the NHF analysis uses mean house prices and wages, whilst the DCLG data uses medians. The ONS publish both weekly mean and median earnings for all periods, as well as some data for different quantiles.

And this is just one small example.

My intent here isn’t to criticise the Chronicle, the NHF, DCLG, and especially not the ONS who are working hard to improve how they publish their data.

I just wanted to highlight that:

  • we need better norms around data citation, and including when and how to link to both new and revised data
  • we need better tools for telling stories on the web, that can easily be used by anyone and which can readily access and manipulate raw data
  • we need better discovery tools for data that go beyond just keyword searches
  • we need to make it easier to share not just analyses but also insights and methods, to avoid doing unnecessary work and to make it easier (or indeed unnecessary) to fact check against sources

That’s an awful lot to still be done. Opening data is just the start at building a good data infrastructure for the web. I’m up for the challenge though. This is the stuff I want to help solve.

Shortly after I published this Matt Jukes published a post wondering what a digital statistical publication might look like. Matt’s post and Russell Davies thoughts on digital white papers are definitely worth a read. 

How can open data publishers monitor usage?

Some open data publishers require a user to register with their portal or provide other personal information before downloading a dataset.

For example:

  • the recently launched Consumer Data Research Centre data portal requires users to register and login before data can be downloaded
  • access to any of the OS Open Data products requires the completion of a form which asks for personal information and an email address to which a download link is sent
  • the Met Office Data Point API provides OGL licensed data but users must register in order to obtain an API key

Requiring a registration step is in fact very common when it comes to open data published via an API. Registration is required on Transport API, Network Rail and Companies House to name a few. This isn’t always the case though as the Open Corporates API can be used without a key, as can APIs exposed via the Socrata platform (and other platforms, I’m sure). In both cases registration carries the benefit of increased usage limits.

The question of whether to require a login is one that I’ve run into a few times. I wanted to explore it a little in this post to tease out some of the issues and alternatives.

For the rest of the post whenever I refer to “a login” please read it as “a login, registration step, or other intermediary web form”.

Is requiring a login permitted?

I’ll note from the start that the open definition doesn’t have anything to say about whether a login is permitted or not permitted.

The definition simply says that data “…must be provided as a whole and at no more than a reasonable one-time reproduction cost, and should be downloadable via the Internet without charge”. In addition the data “…must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.”

You can choose to interpret that in a number of ways. The relevant bits of text have gone through a number of iterations since the definition was first published and I think the current language isn’t as strong as that present in previous versions. That side I don’t recall there ever being a specific pronouncement against having a login.

There is however a useful discussion on the open definition list from October 2014 which has some interesting comments and is worth reviewing. Andrew Stott’s comments provide a useful framing, asking whether such a step is necessary to the provision of the information.

In my view there are very few cases where such a step is necessary, so as general guidance I’d always recommend against requiring a login when publishing open data.

But, being a pragmatic chap, I prefer not to deal in absolutes so I’d like you to think about the pros and cons on either side.

Why do publishers want a login?

I’ve encountered several reasons why publishers want to require a login:

  1. to collect user information to learn more about using their data
  2. to help manage and monitor usage of an API
  3. all of the above

The majority of open data publishers I’ve worked with are very keen to understand who is using their data, how they’re using it, and how successful their users are at building things with their data. It’s entirely natural, as part of providing a free resource to want to understand if people are finding it useful.

Knowing that data is in use and is delivering value can help justify ongoing access, publication of additional data, or improvements in how existing data is published. Everyone wants to understand if they’re having an impact. Knowing who is interested enough to download the data is a first step towards measuring that.

An API without usage limits presents a potentially unbounded liability for a publisher in terms of infrastructure costs. The inability to manage or balance usage across a user base means that especially active or abusive users can hamper the ability for everyone to benefit from the API. API keys, and similar authentication methods, provide a hook that can be used to monitor and manage usage. (IP addresses are not enough.)

Why don’t consumers want to login?

There are also several reasons why data consumers don’t want to have to login:

  1. they want to quickly review and explore some data and a registration step provides unnecessary barriers
  2. they want or need the freedom to access data anonymously
  3. they don’t trust the publisher with their personal information
  4. they want to automatically script bulk downloads to create production workflows without the hassle of providing credentials or navigating access control
  5. they want to use an API from a browser based application which limits their ability to provide private credentials
  6. all of the above

Again, these are all reasonable concerns.

What are the alternatives?

So, how can publishers learn more about their users and, where necessary, offer a reasonable quality of service whilst also staying mindful to the concerns of users?

I think the best way to explore that is by focusing on the question that publishers really want to answer: who are the users actively engaged in using my data?

Requiring a registration step or just counting downloads doesn’t help you answer that question. For example:

  • I’ve filled in the OS Open Data download form multiple times for the same product, sometimes on the same day but from different machines. I can’t imagine it tells them much about what I’ve done (or not done) with their data and they’ve never asked
  • I’ve registered on portals in order to download data simply to take a look at its contents without any serious intent to use it
  • I’ve worked with data publishers that have lots of detail from their registration database but no real insight into what users are doing, or have an ongoing relationship with them

In my view the best way to identify active users and learn more about how they are using your data is to talk to them.

Develop an engagement plan that involves users not just after the release some data, but before a release. Give them a reason to want to talk to you. For example:

  • tell them when the data is updated, or you’ve made corrections to it. This is service that many serious consumers would jump at
  • give them a feedback channel that lets them report problems or make suggestions about improvements and then make sure that channel is actually monitored so feedback is acted on
  • help celebrate their successes by telling their stories, featuring their applications in a showcase, or via social media

Giving users a reason to engage can also help with API and resource management. As I mentioned in the introduction, Open Corporates and others provide a basic usage tier that doesn’t require registration. This lets hobbyists, tinkerers and occasional users get what they need. But the promise of freely accessible, raised usage limits gives active users a reason to engage more closely.

If you’re providing data in bulk but are concerned about data volumes then provide smaller sample datasets that can be used as a preview of the full data.

In short, just like any other data collection exercise, its important that publishers understand why they’re asking users to register. If the data is ultimately of low value, e.g. people providing fake details, or isn’t acted on as part of an engagement plan, then there’s very little reason to collect the data at all.

This post is part of my “basic questions about data” series. If you’ve enjoyed this one then take a look at the other articles. I’m also interested to hear suggestions for topics, so let me know if you have an idea. 

Who is the intended audience for open data?

This post is part of my ongoing series: basic questions about data. It’s intended to expand on a point that I made in a previous post in which I asked: who uses data portals?

At times I see quite a bit of debate within the open data community around how best to publish data. For example should data be made available in bulk or via an API? Which option is “best”? Depending on where you sit in the open data community you’re going to have very different responses to that question.

But I find that in the ensuing debate we often overlook that open data is intended to be used by anyone, for any purpose. And that means that maybe we need to think about more than just the immediate needs of developers and the open data community.

While the community has rightly focused on ensuring that data is machine-readable, so it can be used by developers, we mustn’t forget that data needs to be human-readable too. Otherwise we end up with critiques of what I consider to be fairly reasonable and much-needed guidance on structuring spreadsheets, and suggestions of alternatives that are well meaning but feel a little under-baked.

I feel that there are several different and inter-related viewpoints being expressed:

  • That the citizen or user is the focus and we need to understand their needs and build services that support them. Here data tends to be a secondary concern and perhaps focused on transactional statistics on performance of those services, rather than the raw data
  • That open data is not meant for mere mortals and that its primary audience is developers to analyse and present to users. The emphasis here is on provision of the raw data as rapidly as possible
  • A variant of the above that emphasises delivery of data via an API to web and mobile developers allowing them to more rapidly deliver value. Here we see cases being made about the importance of platforms, infrastructure, and API programs
  • That citizens want to engage with data and need tools to explore it. In this case we see arguments for on-line tools to explore and visualise data, or reasonable suggestions to simply publish data in spreadsheets as this is a format with which many, many people are comfortable

Of course all of these are correct, although their prominence around different types of data, application, etc varies wildly. Depending on where you sit in the open data value network your needs are going to be quite different.

It would be useful to map out the different roles of consumers, aggregators, intermediaries, etc to understand what value exchanges are taking place, as I think this would help highlight the value that each role brings to the ecosystem. But until then both consumers and publishers need to be mindful of potentially competing interests. In an ideal world publishers would serve every reuser need equally.

My advice is simple: publish for machines, but don’t forget the humans. All of the humans. Publish data with context that helps anyone – developers and the interested reader – properly understand the data. Ensure there is at least a human-readable summary or view of the data as well as more developer oriented bulk downloads. If you can get APIs “out of the box” with your portal, then invest the effort you would otherwise spend on preparing machine-readable data in providing more human-readable documentation and reports.

Our ambition should be to build an open data commons that is accessible and useful for as many people as possible.