How can open data publishers monitor usage?

Some open data publishers require a user to register with their portal or provide other personal information before downloading a dataset.

For example:

  • the recently launched Consumer Data Research Centre data portal requires users to register and login before data can be downloaded
  • access to any of the OS Open Data products requires the completion of a form which asks for personal information and an email address to which a download link is sent
  • the Met Office Data Point API provides OGL licensed data but users must register in order to obtain an API key

Requiring a registration step is in fact very common when it comes to open data published via an API. Registration is required on Transport API, Network Rail and Companies House to name a few. This isn’t always the case though as the Open Corporates API can be used without a key, as can APIs exposed via the Socrata platform (and other platforms, I’m sure). In both cases registration carries the benefit of increased usage limits.

The question of whether to require a login is one that I’ve run into a few times. I wanted to explore it a little in this post to tease out some of the issues and alternatives.

For the rest of the post whenever I refer to “a login” please read it as “a login, registration step, or other intermediary web form”.

Is requiring a login permitted?

I’ll note from the start that the open definition doesn’t have anything to say about whether a login is permitted or not permitted.

The definition simply says that data “…must be provided as a whole and at no more than a reasonable one-time reproduction cost, and should be downloadable via the Internet without charge”. In addition the data “…must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.”

You can choose to interpret that in a number of ways. The relevant bits of text have gone through a number of iterations since the definition was first published and I think the current language isn’t as strong as that present in previous versions. That side I don’t recall there ever being a specific pronouncement against having a login.

There is however a useful discussion on the open definition list from October 2014 which has some interesting comments and is worth reviewing. Andrew Stott’s comments provide a useful framing, asking whether such a step is necessary to the provision of the information.

In my view there are very few cases where such a step is necessary, so as general guidance I’d always recommend against requiring a login when publishing open data.

But, being a pragmatic chap, I prefer not to deal in absolutes so I’d like you to think about the pros and cons on either side.

Why do publishers want a login?

I’ve encountered several reasons why publishers want to require a login:

  1. to collect user information to learn more about using their data
  2. to help manage and monitor usage of an API
  3. all of the above

The majority of open data publishers I’ve worked with are very keen to understand who is using their data, how they’re using it, and how successful their users are at building things with their data. It’s entirely natural, as part of providing a free resource to want to understand if people are finding it useful.

Knowing that data is in use and is delivering value can help justify ongoing access, publication of additional data, or improvements in how existing data is published. Everyone wants to understand if they’re having an impact. Knowing who is interested enough to download the data is a first step towards measuring that.

An API without usage limits presents a potentially unbounded liability for a publisher in terms of infrastructure costs. The inability to manage or balance usage across a user base means that especially active or abusive users can hamper the ability for everyone to benefit from the API. API keys, and similar authentication methods, provide a hook that can be used to monitor and manage usage. (IP addresses are not enough.)

Why don’t consumers want to login?

There are also several reasons why data consumers don’t want to have to login:

  1. they want to quickly review and explore some data and a registration step provides unnecessary barriers
  2. they want or need the freedom to access data anonymously
  3. they don’t trust the publisher with their personal information
  4. they want to automatically script bulk downloads to create production workflows without the hassle of providing credentials or navigating access control
  5. they want to use an API from a browser based application which limits their ability to provide private credentials
  6. all of the above

Again, these are all reasonable concerns.

What are the alternatives?

So, how can publishers learn more about their users and, where necessary, offer a reasonable quality of service whilst also staying mindful to the concerns of users?

I think the best way to explore that is by focusing on the question that publishers really want to answer: who are the users actively engaged in using my data?

Requiring a registration step or just counting downloads doesn’t help you answer that question. For example:

  • I’ve filled in the OS Open Data download form multiple times for the same product, sometimes on the same day but from different machines. I can’t imagine it tells them much about what I’ve done (or not done) with their data and they’ve never asked
  • I’ve registered on portals in order to download data simply to take a look at its contents without any serious intent to use it
  • I’ve worked with data publishers that have lots of detail from their registration database but no real insight into what users are doing, or have an ongoing relationship with them

In my view the best way to identify active users and learn more about how they are using your data is to talk to them.

Develop an engagement plan that involves users not just after the release some data, but before a release. Give them a reason to want to talk to you. For example:

  • tell them when the data is updated, or you’ve made corrections to it. This is service that many serious consumers would jump at
  • give them a feedback channel that lets them report problems or make suggestions about improvements and then make sure that channel is actually monitored so feedback is acted on
  • help celebrate their successes by telling their stories, featuring their applications in a showcase, or via social media

Giving users a reason to engage can also help with API and resource management. As I mentioned in the introduction, Open Corporates and others provide a basic usage tier that doesn’t require registration. This lets hobbyists, tinkerers and occasional users get what they need. But the promise of freely accessible, raised usage limits gives active users a reason to engage more closely.

If you’re providing data in bulk but are concerned about data volumes then provide smaller sample datasets that can be used as a preview of the full data.

In short, just like any other data collection exercise, its important that publishers understand why they’re asking users to register. If the data is ultimately of low value, e.g. people providing fake details, or isn’t acted on as part of an engagement plan, then there’s very little reason to collect the data at all.

This post is part of my “basic questions about data” series. If you’ve enjoyed this one then take a look at the other articles. I’m also interested to hear suggestions for topics, so let me know if you have an idea. 

2 thoughts on “How can open data publishers monitor usage?

  1. Hi Leigh, very nice article – as usual.

    One comment re the Open Definition: it always has been the view that login requirements are excluded by the Open Definition. Specifically in 1.0 we had:

    “Absence of Technological Restriction” which was felt clearly to exclude requiring login.

    In 2.0 we had something a bit more subtle in 1.2 and 1.3 around access and “no unnecessary technological obstacles to the performance of the licensed rights.” (Login would be an unnecessary obstacle).

    Finally, in 2.1 we have 1.2 (Access) and 1.3 (Machine Readable). The latter has “The work must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.” Again, I think login requirement would prevent readily processable by a computer (in that logins are usually human only).

    Agree this is not nearly as explicit as it might be – and perhaps we need a special FAQ going forward for some of these examples (the challenge is to keep this simple and short but also cover the key cases like this.

Comments are closed.