Do data scientists spend 80% of their time cleaning data? Turns out, no?

It’s hard to read an article about data science or really anything that involves creating something useful from data these days without tripping over this factoid, or some variant of it:

Data scientists spend 80% of their time cleaning data rather than creating insights.

Or

Data scientists only spend 20% of their time creating insights, the rest wrangling data.

It’s frequently used to highlight the need to address a number of issues around data quality, standards, access. Or as a way to sell portals, dashboards and other analytic tools.

The thing is, I think it’s a bullshit statistic.

Not because I don’t think there aren’t improvements to be made about how we access and share data. Far from it. My issue is more about how that statistic is framed and because its just endlessly parroted without any real insight.

What did the surveys say?

I’ve tried to dig out the underlying survey or source of that factoid, to see if there’s more context. While the figure is widely referenced its rarely accompanied by a link to a survey or results.

Amusingly this IBM data science product marketing page cites this 2018 HBR blog post which cites this 2017 IBM blog which cites this 2016 Crowdflower survey. Why don’t people link to original sources?

In terms of sources of data on how data scientists actually spend their time, I’ve found two ongoing surveys.

So what do these surveys actually say?

  • Crowdflower, 2015: “66.7% said cleaning and organizing data is one of their most time-consuming tasks“.
    • They didn’t report estimates of time spent
  • Crowdflower, 2016: “What data scientists spend the most time doing? Cleaning and organizing data: 60%, Collecting data sets; 19% …“.
    • Only 80% of time spent if you also lump in collecting data as well
  • Crowdflower, 2017: “What activity takes up most of your time? 51% Collecting, labeling, cleaning and organizing data
    • Less than 80% and also now includes tasks like labelling of data
  • Figure Eight, 2018: Doesn’t cover this question.
  • Figure Eight, 2019: “Nearly three quarters of technical respondents 73.5% spend 25% or more of their time managing, cleaning, and/or labeling data
    • That’s pretty far from 80%!
  • Kaggle, 2017: Doesn’t cover this question
  • Kaggle, 2018: “During a typical data science project, what percent of your time is spent engaged in the following tasks? ~11% Gathering data, 15% Cleaning data…
    • Again, much less than 80%

Only the Crowdflower survey reports anything close to 80%, but you need to lump in actually collecting data as well.

Are there other sources? I’ve not spent too much time on it. But this 2015 bizreport article mentions another survey which suggests “between 50% and 90% of business intelligence (BI) workers’ time is spend prepping data to be analyzed“.

And an August 2014 New York Times article states that: “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data“. But doesn’t link to the surveys, because newspapers hate links.

It’s worth noting that “Data Scientist” as a job started to really become a thing around 2009. Although the concept of data science is older. So there may not be much more to dig up. If you’ve seen some earlier surveys, then let me know.

Is it a useful statistic?

So looking at the figures, it looks to me that this is a bullshit statistic. Data scientists do a whole range of different types of task. If you arbitrary label some of these as analysis and others not, then you can make them add up to 80%.

But that’s not the only reason why I think its a bullshit statistic.

Firstly there’s the implication that cleaning and working with data is somehow not worth the time of a data scientist. It’s “data janitor work” work. And “It’s a waste of their skills to be polishing the materials they rely on“. Ugh.

Who, might I ask, is supposed to do this janitorial work?

I would argue that spending time working with data. To transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in.

Understand the material better and you’ll get better insights.

Secondly, I think data science use cases and workflows are a poor measure for how well data is published. Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardised a dataset its likely to need some work.

A sculptor has different needs than a bricklayer. They both use similar materials. And they both create things of lasting value and worth.

We could measure utility better using other assessments than time spent on bespoke work.

Thirdly, it’s measuring the wrong thing. Actually, maybe some friction around the use of data is a good thing. Especially if it encourages you to spend more time understanding a dataset. Even more so if it in any way puts a break on dumb uses of machine-learning.

If we want the process of accessing, using and sharing data to be as frictionless as possible in a technical sense, then let’s make sure that is offset by adding friction elsewhere. E.g. to add checkpoints for reviews of ethical impacts. No matter how highly paid a data scientist is, the impacts of poor use of data and AI can be much, much larger.

Don’t tell me that data scientists are spending time too much time working with data and not enough time getting insights into production. Tell me that data scientists are increasingly spending 50% of their time considering the ethical and social impacts of their work.

Let’s measure that.

Long live RSS! How I manage my reading

“LONG LIVE RSS!”

I shout these words from my bedroom window every morning. Reaffirming my love for this century’s most criminally neglected data standard.

If you’ve either forgotten, or never enjoyed, the ease of managing your information consumption via the magic of RSS and a feed reader, then you’re missing out mate.

Struggling with the noise, gloom and general bombast of social media? Get yourself a feed reader and fill it full of interesting subscriptions for a most measured and sedate way to consume words.

Once upon a time everyone(*) used them. We engaged in educated discourse, shared blog rolls, sent trackbacks and wrote comments on each others websites. Elegant weapons for a more civilized age (**).

I like to read things when I have time to reduce distractions and give me change to absorb several viewpoints rather than simply the latest, hottest takes.

I’ve fine-tuned my approach to managing my reading and research. A few of the tools and services have changed, but the essentials stay the same. If you’re interested, here’s how I’ve made things work for me:

  • Feedbin
    • Manages all my subscriptions for blogs, newsletters and more into one easily accessible location
    • Lots of sites still support RSS its not dead, merely resting
    • Feedbin is great at discovering feeds if you just paste in a site URL. One of the magic parts of RSS
    • You can also subscribe to newsletters with a special Feedbin email address and they’ll get delivered to your reader. Brilliant. You’re not making me go back into my inbox, its scary in there.
  • Feedme. Feedbin allows me to read posts anywhere, but I use this Android app (there are others) as a client instead
    • Regularly syncs with Feedbin, so I can have all the latest unread posts on my phone for the commute or an idle few minutes
    • It provides a really quick interface to skim through posts and either immediately read the or add them to my “to read” list, in Pocket…
  • Pocket. Mobile and web app that I basically use as a way to manage a backlog of things “to read”.
    • Gives me a clutter free (no ads!) way to read content either in the browser (which I rarely do) or on my phone
    • It has its issues with some content, but you can easily switch to a full web view
    • Not everything I want to read comes in via my feed reader so I take links from Slack, Twitter or elsewhere and use the Pocket browser extension or its share button integration to stash things away for later reading. Basically if its not a 1-2 minute read it goes into Pocket until I’m ready for it. Keeps the number of browser tabs under control too.
    • The offline content syncing makes it great for using on my commute, especially on the tube
  • IFTTT. I use this service to do two things:
    • Once I archive something in Pocket then it automatically adds them to Pinboard for me, using the right tags.
    • If I favourite something it tweets out the link without me having to go and actually look at twitter
  • Pinboard. Basically a complete archive of articles I’ve read.

The end result is a fully self-curated feed of interesting stuff. I’m no longer fighting someone else’s algorithm, so I can easily find things again.

I can minimise number of organisations I’m following on twitter, and just subscribe to their blogs. Also helps to buck the trend towards more email newsletters which are just blogs but you’re all in denial.

Also helps to reduce the number of distractions, and fight the pressure to keep checking on twitter in case I’ve missed something interesting. It’ll be in the feed reader when I’m ready.

Long live RSS!

It’s about time we stopped rebooting social networks and rediscovered more flexible ways to create, share and read content online. Go read

Say it with me. Go on.

LONG LIVE RSS!

(*) not actually everyone, but all the cool kids anyway. Alright, just us nerds, but we loved it.

(**) not actually more civilised, but it was more decentralised

 

Licence Friction: A Tale of Two Datasets

For years now at the Open Data Institute we’ve been working to increase access to data, to create a range of social and economic benefits across a range of sectors. While the details change across projects one of the more consistent aspects of our work and guidance has been to support data stewards in making data as open as possible, whilst ensuring that is clearly licensed.

Reference data, like addresses and other geospatial data, that underpins our national and global data infrastructure needs to be available under an open licence. If it’s not, which is the ongoing situation in the UK, then other data cannot be made as open as possible. 

Other considerations aside, data can only be as open as the reference data it relies upon. Ideally, reference data would be in the public domain, e.g. using a CC0 waiver. Attribution should be a consistent norm regardless of what licence is used

Data becomes more useful when it is linked with other data. When it comes to data, adding context adds value. It can also add risks, but more value can be created from linking data. 

When data is published using bespoke or restrictive licences then it is harder to combine different datasets together, because there are often limitations in the licensing terms that restrict how data can be used and redistributed.

This means data needs to be licensed using common, consistent licences. Licences that work with a range of different types of data, collected and used by different communities across jurisdictions. 

Incompatible licences create friction that can make it impossible to create useful products and services. 

It’s well-reported that data scientists and other users spend huge amounts of time cleaning and tidying data because it’s messy and non-standardised. It’s probably less well-reported how many great ideas are simply shelved because of lack of access to data. Or are impossible because of issues with restrictive or incompatible data licences. Or are cancelled or simply needlessly expensive due to the need for legal consultations and drafting of data sharing agreements.

These are the hurdles you often need to overcome before you even get started with that messy data.

Here’s a real-world example of where the lack of open geospatial data in the UK, and ongoing incompatibilities between data licensing is getting in the way of useful work. 

Introducing Active Places

Active Places is a dataset stewarded by Sport England. It provides a curated database of sporting facilities across England. It includes facilities provided by a range of organisations across the public, private and third-sectors. It’s designed to help support decision making about the provision of tens of thousands of sporting sites and facilities around the UK to drive investment and policy making. 

The dataset is rich and includes a wide range of information from disabled access through to the length of ski slopes or the number of turns on a cycling track.

While Sport England are the data steward, the curation of the dataset is partly subcontracted to a data management firm and partly carried out collaboratively with the owners of those sites and facilities.

The dataset is published under a standard open licence, the Creative Commons Attribution 4.0 licence. So anyone can access, use and share the data so long as they acknowledge its source. Contributors to the dataset agree to this licence as part of registering to contribute to the site.

The dataset includes geospatial data, including the addresses and locations of individual sites. This data includes IP from Ordnance Survey and Royal Mail, which means they have a say over what happens to it. In order to release the data under an open licence, Sport England had to request an exemption from the Ordnance Survey to their default position, which is that data containing OS IP cannot be sublicensed. When granted an exemption, an organisation may publish their data under an open licence. In short, OS waive their rights over the geographic locations in the data. 

The OS can’t, however waive any rights that Royal Mail has over the address data. In order to grant Sport England an exemption, the OS also had to seek permission from Royal Mail.  The Sport England team were able to confirm this for me. 

Unfortunately it’s not clear, without having checked, that this is actually the case. It’s not evident in the documentation of either Active Places or the OS exemption process. Is it clarifying all third-party rights a routine part of the exemption process or not?

It would be helpful to know. As the ODI has highlighted, lack of transparency around third-party rights in open data is a problem. For many datasets the situation remains unclear. And Unclear positions are fantastic generators of legal and insurance fees.

So, to recap: Sport England has invested time in convincing Ordnance Survey to allow it to openly publish a rich dataset for the public good. A dataset in which geospatial data is clearly important, but is not the main feature of the dataset. The reference data is dictating how open the dataset can be and, as a result how much value can be created from it.

In case you’re wondering, lots of other organisations have had to do the same thing. The process is standardised to try and streamline it for everyone. A 2016 FOI request shows that between 2011 and 2015 the Ordnance Survey handled more than a 1000 of these requests

Enter OpenStreetMap

At the end of 2019, members of the OpenStreetmap community contacted Sport England to request permission to use the Active Places dataset. 

If you’re not familiar with OpenStreetmap, then you should be. It’s an openly licensed map of the world maintained by a huge community of volunteers, humanitarian organisations, public and private sector businesses around the world.

The OpenStreetmap Foundation is the official steward of the dataset with the day to data curation and operations happening through its volunteer network. As a small not-for-profit, it has to be very cautious about legal issues relating to the data. It can’t afford to be sued. The community is careful to ensure that data that is imported or added into the database comes from openly licensed sources.

In March 2017, after a consultation with the Creative Commons, the OpenStreetmap Licence/Legal Working Group concluded that data published under the Creative Commons Attribution licence is not compatible with the licence used by OpenStreetmap which is called the Open Database Licence. They felt that some specific terms in the licence (and particularly in its 4.0 version) meant that they needed additional permission in order to include that data in OpenStreetmap.

Since then the OpenStreetmap community, has been contacting data stewards to ask them to sign an additional waiver that grants the OSM community explicit permission to use the data. This is exactly what open licensing of data is intended to avoid.

CC-BY is one of the most frequently used open data licences, so this isn’t a rare occurrence. 

As an indicator of the extra effort required, in a 2018 talk from the Bing Maps team in which they discuss how they have been supporting the OpenStreetmap community in Australia, they called out their legal team as one of the most important assets they had to provide to the local mapping community, helping them to get waivers signed. At the time of writing nearly 90 waivers have been circulated in Australia alone, not all of which have been signed.

So, to recap, due to a perceived incompatibility between two of the most frequently used open data licences, the OpenStreetmap community and its supporters are spending time negotiating access to data that is already published under an open licence.

I am not a lawyer. So these are like, just my opinions. But while I understand why the OSM Licence Working Group needs to be cautious, it feels like they are being overly cautious. Then again, I’m not the one responsible for stewarding an increasingly important part of a global data infrastructure. 

Another opinion is that perhaps the Microsoft legal team might be better deployed to solve the licence incompatibility issues. Instead they are now drafting their own new open data licences, which are compatible with CC-BY.

Active Places and OpenStreetmap

Still with me?

At the end of last year, members of the OpenStreetMap community contacted Sport England to ask them to sign a waiver so that they could use the Active Places data. Presumably to incorporate some of the data into the OSM database.

The Sport England data and legal teams then had to understand what they were being asked to do and why. And they asked for some independent advice, which is where I provided some support through our work with Sport England on the OpenActive programme. 

The discussion included:

  • questions about why an additional waiver was actually necessary
  • the differences in how CC-BY and ODbL are designed to require data to remain open and accessible – CC-BY includes limitation on use of technical restrictions, which is allowed by the open definition, whilst ODbL adopts a principle of encouraging “parallel distribution”. 
  • acceptable forms and methods of attribution
  • who, within an organisation like Sport England, might have responsibility to decide what acceptable attribution looked like
  • why the OSM community had come to its decisions
  • who actually had authority to sign-off on the proposed waiver
  • whether signing a waiver and granting a specific permission undermined Sport England’s goal to adopt standard open data practices and licences, and a consistent approach for every user
  • whether the OS exemption, which granted permission to SE to publish the dataset under an open licence, impacted any of the above

All reasonable questions from a team being asked to do something new. 

Like a number of organisations asked to sign waiver in Australia, SE have not yet signed a waiver and may choose not to do so. Like all public sector organisations, SE are being cautious about taking risks. 

The discussion has spilled out onto twitter. I’m writing this to provide some context and background to the discussion in that thread. I’m not criticising anyone as I think everyone is trying to come to a reasonable outcome. 

As the twitter thread highlights, the OSM community are not just concerned about the CC-BY licence but also about the potential that additional third-party rights are lurking in the data. Clarifying that may require SE to share more details about how the address and location data in the dataset is collected, validated and normalised for the OSM community to be happy. But, as noted earlier in the blog, I’ve at least been able to determine the status of any third-party rights in the data. So perhaps this will help to move things further.

The End

So, as a final recap, we have two organisations both aiming to publish and use data for the public good. But, because of complexities around derived data and licence compatibilities, data that might otherwise be used in new, innovative ways is instead going unused.

This is a situation that needs solving. It needs the UK government and Geospatial Commission to open up more geospatial data.

It needs the open data community to invest in resolving licence incompatibilities (and less in creating new licences) so that everyone benefits. 

We also need to understand when licences are the appropriate means of governing how data is used and when norms, e.g. around attribution, can usefully shape how data is accessed, used and shared.

Until then these issues are going to continue to undermine the creation of value from open (geospatial) data.

[Paper Review] The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work

This blog post is a quick review and notes relating to a research paper called: The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work (PDF available here)

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

This paper explores the impact of data infrastructure, and in particular the use of identifiers and the design of databases, on the delivery of human (public) services. By reviewing the use of identifiers and data in service delivery to support homelessness and those affected by AIDS, the authors highlight a number of tensions between how the design of data infrastructure and the need to share data with funders and other agencies has an inevitable impact on frontline services.

For example, the need to evidence impact to funders requires the collection of additional personal, legal identifiers. Even when that information is not critical to the delivery of support.

The paper also explores the interplay between the well defined, unforgiving world of database design, and the messy nature of delivering services to individuals. Along the way the authors touch on aspects of identity, identification, and explore different types of identifiers and data collection practices.

The authors draw out a number of infrastructure problems and provide some design provocations for alternate approaches. The three main problems are the immutability of identifiers in database schema, the “hegemony of NOT NULL” (or the need for identification), and the demand for uniqueness across contexts.

Three reasons to read

Here’s three reasons why you might want to read this paper:

  1. If, like me, you’re often advocating for use of consistent, open identifiers, then this paper provides a useful perspective of how this approach might create issues or unwanted side effects outside of the simpler world of reference data
  2. If you’re designing digital public services then the design provocations around identifiers and approaches to identification are definitely worth reading. I think there’s some useful reflections about how we capture and manage personal information
  3. If you’re a public policy person and advocating for consistent use of identifiers across agencies, then there’s some important considerations around the the policy, privacy and personal impacts of data collection in this paper

Three things I learned

Here’s three things that I learned from reading the paper.

  1. In a section on “The Data Work of Human Services Provision“, the authors highlighted three aspects of frontline data collection which I found it useful to think about:
    • data compliance work – collecting data purely to support the needs of funders, which might be at odds with the needs of both the people being supported and the service delivery staff
    • data coordination work – which stems from the need to link and aggregate data across agencies and funders to provide coordinated support
    • data confidence work – the need to build a trusted relationship with people, at the front-line, in order to capture valid, useful data
  2. Similarly, the authors tease out four reasons for capturing identifiers, each of which have different motivations, outcomes and approaches to identification:
    • counting clients – a basic need to monitor and evaluate service provision, identification here is only necessary to avoid duplicates when counting
    • developing longitudinal histories – e.g. identifying and tracking support given to a person over time can help service workers to develop understanding and improve support for individuals
    • as a means of accessing services – e.g. helping to identify eligibility for support
    • to coordinate service provision – e.g. sharing information about individuals with other agencies and services, which may also have different approaches to identification and use of identifiers
  3. The design provocations around database design were helpful to highlight some alternate approaches to capturing personal information and the needs of the service vs that of the individual

Thoughts and impressions

As someone who has not been directly involved in the design of digital systems to support human services, I found the perspectives and insight shared in this paper really useful. If you’ve been working in this space for some time, then it may be less insightful.

However I haven’t seen much discussion about good ways to design more humane digital services and, in particular, the databases behind them, so I suspect the paper could do with a wider airing. Its useful reading alongside things like Falsehoods Programmers Believe About Names and Falsehoods Programmers Believe About Gender.

Why don’t we have a better approach to managing personal information in databases? Are there solutions our there already?

Finally, the paper makes some pointed comments about the role of funders in data ecosystems. Funders are routinely collecting and aggregating data as part of evaluation studies, but this data might also help support service delivery if it were more accessible. It’s interesting to consider the balance between minimising unnecessary collection of data simply to support evaluation versus the potential role of funders as intermediaries that can provide additional support to charities, agencies or other service delivery organisations that may lack the time, funding and capability to do more with that data.

 

 

[Paper review] Open data for electricity modeling: Legal aspects

This blog post is a quick review and notes relating to a research paper called: Open data for electronic modeling: Legal aspects.

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

The paper reviews the legal status of publicly available energy data (and some related datasets) in Europe, with a focus on German law. The paper is intended to help identify some of the legal issues relevant to creation of analytical models to support use of energy data, e.g. for capacity planning.

As background, the paper describes the types of data relevant to building these types of model, the relevant aspects of database and copyright law in the EU and the properties of open licences. This background is used to assess some of the key data assets published in the EU and how they are licensed (or not) for reuse.

The paper concludes that the majority of uses of this data to support energy modelling in the EU, whether for research or other purposes, is likely to be infringing on the rights of the database holders, meaning that users are currently carrying legal risks. The paper notes that in many cases this is likely not the intended outcome.

The paper provides a range of recommendations to address this issue, including the adoption of open licences.

Three reasons to read

Here’s three reasons why you might want to read this paper

  1. It provides a helpful primer on the range of datasets and data types that are used to develop applications in the energy sector in the EU. Useful if you want to know more about the domain
  2. The background information on database rights and related IP law is clearly written and a good introduction to the topic
  3. The paper provides a great case study of how licensing and legal protections applies to data use in a sector. The approach taken could be reused and extended to other areas

Three things I learned

Here’s three things that I learned from reading the paper.

  1. That a database might be covered by copyright (an “original” database) in addition to database rights. But the authors note this doesn’t apply in the case of a typical energy dataset
  2. That individual member states might have their own statutory exemptions to the the Database Directive. E.g. in Germany it doesn’t apply to use of data in non-commercial teaching. So there is variation in how it applies.
  3. The discussion on how the Database Directive relates to statutory obligations to publish data was interesting, but highlights that the situation is unclear.

Thoughts and impressions

Great paper that clearly articulates the legal issues relating to publication and use of data in the energy sector in the EU. It’s easy to extrapolate from this work to other use cases in energy and by extension to other sectors.

The paper concludes with a good set of recommendations: the adoption of open licences, the need to clarify rights around data reuse and the role of data institutions in doing that, and how policy makers can push towards a more open ecosystem.

However there’s a suggestion that funders should just mandate open licences when funding academic research. While this is the general trend I see across research funding, in the context of this article it lacks a bit of nuance. The paper clearly indicates that the current status quo is that data users do not have the rights to apply open licences to the data they are publishing and generating. I think funders also need to engage with other policy makers to ensure that upstream provision of data is aligned with an open research agenda. Otherwise we risk perpetuating an unclear landscape of rights and permissions. The authors do note the need to address wider issues, but I think there’s a potential role of research funders in helping to drive change.

Finally, in their review of open licences, the authors recommend a move towards adoption of CC0 (public domain waivers and marks) and CC-BY 4.0. But they don’t address the fact that upstream licensing might limit the choice of how researchers can licence downstream data.

Specifically, the authors note the use of OpenStreetmap data to provide infrastructure data. However depending on your use, you may need to adopt this licence when republishing data. This can be at odds with a mandate to use other licences or restrictive licences used by other data stewards.

 

How do data publishing choices shape data ecosystems?

This is the latest in a series of posts in which I explore some basic questions about data.

In our work at the ODI we have often been asked for advice about how best to publish data. When giving trying to give helpful advice, one thing I’m always mindful of is how the decisions about how data is published shapes the ways in which value can be created from it. More specifically, whether those choices will enable the creation of a rich data ecosystem of intermediaries and users.

So what are the types of decisions that might help to shape data ecosystems?

To give a simple example, if I publish a dataset so its available as a bulk download, then you could use that data in any kind of application. You could also use it to create a service that helps other people create value from the same data, e.g. by providing an API or an interface to generate reports from the data. Publishing in bulk allows intermediaries to help create a richer data ecosystem. But, if I’d just published that same data via an API then there are limited ways in which intermediaries can add value. Instead people must come directly to my API or services to use the data.

This is one of the reasons why people prefer open data to be available in bulk. It allows for more choice and flexibility in how it is used. But, as I noted in a recent post, depending on the “dataset archetype” your publishing options might be limited.

The decision to only publish a dataset as an API, even if it could be published in other ways is often a deliberate decision. The publisher may want to capture more of the value around the dataset, e.g. by charging for the use of an API. Or they may it is important to have more direct control over who uses it, and how. These are reasonable choices and, when the data is sensitive, sensible options.

But there are a variety of ways in which the choices that are made about how to publish data, can can shape or constrain the ecosystem around a specific dataset. It’s not just about bulk downloads versus APIs.

The choices include:

  • the licence that is applied to the data, which might limit it to non commercial use. Or restrict redistribution. Or imposing limits on the use of derived data
  • the terms and conditions for the API or other service that provides access to the data. These terms are often conflated with data licences, but typically focus on aspects of service provisions, for example rate limiting, restriction on storage of API results, permitted uses of the API, permitted types of users, etc
  • the technology used to provide access to data. In addition to bulk downloads vs API, there are also details such as the use of specific standards, the types of API call that are possible, etc
  • the governance around the API or service that provides access to data, which might create limit which users can get access the service or create friction that discourages use
  • the business model that is wrapped around the API or service, which might include a freemium model, chargeable usage tiers, service leverl agreements, usage limits, etc

I think these cover the main areas. Let me know if you think I’ve missed something.

You’ll notice that APIs and services provide more choices for how a publisher might control usage. This can be a good or a bad thing.

The range of choices also means it’s very easy to create a situation where an API or service doesn’t work well for some use cases. This is why user research and engagement is such an important part of releasing a data product and designing policy interventions that aim to increase access to data.

For example, let’s imagine someone has published an openly licensed dataset via an API that restricts users to a maximum number of API calls per month.

These choices limits some uses of the API, e.g. applications that need to make lots of queries. This also means that downstream users creating web applications are unable to provide a good quality of service to their own users. A popular application might just stop working at some point over the course of the month because it has hit the usage threshold.

The dataset might be technically openly, but practically its used has been constrained by other choices.

Those choices might have been made for good reasons. For example as a way for the data publisher to be able to predict how much they need to invest each month in providing a free service, that is accessible to lots of users making a smaller number of requests. There is inevitably a trade-off between the needs of individual users and the publisher.

Adding on a commercial usage tier for high volume users might provide a way for the publisher to recoup costs. It also allows some users to choose what to pay for their use of the API, e.g. to more smoothly handle unexpected peaks in their website traffic. But it may sometimes be simpler to provide the data in bulk to support those use cases. Different use cases might be better served by different publishing options.

Another example might be a system that provides access to both shared and open data via a set of APIs that conform to open standards. If the publisher makes it too difficult for users to actually sign up to use those APIs, e.g because of difficult registration or certification requirements, then only those organisations that can afford to invest the time and money to gain access might both using them. The end result might be a closed ecosystem that is built on open foundations.

I think its important for understand how this range of choices can impact data ecosystems. They’re important not just for how we design products and services, but also in helping to design successful policies and regulatory interventions. If we don’t consider the full range of changes, then we may not achieve the intended outcomes.

More generally, I think it’s important to think about the ecosystems of data use. Often I don’t think enough attention is paid to the variety of ways in which value is created. This can lead to poor choices, like a choosing to try and sell data for short term gain rather than considering the variety of ways in which value might be created in a more open ecosystem.

Lets talk about plugs

This is a summary of a short talk I gave internally at the ODI to help illustrate some of the important aspects of data standards for non-technical folk. I thought I’d write it up here too, in case its useful for anyone else. Let me know what you think.

We benefit from standards in every aspect of our daily lives. But because we take them for granted, we don’t tend to think about them very much. At the ODI we’re frequently talking about standards for data which, if you don’t have a technical background, might be even harder to wrap your heard around.

A good example can help to illustrate the value of standards. People frequently refer to telephone lines, railway tracks, etc. But there’s an example that we all have plenty of personal experience with.

Lets talk about plugs!

You can confidently plug any of your devices into a wall socket and it will just work. No thought required.

Have you ever thought about what it would be like if plugs and wall sockets were all different sizes and shapes?

You couldn’t rely on being able to consistently plug your device into any random socket, so you’d have to carry around loads of different cables. Manufacturers might not design their plugs and sockets very well so there might be greater risks of electrocution or fires. Or maybe the company that built your new house decided to only fit a specific type of wall socket because its agree a deal with an electrical manufacturer, so when you move in you needed to buy a completely new set of devices.

We don’t live in that world thankfully. As a nation we’ve agreed that all of our plugs should be designed the same way.

That’s all a standard is. A documented, reusable agreement that everyone uses.

Notice that a single standard, “how to design a really great plug“, has multiple benefits. Safety is increased. We save time and money. Manufacturers can be confident that their equipment will work in any home or office.

That’s true of different standards too. Standards have economic, policy, technical and social impacts.

Open up a UK plug and it looks a bit like this.

Notice that there are colours for different types of wires (2, 3, 4). And that fuses (5) are expected to be the same size and shape. Those are all standards too. The wiring and voltages are standardised too.

So the wiring, wall sockets and plugs in your house are designed affording to a whole family of different standards, that are designed to work with one another.

We can design more complex systems from smaller standards. It helps us make new things faster, because we are reusing existing work.

That’s a lot of time and agreement that we all benefit from. Someone somewhere has invested the time and energy into thinking all of that through. Lucky us!

When we visit other countries, we learn that their plugs and sockets are different. Oh no!

That can be a bit frustrating, and means we have to spend a bit more money and remember to pack the right adapters. It’d be nice if the whole world agreed on how to design a plug. But that seems unlikely. It would cost a lot of time and money in replacing wiring and sockets.

But maybe those different designs are intentional? Perhaps there are different local expectations around safety, for example. Or in what devices people might be using in their homes. There might be reasons why different communities choose to design and adopt slightly different standards. Because they’re meeting slightly different needs. But sometimes those differences might be unnecessary. It can be hard to tell sometimes.

The people most impacted by these differences aren’t tourists, its the manufacturers that have to design equipment to work in different locations. Which is why your electrical devices normally has a separate cable. So, depending on whether you travel or whether you’re a device manufacturer you’ll have different perceptions of how much a problem that is.

All of the above is true for data standards.

Standards for data are agreements that help us collect, access, share, use and publish data in consistent ways.  They have a range of different impacts.

There are lots of different types of standard and we combine them together to create different ways to successfully exchange data. Different communities often have their own standards for similar things, e.g. for describing metadata or accessing data via an API.

Sometimes those are simple differences that an adapter can easily fix. Sometimes those differences are because the standards are designed to meet different needs.

Unfortunately we don’t live in a world of standardised data plugs and wires and fuses. We live in that other world. The one where its hard to connect one thing to another thing. Where the stuff coming down the wires is completely unexpected. And we get repeated shocks from accidental releases of data.

I guarantee that in every user research, interview, government consultation or call for evidence, people will be consistently highlighting the need for more standards for data. People will often say this explicitly, “We need more standards!”. But sometimes they refer to the need in other ways: “We need make data more discoverable!” (metadata standards) or “We need to make it easier to safely release data!” (standardised codes of practice).

Unfortunately that’s not always that helpful because when you probe a little deeper you find that people are talking about lots of different things. Some people want to standardise the wiring. Others just want to agree on a voltage. While others are still debating the definition of “fuse”. These are all useful and important things. You just need to dig a little deeper to find the most useful place to start.

Its also not always clear whose job it is to actually create those standards. Because we take standards for granted, we’re not always clear about how they get created. Or how long it takes and what process to follow to ensure they’re well designed.

The reason we published the open standards for data guidebook was to help communities get started in designing the standards they need.

Standards development needs time and investment, as someone somewhere needs to do the work of creating them. That, as ever, is the really hard part.

Standards are part of the data infrastructure that help us unlock value from data. We need to invest in creating and maintaining them like we do other parts of our infrastructure.

Don’t just listen to me, listen to some of the people who’ve being creating standards for their communities.