How can you help support the use of a dataset?

Getting the most value from data, whilst minimising its harmful impacts, is a community activity. Datasets need to be governed and published well. Most of that responsibility falls on the data publisher. Because the choices they make shapes data ecosystems.

But other people have a role to play too. Being a good data user means engaging with that process.

Helping others to find data and find the value in it, feels particularly important at the moment. During the pandemic there are many new datasets becoming available. And there are lots of questions to be answered. Some of them can be answered through better use of data.

So, how can communities work together to support use of data?

There are a lot of different ways to explore that question. But there’s a framework called BASEDEF, created by the open source community, which I find helpful.

BASEDEF stands for Blog, Apply, Suggest, Extend, Document, Evangelize and Fix. It describes the different types of contributions that can support an open source project. It can also be applied to help organise a small team in doing that work. Here’s a handy cheat sheet.

But the framework can also be applied to the task of supporting the use of an openly licensed dataset. Let’s run through the framework with that in mind.


Blog

You can write about a dataset to help others to discover it. You can help explain the potential value of applying the dataset to specific problems. Or perhaps you can see some downsides that others should consider.

Writing about how a dataset has been useful to you, by describing how you’ve successfully applied it in a project, will also help others see its potential value.

Apply

You can show how a dataset can be used, by creating something with it. You might do a detailed analysis of the data, but some simpler contributions can also be helpful.

For example you might create a simple visualisation. Or write and publish some code that illustrates how the dataset can be accessed and used. You could publish a quick demo showing how the dataset can be imported and used in some frequently used tools and platforms.

At the moment everyone is a bit tired of charts and graphs. And I agree with the first principle in the visualisation design principles for the pandemic. But a helpful visualisation can do a range of things. Visualisation can be exploratory rather than explanatory.

A visualisation could support other people in understanding the shape of a dataset, to inform their analysis and interpretation of it. It can help identify outliers, gaps, or highlight some of the richness in the data. I’d recommend making it clear when you’re doing it type of visualisation, rather than trying to derive specific insights.

Suggest

Read the documentation. Download and explore the dataset. Ask questions. Give feedback.

Make suggestions to the publisher about changes they could make to publish the data better. Rather than just offer academic critique, be clear about how suggested changes will support your needs or that of your community.

Extend

The freedoms granted by an open licence allow you to enrich and improve a dataset.

Sometimes the smallest changes can have the most impact. Convert the data into other common or standard formats. Extracting data from spreadsheets into CSV files. Convert data published in more complex formats or via APIs into simpler tabular data to make it more accessible to analysts rather than programmers.

Or maybe you can enrich a dataset by adding identifiers that will allow it to be linked to other sources. Do the work of merging with other datasets to bring in more context.

The downside here is that if the original data changes your extended version will get out of date. If you can’t commit to keeping your version up to date, then be sure to share your code and document your methods.

Allow others to repeat the steps you’ve taken. And don’t forget to suggest the improvements to the publisher.

Document

Write additional documentation to fill in gaps where the publisher has not provided sufficient background or explanation. Explain technical concepts or academic terms to a non-specialist audience.

As a user of the data, you’re able to write that documentation from a perspective that reflects the needs and questions of your specific community and the kinds of questions you need to ask. The original publisher might not have all that context or understand those needs, so this work can be really helpful.

Good documentation can be a finding aid. There are structured ways that you can go about writing documentation, such as this tool for writing civic data guides. (Check out some of the examples).

Evangelise

Email people that might have a need for the data. Tweet about it to a wider community. Highlight it in a presentation. Talk about it over coffee Zoom.

Fix

If the dataset is collaboratively maintained then go ahead and fix errors and omissions. If you’re not confident about making a fix, then submit an error report. In addition to fixing errors you might be able to help verify that data is correct.

If a dataset isn’t collaboratively maintained then, when you find errors, be sure to flag them to the publisher and highlight the issue for others. Or consider publishing an enriched version with fixes applied.


This framework isn’t perfect. The name is a bit clunky for a start. But there’s a couple of things that I like about it.

Firstly, it recognises that not all contributions need to be technical. There’s room for others to use different skills and in different ways.

Secondly, the elements overlap and reinforce one another. Writing documentation and blogging about how you’ve used a dataset helps to evangelise it. Enriching a dataset can help demonstrate in a practical way how a publisher can improve how data is published.

Finally, it serves to highlight some important aspects of community curation which aren’t always well supported in existing data platforms and portals. We can do better here.

If you’re interested in working on adapting this further then happy to chat!. It might be useful to have a cheat sheet that supports its application to data and more examples of how to do these different elements well.

How can publishing more data decrease the value of existing data?

Last month I wrote a post looking at how publishing new data might increase the value of existing data. I ended up listing seven different ways including things like improving validation, increasing coverage, supporting the ability to link together datasets, etc.

But that post only looked at half of the issue. What about the opposite? Are there ways in which publishing new data might reduce the value of data that’s already available?

The short answer is: yes there are.  But before jumping into that, lets take a moment to reflect on the language we’re using.

A note on language

The original post was prompted by an economic framing of the value of data. I was exploring how the option value for a dataset might be affected by increasing access to other data. While this post is primarily looking at how option value might be reduced, we need to acknowledge that “value” isn’t the only way to frame this type of question.

We might also ask, “how might increasing access to data increase potential for harms?” As part of a wider debate around the issues of increasing access to data, we need to use more than just economic language. There’s a wealth of good writing about the impacts of data on privacy and society which I’m not going to attempt to precis here.

It’s also important to highlight that “increasing value” and “decreasing value” are relative terms.

Increasing the value of existing datasets will not seem like a positive outcome if your goal is to attempt to capture as much value as possible, rather than benefit a broader ecosystem. Similarly, decreasing value of existing data, e.g. through obfuscation, might be seen as a positive outcome if it results in better privacy or increased personal safety.

Decreasing value of existing data

Having acknowledged that, lets try and answer the earlier question. In what ways can publishing new data reduce the value we can derive from existing data?

Increased harms leading to retraction and reduced trust

Publishing new data always runs the risk of re-identification and the enabling of unintended inferences. While the impacts of these harms are likely to be most directly felt by both communities and individuals, there are also broader commercial and national security issues. Together, these issues might ultimately reduce the value of the existing data ecosystem in several ways:

  • Existing datasets may need to be retracted, have their scope changed, or have their circulation reduced in order to avoid further harm. Data privacy impact assessments will need to be updated as the contexts in which data is being shared and published change
  • Increased concerns over potential privacy impacts might lead to organisations to choose not to increase access to similar or related datasets
  • Increased concerns might also lead communities and individuals to reduce the amount of data they are willing to share with previously trusted sources

Overall this can lead to a reduction in the overall coverage, quality and linking of data across a data ecosystem. It’s likely to be one of the most significant impact of poorly considered data releases. It can be mitigated through proper impact assessments, consultation and engagement.

Reducing overall quality

Newly published data might be intended to increase coverage, enrich, link, validate or otherwise improve existing data. But it might actually have the opposite effect because its of poor quality. I’ve briefly touched on this in a previous post on fictional data.

Publication of poor quality data might be unintended. For example an organisation may just be publishing the data it has to help address an issue, without properly considering or addressing underlying problems with it. Or a researcher may publish data that contains honest mistakes.

But publication of poor quality data might also be deliberate. For example as spam or misinformation intended to “poison the well“.

More subtly, practices like p-hacking and falsification of data which might be intended to have a short-term direct benefit to the publisher or author, might have longer term issues by impacting the use of other datasets.

This is why understanding and documenting the provenance of data, monitoring of retractions, fixes and updates to data, and the ability to link analyses with datasets are all so important.

Creating unnecessary competition or increasing friction

Publishing new datasets containing new observations and data about an area or topic of interest can lead to positive impacts, e.g. by increasing confidence or coverage. But datasets are also competing with one another. The same types of data might be available from different sources, but under different licences, access arrangements, pricing, etc.

This competition isn’t necessarily positive. For example, the data ecosystem might not benefit as much from the network effects that follow from linking data because key datasets are not linked or cannot be used together. Incompatible and competing datasets can add friction across an ecosystem.

Building poor foundations

Data is often published as a means of building stronger data infrastructure for a sector, or to address a specific challenge. But if that data is poorly maintained or is not sustainably funded, then the energy that goes into building the communities, tools and other datasets around that infrastructure might be wasted.

That reduces the value of existing datasets which might otherwise have provided a better foundation to build upon. Or whose quality is dependent on the shared infrastructure. While this issue is similar to that of the previous one about competition, its root causes and impacts are slightly different.

 

As I noted in my earlier post. I don’t think this is an exhaustive list and it can be improved by contributions. Leave a comment if you have any thoughts.

How can publishing more data increase the value of existing data?

There’s lots to love about the “Value of Data” report. Like the fantastic infographic on page 9. I’ll wait while you go and check it out.

Great, isn’t it?

My favourite part about the paper is that it’s taught me a few terms that economists use, but which I hadn’t heard before. Like “Incomplete contracts” which is the uncertainty about how people will behave because of ambiguity in norms, regulations, licensing or other rules. Finally, a name to put to my repeated gripes about licensing!

But it’s the term “option value” that I’ve been mulling over for the last few days. Option value is a measure of our willingness to pay for something even though we’re not currently using it. Data has a large option value, because its hard to predict how its value might change in future.

Organisations continue to keep data because of its potential future uses. I’ve written before about data as stored potential.

The report notes that the value of a dataset can change because we might be able to apply new technologies to it. Or think of new questions to ask of it. Or, and this is the interesting part, because we acquire new data that might impact its value.

So, how does increasing access to one dataset affect the value of other datasets?

Moving data along the data spectrum means that increasingly more people will have access to it. That means it can be used by more people, potentially in very different ways than you might expect. Applying Joy’s Law then we might expect some interesting, innovative or just unanticipated uses. (See also: everyone loves a laser.)

But more people using the same data is just extracting additional value from that single dataset. It’s not directly impacting the value of other dataset.

To do that we need to use that in some specific ways. So far I’ve come up with seven ways that new data can change the value of existing data.

  1. Comparison. If we have two or more datasets then we can compare them. That will allow us to identify differences, look for similarities, or find correlations. New data can help us discover insights that aren’t otherwise apparent.
  2. Enrichment. New data can enrich an existing data by adding new information. It gives us context that we didn’t have access to before, unlocking further uses
  3. Validation. New data can help us identify and correct errors in existing data.
  4. Linking. A new dataset might help us to merge some existing dataset, allowing us to analyse them in new ways. The new dataset acts like a missing piece in a jigsaw puzzle.
  5. Scaffolding. A new dataset can help us to organise other data. It might also help us collect new data.
  6. Improve Coverage. Adding more data, of the same type, into an existing pool can help us create a larger, aggregated dataset. We end up with a more complete dataset, which opens up more uses. The combined dataset might have a a better spatial or temporal coverage, be less biased or capture more of the world we want to analyse
  7. Increase Confidence. If the new data measures something we’ve already recorded, then the repeated measurements can help us to be more confident about the quality of our existing data and analyses. For example, we might pool sensor readings about the weather from multiple weather stations in the same area. Or perform a meta-analysis of a scientific study.

I don’t think this is exhaustive, but it was a useful thought experiment.

A while ago, I outlined ten dataset archetypes. It’s interesting to see how these align with the above uses:

  • A meta-analysis to increase confidence will draw on multiple studies
  • Combining sensor feeds can also help us increase confidence in our observations of the world
  • A register can help us with linking or scaffolding datasets. They can also be used to support validation.
  • Pooling together multiple descriptions or personal records can help us create a database that has improved coverage for a specific application
  • A social graph is often used as scaffolding for other datasets

What would you add to my list of ways in which new data improves the value of existing data? What did I miss?

Three types of agreement that shape your use of data

Whenever you’re accessing, using or sharing data you will be bound by a variety of laws and agreements. I’ve written previously about how data governance is a nested set of rules, processes, legislation and norms.

In this post I wanted to clarify the differences between three types of agreements that will govern your use of data. There are others. But from a data consumer point of view these are most common.

If you’re involved in any kind of data project, then you should have read all of relevant agreements that relate to data you’re planning to use. So you should know what to look for.

Data Sharing Agreements

Data sharing agreements are usually contracts that will have been signed between the organisations sharing data. They describe how, when, where and for how long data will be shared.

They will include things like the purpose and legal basis for sharing data. They will describe the important security, privacy and other considerations that govern how data will be shared, managed and used. Data sharing agreements might be time-limited. Or they might describe an ongoing arrangement.

When the public and private sector are sharing data, then publishing a register of agreements is one way to increase transparency around how data is being shared.

The ICO Data Sharing Code of Practice has more detail on the kinds of information a data sharing agreement should contain. As does the UK’s Digital Economy Act 2017 code of practice for data sharing. In a recent project the ODI and CABI created a checklist for data sharing agreements.

Data sharing agreements are most useful when organisations, of any kind, are sharing sensitive data. A contract with detailed, binding rules helps everyone be clear on their obligations.

Licences

Licences are a different approach to defining the rules that apply to use of data. A licence describes the ways that data can be used without any of the organisations involved having to enter into a formal agreement.

A licence will describe how you can use some data. It may also place some restrictions on your use (e.g. “non-commercial”) and may spell out some obligations (“please say where you got the data”). So long as you use the data in the described ways, then you don’t need any kind of explicit permission from the publisher. You don’t even have to tell them you’re using it. Although it’s usually a good idea to do that.

Licences remove the need to negotiate and sign agreements. Permission is granted in advance, with a few caveats.

Standard licences make it easier to use data from multiple sources, because everyone is expecting you to follow the same rules. But only if the licences are widely adopted. Where licences don’t align, we end up with unnecessary friction.

Licences aren’t time-limited. They’re perpetual. At least as long as you follow your obligations.

Licences are best used for open and public data. Sometimes people use data sharing agreements when a licence might be a better option. That’s often because organisations know how to do contracts, but are less confident in giving permissions. Especially if they’re concerned about risks.

Sometimes, even if there’s an open licence to use data, a business would still prefer to have an agreement in place. That’s might be because the licence doesn’t give them the freedoms they want, or they’d like some additional assurances in place around their use of data.

Terms and Conditions

Terms and conditions, or “terms of use” are a set of rules that describe how you can use a service. Terms and conditions are the things we all ignore when signing up to website. But if you’re using a data portal, platform or API then you need to have definitely checked the small print. (You have, haven’t you?)

Like a Data Sharing Agreement, a set of terms and conditions is something that you formally agree to. It might be by checking a box rather than signing a document, but its still an agreement.

Terms of use will describe the service being offered and the ways in which you can use it. Like licences and data sharing agreements, they will also include some restrictions. For example whether you can build a commercial service with it. Or what you can do with the results.

A good set of terms and conditions will clearly and separately identify those rules that relate to your use of the service (e.g. how often you can use it) from those rules that relate to the data provided to you. Ideally the terms would just refer to a separate licence. The Met Office Data Point terms do this.

A poorly defined set of terms will focus on the service parts but not include enough detail about your rights to use and reuse data. That can happen if the emphasis has been on the terms of use of the service as a product, rather than around the sharing of data.

The terms and conditions for a data service and the rules that relate to the data are two of the important decisions that shape the data ecosystem that service will enable. It’s important to get them right.

Hopefully that’s a helpful primer. Remember, if you’re in any kind of role using data then you need to read the small print. If not, then you’re potentially exposing yourself and others to risks.

When can expect more from data portability?

We’re at the end of week 5 of 2020, of the new decade and I’m on a diet.

I’m back to using MyFitnessPal again. I’ve used it on and off for the last 10 years whenever I’ve decided that now is the time to be more healthy. The sporadic, but detailed history of data collection around my weight and eating habits mark out each of the times when this time was going to be the time when I really made a change.

My success has been mixed. But the latest diet is going pretty well, thanks for asking.

This morning the app chose the following feature to highlight as part of its irregular nudges for me to upgrade to premium.

Downloading data about your weight, nutrition and exercise history are a premium feature of the service. This gave me pause for thought for several reasons.

Under UK legislation, and for as long as we maintain data adequacy with the EU, I have a right to data portability. I can request access to any data about me, in a machine-readable format, from any service I happen to be using.

The company that produce MyFitnessPal, Under Armour, do offer me a way to exercise this right. It’s described in their privacy policy, as shown in the following images.

Note about how to exercise your GDPR rights in MyFitnessPalData portability in MyFitnessPal

Rather than enabling this access via an existing product feature, they’ve decide to make me and everyone else request the data directly. Every time I want to use it.

This might be a deliberate decision. They’re following the legislation to the letter. Perhaps its a conscious decision to push people towards a premium service, rather than make it easy by default. Their user base is international, so they don’t have to offer this feature to everyone.

Or maybe its the legal and product teams not looking at data portability as an opportunity. That’s something that the ODI has previously explored.

I’m hoping to see more exploration of the potential benefits and uses of data portability in 2020.

I think we need to re-frame the discussion away from compliance and on to commercial and consumer benefits. For example, by highlighting how access to data contributes to building ecosystems around services, to help retain and grow a customer base. That is more likely to get traction than a continued focus on compliance and product switching.

MyFitnessPal already connects into an ecosystem of other services. A stronger message around portability might help grow that further.  After all, there are more reasons to monitor what you eat than just weight loss.

Clearer legislation and stronger guidance from organisations like ICO and industry regulators describing how data portability should be implemented would also help. Wider international adoption of data portability rights wouldn’t hurt either.

There’s also a role for community driven projects to build stronger norms and expectations around data portability. Projects like OpenSchufa demonstrate the positive benefits of coordinated action to build up an aggregated view of donated, personal data.

But I’d also settle with a return to the ethos of the early 2010s, when making data flow between services was the default. Small pieces, loosely joined.

If we want the big platforms to go on a diet, then they’re going to need to give up some of those bytes.

Do data scientists spend 80% of their time cleaning data? Turns out, no?

It’s hard to read an article about data science or really anything that involves creating something useful from data these days without tripping over this factoid, or some variant of it:

Data scientists spend 80% of their time cleaning data rather than creating insights.

Or

Data scientists only spend 20% of their time creating insights, the rest wrangling data.

It’s frequently used to highlight the need to address a number of issues around data quality, standards, access. Or as a way to sell portals, dashboards and other analytic tools.

The thing is, I think it’s a bullshit statistic.

Not because I don’t think there aren’t improvements to be made about how we access and share data. Far from it. My issue is more about how that statistic is framed and because its just endlessly parroted without any real insight.

What did the surveys say?

I’ve tried to dig out the underlying survey or source of that factoid, to see if there’s more context. While the figure is widely referenced its rarely accompanied by a link to a survey or results.

Amusingly this IBM data science product marketing page cites this 2018 HBR blog post which cites this 2017 IBM blog which cites this 2016 Crowdflower survey. Why don’t people link to original sources?

In terms of sources of data on how data scientists actually spend their time, I’ve found two ongoing surveys.

So what do these surveys actually say?

  • Crowdflower, 2015: “66.7% said cleaning and organizing data is one of their most time-consuming tasks“.
    • They didn’t report estimates of time spent
  • Crowdflower, 2016: “What data scientists spend the most time doing? Cleaning and organizing data: 60%, Collecting data sets; 19% …“.
    • Only 80% of time spent if you also lump in collecting data as well
  • Crowdflower, 2017: “What activity takes up most of your time? 51% Collecting, labeling, cleaning and organizing data
    • Less than 80% and also now includes tasks like labelling of data
  • Figure Eight, 2018: Doesn’t cover this question.
  • Figure Eight, 2019: “Nearly three quarters of technical respondents 73.5% spend 25% or more of their time managing, cleaning, and/or labeling data
    • That’s pretty far from 80%!
  • Kaggle, 2017: Doesn’t cover this question
  • Kaggle, 2018: “During a typical data science project, what percent of your time is spent engaged in the following tasks? ~11% Gathering data, 15% Cleaning data…
    • Again, much less than 80%

Only the Crowdflower survey reports anything close to 80%, but you need to lump in actually collecting data as well.

Are there other sources? I’ve not spent too much time on it. But this 2015 bizreport article mentions another survey which suggests “between 50% and 90% of business intelligence (BI) workers’ time is spend prepping data to be analyzed“.

And an August 2014 New York Times article states that: “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data“. But doesn’t link to the surveys, because newspapers hate links.

It’s worth noting that “Data Scientist” as a job started to really become a thing around 2009. Although the concept of data science is older. So there may not be much more to dig up. If you’ve seen some earlier surveys, then let me know.

Is it a useful statistic?

So looking at the figures, it looks to me that this is a bullshit statistic. Data scientists do a whole range of different types of task. If you arbitrary label some of these as analysis and others not, then you can make them add up to 80%.

But that’s not the only reason why I think its a bullshit statistic.

Firstly there’s the implication that cleaning and working with data is somehow not worth the time of a data scientist. It’s “data janitor work” work. And “It’s a waste of their skills to be polishing the materials they rely on“. Ugh.

Who, might I ask, is supposed to do this janitorial work?

I would argue that spending time working with data. To transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in.

Understand the material better and you’ll get better insights.

Secondly, I think data science use cases and workflows are a poor measure for how well data is published. Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardised a dataset its likely to need some work.

A sculptor has different needs than a bricklayer. They both use similar materials. And they both create things of lasting value and worth.

We could measure utility better using other assessments than time spent on bespoke work.

Thirdly, it’s measuring the wrong thing. Actually, maybe some friction around the use of data is a good thing. Especially if it encourages you to spend more time understanding a dataset. Even more so if it in any way puts a break on dumb uses of machine-learning.

If we want the process of accessing, using and sharing data to be as frictionless as possible in a technical sense, then let’s make sure that is offset by adding friction elsewhere. E.g. to add checkpoints for reviews of ethical impacts. No matter how highly paid a data scientist is, the impacts of poor use of data and AI can be much, much larger.

Don’t tell me that data scientists are spending time too much time working with data and not enough time getting insights into production. Tell me that data scientists are increasingly spending 50% of their time considering the ethical and social impacts of their work.

Let’s measure that.

[Paper Review] The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work

This blog post is a quick review and notes relating to a research paper called: The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work (PDF available here)

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

This paper explores the impact of data infrastructure, and in particular the use of identifiers and the design of databases, on the delivery of human (public) services. By reviewing the use of identifiers and data in service delivery to support homelessness and those affected by AIDS, the authors highlight a number of tensions between how the design of data infrastructure and the need to share data with funders and other agencies has an inevitable impact on frontline services.

For example, the need to evidence impact to funders requires the collection of additional personal, legal identifiers. Even when that information is not critical to the delivery of support.

The paper also explores the interplay between the well defined, unforgiving world of database design, and the messy nature of delivering services to individuals. Along the way the authors touch on aspects of identity, identification, and explore different types of identifiers and data collection practices.

The authors draw out a number of infrastructure problems and provide some design provocations for alternate approaches. The three main problems are the immutability of identifiers in database schema, the “hegemony of NOT NULL” (or the need for identification), and the demand for uniqueness across contexts.

Three reasons to read

Here’s three reasons why you might want to read this paper:

  1. If, like me, you’re often advocating for use of consistent, open identifiers, then this paper provides a useful perspective of how this approach might create issues or unwanted side effects outside of the simpler world of reference data
  2. If you’re designing digital public services then the design provocations around identifiers and approaches to identification are definitely worth reading. I think there’s some useful reflections about how we capture and manage personal information
  3. If you’re a public policy person and advocating for consistent use of identifiers across agencies, then there’s some important considerations around the the policy, privacy and personal impacts of data collection in this paper

Three things I learned

Here’s three things that I learned from reading the paper.

  1. In a section on “The Data Work of Human Services Provision“, the authors highlighted three aspects of frontline data collection which I found it useful to think about:
    • data compliance work – collecting data purely to support the needs of funders, which might be at odds with the needs of both the people being supported and the service delivery staff
    • data coordination work – which stems from the need to link and aggregate data across agencies and funders to provide coordinated support
    • data confidence work – the need to build a trusted relationship with people, at the front-line, in order to capture valid, useful data
  2. Similarly, the authors tease out four reasons for capturing identifiers, each of which have different motivations, outcomes and approaches to identification:
    • counting clients – a basic need to monitor and evaluate service provision, identification here is only necessary to avoid duplicates when counting
    • developing longitudinal histories – e.g. identifying and tracking support given to a person over time can help service workers to develop understanding and improve support for individuals
    • as a means of accessing services – e.g. helping to identify eligibility for support
    • to coordinate service provision – e.g. sharing information about individuals with other agencies and services, which may also have different approaches to identification and use of identifiers
  3. The design provocations around database design were helpful to highlight some alternate approaches to capturing personal information and the needs of the service vs that of the individual

Thoughts and impressions

As someone who has not been directly involved in the design of digital systems to support human services, I found the perspectives and insight shared in this paper really useful. If you’ve been working in this space for some time, then it may be less insightful.

However I haven’t seen much discussion about good ways to design more humane digital services and, in particular, the databases behind them, so I suspect the paper could do with a wider airing. Its useful reading alongside things like Falsehoods Programmers Believe About Names and Falsehoods Programmers Believe About Gender.

Why don’t we have a better approach to managing personal information in databases? Are there solutions our there already?

Finally, the paper makes some pointed comments about the role of funders in data ecosystems. Funders are routinely collecting and aggregating data as part of evaluation studies, but this data might also help support service delivery if it were more accessible. It’s interesting to consider the balance between minimising unnecessary collection of data simply to support evaluation versus the potential role of funders as intermediaries that can provide additional support to charities, agencies or other service delivery organisations that may lack the time, funding and capability to do more with that data.

 

 

How do data publishing choices shape data ecosystems?

This is the latest in a series of posts in which I explore some basic questions about data.

In our work at the ODI we have often been asked for advice about how best to publish data. When giving trying to give helpful advice, one thing I’m always mindful of is how the decisions about how data is published shapes the ways in which value can be created from it. More specifically, whether those choices will enable the creation of a rich data ecosystem of intermediaries and users.

So what are the types of decisions that might help to shape data ecosystems?

To give a simple example, if I publish a dataset so its available as a bulk download, then you could use that data in any kind of application. You could also use it to create a service that helps other people create value from the same data, e.g. by providing an API or an interface to generate reports from the data. Publishing in bulk allows intermediaries to help create a richer data ecosystem. But, if I’d just published that same data via an API then there are limited ways in which intermediaries can add value. Instead people must come directly to my API or services to use the data.

This is one of the reasons why people prefer open data to be available in bulk. It allows for more choice and flexibility in how it is used. But, as I noted in a recent post, depending on the “dataset archetype” your publishing options might be limited.

The decision to only publish a dataset as an API, even if it could be published in other ways is often a deliberate decision. The publisher may want to capture more of the value around the dataset, e.g. by charging for the use of an API. Or they may it is important to have more direct control over who uses it, and how. These are reasonable choices and, when the data is sensitive, sensible options.

But there are a variety of ways in which the choices that are made about how to publish data, can can shape or constrain the ecosystem around a specific dataset. It’s not just about bulk downloads versus APIs.

The choices include:

  • the licence that is applied to the data, which might limit it to non commercial use. Or restrict redistribution. Or imposing limits on the use of derived data
  • the terms and conditions for the API or other service that provides access to the data. These terms are often conflated with data licences, but typically focus on aspects of service provisions, for example rate limiting, restriction on storage of API results, permitted uses of the API, permitted types of users, etc
  • the technology used to provide access to data. In addition to bulk downloads vs API, there are also details such as the use of specific standards, the types of API call that are possible, etc
  • the governance around the API or service that provides access to data, which might create limit which users can get access the service or create friction that discourages use
  • the business model that is wrapped around the API or service, which might include a freemium model, chargeable usage tiers, service leverl agreements, usage limits, etc

I think these cover the main areas. Let me know if you think I’ve missed something.

You’ll notice that APIs and services provide more choices for how a publisher might control usage. This can be a good or a bad thing.

The range of choices also means it’s very easy to create a situation where an API or service doesn’t work well for some use cases. This is why user research and engagement is such an important part of releasing a data product and designing policy interventions that aim to increase access to data.

For example, let’s imagine someone has published an openly licensed dataset via an API that restricts users to a maximum number of API calls per month.

These choices limits some uses of the API, e.g. applications that need to make lots of queries. This also means that downstream users creating web applications are unable to provide a good quality of service to their own users. A popular application might just stop working at some point over the course of the month because it has hit the usage threshold.

The dataset might be technically openly, but practically its used has been constrained by other choices.

Those choices might have been made for good reasons. For example as a way for the data publisher to be able to predict how much they need to invest each month in providing a free service, that is accessible to lots of users making a smaller number of requests. There is inevitably a trade-off between the needs of individual users and the publisher.

Adding on a commercial usage tier for high volume users might provide a way for the publisher to recoup costs. It also allows some users to choose what to pay for their use of the API, e.g. to more smoothly handle unexpected peaks in their website traffic. But it may sometimes be simpler to provide the data in bulk to support those use cases. Different use cases might be better served by different publishing options.

Another example might be a system that provides access to both shared and open data via a set of APIs that conform to open standards. If the publisher makes it too difficult for users to actually sign up to use those APIs, e.g because of difficult registration or certification requirements, then only those organisations that can afford to invest the time and money to gain access might both using them. The end result might be a closed ecosystem that is built on open foundations.

I think its important for understand how this range of choices can impact data ecosystems. They’re important not just for how we design products and services, but also in helping to design successful policies and regulatory interventions. If we don’t consider the full range of changes, then we may not achieve the intended outcomes.

More generally, I think it’s important to think about the ecosystems of data use. Often I don’t think enough attention is paid to the variety of ways in which value is created. This can lead to poor choices, like a choosing to try and sell data for short term gain rather than considering the variety of ways in which value might be created in a more open ecosystem.

Lets talk about plugs

This is a summary of a short talk I gave internally at the ODI to help illustrate some of the important aspects of data standards for non-technical folk. I thought I’d write it up here too, in case its useful for anyone else. Let me know what you think.

We benefit from standards in every aspect of our daily lives. But because we take them for granted, we don’t tend to think about them very much. At the ODI we’re frequently talking about standards for data which, if you don’t have a technical background, might be even harder to wrap your heard around.

A good example can help to illustrate the value of standards. People frequently refer to telephone lines, railway tracks, etc. But there’s an example that we all have plenty of personal experience with.

Lets talk about plugs!

You can confidently plug any of your devices into a wall socket and it will just work. No thought required.

Have you ever thought about what it would be like if plugs and wall sockets were all different sizes and shapes?

You couldn’t rely on being able to consistently plug your device into any random socket, so you’d have to carry around loads of different cables. Manufacturers might not design their plugs and sockets very well so there might be greater risks of electrocution or fires. Or maybe the company that built your new house decided to only fit a specific type of wall socket because its agree a deal with an electrical manufacturer, so when you move in you needed to buy a completely new set of devices.

We don’t live in that world thankfully. As a nation we’ve agreed that all of our plugs should be designed the same way.

That’s all a standard is. A documented, reusable agreement that everyone uses.

Notice that a single standard, “how to design a really great plug“, has multiple benefits. Safety is increased. We save time and money. Manufacturers can be confident that their equipment will work in any home or office.

That’s true of different standards too. Standards have economic, policy, technical and social impacts.

Open up a UK plug and it looks a bit like this.

Notice that there are colours for different types of wires (2, 3, 4). And that fuses (5) are expected to be the same size and shape. Those are all standards too. The wiring and voltages are standardised too.

So the wiring, wall sockets and plugs in your house are designed affording to a whole family of different standards, that are designed to work with one another.

We can design more complex systems from smaller standards. It helps us make new things faster, because we are reusing existing work.

That’s a lot of time and agreement that we all benefit from. Someone somewhere has invested the time and energy into thinking all of that through. Lucky us!

When we visit other countries, we learn that their plugs and sockets are different. Oh no!

That can be a bit frustrating, and means we have to spend a bit more money and remember to pack the right adapters. It’d be nice if the whole world agreed on how to design a plug. But that seems unlikely. It would cost a lot of time and money in replacing wiring and sockets.

But maybe those different designs are intentional? Perhaps there are different local expectations around safety, for example. Or in what devices people might be using in their homes. There might be reasons why different communities choose to design and adopt slightly different standards. Because they’re meeting slightly different needs. But sometimes those differences might be unnecessary. It can be hard to tell sometimes.

The people most impacted by these differences aren’t tourists, its the manufacturers that have to design equipment to work in different locations. Which is why your electrical devices normally has a separate cable. So, depending on whether you travel or whether you’re a device manufacturer you’ll have different perceptions of how much a problem that is.

All of the above is true for data standards.

Standards for data are agreements that help us collect, access, share, use and publish data in consistent ways.  They have a range of different impacts.

There are lots of different types of standard and we combine them together to create different ways to successfully exchange data. Different communities often have their own standards for similar things, e.g. for describing metadata or accessing data via an API.

Sometimes those are simple differences that an adapter can easily fix. Sometimes those differences are because the standards are designed to meet different needs.

Unfortunately we don’t live in a world of standardised data plugs and wires and fuses. We live in that other world. The one where its hard to connect one thing to another thing. Where the stuff coming down the wires is completely unexpected. And we get repeated shocks from accidental releases of data.

I guarantee that in every user research, interview, government consultation or call for evidence, people will be consistently highlighting the need for more standards for data. People will often say this explicitly, “We need more standards!”. But sometimes they refer to the need in other ways: “We need make data more discoverable!” (metadata standards) or “We need to make it easier to safely release data!” (standardised codes of practice).

Unfortunately that’s not always that helpful because when you probe a little deeper you find that people are talking about lots of different things. Some people want to standardise the wiring. Others just want to agree on a voltage. While others are still debating the definition of “fuse”. These are all useful and important things. You just need to dig a little deeper to find the most useful place to start.

Its also not always clear whose job it is to actually create those standards. Because we take standards for granted, we’re not always clear about how they get created. Or how long it takes and what process to follow to ensure they’re well designed.

The reason we published the open standards for data guidebook was to help communities get started in designing the standards they need.

Standards development needs time and investment, as someone somewhere needs to do the work of creating them. That, as ever, is the really hard part.

Standards are part of the data infrastructure that help us unlock value from data. We need to invest in creating and maintaining them like we do other parts of our infrastructure.

Don’t just listen to me, listen to some of the people who’ve being creating standards for their communities.

The words we use for data

I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think.  One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.

As Dan pointed out we often need a broader vocabulary when talking about data.  Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.

It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.

In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.

The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?

This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.

If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?

It turns out there’s quite a few.

For example there are “samples” and “sampling“.  These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples.  We might only be allowed to take a few.

Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.

Or how about “observations” and “observing“?  Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.

Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.

AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.

What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.

In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.

Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?

Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.

I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.

I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.

What terms do you find most useful when talking about data?