Can the regulation of hazardous substances help us think about regulation of AI?

This post is a thought experiment. It considers how existing laws that cover the registration and testing of hazardous substances like pesticides might be used as an analogy for thinking through approaches to regulation of AI/ML.

As a thought experiment its not a detailed or well-research proposal, but there are elements which I think are interesting. I’m interested in feedback and also pointers to more detailed explorations of similar ideas.

A cursory look of substance registration legislation in the EU and US

Under EU REACH legislation, if you want to manufacture or import large amount of potentially hazardous chemical substances then you need to register with the ECHA. The registration process involves providing information about the substance and its potential risks.

“No data no market” is a key principle of the legislation. The private sector carries the burden of collecting data and demonstrating safety of substances. There is a standard set of information that must be provided.

In order to demonstrate the safety, companies may need to carry out animal testing. The legislation has been designed to minimise unnecessary animal  testing. While there is an argument that all testing is unnecessary, current practices requires testing in some circumstances. Where testing is not required, then other data sources can be used. But controlled animal tests are the proof of last resort if no other data is available.

To further minimise the need to carry out tests on animals, the legislation is designed to encourage companies registering the same (or similar) substances to share data with one another in a “fair, transparent and non-discriminatory way”. Companies There is detailed guidance around data sharing, including a legal framework and guidance on cost sharing.

The coordination around sharing data and costs is achieved via a SIEF (PDF), a loose consortia of businesses looking to register the same substance. There is guidance to help facilitate creation of these sharing forums.

The US has a similar set of laws which also aim to encourage sharing of data across companies to minimise animal testing and other regulatory burdens. The practice of “data compensation” provides businesses with a right to charge fees for use of data. The legislation doesn’t define acceptable fees, but does specify an arbitration procedure.

The compensation, along with some exclusive use arrangements, are intended to avoid discouraging original research, testing and registration of new substances. Companies that bear the costs of developing new substances can have exclusive use for a period and expect some compensation for research costs to bring to market. Later manufacturers can benefit from the safety testing results, but have to pay for the privilege of access.

Summarising some design principles

Based on my reading, I think both sets of legislation are ultimately designed to:

  • increase safety of the general public, by ensuring that substances are properly tested and documented
  • require companies to assess the risks of substances
  • take an ethical stance on reducing unnecessary animal testing and other data collection by facilitating
    data collection
  • require companies to register their intention to manufacture or import substances
  • enable companies to coordinate in order to share costs and other burdens of registration
  • provide an arbitration route if data is not being shared
  • avoid discouraging new research and development by providing a cost sharing model to offset regulatory requirements

Parallels to AI regulation

What if we adopted a similar approach towards the regulation of AI/ML?

When we think about some of the issues with large scale, public deployment of AI/ML, I think the debate often highlights a variety of needs, including:

  • greater oversight about how systems are being designed and tested, to help understand risks and design problems
  • understanding how and where systems are being deployed, to help assess impacts
  • minimising harms to either the general public, or specific communities
  • thorough testing of new approaches to assess immediate and potential long-term impacts
  • reducing unnecessary data collection that is otherwise required to train and test models
  • exploration of potential impacts of new technologies to address social, economic and environmental problems
  • to continue to encourage primary research and innovation

That list is not exhaustive. I suspect not everyone will necessarily agree on the importance of all elements.

However, if we look at these concerns and the principles that underpin the legislation of hazardous substances, I think there are a lot of parallels.

Applying the approach to AI

What if, for certain well-defined applications of AI/ML such as facial recognition, autonomous vehicles, etc, we required companies to:

  • register their systems, accompanies by a standard set of technical, testing and other documentation
  • carry out tests of their system using agreed protocols, to encourage consistency in comparison across testing
  • share data, e.g via a data trust or similar model, in order to minimise the unnecessary collection of data and to facilitate some assessment of bias in training data
  • demonstrate and document the safety of their systems to agreed standards, allowing public and private sector users of systems and models to make informed decisions about risks, or to support enforcement of legal standards
  • coordinate to share costs of collecting and maintaining data, conducting tests of standard models, etc
  • and, perhaps, after a period, accept that trained models would become available for others to reuse, similarly to how medicines or other substances may ultimately be manufactured by other companies

In addition to providing more controls and assurance around how AI/ML is being deployed, an approach based on facilitating collaboration around collection of data might help nudge new and emerging sectors into a more open direction, right from the start.

There are a number of potential risks and issues which I will acknowledge up front:

  • sharing of data about hazardous substance testing doesn’t have to address data protection. But this could be factored in to the design, and some uses of AI/ML draw on non-personal data
  • we may want to simply ban, or discourage use of some applications of AI/ML, rather than enable it. But at the moment there are few, if any controls
  • the approach might encourage collection and sharing of data which we might otherwise want to restrict. But strong governance and access controls, via a data trust or other institution might actually raise the bar around governance and security, beyond that which individual businesses can, or are willing to achieve. Coordination with a regulator might also help decide on how much is “enough” data
  • the utility of data and openly available models might degrade over time, requiring ongoing investment
  • the approach seems most applicable to uses of AI/ML with similar data requirements, In practice there may be only a small number of these, or data requirements may vary enough to limit benefits of data sharing

Again, not an exhaustive list. But as I’ve noted, I think there are ways to mitigate some of these risks.

Let me know what you think, what I’ve missed, or what I should be reading. I’m not in a position to move this forward, but welcome a discussion. Leave your thoughts in the comments below, or ping me on twitter.

When can expect more from data portability?

We’re at the end of week 5 of 2020, of the new decade and I’m on a diet.

I’m back to using MyFitnessPal again. I’ve used it on and off for the last 10 years whenever I’ve decided that now is the time to be more healthy. The sporadic, but detailed history of data collection around my weight and eating habits mark out each of the times when this time was going to be the time when I really made a change.

My success has been mixed. But the latest diet is going pretty well, thanks for asking.

This morning the app chose the following feature to highlight as part of its irregular nudges for me to upgrade to premium.

Downloading data about your weight, nutrition and exercise history are a premium feature of the service. This gave me pause for thought for several reasons.

Under UK legislation, and for as long as we maintain data adequacy with the EU, I have a right to data portability. I can request access to any data about me, in a machine-readable format, from any service I happen to be using.

The company that produce MyFitnessPal, Under Armour, do offer me a way to exercise this right. It’s described in their privacy policy, as shown in the following images.

Note about how to exercise your GDPR rights in MyFitnessPalData portability in MyFitnessPal

Rather than enabling this access via an existing product feature, they’ve decide to make me and everyone else request the data directly. Every time I want to use it.

This might be a deliberate decision. They’re following the legislation to the letter. Perhaps its a conscious decision to push people towards a premium service, rather than make it easy by default. Their user base is international, so they don’t have to offer this feature to everyone.

Or maybe its the legal and product teams not looking at data portability as an opportunity. That’s something that the ODI has previously explored.

I’m hoping to see more exploration of the potential benefits and uses of data portability in 2020.

I think we need to re-frame the discussion away from compliance and on to commercial and consumer benefits. For example, by highlighting how access to data contributes to building ecosystems around services, to help retain and grow a customer base. That is more likely to get traction than a continued focus on compliance and product switching.

MyFitnessPal already connects into an ecosystem of other services. A stronger message around portability might help grow that further.  After all, there are more reasons to monitor what you eat than just weight loss.

Clearer legislation and stronger guidance from organisations like ICO and industry regulators describing how data portability should be implemented would also help. Wider international adoption of data portability rights wouldn’t hurt either.

There’s also a role for community driven projects to build stronger norms and expectations around data portability. Projects like OpenSchufa demonstrate the positive benefits of coordinated action to build up an aggregated view of donated, personal data.

But I’d also settle with a return to the ethos of the early 2010s, when making data flow between services was the default. Small pieces, loosely joined.

If we want the big platforms to go on a diet, then they’re going to need to give up some of those bytes.

Licence Friction: A Tale of Two Datasets

For years now at the Open Data Institute we’ve been working to increase access to data, to create a range of social and economic benefits across a range of sectors. While the details change across projects one of the more consistent aspects of our work and guidance has been to support data stewards in making data as open as possible, whilst ensuring that is clearly licensed.

Reference data, like addresses and other geospatial data, that underpins our national and global data infrastructure needs to be available under an open licence. If it’s not, which is the ongoing situation in the UK, then other data cannot be made as open as possible. 

Other considerations aside, data can only be as open as the reference data it relies upon. Ideally, reference data would be in the public domain, e.g. using a CC0 waiver. Attribution should be a consistent norm regardless of what licence is used

Data becomes more useful when it is linked with other data. When it comes to data, adding context adds value. It can also add risks, but more value can be created from linking data. 

When data is published using bespoke or restrictive licences then it is harder to combine different datasets together, because there are often limitations in the licensing terms that restrict how data can be used and redistributed.

This means data needs to be licensed using common, consistent licences. Licences that work with a range of different types of data, collected and used by different communities across jurisdictions. 

Incompatible licences create friction that can make it impossible to create useful products and services. 

It’s well-reported that data scientists and other users spend huge amounts of time cleaning and tidying data because it’s messy and non-standardised. It’s probably less well-reported how many great ideas are simply shelved because of lack of access to data. Or are impossible because of issues with restrictive or incompatible data licences. Or are cancelled or simply needlessly expensive due to the need for legal consultations and drafting of data sharing agreements.

These are the hurdles you often need to overcome before you even get started with that messy data.

Here’s a real-world example of where the lack of open geospatial data in the UK, and ongoing incompatibilities between data licensing is getting in the way of useful work. 

Introducing Active Places

Active Places is a dataset stewarded by Sport England. It provides a curated database of sporting facilities across England. It includes facilities provided by a range of organisations across the public, private and third-sectors. It’s designed to help support decision making about the provision of tens of thousands of sporting sites and facilities around the UK to drive investment and policy making. 

The dataset is rich and includes a wide range of information from disabled access through to the length of ski slopes or the number of turns on a cycling track.

While Sport England are the data steward, the curation of the dataset is partly subcontracted to a data management firm and partly carried out collaboratively with the owners of those sites and facilities.

The dataset is published under a standard open licence, the Creative Commons Attribution 4.0 licence. So anyone can access, use and share the data so long as they acknowledge its source. Contributors to the dataset agree to this licence as part of registering to contribute to the site.

The dataset includes geospatial data, including the addresses and locations of individual sites. This data includes IP from Ordnance Survey and Royal Mail, which means they have a say over what happens to it. In order to release the data under an open licence, Sport England had to request an exemption from the Ordnance Survey to their default position, which is that data containing OS IP cannot be sublicensed. When granted an exemption, an organisation may publish their data under an open licence. In short, OS waive their rights over the geographic locations in the data. 

The OS can’t, however waive any rights that Royal Mail has over the address data. In order to grant Sport England an exemption, the OS also had to seek permission from Royal Mail.  The Sport England team were able to confirm this for me. 

Unfortunately it’s not clear, without having checked, that this is actually the case. It’s not evident in the documentation of either Active Places or the OS exemption process. Is it clarifying all third-party rights a routine part of the exemption process or not?

It would be helpful to know. As the ODI has highlighted, lack of transparency around third-party rights in open data is a problem. For many datasets the situation remains unclear. And Unclear positions are fantastic generators of legal and insurance fees.

So, to recap: Sport England has invested time in convincing Ordnance Survey to allow it to openly publish a rich dataset for the public good. A dataset in which geospatial data is clearly important, but is not the main feature of the dataset. The reference data is dictating how open the dataset can be and, as a result how much value can be created from it.

In case you’re wondering, lots of other organisations have had to do the same thing. The process is standardised to try and streamline it for everyone. A 2016 FOI request shows that between 2011 and 2015 the Ordnance Survey handled more than a 1000 of these requests

Enter OpenStreetMap

At the end of 2019, members of the OpenStreetmap community contacted Sport England to request permission to use the Active Places dataset. 

If you’re not familiar with OpenStreetmap, then you should be. It’s an openly licensed map of the world maintained by a huge community of volunteers, humanitarian organisations, public and private sector businesses around the world.

The OpenStreetmap Foundation is the official steward of the dataset with the day to data curation and operations happening through its volunteer network. As a small not-for-profit, it has to be very cautious about legal issues relating to the data. It can’t afford to be sued. The community is careful to ensure that data that is imported or added into the database comes from openly licensed sources.

In March 2017, after a consultation with the Creative Commons, the OpenStreetmap Licence/Legal Working Group concluded that data published under the Creative Commons Attribution licence is not compatible with the licence used by OpenStreetmap which is called the Open Database Licence. They felt that some specific terms in the licence (and particularly in its 4.0 version) meant that they needed additional permission in order to include that data in OpenStreetmap.

Since then the OpenStreetmap community, has been contacting data stewards to ask them to sign an additional waiver that grants the OSM community explicit permission to use the data. This is exactly what open licensing of data is intended to avoid.

CC-BY is one of the most frequently used open data licences, so this isn’t a rare occurrence. 

As an indicator of the extra effort required, in a 2018 talk from the Bing Maps team in which they discuss how they have been supporting the OpenStreetmap community in Australia, they called out their legal team as one of the most important assets they had to provide to the local mapping community, helping them to get waivers signed. At the time of writing nearly 90 waivers have been circulated in Australia alone, not all of which have been signed.

So, to recap, due to a perceived incompatibility between two of the most frequently used open data licences, the OpenStreetmap community and its supporters are spending time negotiating access to data that is already published under an open licence.

I am not a lawyer. So these are like, just my opinions. But while I understand why the OSM Licence Working Group needs to be cautious, it feels like they are being overly cautious. Then again, I’m not the one responsible for stewarding an increasingly important part of a global data infrastructure. 

Another opinion is that perhaps the Microsoft legal team might be better deployed to solve the licence incompatibility issues. Instead they are now drafting their own new open data licences, which are compatible with CC-BY.

Active Places and OpenStreetmap

Still with me?

At the end of last year, members of the OpenStreetMap community contacted Sport England to ask them to sign a waiver so that they could use the Active Places data. Presumably to incorporate some of the data into the OSM database.

The Sport England data and legal teams then had to understand what they were being asked to do and why. And they asked for some independent advice, which is where I provided some support through our work with Sport England on the OpenActive programme. 

The discussion included:

  • questions about why an additional waiver was actually necessary
  • the differences in how CC-BY and ODbL are designed to require data to remain open and accessible – CC-BY includes limitation on use of technical restrictions, which is allowed by the open definition, whilst ODbL adopts a principle of encouraging “parallel distribution”. 
  • acceptable forms and methods of attribution
  • who, within an organisation like Sport England, might have responsibility to decide what acceptable attribution looked like
  • why the OSM community had come to its decisions
  • who actually had authority to sign-off on the proposed waiver
  • whether signing a waiver and granting a specific permission undermined Sport England’s goal to adopt standard open data practices and licences, and a consistent approach for every user
  • whether the OS exemption, which granted permission to SE to publish the dataset under an open licence, impacted any of the above

All reasonable questions from a team being asked to do something new. 

Like a number of organisations asked to sign waiver in Australia, SE have not yet signed a waiver and may choose not to do so. Like all public sector organisations, SE are being cautious about taking risks. 

The discussion has spilled out onto twitter. I’m writing this to provide some context and background to the discussion in that thread. I’m not criticising anyone as I think everyone is trying to come to a reasonable outcome. 

As the twitter thread highlights, the OSM community are not just concerned about the CC-BY licence but also about the potential that additional third-party rights are lurking in the data. Clarifying that may require SE to share more details about how the address and location data in the dataset is collected, validated and normalised for the OSM community to be happy. But, as noted earlier in the blog, I’ve at least been able to determine the status of any third-party rights in the data. So perhaps this will help to move things further.

The End

So, as a final recap, we have two organisations both aiming to publish and use data for the public good. But, because of complexities around derived data and licence compatibilities, data that might otherwise be used in new, innovative ways is instead going unused.

This is a situation that needs solving. It needs the UK government and Geospatial Commission to open up more geospatial data.

It needs the open data community to invest in resolving licence incompatibilities (and less in creating new licences) so that everyone benefits. 

We also need to understand when licences are the appropriate means of governing how data is used and when norms, e.g. around attribution, can usefully shape how data is accessed, used and shared.

Until then these issues are going to continue to undermine the creation of value from open (geospatial) data.

[Paper review] Open data for electricity modeling: Legal aspects

This blog post is a quick review and notes relating to a research paper called: Open data for electronic modeling: Legal aspects.

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

The paper reviews the legal status of publicly available energy data (and some related datasets) in Europe, with a focus on German law. The paper is intended to help identify some of the legal issues relevant to creation of analytical models to support use of energy data, e.g. for capacity planning.

As background, the paper describes the types of data relevant to building these types of model, the relevant aspects of database and copyright law in the EU and the properties of open licences. This background is used to assess some of the key data assets published in the EU and how they are licensed (or not) for reuse.

The paper concludes that the majority of uses of this data to support energy modelling in the EU, whether for research or other purposes, is likely to be infringing on the rights of the database holders, meaning that users are currently carrying legal risks. The paper notes that in many cases this is likely not the intended outcome.

The paper provides a range of recommendations to address this issue, including the adoption of open licences.

Three reasons to read

Here’s three reasons why you might want to read this paper

  1. It provides a helpful primer on the range of datasets and data types that are used to develop applications in the energy sector in the EU. Useful if you want to know more about the domain
  2. The background information on database rights and related IP law is clearly written and a good introduction to the topic
  3. The paper provides a great case study of how licensing and legal protections applies to data use in a sector. The approach taken could be reused and extended to other areas

Three things I learned

Here’s three things that I learned from reading the paper.

  1. That a database might be covered by copyright (an “original” database) in addition to database rights. But the authors note this doesn’t apply in the case of a typical energy dataset
  2. That individual member states might have their own statutory exemptions to the the Database Directive. E.g. in Germany it doesn’t apply to use of data in non-commercial teaching. So there is variation in how it applies.
  3. The discussion on how the Database Directive relates to statutory obligations to publish data was interesting, but highlights that the situation is unclear.

Thoughts and impressions

Great paper that clearly articulates the legal issues relating to publication and use of data in the energy sector in the EU. It’s easy to extrapolate from this work to other use cases in energy and by extension to other sectors.

The paper concludes with a good set of recommendations: the adoption of open licences, the need to clarify rights around data reuse and the role of data institutions in doing that, and how policy makers can push towards a more open ecosystem.

However there’s a suggestion that funders should just mandate open licences when funding academic research. While this is the general trend I see across research funding, in the context of this article it lacks a bit of nuance. The paper clearly indicates that the current status quo is that data users do not have the rights to apply open licences to the data they are publishing and generating. I think funders also need to engage with other policy makers to ensure that upstream provision of data is aligned with an open research agenda. Otherwise we risk perpetuating an unclear landscape of rights and permissions. The authors do note the need to address wider issues, but I think there’s a potential role of research funders in helping to drive change.

Finally, in their review of open licences, the authors recommend a move towards adoption of CC0 (public domain waivers and marks) and CC-BY 4.0. But they don’t address the fact that upstream licensing might limit the choice of how researchers can licence downstream data.

Specifically, the authors note the use of OpenStreetmap data to provide infrastructure data. However depending on your use, you may need to adopt this licence when republishing data. This can be at odds with a mandate to use other licences or restrictive licences used by other data stewards.

 

How do data publishing choices shape data ecosystems?

This is the latest in a series of posts in which I explore some basic questions about data.

In our work at the ODI we have often been asked for advice about how best to publish data. When giving trying to give helpful advice, one thing I’m always mindful of is how the decisions about how data is published shapes the ways in which value can be created from it. More specifically, whether those choices will enable the creation of a rich data ecosystem of intermediaries and users.

So what are the types of decisions that might help to shape data ecosystems?

To give a simple example, if I publish a dataset so its available as a bulk download, then you could use that data in any kind of application. You could also use it to create a service that helps other people create value from the same data, e.g. by providing an API or an interface to generate reports from the data. Publishing in bulk allows intermediaries to help create a richer data ecosystem. But, if I’d just published that same data via an API then there are limited ways in which intermediaries can add value. Instead people must come directly to my API or services to use the data.

This is one of the reasons why people prefer open data to be available in bulk. It allows for more choice and flexibility in how it is used. But, as I noted in a recent post, depending on the “dataset archetype” your publishing options might be limited.

The decision to only publish a dataset as an API, even if it could be published in other ways is often a deliberate decision. The publisher may want to capture more of the value around the dataset, e.g. by charging for the use of an API. Or they may it is important to have more direct control over who uses it, and how. These are reasonable choices and, when the data is sensitive, sensible options.

But there are a variety of ways in which the choices that are made about how to publish data, can can shape or constrain the ecosystem around a specific dataset. It’s not just about bulk downloads versus APIs.

The choices include:

  • the licence that is applied to the data, which might limit it to non commercial use. Or restrict redistribution. Or imposing limits on the use of derived data
  • the terms and conditions for the API or other service that provides access to the data. These terms are often conflated with data licences, but typically focus on aspects of service provisions, for example rate limiting, restriction on storage of API results, permitted uses of the API, permitted types of users, etc
  • the technology used to provide access to data. In addition to bulk downloads vs API, there are also details such as the use of specific standards, the types of API call that are possible, etc
  • the governance around the API or service that provides access to data, which might create limit which users can get access the service or create friction that discourages use
  • the business model that is wrapped around the API or service, which might include a freemium model, chargeable usage tiers, service leverl agreements, usage limits, etc

I think these cover the main areas. Let me know if you think I’ve missed something.

You’ll notice that APIs and services provide more choices for how a publisher might control usage. This can be a good or a bad thing.

The range of choices also means it’s very easy to create a situation where an API or service doesn’t work well for some use cases. This is why user research and engagement is such an important part of releasing a data product and designing policy interventions that aim to increase access to data.

For example, let’s imagine someone has published an openly licensed dataset via an API that restricts users to a maximum number of API calls per month.

These choices limits some uses of the API, e.g. applications that need to make lots of queries. This also means that downstream users creating web applications are unable to provide a good quality of service to their own users. A popular application might just stop working at some point over the course of the month because it has hit the usage threshold.

The dataset might be technically openly, but practically its used has been constrained by other choices.

Those choices might have been made for good reasons. For example as a way for the data publisher to be able to predict how much they need to invest each month in providing a free service, that is accessible to lots of users making a smaller number of requests. There is inevitably a trade-off between the needs of individual users and the publisher.

Adding on a commercial usage tier for high volume users might provide a way for the publisher to recoup costs. It also allows some users to choose what to pay for their use of the API, e.g. to more smoothly handle unexpected peaks in their website traffic. But it may sometimes be simpler to provide the data in bulk to support those use cases. Different use cases might be better served by different publishing options.

Another example might be a system that provides access to both shared and open data via a set of APIs that conform to open standards. If the publisher makes it too difficult for users to actually sign up to use those APIs, e.g because of difficult registration or certification requirements, then only those organisations that can afford to invest the time and money to gain access might both using them. The end result might be a closed ecosystem that is built on open foundations.

I think its important for understand how this range of choices can impact data ecosystems. They’re important not just for how we design products and services, but also in helping to design successful policies and regulatory interventions. If we don’t consider the full range of changes, then we may not achieve the intended outcomes.

More generally, I think it’s important to think about the ecosystems of data use. Often I don’t think enough attention is paid to the variety of ways in which value is created. This can lead to poor choices, like a choosing to try and sell data for short term gain rather than considering the variety of ways in which value might be created in a more open ecosystem.

The words we use for data

I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think.  One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.

As Dan pointed out we often need a broader vocabulary when talking about data.  Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.

It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.

In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.

The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?

This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.

If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?

It turns out there’s quite a few.

For example there are “samples” and “sampling“.  These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples.  We might only be allowed to take a few.

Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.

Or how about “observations” and “observing“?  Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.

Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.

AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.

What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.

In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.

Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?

Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.

I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.

I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.

What terms do you find most useful when talking about data?

How can we describe different types of dataset? Ten dataset archetypes

As a community, when we are discussing recommendations and best practices for how data should be published and governed, there is a natural tendency for people to focus on the types of data they are most familiar with working with.

This leads to suggestions that every dataset should have an API, for example. Or that every dataset should be available in bulk. While good general guidance, those approaches aren’t practical in every case. That’s because we also need to take into account a variety of other issues, including:

  • the characteristics of the dataset
  • the capabilities of the publishing organisation and the funding their have available
  • the purpose behind publishing the data
  • and the ethical, legal and social contexts in which it will be used

I’m not going to cover all of that in this blog post.

But it occurred to me that it might be useful to describe a set of dataset archetypes, that would function a bit like user personas. They might help us better answer some of the basic questions people have around data, discuss recommendations around best practices, inform workshop exercises or just test our assumptions.

To test this idea I’ve briefly described ten archetypes. For each one I’ve tried to describe some it’s features, identified some specific examples, and briefly outlined some of the challenges that might apply in providing sustainable access to it.

Like any characterisation detail is lost. This is not an exhaustive list. I haven’t attempted to list every possible variation based on size, format, timeliness, category, etc. But I’ve tried to capture a range that hopefully illustrate some different characteristics. The archetypes reflect my own experiences, you will have different thoughts and ideas. I’d love to read them.

The Study

The Study is a dataset that was collected to support a research project. The research group collected a variety of new data as part of conducting their study. The dataset is small, focused on a specific use case and there are no plans to maintain or update it further as the research group does not have any ongoing funded to collect or maintain the dataset. The data is provided as is for others to reuse, e.g. to confirm the original analysis of the data or to use it on other studies. To help others, and as part of writing some academic papers that reference the dataset, the research group has documented their methodology for collecting the data. The dataset is likely published in an academic data portal or alongside the academic papers that reference it.

Examples: water quality samples, field sightings of animals, laboratory experiment results, bibliographic data from a literature review, photos showing evidence of plant diseases, consumer research survey results

The Sensor Feed

The Sensor Feed is a stream of sensor readings that are produced by a collection of sensors that have been installed across a city. New readings are added to the stream at regular intervals. The feed is provided to allow a variety of applications to tap into the raw sensor readings.. The data points are as directly reported by the individual sensors and are not quality controlled. The individual sensors may been updated, re-calibrated or replaced over time. The readings are part of the operational infrastructure of the city so can be expected to be available over at least the medium term. This mean the dataset is effectively unbounded: new observations will continue to be reported until the infrastructure is decommissioned.

Examples: air quality readings, car park occupancy, footfall measurements, rain gauges, traffic light queuing counts, real-time bus locations

The Statistical Index

The Statistical Index is intended to provide insights into the performance of specific social or economic policies by measuring some aspect of a local community or economy. For example a sales or well-being index. The index draws on a variety of primary datasets, e.g. on commercial activities, which are then processed according to a documented methodology to generate the index. The Index is stewarded by an organisation and is expected to be available over the long term. The dataset is relatively small and is reported against specific geographic areas (e.g. from The Register) to support comparisons. The Index is updated on a regular basis, e.g. monthly or annually. Use of the data typically involves comparing across time and location at different levels of aggregation.

Examples: street safety survey, consumer price indices, happiness index, various national statistical indexes

The Register

The Register is a set of reference data that is useful for adding context to other datasets. It consists of a list of specific things, e.g. locations, cars, services with an unique identifier and some basic descriptive metadata for each of the entries on the list. The Register is relatively small, but may grow over time. It is stewarded by an organisation tasked with making the data available for others. The steward, or custodian, provides some guarantees around the quality of the data.  It is commonly used as a means to link, validate and enrich other datasets and is rarely used in isolation other than in reporting on changes to the size and composition of the register.

Examples: licensed pubs, registered doctors, lists of MOT stations, registered companies, a taxonomy of business types, a statistical geography, addresses

The Database

The Database is a copy or extract of the data that underpins a specific application or service. The database contains information about a variety of different types of things, e.g. musicians, the albums and songs. It is a relatively large dataset that can be used to perform a variety of different types of query and to support a variety of uses. As it is used in a live service it is regularly updated, undergoes a variety of quality checks, and is growing over time in both volume and scope. Some aspects of The Database may reference one or more Registers or could be considered as Registers in themselves.

Examples: geographic datasets that include a variety of different types of features (e.g. OpenStreetMap, MasterMap), databases of music (e.g. MusicBrainz) and books (e.g. OpenLibrary), company product and customer databases, Wikidata

The Description

The Description is a collection of a few data points relating to a single entity. Embedded into a single web page, it provides some basic information about an event, or place, or company. Individually it may be useful in context, e.g. to support a social interaction or application share. The owner of the website provides some data about the things that are discussed or featured on the website, but does not have access to a full dataset. The individual item descriptions are provided by website contributors using a CRM to add content to the website. If available in aggregate, the individual descriptions might make a useful Database or Register.

Examples: descriptions of jobs, events, stores, video content, articles

The Personal Records

The Personal Records are a history of the interactions of a single person with a product or service. The data provides insight into the individual person’s activities.  The data is a slice of a larger Dataset that contains data for a larger number of people. As the information contains personal information it has to be secure and the individual has various rights over the collection and use of the data as granted by GDPR (or similar local regulation). The dataset is relatively small, is focused on a specific set of interactions, but is growing over time. Analysing the data might provide useful insight to the individual that may help them change their behaviour, increase their health, etc.

Examples: bank transactions, home energy usage, fitness or sleep tracker, order history with an online service, location tracker, health records

The Social Graph

The Social Graph is a dataset that describes the relationships between a group of individuals. It is typically built-up by a small number of contributions made by individuals that provide information about their relationships and connections to others. They may also provide information about those other people, e.g. names, contact numbers, service ratings, etc. When published or exported it is typically focused on a single individual, but might be available in aggregate. It is different to Personal Records as its specifically about multiple people, rather than a history of information about an individual (although Personal Records may reference or include data about others).  The graph as a whole is maintained by an organisation that is operating a social network (or service that has social features).

Examples: social networks data, collaboration graphs, reviews and trip histories from ride sharing services, etc

The Observatory

The Observatory is a very large dataset produce by a coordinated large-scale data collection exercise, for example by a range of earth observation satellites. The data collection is intentionally designed to support a variety of down-stream uses, which informs the scale and type of data collected. The scale and type of data can makes it difficult to use because of the need for specific tools or expertise. But there are a wide range of ways in which the raw data can be processed to create other types of data products, to drive a variety of analyses, or used to power a variety of services.  It is refreshed and re-released as required by the needs and financial constraints of the organisations collaborating on collecting and using the dataset.

Examples: earth observation data, LIDAR point clouds, data from astronomical surveys or Large Hadron Collider experiments

The Forecast

The Forecast is used to predict the outcome of specific real-world events, e.g. a weather or climate forecast. It draws on a variety of primary datasets which are then processed and anlysed to produce the output dataset. The process by which the predictions are made are well-documented to provide insight into the quality of the output. As the predictions are time-based the dataset has a relatively short “shelf-life” which means that users need to quickly access the most recent data for a specific location or area of interest. Depending on the scale and granularity, Forecast datasets can be very large, making them difficult to distribute in a timely manner.

Example: weather forecasts

Let me know what you think of these. Do they provide any useful perspective? How would you use or improve them?