Why is change discovery important for open data?

Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.

Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:

  • How often will a dataset get routinely updated and republished?
  • What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
  • How will the dataset, or parts of it be version controlled?
  • How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
  • How will planned and unplanned updates and changes be communicated to users of the dataset?
  • How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
  • Are large scale changes to the data model expected, and if so over what timescale?
  • Are changes to the technical infrastructure planned, and if so over what timescale?
  • How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?

These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.

Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.

Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.

A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:

  • changes to a row in a dataset might be flagged with a timestamp and a change notice
  • API responses might indicate the version of the object being retrieved
  • dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
  • a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
  • changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum

These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.

Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:

  • make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
  • use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
  • subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
  • subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems

Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.

I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.

The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.

Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.

Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.

In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.

And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.

How can publishing more data decrease the value of existing data?

Last month I wrote a post looking at how publishing new data might increase the value of existing data. I ended up listing seven different ways including things like improving validation, increasing coverage, supporting the ability to link together datasets, etc.

But that post only looked at half of the issue. What about the opposite? Are there ways in which publishing new data might reduce the value of data that’s already available?

The short answer is: yes there are.  But before jumping into that, lets take a moment to reflect on the language we’re using.

A note on language

The original post was prompted by an economic framing of the value of data. I was exploring how the option value for a dataset might be affected by increasing access to other data. While this post is primarily looking at how option value might be reduced, we need to acknowledge that “value” isn’t the only way to frame this type of question.

We might also ask, “how might increasing access to data increase potential for harms?” As part of a wider debate around the issues of increasing access to data, we need to use more than just economic language. There’s a wealth of good writing about the impacts of data on privacy and society which I’m not going to attempt to precis here.

It’s also important to highlight that “increasing value” and “decreasing value” are relative terms.

Increasing the value of existing datasets will not seem like a positive outcome if your goal is to attempt to capture as much value as possible, rather than benefit a broader ecosystem. Similarly, decreasing value of existing data, e.g. through obfuscation, might be seen as a positive outcome if it results in better privacy or increased personal safety.

Decreasing value of existing data

Having acknowledged that, lets try and answer the earlier question. In what ways can publishing new data reduce the value we can derive from existing data?

Increased harms leading to retraction and reduced trust

Publishing new data always runs the risk of re-identification and the enabling of unintended inferences. While the impacts of these harms are likely to be most directly felt by both communities and individuals, there are also broader commercial and national security issues. Together, these issues might ultimately reduce the value of the existing data ecosystem in several ways:

  • Existing datasets may need to be retracted, have their scope changed, or have their circulation reduced in order to avoid further harm. Data privacy impact assessments will need to be updated as the contexts in which data is being shared and published change
  • Increased concerns over potential privacy impacts might lead to organisations to choose not to increase access to similar or related datasets
  • Increased concerns might also lead communities and individuals to reduce the amount of data they are willing to share with previously trusted sources

Overall this can lead to a reduction in the overall coverage, quality and linking of data across a data ecosystem. It’s likely to be one of the most significant impact of poorly considered data releases. It can be mitigated through proper impact assessments, consultation and engagement.

Reducing overall quality

Newly published data might be intended to increase coverage, enrich, link, validate or otherwise improve existing data. But it might actually have the opposite effect because its of poor quality. I’ve briefly touched on this in a previous post on fictional data.

Publication of poor quality data might be unintended. For example an organisation may just be publishing the data it has to help address an issue, without properly considering or addressing underlying problems with it. Or a researcher may publish data that contains honest mistakes.

But publication of poor quality data might also be deliberate. For example as spam or misinformation intended to “poison the well“.

More subtly, practices like p-hacking and falsification of data which might be intended to have a short-term direct benefit to the publisher or author, might have longer term issues by impacting the use of other datasets.

This is why understanding and documenting the provenance of data, monitoring of retractions, fixes and updates to data, and the ability to link analyses with datasets are all so important.

Creating unnecessary competition or increasing friction

Publishing new datasets containing new observations and data about an area or topic of interest can lead to positive impacts, e.g. by increasing confidence or coverage. But datasets are also competing with one another. The same types of data might be available from different sources, but under different licences, access arrangements, pricing, etc.

This competition isn’t necessarily positive. For example, the data ecosystem might not benefit as much from the network effects that follow from linking data because key datasets are not linked or cannot be used together. Incompatible and competing datasets can add friction across an ecosystem.

Building poor foundations

Data is often published as a means of building stronger data infrastructure for a sector, or to address a specific challenge. But if that data is poorly maintained or is not sustainably funded, then the energy that goes into building the communities, tools and other datasets around that infrastructure might be wasted.

That reduces the value of existing datasets which might otherwise have provided a better foundation to build upon. Or whose quality is dependent on the shared infrastructure. While this issue is similar to that of the previous one about competition, its root causes and impacts are slightly different.

 

As I noted in my earlier post. I don’t think this is an exhaustive list and it can be improved by contributions. Leave a comment if you have any thoughts.

Exploring registration agencies as data institutions

A key focus for our research and delivery work at the ODI at the moment is exploring how to design sustainable and trustworthy data institutions. Data institutions are organisations that steward data on behalf of a community. They have a variety of legal forms, roles and purposes.

Yesterday I wrote (again!) about identifiers and specifically, how different communities have been designing and using identifier systems within their business and data ecosystems. In that post I provided an outline of centralised and federated models for assigning identifiers. Both of those models rely on organisations that are known as registration agencies, registration authorities or registrars.

In this post, I’m going to briefly explore the role of registration agencies as a specific form of data institution.

What problem are registration agencies solving?

Organisations working within the same sector, whether they are publishing books, shipping cargo, manufacturing cars or streaming media, need to be able to consistently identify things. Which book has been sold? Where did this cargo container come from? When was this car manufactured? Which artist produced this song?

Whether a group of organisations are competing with one another, providing services or funding to each other, or collaborating as part of a supply chain, they need to be able to refer to the physical and digital objects, people, places and things that are core to their businesses.

Consistent, unique identifiers are one of the building blocks of data infrastructure. As I described in my previous blog post, there are different ways to create identifiers, but a common pattern is to use a registration agency as a central point of coordination.

Registration agencies fulfill the role of having an independent, cross-industry organisation responsible for assigning and managing identifiers for those things of shared interest.

What data does a registration agency steward?

The core role of a registration agency is to govern the identifier scheme. That will involve deciding on details such as the syntax and rules for constructing identifiers, how they are assigned and by whom. It will also manage how the scheme evolves over time in order to support the changing needs of its community. Identifier schemes are standards for data and need to be maintained over the long term.

Registration agencies might directly create and assign identifiers at the request of its community. Or it might delegate that activity to other organisations. Depending on the specifics of the identifier scheme, the agency may only manage a small amount of data.

For example, the IFPI is the Registration Agency for the ISRC identifier used in the music industry since 1986. As an organisation, to create an ISRC for music you are publishing, you first apply for a registration code (a prefix used in the identifiers) from a national agency. You can then locally assign identifiers to your recordings. There is no requirement to register the individual codes with either IFPI or the national agency. There isn’t a central database of the identifiers. So for a long time the IFPI will likely only have had a small database listing the prefixes that had been assigned to specific organisations.

Other registration agencies capture more information about the things that are being identified. Organisations requesting an identifier either provide that data at the point of assignment or later deposit it with the agency. This seems to me to be a more common setup: having a central database supports a variety of additional use cases. For example, it can help answer some of the questions I posed above, e.g. when was this car manufactured?

In 2016, IFPI worked with a vendor called SoundExchange to launch a search engine and database, although this is not a complete source of all the data. This presumably addressed needs not covered by the existing system.

So, the data stewarded by a registration agency may vary. It may ranges from basic administrative information about the identifier scheme to a much broader set of data deemed to be useful to the community. Registration agencies may be key data intermediaries in their sector and so fulfill a wider purpose. This is why there is often commercial interest and competing projects to creating identifier schemes for specific industries, there is a lot of potential value to be captured.

How are they setup, and how do they approach sustainability?

In practice any community could work together to setup a common identifier scheme and an organisation to manage it. It just needs a shared understanding of the value of common identifiers and/or a common registry. For example, ZooBank and the LSID in the biosciences. Or the role of the IEEE in managing identifiers the electronics industry.

Existing data intermediaries may branch out into launching identifier schemes to support aggregation and distribution of other data. For example, Refinitiv’s PermId.

Governments also often setup registers and organisations to steward them. For example, Companies House in the UK. Registers frequently address a different set of needs, but assigning identifiers is frequently part of the task of maintaining a register.

Governments can create registers and registration agencies whenever they see fit. As can commercial organisations and community initiatives, given sufficient agreement, funding and resources.

A fourth approach to starting a registration agency is via ISO. Some identifier schemes end up being published as international standards. According to ISO policy, if a new standard identifier is going to require a registration process, then ISO will appoint an organisation as the official registration authority for that standard. This creates a monopoly situation so there is a process of review of the proposed approach, the agency and their approach to sustainability.

ISO publish a list of registration agencies for ISO standards. It includes IFPI as the agency for the ISRC standard

Registration agencies can charge fees for providing the registration services. But ISO requires those to be done on a cost recovery basis only. Approval for the charging of fees requires an additional level of review within ISO. But an agency might provide other supporting services.

Looking across some of the ISO appointed authorities, many appear to charge fees for registration both at the point of assignment of an identifier and on an annual basis. Many also seem to offer additional services and/or operate on a membership basis.

Different approaches to governance

From my reading so far, it seems that registration agencies supporting identifier schemes that are part of the public sector, commercial or community initiatives tend to be more centralised.

Looking across the ISO nominated registration agencies, these tend to use a federated assignment approach, similar to the IFPI, where much of the work is delegated to national agencies with the primary agency primarily acting as the custodian of the overall scheme and a point of coordination. The primary registration agency might also be a fallback for circumstances where a national agency hasn’t been appointed.

This country based approach makes sense for international standards: national agencies can work more closely with their communities.

Another example of this approach is the International Standard Name Identifier (ISNI) which is governed by the ISNI International Agency which appears to have been set up specifically for this purpose. It’s work is delegated to a long list of specific assignment agencies. One of which is the British Library. As it happens, the British Library fulfills a similar role for a number of identifier schemes. This suggests that long-term sustainability for the identifier scheme and the primary registration agency is related to the sustainability of a broader set of organisations which might be acting as a national registration agency only as part of their operations.

One slightly different approach to governance is that of the DOI Foundation, which is the ISO appointed registration agency for DOI identifiers. DOIs can be assigned to a very broad category of different things and so, while the Foundation does delegate to other agencies, these aren’t along national lines. Instead there are different DOI registration agencies for different communities and purposes.

One example is CrossRef which works in the publishing industry, another is EIDR which operates in the entertainment industry. Both are covered by common rules published by the DOI Foundation which outlines acceptable business models, roles and and responsibilities.

While the individual agencies run their own technical platforms, the DOI Foundation also provides some common technical infrastructure to support its registration agencies and enable long-term persistence of the identifiers. This common infrastructure was moved to a separate not-for-profit in 2014, apparently as a means to increase trust.

How do different communities create unique identifiers?

Identifiers are part of data infrastructure. They play an important role, helping to publish, structure and link together data. Identifiers are boundary objects, that cross communities. That means they need to be well-documented in order to be most useful.

Understanding how identifiers are created, assigned and governed can help us think through how to strengthen our data infrastructure. With that in mind, let’s take a quick tour of how different communities and systems have created identifier systems to help to uniquely refer to different digital and physical objects.

The simplest way to generate identifiers is by a serial number. A steadily increasing number that is assigned to whatever you need to identify next. This is the approached used in most internal databases as well as some commonly encountered public identifiers.

For example the Ordnance Survey TOID identifier is a serial number that looks like this: osgb1000006032892. UPRNs are similar.

Serial numbers work well when you have a single organisation and/or system generating the identifiers. They’re simple to implement, but can have their downsides, especially when they’re shared with others.

Some serial numbering systems include built in error-checking to deal with copying errors, using a check digit. Examples include the CAS registry number for identifying chemicals, and the basic form of the ISSN for identifying academic journals.

 

 

 

 

 

 

As we can see in the bar code form of the ISSN shown above, identifiers often have more structure to them. And they may not be assigned as a simple serial number.

The second way of providing unique identifiers is using a name or code. These are typically still assigned by a central authority, sometimes known as a registration agency, but they are constructed in different ways.

Identifiers for geographic locations typically rely on administrative regions or other areas to help structure identifiers. For example the statistics community in the EU created the NUTS codes to help identify country sub-divisions in statistical datasets. These are assigned based on hierarchy beginning with the country and then smaller geographic regions. Bath is UKK12 for example.

 

 

 

 

 

 

 

 

Postal codes are another geographically based set of codes. Both the UK and US postal codes use a geographical hierarchy. Only here the regions are those meaningful to how the Royal Mail and USPS manages its delivery operations, rather than being administratively defined by the government.

 

 

 

 

 

Hierarchies that are based on geography and/or organisational structures are common patterns in identifiers. Existing hierarchies provide a handy way to partition up sets of things for identification purposes.

The SWIFT code used in banking has a mixture of organisational and geographic hierarchies.

 

 

 

 

 

 

Encoding information about geography and hierarchy within codes can be useful. It can make them easier to validate. It also mean you can also manipulate them, e.g. by truncation, to find the identifiers for broader regions.

But encoding lots of information in identifiers also has its downsides. The main one being dealing with changes to administrative areas that mean the hierarchy has changed. Do you reassign all the identifiers?

Assigning identifiers from a single, central authority isn’t always ideal. It can add coordination overhead which can be particularly problematic if you need to assign lots of identifiers quickly. So some identifier systems look at reducing the burden on that central authority.

A solution to this is to delegate identifier assignment to other organisations. There are two ways this is done in practice.

The first is what we might call federated assignment. This is where the registration agency shares the work of assigning identifiers with other organisations. A typical approach is to delegate the work of registration and assignment to national organisations. Although other approaches are possible.

The delegation of work might be handled entirely “behind the scenes” as an operational approach. But sometimes it ends up being a feature of the identifier system.

For example the  (LEI) uses federated assignment where “Local Operating Units” do the work of assigning identifiers with. As you can see below, the identifiers for the LOUs become part of the identifiers they assign.

 

 

 

The International Standard Recording Code uses a similar approach with national agencies assigning identifiers.

 

 

 

 

Another approach to reducing dependence on, and coordination with a single registration agency, is to use what I’ll call “local assignment“. In this approach individual organisations are empowered to assign identifiers as they need them.

A simplistic approach to local assignment is “block allocation“: handing out blocks of pregenerated identifiers to organisations which can locally assign them. Blocks of IP addresses are handed out to Internet Service Providers. Similarly, blocks of UPRNs are handed out to local authorities.

Here the registration agency still generates the identifiers, but the assignment of identifier to “thing” is done locally. And, in the second case at least, a record of this assignment will still be shared with the agency.

A more common approach is to use “prefix allocation“. In this approach the registration agency assigns individual organisations a prefix within the identifier system. The organisation then generates new unique identifiers by combining their prefix with a locally generated suffix.

A suffix might be generated by adding a local serial number to the prefix. Or by some other approach. Again, after generating and assigning an identifier they are commonly still centrally registered.

Many identifiers use this approach. The EIDR identifiers used in the entertainment industry look like this:

 

 

A GTIN looks like this:

 

 

 

 

And the BIC code for shipping contains look like this:

 

 

 

One challenge with prefix allocation is ensuring that the rules for locally assigned suffixes work in every context where the identifier needs to appear. This typically means providing some rules about how suffixes are constructed.

The DOI system encountered problems because publishers were generating identifiers that didn’t work well when DOIs were expressed as URLs, due to the need for extra encoding. This made them tricky to work with.

For a complicated example that mixes use of prefixes, country codes and check digits, then we can look at the VIN, which is a unique identifier for vehicles. This 17 digit code includes multiple segments but there are four competing standards for what the segments mean. Sigh.

 

 

 

 

 

It’s possible to go further than just reducing dependency on registration agencies. They can be eliminated completely.

In distributed assignment of identifiers, anyone can create an identifier. Rather than requesting an identifier, or a prefix from a registration agency, these systems operate by agreeing rules for how unique identifiers can be constructed.

One approach to distributed assignment is to use an element of randomness to generate a unique identifier at the point of time its needed. The goal is to design an algorithm that uses a random number generator and sometimes additional information like a timestamp or a MAC address, to construct an identifier where there is an extremely low chance that someone could have created the same identifier at the same moment in time. (Known as a “collision”).

This is how UUIDs work. You can play with generating some using online tools.

Identifiers like UUIDs are cheap to generate and require no coordination beyond an agreed algorithm. They work very well when you just need a reliable way to assign an identifier to something with reasonable confidence that if our data is later combined then we won’t encounter any issues.

But what if we need to independently assign an identifier to the same thing? So that when we later combine our datasets, then our data will link up?

For this we need to use a hash-based identifier. A hash based identifier takes some properties of the thing we want to identify and then use that to construct an identifier. If we have a good enough algorithm then even if we do this independently we should end up constructing the same identifier.

This is sometimes referred to as creating a “digital fingerprint” of the object. It’s commonly used to identify copies of objects. For example, the approach is used to construct content identifiers in the IPFS system. And as part of YouTube’s Content ID system to manage copyright claims.

But hash-based identifiers don’t have to be used for managing content, they can be used as pure identifiers. The most complex example I’m familiar with is the InChi, which is a means of generating a unique identifier for chemicals by using information about their structure.

 

 

 

 

By using a consistent algorithm provided as open source software, chemists can reliably create identifiers for the same structures.

The SICI code used to identify academic papers was a hash based system that used metadata about the publication to generate an identifier. However in practice it was difficult to work with due to the variety of ways in which content was actually published and the variety of contexts in which identifiers needed to be generated.

Hash-based identifiers are very tricky to get right as you need a robust algorithm, that is widely adopted. Those needing to generate identifiers will also need to be able to reliably access all of the information required to create the identifier. Variations in availability of metadata, object formats, etc can all impact how well they work in practice.

How can publishing more data increase the value of existing data?

There’s lots to love about the “Value of Data” report. Like the fantastic infographic on page 9. I’ll wait while you go and check it out.

Great, isn’t it?

My favourite part about the paper is that it’s taught me a few terms that economists use, but which I hadn’t heard before. Like “Incomplete contracts” which is the uncertainty about how people will behave because of ambiguity in norms, regulations, licensing or other rules. Finally, a name to put to my repeated gripes about licensing!

But it’s the term “option value” that I’ve been mulling over for the last few days. Option value is a measure of our willingness to pay for something even though we’re not currently using it. Data has a large option value, because its hard to predict how its value might change in future.

Organisations continue to keep data because of its potential future uses. I’ve written before about data as stored potential.

The report notes that the value of a dataset can change because we might be able to apply new technologies to it. Or think of new questions to ask of it. Or, and this is the interesting part, because we acquire new data that might impact its value.

So, how does increasing access to one dataset affect the value of other datasets?

Moving data along the data spectrum means that increasingly more people will have access to it. That means it can be used by more people, potentially in very different ways than you might expect. Applying Joy’s Law then we might expect some interesting, innovative or just unanticipated uses. (See also: everyone loves a laser.)

But more people using the same data is just extracting additional value from that single dataset. It’s not directly impacting the value of other dataset.

To do that we need to use that in some specific ways. So far I’ve come up with seven ways that new data can change the value of existing data.

  1. Comparison. If we have two or more datasets then we can compare them. That will allow us to identify differences, look for similarities, or find correlations. New data can help us discover insights that aren’t otherwise apparent.
  2. Enrichment. New data can enrich an existing data by adding new information. It gives us context that we didn’t have access to before, unlocking further uses
  3. Validation. New data can help us identify and correct errors in existing data.
  4. Linking. A new dataset might help us to merge some existing dataset, allowing us to analyse them in new ways. The new dataset acts like a missing piece in a jigsaw puzzle.
  5. Scaffolding. A new dataset can help us to organise other data. It might also help us collect new data.
  6. Improve Coverage. Adding more data, of the same type, into an existing pool can help us create a larger, aggregated dataset. We end up with a more complete dataset, which opens up more uses. The combined dataset might have a a better spatial or temporal coverage, be less biased or capture more of the world we want to analyse
  7. Increase Confidence. If the new data measures something we’ve already recorded, then the repeated measurements can help us to be more confident about the quality of our existing data and analyses. For example, we might pool sensor readings about the weather from multiple weather stations in the same area. Or perform a meta-analysis of a scientific study.

I don’t think this is exhaustive, but it was a useful thought experiment.

A while ago, I outlined ten dataset archetypes. It’s interesting to see how these align with the above uses:

  • A meta-analysis to increase confidence will draw on multiple studies
  • Combining sensor feeds can also help us increase confidence in our observations of the world
  • A register can help us with linking or scaffolding datasets. They can also be used to support validation.
  • Pooling together multiple descriptions or personal records can help us create a database that has improved coverage for a specific application
  • A social graph is often used as scaffolding for other datasets

What would you add to my list of ways in which new data improves the value of existing data? What did I miss?

Three types of agreement that shape your use of data

Whenever you’re accessing, using or sharing data you will be bound by a variety of laws and agreements. I’ve written previously about how data governance is a nested set of rules, processes, legislation and norms.

In this post I wanted to clarify the differences between three types of agreements that will govern your use of data. There are others. But from a data consumer point of view these are most common.

If you’re involved in any kind of data project, then you should have read all of relevant agreements that relate to data you’re planning to use. So you should know what to look for.

Data Sharing Agreements

Data sharing agreements are usually contracts that will have been signed between the organisations sharing data. They describe how, when, where and for how long data will be shared.

They will include things like the purpose and legal basis for sharing data. They will describe the important security, privacy and other considerations that govern how data will be shared, managed and used. Data sharing agreements might be time-limited. Or they might describe an ongoing arrangement.

When the public and private sector are sharing data, then publishing a register of agreements is one way to increase transparency around how data is being shared.

The ICO Data Sharing Code of Practice has more detail on the kinds of information a data sharing agreement should contain. As does the UK’s Digital Economy Act 2017 code of practice for data sharing. In a recent project the ODI and CABI created a checklist for data sharing agreements.

Data sharing agreements are most useful when organisations, of any kind, are sharing sensitive data. A contract with detailed, binding rules helps everyone be clear on their obligations.

Licences

Licences are a different approach to defining the rules that apply to use of data. A licence describes the ways that data can be used without any of the organisations involved having to enter into a formal agreement.

A licence will describe how you can use some data. It may also place some restrictions on your use (e.g. “non-commercial”) and may spell out some obligations (“please say where you got the data”). So long as you use the data in the described ways, then you don’t need any kind of explicit permission from the publisher. You don’t even have to tell them you’re using it. Although it’s usually a good idea to do that.

Licences remove the need to negotiate and sign agreements. Permission is granted in advance, with a few caveats.

Standard licences make it easier to use data from multiple sources, because everyone is expecting you to follow the same rules. But only if the licences are widely adopted. Where licences don’t align, we end up with unnecessary friction.

Licences aren’t time-limited. They’re perpetual. At least as long as you follow your obligations.

Licences are best used for open and public data. Sometimes people use data sharing agreements when a licence might be a better option. That’s often because organisations know how to do contracts, but are less confident in giving permissions. Especially if they’re concerned about risks.

Sometimes, even if there’s an open licence to use data, a business would still prefer to have an agreement in place. That’s might be because the licence doesn’t give them the freedoms they want, or they’d like some additional assurances in place around their use of data.

Terms and Conditions

Terms and conditions, or “terms of use” are a set of rules that describe how you can use a service. Terms and conditions are the things we all ignore when signing up to website. But if you’re using a data portal, platform or API then you need to have definitely checked the small print. (You have, haven’t you?)

Like a Data Sharing Agreement, a set of terms and conditions is something that you formally agree to. It might be by checking a box rather than signing a document, but its still an agreement.

Terms of use will describe the service being offered and the ways in which you can use it. Like licences and data sharing agreements, they will also include some restrictions. For example whether you can build a commercial service with it. Or what you can do with the results.

A good set of terms and conditions will clearly and separately identify those rules that relate to your use of the service (e.g. how often you can use it) from those rules that relate to the data provided to you. Ideally the terms would just refer to a separate licence. The Met Office Data Point terms do this.

A poorly defined set of terms will focus on the service parts but not include enough detail about your rights to use and reuse data. That can happen if the emphasis has been on the terms of use of the service as a product, rather than around the sharing of data.

The terms and conditions for a data service and the rules that relate to the data are two of the important decisions that shape the data ecosystem that service will enable. It’s important to get them right.

Hopefully that’s a helpful primer. Remember, if you’re in any kind of role using data then you need to read the small print. If not, then you’re potentially exposing yourself and others to risks.

Can the regulation of hazardous substances help us think about regulation of AI?

This post is a thought experiment. It considers how existing laws that cover the registration and testing of hazardous substances like pesticides might be used as an analogy for thinking through approaches to regulation of AI/ML.

As a thought experiment its not a detailed or well-research proposal, but there are elements which I think are interesting. I’m interested in feedback and also pointers to more detailed explorations of similar ideas.

A cursory look of substance registration legislation in the EU and US

Under EU REACH legislation, if you want to manufacture or import large amount of potentially hazardous chemical substances then you need to register with the ECHA. The registration process involves providing information about the substance and its potential risks.

“No data no market” is a key principle of the legislation. The private sector carries the burden of collecting data and demonstrating safety of substances. There is a standard set of information that must be provided.

In order to demonstrate the safety, companies may need to carry out animal testing. The legislation has been designed to minimise unnecessary animal  testing. While there is an argument that all testing is unnecessary, current practices requires testing in some circumstances. Where testing is not required, then other data sources can be used. But controlled animal tests are the proof of last resort if no other data is available.

To further minimise the need to carry out tests on animals, the legislation is designed to encourage companies registering the same (or similar) substances to share data with one another in a “fair, transparent and non-discriminatory way”. Companies There is detailed guidance around data sharing, including a legal framework and guidance on cost sharing.

The coordination around sharing data and costs is achieved via a SIEF (PDF), a loose consortia of businesses looking to register the same substance. There is guidance to help facilitate creation of these sharing forums.

The US has a similar set of laws which also aim to encourage sharing of data across companies to minimise animal testing and other regulatory burdens. The practice of “data compensation” provides businesses with a right to charge fees for use of data. The legislation doesn’t define acceptable fees, but does specify an arbitration procedure.

The compensation, along with some exclusive use arrangements, are intended to avoid discouraging original research, testing and registration of new substances. Companies that bear the costs of developing new substances can have exclusive use for a period and expect some compensation for research costs to bring to market. Later manufacturers can benefit from the safety testing results, but have to pay for the privilege of access.

Summarising some design principles

Based on my reading, I think both sets of legislation are ultimately designed to:

  • increase safety of the general public, by ensuring that substances are properly tested and documented
  • require companies to assess the risks of substances
  • take an ethical stance on reducing unnecessary animal testing and other data collection by facilitating
    data collection
  • require companies to register their intention to manufacture or import substances
  • enable companies to coordinate in order to share costs and other burdens of registration
  • provide an arbitration route if data is not being shared
  • avoid discouraging new research and development by providing a cost sharing model to offset regulatory requirements

Parallels to AI regulation

What if we adopted a similar approach towards the regulation of AI/ML?

When we think about some of the issues with large scale, public deployment of AI/ML, I think the debate often highlights a variety of needs, including:

  • greater oversight about how systems are being designed and tested, to help understand risks and design problems
  • understanding how and where systems are being deployed, to help assess impacts
  • minimising harms to either the general public, or specific communities
  • thorough testing of new approaches to assess immediate and potential long-term impacts
  • reducing unnecessary data collection that is otherwise required to train and test models
  • exploration of potential impacts of new technologies to address social, economic and environmental problems
  • to continue to encourage primary research and innovation

That list is not exhaustive. I suspect not everyone will necessarily agree on the importance of all elements.

However, if we look at these concerns and the principles that underpin the legislation of hazardous substances, I think there are a lot of parallels.

Applying the approach to AI

What if, for certain well-defined applications of AI/ML such as facial recognition, autonomous vehicles, etc, we required companies to:

  • register their systems, accompanies by a standard set of technical, testing and other documentation
  • carry out tests of their system using agreed protocols, to encourage consistency in comparison across testing
  • share data, e.g via a data trust or similar model, in order to minimise the unnecessary collection of data and to facilitate some assessment of bias in training data
  • demonstrate and document the safety of their systems to agreed standards, allowing public and private sector users of systems and models to make informed decisions about risks, or to support enforcement of legal standards
  • coordinate to share costs of collecting and maintaining data, conducting tests of standard models, etc
  • and, perhaps, after a period, accept that trained models would become available for others to reuse, similarly to how medicines or other substances may ultimately be manufactured by other companies

In addition to providing more controls and assurance around how AI/ML is being deployed, an approach based on facilitating collaboration around collection of data might help nudge new and emerging sectors into a more open direction, right from the start.

There are a number of potential risks and issues which I will acknowledge up front:

  • sharing of data about hazardous substance testing doesn’t have to address data protection. But this could be factored in to the design, and some uses of AI/ML draw on non-personal data
  • we may want to simply ban, or discourage use of some applications of AI/ML, rather than enable it. But at the moment there are few, if any controls
  • the approach might encourage collection and sharing of data which we might otherwise want to restrict. But strong governance and access controls, via a data trust or other institution might actually raise the bar around governance and security, beyond that which individual businesses can, or are willing to achieve. Coordination with a regulator might also help decide on how much is “enough” data
  • the utility of data and openly available models might degrade over time, requiring ongoing investment
  • the approach seems most applicable to uses of AI/ML with similar data requirements, In practice there may be only a small number of these, or data requirements may vary enough to limit benefits of data sharing

Again, not an exhaustive list. But as I’ve noted, I think there are ways to mitigate some of these risks.

Let me know what you think, what I’ve missed, or what I should be reading. I’m not in a position to move this forward, but welcome a discussion. Leave your thoughts in the comments below, or ping me on twitter.