How can publishing more data increase the value of existing data?

There’s lots to love about the “Value of Data” report. Like the fantastic infographic on page 9. I’ll wait while you go and check it out.

Great, isn’t it?

My favourite part about the paper is that it’s taught me a few terms that economists use, but which I hadn’t heard before. Like “Incomplete contracts” which is the uncertainty about how people will behave because of ambiguity in norms, regulations, licensing or other rules. Finally, a name to put to my repeated gripes about licensing!

But it’s the term “option value” that I’ve been mulling over for the last few days. Option value is a measure of our willingness to pay for something even though we’re not currently using it. Data has a large option value, because its hard to predict how its value might change in future.

Organisations continue to keep data because of its potential future uses. I’ve written before about data as stored potential.

The report notes that the value of a dataset can change because we might be able to apply new technologies to it. Or think of new questions to ask of it. Or, and this is the interesting part, because we acquire new data that might impact its value.

So, how does increasing access to one dataset affect the value of other datasets?

Moving data along the data spectrum means that increasingly more people will have access to it. That means it can be used by more people, potentially in very different ways than you might expect. Applying Joy’s Law then we might expect some interesting, innovative or just unanticipated uses. (See also: everyone loves a laser.)

But more people using the same data is just extracting additional value from that single dataset. It’s not directly impacting the value of other dataset.

To do that we need to use that in some specific ways. So far I’ve come up with seven ways that new data can change the value of existing data.

  1. Comparison. If we have two or more datasets then we can compare them. That will allow us to identify differences, look for similarities, or find correlations. New data can help us discover insights that aren’t otherwise apparent.
  2. Enrichment. New data can enrich an existing data by adding new information. It gives us context that we didn’t have access to before, unlocking further uses
  3. Validation. New data can help us identify and correct errors in existing data.
  4. Linking. A new dataset might help us to merge some existing dataset, allowing us to analyse them in new ways. The new dataset acts like a missing piece in a jigsaw puzzle.
  5. Scaffolding. A new dataset can help us to organise other data. It might also help us collect new data.
  6. Improve Coverage. Adding more data, of the same type, into an existing pool can help us create a larger, aggregated dataset. We end up with a more complete dataset, which opens up more uses. The combined dataset might have a a better spatial or temporal coverage, be less biased or capture more of the world we want to analyse
  7. Increase Confidence. If the new data measures something we’ve already recorded, then the repeated measurements can help us to be more confident about the quality of our existing data and analyses. For example, we might pool sensor readings about the weather from multiple weather stations in the same area. Or perform a meta-analysis of a scientific study.

I don’t think this is exhaustive, but it was a useful thought experiment.

A while ago, I outlined ten dataset archetypes. It’s interesting to see how these align with the above uses:

  • A meta-analysis to increase confidence will draw on multiple studies
  • Combining sensor feeds can also help us increase confidence in our observations of the world
  • A register can help us with linking or scaffolding datasets. They can also be used to support validation.
  • Pooling together multiple descriptions or personal records can help us create a database that has improved coverage for a specific application
  • A social graph is often used as scaffolding for other datasets

What would you add to my list of ways in which new data improves the value of existing data? What did I miss?

Three types of agreement that shape your use of data

Whenever you’re accessing, using or sharing data you will be bound by a variety of laws and agreements. I’ve written previously about how data governance is a nested set of rules, processes, legislation and norms.

In this post I wanted to clarify the differences between three types of agreements that will govern your use of data. There are others. But from a data consumer point of view these are most common.

If you’re involved in any kind of data project, then you should have read all of relevant agreements that relate to data you’re planning to use. So you should know what to look for.

Data Sharing Agreements

Data sharing agreements are usually contracts that will have been signed between the organisations sharing data. They describe how, when, where and for how long data will be shared.

They will include things like the purpose and legal basis for sharing data. They will describe the important security, privacy and other considerations that govern how data will be shared, managed and used. Data sharing agreements might be time-limited. Or they might describe an ongoing arrangement.

When the public and private sector are sharing data, then publishing a register of agreements is one way to increase transparency around how data is being shared.

The ICO Data Sharing Code of Practice has more detail on the kinds of information a data sharing agreement should contain. As does the UK’s Digital Economy Act 2017 code of practice for data sharing. In a recent project the ODI and CABI created a checklist for data sharing agreements.

Data sharing agreements are most useful when organisations, of any kind, are sharing sensitive data. A contract with detailed, binding rules helps everyone be clear on their obligations.

Licences

Licences are a different approach to defining the rules that apply to use of data. A licence describes the ways that data can be used without any of the organisations involved having to enter into a formal agreement.

A licence will describe how you can use some data. It may also place some restrictions on your use (e.g. “non-commercial”) and may spell out some obligations (“please say where you got the data”). So long as you use the data in the described ways, then you don’t need any kind of explicit permission from the publisher. You don’t even have to tell them you’re using it. Although it’s usually a good idea to do that.

Licences remove the need to negotiate and sign agreements. Permission is granted in advance, with a few caveats.

Standard licences make it easier to use data from multiple sources, because everyone is expecting you to follow the same rules. But only if the licences are widely adopted. Where licences don’t align, we end up with unnecessary friction.

Licences aren’t time-limited. They’re perpetual. At least as long as you follow your obligations.

Licences are best used for open and public data. Sometimes people use data sharing agreements when a licence might be a better option. That’s often because organisations know how to do contracts, but are less confident in giving permissions. Especially if they’re concerned about risks.

Sometimes, even if there’s an open licence to use data, a business would still prefer to have an agreement in place. That’s might be because the licence doesn’t give them the freedoms they want, or they’d like some additional assurances in place around their use of data.

Terms and Conditions

Terms and conditions, or “terms of use” are a set of rules that describe how you can use a service. Terms and conditions are the things we all ignore when signing up to website. But if you’re using a data portal, platform or API then you need to have definitely checked the small print. (You have, haven’t you?)

Like a Data Sharing Agreement, a set of terms and conditions is something that you formally agree to. It might be by checking a box rather than signing a document, but its still an agreement.

Terms of use will describe the service being offered and the ways in which you can use it. Like licences and data sharing agreements, they will also include some restrictions. For example whether you can build a commercial service with it. Or what you can do with the results.

A good set of terms and conditions will clearly and separately identify those rules that relate to your use of the service (e.g. how often you can use it) from those rules that relate to the data provided to you. Ideally the terms would just refer to a separate licence. The Met Office Data Point terms do this.

A poorly defined set of terms will focus on the service parts but not include enough detail about your rights to use and reuse data. That can happen if the emphasis has been on the terms of use of the service as a product, rather than around the sharing of data.

The terms and conditions for a data service and the rules that relate to the data are two of the important decisions that shape the data ecosystem that service will enable. It’s important to get them right.

Hopefully that’s a helpful primer. Remember, if you’re in any kind of role using data then you need to read the small print. If not, then you’re potentially exposing yourself and others to risks.

Can the regulation of hazardous substances help us think about regulation of AI?

This post is a thought experiment. It considers how existing laws that cover the registration and testing of hazardous substances like pesticides might be used as an analogy for thinking through approaches to regulation of AI/ML.

As a thought experiment its not a detailed or well-research proposal, but there are elements which I think are interesting. I’m interested in feedback and also pointers to more detailed explorations of similar ideas.

A cursory look of substance registration legislation in the EU and US

Under EU REACH legislation, if you want to manufacture or import large amount of potentially hazardous chemical substances then you need to register with the ECHA. The registration process involves providing information about the substance and its potential risks.

“No data no market” is a key principle of the legislation. The private sector carries the burden of collecting data and demonstrating safety of substances. There is a standard set of information that must be provided.

In order to demonstrate the safety, companies may need to carry out animal testing. The legislation has been designed to minimise unnecessary animal  testing. While there is an argument that all testing is unnecessary, current practices requires testing in some circumstances. Where testing is not required, then other data sources can be used. But controlled animal tests are the proof of last resort if no other data is available.

To further minimise the need to carry out tests on animals, the legislation is designed to encourage companies registering the same (or similar) substances to share data with one another in a “fair, transparent and non-discriminatory way”. Companies There is detailed guidance around data sharing, including a legal framework and guidance on cost sharing.

The coordination around sharing data and costs is achieved via a SIEF (PDF), a loose consortia of businesses looking to register the same substance. There is guidance to help facilitate creation of these sharing forums.

The US has a similar set of laws which also aim to encourage sharing of data across companies to minimise animal testing and other regulatory burdens. The practice of “data compensation” provides businesses with a right to charge fees for use of data. The legislation doesn’t define acceptable fees, but does specify an arbitration procedure.

The compensation, along with some exclusive use arrangements, are intended to avoid discouraging original research, testing and registration of new substances. Companies that bear the costs of developing new substances can have exclusive use for a period and expect some compensation for research costs to bring to market. Later manufacturers can benefit from the safety testing results, but have to pay for the privilege of access.

Summarising some design principles

Based on my reading, I think both sets of legislation are ultimately designed to:

  • increase safety of the general public, by ensuring that substances are properly tested and documented
  • require companies to assess the risks of substances
  • take an ethical stance on reducing unnecessary animal testing and other data collection by facilitating
    data collection
  • require companies to register their intention to manufacture or import substances
  • enable companies to coordinate in order to share costs and other burdens of registration
  • provide an arbitration route if data is not being shared
  • avoid discouraging new research and development by providing a cost sharing model to offset regulatory requirements

Parallels to AI regulation

What if we adopted a similar approach towards the regulation of AI/ML?

When we think about some of the issues with large scale, public deployment of AI/ML, I think the debate often highlights a variety of needs, including:

  • greater oversight about how systems are being designed and tested, to help understand risks and design problems
  • understanding how and where systems are being deployed, to help assess impacts
  • minimising harms to either the general public, or specific communities
  • thorough testing of new approaches to assess immediate and potential long-term impacts
  • reducing unnecessary data collection that is otherwise required to train and test models
  • exploration of potential impacts of new technologies to address social, economic and environmental problems
  • to continue to encourage primary research and innovation

That list is not exhaustive. I suspect not everyone will necessarily agree on the importance of all elements.

However, if we look at these concerns and the principles that underpin the legislation of hazardous substances, I think there are a lot of parallels.

Applying the approach to AI

What if, for certain well-defined applications of AI/ML such as facial recognition, autonomous vehicles, etc, we required companies to:

  • register their systems, accompanies by a standard set of technical, testing and other documentation
  • carry out tests of their system using agreed protocols, to encourage consistency in comparison across testing
  • share data, e.g via a data trust or similar model, in order to minimise the unnecessary collection of data and to facilitate some assessment of bias in training data
  • demonstrate and document the safety of their systems to agreed standards, allowing public and private sector users of systems and models to make informed decisions about risks, or to support enforcement of legal standards
  • coordinate to share costs of collecting and maintaining data, conducting tests of standard models, etc
  • and, perhaps, after a period, accept that trained models would become available for others to reuse, similarly to how medicines or other substances may ultimately be manufactured by other companies

In addition to providing more controls and assurance around how AI/ML is being deployed, an approach based on facilitating collaboration around collection of data might help nudge new and emerging sectors into a more open direction, right from the start.

There are a number of potential risks and issues which I will acknowledge up front:

  • sharing of data about hazardous substance testing doesn’t have to address data protection. But this could be factored in to the design, and some uses of AI/ML draw on non-personal data
  • we may want to simply ban, or discourage use of some applications of AI/ML, rather than enable it. But at the moment there are few, if any controls
  • the approach might encourage collection and sharing of data which we might otherwise want to restrict. But strong governance and access controls, via a data trust or other institution might actually raise the bar around governance and security, beyond that which individual businesses can, or are willing to achieve. Coordination with a regulator might also help decide on how much is “enough” data
  • the utility of data and openly available models might degrade over time, requiring ongoing investment
  • the approach seems most applicable to uses of AI/ML with similar data requirements, In practice there may be only a small number of these, or data requirements may vary enough to limit benefits of data sharing

Again, not an exhaustive list. But as I’ve noted, I think there are ways to mitigate some of these risks.

Let me know what you think, what I’ve missed, or what I should be reading. I’m not in a position to move this forward, but welcome a discussion. Leave your thoughts in the comments below, or ping me on twitter.

When can expect more from data portability?

We’re at the end of week 5 of 2020, of the new decade and I’m on a diet.

I’m back to using MyFitnessPal again. I’ve used it on and off for the last 10 years whenever I’ve decided that now is the time to be more healthy. The sporadic, but detailed history of data collection around my weight and eating habits mark out each of the times when this time was going to be the time when I really made a change.

My success has been mixed. But the latest diet is going pretty well, thanks for asking.

This morning the app chose the following feature to highlight as part of its irregular nudges for me to upgrade to premium.

Downloading data about your weight, nutrition and exercise history are a premium feature of the service. This gave me pause for thought for several reasons.

Under UK legislation, and for as long as we maintain data adequacy with the EU, I have a right to data portability. I can request access to any data about me, in a machine-readable format, from any service I happen to be using.

The company that produce MyFitnessPal, Under Armour, do offer me a way to exercise this right. It’s described in their privacy policy, as shown in the following images.

Note about how to exercise your GDPR rights in MyFitnessPalData portability in MyFitnessPal

Rather than enabling this access via an existing product feature, they’ve decide to make me and everyone else request the data directly. Every time I want to use it.

This might be a deliberate decision. They’re following the legislation to the letter. Perhaps its a conscious decision to push people towards a premium service, rather than make it easy by default. Their user base is international, so they don’t have to offer this feature to everyone.

Or maybe its the legal and product teams not looking at data portability as an opportunity. That’s something that the ODI has previously explored.

I’m hoping to see more exploration of the potential benefits and uses of data portability in 2020.

I think we need to re-frame the discussion away from compliance and on to commercial and consumer benefits. For example, by highlighting how access to data contributes to building ecosystems around services, to help retain and grow a customer base. That is more likely to get traction than a continued focus on compliance and product switching.

MyFitnessPal already connects into an ecosystem of other services. A stronger message around portability might help grow that further.  After all, there are more reasons to monitor what you eat than just weight loss.

Clearer legislation and stronger guidance from organisations like ICO and industry regulators describing how data portability should be implemented would also help. Wider international adoption of data portability rights wouldn’t hurt either.

There’s also a role for community driven projects to build stronger norms and expectations around data portability. Projects like OpenSchufa demonstrate the positive benefits of coordinated action to build up an aggregated view of donated, personal data.

But I’d also settle with a return to the ethos of the early 2010s, when making data flow between services was the default. Small pieces, loosely joined.

If we want the big platforms to go on a diet, then they’re going to need to give up some of those bytes.

Thinking about the governance of data

I find “governance” to be a tricky word. Particularly when we’re talking about the governance of data.

For example, I’ve experienced conversations with people from a public policy background and people with a background in data management, where its clear that there are different perspectives. From a policy perspective, governance of data could be described as the work that governments do to enforce, encourage or enable an environment where data works for everyone. Which is slightly different to the work that organisations do in order to ensure that data is treated as an asset, which is how I tend to think about organisational data governance.

These aren’t mutually exclusive perspectives. But they operate at different scales with a different emphasis, which I think can sometimes lead to crossed wires or missed opportunities.

As another example, reading this interesting piece of open data governance recently, I found myself wondering about that phrase: “open data governance”. Does it refer to the governance of open data? Being open about how data is governed? The use of open data in governance (e.g. as a public policy tool), or the role of open data in demonstrating good governance (e.g. through transparency). I think the article touched on all of these but they seem quite different things. (Personally I’m not sure there is anything special about the governance of open data as opposed to data in general: open data isn’t special).

Now, all of the above might be completely clear to everyone else and I’m just falling into my usual trap of getting caught up on words and meanings. But picking away at definitions is often useful, so here we are.

The way I’ve rationalised the different data management and public policy perspectives is in thinking about the governance of data as a set of (partly) overlapping contexts. Like this:

 

Governance of data as a set of overlapping contexts

 

Whenever we are managing and using data we are doing so within a nested set of rules, processes, legislation and norms.

In the UK our use of data is bounded by a number of contexts. This includes, for example: legislation from the EU (currently!), legislation from the UK government, legislation defined by regulators, best practices that might be defined how a sector operates, our norms as a society and community, and then the governance processes that apply within our specific organisations, departments and even teams.

Depending on what you’re doing with the data, and the type of data you’re working with, then different contexts might apply. The obvious one being the use of personal data. As data moves between organisations and countries, then different contexts will apply, but we can’t necessarily ignore the broader contexts in which it already sits.

The narrowest contexts, e.g. those within an organisations, will focus on questions like: “how are we managing dataset XYZ to ensure it is protected and managed to a high quality?” The broadest contexts are likely to focus on questions like: “how do we safely manage personal data?

Narrow contexts define the governance and stewardship of individual datasets. Wider contexts guide the stewardship of data more broadly.

What the above diagram hopefully shows is that data, and our use of data, is never free from governance. It’s just that the terms under which it is governed may just be very loosely defined.

This terrible sketch I shared on twitter a while ago shows another way of looking at this. The laws, permissions, norms and guidelines that define the context in which we use data.

Data use in context

One of the ways in which I’ve found this “overlapping contexts” perspective useful, is in thinking about how data moves into and out of different contexts. For example when it is published or shared between organisations and communities. Here’s an example from this week.

IBM have been under fire because they recently released (or re-released) a dataset intended to support facial recognition research. The dataset was constructed by linking to public and openly licensed images already published on the web, e.g. on Flickr. The photographers, and in some cases the people featured in those images, are unhappy about the photographs being used in this new way. In this new context.

In my view, the IBM researchers producing this dataset made two mistakes. Firstly, they didn’t give proper appreciation to the norms and regulations that apply to this data — the broader contexts which inform how it is governed and used, even though its published under an open licence. For example, e.g. people’s expectations about how photographs of them will be used.

An open licence helps data move between organisations — between contexts — but doesn’t absolve anyone from complying with all of the other rules, regulations, norms, etc that will still apply to how it is accessed, used and shared. The statement from Creative Commons helps to clarify that their licenses are not a tool for governance. They just help to support the reuse of information.

This lead to IBM’s second mistake. By creating a new dataset they took on responsibility as its data steward. And being a data steward means having a well-defined, set of data governance processes that are informed and guided by all of the applicable contexts of governance. But they missed some things.

The dataset included content that was created by and features individuals. So their lack of engagement with the community of contributors, in order to discuss norms and expectations was mistaken. The lack of good tools to allow people to remove photos — NBC News created a better tool to allow Flickr users to check the contents of the dataset — is also a shortfall in their duties. Its the combination of these that has lead to the outcry.

If IBM had instead launched an initiative similar where they built this dataset, collaboratively, with the community then they could have avoided this issue. This is the approach that Mozilla took with Voice. IBM, and the world, might even have had a better dataset as a result because people have have opted-in to including more photos. This is important because, as John Wilbanks has pointed out, the market isn’t creating these fairer, more inclusive datasets. We need them to create an open, trustworthy data ecosystem.

Anyway, that’s one example of how I’ve found thinking about the different contexts of governing data, helpful in understanding how to build stronger data infrastructure. What do you think? Am I thinking about this all wrong? What else should I be reading?

 

Impressions from pidapalooza 19

This week I was at the third pidapalooza conference in Dublin. It’s a conference that is dedicated open identifiers: how to create them, steward them, drive adoption and promote their benefits.

Anyone who has spent any time reading this blog or following me on twitter will know that this is a topic close to my heart. Open identifiers are infrastructure.

I’ve separately written up the talk I gave on documenting identifiers to help drive adoption and spur the creation of additional services. I had lots of great nerdy discussions around URIs, identifier schemes, compact URIs, standards development and open data. But I wanted to briefly capture and share a few general impressions.

Firstly, while the conference topic is very much my thing, and the attendees were very much my people (including a number of ex-colleagues and collaborators), I was approaching the event from a very different perspective to the majority of other attendees.

Pidapalooza as a conference has been created by organisations from the scholarly publishing, research and archiving communities. Identifiers are a key part of how the integrity of the scholarly record is maintained over the long term. They’re essential to support archiving and access to a variety of research outputs, with data being a key growth area. Open access and open data were very much in evidence.

But I think I was one of only a few (perhaps the only?) attendee from what I’ll call the “broader” open data community. That wasn’t a complete surprise but I think the conference as a whole could benefit from a wider audience and set of participants.

If you’re working in and around open data, I’d encourage you to go to pidapalooza, submit some talk ideas and consider sponsoring. I think that would be beneficial for several reasons.

Firstly, in the pidapalooza community, the idea of data infrastructure is just a given. It was refreshing to be around a group of people that past the idea of thinking of data as infrastructure and were instead focusing on how to build, govern and drive adoption of that infrastructure. There’s a lot of lessons there that are more generally applicable.

For example I went to a fascinating talk about how EIDR, an identifier for movie and television assets, had helped to drive digital transformation in that sector. Persistent identifiers are critical to digital supply chains (Netflix, streaming services, etc). There are lessons here for other sectors around benefits of wider sharing of data.

I also attended a great talk by the Australian Research Data Commons that reviewed the ways in which they were engaging with their community to drive adoption and best practices for their data infrastructure. They have a programme of policy change, awareness raising, skills development, community building and culture change which could easily be replicated in other areas. It paralleled some of the activities that the Open Data Institute has carried out around its sector programmes like OpenActive.

The need for transparent governance and long-term sustainability were also frequent topics. As was the recognition that data infrastructure takes time to build. Technology is easy, its growing a community and building consensus around an approach that takes time.

(btw, I’d love to spend some time capturing some of the lessons learned by the research and publishing community, perhaps as a new entry to the series of data infrastructure papers that the ODI has previously written. If you’d like to collaborate with or sponsor the ODI to explore that, then drop me a line?)

Secondly, while the pidapalooza community seem to have generally accepted (with a few exceptions) the importance of web identifiers and open licensing of reference data. But that practice is still not widely adopted in other domains. Few of the identifiers I encounter in open government data, for example, are well documented, openly licensed or are supported by a range of APIs and services.

Finally, much of the focus of pidapalooza was on identifying research outputs and related objects: papers, conferences, organisations, datasets, researchers, etc. I didn’t see many discussions around the potential benefits and consequences of use of identifiers in research datasets. Again, this focus follows from the community around the conference.

But as the research, data science and machine-learning communities begin exploring new approaches to increase access to data, it will be increasingly important to explore the use of standard identifiers in that context. Identifiers have a clear role in helping to integrate data from different sources, but there are wider risks around data privacy and ethical considerations around identification of individuals, for example, that will need to happen.

I think we should be building a wider community of practice around use of identifiers in different contexts, and I think pidapalooza could become a great venue to do that.

Talk: Documenting Identifiers for Humans and Machines

This is a rough transcript of a talk I recently gave at a session at Pidapalooza 2019. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing. I’d also really welcome you thoughts and feedback on this discussion document.

At the Open Data Institute we think of data as infrastructure. Something that must be invested in and maintained so that we can maximise the value we get from data. For research, to inform policy and for a wide variety of social and economic benefits.

Identifiers, registers and open standards are some of the key building blocks of data infrastructure. We’ve done a lot of work to explore how to build strong, open foundations for our data infrastructure.

A couple of years ago we published a white paper highlighting the importance of openly licensed identifiers in creating open ecosystems around data. We used that to introduce some case studies from different sectors and to explore some of the characteristics of good identifier systems.

We’ve also explored ways to manage and publish registers. “Register” isn’t a word that I’ve encountered much in this community. But its frequently used to describe a whole set of government data assets.

Registers are reference datasets that provide both unique and/or persistent identifiers for things, and data about those things. The datasets of metadata that describe ORCIDs and DOIs are registers. As are lists of doctors, countries and locations where you can get our car taxed. We’ve explored different models for stewarding registers and ways to build trust
around how they are created and maintained.

In the work I’ve done and the conversations I’ve been involved with around identifiers, I think we tend to focus on two things.

The first is persistence. We need identifiers to be persistent in order to be able to rely on them enough to build them into our systems and processes. I’ve seen lots of discussion about the technical and organisational foundations necessary to ensure identifiers are persistent.

There’s also been great work and progress around giving identifiers affordance. Making them actionable.

Identifiers that are URIs can be clicked on in documents and emails. They can be used by humans and machines to find content, data and metadata. Where identifiers are not URIs, then there are often resolvers that will help to make to integrate them with the web.

Persistence and affordance are both vital qualities for identifiers that will help us build a stronger data infrastructure.

But lately I’ve been thinking that there should be more discussion and thought put into how we document identifiers. I think there are three reasons for this.

Firstly, identifiers are boundary objects. As we increase access to data, by sharing it between organisations or publishing it as open data, then an increasing number of data users  and communities are likely to encounter these identifiers.

I’m sure everyone in this room know what a DOI is (aside: they did). But how many people know what a TOID is? (Aside: none of them did). TOIDs are a national identifier scheme. There’s a TOID for every geographic feature on Ordnance Survey maps. As access to OS data increases, more developers will be introduced to TOIDs and could start using them in their applications.

As identifiers become shared between communities. It’s important that the context around how those identifiers are created and managed is accessible, so that we can properly interpret the data that uses them.

Secondly, identifiers are standards. There are many different types of standard. But they all face common problems of achieving wide adoption and impact. Getting a sector to adopt a common set of identifiers is a process of agreement and implementation. Adoption is driven by engagement and support.

To help drive adoption of standards, we need to ensure that they are well documented. So that users can understand their utility and benefits.

Finally identifiers usually exist as part of registers or similar reference data. So when we are publishing identifiers we face all the general challenges of being good data publishers. The data needs to be well described and documented. And to meet a variety of data user needs, we may need a range of services to help people consume and use it.

Together I think these different issues can lead to additional friction that can hinder the adoption of open identifiers. Better documentation could go some way towards addressing some of these challenges.

So what documentation should we publish around identifier schemes?

I’ve created a discussion document to gather and present some thoughts around this. Please have a read and leave you comments and suggestions on that document. For this presentation I’ll just talk through some of the key categories of information.

I think these are:

  • Descriptive information that provides the background to a scheme, such as what it’s for, when it was created, examples of it being used, etc
  • Governance information that describes how the scheme is managed, who operates it and how access is managed
  • Technical notes that describe the syntax and validation rules for the scheme
  • Operational information that helps developers understand how many identifiers there are, when and how new identifiers are assigned
  • Service pointers that signpost to resolvers and other APIs and services that help people use or adopt the identifiers

I take it pretty much as a given that this type of important documentation and metadata should be machine-readable in some form. So we need to approach all of the above in a way that can meet the needs of both human and machine data users.

Before jumping into bike-shedding around formats. There’s a few immediate questions to consider:

  • how do we make this metadata discoverable, e.g. from datasets and individual identifiers?
  • are there different use cases that might encourage us to separate out some of this information into separate formats and/or types of documentation?
  • what services might we build off the metadata?
  • …etc

I’m interested to know whether others think this would be a useful exercise to take further. And also the best forum for doing that. For example should there be a W3C community group or similar that we could use to discuss and publish some best practice.

Please have a look at the discussion document. I’m keen to learn from this community. So let me know what you think.

Thanks for listening.