How do different communities create unique identifiers?

Identifiers are part of data infrastructure. They play an important role, helping to publish, structure and link together data. Identifiers are boundary objects, that cross communities. That means they need to be well-documented in order to be most useful.

Understanding how identifiers are created, assigned and governed can help us think through how to strengthen our data infrastructure. With that in mind, let’s take a quick tour of how different communities and systems have created identifier systems to help to uniquely refer to different digital and physical objects.

The simplest way to generate identifiers is by a serial number. A steadily increasing number that is assigned to whatever you need to identify next. This is the approached used in most internal databases as well as some commonly encountered public identifiers.

For example the Ordnance Survey TOID identifier is a serial number that looks like this: osgb1000006032892. UPRNs are similar.

Serial numbers work well when you have a single organisation and/or system generating the identifiers. They’re simple to implement, but can have their downsides, especially when they’re shared with others.

Some serial numbering systems include built in error-checking to deal with copying errors, using a check digit. Examples include the CAS registry number for identifying chemicals, and the basic form of the ISSN for identifying academic journals.

 

 

 

 

 

 

As we can see in the bar code form of the ISSN shown above, identifiers often have more structure to them. And they may not be assigned as a simple serial number.

The second way of providing unique identifiers is using a name or code. These are typically still assigned by a central authority, sometimes known as a registration agency, but they are constructed in different ways.

Identifiers for geographic locations typically rely on administrative regions or other areas to help structure identifiers. For example the statistics community in the EU created the NUTS codes to help identify country sub-divisions in statistical datasets. These are assigned based on hierarchy beginning with the country and then smaller geographic regions. Bath is UKK12 for example.

 

 

 

 

 

 

 

 

Postal codes are another geographically based set of codes. Both the UK and US postal codes use a geographical hierarchy. Only here the regions are those meaningful to how the Royal Mail and USPS manages its delivery operations, rather than being administratively defined by the government.

 

 

 

 

 

Hierarchies that are based on geography and/or organisational structures are common patterns in identifiers. Existing hierarchies provide a handy way to partition up sets of things for identification purposes.

The SWIFT code used in banking has a mixture of organisational and geographic hierarchies.

 

 

 

 

 

 

Encoding information about geography and hierarchy within codes can be useful. It can make them easier to validate. It also mean you can also manipulate them, e.g. by truncation, to find the identifiers for broader regions.

But encoding lots of information in identifiers also has its downsides. The main one being dealing with changes to administrative areas that mean the hierarchy has changed. Do you reassign all the identifiers?

Assigning identifiers from a single, central authority isn’t always ideal. It can add coordination overhead which can be particularly problematic if you need to assign lots of identifiers quickly. So some identifier systems look at reducing the burden on that central authority.

A solution to this is to delegate identifier assignment to other organisations. There are two ways this is done in practice.

The first is what we might call federated assignment. This is where the registration agency shares the work of assigning identifiers with other organisations. A typical approach is to delegate the work of registration and assignment to national organisations. Although other approaches are possible.

The delegation of work might be handled entirely “behind the scenes” as an operational approach. But sometimes it ends up being a feature of the identifier system.

For example the  (LEI) uses federated assignment where “Local Operating Units” do the work of assigning identifiers with. As you can see below, the identifiers for the LOUs become part of the identifiers they assign.

 

 

 

The International Standard Recording Code uses a similar approach with national agencies assigning identifiers.

 

 

 

 

Another approach to reducing dependence on, and coordination with a single registration agency, is to use what I’ll call “local assignment“. In this approach individual organisations are empowered to assign identifiers as they need them.

A simplistic approach to local assignment is “block allocation“: handing out blocks of pregenerated identifiers to organisations which can locally assign them. Blocks of IP addresses are handed out to Internet Service Providers. Similarly, blocks of UPRNs are handed out to local authorities.

Here the registration agency still generates the identifiers, but the assignment of identifier to “thing” is done locally. And, in the second case at least, a record of this assignment will still be shared with the agency.

A more common approach is to use “prefix allocation“. In this approach the registration agency assigns individual organisations a prefix within the identifier system. The organisation then generates new unique identifiers by combining their prefix with a locally generated suffix.

A suffix might be generated by adding a local serial number to the prefix. Or by some other approach. Again, after generating and assigning an identifier they are commonly still centrally registered.

Many identifiers use this approach. The EIDR identifiers used in the entertainment industry look like this:

 

 

A GTIN looks like this:

 

 

 

 

And the BIC code for shipping contains look like this:

 

 

 

One challenge with prefix allocation is ensuring that the rules for locally assigned suffixes work in every context where the identifier needs to appear. This typically means providing some rules about how suffixes are constructed.

The DOI system encountered problems because publishers were generating identifiers that didn’t work well when DOIs were expressed as URLs, due to the need for extra encoding. This made them tricky to work with.

For a complicated example that mixes use of prefixes, country codes and check digits, then we can look at the VIN, which is a unique identifier for vehicles. This 17 digit code includes multiple segments but there are four competing standards for what the segments mean. Sigh.

 

 

 

 

 

It’s possible to go further than just reducing dependency on registration agencies. They can be eliminated completely.

In distributed assignment of identifiers, anyone can create an identifier. Rather than requesting an identifier, or a prefix from a registration agency, these systems operate by agreeing rules for how unique identifiers can be constructed.

One approach to distributed assignment is to use an element of randomness to generate a unique identifier at the point of time its needed. The goal is to design an algorithm that uses a random number generator and sometimes additional information like a timestamp or a MAC address, to construct an identifier where there is an extremely low chance that someone could have created the same identifier at the same moment in time. (Known as a “collision”).

This is how UUIDs work. You can play with generating some using online tools.

Identifiers like UUIDs are cheap to generate and require no coordination beyond an agreed algorithm. They work very well when you just need a reliable way to assign an identifier to something with reasonable confidence that if our data is later combined then we won’t encounter any issues.

But what if we need to independently assign an identifier to the same thing? So that when we later combine our datasets, then our data will link up?

For this we need to use a hash-based identifier. A hash based identifier takes some properties of the thing we want to identify and then use that to construct an identifier. If we have a good enough algorithm then even if we do this independently we should end up constructing the same identifier.

This is sometimes referred to as creating a “digital fingerprint” of the object. It’s commonly used to identify copies of objects. For example, the approach is used to construct content identifiers in the IPFS system. And as part of YouTube’s Content ID system to manage copyright claims.

But hash-based identifiers don’t have to be used for managing content, they can be used as pure identifiers. The most complex example I’m familiar with is the InChi, which is a means of generating a unique identifier for chemicals by using information about their structure.

 

 

 

 

By using a consistent algorithm provided as open source software, chemists can reliably create identifiers for the same structures.

The SICI code used to identify academic papers was a hash based system that used metadata about the publication to generate an identifier. However in practice it was difficult to work with due to the variety of ways in which content was actually published and the variety of contexts in which identifiers needed to be generated.

Hash-based identifiers are very tricky to get right as you need a robust algorithm, that is widely adopted. Those needing to generate identifiers will also need to be able to reliably access all of the information required to create the identifier. Variations in availability of metadata, object formats, etc can all impact how well they work in practice.

I miss being able to look people in the eye

What even is time, anymore?

I’ve seen and made many variations of this joke across Slack, twitter and meetings this week. Remote working and social isolation has disrupted all of our routines and left us feeling adrift. But, for those of us lucky enough to have good connectivity, we’re certainly not talking or seeing each other any less. I’ve ended several days this week hoarse from talking.

The number of people playing with avatars, virtual backgrounds and buying green screens speaks to the level of engagement with video meetings and chat. Of course, there’s also the memes.

By the way, Disney are sharing a nice line in backgrounds. But I have my own favourites.

In team catch-ups this week, a few people have remarked how, despite all the meetings and check-ins, they just didn’t feel as engaged. Key decisions or outcomes were not sinking in. People struggled to remember who was on a particular call. This isn’t surprising. Neither the general situation nor the medium we’re using is really great for focus and connection.

The comments have made me more conscious of the limitations of the software we’re using.

For example, one of the nice features of Zoom is the “gallery view” so you can see everyone on the call. Or at least until your call is so large that you end up with several pages of attendees. It makes it really easy to read the room when chairing. Contrast that with hangouts which doesn’t have the same feature. This makes it so much harder to gauge reactions in a discussion, identify people who want to raise questions, or even just catch when someone has had a connectivity problem.

General presence notifications are also a problem. In a drop-in meeting this week, it was only a little way into the call that I realised that we had 17 people in the discussion. That level of participation was so much easier to gauge when we were all sat around tables in the office kitchen.

We tried out Remo recently too. It has a cute office layout that facilitates break-out discussions and you can easily move between chats. I think it’s great for some types of meetings. But it didn’t create quite the same atmosphere for having drinks with the team than a raucous, messy hangout.

I think the thing that I’ve personally been struggling with is that you can’t look anyone in the eye on a video call.

Now, I’m usually terrible at looking people in the eye. In a conversation with me, you’ll find I’m typically looking around as I’m talking. It helps me think. Although when I’m listening, I’m much more attentive to others. But being able to look someone in the eye to read their reactions, look for agreement, or just to enjoy a joke is something that we can’t easily do at the minute. And I miss it.

Some people struggle with direct eye contact. Some people like the freedom to look away, fidget or play with a stress toy when listening. We’re all wired differently. Eye contact isn’t always necessary or desirable. But there’s lots of research exploring the effects of eye contact, which notes some potential impacts on memory and prosocial behaviour.

While tools like Zoom need to fix their security flaws before adding features, I’m hoping this period will lead to more user research and product development. So that we have much better and more secure tools. There’s plenty of room for innovation. Although like others I don’t think that attention correction is what we need. But I’d love to read more about interesting experiments with online presence and remote working tools.

It’s important to remember – as ever when we choose to make something digital – that many of these challenges are a fact of life for people with disabilities, who may be relying on remote participation in events and meetings.

In the meantime there’s a few things we can all do to improve our meetings. Choose the right tool. Find ways to stay in contact with everyone on the call. Take notes. Share key decisions afterwards (duh!)

And, if you’re using multiple monitors, maybe put the video call on the same desktop as your webcam. Or think about putting your webcam near your screen. Then we can at least glance in each others’ directions.

Quick tips for chairing remote meetings

There’s a growing set of useful resources and guidance to help people run better remote meetings. I’ve been compiling a list to a few. At the risk of repeating other, better advice, I’m going to write down some brief tips for running remote meetings.

For a year or so I was chairing fortnightly meetings of the OpenActive standards group. Those meetings were an opportunity to share updates with a community of collaborators, get feedback on working documents and have debates and discussion around a range of topics. So I had to get better at doing it. Not sure whether I did, but here’s a few things I learned.

I’ll skip over general good meeting etiquette (e.g. around circulating an agenda and working documents in advance), to focus on the remote bits.

  1. Give people time to arrive. Just because everyone is attending remotely doesn’t mean that everyone will be able to arrive promptly. They may be working through technical difficulties, for example. Build in a bit of deliberate slack time at the start of the meeting. I usually gave it around 5-10 minutes. As people arrive, greet them and let them know this is happening. You can then either chat as a group or people can switch to emails, etc while waiting for things to start.
  2. Call the meeting to order. Make it clear when the meeting is formally starting and you’ve switched from general chat and waiting for late arrivals. This will help ensure you have people’s attention.
  3. Use the tools you have as a chair. Monitor side chat. Monitor the video feeds to check to see if people look like they have something to say. And, most importantly, mute people that aren’t speaking but are typing or have lots of background noise. You can usually avoid the polite dance around asking people to do that, or suffering in silence, by using option to mute people. Just tell them you’ve done that. I usually had Zoom meetings set up so that people were muted on entry.
  4. Do a roll call. Ask everyone to introduce themselves at the start. Don’t just ask everyone to do that, as they’ll talk over each other. Go through people individually as ask them to say hello or do an introduction. This helps with putting voices to names (if not everyone is on video), ensures that everyone knows how to mute/unmute and puts some structure to the meeting.
  5. Be aware of when people are connecting in different ways. Some software, like Zoom, allow people to join in several ways. Be aware of when you have people on phone and video, especially if you’re presenting material. Try to circulate links either before or during meeting so they can see them
  6. Use slides to help structure the meeting. I found that doing a screenshare of a set of slides for the agenda and key talking points helps to give people a sense of where you’re at in the meeting. So, for example if you have four items on your agenda, have a slide for each topic item. With key questions or decision points. It can help to focus discussion, keeps people’s attention on the meeting (rather than a separate doc) and gives people a sense of where you are. The latter is especially helpful if people are joining late.
  7. Don’t be afraid of a quick recap. If people join a few minutes late in the meeting, give them a quick recap of where you’re at, ask them to introduce themselves. I often did this if people joined a few minutes late, but not if they dropped in 30 minutes into a 1 hour meeting.
  8. Don’t be afraid of silence or directly asking people questions. Chairing remote meetings can be stressful and awkward for everyone. It can be particularly awkward to ask questions and then sit in silence. Often this is because people are worried about talking over each other. Or they just need time to think. Don’t be afraid of a bit of silence. Doing a roll call to ask everyone individually for feedback can be helpful if you want to make decisions. Check in on people who have not said anything for a while. It’s slow, but provides some order for everyone
  9. Keep to time. I tried very hard not to let meetings over-run even if we didn’t cover everything. People have other events in their calendars. Video and phone calls can be tiring. It’s better to wrap up at a suitable point and follow up on things you didn’t get to cover than to have half the meeting drop out at the end.
  10. Follow-up afterwards. Make sure to follow up afterwards. Especially if not everyone was able to attend. For OpenActive we video the calls and share those as well as a summary of discussion points.

Those are all the things I tried to consciously get better at and I think helped things go more smoothly.

What is collaborative maintenance of data? A short talk at the Royal Society

Following the publication of their report on data governance in the 21st century, the Royal Society are running a number of workshops to explore data governance in different sectors. In October 2019 year they ran one exploring data governance in the auto insurance sector.

Last week they held a workshop looking at data governance in the civil society sector. The ODI were invited to help out, and I chaired a session looking at collaborative maintenance of data. I believe the Royal Society will be publishing a longer write-up of the workshop over the coming weeks.

This blog post is a written version of a short ten minute talk I gave during the workshop. The slides are public.

Let’s start with a definition. What is collaborative maintenance?

You might already be familiar with terms like “crowd-sourcing” or “citizen science”. Both of those are examples of collaborative maintenance. But it can take other forms too. At the ODI we use collaborative maintenance of data to refer to any scenario where organisations and communities are sharing the work of collecting and maintaining data.

It might be helpful to position collaborative maintenance alongside other approaches that are part of “open culture”. These include open standards, open source, and open data. Let’s look at each of them in turn.

Open standards for data are reusable, shared agreements that shape how we collect, share, govern and use data. There are different types of open standards. Some are technical, and describe file formats and methods of exchanging data. Others are higher-level and capture codes of practices and protocols for collecting data. Open standards are best developed collaboratively, so that everyone impacted by or benefiting from the standard can help shape it.

Open source involves collaborating to create reusable, openly licensed code and applications. Some open source projects are run by individuals or small communities. Others are backed by larger commercial organisations. This collaborative work is different to that of open standards. For example, it involves identifying and agreeing features, writing and testing code and producing documentation to allow others to use it.

Open data is about publishing data under an open licence, so it can be accessed, used and shared by anyone for any purpose. Different communities engage in publication of open data for different purposes.

For example, the open government movement originally focused on open data as a means to increase transparency of governments. More recently there is a shift towards using open data to help address a variety of social, economic and environmental challenges. In contrast, as part of the open science movement, there is a different role for open data. Recent attention has been on the use of open data to address the reproducibility crisis around research. Or to help respond to emerging health issues, like Coronavirus.

With a few exceptions, the main approach to open data has been a single organisation (or researcher) publishing data that they have already collected. There may be some collaboration around use of that data, but not in its collection or maintenance.

This makes open data quite distinct from open source or open sources.

We can think of collaborative maintenance as about taking the approach used in open source and applying it to data. Collaborative maintenance involves collaboration across the full lifecycle of a dataset.

Some examples might be helpful.

OpenStreetMap is a collaboratively produced spatial database of the entire world. While it was originally produced by individuals and communities, it is now contributed to by large organisations like Facebook, Microsoft and Apple. The Humanitarian OpenStreetMap community focuses on the collection and use of data to support humanitarian activities. The community are involved in deciding what data to collect, prioritising maintenance of data following disasters, and mapping activities either on the ground or remotely. The community works across the lifecycle and is self-directing.

Common Voice is a Mozilla project. It aims to build an open dataset to support voice recognition applications. By asking others to contribute to the dataset, they hope to make it more comprehensive and inclusive. Mozilla have defined what data will be collected and the tasks to be carried out, but anyone can contribute to the dataset by adding their voice or transcribing a recording. It’s this open participation that could help ensure that the dataset represents a more diverse set of people.

Edubase is maintained by the Department for Education (DfE). It’s our national database of schools. It’s used in a variety of different applications. Like Mozilla, DfE are acting as the steward of the data and have defined what information should be collected. But the work of populating and maintaining the shared directory is carried out by people in the individual schools. This is the best way to keep that data up to date. Those who are know when the data has changed have the ability to update it. The contributors all benefit from shared resource.

Build a shared directory is a common use for collaborative maintenance. But there are others.

Looking across these projects and other examples that we’ve studied in our desk and user research, we can see that there are different ways we can collaborate around data.

For example, we can work together to decide what data to collect. We can share the work of collecting and maintaining data, ensuring its quality and governing access to it. We can use open source to help to build the tools to support those communities.

We’ve developed the collaborative maintenance guidebook to help support the design of new services and platforms. It includes some background and a worked example. The bulk of the guidebook is a set of “design patterns” that describe solutions to common problems. For example how to manage quality when many different people are contributing to the same dataset.

We think collaborative maintenance can be useful in more projects. For civil society organisations collaborative maintenance might help you engage with communities that you’re supporting to collect and maintain useful data. It might also be a tool to support collaboration across the sector as a means of building common resources.

The guidebook is at an early stage and we’d love to get feedback on it contents. Or help you apply it to a real-world project. Let us know what you think!

 

How can publishing more data increase the value of existing data?

There’s lots to love about the “Value of Data” report. Like the fantastic infographic on page 9. I’ll wait while you go and check it out.

Great, isn’t it?

My favourite part about the paper is that it’s taught me a few terms that economists use, but which I hadn’t heard before. Like “Incomplete contracts” which is the uncertainty about how people will behave because of ambiguity in norms, regulations, licensing or other rules. Finally, a name to put to my repeated gripes about licensing!

But it’s the term “option value” that I’ve been mulling over for the last few days. Option value is a measure of our willingness to pay for something even though we’re not currently using it. Data has a large option value, because its hard to predict how its value might change in future.

Organisations continue to keep data because of its potential future uses. I’ve written before about data as stored potential.

The report notes that the value of a dataset can change because we might be able to apply new technologies to it. Or think of new questions to ask of it. Or, and this is the interesting part, because we acquire new data that might impact its value.

So, how does increasing access to one dataset affect the value of other datasets?

Moving data along the data spectrum means that increasingly more people will have access to it. That means it can be used by more people, potentially in very different ways than you might expect. Applying Joy’s Law then we might expect some interesting, innovative or just unanticipated uses. (See also: everyone loves a laser.)

But more people using the same data is just extracting additional value from that single dataset. It’s not directly impacting the value of other dataset.

To do that we need to use that in some specific ways. So far I’ve come up with seven ways that new data can change the value of existing data.

  1. Comparison. If we have two or more datasets then we can compare them. That will allow us to identify differences, look for similarities, or find correlations. New data can help us discover insights that aren’t otherwise apparent.
  2. Enrichment. New data can enrich an existing data by adding new information. It gives us context that we didn’t have access to before, unlocking further uses
  3. Validation. New data can help us identify and correct errors in existing data.
  4. Linking. A new dataset might help us to merge some existing dataset, allowing us to analyse them in new ways. The new dataset acts like a missing piece in a jigsaw puzzle.
  5. Scaffolding. A new dataset can help us to organise other data. It might also help us collect new data.
  6. Improve Coverage. Adding more data, of the same type, into an existing pool can help us create a larger, aggregated dataset. We end up with a more complete dataset, which opens up more uses. The combined dataset might have a a better spatial or temporal coverage, be less biased or capture more of the world we want to analyse
  7. Increase Confidence. If the new data measures something we’ve already recorded, then the repeated measurements can help us to be more confident about the quality of our existing data and analyses. For example, we might pool sensor readings about the weather from multiple weather stations in the same area. Or perform a meta-analysis of a scientific study.

I don’t think this is exhaustive, but it was a useful thought experiment.

A while ago, I outlined ten dataset archetypes. It’s interesting to see how these align with the above uses:

  • A meta-analysis to increase confidence will draw on multiple studies
  • Combining sensor feeds can also help us increase confidence in our observations of the world
  • A register can help us with linking or scaffolding datasets. They can also be used to support validation.
  • Pooling together multiple descriptions or personal records can help us create a database that has improved coverage for a specific application
  • A social graph is often used as scaffolding for other datasets

What would you add to my list of ways in which new data improves the value of existing data? What did I miss?

Three types of agreement that shape your use of data

Whenever you’re accessing, using or sharing data you will be bound by a variety of laws and agreements. I’ve written previously about how data governance is a nested set of rules, processes, legislation and norms.

In this post I wanted to clarify the differences between three types of agreements that will govern your use of data. There are others. But from a data consumer point of view these are most common.

If you’re involved in any kind of data project, then you should have read all of relevant agreements that relate to data you’re planning to use. So you should know what to look for.

Data Sharing Agreements

Data sharing agreements are usually contracts that will have been signed between the organisations sharing data. They describe how, when, where and for how long data will be shared.

They will include things like the purpose and legal basis for sharing data. They will describe the important security, privacy and other considerations that govern how data will be shared, managed and used. Data sharing agreements might be time-limited. Or they might describe an ongoing arrangement.

When the public and private sector are sharing data, then publishing a register of agreements is one way to increase transparency around how data is being shared.

The ICO Data Sharing Code of Practice has more detail on the kinds of information a data sharing agreement should contain. As does the UK’s Digital Economy Act 2017 code of practice for data sharing. In a recent project the ODI and CABI created a checklist for data sharing agreements.

Data sharing agreements are most useful when organisations, of any kind, are sharing sensitive data. A contract with detailed, binding rules helps everyone be clear on their obligations.

Licences

Licences are a different approach to defining the rules that apply to use of data. A licence describes the ways that data can be used without any of the organisations involved having to enter into a formal agreement.

A licence will describe how you can use some data. It may also place some restrictions on your use (e.g. “non-commercial”) and may spell out some obligations (“please say where you got the data”). So long as you use the data in the described ways, then you don’t need any kind of explicit permission from the publisher. You don’t even have to tell them you’re using it. Although it’s usually a good idea to do that.

Licences remove the need to negotiate and sign agreements. Permission is granted in advance, with a few caveats.

Standard licences make it easier to use data from multiple sources, because everyone is expecting you to follow the same rules. But only if the licences are widely adopted. Where licences don’t align, we end up with unnecessary friction.

Licences aren’t time-limited. They’re perpetual. At least as long as you follow your obligations.

Licences are best used for open and public data. Sometimes people use data sharing agreements when a licence might be a better option. That’s often because organisations know how to do contracts, but are less confident in giving permissions. Especially if they’re concerned about risks.

Sometimes, even if there’s an open licence to use data, a business would still prefer to have an agreement in place. That’s might be because the licence doesn’t give them the freedoms they want, or they’d like some additional assurances in place around their use of data.

Terms and Conditions

Terms and conditions, or “terms of use” are a set of rules that describe how you can use a service. Terms and conditions are the things we all ignore when signing up to website. But if you’re using a data portal, platform or API then you need to have definitely checked the small print. (You have, haven’t you?)

Like a Data Sharing Agreement, a set of terms and conditions is something that you formally agree to. It might be by checking a box rather than signing a document, but its still an agreement.

Terms of use will describe the service being offered and the ways in which you can use it. Like licences and data sharing agreements, they will also include some restrictions. For example whether you can build a commercial service with it. Or what you can do with the results.

A good set of terms and conditions will clearly and separately identify those rules that relate to your use of the service (e.g. how often you can use it) from those rules that relate to the data provided to you. Ideally the terms would just refer to a separate licence. The Met Office Data Point terms do this.

A poorly defined set of terms will focus on the service parts but not include enough detail about your rights to use and reuse data. That can happen if the emphasis has been on the terms of use of the service as a product, rather than around the sharing of data.

The terms and conditions for a data service and the rules that relate to the data are two of the important decisions that shape the data ecosystem that service will enable. It’s important to get them right.

Hopefully that’s a helpful primer. Remember, if you’re in any kind of role using data then you need to read the small print. If not, then you’re potentially exposing yourself and others to risks.

GUIDE, a retrospective

“Tyntesfield servants’ bells” by Caroline. CC-BY-NC-ND licence. https://www.flickr.com/photos/carolineld/4608720906/

This article was first published in the February 2030 edition of Sustain magazine. Ten years since the public launch of GUIDE we sit down with its designers to chat about its origin and what’s made it successful.

It’s a Saturday morning and I’m sitting in the bustling cafe at Tyntesfield house, a National Trust property south of Bristol. I’m enjoying a large pot of tea and a slice of cake with Joe Shilling and Gordon Leith designers of one of the world’s most popular social applications: GUIDE. I’d expected to meet somewhere in the city, but Shilling suggested this as a suitable venue. It turns out Tyntesfield plays a part in the origin story of GUIDE. So its fitting that we are here for the tenth anniversary of its public launch.

SHILLING: “Originally we were just playing. Exploring the design parameters of social applications.”

He stirs the pot of tea while Leith begins sectioning the sponge cake they’ve ordered.

SHILLING: “People did that more in the early days of the web. But Twitter, Facebook, Instagram…they just kind of sucked up all the attention and users. It killed off all that creativity. For a while it seemed like they just owned the space…But then TikTok happened…”

He pauses while I nod to indicate I’ve heard of it.

SHILLING: “…and small experiments like Yap. It was a slow burn, but I think a bunch of us started to get interested again in designing different kinds of social apps. We were part of this indie scene building and releasing bespoke social networks. They came and went really quickly. People just enjoyed them whilst they were around.”

Leith interjects around a mouthful of cake:

LEITH: “Some really random stuff. Social nets with built in profile decay so they were guaranteed to end. Made them low commitment, disposable. Messaging services where you could only post at really specific, sometimes random times. Networks that only came online when its members were in precise geographic coordinates. Spatial partitioning to force separation of networks for home, work and play. Experimental, ritualised interfactions.”

SHILLING: “The migratory networks grew out of that movement too. They didn’t last long, but they were intense. ”

LEITH: “Yeah. Social networks that just kicked into life around a critical mass of people. Like in a club. Want to stay a member…share the memes? Then you needed to be in its radius. In the right city, at the right time. And then keep up as the algorithm shifted it. Social spaces herding their members.”

SHILLING: “They were intense and incredibly problematic. Which is why they didn’t last long. But for a while there was a crowd that loved them. Until the club promoters got involved and then that commercial aspect killed it.”

RENT-SEEKING

GUIDE had a very different starting point. Flat sharing in Bristol, the duo needed money. Their indie credibility was high, but what they were looking for a more mainstream hit with some likelihood of revenue. The break-up of Facebook and the other big services had created an opportunity which many were hoping to capitalise on. But investment was a problem.

LEITH: “We wrote a lot of grant proposals. Goal was to use the money to build out decent code base. Pay for some servers that we could use to launch something bigger”.

Shilling pours the tea, while Leith passes me a slice of cake.

SHILLING: “It was a bit more principled that that. There was plenty of money for apps to help with social isolation. We thought maybe we could build something useful, tackle some social problems, work with a different demographic than we had before. But, yeah, we had our own goals too. We had to take what opportunities were out there.”

LEITH: “My mum had been attending this Memory Skills group. Passing around old photos and memorabilia to get people talking and reminiscing. We thought we could create something digital.”

SHILLING: “We managed to land a grant to explore the idea. We figured that there was a demographic that had spent time connecting not around the high street or the local football club. But with stuff they’d all been doing online. Streaming the same shows. Revisiting old game worlds. We thought those could be really useful touch points and memory triggers too. And not everyone can access some of the other services.”

LEITH: “Mum could talk for hours about Skyrim and Fallout”.

SHILLING: “So we prototyped some social spaces based around that kind of content. It was during the user testing that we had the real eye-opener”.

“Memory Box” by judy_and_ed. CC-BY-NC. https://www.flickr.com/photos/65924740@N00/18516079841/

ITERATIONS

The first iterations of the app that ultimately became GUIDE were pretty rough. Shilling and Leith have been pretty open about their early failures.

LEITH: “The first iteration was basically a Twitch knock-off. People could join the group remotely, chat to each other and watch whatever the facilitator decided to stream.”

SHILLING: “Engagement was low. We didn’t have cash to license a decent range of content. The facilitators needed too much training on the streaming interface and real-time community management.”

LEITH: “I then tried getting a generic game engine to boot up old game worlds, so we could run tours. But the tech was a nightmare to get working. Basically needed different engines for different games”

SHILLING: “Some of the users loved it, mainly those that had the right hardware and were already into gaming. But it didn’t work for most people. And again…I…we were worried about licensing issues”

LEITH: “So we started testing a customised, open source version of Yap. Hosted chat rooms, time-limited rooms and content embedding…that ticked a lot of boxes. I built a custom index over the Internet Archive, so we could use their content as embeds”.

SHILLING: “There’s so much great stuff that people love in the Internet Archive. At the time, not many services were using it. Just a few social media accounts. So we made using it a core feature. It neatly avoided the licensing issues. We let the alpha testers run with the service for a while. We gave them and the memory service facilitators tips on hosting their own chats. And basically left them to it for a few weeks. It was during the later user testing that we discovered they were using it in different ways that we’d expected.”

Instead of having conversations with their peer groups, the most engaged users were using it to chat with their families. Grandparents showing their grandchildren stuff they’d watched, listened to, or read when they were younger.

SHILLING: “They were using it to tell stories”

Surrounded by the bustle in the cafe, we pause to enjoy the tea and cake. Then Shilling gestures around the room.

SHILLING: “We came here one weekend. To get out of the city. Take some time to think. They have these volunteers here. One in every room of the house. People just giving up their free time to answer any questions you might have as you wander around. Maybe, point out interesting things you might not have noticed? Or, if you’re interested, tell you about some of things they love about the place. It was fascinating. I realised that’s how our alpha testers were using the prototype…just sharing their passions with their family.”

LEITH: “So this is where GUIDE was born. We hashed out the core features for the next iteration in a walk through the grounds. Fantastic cake, too.”

“Walkman and mix tapes” by henry… CC-BY-NC-ND. https://www.flickr.com/photos/henrybloomfield/5136897807/

MEMORY PALACE

The familiar, core features of GUIDE have stayed roughly the same since that day.

Anyone can become a Guide and create a Room which they can use to curate and showcase small collections of public domain or openly licensed content. But no more than seven videos, photos, games or whatever else you can embed from the Internet Archive. Room contents can be refreshed once a week.

Visitors are limited to a maximum of five people. Everyone else gets to wait in a lobby, with new visitors being admitted every twenty minutes. Audio feeds only from the Guides, allowing them to chat to Visitors. But Visitors can only interact with Guides via a chat interface that requires building up messages — mostly questions — from a restricted set of words and phrases that can be tweaked by Guides for their specific Room. Each visitor limited to one question every five minutes.

LEITH: “The asymmetric interface, lobby system and cool-down timers were lifted straight from games. I looked up the average number of grandchildren people had. Turns out its about fives, so we used that to size Rooms. The seven item limit was because I thought it was a lucky number. We leaned heavily on the Internet Archive’s bandwidth early on for the embeds, but we now mirror a lot of stuff. And donate, obviously.”

SHILLING: “The restricted chat interface has helped limit spamming and moderation. No video feeds from Guides means that the focus stays on the contents of the Room, not the host. Twitch had some problematic stuff which we wanted to avoid. I think its more inclusive.”

LEITH: “Audio only meant the ASMR crowd were still happy though”.

Today there are tens of thousands of Rooms. Shilling shows me a Room where the Guide gives tours of historical maps of Bath, mixing in old photos for context. Another, “Eleanor’s Knitting Room” curates knitting patterns. The Guide alternating between knitting tips and cultural critiques.

Leith has a bookmarked collection of retro-gaming Rooms. Doom WAD teardowns and classic speed-runs analysis for the most part.

In my own collection, my favourite is a Room showing a rota of Japanese manhole cover designs, the Guide an expert on Japanese art and infrastructure. I often have this one a second screen whilst writing. The lobby wait time is regularly over an hour. Shilling asks me to share that one with him.

LEITH: “There are no discovery tools in Guide. That was deliberate from the start. Strictly no search engine. Want to find a Room? You’ll need to be invited by a Guide or grab a link from a friend”.

SHILLING: “Our approach has been to allow the service to grow within the bounds of existing communities. We originally marketed the site to family groups, and an older demographic. The UK and US were late adopters, the service was much more popular elsewhere for a long time. Things really took off when the fandoms grabbed hold of it.”

An ecosystem of recommendation systems, reviews and community Room databases has grown up around the service. I asked whether that defeated the purpose of not building those into the core app?

LEITH: “It’s about power. If we ran those features then it would be our algorithms. Our choice. We didn’t want that.”

SHILLING: “We wanted the community to decide how to best use GUIDE as social glue. There’s so many more creative ways in which people interact with and use the platform now”.

The two decline to get into discussion of the commercial success of GUIDE. It’s well-documented that the two have become moderately wealthy from the service. More than enough to cover that rent in the city centre. Shilling only touches on it briefly:

SHILLING: “No ads and a subscription-based service has kept us honest. The goal was to pay the bills while running a service we love. We’ve shared a lot of that revenue back with the community in various ways”.

Photo by Jacques Bopp on Unsplash. https://unsplash.com/photos/pvtA7r3jBTc

SLOW WEB

GUIDE can be situated within the Slow Web movement. There are a host of services offering quieter online experiences. Videos of walks through foreign cities. Live feeds from orbiting satellites and VR outposts mounted on marine buoys and in wild locations around the world. Social features as bolt-on features. But GUIDE’s focus on the curation of small spaces, story telling and shared discovery sets it apart.

Of course, all of this was possible before. YouTube and Twitch supported broadcasts and streaming for years, and many people used them in similar ways. But the purposeful design of a more dedicated interface highlights how constraints can shape a community and spark creativity. Removal of many of the asymmetries inherent in the design of those older platforms has undoubtedly helped.

While we finished the last of the tea, I asked them what they thought made the service successful.

SHILLING: “You can find, watch and listen to any of the material that people are sharing in GUIDE on the open web. Just Google it. But I don’t think people just want more content. They want context. And its people that bring that context to life. You can find Rooms now where there’s a relay of Guides running 24×7. Each Guide highlighting different aspects of the exact same collection. Costume design, narrative arcs and character bios. Historical and cultural significance. Personal stories. There’s endless context to discover around the same content. That’s what fandoms have understood for years.”

LEITH: “People just like stories. We gave them a place to tell them. And an opportunity to listen.”

Can the regulation of hazardous substances help us think about regulation of AI?

This post is a thought experiment. It considers how existing laws that cover the registration and testing of hazardous substances like pesticides might be used as an analogy for thinking through approaches to regulation of AI/ML.

As a thought experiment its not a detailed or well-research proposal, but there are elements which I think are interesting. I’m interested in feedback and also pointers to more detailed explorations of similar ideas.

A cursory look of substance registration legislation in the EU and US

Under EU REACH legislation, if you want to manufacture or import large amount of potentially hazardous chemical substances then you need to register with the ECHA. The registration process involves providing information about the substance and its potential risks.

“No data no market” is a key principle of the legislation. The private sector carries the burden of collecting data and demonstrating safety of substances. There is a standard set of information that must be provided.

In order to demonstrate the safety, companies may need to carry out animal testing. The legislation has been designed to minimise unnecessary animal  testing. While there is an argument that all testing is unnecessary, current practices requires testing in some circumstances. Where testing is not required, then other data sources can be used. But controlled animal tests are the proof of last resort if no other data is available.

To further minimise the need to carry out tests on animals, the legislation is designed to encourage companies registering the same (or similar) substances to share data with one another in a “fair, transparent and non-discriminatory way”. Companies There is detailed guidance around data sharing, including a legal framework and guidance on cost sharing.

The coordination around sharing data and costs is achieved via a SIEF (PDF), a loose consortia of businesses looking to register the same substance. There is guidance to help facilitate creation of these sharing forums.

The US has a similar set of laws which also aim to encourage sharing of data across companies to minimise animal testing and other regulatory burdens. The practice of “data compensation” provides businesses with a right to charge fees for use of data. The legislation doesn’t define acceptable fees, but does specify an arbitration procedure.

The compensation, along with some exclusive use arrangements, are intended to avoid discouraging original research, testing and registration of new substances. Companies that bear the costs of developing new substances can have exclusive use for a period and expect some compensation for research costs to bring to market. Later manufacturers can benefit from the safety testing results, but have to pay for the privilege of access.

Summarising some design principles

Based on my reading, I think both sets of legislation are ultimately designed to:

  • increase safety of the general public, by ensuring that substances are properly tested and documented
  • require companies to assess the risks of substances
  • take an ethical stance on reducing unnecessary animal testing and other data collection by facilitating
    data collection
  • require companies to register their intention to manufacture or import substances
  • enable companies to coordinate in order to share costs and other burdens of registration
  • provide an arbitration route if data is not being shared
  • avoid discouraging new research and development by providing a cost sharing model to offset regulatory requirements

Parallels to AI regulation

What if we adopted a similar approach towards the regulation of AI/ML?

When we think about some of the issues with large scale, public deployment of AI/ML, I think the debate often highlights a variety of needs, including:

  • greater oversight about how systems are being designed and tested, to help understand risks and design problems
  • understanding how and where systems are being deployed, to help assess impacts
  • minimising harms to either the general public, or specific communities
  • thorough testing of new approaches to assess immediate and potential long-term impacts
  • reducing unnecessary data collection that is otherwise required to train and test models
  • exploration of potential impacts of new technologies to address social, economic and environmental problems
  • to continue to encourage primary research and innovation

That list is not exhaustive. I suspect not everyone will necessarily agree on the importance of all elements.

However, if we look at these concerns and the principles that underpin the legislation of hazardous substances, I think there are a lot of parallels.

Applying the approach to AI

What if, for certain well-defined applications of AI/ML such as facial recognition, autonomous vehicles, etc, we required companies to:

  • register their systems, accompanies by a standard set of technical, testing and other documentation
  • carry out tests of their system using agreed protocols, to encourage consistency in comparison across testing
  • share data, e.g via a data trust or similar model, in order to minimise the unnecessary collection of data and to facilitate some assessment of bias in training data
  • demonstrate and document the safety of their systems to agreed standards, allowing public and private sector users of systems and models to make informed decisions about risks, or to support enforcement of legal standards
  • coordinate to share costs of collecting and maintaining data, conducting tests of standard models, etc
  • and, perhaps, after a period, accept that trained models would become available for others to reuse, similarly to how medicines or other substances may ultimately be manufactured by other companies

In addition to providing more controls and assurance around how AI/ML is being deployed, an approach based on facilitating collaboration around collection of data might help nudge new and emerging sectors into a more open direction, right from the start.

There are a number of potential risks and issues which I will acknowledge up front:

  • sharing of data about hazardous substance testing doesn’t have to address data protection. But this could be factored in to the design, and some uses of AI/ML draw on non-personal data
  • we may want to simply ban, or discourage use of some applications of AI/ML, rather than enable it. But at the moment there are few, if any controls
  • the approach might encourage collection and sharing of data which we might otherwise want to restrict. But strong governance and access controls, via a data trust or other institution might actually raise the bar around governance and security, beyond that which individual businesses can, or are willing to achieve. Coordination with a regulator might also help decide on how much is “enough” data
  • the utility of data and openly available models might degrade over time, requiring ongoing investment
  • the approach seems most applicable to uses of AI/ML with similar data requirements, In practice there may be only a small number of these, or data requirements may vary enough to limit benefits of data sharing

Again, not an exhaustive list. But as I’ve noted, I think there are ways to mitigate some of these risks.

Let me know what you think, what I’ve missed, or what I should be reading. I’m not in a position to move this forward, but welcome a discussion. Leave your thoughts in the comments below, or ping me on twitter.

When can expect more from data portability?

We’re at the end of week 5 of 2020, of the new decade and I’m on a diet.

I’m back to using MyFitnessPal again. I’ve used it on and off for the last 10 years whenever I’ve decided that now is the time to be more healthy. The sporadic, but detailed history of data collection around my weight and eating habits mark out each of the times when this time was going to be the time when I really made a change.

My success has been mixed. But the latest diet is going pretty well, thanks for asking.

This morning the app chose the following feature to highlight as part of its irregular nudges for me to upgrade to premium.

Downloading data about your weight, nutrition and exercise history are a premium feature of the service. This gave me pause for thought for several reasons.

Under UK legislation, and for as long as we maintain data adequacy with the EU, I have a right to data portability. I can request access to any data about me, in a machine-readable format, from any service I happen to be using.

The company that produce MyFitnessPal, Under Armour, do offer me a way to exercise this right. It’s described in their privacy policy, as shown in the following images.

Note about how to exercise your GDPR rights in MyFitnessPalData portability in MyFitnessPal

Rather than enabling this access via an existing product feature, they’ve decide to make me and everyone else request the data directly. Every time I want to use it.

This might be a deliberate decision. They’re following the legislation to the letter. Perhaps its a conscious decision to push people towards a premium service, rather than make it easy by default. Their user base is international, so they don’t have to offer this feature to everyone.

Or maybe its the legal and product teams not looking at data portability as an opportunity. That’s something that the ODI has previously explored.

I’m hoping to see more exploration of the potential benefits and uses of data portability in 2020.

I think we need to re-frame the discussion away from compliance and on to commercial and consumer benefits. For example, by highlighting how access to data contributes to building ecosystems around services, to help retain and grow a customer base. That is more likely to get traction than a continued focus on compliance and product switching.

MyFitnessPal already connects into an ecosystem of other services. A stronger message around portability might help grow that further.  After all, there are more reasons to monitor what you eat than just weight loss.

Clearer legislation and stronger guidance from organisations like ICO and industry regulators describing how data portability should be implemented would also help. Wider international adoption of data portability rights wouldn’t hurt either.

There’s also a role for community driven projects to build stronger norms and expectations around data portability. Projects like OpenSchufa demonstrate the positive benefits of coordinated action to build up an aggregated view of donated, personal data.

But I’d also settle with a return to the ethos of the early 2010s, when making data flow between services was the default. Small pieces, loosely joined.

If we want the big platforms to go on a diet, then they’re going to need to give up some of those bytes.

Do data scientists spend 80% of their time cleaning data? Turns out, no?

It’s hard to read an article about data science or really anything that involves creating something useful from data these days without tripping over this factoid, or some variant of it:

Data scientists spend 80% of their time cleaning data rather than creating insights.

Or

Data scientists only spend 20% of their time creating insights, the rest wrangling data.

It’s frequently used to highlight the need to address a number of issues around data quality, standards, access. Or as a way to sell portals, dashboards and other analytic tools.

The thing is, I think it’s a bullshit statistic.

Not because I don’t think there aren’t improvements to be made about how we access and share data. Far from it. My issue is more about how that statistic is framed and because its just endlessly parroted without any real insight.

What did the surveys say?

I’ve tried to dig out the underlying survey or source of that factoid, to see if there’s more context. While the figure is widely referenced its rarely accompanied by a link to a survey or results.

Amusingly this IBM data science product marketing page cites this 2018 HBR blog post which cites this 2017 IBM blog which cites this 2016 Crowdflower survey. Why don’t people link to original sources?

In terms of sources of data on how data scientists actually spend their time, I’ve found two ongoing surveys.

So what do these surveys actually say?

  • Crowdflower, 2015: “66.7% said cleaning and organizing data is one of their most time-consuming tasks“.
    • They didn’t report estimates of time spent
  • Crowdflower, 2016: “What data scientists spend the most time doing? Cleaning and organizing data: 60%, Collecting data sets; 19% …“.
    • Only 80% of time spent if you also lump in collecting data as well
  • Crowdflower, 2017: “What activity takes up most of your time? 51% Collecting, labeling, cleaning and organizing data
    • Less than 80% and also now includes tasks like labelling of data
  • Figure Eight, 2018: Doesn’t cover this question.
  • Figure Eight, 2019: “Nearly three quarters of technical respondents 73.5% spend 25% or more of their time managing, cleaning, and/or labeling data
    • That’s pretty far from 80%!
  • Kaggle, 2017: Doesn’t cover this question
  • Kaggle, 2018: “During a typical data science project, what percent of your time is spent engaged in the following tasks? ~11% Gathering data, 15% Cleaning data…
    • Again, much less than 80%

Only the Crowdflower survey reports anything close to 80%, but you need to lump in actually collecting data as well.

Are there other sources? I’ve not spent too much time on it. But this 2015 bizreport article mentions another survey which suggests “between 50% and 90% of business intelligence (BI) workers’ time is spend prepping data to be analyzed“.

And an August 2014 New York Times article states that: “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data“. But doesn’t link to the surveys, because newspapers hate links.

It’s worth noting that “Data Scientist” as a job started to really become a thing around 2009. Although the concept of data science is older. So there may not be much more to dig up. If you’ve seen some earlier surveys, then let me know.

Is it a useful statistic?

So looking at the figures, it looks to me that this is a bullshit statistic. Data scientists do a whole range of different types of task. If you arbitrary label some of these as analysis and others not, then you can make them add up to 80%.

But that’s not the only reason why I think its a bullshit statistic.

Firstly there’s the implication that cleaning and working with data is somehow not worth the time of a data scientist. It’s “data janitor work” work. And “It’s a waste of their skills to be polishing the materials they rely on“. Ugh.

Who, might I ask, is supposed to do this janitorial work?

I would argue that spending time working with data. To transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in.

Understand the material better and you’ll get better insights.

Secondly, I think data science use cases and workflows are a poor measure for how well data is published. Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardised a dataset its likely to need some work.

A sculptor has different needs than a bricklayer. They both use similar materials. And they both create things of lasting value and worth.

We could measure utility better using other assessments than time spent on bespoke work.

Thirdly, it’s measuring the wrong thing. Actually, maybe some friction around the use of data is a good thing. Especially if it encourages you to spend more time understanding a dataset. Even more so if it in any way puts a break on dumb uses of machine-learning.

If we want the process of accessing, using and sharing data to be as frictionless as possible in a technical sense, then let’s make sure that is offset by adding friction elsewhere. E.g. to add checkpoints for reviews of ethical impacts. No matter how highly paid a data scientist is, the impacts of poor use of data and AI can be much, much larger.

Don’t tell me that data scientists are spending time too much time working with data and not enough time getting insights into production. Tell me that data scientists are increasingly spending 50% of their time considering the ethical and social impacts of their work.

Let’s measure that.