How can you help support the use of a dataset?

Getting the most value from data, whilst minimising its harmful impacts, is a community activity. Datasets need to be governed and published well. Most of that responsibility falls on the data publisher. Because the choices they make shapes data ecosystems.

But other people have a role to play too. Being a good data user means engaging with that process.

Helping others to find data and find the value in it, feels particularly important at the moment. During the pandemic there are many new datasets becoming available. And there are lots of questions to be answered. Some of them can be answered through better use of data.

So, how can communities work together to support use of data?

There are a lot of different ways to explore that question. But there’s a framework called BASEDEF, created by the open source community, which I find helpful.

BASEDEF stands for Blog, Apply, Suggest, Extend, Document, Evangelize and Fix. It describes the different types of contributions that can support an open source project. It can also be applied to help organise a small team in doing that work. Here’s a handy cheat sheet.

But the framework can also be applied to the task of supporting the use of an openly licensed dataset. Let’s run through the framework with that in mind.


Blog

You can write about a dataset to help others to discover it. You can help explain the potential value of applying the dataset to specific problems. Or perhaps you can see some downsides that others should consider.

Writing about how a dataset has been useful to you, by describing how you’ve successfully applied it in a project, will also help others see its potential value.

Apply

You can show how a dataset can be used, by creating something with it. You might do a detailed analysis of the data, but some simpler contributions can also be helpful.

For example you might create a simple visualisation. Or write and publish some code that illustrates how the dataset can be accessed and used. You could publish a quick demo showing how the dataset can be imported and used in some frequently used tools and platforms.

At the moment everyone is a bit tired of charts and graphs. And I agree with the first principle in the visualisation design principles for the pandemic. But a helpful visualisation can do a range of things. Visualisation can be exploratory rather than explanatory.

A visualisation could support other people in understanding the shape of a dataset, to inform their analysis and interpretation of it. It can help identify outliers, gaps, or highlight some of the richness in the data. I’d recommend making it clear when you’re doing it type of visualisation, rather than trying to derive specific insights.

Suggest

Read the documentation. Download and explore the dataset. Ask questions. Give feedback.

Make suggestions to the publisher about changes they could make to publish the data better. Rather than just offer academic critique, be clear about how suggested changes will support your needs or that of your community.

Extend

The freedoms granted by an open licence allow you to enrich and improve a dataset.

Sometimes the smallest changes can have the most impact. Convert the data into other common or standard formats. Extracting data from spreadsheets into CSV files. Convert data published in more complex formats or via APIs into simpler tabular data to make it more accessible to analysts rather than programmers.

Or maybe you can enrich a dataset by adding identifiers that will allow it to be linked to other sources. Do the work of merging with other datasets to bring in more context.

The downside here is that if the original data changes your extended version will get out of date. If you can’t commit to keeping your version up to date, then be sure to share your code and document your methods.

Allow others to repeat the steps you’ve taken. And don’t forget to suggest the improvements to the publisher.

Document

Write additional documentation to fill in gaps where the publisher has not provided sufficient background or explanation. Explain technical concepts or academic terms to a non-specialist audience.

As a user of the data, you’re able to write that documentation from a perspective that reflects the needs and questions of your specific community and the kinds of questions you need to ask. The original publisher might not have all that context or understand those needs, so this work can be really helpful.

Good documentation can be a finding aid. There are structured ways that you can go about writing documentation, such as this tool for writing civic data guides. (Check out some of the examples).

Evangelise

Email people that might have a need for the data. Tweet about it to a wider community. Highlight it in a presentation. Talk about it over coffee Zoom.

Fix

If the dataset is collaboratively maintained then go ahead and fix errors and omissions. If you’re not confident about making a fix, then submit an error report. In addition to fixing errors you might be able to help verify that data is correct.

If a dataset isn’t collaboratively maintained then, when you find errors, be sure to flag them to the publisher and highlight the issue for others. Or consider publishing an enriched version with fixes applied.


This framework isn’t perfect. The name is a bit clunky for a start. But there’s a couple of things that I like about it.

Firstly, it recognises that not all contributions need to be technical. There’s room for others to use different skills and in different ways.

Secondly, the elements overlap and reinforce one another. Writing documentation and blogging about how you’ve used a dataset helps to evangelise it. Enriching a dataset can help demonstrate in a practical way how a publisher can improve how data is published.

Finally, it serves to highlight some important aspects of community curation which aren’t always well supported in existing data platforms and portals. We can do better here.

If you’re interested in working on adapting this further then happy to chat!. It might be useful to have a cheat sheet that supports its application to data and more examples of how to do these different elements well.

Why is change discovery important for open data?

Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.

Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:

  • How often will a dataset get routinely updated and republished?
  • What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
  • How will the dataset, or parts of it be version controlled?
  • How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
  • How will planned and unplanned updates and changes be communicated to users of the dataset?
  • How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
  • Are large scale changes to the data model expected, and if so over what timescale?
  • Are changes to the technical infrastructure planned, and if so over what timescale?
  • How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?

These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.

Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.

Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.

A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:

  • changes to a row in a dataset might be flagged with a timestamp and a change notice
  • API responses might indicate the version of the object being retrieved
  • dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
  • a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
  • changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum

These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.

Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:

  • make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
  • use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
  • subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
  • subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems

Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.

I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.

The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.

Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.

Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.

In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.

And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.

How can publishing more data decrease the value of existing data?

Last month I wrote a post looking at how publishing new data might increase the value of existing data. I ended up listing seven different ways including things like improving validation, increasing coverage, supporting the ability to link together datasets, etc.

But that post only looked at half of the issue. What about the opposite? Are there ways in which publishing new data might reduce the value of data that’s already available?

The short answer is: yes there are.  But before jumping into that, lets take a moment to reflect on the language we’re using.

A note on language

The original post was prompted by an economic framing of the value of data. I was exploring how the option value for a dataset might be affected by increasing access to other data. While this post is primarily looking at how option value might be reduced, we need to acknowledge that “value” isn’t the only way to frame this type of question.

We might also ask, “how might increasing access to data increase potential for harms?” As part of a wider debate around the issues of increasing access to data, we need to use more than just economic language. There’s a wealth of good writing about the impacts of data on privacy and society which I’m not going to attempt to precis here.

It’s also important to highlight that “increasing value” and “decreasing value” are relative terms.

Increasing the value of existing datasets will not seem like a positive outcome if your goal is to attempt to capture as much value as possible, rather than benefit a broader ecosystem. Similarly, decreasing value of existing data, e.g. through obfuscation, might be seen as a positive outcome if it results in better privacy or increased personal safety.

Decreasing value of existing data

Having acknowledged that, lets try and answer the earlier question. In what ways can publishing new data reduce the value we can derive from existing data?

Increased harms leading to retraction and reduced trust

Publishing new data always runs the risk of re-identification and the enabling of unintended inferences. While the impacts of these harms are likely to be most directly felt by both communities and individuals, there are also broader commercial and national security issues. Together, these issues might ultimately reduce the value of the existing data ecosystem in several ways:

  • Existing datasets may need to be retracted, have their scope changed, or have their circulation reduced in order to avoid further harm. Data privacy impact assessments will need to be updated as the contexts in which data is being shared and published change
  • Increased concerns over potential privacy impacts might lead to organisations to choose not to increase access to similar or related datasets
  • Increased concerns might also lead communities and individuals to reduce the amount of data they are willing to share with previously trusted sources

Overall this can lead to a reduction in the overall coverage, quality and linking of data across a data ecosystem. It’s likely to be one of the most significant impact of poorly considered data releases. It can be mitigated through proper impact assessments, consultation and engagement.

Reducing overall quality

Newly published data might be intended to increase coverage, enrich, link, validate or otherwise improve existing data. But it might actually have the opposite effect because its of poor quality. I’ve briefly touched on this in a previous post on fictional data.

Publication of poor quality data might be unintended. For example an organisation may just be publishing the data it has to help address an issue, without properly considering or addressing underlying problems with it. Or a researcher may publish data that contains honest mistakes.

But publication of poor quality data might also be deliberate. For example as spam or misinformation intended to “poison the well“.

More subtly, practices like p-hacking and falsification of data which might be intended to have a short-term direct benefit to the publisher or author, might have longer term issues by impacting the use of other datasets.

This is why understanding and documenting the provenance of data, monitoring of retractions, fixes and updates to data, and the ability to link analyses with datasets are all so important.

Creating unnecessary competition or increasing friction

Publishing new datasets containing new observations and data about an area or topic of interest can lead to positive impacts, e.g. by increasing confidence or coverage. But datasets are also competing with one another. The same types of data might be available from different sources, but under different licences, access arrangements, pricing, etc.

This competition isn’t necessarily positive. For example, the data ecosystem might not benefit as much from the network effects that follow from linking data because key datasets are not linked or cannot be used together. Incompatible and competing datasets can add friction across an ecosystem.

Building poor foundations

Data is often published as a means of building stronger data infrastructure for a sector, or to address a specific challenge. But if that data is poorly maintained or is not sustainably funded, then the energy that goes into building the communities, tools and other datasets around that infrastructure might be wasted.

That reduces the value of existing datasets which might otherwise have provided a better foundation to build upon. Or whose quality is dependent on the shared infrastructure. While this issue is similar to that of the previous one about competition, its root causes and impacts are slightly different.

 

As I noted in my earlier post. I don’t think this is an exhaustive list and it can be improved by contributions. Leave a comment if you have any thoughts.

Exploring registration agencies as data institutions

A key focus for our research and delivery work at the ODI at the moment is exploring how to design sustainable and trustworthy data institutions. Data institutions are organisations that steward data on behalf of a community. They have a variety of legal forms, roles and purposes.

Yesterday I wrote (again!) about identifiers and specifically, how different communities have been designing and using identifier systems within their business and data ecosystems. In that post I provided an outline of centralised and federated models for assigning identifiers. Both of those models rely on organisations that are known as registration agencies, registration authorities or registrars.

In this post, I’m going to briefly explore the role of registration agencies as a specific form of data institution.

What problem are registration agencies solving?

Organisations working within the same sector, whether they are publishing books, shipping cargo, manufacturing cars or streaming media, need to be able to consistently identify things. Which book has been sold? Where did this cargo container come from? When was this car manufactured? Which artist produced this song?

Whether a group of organisations are competing with one another, providing services or funding to each other, or collaborating as part of a supply chain, they need to be able to refer to the physical and digital objects, people, places and things that are core to their businesses.

Consistent, unique identifiers are one of the building blocks of data infrastructure. As I described in my previous blog post, there are different ways to create identifiers, but a common pattern is to use a registration agency as a central point of coordination.

Registration agencies fulfill the role of having an independent, cross-industry organisation responsible for assigning and managing identifiers for those things of shared interest.

What data does a registration agency steward?

The core role of a registration agency is to govern the identifier scheme. That will involve deciding on details such as the syntax and rules for constructing identifiers, how they are assigned and by whom. It will also manage how the scheme evolves over time in order to support the changing needs of its community. Identifier schemes are standards for data and need to be maintained over the long term.

Registration agencies might directly create and assign identifiers at the request of its community. Or it might delegate that activity to other organisations. Depending on the specifics of the identifier scheme, the agency may only manage a small amount of data.

For example, the IFPI is the Registration Agency for the ISRC identifier used in the music industry since 1986. As an organisation, to create an ISRC for music you are publishing, you first apply for a registration code (a prefix used in the identifiers) from a national agency. You can then locally assign identifiers to your recordings. There is no requirement to register the individual codes with either IFPI or the national agency. There isn’t a central database of the identifiers. So for a long time the IFPI will likely only have had a small database listing the prefixes that had been assigned to specific organisations.

Other registration agencies capture more information about the things that are being identified. Organisations requesting an identifier either provide that data at the point of assignment or later deposit it with the agency. This seems to me to be a more common setup: having a central database supports a variety of additional use cases. For example, it can help answer some of the questions I posed above, e.g. when was this car manufactured?

In 2016, IFPI worked with a vendor called SoundExchange to launch a search engine and database, although this is not a complete source of all the data. This presumably addressed needs not covered by the existing system.

So, the data stewarded by a registration agency may vary. It may ranges from basic administrative information about the identifier scheme to a much broader set of data deemed to be useful to the community. Registration agencies may be key data intermediaries in their sector and so fulfill a wider purpose. This is why there is often commercial interest and competing projects to creating identifier schemes for specific industries, there is a lot of potential value to be captured.

How are they setup, and how do they approach sustainability?

In practice any community could work together to setup a common identifier scheme and an organisation to manage it. It just needs a shared understanding of the value of common identifiers and/or a common registry. For example, ZooBank and the LSID in the biosciences. Or the role of the IEEE in managing identifiers the electronics industry.

Existing data intermediaries may branch out into launching identifier schemes to support aggregation and distribution of other data. For example, Refinitiv’s PermId.

Governments also often setup registers and organisations to steward them. For example, Companies House in the UK. Registers frequently address a different set of needs, but assigning identifiers is frequently part of the task of maintaining a register.

Governments can create registers and registration agencies whenever they see fit. As can commercial organisations and community initiatives, given sufficient agreement, funding and resources.

A fourth approach to starting a registration agency is via ISO. Some identifier schemes end up being published as international standards. According to ISO policy, if a new standard identifier is going to require a registration process, then ISO will appoint an organisation as the official registration authority for that standard. This creates a monopoly situation so there is a process of review of the proposed approach, the agency and their approach to sustainability.

ISO publish a list of registration agencies for ISO standards. It includes IFPI as the agency for the ISRC standard

Registration agencies can charge fees for providing the registration services. But ISO requires those to be done on a cost recovery basis only. Approval for the charging of fees requires an additional level of review within ISO. But an agency might provide other supporting services.

Looking across some of the ISO appointed authorities, many appear to charge fees for registration both at the point of assignment of an identifier and on an annual basis. Many also seem to offer additional services and/or operate on a membership basis.

Different approaches to governance

From my reading so far, it seems that registration agencies supporting identifier schemes that are part of the public sector, commercial or community initiatives tend to be more centralised.

Looking across the ISO nominated registration agencies, these tend to use a federated assignment approach, similar to the IFPI, where much of the work is delegated to national agencies with the primary agency primarily acting as the custodian of the overall scheme and a point of coordination. The primary registration agency might also be a fallback for circumstances where a national agency hasn’t been appointed.

This country based approach makes sense for international standards: national agencies can work more closely with their communities.

Another example of this approach is the International Standard Name Identifier (ISNI) which is governed by the ISNI International Agency which appears to have been set up specifically for this purpose. It’s work is delegated to a long list of specific assignment agencies. One of which is the British Library. As it happens, the British Library fulfills a similar role for a number of identifier schemes. This suggests that long-term sustainability for the identifier scheme and the primary registration agency is related to the sustainability of a broader set of organisations which might be acting as a national registration agency only as part of their operations.

One slightly different approach to governance is that of the DOI Foundation, which is the ISO appointed registration agency for DOI identifiers. DOIs can be assigned to a very broad category of different things and so, while the Foundation does delegate to other agencies, these aren’t along national lines. Instead there are different DOI registration agencies for different communities and purposes.

One example is CrossRef which works in the publishing industry, another is EIDR which operates in the entertainment industry. Both are covered by common rules published by the DOI Foundation which outlines acceptable business models, roles and and responsibilities.

While the individual agencies run their own technical platforms, the DOI Foundation also provides some common technical infrastructure to support its registration agencies and enable long-term persistence of the identifiers. This common infrastructure was moved to a separate not-for-profit in 2014, apparently as a means to increase trust.

How do different communities create unique identifiers?

Identifiers are part of data infrastructure. They play an important role, helping to publish, structure and link together data. Identifiers are boundary objects, that cross communities. That means they need to be well-documented in order to be most useful.

Understanding how identifiers are created, assigned and governed can help us think through how to strengthen our data infrastructure. With that in mind, let’s take a quick tour of how different communities and systems have created identifier systems to help to uniquely refer to different digital and physical objects.

The simplest way to generate identifiers is by a serial number. A steadily increasing number that is assigned to whatever you need to identify next. This is the approached used in most internal databases as well as some commonly encountered public identifiers.

For example the Ordnance Survey TOID identifier is a serial number that looks like this: osgb1000006032892. UPRNs are similar.

Serial numbers work well when you have a single organisation and/or system generating the identifiers. They’re simple to implement, but can have their downsides, especially when they’re shared with others.

Some serial numbering systems include built in error-checking to deal with copying errors, using a check digit. Examples include the CAS registry number for identifying chemicals, and the basic form of the ISSN for identifying academic journals.

 

 

 

 

 

 

As we can see in the bar code form of the ISSN shown above, identifiers often have more structure to them. And they may not be assigned as a simple serial number.

The second way of providing unique identifiers is using a name or code. These are typically still assigned by a central authority, sometimes known as a registration agency, but they are constructed in different ways.

Identifiers for geographic locations typically rely on administrative regions or other areas to help structure identifiers. For example the statistics community in the EU created the NUTS codes to help identify country sub-divisions in statistical datasets. These are assigned based on hierarchy beginning with the country and then smaller geographic regions. Bath is UKK12 for example.

 

 

 

 

 

 

 

 

Postal codes are another geographically based set of codes. Both the UK and US postal codes use a geographical hierarchy. Only here the regions are those meaningful to how the Royal Mail and USPS manages its delivery operations, rather than being administratively defined by the government.

 

 

 

 

 

Hierarchies that are based on geography and/or organisational structures are common patterns in identifiers. Existing hierarchies provide a handy way to partition up sets of things for identification purposes.

The SWIFT code used in banking has a mixture of organisational and geographic hierarchies.

 

 

 

 

 

 

Encoding information about geography and hierarchy within codes can be useful. It can make them easier to validate. It also mean you can also manipulate them, e.g. by truncation, to find the identifiers for broader regions.

But encoding lots of information in identifiers also has its downsides. The main one being dealing with changes to administrative areas that mean the hierarchy has changed. Do you reassign all the identifiers?

Assigning identifiers from a single, central authority isn’t always ideal. It can add coordination overhead which can be particularly problematic if you need to assign lots of identifiers quickly. So some identifier systems look at reducing the burden on that central authority.

A solution to this is to delegate identifier assignment to other organisations. There are two ways this is done in practice.

The first is what we might call federated assignment. This is where the registration agency shares the work of assigning identifiers with other organisations. A typical approach is to delegate the work of registration and assignment to national organisations. Although other approaches are possible.

The delegation of work might be handled entirely “behind the scenes” as an operational approach. But sometimes it ends up being a feature of the identifier system.

For example the  (LEI) uses federated assignment where “Local Operating Units” do the work of assigning identifiers with. As you can see below, the identifiers for the LOUs become part of the identifiers they assign.

 

 

 

The International Standard Recording Code uses a similar approach with national agencies assigning identifiers.

 

 

 

 

Another approach to reducing dependence on, and coordination with a single registration agency, is to use what I’ll call “local assignment“. In this approach individual organisations are empowered to assign identifiers as they need them.

A simplistic approach to local assignment is “block allocation“: handing out blocks of pregenerated identifiers to organisations which can locally assign them. Blocks of IP addresses are handed out to Internet Service Providers. Similarly, blocks of UPRNs are handed out to local authorities.

Here the registration agency still generates the identifiers, but the assignment of identifier to “thing” is done locally. And, in the second case at least, a record of this assignment will still be shared with the agency.

A more common approach is to use “prefix allocation“. In this approach the registration agency assigns individual organisations a prefix within the identifier system. The organisation then generates new unique identifiers by combining their prefix with a locally generated suffix.

A suffix might be generated by adding a local serial number to the prefix. Or by some other approach. Again, after generating and assigning an identifier they are commonly still centrally registered.

Many identifiers use this approach. The EIDR identifiers used in the entertainment industry look like this:

 

 

A GTIN looks like this:

 

 

 

 

And the BIC code for shipping contains look like this:

 

 

 

One challenge with prefix allocation is ensuring that the rules for locally assigned suffixes work in every context where the identifier needs to appear. This typically means providing some rules about how suffixes are constructed.

The DOI system encountered problems because publishers were generating identifiers that didn’t work well when DOIs were expressed as URLs, due to the need for extra encoding. This made them tricky to work with.

For a complicated example that mixes use of prefixes, country codes and check digits, then we can look at the VIN, which is a unique identifier for vehicles. This 17 digit code includes multiple segments but there are four competing standards for what the segments mean. Sigh.

 

 

 

 

 

It’s possible to go further than just reducing dependency on registration agencies. They can be eliminated completely.

In distributed assignment of identifiers, anyone can create an identifier. Rather than requesting an identifier, or a prefix from a registration agency, these systems operate by agreeing rules for how unique identifiers can be constructed.

One approach to distributed assignment is to use an element of randomness to generate a unique identifier at the point of time its needed. The goal is to design an algorithm that uses a random number generator and sometimes additional information like a timestamp or a MAC address, to construct an identifier where there is an extremely low chance that someone could have created the same identifier at the same moment in time. (Known as a “collision”).

This is how UUIDs work. You can play with generating some using online tools.

Identifiers like UUIDs are cheap to generate and require no coordination beyond an agreed algorithm. They work very well when you just need a reliable way to assign an identifier to something with reasonable confidence that if our data is later combined then we won’t encounter any issues.

But what if we need to independently assign an identifier to the same thing? So that when we later combine our datasets, then our data will link up?

For this we need to use a hash-based identifier. A hash based identifier takes some properties of the thing we want to identify and then use that to construct an identifier. If we have a good enough algorithm then even if we do this independently we should end up constructing the same identifier.

This is sometimes referred to as creating a “digital fingerprint” of the object. It’s commonly used to identify copies of objects. For example, the approach is used to construct content identifiers in the IPFS system. And as part of YouTube’s Content ID system to manage copyright claims.

But hash-based identifiers don’t have to be used for managing content, they can be used as pure identifiers. The most complex example I’m familiar with is the InChi, which is a means of generating a unique identifier for chemicals by using information about their structure.

 

 

 

 

By using a consistent algorithm provided as open source software, chemists can reliably create identifiers for the same structures.

The SICI code used to identify academic papers was a hash based system that used metadata about the publication to generate an identifier. However in practice it was difficult to work with due to the variety of ways in which content was actually published and the variety of contexts in which identifiers needed to be generated.

Hash-based identifiers are very tricky to get right as you need a robust algorithm, that is widely adopted. Those needing to generate identifiers will also need to be able to reliably access all of the information required to create the identifier. Variations in availability of metadata, object formats, etc can all impact how well they work in practice.

I miss being able to look people in the eye

What even is time, anymore?

I’ve seen and made many variations of this joke across Slack, twitter and meetings this week. Remote working and social isolation has disrupted all of our routines and left us feeling adrift. But, for those of us lucky enough to have good connectivity, we’re certainly not talking or seeing each other any less. I’ve ended several days this week hoarse from talking.

The number of people playing with avatars, virtual backgrounds and buying green screens speaks to the level of engagement with video meetings and chat. Of course, there’s also the memes.

By the way, Disney are sharing a nice line in backgrounds. But I have my own favourites.

In team catch-ups this week, a few people have remarked how, despite all the meetings and check-ins, they just didn’t feel as engaged. Key decisions or outcomes were not sinking in. People struggled to remember who was on a particular call. This isn’t surprising. Neither the general situation nor the medium we’re using is really great for focus and connection.

The comments have made me more conscious of the limitations of the software we’re using.

For example, one of the nice features of Zoom is the “gallery view” so you can see everyone on the call. Or at least until your call is so large that you end up with several pages of attendees. It makes it really easy to read the room when chairing. Contrast that with hangouts which doesn’t have the same feature. This makes it so much harder to gauge reactions in a discussion, identify people who want to raise questions, or even just catch when someone has had a connectivity problem.

General presence notifications are also a problem. In a drop-in meeting this week, it was only a little way into the call that I realised that we had 17 people in the discussion. That level of participation was so much easier to gauge when we were all sat around tables in the office kitchen.

We tried out Remo recently too. It has a cute office layout that facilitates break-out discussions and you can easily move between chats. I think it’s great for some types of meetings. But it didn’t create quite the same atmosphere for having drinks with the team than a raucous, messy hangout.

I think the thing that I’ve personally been struggling with is that you can’t look anyone in the eye on a video call.

Now, I’m usually terrible at looking people in the eye. In a conversation with me, you’ll find I’m typically looking around as I’m talking. It helps me think. Although when I’m listening, I’m much more attentive to others. But being able to look someone in the eye to read their reactions, look for agreement, or just to enjoy a joke is something that we can’t easily do at the minute. And I miss it.

Some people struggle with direct eye contact. Some people like the freedom to look away, fidget or play with a stress toy when listening. We’re all wired differently. Eye contact isn’t always necessary or desirable. But there’s lots of research exploring the effects of eye contact, which notes some potential impacts on memory and prosocial behaviour.

While tools like Zoom need to fix their security flaws before adding features, I’m hoping this period will lead to more user research and product development. So that we have much better and more secure tools. There’s plenty of room for innovation. Although like others I don’t think that attention correction is what we need. But I’d love to read more about interesting experiments with online presence and remote working tools.

It’s important to remember – as ever when we choose to make something digital – that many of these challenges are a fact of life for people with disabilities, who may be relying on remote participation in events and meetings.

In the meantime there’s a few things we can all do to improve our meetings. Choose the right tool. Find ways to stay in contact with everyone on the call. Take notes. Share key decisions afterwards (duh!)

And, if you’re using multiple monitors, maybe put the video call on the same desktop as your webcam. Or think about putting your webcam near your screen. Then we can at least glance in each others’ directions.

Quick tips for chairing remote meetings

There’s a growing set of useful resources and guidance to help people run better remote meetings. I’ve been compiling a list to a few. At the risk of repeating other, better advice, I’m going to write down some brief tips for running remote meetings.

For a year or so I was chairing fortnightly meetings of the OpenActive standards group. Those meetings were an opportunity to share updates with a community of collaborators, get feedback on working documents and have debates and discussion around a range of topics. So I had to get better at doing it. Not sure whether I did, but here’s a few things I learned.

I’ll skip over general good meeting etiquette (e.g. around circulating an agenda and working documents in advance), to focus on the remote bits.

  1. Give people time to arrive. Just because everyone is attending remotely doesn’t mean that everyone will be able to arrive promptly. They may be working through technical difficulties, for example. Build in a bit of deliberate slack time at the start of the meeting. I usually gave it around 5-10 minutes. As people arrive, greet them and let them know this is happening. You can then either chat as a group or people can switch to emails, etc while waiting for things to start.
  2. Call the meeting to order. Make it clear when the meeting is formally starting and you’ve switched from general chat and waiting for late arrivals. This will help ensure you have people’s attention.
  3. Use the tools you have as a chair. Monitor side chat. Monitor the video feeds to check to see if people look like they have something to say. And, most importantly, mute people that aren’t speaking but are typing or have lots of background noise. You can usually avoid the polite dance around asking people to do that, or suffering in silence, by using option to mute people. Just tell them you’ve done that. I usually had Zoom meetings set up so that people were muted on entry.
  4. Do a roll call. Ask everyone to introduce themselves at the start. Don’t just ask everyone to do that, as they’ll talk over each other. Go through people individually as ask them to say hello or do an introduction. This helps with putting voices to names (if not everyone is on video), ensures that everyone knows how to mute/unmute and puts some structure to the meeting.
  5. Be aware of when people are connecting in different ways. Some software, like Zoom, allow people to join in several ways. Be aware of when you have people on phone and video, especially if you’re presenting material. Try to circulate links either before or during meeting so they can see them
  6. Use slides to help structure the meeting. I found that doing a screenshare of a set of slides for the agenda and key talking points helps to give people a sense of where you’re at in the meeting. So, for example if you have four items on your agenda, have a slide for each topic item. With key questions or decision points. It can help to focus discussion, keeps people’s attention on the meeting (rather than a separate doc) and gives people a sense of where you are. The latter is especially helpful if people are joining late.
  7. Don’t be afraid of a quick recap. If people join a few minutes late in the meeting, give them a quick recap of where you’re at, ask them to introduce themselves. I often did this if people joined a few minutes late, but not if they dropped in 30 minutes into a 1 hour meeting.
  8. Don’t be afraid of silence or directly asking people questions. Chairing remote meetings can be stressful and awkward for everyone. It can be particularly awkward to ask questions and then sit in silence. Often this is because people are worried about talking over each other. Or they just need time to think. Don’t be afraid of a bit of silence. Doing a roll call to ask everyone individually for feedback can be helpful if you want to make decisions. Check in on people who have not said anything for a while. It’s slow, but provides some order for everyone
  9. Keep to time. I tried very hard not to let meetings over-run even if we didn’t cover everything. People have other events in their calendars. Video and phone calls can be tiring. It’s better to wrap up at a suitable point and follow up on things you didn’t get to cover than to have half the meeting drop out at the end.
  10. Follow-up afterwards. Make sure to follow up afterwards. Especially if not everyone was able to attend. For OpenActive we video the calls and share those as well as a summary of discussion points.

Those are all the things I tried to consciously get better at and I think helped things go more smoothly.