The Lego Analogy

I think Lego is a great analogy for understanding the importance of data standards and registers.

Lego have been making plastic toys and bricks since the late 40s. It took them a little while to perfect their designs. But since 1958 they’ve been manufacturing bricks in the same way, to the same basic standard. This means that you can take any brick that’s been manufactured over the last 59 years and they’ll fit together. As a company, they have extremely high standards around how their bricks are manufactured. Only 18 in a million are ever rejected.

A commitment to standards maximises the utility of all of the bricks that the company has ever produced.

Open data standards apply the same principle but to data. By publishing data using common APIs, formats and schemas, we can start to treat data like Lego bricks. Standards help us recombine data in many, many different ways.

There are now many more types and shapes of Lego brick than there used to be. The Lego standard colour palette has also evolved over the years. The types and colours of bricks have changed to reflect the company’s desire to create a wider variety of sets and themes.

If you look across all of the different sets that Lego have produced, you can see that some basic pieces are used very frequently. A number of these pieces are “plates” that help to connect other bricks together. If you ask a Master Lego Builder for a list of their favourite pieces, you’ll discover the same. Elements that help you connect other bricks together in new and interesting ways are the most popular.

Registers are small, simple datasets that play the same role in the data ecosystem. They provide a means for us to connect datasets together. A way to improve the quality and structure of other datasets. They may not be the most excitingly shaped data. Sometimes they’re just simple lists and tables. But they play a very important role in unlocking the value of other data.

So there we have it, the Lego analogy for standards and registers.

Under construction

It’s been a while since I posted a more personal update here. But, as I announced this morning, I’ve got a new job! I thought I’d write a quick overview of what I’ll be doing and what I hope to achieve.

I’ve been considering giving up freelancing for a while now. I’ve been doing it on and off since 2012 when I left Talis. Freelancing has given me a huge amount of flexibility to take on a mixture of different projects. Looking back, there’s a lot of projects I’m really proud of. I’ve worked with the Ordnance Survey, the British Library and the Barbican. I helped launch a startup which is now celebrating its fifth birthday. And I’ve had far too much fun working with the ONS Digital team.

I’ve also been able to devote time to helping lead a plucky band of civic hackers in Bath. We’ve run free training courses, built an energy-saving application for schools and mapped the city. Amongst many other things.

I’ve spent a significant amount of time over the last few years working with the Open Data Institute. The ODI is five and I think I’ve been involved with the organisation for around 4.5 years. Mostly as a part-time associate, but also for a year or so as a consultant. It turned out that wasn’t quite the right role for me, hence the recent dive back into freelancing.

But over that time, I’ve had the opportunity to work on a similarly wide-ranging set of projects. I’ve researched how election data is collected and used and learnt about weather data. I’ve helped to create guidance around open identifiers, licensing, and open data policies.  And explored ways to direct organisations on their open data journey. I’ve also provided advice and support to startups, government and multi-national organisations. That’s pretty cool.

I’ve also worked with an amazing set of people. Some of those people are still at the ODI and others have now moved on. I’ve learnt loads from all of them.

I was pretty clear what type of work I wanted to do in a more permanent role. Firstly, I wanted to take on bigger projects. There’s only so much you can do as an independent freelancer. Secondly, I wanted to work on “data infrastructure”. While collectively we’ve only just begun thinking through the idea of data as infrastructure, looking back over my career it’s a useful label for the types of work I’ve been doing. The majority of which has involved looking at applications of data, technology, standards and processes.

I realised that the best place for me to do all of that was at the ODI. So I’ve seized the opportunity to jump back into the organisation.

My new job title is “Data Infrastructure Programme Lead”. In practice this means that I’m going to be:

  • helping to develop the ODI’s programme of work around data infrastructure, including the creation of research, standards, guidance and tools that will support the creation of good data infrastructure
  • taking on product ownership for certificates and pathway, so we’ve got a way to measure good data infrastructure
  • working with the ODI’s partners and network to support them in building stronger data infrastructure
  • building relationships with others who are working on building data infrastructure in public and private sector, so we can learn from one another

And no doubt, a whole lot of other things besides!

I’ll be working closely with Peter and Olivier, as my role should complement theirs. And I’m looking forward to spending more time with the rest of the ODI team, so I can find ways to support and learn more from them all.

My immediate priorities will be are working on standards and tools to help build data infrastructure in the physical activity sector, through the OpenActive project. And leading on projects looking at how to build better standards and how to develop collaborative registers.

I’m genuinely excited about the opportunities we have for improving the publication and use of data on the web. It’s a topic that continues to occupy a lot of my attention. For example, I’m keen to see whether we can build a design manual for data infrastructure. Or improve governance around data through analysing existing sources. Or whether mapping data ecosystems and diagramming data flows can help us understand what makes a good data infrastructure. And a million other things. It’s also probably time we started to recognise and invest in the building blocks for data infrastructure that we’ve already built.

If you’re interesting in talking about data infrastructure, then I’d love to hear from you. You can reach me on twitter or email.

We can strengthen data infrastructure by analysing open data

Data is infrastructure for our society and businesses. To create stronger, sustainable data infrastructure that supports a variety of users and uses, we need to build it in a principled way.

Over time, as we gain experience with a variety of infrastructures supporting both shared and open data, we can identify the common elements of good data infrastructure. We can use that to help to write a design manual for data infrastructure.

There a variety of ways to approach that task. We can write case studies on specific projects, and we can map ecosystems to understand how value is created through data. We can also take time to contribute to projects. Experiencing different types of governance, following processes and using tools can provide useful insight.

We can also analyse open data to look for additional insights that might help use improve data infrastructure. I’ve recently been involved in two short projects that have analysed some existing open data.

Exploring open data quality

Working with Experian and colleagues at the ODI, we looked at the quality of some UK government datasets. We used a data quality tool to analyse data from the Land Registry, the NHS and Companies House. We found issues with each of the datasets.

It’s clear that there’s is still plenty of scope to make basic improvements to how data is published, by providing:

  • better guidance on the structure, content and licensing of data
  • basic data models and machine-readable schemas to help standardise approaches to sharing similar data
  • better tooling to help reconcile data against authoritative registers

The UK is also still in need of a national open address register.

Open data quality is a current topic in the open data community. The community might benefit from access to an “open data quality index” that provides more detail into these issues. Open data certificates would be an important part of that index. The tools used to generate that index could also be used on shared datasets. The results could be open, even if the datasets themselves might not be.

Exploring the evolution of data

There are currently plans to further improve the data infrastructure that supports academic research by standardising organisation identifiers. I’ve been doing some R&D work for that project to analyse several different shared and open datasets of organisation identifiers. By collecting and indexing the data, we’ve been able to assess how well they can support improving existing data, through automated reconciliation and by creating better data entry tools for users.

Increasingly, when we are building new data infrastructures, we are building on and linking together existing datasets. So it’s important to have a good understanding of the scope, coverage and governance of the source data we are using. Access to regularly published data gives us an opportunity to explore the dynamics around the management of those sources.

For example, I’ve explored the growth of the GRID organisational identifiers.

This type of analysis can help assess the level of investment required to maintain different types of dataset and registers. The type of governance we decide to put around data will have a big impact on the technology and processes that need to be created to maintain it. A collaborative, user maintained register will operate very differently to one that is managed by a single authority.

One final area in which I hope the community can begin to draw together some insight is around how data is used. At present there are no standards to guide the collection and reporting on metrics for the usage of either shared or open data. Publishing open data about how data is used could be extremely useful not just in understanding data infrastructure, but also in providing transparency about when and how data is being used.

 

Discogs: a business based on public domain data

When I’m discussing business models around open data I regularly refer to a few different examples. Not all of these have well developed case studies, so I thought I’d start trying to capture them here. In this first write-up I’m going to look at Discogs.

In an attempt to explore a few different aspects of the service I’m going to:

How well that will work I don’t know, but lets see!

Discogs: the service

Discogs is a crowd-sourced database about music releases: singles, albums, artists, etc. The service was launched in 2000. In 2015 it held data on more than 6.6 million releases. As of today there are 7.7 million releases. That’s a 30% growth from 2014-15 and around 16% growth in 2015-2016. The 2015 report and this wikipedia entry contain more details.

The database has been built from the contributions of over 300,000 people. That community has grown about 10% in the last six months alone.

The database has been described as one of the most exhaustive collections of discographical metadata in the world.

The service has been made sustainable through its marketplace, which allows record collectors to buy and sell releases. As of today there are more than 30 million items for sale. A New York Times article from last year explained that the marketplace was generating 80,000 orders a week and was on track to do $100 million in sales. Of which Discogs take an 8% commission.

The company has grown from a one man operation to having 47 employees around the world, and that the website has 20 million visitors a month and over 3 million registered users. So approximately 1% of users also contribute to the database.

In 2007 Discogs added an API to allow anyone to access the database. Initially the data was made available under a custom data licence which included attribution and no derivatives clauses. The latter encouraged reusers to contribute to the core database, rather than modify it outside of the system. This licence was rapidly dropped (within a few months, as far as I can tell) in favour of a public domain licence. This has subsequently transitioned to a Creative Commons CC0 waiver.

The API has gone through a number of iterations. Over time the requirement to use API keys has been dropped, rate limits have been lifted and since 2008 full data dumps of the catalogue have been available for anyone to download. In short the data has been increasingly open and accessible to anyone that wanted to use it.

Wikipedia lists a number of pieces of music software that uses the data. In May 2012 Discogs and The Echo Nest both announced a partnership which would see the Discogs database incorporated into Echo Nest’s Rosetta Stone product which was being sold as a “big data” product to music businesses. It’s unclear to me if there’s an ongoing relationship. But The Echo Nest were acquired by Spotify in 2014 and have a range of customers, so we might expect that the Discogs data is being used regularly as part of their products.

Discogs: the data ecosystem

Looking at the various roles in the Discogs data ecosystem, we can identify:

  • Steward: Discogs is a service operated by Zink Media, Inc. They operate the infrastructure and marketplace.
  • Contributor: The team of volunteers curating the website as well as the community support and leaders on the Discogs team
  • Reusers: The database is used in a number of small music software and potentially by other organisations like Echo Nest and their customers. Some more work required here to understand this aspect more
  • Aggregator: Echo Nest aggregates data from Discogs and other services, providing value-added services to other organisations on a commercial basis. Echo Nest in turn support additional reusers and applications.
  • Beneficiaries: Through the website, the information is consumed by a wide variety of enthusiasts, collectors and music stores. A larger network of individuals and organisations is likely supported through the APIs and aggregators

Discogs: the data infrastructure

To characterise the model we can identify:

  • Assets: the core database is available as open data. Most of this is available via the data dumps, although the API also exposes some additional data and functionality, including user lists and marketplace entries. It’s not clear to me how much data is available on the historical pricing in the marketplace. This might not be openly available, in which case it would be classified as shared data available only to the Discogs team.
  • Community: the Contributors, Reusers and Aggregators are all outlined above
  • Financial Model: the service is made sustainable through the revenue generated from the marketplace transactions. Interestingly, originally the marketplace wasn’t a part of the core service but was added based on user demand. This clearly provided a means for the service to become more sustainable and supported growth of staff and office space.
  • Licensing: I wasn’t able to find any details on other partnerships or deals, but the entire data assets of the business are in the public domain. It’s the community around the dataset and the website that has meant that Discogs has continued to grow whilst other efforts have failed
  • Incentives: as with any enthusiast driven website, the incentives are around creating and maintaining a freely available, authoritative resource. The marketplace provides a means for record collectors to buy and sell releases, whilst the website itself provides a reference and a resource in support of other commercial activities

Exploring Discog as a data infrastructure using Ostrom’s principles we can see that:

While it is hard to assess any community from the outside, the fact that both the marketplace and contributor communities are continuing to grow suggests that these measures are working.

I’ll leave this case study with the following great quote from Discog’s founder, Kevin Lewandowski:

See, the thing about a community is that it’s different from a network. A network is like your Facebook group; you cherrypick who you want to live in your circle, and it validates you, but it doesn’t make you grow as easily. A web community, much like a neighborhood community, is made up of people you do not pluck from a roster, and the only way to make order out of it is to communicate and demonstrate democratic growth, which I believe we have done and will continue to do with Discogs in the future.

If you found this case study interesting and useful, then let me know. It’ll encourage me to do more. I’m particularly interested in your views on the approach I’ve taken to capture the different aspects of the ecosystem, infrastructure, etc.