Open Data Camp Pitch: Mapping data ecosystems

I’m going to Open Data Camp #4 this weekend. I’m really looking forward to catching up with people and seeing what sessions will be running. I’ve been toying with a few session proposals of my own and thought I’d share an outline for this one to gauge interest and get some feedback.

I’m calling the session: “Mapping open data ecosystems“.

Problem statement

I’m very interested in understanding how people and organisations create and share value through open data. One of the key questions that the community wrestles with is demonstrating that value, and we often turn to case studies to attempt to describe it. We also develop arguments to use to convince both publishers and consumers of data that “open” is a positive.

But, as I’ve written about before, the open data ecosystem consists of more than just publishers and consumers. There are a number of different roles. Value is created and shared between those roles. This creates a value network including both tangible (e.g. data, applications) and intangible (knowledge, insight, experience) value.

I think if we map these networks we can get more insight into what roles people play, what makes a stable ecosystem, and better understand the needs of different types of user. For example we can compare open data ecosystems with more closed marketplaces.

The goal

Get together a group of people to:

  • map some ecosystems using a suggested set of roles, e.g. those we are individually involved with
  • discuss whether the suggested roles need to be refined
  • share the maps with each other, to look for overlaps, draw out insights, validate the approach, etc


I know Open Data Camp sessions are self-organising, but I was going to propose a structure to give everyone a chance to contribute, whilst also generating some output. Assuming an hour session, we could organise it as follows:

  • 5 mins review of the background, the roles and approach
  • 20 mins group activity to do a mapping exercise
  • 20 mins discussion to share maps, thoughts, etc
  • 15 mins discussion on whether the approach is useful, refine the roles, etc

The intention here being to try to generate some outputs that we can take away. Most of the session will be group activity and discussion.

Obviously I’m open to other approaches.

And if no-one is interested in the session then that’s fine. I might just wander round with bits of paper and ask people to draw their own networks over the weekend.

Let me know if you’re interested!


Donate to the commons this holiday season

Holiday season is nearly upon us. Donating to a charity is an alternative form of gift giving that shows you care, whilst directing your money towards helping those that need it. There are a lot of great and deserving causes you can support, and I’m certainly not going to tell you where you should donate your money.

But I’ve been thinking about the various ways in which I can support projects that I care about. There are a lot of them as it turns out. And it occurred to me that I could ask friends and family who might want to buy me a gift to donate to them instead. It’ll save me getting me getting yet another scarf, pair of socks, or (shudder) a brutalised toblerone.

One topic I’m interested in, as regular readers will know, is how we can create a sustainable commons: open data, open source, etc. So here’s a list of relevant donation options. I’m sharing it here in case you might find it useful too.

Open Source

Open Content & Data

Open Science

Open Standards and Rights

This isn’t meant as an exhaustive list. It’s just the organisations that immediately came to mind. Leave a comment if you’d like to suggest an addition.


The practice of open data

Open data is data that anyone can access, use and share.

Open data is the result of several processes. The most obvious one is the release process that results in data being made available for reuse and sharing.

But there are other processes that may take place before that open data is made available: collecting and curating a dataset; running it through quality checks; or ensuring that data has been properly anonymised.

There are also processes that happen after data has been published. Providing support to users, for example. Or dealing with error reports or service issues with an API or portal.

Some processes are also continuous. Engaging with re-users is something that is best done on an ongoing basis. Re-users can help you decide which datasets to release and when. They can also give you feedback on ways to improve how your data is published. Or how it can be connected and enriched against other sources.

Collectively these processes define the practice of open data.

The practice of open data covers much more than the technical details of helping someone else access your data. It covers a whole range of organisational activities.

Releasing open data can be really easy. But developing your open data practice can take time. It can involve other changes in your organisation, such as creating a more open approach to data sharing. Or getting better at data governance and management.

The extent to which you develop an open data practice depends on how important open data is to your organisation. Is it part of your core strategy or just something you’re doing on a more limited basis?

The breadth and depth of the practice of open data is surprising to many people. The learning process is best experienced. Going through the process of opening a dataset, however small, provides useful insight that can help identify where further learning is needed.

On aspect of the practice of open data involves understanding what data can be open, what can be shared and what must stay closed. Moving data along the data spectrum can unlock more value. But not all data can be open.

An open data practitioner works to make sure that data is at the right point on the data spectrum.

An open data practitioner will understand the practice of open data and be able to use those skills to create value for their organisation.

Often I find that when people write about “the state of open data” what they’re actually writing about is the practice of open data within a specific community. For example, the practice of open data in research, or the practice of open government data in the US, or the UK.

Different communities are developing their open data practices at different rates. It’s useful to compare practices so we can distil out the useful, reusable elements. But we must acknowledge that these differences exist. That open data can fulfil a different role and offer a different value proposition in different communities. However there will obviously be common elements to those practices; the common processes that we all follow.

The open data maturity model is an attempt to describe the practice of open data. The framework identifies a range of activities and processes that are relevant to the practice of open data. It’s based on years of experience across a range of different projects. And it’s been used by both public and private sector organisations.

The model is designed to help organisations assess and improve their open data practice. It provides a tool-kit to help you think about the different aspects of open data practice. By using a common framework we can benchmark our practices against those in other organisations. Not as a way to generate leader-boards, but as a way to identify opportunities for sharing our experiences to help each other develop.

If you take and find it useful, then let me know. And if you don’t find it useful, then let me know too. Hearing what works and what doesn’t is how I develop my own open data practice.

Discogs: a business based on public domain data

When I’m discussing business models around open data I regularly refer to a few different examples. Not all of these have well developed case studies, so I thought I’d start trying to capture them here. In this first write-up I’m going to look at Discogs.

In an attempt to explore a few different aspects of the service I’m going to:

How well that will work I don’t know, but lets see!

Discogs: the service

Discogs is a crowd-sourced database about music releases: singles, albums, artists, etc. The service was launched in 2000. In 2015 it held data on more than 6.6 million releases. As of today there are 7.7 million releases. That’s a 30% growth from 2014-15 and around 16% growth in 2015-2016. The 2015 report and this wikipedia entry contain more details.

The database has been built from the contributions of over 300,000 people. That community has grown about 10% in the last six months alone.

The database has been described as one of the most exhaustive collections of discographical metadata in the world.

The service has been made sustainable through its marketplace, which allows record collectors to buy and sell releases. As of today there are more than 30 million items for sale. A New York Times article from last year explained that the marketplace was generating 80,000 orders a week and was on track to do $100 million in sales. Of which Discogs take an 8% commission.

The company has grown from a one man operation to having 47 employees around the world, and that the website has 20 million visitors a month and over 3 million registered users. So approximately 1% of users also contribute to the database.

In 2007 Discogs added an API to allow anyone to access the database. Initially the data was made available under a custom data licence which included attribution and no derivatives clauses. The latter encouraged reusers to contribute to the core database, rather than modify it outside of the system. This licence was rapidly dropped (within a few months, as far as I can tell) in favour of a public domain licence. This has subsequently transitioned to a Creative Commons CC0 waiver.

The API has gone through a number of iterations. Over time the requirement to use API keys has been dropped, rate limits have been lifted and since 2008 full data dumps of the catalogue have been available for anyone to download. In short the data has been increasingly open and accessible to anyone that wanted to use it.

Wikipedia lists a number of pieces of music software that uses the data. In May 2012 Discogs and The Echo Nest both announced a partnership which would see the Discogs database incorporated into Echo Nest’s Rosetta Stone product which was being sold as a “big data” product to music businesses. It’s unclear to me if there’s an ongoing relationship. But The Echo Nest were acquired by Spotify in 2014 and have a range of customers, so we might expect that the Discogs data is being used regularly as part of their products.

Discogs: the data ecosystem

Looking at the various roles in the Discogs data ecosystem, we can identify:

  • Steward: Discogs is a service operated by Zink Media, Inc. They operate the infrastructure and marketplace.
  • Contributor: The team of volunteers curating the website as well as the community support and leaders on the Discogs team
  • Reusers: The database is used in a number of small music software and potentially by other organisations like Echo Nest and their customers. Some more work required here to understand this aspect more
  • Aggregator: Echo Nest aggregates data from Discogs and other services, providing value-added services to other organisations on a commercial basis. Echo Nest in turn support additional reusers and applications.
  • Beneficiaries: Through the website, the information is consumed by a wide variety of enthusiasts, collectors and music stores. A larger network of individuals and organisations is likely supported through the APIs and aggregators

Discogs: the data infrastructure

To characterise the model we can identify:

  • Assets: the core database is available as open data. Most of this is available via the data dumps, although the API also exposes some additional data and functionality, including user lists and marketplace entries. It’s not clear to me how much data is available on the historical pricing in the marketplace. This might not be openly available, in which case it would be classified as shared data available only to the Discogs team.
  • Community: the Contributors, Reusers and Aggregators are all outlined above
  • Financial Model: the service is made sustainable through the revenue generated from the marketplace transactions. Interestingly, originally the marketplace wasn’t a part of the core service but was added based on user demand. This clearly provided a means for the service to become more sustainable and supported growth of staff and office space.
  • Licensing: I wasn’t able to find any details on other partnerships or deals, but the entire data assets of the business are in the public domain. It’s the community around the dataset and the website that has meant that Discogs has continued to grow whilst other efforts have failed
  • Incentives: as with any enthusiast driven website, the incentives are around creating and maintaining a freely available, authoritative resource. The marketplace provides a means for record collectors to buy and sell releases, whilst the website itself provides a reference and a resource in support of other commercial activities

Exploring Discog as a data infrastructure using Ostrom’s principles we can see that:

While it is hard to assess any community from the outside, the fact that both the marketplace and contributor communities are continuing to grow suggests that these measures are working.

I’ll leave this case study with the following great quote from Discog’s founder, Kevin Lewandowski:

See, the thing about a community is that it’s different from a network. A network is like your Facebook group; you cherrypick who you want to live in your circle, and it validates you, but it doesn’t make you grow as easily. A web community, much like a neighborhood community, is made up of people you do not pluck from a roster, and the only way to make order out of it is to communicate and demonstrate democratic growth, which I believe we have done and will continue to do with Discogs in the future.

If you found this case study interesting and useful, then let me know. It’ll encourage me to do more. I’m particularly interested in your views on the approach I’ve taken to capture the different aspects of the ecosystem, infrastructure, etc.

Checking Fact Checkers

As of last month Google News attempts to highlight fact check articles. Content from fact checking organisations will be tagged so that their contribution to on-line debate can be more clearly identified. I think this is a great move and a first small step towards addressing wider concerns around use of the web for disinformation and a “post truth” society.

So how does it work?

Firstly, news sites can now advertise fact checking articles using a pending extension called Claim Review. The mark-up allows a fact checker to indicate which article they are critiquing along with a brief summary of what aspects are being reviewed.

Metadata alone is obviously ripe for abuse. Anyone could claim any article is a fact check. So there’s an additional level of editorial control that Google layer on top of that metadata. They’ve outlined their criteria in their help pages. These seems perfectly reasonable: it should be clear what facts are being checked, sources must be cited, organisations must be non-partisan and transparent, etc.

It’s the latter aspect that I think is worth digging into a little more. The Google News announcement references the International Fact Checking Network and a study on fact checking sites. The study, by the Duke Reporter’s Lab, outlines how they identify fact checking organisations. Again, they mention both transparency of sources and organisational transparency as being important criteria.

I think I’d go a step further and require that:

  • Google’s (and other’s) lists of approved fact checking organisations are published as open data
  • The lists are cross-referenced with identifiers from sources like OpenCorporates that will allow independent verification of ownership, etc.
  • Fact checking organisations publish open data about their sources of funding and affiliations
  • Fact checking organisations publish open data, perhaps using annotations, about the dataset(s) they use to check individual claims in their articles
  • Fact checking organisations licence their ClaimReview metadata for reuse by anyone

Fact checking is an area that benefits from the greatest possible transparency. Open data can deliver that transparency.

Another angle to consider is that fact checking may be carried out by more than just media organisations. John Udell has written a couple of interesting pieces on annotating the wild-west of information flow and bird-dogging the web that highlight the potential role of annotation services in helping to fact check and create constructive debate and discussion on-line.

Elinor Ostrom and data infrastructure

One of the topics that most interests me at the moment is how we design systems and organisations that contribute to the creation and maintenance of the open data commons.

This is more than a purely academic interest. If we can understand the characteristics of successful open data projects like Open Street Map or Musicbrainz then we could replicate them in other areas. My hope is that we may be able to define a useful tool-kit of organisational and technical design patterns that make it more likely for other similar projects to proceed. These patterns might also give us a way to evaluate and improve other existing systems.

A lot of the current discussion around this topic is going on under the “data infrastructure” heading. Also related is the idea of open data as a public good.

While I believe that open data is a public good, I do wonder whether particular styles of data infrastructure and licensing arrangements mean that data might sometimes be a club good. But lots more thinking and reading to be done there. Economics isn’t my area of expertise.

That said, if you’re interested in data infrastructure then I’d recommend looking at the work of Elinor Ostrom. She received a Nobel Prize for her research exploring how communities self-organise to managing the commons. Her work was instrumental in debunking the idea of the “tragedy of the commons”.

A key outcome of Ostrom’s work was the definition of 8 principles for designing organisations that manage common-pool resources. While her focus was on common-pool resources rather than public goods, the principles define a framework that can be applied more generally. And, as this article on the influences of Ostrom’s work notes, “any group whose members must work together to achieve a common goal is vulnerable to self-serving behaviors and should benefit from the same principles“.

The Open Data Institute have defined some high-level principles for strengthening data infrastructure. These include working in the open, designing collaborative models, building with the web, and balancing stakeholder interests.

I think you can usefully read Ostrom’s principles as more detailed guidance for how to create digital communities that collaborate to create and maintain open data. In fact if you’ve been part of any online community or taken part in community-building activities, I think those principles should resonate pretty strongly.

As an illustration, here are each of the principles and some suggested questions that are relevant to digital communities and open data. Ostrom highlights that communities will have:

  1. Clearly defined boundaries (clear definition of the contents of the common pool resource and effective exclusion of external un-entitled parties)
    • what is the purpose of the data infrastructure?
    • what community does it serve, and how are they identified?
    • what are the key data assets that the infrastructure will produce?
    • when will it’s mission be complete?
  2. Rules regarding the appropriation and provision of common resources that are adapted to local conditions;
    • how are the data assets and guidance provided by the community licensed?
    • what are the forms of attribution and other social norms that apply to use of the resources?
    • what are the guidelines that apply to contributions from the community?
    • how are new contributors guided towards becoming productive members of the community?
    • what are the means by which people can access and reuse the data?
  3. Collective-choice arrangements that allow most resource appropriators to participate in the decision-making process;
    • how does the community share ideas about how the infrastructure should evolve?
    • what are the decision making processes and the tools used to support them?
    • if poor quality data is added, how is this discussed, highlighted and improved?
    • how are differences of opinion, or innovative ideas relating to e.g. data modelling or organisation issues, discussed within the community?
  4. Effective monitoring by monitors who are part of or accountable to the appropriators;
    • how are contributions to the data assets managed or reviewed by moderators?
    • how does the community measure its progress and activity?
    • how are moderators identified and promoted? how might their privileges be removed?
    • how are good uses of the infrastructure showcased?
    • what metrics are available to measure data quality, coverage, etc?
  5. A scale of graduated sanctions for resource appropriators who violate community rules;
    • how is spam and other wilful misuse identified and dealt with?
    • how is abusive behaviour dealt with?
    • how does the community document and share its norms?
    • what are the means by which contributors gain or lose privileges?
  6. Mechanisms of conflict resolution that are cheap and of easy access;
    • what process are used to resolve debates and make decisions?
    • how can data quality issues be flagged and address?
    • what are the mechanisms by which community members can share their opinions, or have their voice heard?
    • how are the results of debate and key decisions recorded?
  7. Self-determination of the community recognized by higher-level authorities;
    • what type of organisation is used to manage the community resources?
    • what is the process by which other organisations engage with the community and/or its representatives?
  8. In the case of larger common-pool resources, organization in the form of multiple layers of nested enterprises, with small local CPRs at the base level
    • how does the community interact with other similar initiatives, e.g. in a sector or broader community?

Most importantly implementing a viable digital community for managing the data commons means that we must build with the web, which is another of the ODI’s principles.

As ever, if you have thoughts then let me know!

Current gaps in the open data standards framework

In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)

To define the scope of those standards, lets try and answer two questions.

Question 1: What are the various activities that we might want to carry out around an open dataset?

  • A. Discover the metadata and documentation about a dataset
  • B. Download or otherwise extract the contents of a dataset
  • C. Manage a dataset within a platform, e.g. create and publish it, update or delete it
  • D. Monitor a dataset for updates
  • E. Extract metrics about a dataset, e.g. a description of its contents or quality metrics
  • F. Mirror a dataset to another location, e.g. exporting its metadata and contents
  • G. Link or reconcile some data against a dataset or register

Question 2: What are the various activities that we might want to carry out around an open data catalogue?

  • V. Find whether a dataset exists, e.g. via a search or similar interface
  • X. List the contents of the platform, e.g. its datasets or other published assets
  • Y. Manage user accounts, e.g. to create accounts, or grant or remove rights from specific accounts
  • Z. Extract usage statistics, e.g. metrics on use of the platform and the datasets it contains

Now, based on that quick review: which of these areas of functionality are covered by existing standards?

  • DCAT and its extensions gives us a way to describe a dataset (A) and can be used to find download links which addresses part of (B). But it doesn’t say how the metadata is to be discovered by clients.
  • The draft Data Catalog Vocabulary starts to address parts of (E) but also doesn’t address discovery of published metrics
  • ODATA provides a means for querying and manipulating data via a RESTful interface (B, C). Although I don’t think it recognises a dataset as such, just resources exposed over the web
  • SPARQL (query, update, etc) also provides a means for similar operations (B, C), but on RDF data.
  • The Linked Data Platform specification also offers a similar set of functionality (B, C)
  • If a platform exposes its catalogue using DCAT then a client could use that to list its contents (X)
  • The draft Linked Data Notifications specification covers monitoring and synchronising of data (D)
  • Data Packages provide a means for packaging metadata and contents of dataset for download and mirroring (B, F)

I think there’s a number of obvious gaps around discovery and platform (portal) functionality. API and metadata discovery is also something could usefully be addressed.

If you’re managing and publishing data as RDF and Linked Data then you’re slightly better covered at least in terms of standards, if not in actual platform and tool support. The majority of current portals don’t manage data as RDF or Linked Data. They’re focused on either tabular or maybe geographic datasets.

This means that portability among the current crop of portals is actually pretty low. Moving between a platform means moving between different entirely different sets of APIs and workflows. I’m not sure that’s ideal. I don’t feel like we’ve yet created a very coherent set of standards.

What do you think? What am I missing?