Who is the intended audience for open data?

This post is part of my ongoing series: basic questions about data. It’s intended to expand on a point that I made in a previous post in which I asked: who uses data portals?

At times I see quite a bit of debate within the open data community around how best to publish data. For example should data be made available in bulk or via an API? Which option is “best”? Depending on where you sit in the open data community you’re going to have very different responses to that question.

But I find that in the ensuing debate we often overlook that open data is intended to be used by anyone, for any purpose. And that means that maybe we need to think about more than just the immediate needs of developers and the open data community.

While the community has rightly focused on ensuring that data is machine-readable, so it can be used by developers, we mustn’t forget that data needs to be human-readable too. Otherwise we end up with critiques of what I consider to be fairly reasonable and much-needed guidance on structuring spreadsheets, and suggestions of alternatives that are well meaning but feel a little under-baked.

I feel that there are several different and inter-related viewpoints being expressed:

  • That the citizen or user is the focus and we need to understand their needs and build services that support them. Here data tends to be a secondary concern and perhaps focused on transactional statistics on performance of those services, rather than the raw data
  • That open data is not meant for mere mortals and that its primary audience is developers to analyse and present to users. The emphasis here is on provision of the raw data as rapidly as possible
  • A variant of the above that emphasises delivery of data via an API to web and mobile developers allowing them to more rapidly deliver value. Here we see cases being made about the importance of platforms, infrastructure, and API programs
  • That citizens want to engage with data and need tools to explore it. In this case we see arguments for on-line tools to explore and visualise data, or reasonable suggestions to simply publish data in spreadsheets as this is a format with which many, many people are comfortable

Of course all of these are correct, although their prominence around different types of data, application, etc varies wildly. Depending on where you sit in the open data value network your needs are going to be quite different.

It would be useful to map out the different roles of consumers, aggregators, intermediaries, etc to understand what value exchanges are taking place, as I think this would help highlight the value that each role brings to the ecosystem. But until then both consumers and publishers need to be mindful of potentially competing interests. In an ideal world publishers would serve every reuser need equally.

My advice is simple: publish for machines, but don’t forget the humans. All of the humans. Publish data with context that helps anyone – developers and the interested reader – properly understand the data. Ensure there is at least a human-readable summary or view of the data as well as more developer oriented bulk downloads. If you can get APIs “out of the box” with your portal, then invest the effort you would otherwise spend on preparing machine-readable data in providing more human-readable documentation and reports.

Our ambition should be to build an open data commons that is accessible and useful for as many people as possible.


Managing risks when publishing open data

A question that I frequently encounter when talking to organisations about publishing open data is: “what if someone misuses or misunderstands our data?“.

These concerns stem from several different sources:

  • that the data might be analysed incorrectly, drawing incorrect conclusions that might be attributed to the publisher
  • that the data has known limitations and this might reflect on the publisher’s abilities, e.g. exposing issues with their operations
  • that the data might be used against the publisher in some way, e.g. to paint them in a bad light
  • that the data might be used for causes with which the publisher does not want to be aligned
  • that the data might harm the business activities of the publisher, e.g. by allowing someone to replicate a service or product

All of these are understandable and reasonable concerns. And the truth is that when publishing open data you are giving up a great deal of control over your data.

But the same is true of publishing any information: there will always cases of accidental and wilful misuse of information. Short of not sharing information at all, all organisations already face this risk. It’s just that open data, which anyone can access, use and share for any purpose, really draws this issue into the spotlight.

In this post I wanted to share some thoughts about how organisations can manage the risks associated with publishing open data.

Risks of not sharing

Firstly its worth noting that the risks of not sharing data are often unconsciously discounted.

There’s increasing evidence that holding on to data can hamper innovation whereas opening data can unlock value. This might be of direct benefit for the organisation or have wider economic, social and environmental benefits.

Organisations with a specific mission or task can more readily demonstrate their impact and progress by publishing open data. Those that are testing a theory of change will be reporting on indicators that help to measure impact and confirm that interventions are working as expected. Open data is the most transparent way to approach to these impact assessments.

Many organisations, particularly government bodies, are attempting to address challenges that can only be overcome in collaboration with others. Open data specifically, and data sharing practices in general, provides an important foundation for collaborative projects.

As data moves from the closed to the open end of the data spectrum, there is an increasingly wider audience that can access and use that information. We can point to Joy’s Law as a reason why this is a good thing.

In scientific publishing there are growing concerns of a “reproducibility crisis” which is in part fuelled by both a lack of access to original experimental data and analysis.  Open publishing of scientific results is one remedy.

But setting aside what might be seen as a sleight of hand re-framing of the original question, how can organisation minimise specific types of risk?

Managing forms of permitted reuse

Organisations manage the forms of reuse of its data through a licence. The challenge for many is that an open licence places few limits on how data can be reused.

There is a wider range of licences that publishers could use, including some that limit creation of derivative works or commercial uses. But all of these restrictions may also unintentionally stop the kinds of reuse that publishers want to encourage or enable. This is particularly true when applying a “non-commercial” use clause. These issues are covered in detail in the recently published ODI guidance on the impacts of non-open licences.

While my default recommendation is that organisations use a CC-BY 4.0 licence, an alternative is the CC-BY-SA licence which requires that any derivative works are published under the same licence, i.e. that reusers must share in the same spirit as the publisher.

This could be a viable alternative that might help organisations feel more confident that they are deterring some forms of undesired reuse, e.g. discouraging a third-party or competitor from publishing a commercial analysis based on their data by requiring that the report also be distributed under an open licence.

The attribution requirement already stops data being reused without its original source being credited.

Managing risks of accidental misinterpretation

When I was working in academic publishing a friend at the OECD told me that at least one statistician had been won over to a plan to publicly publish data by the observation that the alternative was to continue to allow users to manually copy data from published reports, with the obvious risks of transcription errors.

This is a small example of how to manage risks of data being accidentally misused or misinterpreted. Putting appropriate effort into the documentation and publication of a dataset will help reusers understand how it can be correctly used. This includes:

  • describing what data is being reported
  • how the data was collected
  • the quality control, if any, that has been used to check the data
  • any known limits on its accuracy or gaps in coverage

All of these help to provide reusers with the appropriate context that can guide their use. It also makes them more likely to be successful. This detail is already covered in the ODI certification process.

Writing a short overview of a dataset highlighting its most interesting features, sharing ideas for how it might be used, and clearly marking known limits can also help orientate potential reusers.

Of course, publishers may not have the resources to fully document every dataset. This is where having a contact point to allow users to ask for help, guidance and clarification is important. 

Managing risks of wilful misinterpration

Managing risks of wilful misinterpretation of data is harder. You can’t control cases where people totally disregard documentation and licensing in order to push a particular agenda. Publishers can however highlight breaches of social norms and can choose to call out misuse they feel is important to highlight.

It’s important to note that there are standard terms in the majority of open licences, including the Creative Commons Licences and the Open Government Licence, which address:

  • limited warranties – no guarantees that data is fit for purpose, so reusers can’t claim damages if misused or misapplied
  • non-endorsement– reusers can’t say that their use of the data was endorsed or supported by the publisher
  • no use of trademarks, branding, etc. – reusers don’t have permission to brand their analysis as originating from the publisher
  • attribution– reusers must acknowledge the source of their data and cannot pass it off as their own

These clauses collectively limit the liability of the publisher. It also potentially provides some recourse to take legal action if a reuser did breach the terms of they licence, and the publisher thought that this was worth doing.

I would usually add to this that the attribution requirement means that there is always a link back to the original source of the data. This allows the reader of some analysis to find the original authoritative data and confirm any findings for themselves. It is important that publishers document how they would like to be attributed.

Managing  business impacts

Finally, publishers concerned about the risk of releasing data to their business, should ensure they’re doing so with a clear business case. This includes understanding whether supply of data is the core value of your business or whether customers place more value in the services.

One startup I worked with were concerned that an open licence on user contributions might allow a competitor to clone their product. But in this case the defensibility in their business model didn’t derive from controlling the data but in the services provided and the network effects of the platform. These are harder things to replicate.

This post isn’t intended to be a comprehensive review of all approaches to risk management when releasing data. There’s a great deal more which I’ve not covered including the need to pay appropriate attention to data protection, privacy, anonymisation, and general data governance.

But there is plenty of existing guidance available to help organisations work through those areas. I wanted to share some advice that more specifically relates to publishing data under an open licence.

Please leave a comment to let me know what you think. Is this advice useful and is there anything you would add?

Fictional data

The phrase “fictional data” popped into my head recently, largely because of odd connections between a couple of projects I’ve been working on.

It’s stuck with me because, if you set aside the literal meaning of “data that doesn’t actually exist“, there are some interesting aspects to it. For example the phrase could apply to:

  1. data that is deliberately wrong or inaccurate in order to mislead – lies or spam
  2. data that is deliberately wrong as a proof of origin or claim of ownership – e.g. inaccuracies introduced into maps to identify their sources, or copyright easter eggs
  3. data that is deliberately wrong, but intended as a prank – e.g. the original entry of Uqbar on wikipedia. Uqbar is actually a doubly fictional place.
  4. data that is fictionalised (but still realistic) in order to support testing of some data analysis – e.g. a set of anonymised and obfuscated bank transactions
  5. data that is fictionalised in order to avoid being a nuisance, cause confusion, or accidentally linkage – like 555 prefix telephone numbers or perhaps social media account names
  6. data that is drawn from a work of fiction or a virtual world – such as the marvel universe social graph, the Elite: Dangerous trading economy (context), or the data and algorithms relating to Pokemon capture.

I find all of these fascinating, for a variety of reasons:

  • How do we identify and exclude deliberately fictional data when harvesting, aggregating and analysing data from the web? Credit to Ian Davis for some early thinking about attack vectors for spam in Linked Data. While I’d expect copyright easter eggs to become less frequent they’re unlikely to completely disappear. But we can definitely expect more and more deliberate spam and attacks on authoritative data. (Categories 1, 2, 3)
  • How do we generate useful synthetic datasets that can be used for testing systems? Could we generate data based on some rules and a better understanding of real-world data as a safer alternative to obfuscating data that is shared for research purposes? It turns out that some fictional data is a good proxy for real world social networks. And analysis of videogame economics is useful for creating viable long-term communities. (Categories 4, 6)
  • Some of the most enthusiastic collectors and curators of data are those that are documenting fictional environments. Wikia is a small universe of mini-wikipedias complete with infoboxes and structured data. What can we learn from those communities and what better tools could we build for them? (Category 6)

Interesting, huh?

What is a data portal?

This post is part of my ongoing series of basic questions about data, this time prompted by a tweet by Andy Dickinson asking the same question.

There are lots of open data portals. OpenDataMonitor lists 161 in the EU alone. The numbers have grown rapidly over the last few years. Encouraged by exemplars such as data.gov.uk they’re usually the first item on the roadmap for any open data initiative.

But what is a data portal and what role does it play?

A Basic Definition

I’d suggest that the most basic definition of an open data portal is:

A list of datasets with pointers to how those datasets can be accessed.

A web page on an existing website meets this criteria. It’s the minimum viable open data portal. And, quite rightly, this is still where many projects begin.

Once you have more than a handful of datasets then you’re likely to need something more sophisticated to help users discover datasets that are of interest to them. A more sophisticated portal will provide the means to capture metadata about each dataset and then use that to provide the ability to search and browse through the list, e.g. by theme, licence, or other facets.

Portals rarely place any restrictions on the type of data that is catalogued or the means by which data is accessed. However more sophisticated portals offer additional capabilities for both the end user and the publisher.

Publisher features include:

  • File storage to make it easier to get data made available online
  • Additional curation tools, e.g. addition of custom metadata, creation of collections, and promotion of datasets
  • Integrated data stores, e.g. to allow data files to be uploaded into a database that will allow data to be queried and accessed by users in more sophisticated ways

User features include:

  • Notification tools to alert to the publication of new or updated datasets
  • Integrated and embeddable visualisations to support manipulation and use of data directly in the portal, often with embedding in other websites.
  • Automatically generated APIs to allow for more sophisticated online querying and interaction with datasets
  • Engagement tools such as rating, discussions and publisher feedback channels

There are a number of open source and commercial data stores, including CKAN, Socrata and OpenDataSoft. All of these offer a mixture of the features outlined above.

Who uses data portals?

Right now the target customer for a data portal is likely to be a public sector organisation, e.g. a local authority, city administration or government department that is looking to publish a number of datasets.

But the users of a data portal are a mixture of all of different aspects of the open data community: individual citizens, developers or civic hackers, data journalists, public sector officials, commercial developers, etc.

Balancing the needs of these different constituents is difficult:

  • The customer wants to see some results from publishing their data as soon as possible, so instant access to visualisations and exploration tools gives immediate utility and benefit
  • Data analysts or designers will likely just want to download the data so they can make more sophisticated use of the data
  • Web and mobile developers often want an API to allow them to quickly build an application, without setting up infrastructure and a custom data processing pipeline
  • A citizen, assuming they wander in at all, is likely to want some fairly simple data exploration tools, ideally wrapped up in some narrative that puts the data into context and help tells a story

Depending on where you sit in the community you may think that current data portals are either fantastic or are under-serving your needs.

The business model and target market of the portal developer is also likely to affect how well they serve different communities. APIs, for example, support the creation of platforms that helps embed the portal into an ecosystem.

Enterprise use

There are enterprise data portals too. Large enterprises have exactly the same problems as exists in the wider open data community: it’s often not clear what data is available or how to access it.

For example Microsoft has the Azure Data Catalog. This has been around for quite a few years now in various incarnations. There are also tools like Tamr Catalog.

They both have similar capabilities – collaborative cataloguing of datasets within an enterprise – and both are tied into a wider ecosystem of data processing and analytics tools.

Future directions

How might data portals evolve in the future?

I think there’s still plenty of room to develop new features to better serve different audiences.

For example none of the existing catalogues really help me publish some data and then tell a story with it. A story is likely to consist of a mixture of narrative and visualisations, perhaps spanning multiple datasets. This might best be served by making it easier to embed differnt views of data into blog posts rather than building additional content management features into the catalog itself. But for a certain audience, e.g. data journalists and media organisations, this might be a useful package.

Better developer tooling, e.g. data syndication and schema validation, would help serve data scientists that are building custom workflows against data that is downloaded or harvested from data portals. This is a way to explore a platform approach that doesn’t necessarily require downstream users to use the portal APIs to query the data – just syndication of updates and notifications of changes.

Another area is curation and data management tools. E.g. features to support multiple people in creating and managing a dataset directly in the portal itself. This might be useful for small-scale enterprise uses as well as supporting collaboration around open datasets.

Automated analysis of hosted data is another area in which data portals could develop features that would support both the publishers and developers. Some metadata about a dataset, e.g. to help describe its contents, could be derived by summarising features of the data rather than requiring manual data entry.

Regardless of how they evolve in terms of features, data portals are likely to remain a key part of open data infrastructure. However as Google and others begin doing more to index the contents of datasets, it may be that the users of portals increasingly become machines rather than humans.

“The woodcutter”, an open data parable

In a time long past, in a land far away, there was once a great forest. It was a huge sprawling forest containing every known species of tree. And perhaps a few more.

The forest was part of a kingdom that had been ruled over by an old mad king for many years. The old king had refused anyone access to the forest. Only he was allowed to hunt amongst its trees. And the wood from the trees was used only to craft things that the king desired.

But there was now a new king. Where the old king was miserly, the new king was generous. Where the old king was cruel, the new king was wise.

As his first decree, the king announced that the trails that meandered through the great forest might be used by anyone who needed passage. And that the wood from his forest could be used by anyone who needed it, provided that they first ask the king’s woodcutter.

Several months after his decree, whilst riding on the edge of the forest, the king happened upon a surprising scene.

Gone was the woodcutter’s small cottage and workshop. In its place had grown up a collection of massive workshops and storage sheds. Surrounding the buildings was a large wooden palisade in which was set some heavily barred gates. From inside the palisade came the sounds of furious activity: sawing, chopping and men shouting orders.

All around the compound, filling the nearby fields, was a bustling encampment. Looking at the array of liveries, flags and clothing on display, the king judged that there were people gathered here from all across his lands. From farms, cities, and towns. From the coast and the mountains. There were also many from neighbouring kingdoms.

It was also clear that many of these people had been living here for some time.

Perplexed, the king rode to the compound, making his way through the crowds waiting outside the gates. Once he had been granted entry, he immediately sought out the woodcutter, finding him directing activities from a high vantage point.

Climbing to stand beside the woodcutter the king asked, “Woodcutter, why are all these people waiting outside of your compound? Where is the wood that they seek?”

Flustered, the woodcutter, mopped his brow and bowed to his king. “Sire, these people shall have their wood as soon as we are ready. But first we must make preparations.”

“What preparations are needed?”, asked the king. “Your people have provided wood from this forest for many, many years. While the old king took little, is it not the same wood?”

“Ah, but sire, we must now provide the wood to so many different peoples”. Gesturing to a small group of tents close to the compound, the woodcutter continued: “Those are the ship builders. They need the longest, straightest planks to build their ships. And great trees to make their keels”.

“Over there are the house builders”, the woodcutter gestured, “they too need planks. But of a different size and from a different type of tree. This small group here represents the carpenters guild. They seek only the finest hard woods to craft clever jewellery boxes and similar fine goods.”

The king nodded. “So you have many more people to serve and many more trees to fell.”

“That is not all”, said the woodcutter pointing to another group. “Here are the river people who seek only logs to craft their dugout boats. Here are the toy makers who need fine pieces. Here are the fishermen seeking green wood for their smokers. And there the farmers and gardeners looking for bark and sawdust for bedding and mulch”.

The king nodded. “I see. But why are they still waiting for their wood? Why have you recruited men to build this compound and these workshops, instead of fetching the wood that they need?”

“How else are we to serve their needs sire? In the beginning I tried to handle each new request as it came in. But every day a new type and shape of wood. If I created planks, then the river people needed logs. If I created chippings, the house builders needed cladding.

Everyone saw only their own needs. Only I saw all of them. To fulfil your decree, I need to be ready to provide whatever the people needed.

And so unfortunately they must wait until we are better able to do so. Soon we will be, once the last dozen workshops are completed. Then we will be able to begin providing wood once more.”

The king frowned in thought. “Can the people not fetch their own wood from the forest?”

Sadly, the woodcutter said, “No sire. Outside of the known trails the woods are too dangerous. Only the woodcutters know the safe paths. And only the woodcutters know the art of finding the good wood and felling it safely. It is an art that is learnt over many years”.

“But don’t you see?” said the King, “You need only do this and then let others do the rest. Fell the trees and bring the logs here. Let others do the making of planks and cladding. Let others worry about running the workshops. There is a host of people here outside your walls who can help. Let them help serve each others needs. You need only provide the raw materials”.

And with this the king ordered the gates to the compound to be opened, sending the relieved woodcutter back to the forest.

Returning to the compound many months later, the king once again found it to be a hive of activity. Except now the house builders and ship makers were crafting many sizes and shapes of planks. The toy makers took offcuts to shape the small pieces they needed, and the gardeners swept the leavings from all into sacks to carry to their gardens.

Happy that his decree had at last been fulfilled, the king continued on his way.

Read the first open data parable, “The scribe and the djinn’s agreement“.

Basic questions about data

Over the past couple of years I’ve written several posts that each focus on trying to answer a simple question relating to data and/or open data.

I’ve collected them together into a list here for easier reference. I’ll update the list as I write more related posts:

I find that asking and then trying to answer these questions are a good way to develop understanding. Often there are a number of underlying questions or issues that can be more easily surfaced.

What is Derived Data?

A while ago I asked the question: “What is a Dataset?“. The idea was to look at how different data communities were using the term to see if there were any common themes. This week I’ve been considering how UPRNs can be a part of open data, a question made more difficult due to complex licensing issues.

One aspect of the discussion is the idea of “derived data“. Anyone who has worked with open data in the UK will have come across this term in relation to licensing of Ordnance Survey and Royal Mail data. But, as we’ll see shortly, the term is actually in wider use. I’ve realised though that like “dataset”, this is another term which hasn’t been well defined. So I thought I’d explore what definitions are available and whether we can bring any clarity.

I think there are several reasons why having a clearer definition and understanding of what constitutes “derived data”:

  1. When using data published under different licenses it’s important to understand what the implications are of reusing and mixing together datasets. While open data licenses create few issues, mixing together open and shared data, can create additional complexities due to non-open licensing terms. For further reading here see: “IPR and licensing issues in Derived Data ” (Korn et al, 2007)  and “Data as IP and Data License Agreements” (Practical Law, 2013).
  2. Understanding how data is derived is useful in understanding the provenance of a dataset and ensuring that sources are correctly attributed
  3. In the EU, at least, there are many open questions relating to the creation of services that use multiple data sources. As a community we should be trying to answer these questions to identify best practices, even if ultimately they might only be resolved through a legal process.

On that basis: what is derived data?

Definitions of derived data from the statistics community

The OECD Glossary of Statistical Terms defines “derived data element” as:

A derived data element is a data element derived from other data elements using a mathematical, logical, or other type of transformation, e.g. arithmetic formula, composition, aggregation.

This same definition is used in the data.gov.uk glossary, which has some comments.

The OECD definition of “derived statistics” also provides some examples of derivation, e.g. creating population-per-square-mile statistics from primary observations (e.g. population counts, geographical areas).

Staying in the statistical domain, this britannica.com article on censuses explains that (emphasis added):

there are two broad types of resulting data: direct data, the answers to specific questions on the schedule; and derived data, the facts discovered by classifying and interrelating the answers to various questions. Direct information, in turn, is of two sorts: items such as name, address, and the like, used primarily to guide the enumeration process itself; and items such as birthplace, marital status, and occupation, used directly for the compilation of census tables. From the second class of direct data, derived information is obtained, such as total population, rural-urban distribution, and family composition

I think this clearly indicates the basic idea that derived data is obtained when you apply a process or transformation to one or more source datasets.

What this basic definition doesn’t address is whether there any important differences between categories of data processing, e.g. does validating some data against a dataset yield derived data, or does the process have to be more transformative? We’ll come back to this later.

Legal definitions of derived data

The Open Database Licence (ODbL), which is now used by Open Streetmap, defines a “Derivative Database” as:

…a database based upon the Database, and includes any translation, adaptation, arrangement, modification, or any other alteration of the Database or of a Substantial part of the Contents. This includes, but is not limited to, Extracting or Re-utilising the whole or a Substantial part of the Contents in a new Database.

This itemises some additional types of process, namely that extracting portions of a dataset also creates a derivative and not just transformation or statistical calculations.

However, as noted in the legal summary for the Creative Commons No Derivatives licence, simply changing the format of a work doesn’t create a derivative. So, in their opinion at least, this type of transformation doesn’t yield a derived work. In the full legal code they don’t use the term “derived data”, largely because the licences can be applied to a wide range of different types of works, they instead define an “Adapted Material“:

…material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor.

The Ordnance Survey User Derived Dataset Contract (copy provided by Owen Boswarva), which allows others to create products using OS data, defines “User Derived Datasets” as:

datasets which you have created or obtained containing in part only or utilising in whole or in part Licensed Data in their creation together with additional information not obtained from any Licensed Data which is a fundamental component of the purpose of your Product and/or Service.

The definition stresses that the datasets consist of some geographical data, e.g. points or polygons, plus some additional data elements.

The Ordnance Survey derived data exemptions documentation has this to say about derived data:

data and or hard copy information created by you using (to a greater or lesser degree) data products supplied and licensed by OS, see our Intellectual Property (IP) policy.

For the avoidance of doubt, if you make a direct copy of a product supplied by OS – that copy is not derived data.

Their licensing jargon page just defines the term as:

…any data that you create using Ordnance Survey mapping data as a source

Unfortunately none of these definitions really provide any useful detail, which is no doubt part of the problems that everyone has with understanding OS policy and licensing terms. As my recent post highlights, the OS do have some pretty clear ideas of when and how derived data is created.

The practice note on “Data as IP and Data License Agreements” published by Practical Law provides a great summary of a range of IP issues relating to data and includes a discussion of derived data. Interestingly they highlight that it may be useful to consider not just data generated by processing a dataset but other data that may be generated through the interactions of a data publisher (or service provider) and a data consumer. (See “Original versus Derived Data“, page 7).

This leads them to define the following cases for when derived data might be generated:

  • Processing the licensed data to create new data that is either:
    • sufficiently different from the original data that the original data cannot be identified from analysis, processing or reverse engineering the derived data; or
    • „a modification, enhancement, translation or other derivation of the original data but from which the original data may be traced. „
  • Monitoring the licensee’s use of a provider’s service (commonly referred to as usage data).

From a general intellectual property stance I can see why usage data should be included here, but I would suggest that this category of derived data is quite different to what is understood by the (open) data community.

What I find helpful about this summary is that it starts to bring some clarity around the different types of processes that yield derived data.

The best existing approach to this that I’ve seen can be found in: “Discussion draft: IPR, liability and other issues in regard to Derived Data“. The document aims to clarify, or at least start a discussion around, what is considered to be derived data in the geographical and spatial data domain. They identify a number of different examples, including:

  • Transforming the spatial projection of a dataset, e.g. to/from Mercator
  • Aggregating data about a region to summarise to an administrative area
  • Layering together different datasets
  • Inferring new geographical entities from existing features, e.g. road centre lines derived from road edges

In my opinion these types of illustrative examples are a much better way of trying to identify when and how derived data is created. For most re-users its easier to relate to an example that legal definitions.

Another nice example is the Open Streetmap guidance on what they consider to be “trivial transformations” which don’t trigger the creation of derived works.

An expanded definition of derived data

With the above in mind, can we create a better definition of derived data by focusing on the types of processes and transformations that are carried out?

Firstly I’d suggest that the following types of process do not create derived data:

  1. Using a dataset – stating the obvious really, but simply using a dataset doesn’t trigger creating a derivative. Open Street Map calls these a “Produced Work”.
  2. Copying – again, I think this is should be well understood, but I mention it for completeness. This is distribution, not derivation.
  3. Changing the format – E.g. converting a JSON file to XML. The information content remains the same, only the format is changed. This is supported by the Creative Commons definitions of remixing/reuse.
  4. Packaging (or repackaging) – E.g. taking a CSV file and re-publishing it as a data package. This would also include taking several CSV files from different publishers and creating a single data package from them. I believe this is best understood as a “Collected Work” or “Compilation” as the original datasets remain intact.
  5. Validation – checking whether field(s) in dataset A are correct according to field(s) dataset B, so long as dataset A is not corrected as a result. This is a stance that Open Street Map seem to agree with.


This leaves us with a number of other processes which do create derived data:

  1. Extracting – extracting portions of a dataset, e.g. extracting some fields from a CSV file.
  2. Restructuring – changing the schema or internal layout of a database, e.g. parsing out existing data to create new fields such as breaking down an address into its constituent parts
  3. Annotation – enhancing an existing dataset to include new fields, e.g. adding UPRNs to a dataset that contains addresses
  4. Summarising or Analysing – e.g. creating statistical summaries of fields in a dataset, such as the population statistics examples given by the OECD. Whether the original dataset can be reconstructed from the derived will depend on the type of analysis being carried out, and how much of the original dataset is also included in the derived.
  5. Correcting – validating dataset A against dataset B, and then correcting dataset A with data from dataset B where there are discrepancies
  6. Inferencing – applying reasoning, heuristics, etc to generate an entirely new data based on one or more datasets as input.
  7. Model Generation – I couldn’t think of a better name for this, but I’m thinking of scenarios such as sharing a neural network that has used some datasets as a training set. I think this is different to inferencing.

What do you think of this? Does it capture the main categories of deriving data? If you have comments on this then please let me know by leaving a comment here or pinging me on twitter.