We have a long way to go

Stood in the queue at the supermarket earlier I noticed the cover of the Bath Chronicle. The lead story this week is: “House prices in Bath almost 13 times the average wage“. This is almost perfectly designed clickbait for me. I can’t help but want to explore the data.

In fact I’ve already done this before, when the paper published a similar headline in September last year: “Average house price in Bath is now eight times average salary“. I wrote a blog post at the time to highlight some of the issues with their reporting.

Now I’m writing another blog post, but this time to highlight how far we still have to go with publishing data on the web.

To try to illustrate the problems, here’s what happened when I got back from the supermarket:

  1. Read the article on the Chronicle website to identify the source of the data, the annual Home Truths report published by the National Housing Federation.
  2. I then googled for “National Housing Federation Home Truths” as the Chronicle didn’t link to its sources.
  3. I then found and downloaded the “Home Truths 2014/15: South West” report which has a badly broken table of figures in it. After some careful reading I realised the figures didn’t match the Chronicle
  4. Double-checking, I browsed around the NHF website and found the correct report: “Home Truths 2015/2016: The housing market in the South West“. Which, you’ll notice, isn’t clearly signposted from their research page
  5. The report has a mean house price of £321,674 for Bath & North East Somerset using Land Registry data from 2014. It also has a figure of £25,324 for mean annual earnings in 2014 for the region, giving a ratio of 12.7. The earnings data is from the ONS ASHE survey
  6. I then googled for the ASHE survey figures as the NHF didn’t link to its sources
  7. Having found the ONS ASHE survey I clicked on the latest figures and found the reference tables before downloading the zip file containing Table 8
  8. Unzipping, I opened the relevant spreadsheet and found the worksheet containing the figures for “All” employees
  9. Realising that the ONS figures were actually weekly rather than annual wages I opened up my calculator and multiplied the value by 52
  10. The figures didn’t match. Checked my maths
  11. I then realised that, like an idiot, I’d downloaded the 2015 figures but the NHF report was based on the 2014 data
  12. Returning to the ONS website I found the tables for the 2014 Revised version of the ASHE
  13. Downloading, unzipping, and calculating I found that again the figures didn’t match
  14. On a hunch, I checked the ONS website again and then found the reference tables for the 2014 Provisional version of the ASHE
  15. Downloading, unzipping, and re-calculating I finally had my match for the NHF figure
  16. I then decided that rather than dig further I’d write this blog post

This is a less than ideal situation. What could have streamlined this process?

The lack of direct linking – from the Chronicle to the NHF, and from the NHF to the ONS – was the root cause of my issues here. I spent far too much time working to locate the correct data. Direct links would have avoided all of my bumbling around.

While a direct link would have taken me straight to the data, I might have missed out on the fact that there were revised figures for 2014. Or that there were actually some new provisional figures for 2015. So there’s actually a update to the story already waiting to be written. The analysis is already out of date.

The new data was published on the 18th November and the NHF report on the 23rd. That gave a five day period in which the relevant tables and commentary could have been updated. Presumably the report was too deep into final production to make changes. Or maybe just no-one thought to check for updated data.

If both the raw data from the ONS and the NHF analysis had been published natively to the web rather than in a PDF maybe some of that production overhead could have been reduced. I know PDF has some better support for embedding and linking data these days, but a web native approach might have provided a more dynamic approach.

In fact, why should the numbers have been manually recalculated at all? The actual analysis involves little more than pulling some cells from existing tables and doing some basic calculations. Maybe that could have been done on the fly? Perhaps by embedding the relevant figures. At the moment I’m left with doing some manual copy-and-paste.

It’s not just NHF that are slow to publish their figures though. Researching the Chronicle article from last year, I turned up some DCLG figures on housing market and house prices. These weren’t actually referenced from the article or any of its sources. I just tripped over them whilst investigating. Because data nerd.

The live (sic) DCLG tables include a ratio of median house prices to median earnings but they haven’t been updated since April 2014. Their analysis only uses the provisional ASHE figures for 2013.

Oh, and just for fun, the NHF analysis uses mean house prices and wages, whilst the DCLG data uses medians. The ONS publish both weekly mean and median earnings for all periods, as well as some data for different quantiles.

And this is just one small example.

My intent here isn’t to criticise the Chronicle, the NHF, DCLG, and especially not the ONS who are working hard to improve how they publish their data.

I just wanted to highlight that:

  • we need better norms around data citation, and including when and how to link to both new and revised data
  • we need better tools for telling stories on the web, that can easily be used by anyone and which can readily access and manipulate raw data
  • we need better discovery tools for data that go beyond just keyword searches
  • we need to make it easier to share not just analyses but also insights and methods, to avoid doing unnecessary work and to make it easier (or indeed unnecessary) to fact check against sources

That’s an awful lot to still be done. Opening data is just the start at building a good data infrastructure for the web. I’m up for the challenge though. This is the stuff I want to help solve.

Shortly after I published this Matt Jukes published a post wondering what a digital statistical publication might look like. Matt’s post and Russell Davies thoughts on digital white papers are definitely worth a read. 

How can open data publishers monitor usage?

Some open data publishers require a user to register with their portal or provide other personal information before downloading a dataset.

For example:

  • the recently launched Consumer Data Research Centre data portal requires users to register and login before data can be downloaded
  • access to any of the OS Open Data products requires the completion of a form which asks for personal information and an email address to which a download link is sent
  • the Met Office Data Point API provides OGL licensed data but users must register in order to obtain an API key

Requiring a registration step is in fact very common when it comes to open data published via an API. Registration is required on Transport API, Network Rail and Companies House to name a few. This isn’t always the case though as the Open Corporates API can be used without a key, as can APIs exposed via the Socrata platform (and other platforms, I’m sure). In both cases registration carries the benefit of increased usage limits.

The question of whether to require a login is one that I’ve run into a few times. I wanted to explore it a little in this post to tease out some of the issues and alternatives.

For the rest of the post whenever I refer to “a login” please read it as “a login, registration step, or other intermediary web form”.

Is requiring a login permitted?

I’ll note from the start that the open definition doesn’t have anything to say about whether a login is permitted or not permitted.

The definition simply says that data “…must be provided as a whole and at no more than a reasonable one-time reproduction cost, and should be downloadable via the Internet without charge”. In addition the data “…must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.”

You can choose to interpret that in a number of ways. The relevant bits of text have gone through a number of iterations since the definition was first published and I think the current language isn’t as strong as that present in previous versions. That side I don’t recall there ever being a specific pronouncement against having a login.

There is however a useful discussion on the open definition list from October 2014 which has some interesting comments and is worth reviewing. Andrew Stott’s comments provide a useful framing, asking whether such a step is necessary to the provision of the information.

In my view there are very few cases where such a step is necessary, so as general guidance I’d always recommend against requiring a login when publishing open data.

But, being a pragmatic chap, I prefer not to deal in absolutes so I’d like you to think about the pros and cons on either side.

Why do publishers want a login?

I’ve encountered several reasons why publishers want to require a login:

  1. to collect user information to learn more about using their data
  2. to help manage and monitor usage of an API
  3. all of the above

The majority of open data publishers I’ve worked with are very keen to understand who is using their data, how they’re using it, and how successful their users are at building things with their data. It’s entirely natural, as part of providing a free resource to want to understand if people are finding it useful.

Knowing that data is in use and is delivering value can help justify ongoing access, publication of additional data, or improvements in how existing data is published. Everyone wants to understand if they’re having an impact. Knowing who is interested enough to download the data is a first step towards measuring that.

An API without usage limits presents a potentially unbounded liability for a publisher in terms of infrastructure costs. The inability to manage or balance usage across a user base means that especially active or abusive users can hamper the ability for everyone to benefit from the API. API keys, and similar authentication methods, provide a hook that can be used to monitor and manage usage. (IP addresses are not enough.)

Why don’t consumers want to login?

There are also several reasons why data consumers don’t want to have to login:

  1. they want to quickly review and explore some data and a registration step provides unnecessary barriers
  2. they want or need the freedom to access data anonymously
  3. they don’t trust the publisher with their personal information
  4. they want to automatically script bulk downloads to create production workflows without the hassle of providing credentials or navigating access control
  5. they want to use an API from a browser based application which limits their ability to provide private credentials
  6. all of the above

Again, these are all reasonable concerns.

What are the alternatives?

So, how can publishers learn more about their users and, where necessary, offer a reasonable quality of service whilst also staying mindful to the concerns of users?

I think the best way to explore that is by focusing on the question that publishers really want to answer: who are the users actively engaged in using my data?

Requiring a registration step or just counting downloads doesn’t help you answer that question. For example:

  • I’ve filled in the OS Open Data download form multiple times for the same product, sometimes on the same day but from different machines. I can’t imagine it tells them much about what I’ve done (or not done) with their data and they’ve never asked
  • I’ve registered on portals in order to download data simply to take a look at its contents without any serious intent to use it
  • I’ve worked with data publishers that have lots of detail from their registration database but no real insight into what users are doing, or have an ongoing relationship with them

In my view the best way to identify active users and learn more about how they are using your data is to talk to them.

Develop an engagement plan that involves users not just after the release some data, but before a release. Give them a reason to want to talk to you. For example:

  • tell them when the data is updated, or you’ve made corrections to it. This is service that many serious consumers would jump at
  • give them a feedback channel that lets them report problems or make suggestions about improvements and then make sure that channel is actually monitored so feedback is acted on
  • help celebrate their successes by telling their stories, featuring their applications in a showcase, or via social media

Giving users a reason to engage can also help with API and resource management. As I mentioned in the introduction, Open Corporates and others provide a basic usage tier that doesn’t require registration. This lets hobbyists, tinkerers and occasional users get what they need. But the promise of freely accessible, raised usage limits gives active users a reason to engage more closely.

If you’re providing data in bulk but are concerned about data volumes then provide smaller sample datasets that can be used as a preview of the full data.

In short, just like any other data collection exercise, its important that publishers understand why they’re asking users to register. If the data is ultimately of low value, e.g. people providing fake details, or isn’t acted on as part of an engagement plan, then there’s very little reason to collect the data at all.

This post is part of my “basic questions about data” series. If you’ve enjoyed this one then take a look at the other articles. I’m also interested to hear suggestions for topics, so let me know if you have an idea. 

Who is the intended audience for open data?

This post is part of my ongoing series: basic questions about data. It’s intended to expand on a point that I made in a previous post in which I asked: who uses data portals?

At times I see quite a bit of debate within the open data community around how best to publish data. For example should data be made available in bulk or via an API? Which option is “best”? Depending on where you sit in the open data community you’re going to have very different responses to that question.

But I find that in the ensuing debate we often overlook that open data is intended to be used by anyone, for any purpose. And that means that maybe we need to think about more than just the immediate needs of developers and the open data community.

While the community has rightly focused on ensuring that data is machine-readable, so it can be used by developers, we mustn’t forget that data needs to be human-readable too. Otherwise we end up with critiques of what I consider to be fairly reasonable and much-needed guidance on structuring spreadsheets, and suggestions of alternatives that are well meaning but feel a little under-baked.

I feel that there are several different and inter-related viewpoints being expressed:

  • That the citizen or user is the focus and we need to understand their needs and build services that support them. Here data tends to be a secondary concern and perhaps focused on transactional statistics on performance of those services, rather than the raw data
  • That open data is not meant for mere mortals and that its primary audience is developers to analyse and present to users. The emphasis here is on provision of the raw data as rapidly as possible
  • A variant of the above that emphasises delivery of data via an API to web and mobile developers allowing them to more rapidly deliver value. Here we see cases being made about the importance of platforms, infrastructure, and API programs
  • That citizens want to engage with data and need tools to explore it. In this case we see arguments for on-line tools to explore and visualise data, or reasonable suggestions to simply publish data in spreadsheets as this is a format with which many, many people are comfortable

Of course all of these are correct, although their prominence around different types of data, application, etc varies wildly. Depending on where you sit in the open data value network your needs are going to be quite different.

It would be useful to map out the different roles of consumers, aggregators, intermediaries, etc to understand what value exchanges are taking place, as I think this would help highlight the value that each role brings to the ecosystem. But until then both consumers and publishers need to be mindful of potentially competing interests. In an ideal world publishers would serve every reuser need equally.

My advice is simple: publish for machines, but don’t forget the humans. All of the humans. Publish data with context that helps anyone – developers and the interested reader – properly understand the data. Ensure there is at least a human-readable summary or view of the data as well as more developer oriented bulk downloads. If you can get APIs “out of the box” with your portal, then invest the effort you would otherwise spend on preparing machine-readable data in providing more human-readable documentation and reports.

Our ambition should be to build an open data commons that is accessible and useful for as many people as possible.


Managing risks when publishing open data

A question that I frequently encounter when talking to organisations about publishing open data is: “what if someone misuses or misunderstands our data?“.

These concerns stem from several different sources:

  • that the data might be analysed incorrectly, drawing incorrect conclusions that might be attributed to the publisher
  • that the data has known limitations and this might reflect on the publisher’s abilities, e.g. exposing issues with their operations
  • that the data might be used against the publisher in some way, e.g. to paint them in a bad light
  • that the data might be used for causes with which the publisher does not want to be aligned
  • that the data might harm the business activities of the publisher, e.g. by allowing someone to replicate a service or product

All of these are understandable and reasonable concerns. And the truth is that when publishing open data you are giving up a great deal of control over your data.

But the same is true of publishing any information: there will always cases of accidental and wilful misuse of information. Short of not sharing information at all, all organisations already face this risk. It’s just that open data, which anyone can access, use and share for any purpose, really draws this issue into the spotlight.

In this post I wanted to share some thoughts about how organisations can manage the risks associated with publishing open data.

Risks of not sharing

Firstly its worth noting that the risks of not sharing data are often unconsciously discounted.

There’s increasing evidence that holding on to data can hamper innovation whereas opening data can unlock value. This might be of direct benefit for the organisation or have wider economic, social and environmental benefits.

Organisations with a specific mission or task can more readily demonstrate their impact and progress by publishing open data. Those that are testing a theory of change will be reporting on indicators that help to measure impact and confirm that interventions are working as expected. Open data is the most transparent way to approach to these impact assessments.

Many organisations, particularly government bodies, are attempting to address challenges that can only be overcome in collaboration with others. Open data specifically, and data sharing practices in general, provides an important foundation for collaborative projects.

As data moves from the closed to the open end of the data spectrum, there is an increasingly wider audience that can access and use that information. We can point to Joy’s Law as a reason why this is a good thing.

In scientific publishing there are growing concerns of a “reproducibility crisis” which is in part fuelled by both a lack of access to original experimental data and analysis.  Open publishing of scientific results is one remedy.

But setting aside what might be seen as a sleight of hand re-framing of the original question, how can organisation minimise specific types of risk?

Managing forms of permitted reuse

Organisations manage the forms of reuse of its data through a licence. The challenge for many is that an open licence places few limits on how data can be reused.

There is a wider range of licences that publishers could use, including some that limit creation of derivative works or commercial uses. But all of these restrictions may also unintentionally stop the kinds of reuse that publishers want to encourage or enable. This is particularly true when applying a “non-commercial” use clause. These issues are covered in detail in the recently published ODI guidance on the impacts of non-open licences.

While my default recommendation is that organisations use a CC-BY 4.0 licence, an alternative is the CC-BY-SA licence which requires that any derivative works are published under the same licence, i.e. that reusers must share in the same spirit as the publisher.

This could be a viable alternative that might help organisations feel more confident that they are deterring some forms of undesired reuse, e.g. discouraging a third-party or competitor from publishing a commercial analysis based on their data by requiring that the report also be distributed under an open licence.

The attribution requirement already stops data being reused without its original source being credited.

Managing risks of accidental misinterpretation

When I was working in academic publishing a friend at the OECD told me that at least one statistician had been won over to a plan to publicly publish data by the observation that the alternative was to continue to allow users to manually copy data from published reports, with the obvious risks of transcription errors.

This is a small example of how to manage risks of data being accidentally misused or misinterpreted. Putting appropriate effort into the documentation and publication of a dataset will help reusers understand how it can be correctly used. This includes:

  • describing what data is being reported
  • how the data was collected
  • the quality control, if any, that has been used to check the data
  • any known limits on its accuracy or gaps in coverage

All of these help to provide reusers with the appropriate context that can guide their use. It also makes them more likely to be successful. This detail is already covered in the ODI certification process.

Writing a short overview of a dataset highlighting its most interesting features, sharing ideas for how it might be used, and clearly marking known limits can also help orientate potential reusers.

Of course, publishers may not have the resources to fully document every dataset. This is where having a contact point to allow users to ask for help, guidance and clarification is important. 

Managing risks of wilful misinterpration

Managing risks of wilful misinterpretation of data is harder. You can’t control cases where people totally disregard documentation and licensing in order to push a particular agenda. Publishers can however highlight breaches of social norms and can choose to call out misuse they feel is important to highlight.

It’s important to note that there are standard terms in the majority of open licences, including the Creative Commons Licences and the Open Government Licence, which address:

  • limited warranties – no guarantees that data is fit for purpose, so reusers can’t claim damages if misused or misapplied
  • non-endorsement– reusers can’t say that their use of the data was endorsed or supported by the publisher
  • no use of trademarks, branding, etc. – reusers don’t have permission to brand their analysis as originating from the publisher
  • attribution– reusers must acknowledge the source of their data and cannot pass it off as their own

These clauses collectively limit the liability of the publisher. It also potentially provides some recourse to take legal action if a reuser did breach the terms of they licence, and the publisher thought that this was worth doing.

I would usually add to this that the attribution requirement means that there is always a link back to the original source of the data. This allows the reader of some analysis to find the original authoritative data and confirm any findings for themselves. It is important that publishers document how they would like to be attributed.

Managing  business impacts

Finally, publishers concerned about the risk of releasing data to their business, should ensure they’re doing so with a clear business case. This includes understanding whether supply of data is the core value of your business or whether customers place more value in the services.

One startup I worked with were concerned that an open licence on user contributions might allow a competitor to clone their product. But in this case the defensibility in their business model didn’t derive from controlling the data but in the services provided and the network effects of the platform. These are harder things to replicate.

This post isn’t intended to be a comprehensive review of all approaches to risk management when releasing data. There’s a great deal more which I’ve not covered including the need to pay appropriate attention to data protection, privacy, anonymisation, and general data governance.

But there is plenty of existing guidance available to help organisations work through those areas. I wanted to share some advice that more specifically relates to publishing data under an open licence.

Please leave a comment to let me know what you think. Is this advice useful and is there anything you would add?

Fictional data

The phrase “fictional data” popped into my head recently, largely because of odd connections between a couple of projects I’ve been working on.

It’s stuck with me because, if you set aside the literal meaning of “data that doesn’t actually exist“, there are some interesting aspects to it. For example the phrase could apply to:

  1. data that is deliberately wrong or inaccurate in order to mislead – lies or spam
  2. data that is deliberately wrong as a proof of origin or claim of ownership – e.g. inaccuracies introduced into maps to identify their sources, or copyright easter eggs
  3. data that is deliberately wrong, but intended as a prank – e.g. the original entry of Uqbar on wikipedia. Uqbar is actually a doubly fictional place.
  4. data that is fictionalised (but still realistic) in order to support testing of some data analysis – e.g. a set of anonymised and obfuscated bank transactions
  5. data that is fictionalised in order to avoid being a nuisance, cause confusion, or accidentally linkage – like 555 prefix telephone numbers or perhaps social media account names
  6. data that is drawn from a work of fiction or a virtual world – such as the marvel universe social graph, the Elite: Dangerous trading economy (context), or the data and algorithms relating to Pokemon capture.

I find all of these fascinating, for a variety of reasons:

  • How do we identify and exclude deliberately fictional data when harvesting, aggregating and analysing data from the web? Credit to Ian Davis for some early thinking about attack vectors for spam in Linked Data. While I’d expect copyright easter eggs to become less frequent they’re unlikely to completely disappear. But we can definitely expect more and more deliberate spam and attacks on authoritative data. (Categories 1, 2, 3)
  • How do we generate useful synthetic datasets that can be used for testing systems? Could we generate data based on some rules and a better understanding of real-world data as a safer alternative to obfuscating data that is shared for research purposes? It turns out that some fictional data is a good proxy for real world social networks. And analysis of videogame economics is useful for creating viable long-term communities. (Categories 4, 6)
  • Some of the most enthusiastic collectors and curators of data are those that are documenting fictional environments. Wikia is a small universe of mini-wikipedias complete with infoboxes and structured data. What can we learn from those communities and what better tools could we build for them? (Category 6)

Interesting, huh?

What is a data portal?

This post is part of my ongoing series of basic questions about data, this time prompted by a tweet by Andy Dickinson asking the same question.

There are lots of open data portals. OpenDataMonitor lists 161 in the EU alone. The numbers have grown rapidly over the last few years. Encouraged by exemplars such as data.gov.uk they’re usually the first item on the roadmap for any open data initiative.

But what is a data portal and what role does it play?

A Basic Definition

I’d suggest that the most basic definition of an open data portal is:

A list of datasets with pointers to how those datasets can be accessed.

A web page on an existing website meets this criteria. It’s the minimum viable open data portal. And, quite rightly, this is still where many projects begin.

Once you have more than a handful of datasets then you’re likely to need something more sophisticated to help users discover datasets that are of interest to them. A more sophisticated portal will provide the means to capture metadata about each dataset and then use that to provide the ability to search and browse through the list, e.g. by theme, licence, or other facets.

Portals rarely place any restrictions on the type of data that is catalogued or the means by which data is accessed. However more sophisticated portals offer additional capabilities for both the end user and the publisher.

Publisher features include:

  • File storage to make it easier to get data made available online
  • Additional curation tools, e.g. addition of custom metadata, creation of collections, and promotion of datasets
  • Integrated data stores, e.g. to allow data files to be uploaded into a database that will allow data to be queried and accessed by users in more sophisticated ways

User features include:

  • Notification tools to alert to the publication of new or updated datasets
  • Integrated and embeddable visualisations to support manipulation and use of data directly in the portal, often with embedding in other websites.
  • Automatically generated APIs to allow for more sophisticated online querying and interaction with datasets
  • Engagement tools such as rating, discussions and publisher feedback channels

There are a number of open source and commercial data stores, including CKAN, Socrata and OpenDataSoft. All of these offer a mixture of the features outlined above.

Who uses data portals?

Right now the target customer for a data portal is likely to be a public sector organisation, e.g. a local authority, city administration or government department that is looking to publish a number of datasets.

But the users of a data portal are a mixture of all of different aspects of the open data community: individual citizens, developers or civic hackers, data journalists, public sector officials, commercial developers, etc.

Balancing the needs of these different constituents is difficult:

  • The customer wants to see some results from publishing their data as soon as possible, so instant access to visualisations and exploration tools gives immediate utility and benefit
  • Data analysts or designers will likely just want to download the data so they can make more sophisticated use of the data
  • Web and mobile developers often want an API to allow them to quickly build an application, without setting up infrastructure and a custom data processing pipeline
  • A citizen, assuming they wander in at all, is likely to want some fairly simple data exploration tools, ideally wrapped up in some narrative that puts the data into context and help tells a story

Depending on where you sit in the community you may think that current data portals are either fantastic or are under-serving your needs.

The business model and target market of the portal developer is also likely to affect how well they serve different communities. APIs, for example, support the creation of platforms that helps embed the portal into an ecosystem.

Enterprise use

There are enterprise data portals too. Large enterprises have exactly the same problems as exists in the wider open data community: it’s often not clear what data is available or how to access it.

For example Microsoft has the Azure Data Catalog. This has been around for quite a few years now in various incarnations. There are also tools like Tamr Catalog.

They both have similar capabilities – collaborative cataloguing of datasets within an enterprise – and both are tied into a wider ecosystem of data processing and analytics tools.

Future directions

How might data portals evolve in the future?

I think there’s still plenty of room to develop new features to better serve different audiences.

For example none of the existing catalogues really help me publish some data and then tell a story with it. A story is likely to consist of a mixture of narrative and visualisations, perhaps spanning multiple datasets. This might best be served by making it easier to embed differnt views of data into blog posts rather than building additional content management features into the catalog itself. But for a certain audience, e.g. data journalists and media organisations, this might be a useful package.

Better developer tooling, e.g. data syndication and schema validation, would help serve data scientists that are building custom workflows against data that is downloaded or harvested from data portals. This is a way to explore a platform approach that doesn’t necessarily require downstream users to use the portal APIs to query the data – just syndication of updates and notifications of changes.

Another area is curation and data management tools. E.g. features to support multiple people in creating and managing a dataset directly in the portal itself. This might be useful for small-scale enterprise uses as well as supporting collaboration around open datasets.

Automated analysis of hosted data is another area in which data portals could develop features that would support both the publishers and developers. Some metadata about a dataset, e.g. to help describe its contents, could be derived by summarising features of the data rather than requiring manual data entry.

Regardless of how they evolve in terms of features, data portals are likely to remain a key part of open data infrastructure. However as Google and others begin doing more to index the contents of datasets, it may be that the users of portals increasingly become machines rather than humans.

“The woodcutter”, an open data parable

In a time long past, in a land far away, there was once a great forest. It was a huge sprawling forest containing every known species of tree. And perhaps a few more.

The forest was part of a kingdom that had been ruled over by an old mad king for many years. The old king had refused anyone access to the forest. Only he was allowed to hunt amongst its trees. And the wood from the trees was used only to craft things that the king desired.

But there was now a new king. Where the old king was miserly, the new king was generous. Where the old king was cruel, the new king was wise.

As his first decree, the king announced that the trails that meandered through the great forest might be used by anyone who needed passage. And that the wood from his forest could be used by anyone who needed it, provided that they first ask the king’s woodcutter.

Several months after his decree, whilst riding on the edge of the forest, the king happened upon a surprising scene.

Gone was the woodcutter’s small cottage and workshop. In its place had grown up a collection of massive workshops and storage sheds. Surrounding the buildings was a large wooden palisade in which was set some heavily barred gates. From inside the palisade came the sounds of furious activity: sawing, chopping and men shouting orders.

All around the compound, filling the nearby fields, was a bustling encampment. Looking at the array of liveries, flags and clothing on display, the king judged that there were people gathered here from all across his lands. From farms, cities, and towns. From the coast and the mountains. There were also many from neighbouring kingdoms.

It was also clear that many of these people had been living here for some time.

Perplexed, the king rode to the compound, making his way through the crowds waiting outside the gates. Once he had been granted entry, he immediately sought out the woodcutter, finding him directing activities from a high vantage point.

Climbing to stand beside the woodcutter the king asked, “Woodcutter, why are all these people waiting outside of your compound? Where is the wood that they seek?”

Flustered, the woodcutter, mopped his brow and bowed to his king. “Sire, these people shall have their wood as soon as we are ready. But first we must make preparations.”

“What preparations are needed?”, asked the king. “Your people have provided wood from this forest for many, many years. While the old king took little, is it not the same wood?”

“Ah, but sire, we must now provide the wood to so many different peoples”. Gesturing to a small group of tents close to the compound, the woodcutter continued: “Those are the ship builders. They need the longest, straightest planks to build their ships. And great trees to make their keels”.

“Over there are the house builders”, the woodcutter gestured, “they too need planks. But of a different size and from a different type of tree. This small group here represents the carpenters guild. They seek only the finest hard woods to craft clever jewellery boxes and similar fine goods.”

The king nodded. “So you have many more people to serve and many more trees to fell.”

“That is not all”, said the woodcutter pointing to another group. “Here are the river people who seek only logs to craft their dugout boats. Here are the toy makers who need fine pieces. Here are the fishermen seeking green wood for their smokers. And there the farmers and gardeners looking for bark and sawdust for bedding and mulch”.

The king nodded. “I see. But why are they still waiting for their wood? Why have you recruited men to build this compound and these workshops, instead of fetching the wood that they need?”

“How else are we to serve their needs sire? In the beginning I tried to handle each new request as it came in. But every day a new type and shape of wood. If I created planks, then the river people needed logs. If I created chippings, the house builders needed cladding.

Everyone saw only their own needs. Only I saw all of them. To fulfil your decree, I need to be ready to provide whatever the people needed.

And so unfortunately they must wait until we are better able to do so. Soon we will be, once the last dozen workshops are completed. Then we will be able to begin providing wood once more.”

The king frowned in thought. “Can the people not fetch their own wood from the forest?”

Sadly, the woodcutter said, “No sire. Outside of the known trails the woods are too dangerous. Only the woodcutters know the safe paths. And only the woodcutters know the art of finding the good wood and felling it safely. It is an art that is learnt over many years”.

“But don’t you see?” said the King, “You need only do this and then let others do the rest. Fell the trees and bring the logs here. Let others do the making of planks and cladding. Let others worry about running the workshops. There is a host of people here outside your walls who can help. Let them help serve each others needs. You need only provide the raw materials”.

And with this the king ordered the gates to the compound to be opened, sending the relieved woodcutter back to the forest.

Returning to the compound many months later, the king once again found it to be a hive of activity. Except now the house builders and ship makers were crafting many sizes and shapes of planks. The toy makers took offcuts to shape the small pieces they needed, and the gardeners swept the leavings from all into sacks to carry to their gardens.

Happy that his decree had at last been fulfilled, the king continued on his way.

Read the first open data parable, “The scribe and the djinn’s agreement“.