That thing we call “open”

I’ve been involved in a few conversations recently about what “open” or “being open” means in different situations.

As I’ve noted previously when people say “open” they often mean very different things. And while there may be a clear definitions of “open”, people don’t
often use the terms correctly. And some phrases like “open API” are still, well, open to interpretation.

In this post I’m going to summarise some of the ways in which I tend to think about making something “open”.

Let me know if I’m missing something so I can plug gaps in my understanding.

Openness of a “thing”

Digital objects: books, documents, images, music, software and datasets can all be open.

Making things open in this sense is the most well documented, but still the most consistently misunderstood. There are clear definitions for open content
and data, open source, etc. Open in these contexts provide various freedoms to use, remix, share, etc.

People often confuse something being visible or available to them as being open, but that’s not the same thing at all. Being able to see or read something doesn’t give you any legal permissions at all.

It’s worth noting that the definitions of open “things” in different communities are often overlapping. For example, the Creative Commons licences allow works to be licensed in ways that enable a wide variety of legal reuses. But the Open Definition only recognises a subset of those as being open, rather than shared.

Putting an open licence on something also doesn’t necessarily grant you the full freedom to reuse that thing. For example I could open source some machine learning software but it might only be practically reusable if you can train it on some data that I’ve chosen not to share.

Or I might use an open licence like the Open Government Licence that allows me to put an open licence on something whilst ignoring the existence of any third-party rights. No need to do my homework. Reuser beware.

Openness of a process

Processes can be open. It might be better to think about transparency (e.g. of how the process is running) or the ability to participate in a process in this context.

Anything that changes and evolves over time will have a process by which those changes are identified, agreed, prioritised and applied. We sometimes call that governance. The definition of an open standard includes defining both the openness of the standard (the thing) as well as the process.

Stewardship, of a software project, or a dataset, or a standard are also examples of where it might be useful for a process to be open. Questions we can ask of open processes are things like:

  • Can I contribute to the main codebase of a software package, rather than just fork it?
  • Can I get involved in the decision making around how a piece of software or standard evolves?
  • Can I directly fix errors in a dataset?
  • Can I see what decisions have been, or are being made that relate to how something is evolving?

When we’re talking about open data or open source, often we’re really talking about openness of the “thing”. But when we’re making things open to make them
better, I think we’re often talking about being open to contributions and participation. Which needs something more than a licence on a thing.

There’s probably a broader category of openness here which relates to how open a process is socially. Words like inclusivity and diversity spring to mind.

Your standards process isn’t really open to all if all of your meetings are held face to face in Hawaii.

Openness of a product, system or platform

Products, platforms and systems can be open too. Here we can think of openness as relating to the degree to which the system

  • is built around open standards and open data (made from open things)
  •  is operated using open processes
  • is available for wider access and use

We can explore this by asking questions like:

  • Is it designed to run on open infrastructure or is it tied to particular cloud infrastructure or hardware?
  • Are the interfaces to the system built around open standards?
  • Can I get access to an API? Or is it invite only?
  • How do the terms of service shape the acceptable uses of the system?
  • Can I use its outputs, e.g. the data returned by a platform or an API, under an open licence?
  • Can we observe how well the system or platform is performing, or measure its impacts in different ways (e.g. socially, economically, environmentally)

Openness of an ecosystem

Ecosystems can be open too. In one sense an open ecosystem is “all of the above”. But there are properties of an ecosystem that might itself indicate aspects of openness:

  • Is there a choice in providers, or is there a monopoly provider of services or data?
  • How easy is it for new organisations to engage with the ecosystem, e.g to provide
    competing or new services?
  • Can we measure the impacts and operations of the ecosystem?

When we’re talking about openness of an ecosystem we’re usually talking about markets and sectors and regulation and governance.

Applying this in practice

So when  thinking about whether something is “open” the first thing I tend to do is consider which of the above categories apply. In some cases its actually several.

This is evident in my attempt to define “open API“.

For example we’re doing some work @ODIHQ to explore the concept of a digital twin. According to the Gemini Principles a digital twin should be open. Here we can think of an individual digital twin as an object (a piece of software or a model), or a process (e.g. as an open source project), or an operational system or platform depending on how its made available.

We’re also looking at cities. Cities can be open in the sense of the openness of their processes of governance and decision making. They might also be considered as platforms for sharing data and connecting sofrware. Or as ecosystems of the same.

Thinking about the governance of data

I find “governance” to be a tricky word. Particularly when we’re talking about the governance of data.

For example, I’ve experienced conversations with people from a public policy background and people with a background in data management, where its clear that there are different perspectives. From a policy perspective, governance of data could be described as the work that governments do to enforce, encourage or enable an environment where data works for everyone. Which is slightly different to the work that organisations do in order to ensure that data is treated as an asset, which is how I tend to think about organisational data governance.

These aren’t mutually exclusive perspectives. But they operate at different scales with a different emphasis, which I think can sometimes lead to crossed wires or missed opportunities.

As another example, reading this interesting piece of open data governance recently, I found myself wondering about that phrase: “open data governance”. Does it refer to the governance of open data? Being open about how data is governed? The use of open data in governance (e.g. as a public policy tool), or the role of open data in demonstrating good governance (e.g. through transparency). I think the article touched on all of these but they seem quite different things. (Personally I’m not sure there is anything special about the governance of open data as opposed to data in general: open data isn’t special).

Now, all of the above might be completely clear to everyone else and I’m just falling into my usual trap of getting caught up on words and meanings. But picking away at definitions is often useful, so here we are.

The way I’ve rationalised the different data management and public policy perspectives is in thinking about the governance of data as a set of (partly) overlapping contexts. Like this:

 

Governance of data as a set of overlapping contexts

 

Whenever we are managing and using data we are doing so within a nested set of rules, processes, legislation and norms.

In the UK our use of data is bounded by a number of contexts. This includes, for example: legislation from the EU (currently!), legislation from the UK government, legislation defined by regulators, best practices that might be defined how a sector operates, our norms as a society and community, and then the governance processes that apply within our specific organisations, departments and even teams.

Depending on what you’re doing with the data, and the type of data you’re working with, then different contexts might apply. The obvious one being the use of personal data. As data moves between organisations and countries, then different contexts will apply, but we can’t necessarily ignore the broader contexts in which it already sits.

The narrowest contexts, e.g. those within an organisations, will focus on questions like: “how are we managing dataset XYZ to ensure it is protected and managed to a high quality?” The broadest contexts are likely to focus on questions like: “how do we safely manage personal data?

Narrow contexts define the governance and stewardship of individual datasets. Wider contexts guide the stewardship of data more broadly.

What the above diagram hopefully shows is that data, and our use of data, is never free from governance. It’s just that the terms under which it is governed may just be very loosely defined.

This terrible sketch I shared on twitter a while ago shows another way of looking at this. The laws, permissions, norms and guidelines that define the context in which we use data.

Data use in context

One of the ways in which I’ve found this “overlapping contexts” perspective useful, is in thinking about how data moves into and out of different contexts. For example when it is published or shared between organisations and communities. Here’s an example from this week.

IBM have been under fire because they recently released (or re-released) a dataset intended to support facial recognition research. The dataset was constructed by linking to public and openly licensed images already published on the web, e.g. on Flickr. The photographers, and in some cases the people featured in those images, are unhappy about the photographs being used in this new way. In this new context.

In my view, the IBM researchers producing this dataset made two mistakes. Firstly, they didn’t give proper appreciation to the norms and regulations that apply to this data — the broader contexts which inform how it is governed and used, even though its published under an open licence. For example, e.g. people’s expectations about how photographs of them will be used.

An open licence helps data move between organisations — between contexts — but doesn’t absolve anyone from complying with all of the other rules, regulations, norms, etc that will still apply to how it is accessed, used and shared. The statement from Creative Commons helps to clarify that their licenses are not a tool for governance. They just help to support the reuse of information.

This lead to IBM’s second mistake. By creating a new dataset they took on responsibility as its data steward. And being a data steward means having a well-defined, set of data governance processes that are informed and guided by all of the applicable contexts of governance. But they missed some things.

The dataset included content that was created by and features individuals. So their lack of engagement with the community of contributors, in order to discuss norms and expectations was mistaken. The lack of good tools to allow people to remove photos — NBC News created a better tool to allow Flickr users to check the contents of the dataset — is also a shortfall in their duties. Its the combination of these that has lead to the outcry.

If IBM had instead launched an initiative similar where they built this dataset, collaboratively, with the community then they could have avoided this issue. This is the approach that Mozilla took with Voice. IBM, and the world, might even have had a better dataset as a result because people have have opted-in to including more photos. This is important because, as John Wilbanks has pointed out, the market isn’t creating these fairer, more inclusive datasets. We need them to create an open, trustworthy data ecosystem.

Anyway, that’s one example of how I’ve found thinking about the different contexts of governing data, helpful in understanding how to build stronger data infrastructure. What do you think? Am I thinking about this all wrong? What else should I be reading?

 

Impressions from pidapalooza 19

This week I was at the third pidapalooza conference in Dublin. It’s a conference that is dedicated open identifiers: how to create them, steward them, drive adoption and promote their benefits.

Anyone who has spent any time reading this blog or following me on twitter will know that this is a topic close to my heart. Open identifiers are infrastructure.

I’ve separately written up the talk I gave on documenting identifiers to help drive adoption and spur the creation of additional services. I had lots of great nerdy discussions around URIs, identifier schemes, compact URIs, standards development and open data. But I wanted to briefly capture and share a few general impressions.

Firstly, while the conference topic is very much my thing, and the attendees were very much my people (including a number of ex-colleagues and collaborators), I was approaching the event from a very different perspective to the majority of other attendees.

Pidapalooza as a conference has been created by organisations from the scholarly publishing, research and archiving communities. Identifiers are a key part of how the integrity of the scholarly record is maintained over the long term. They’re essential to support archiving and access to a variety of research outputs, with data being a key growth area. Open access and open data were very much in evidence.

But I think I was one of only a few (perhaps the only?) attendee from what I’ll call the “broader” open data community. That wasn’t a complete surprise but I think the conference as a whole could benefit from a wider audience and set of participants.

If you’re working in and around open data, I’d encourage you to go to pidapalooza, submit some talk ideas and consider sponsoring. I think that would be beneficial for several reasons.

Firstly, in the pidapalooza community, the idea of data infrastructure is just a given. It was refreshing to be around a group of people that past the idea of thinking of data as infrastructure and were instead focusing on how to build, govern and drive adoption of that infrastructure. There’s a lot of lessons there that are more generally applicable.

For example I went to a fascinating talk about how EIDR, an identifier for movie and television assets, had helped to drive digital transformation in that sector. Persistent identifiers are critical to digital supply chains (Netflix, streaming services, etc). There are lessons here for other sectors around benefits of wider sharing of data.

I also attended a great talk by the Australian Research Data Commons that reviewed the ways in which they were engaging with their community to drive adoption and best practices for their data infrastructure. They have a programme of policy change, awareness raising, skills development, community building and culture change which could easily be replicated in other areas. It paralleled some of the activities that the Open Data Institute has carried out around its sector programmes like OpenActive.

The need for transparent governance and long-term sustainability were also frequent topics. As was the recognition that data infrastructure takes time to build. Technology is easy, its growing a community and building consensus around an approach that takes time.

(btw, I’d love to spend some time capturing some of the lessons learned by the research and publishing community, perhaps as a new entry to the series of data infrastructure papers that the ODI has previously written. If you’d like to collaborate with or sponsor the ODI to explore that, then drop me a line?)

Secondly, while the pidapalooza community seem to have generally accepted (with a few exceptions) the importance of web identifiers and open licensing of reference data. But that practice is still not widely adopted in other domains. Few of the identifiers I encounter in open government data, for example, are well documented, openly licensed or are supported by a range of APIs and services.

Finally, much of the focus of pidapalooza was on identifying research outputs and related objects: papers, conferences, organisations, datasets, researchers, etc. I didn’t see many discussions around the potential benefits and consequences of use of identifiers in research datasets. Again, this focus follows from the community around the conference.

But as the research, data science and machine-learning communities begin exploring new approaches to increase access to data, it will be increasingly important to explore the use of standard identifiers in that context. Identifiers have a clear role in helping to integrate data from different sources, but there are wider risks around data privacy and ethical considerations around identification of individuals, for example, that will need to happen.

I think we should be building a wider community of practice around use of identifiers in different contexts, and I think pidapalooza could become a great venue to do that.

Talk: Documenting Identifiers for Humans and Machines

This is a rough transcript of a talk I recently gave at a session at Pidapalooza 2019. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing. I’d also really welcome you thoughts and feedback on this discussion document.

At the Open Data Institute we think of data as infrastructure. Something that must be invested in and maintained so that we can maximise the value we get from data. For research, to inform policy and for a wide variety of social and economic benefits.

Identifiers, registers and open standards are some of the key building blocks of data infrastructure. We’ve done a lot of work to explore how to build strong, open foundations for our data infrastructure.

A couple of years ago we published a white paper highlighting the importance of openly licensed identifiers in creating open ecosystems around data. We used that to introduce some case studies from different sectors and to explore some of the characteristics of good identifier systems.

We’ve also explored ways to manage and publish registers. “Register” isn’t a word that I’ve encountered much in this community. But its frequently used to describe a whole set of government data assets.

Registers are reference datasets that provide both unique and/or persistent identifiers for things, and data about those things. The datasets of metadata that describe ORCIDs and DOIs are registers. As are lists of doctors, countries and locations where you can get our car taxed. We’ve explored different models for stewarding registers and ways to build trust
around how they are created and maintained.

In the work I’ve done and the conversations I’ve been involved with around identifiers, I think we tend to focus on two things.

The first is persistence. We need identifiers to be persistent in order to be able to rely on them enough to build them into our systems and processes. I’ve seen lots of discussion about the technical and organisational foundations necessary to ensure identifiers are persistent.

There’s also been great work and progress around giving identifiers affordance. Making them actionable.

Identifiers that are URIs can be clicked on in documents and emails. They can be used by humans and machines to find content, data and metadata. Where identifiers are not URIs, then there are often resolvers that will help to make to integrate them with the web.

Persistence and affordance are both vital qualities for identifiers that will help us build a stronger data infrastructure.

But lately I’ve been thinking that there should be more discussion and thought put into how we document identifiers. I think there are three reasons for this.

Firstly, identifiers are boundary objects. As we increase access to data, by sharing it between organisations or publishing it as open data, then an increasing number of data users  and communities are likely to encounter these identifiers.

I’m sure everyone in this room know what a DOI is (aside: they did). But how many people know what a TOID is? (Aside: none of them did). TOIDs are a national identifier scheme. There’s a TOID for every geographic feature on Ordnance Survey maps. As access to OS data increases, more developers will be introduced to TOIDs and could start using them in their applications.

As identifiers become shared between communities. It’s important that the context around how those identifiers are created and managed is accessible, so that we can properly interpret the data that uses them.

Secondly, identifiers are standards. There are many different types of standard. But they all face common problems of achieving wide adoption and impact. Getting a sector to adopt a common set of identifiers is a process of agreement and implementation. Adoption is driven by engagement and support.

To help drive adoption of standards, we need to ensure that they are well documented. So that users can understand their utility and benefits.

Finally identifiers usually exist as part of registers or similar reference data. So when we are publishing identifiers we face all the general challenges of being good data publishers. The data needs to be well described and documented. And to meet a variety of data user needs, we may need a range of services to help people consume and use it.

Together I think these different issues can lead to additional friction that can hinder the adoption of open identifiers. Better documentation could go some way towards addressing some of these challenges.

So what documentation should we publish around identifier schemes?

I’ve created a discussion document to gather and present some thoughts around this. Please have a read and leave you comments and suggestions on that document. For this presentation I’ll just talk through some of the key categories of information.

I think these are:

  • Descriptive information that provides the background to a scheme, such as what it’s for, when it was created, examples of it being used, etc
  • Governance information that describes how the scheme is managed, who operates it and how access is managed
  • Technical notes that describe the syntax and validation rules for the scheme
  • Operational information that helps developers understand how many identifiers there are, when and how new identifiers are assigned
  • Service pointers that signpost to resolvers and other APIs and services that help people use or adopt the identifiers

I take it pretty much as a given that this type of important documentation and metadata should be machine-readable in some form. So we need to approach all of the above in a way that can meet the needs of both human and machine data users.

Before jumping into bike-shedding around formats. There’s a few immediate questions to consider:

  • how do we make this metadata discoverable, e.g. from datasets and individual identifiers?
  • are there different use cases that might encourage us to separate out some of this information into separate formats and/or types of documentation?
  • what services might we build off the metadata?
  • …etc

I’m interested to know whether others think this would be a useful exercise to take further. And also the best forum for doing that. For example should there be a W3C community group or similar that we could use to discuss and publish some best practice.

Please have a look at the discussion document. I’m keen to learn from this community. So let me know what you think.

Thanks for listening.

Talk: Tabular data on the web

This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.

At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.

An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them

We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.

Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.

The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.

Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.

New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.

So how do we integrate data with the web? To ensure that statistics are well described and easy to find?

We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.

What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.

And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:

The primer provides an excellent walk through of all of the capabilities and I’d encourage you to explore it.

One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.

Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.

(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)

This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.

There also aren’t a great many tools that support CSV on the web currently.

It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?

But adoption could be driven from multiple angles. For example:

  • open data tools, portals and data publishers could start to generate best practice CSVs. That would be easy to implement
  • open data portals could also readily adopt CSV on the Web metadata, most already support DCAT
  • standards developers could adopt CSV on the Web as their primary means of defining schemas for tabular formats

Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.

Thanks for listening.

UnINSPIREd: problems accessing local government geospatial data

This weekend I started a side project which I plan to spend some time on this winter. The goal is to create a web interface that will let people explore geospatial datasets published by the three local authorities that make up the West of England Combined Authority: Bristol City Council, South Gloucestershire Council and Bath & North East Somerset Council.

Through Bath: Hacked we’ve already worked with the council to publish a lot of geospatial data. We’ve also run community mapping events and created online tools to explore geospatial datasets. But we don’t have a single web interface that makes it easy for anyone to explore that data and perhaps mix it with new data that they have collected.

Rather than build something new, which would be fun but time consuming, I’ve decided to try out TerriaJS. Its an open source, web based mapping tool that is already being used to publish the Australian National Map. It should handle doing the West of England quite comfortably. It’s got a great set of features and can connect to existing data catalogues and endpoints. It seems to be perfect for my needs.

I decided to start by configuring the datasets that are already in the Bath: Hacked Datastore, the Bristol Open Data portal, and data.gov.uk. Every council also has to publish some data via standard APIs as part of the INSPIRE regulations, so I hoped to be able to quickly bring a list of existing datasets without having to download and manage them myself.

Unfortunately this hasn’t proved as easy as I’d hoped. Based on what we’ve learned so far about the state of geospatial data infrastructure in our project at the ODI I had reasonably low expectations. But there’s nothing like some practical experience to really drive things home.

Here’s a few of the challenges and issues I’ve encountered so far.

  • The three councils are publishing different sets of data. Why is that?
  • The dataset licensing isn’t open and looks to be inconsistent across the three councils. When is something covered by INSPIRE rather than the PSMA end user agreement?
  • The new data.gov.uk “filter by publisher” option doesn’t return all datasets for the specified publisher. I’ve reported this as a bug, in the meantime I’ve fallen back on searching by name
  • The metadata for the datasets is pretty poor, and there is little supporting documentation. I’m not sure what some of the datasets are intended to represent. What are “core strategy areas“?
  • The INSPIRE service endpoints do include metadata that isn’t exposed via data.gov.uk. For example this South Gloucester dataset includes contact details, data on geospatial extents, and format information which isn’t otherwise available. It would be nice to be able to see this and not have to read the XML
  • None of the metadata appears to tell me when the dataset was last updated. The last modified data on data.gov.uk is (I think) the date the catalogue entry was last updated. Are the Section 106 agreements listed in this dataset from 2010 or are they regularly updated. How can I tell?
  • Bath is using GetMapping to host its INSPIRE datasets. Working through them on data.gov.uk I found that 46 out of the 48 datasets I reviewed have broken endpoints. I’m reasonably certain these used to work. I’ve reported the issue to the council.
  • The two datasets that do work in Bath cannot be used in TerriaJS. I managed to work around the fact that they require a username and password to access but have hit a wall because the GetMapping APIs only seem to support EPSG:27700 (British National Grid) and not EPSG:3857 as used by online mapping tools. So the APIs refuse to serve the data in a way that can be used by the framework. The Bristol and South Gloucestershire endpoints handle this fine. I assume this is either a limitation of the GetMapping service or a misconfiguration. I’ve asked for help.
  • A single Web Mapping Service can expose multiple datasets as individual layers. But apart from Bristol, both Bath and South Gloucestershire are publishing each dataset through its own API endpoint. I hope the services they’re using aren’t charging per end-point, as they’re probably unnecessary? Bristol has chosen to publish a couple of API that bring together several datasets, but these are also available individually through separate APIs.
  • The same datasets are repeated across data catalogues and endpoints. Bristol has its data listed as individual datasets in its own platform, listed as individual datasets in data.gov.uk and also exposed via two different collections which bundle some (or all?) of them together. I’m unclear on the overlap or whether there are differences between them in terms of scope, timeliness, etc. The licensing is also different. Exploring the three different datasets that describe allotments in Bristol, only one actually displayed any data in TerriaJS. I don’t know why
  • The South Gloucestershire web mapping services all worked seamlessly, but I noticed that if I wanted to download the data, then I would need to jump through hoops to register to access it. Obviously not ideal if I do want to work with the data locally. This isn’t required by the other councils. I assume this is a feature of MisoPortal
  • The South Gloucestershire datasets don’t seem to include any useful attributes for the features represented in the data. When you click on the points, lines and polygons in TerriaJS no additional information is displayed. I don’t know yet whether this data just isn’t included in the dataset or if its a bug in the API or in how TerriaJS is requesting it. I’d need to download or explore the data in some other way to find out. However the data that is available from Bath and Bristol also has inconsistencies in how its described, so I suspect there aren’t any agreed standards
  • Neither the GetMapping or MisoPortal APIs support CORS. This means you can’t access the data from Javascript running directly in the browser, which is what TerriaJS does by default. I’ve had to configure those to be accessed via a proxy. “Web mapping services” should work on the web.
  • While TerriaJS doesn’t have a plugin for OpenDataSoft (which powers the Bristol Open Data platform), I found that OpenDataSoft do provide a Web Feature Service interface. So I was able to configure that in TerriaJS to access that. Unfortunately I then found that either there’s a bug in the platform or a problem with the data because most of the points were in the Indian Ocean

The goal of the INSPIRE legislation was to provide a common geospatial data infrastructure across Europe. What I’m trying to do here should be relatively quick and easy to do. Looking at this graph of INSPIRE conformance for the UK, everything looks rosy.

But, based on an admittedly small sample of only three local authorities, the reality seems to be that:

  • services are inconsistently implemented and have not been designed to be used as part of native web applications and mapping frameworks
  • metadata quality is poor
  • there is inconsistent detail about features which makes it hard to aggregate, use and compare data across different areas
  • it’s hard to tell the provenance of data because of duplicated copies of data across catalogues and endpoints. Without modification or provenance information, its unclear whether data is data is up to date
  • licensing is unclear
  • links to service endpoints are broken. At best, this leads to wasted time from data users. At worst, there’s public money being spent on publishing services that no-one can access

It’s important that we find ways to resolve these problems. As this recent survey by the ODI highlights, SMEs, startups and local community groups all need to be able to use this data. Local government needs more support to help strengthen our geospatial data infrastructure.

Observations on the web

Eight years ago I was invited to a workshop. The Office for National Statistics were gathering together people from the statistics and linked data communities to talk about publishing statistics on the web.

At the time there was lots of ongoing discussion within and between the two communities around this topic. With a particular emphasis on government statistics.

I was invited along to talk about how publishing linked data could help improve discovery of related datasets.

Others were there to talk about other related projects. There were lots of people there from the SDMX community who were working hard to standardise how statistics can be exchanged between organisations.

There’s a short write-up that mentions the workshop, some key findings and some follow on work.

One general point of agreement was that statistical data points or observations should be part of the web.

Every number, like the current population of Bath & North East Somerset, should have a unique address or URI. So people could just point at it. With their browsers or code.

Last week the ONS launched the beta of a new API that allows you to create links to individual observations.

Seven years on they’ve started delivering on the recommendations of that workshop.

Agreeing that observations should have URIs was easy. The hard work of doing the digital transformation required to actually deliver it has taken much longer.

Proof-of-concept demos have been around for a while. We made one at the ODI.

But the patient, painstaking work to change processes and culture to create sustainable change takes time. And in the tech community we consistently underestimate how long that takes, and how much work is required.

So kudos to Laura, Matt, Andy, Rob, Darren Barnee and the rest of present and past ONS team for making this happen. I’ve see glimpses of the hard work they’ve had to put in behind the scenes. You’re doing an amazing and necessary job.

If you’re unsure as to why this is such a great step forward, here’s a user need I learned at that workshop.

Amongst the attendees was a designer who worked on data visualisations. He would spend a great deal of time working with data to get it into the right format and then designing engaging, interactive views of it.

Often there were unusual peaks and troughs in the graphs and charts which needed some explanation. Maybe there had been an external event that impacted the data, or a change in methodology. Or a data quality issue that needed explaining. Or maybe just something interesting that should be highlighted to users.

What he wanted was a way for the statisticians to give him that context, so he could add notes and explanations to the diagrams. He was doing this manually and it was a lot of time and effort.

For decades statisticians have been putting these useful insights into the margins of their work. Because of the limitations of the printed page and spreadsheet tables this useful context has been relegated into footnotes for the reader to find for themselves.

But by putting this data onto the web, at individual URIs, we can deliver those numbers in context. Everything you need to know can be provided with the statistic, along with pointers to other useful information.

Giving observation unique URIs, frees statisticians from the tyranny of the document. And might help us all to share and discuss data in a much richer way.

I’m not naive enough to think that linking data can help us address issues with fake news. But it’s hard for me to imagine how being able to more easily work with data on the web isn’t at least part of the solution.