This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.
At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.
An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them
We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.
Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.
The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.
Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.
New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.
So how do we integrate data with the web? To ensure that statistics are well described and easy to find?
We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.
What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.
And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:
- help us to consistently describe tabular data and generate best practice CSVs
- allow us to describe not just a whole table, but also its columns, individual cells and even relationships between files
- allow us to consistently convert tabular data into JSON and RDF to enable its use in other workflows
One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.
Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.
(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)
This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.
There also aren’t a great many tools that support CSV on the web currently.
It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?
But adoption could be driven from multiple angles. For example:
- open data tools, portals and data publishers could start to generate best practice CSVs. That would be easy to implement
- open data portals could also readily adopt CSV on the Web metadata, most already support DCAT
- standards developers could adopt CSV on the Web as their primary means of defining schemas for tabular formats
Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.
Thanks for listening.