Schema explorers and how they can help guide adoption of common standards

Despite being very different projects Wikidata and OpenStreetmap have a number of similarities. Recurring patterns in how they organise and support the work of their communities.

We documented a number of these patterns in the ODI Collaborative Maintenance Guidebook. There were also a number we didn’t get time to write-up.

A further pattern which I noticed recently is that both Wikidata and OSM provide tools and documentation that help contributors and data users explore the schema that shapes the data.

Both projects have a core data model around which their communities are building and iterating on a more focused domain model. This approach of providing tools for the community to discuss, evolve and revise a schema is what we called the Shared Canvas pattern in the ODI guidebook.

In OpenStreetmap that core model is consists of nodes, ways and relations. Tags (name-value pairs) can be attached to any of these types.

In Wikidata the core data model is essentially a graph. A collection of statements that associate values with nodes using a range of different properties. It’s actually more complicated than that, but the detail isn’t important here.

The list of properties in Wikidata and the list of tags in OpenStreetmap are continually revised and extended by the community to capture additional information.

The OpenStreetmap community documents tags in its Wiki (e.g. the building tag). Wikidata documents its properties within the project dataset (e.g. the name property, P2561).

But to successfully apply the Shared Canvas pattern, you also need to keep the community up to date about your Evolving Schema. To do that you need some way to communicate which properties or tags are in use, and how. OSM and Wikidata both provide tools to support that.

In OSM this role is filled by TagInfo. It can provide you with a break down of what type of feature the tag is used on, the range of values, combinations with other tags and some idea of its geographic usage. Tag uses varies by geographic community in OSM. Here’s the information about the building tag.

In Wikidata this tooling is provided by a series of reports that are available from the Discussion page for an individual property. This includes information about how often it is used and pointers to examples of frequent and recent uses. Here’s the information about the name property.

Both tools provide useful insight into how different aspects of a schema are being adopted and uses. They can help guide not just the discussion around the schema (“is this tag in use?”, but also the process of collecting data (“which tags should I use here”) and using the data (“what tags might I find, or query for?”).

Any project that adopts a Shared Canvas approach is likely to need to implement this type of tooling. Lets call it the “Schema explorer” pattern for now.

I’ll leave documenting it further for another post, or a contribution to the guidebook.

Schema explorers for open standards and open data

This type of tooling would be useful in other contexts.

Anywhere that we’re trying to drive adoption of a common data standard, it would be helpful to be able to assess how well used different parts of that schema are by analysing the available data.

That’s not something I’ve regularly seen produced. In our survey of decentralised publishing initiatives at the ODI we found common types of documentation, data validators and other tools to support use of data, like useful aggregations. But no tooling to help explore how well it is adopted. Or to help data users understand the shape of the available data prior to aggregating it.

When i was working on the OpenActive standard, I found the data profiles that Dan Winchester produced really helpful. They provide useful insight into which parts of a standard different publishers were actually using.

I was thinking about this again recently whilst doing some work for Full Fact, exploring the ClaimReview markup in Schema.org. It would be great to see which features different fact checkers are actually using. In fact that would be true of many different aspects of Schema.org.

This type of reporting is hard to do in a distributed environment without aggregating all the data. But Google are regularly harvesting some of this data, so it feels like it would be relatively easy for them to provide insights like this if they chose.

An alternative is the Schema.org Table Corpus which provides exports of Schema.org data contained in the Common Crawl dataset. But more work is likely needed to generate some useful views over the data, and it is less frequently updated.

Outside of Schema.org, schema explorers reporting on the contents of open datasets, would help inform a range of standards work. For example, it could help inform decisions about how to iterate on a schema, guide the production of documentation, and help improve the design of validators and other tools.

If you’ve seen examples of this type of tooling, then I’d be interested to see some links.

Building data validators

This is a post about building tools to validate data. I wanted to share a few reflections based on helping to design and build a few different public and private tools, as well as my experience as a user.

I like using data validators to check my homework. I’ve been using a few different recently which has prompted me to think a bit about their role and the designs that go into their design.

The tl;dr version of this post is along the lines of “Think about user needs when designing tools. But also be conscious of the role those tools play in their broader ecosystem“.

What is a data validator?

A data validator is a tool that checks the correctness and quality of data. This means doing the following categories of checks:

  • Syntax
    • Checking to determine whether there are any mistakes in how it is formatted. E.g. is the syntax of a CSV, XML or JSON file correct?
  • Validity
    • Confirming if all of the required fields, necessary to make the data useful, been provided?
    • Testing that individual values have been correctly specified. E.g. if the field contains a number then is the provided value actually a number rather than a text?
    • Performing more semantic checks such as, if this is a dataset about UK planning applications, then are the coordinates actually in the UK? Or is the start date for the application before the end date?
  • Utility
    • Confirming that provided data is of a useful quality, e.g. are geographic coordinates of the right precision? Or do any links to other resources actually work?
    • Warning about data that may or may not be included. For example, prompting the user to include additional fields that may improve the utility of the data. Or asking them to consider whether any personal data included should be there

These validation rules will typically come from a range of different sources, including:

  • The standard or specification that defines the syntax of the data.
  • The standard or specification (or schema) that describes the structure and content of the data. (This might be the same as the above, or might be defined elsewhere)
  • Legislation, which might guide, inform or influence what data should or should not be included
  • The implementer of the validation tool, who may have opinions about what is considered to be correct or useful data based on their specific needs (e.g. as a direct consumer of the data) or more broadly as a contributor to a community initiative to support improvements to how data is published

Data validators are frequently web based these days. At least for smaller datasets. But both desktop and command-line tools are also regularly used in different settings. The choice of design will be informed by things like how open the data can be, the volume of data being checked, and how the validator might be integrated into a data workflow, e.g. as an automated or manual step.

Examples of different types of data validator

Here are some examples of different data validators created for different purposes and projects

  1. JSON lint
  2. GeoJSON Lint
  3. JSON LD Playground
  4. CSVlint
  5. ODI Leeds Business Rates format validator
  6. 360Giving Data Quality Tool
  7. OpenContracting Data Review Tool
  8. The OpenActive validator
  9. OpenReferral UK Service Validator
  10. The Schema.org validator
  11. Google’s Rich Results Test
  12. The Twitter Card validator
  13. Facebook’s sharing debugger

The first few on the list are largely syntax checkers. They validate whether your CSV, JSON or GeoJSON files are correctly structured.

The others go further and check not just the format of the data, but also its validity against a schema. That schema is defined in a standard intended to support consistent publication of data across a community. The goal of these tools is to improve quality of data for a wide range of potential users, by guiding publishers about how to publish data well.

The last three examples are validators that are designed to help publishers meet the needs of a specific application or consumer of the data. They’re an actionable way to test data against the requirements of a specific user.

Validators also vary in other ways.

For example, the 360Giving, OpenContracting and Rich Results Test validators all accept a range of different data formats. They validate different syntaxes against a common schema. Others are built around a single specific format

Some tools provide a low-level view of the results, e.g. a list of errors and warnings with reference to specific sections of the data. Others provide a high-level interface, such as a preview of what the data looks like on a map or as it would be displayed in a specific application. This type of visual presentation can help catch other types of errors and more directly confirm how data might be interpreted, whilst also making the tool useful to a wider audience.

What do we mean by data being valid?

For simple syntax checking identifying whether something is valid is straight-forward. Your JSON is either well-formed or its not.

Validators that are designed around specific applications also usually have a clear marker of what is “valid”: can the application parse, interpret and display the data as expected? Does my twitter card look correct?

In other examples, the notion of “valid” is harder to define. They may be some basic rules around what a minimum viable dataset looks like. If so, these are easier to identify and classify as errors.

But there is often variability within a schema. E.g. optional elements. This means that validators need to offer more than just a binary decision and instead offer warnings, suggestions and feedback.

For example, when thinking about the design of the OpenActive validator we discussed the need to go beyond simple validation and provide feedback and prompts along the lines of “you haven’t provided a price, is the event free or chargeable“? Or “you haven’t provided an image for this event, this is legal but evidence shows that participants are more likely to sign-up to events where they can see what participation looks like.”

To put this differently: data quality depends on how you’re planning to use the data. It’s not an absolute. If you’re not validating data for a specific application or purpose, then you tool should be prompting users to think about the choices they are making around how data is being shared.

In the context of sharing and publishing open data, this moves the role of a data validator beyond simplify checking correctness, and towards identifying sources of friction that will exist between publisher and consumer.

Beyond the formal conformance criteria defined in a specification, deciding whether something is valid or not, is really just a marker for how much extra work is required by a consumer. And in some cases the publisher may not have the time, budget or resources to invest in reducing that burden.

Things to think about when designing a validator

To wrap up this post, here are some things to think about when designing a data validator

  • Who are your users? What level of technical skill and understanding are you designing for?
  • How will the validator be used or integrated into the users workflow? A tool for integration into a continuous integration environment will need to operate differently to something used to do acceptance checking before data is published. Maybe you need several different tools?
  • How much knowledge of the relevant standards or specification will a user need before they can use the tool? Should the tool facilitate learning and exploration about how to structure data, or is just checking existing data?
  • How can you provide good, clear feedback? Tools that rely on applying machine-readable schemas like JSON Schema can often have cryptic messages as they rely on an underlying library to report errors
  • How can you provide guidance and feedback that will help users decide how to improve data? Is the feedback actionable? (For example in CSVLint we figured out that when reporting that a user had an incorrect mime-type for their CSV file we could identify if it was served from AWS and provide a clear suggestion about how to fix the issue)
  • Would showing the data, as a preview or within a mocked up view, help surface problems or build confidence in how data is published?
  • Are the documentation about how to publish data and the reports from your validator consistent? If not, then fix the documentation or explain the limits of the validator

Finally, if you’re designing a validator for a specific application, then don’t mark as “invalid” anything that you can simply ignore. Don’t force the ecosystem to converge on your preferences.

You may not be interested in the full scope of a standard, but different applications and users will have different needs.

Data quality is a dialogue between publishers and users of data. One that will evolve over time as tools, applications, norms and standards become adopted across a data ecosystem. A data validator is an important building block that can facilitate that discussion.

Some lessons learned from building standards around Schema.org

OpenActive is a community-led initiative in the sport and physical activity sector in England. It’s goal is to help to get people healthier and more active by making its easier for people to find information about activities and events happening in their area. Publishing open data about opportunities to be active is a key part of its approach.

The initiative has been running for several years, funded by Sport England. Its supported by a team at the Open Data Institute who are working in close collaboration with a range of organisations across the sector.

During the early stages of the project I was responsible for leading the work to develop the technical standards and guidance that would help organisations publish open data about squash courts and exercise classes. I’ve written some previous blog posts that described the steps that got us to version 1.0 of the standards and then later the roadmap towards 2.0.

Since then the team have been exploring new features like publishing data about walking and cycling routes, improving accessibility information and, more recently, testing a standard API for booking classes.

If you’re interested in more of the details then I’d encourage you to dig into those posts as well as the developer portal.

What I wanted to cover in this blog post are some reflections about one of the key decisions we made early in the standards workstream. This was to base the core data model on Schema.org.

Why did we end up basing the standards on Schema.org?

We started the standards work in OpenActive by doing a proper scoping exercise. This helped us to understand the potential benefits of introducing a standard, and the requirements that would inform its development.

As part of our initial research, we did a review of what standards existed in the sector. We found very little that matched our needs. The few APIs that were provided were quite limited and proprietary and there was little consistency around how data was organised.

It was clear that some standardisation would be beneficial and that there was little in the way of sector-specific work to build on. It was also clear that we’d need a range of different types of standard. Data formats and APIs to support exchange of data, a common data model to help organise data and a taxonomy to help describe different types of activity.

For the data model, it was clear that the core domain model would need to be able to describe events. E.g. that a yoga class takes place in a specific gym at regular times. This would support basic discovery use cases. Where can I go and exercise today? What classes are happening near me?

As part of our review of existing standards, we found that Schema.org already provided this core model along with some additional vocabulary that would help us categorise and describe both the events and locations. For example, whether an Event was free, its capacity and information about the organiser.

For many people Schema.org may be more synonymous with publishing data for use by search engines. But as a project its goal is much broader, it is “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data“.

The data model covers much more than what search engines are consuming. Some communities are instead using the project as a means to collaborate on developing better vocabulary for sharing data between other applications. As well as aligning existing vocabularies under a common umbrella.

New standards should ideally be based on existing standards. We knew we were going to be building the OpenActive technical standards around a “stack” of standards that included HTTP, JSON and JSON-LD. So it was a natural step to base our initial domain model on aspects of Schema.org.

What were the benefits?

An early benefit of this approach is that we could immediately focus our roadmap on exploring extensions to the Schema.org data model that would add value to the physical activity sector.

Our initial community sessions around the OpenActive standards involved demonstrating how well the existing Schema.org model fitted the core requirements. And exploring where additional work was needed.

This meant we skipped any wrangling around how to describe events and instead focused on what we wanted to say about them. Important early questions focused on what information would potential participants find helpful in understanding whether this is specific activity or event is something that they might want to try? For example, details like: what activities they involved and for what level of competency?

We were able to identify those elements of the core Schema.org model supported out use cases and then documented some extensions in our own specifications. The extensions and clarifications were important for the OpenActive community, but not necessarily relevant in the broader context in which Schema.org is being used. We wanted to build some agreement and usage in our community first, before suggesting changes to Schema.org.

As well as giving us an initial head start, the decision also helped us address new requirements much quicker.

As we uncovered further requirements that mean expanding our data model, we were always able to initially look to see if existing Schema.org terms covered what we needed. We began using it as a kind of “dictionary” that we could draw on when needed.

Where existing parts of the Schema.org model fitted out needs, it was gratifying to be able to rapidly address the new requirements by documenting patterns for how to use them. Data publishers were also doing the same thing. Having a common dictionary of terms gave freedom to experiment with new features, drawing on terms defined in a public schema, before the community had discussed and agreed how to implement those patterns more broadly.

Every standards project has its own cadence. The speed of development and adoption are tied up with a whole range of different factors that go well beyond how quickly you can reach consensus around a specification.

But I think the decision to use Schema.org definitely accelerated progress and helped us more quickly deliver a data model that covered the core requirements for the sector.

Where were the challenges?

The approach wasn’t without its challenges, however.

Firstly, for a sector that was new to building open standards, choosing to based parts of that new standard on one project and then defining extensions created some confusion. Some communities seem more comfortable with piecing together vocabularies and taxonomies, but that is not true more widely.

Developers found it tricky to refer to both specifications, to explore their options for publishing different types of data. So we ended up expanding our documentation to cover all of the Schema.org terms we recommended or suggested people use, instead of focusing more on our own extensions.

Secondly, we also initially adopted the same flexible, non-prescriptive approach to data publishing that Schema.org uses. It does not define strict conformance critiera and there are often different options for how the same data might be organised depending on the level of detail a publisher has available. If Schema.org were too restrictive then it would limit how well the model could be used by different communities. It also leaves space for usage patterns to emerge.

In OpenActive we recognised that the physical activity sector had a wide range of capabilities when it came to publishing structured data. And different organisations organised data in different ways. We adopted the same less prescriptive approach to publishing with the goal of reducing the barriers to getting more data published. Essentially asking publishers to structure data as best they could within the options available.

In the end this wasn’t the right decision.

Too much flexibility made it harder for implementers to understand what data would be most useful to publish. And how to do it well. Many publishers were building new services to expose the data so they needed a clearer specification for their development teams.

We addressed this in Version 2 of the specifications by considerably tightening up the requirements. We defined which terms were required or just recommended (and why). And added cardinalities and legal values for terms. Our specification became a more formal, extended profile of Schema.org. This also allowed us build a data validator that is now being released and maintained alongside the specifications.

Our third challenge was about process. In a few cases we identified changes that we felt would sit more naturally within Schema.org than our own extensions. For example, they were improvements and clarifications around the core Event model that would be useful more widely. So we submitted those as proposed changes and clarifications.

Given that Schema.org has a very open process, and the wide range of people active in discussing issues and proposals, it was sometimes hard to know how decisions would get made. We had good support from Dan Brickley and others stewarding the project, but without knowing much about who is commenting on your proposal, their background or their own uses cases, it was tricky to know how much time to spend on handling this feedback. Or when we could confidently say that we had achieved some level of consensus.

We managed to successfully navigate this, by engaging as we would within any open community: working transparently and collegiately, and being willing to reflect on and incorporate feedback regardless of its source.

The final challenge was about assessing the level of use of different parts of the Schema.org model. If we wanted to propose a change in how a term was documented or suggest a revision to its expected values, it is difficult to assess the potential impact of that change. There’s no easy way to see which applications might be relying on specific parts of the model. Or how many people are publishing data that uses different terms.

The Schema.org documentation does flag terms that are currently under discussion or evaluation as “pending”. But outside of this its difficult to understand more about how the model is being used in practice. To do that you need to engage with a user community, or find some metrics about deployment.

We handled this by engaging with the open process of discussion, sharing our own planned usage to inform the discussion. And, where we felt that Schema.org didn’t fit with the direction we needed, we were happy to look to other standards that better filled those gaps. For example we chose to use SKOS to help us organise and structure a taxonomy of physical activities rather than using some of the similar vocabulary that Schema.org provides.

Choosing to draw on Schema.org as a source of part of our domain model didn’t mean that we felt tied to using only what it provides.

Some recommendations

Overall I’m happy that we made the right decision. The benefits definitely outweighed the challenges.

But navigating those challenges was easier because those of us leading the standards work were comfortable both with working in the open and in combining different standards to achieve a specific goal. Helping to build more competency in this area is one goal of the ODI standards guidebook.

If you’re involved in a project to build a common data model as part of a community project to publish data, then I’d recommend looking at whether based some or all of that model around Schema.org might help kickstart your technical work.

If you do that, my personal advice would be:

  • Remember that Schema.org isn’t the right home for every data model. Depending on your requirements, the complexity and the potential uses for the data, you may be better off designing and iterating on your model separately. Similarly, don’t expect that every change or extension you might want to make will necessarily be accepted into the core model
  • Don’t assume that search engines will start using your data, just because you’re using Schema.org as a basis for publishing, or even if you successfully submit change proposals. It’s not a means of driving adoption and use of your data or preferred model
  • Plan to write your own specifications and documentation that describe how your application or community expects data to be published. You’ll need to add more conformance criteria and document useful patterns that go beyond that Schema.org is providing
  • Work out how you will engage with your community. To make it easier to refine your specifications, discuss extensions and gather implementation feedback, you’ll still need a dedicated forum or channel for your community to collaborate. Schema.org doesn’t really provide a home for that. You might have your own github project or setup a W3C community group.
  • Build your own tooling. Schema.org are improving their own tooling, but you’ll likely need your own validation tools that are tailored to your community and your specifications
  • Contribute to the Schema.org project where you can. If you have feedback, proposed changes or revisions then submit these to the project. Its through a community approach that we improve the model for everyone. Just be aware that there are likely to be a whole range of different use cases that may be different to your own. Your proposals may need to go through several revisions before being accepted. Proposals that draw on real-world experience or are tied to actual applications will likely carry more weight than general opinions about the “right” way to design something
  • Be prepared to diverge where necessary. As I’ve explained above, sometimes the right option is to propose changes to Schema.org. And sometimes you may need to be ready to draw on other standards or approaches.

12 ways to improve the GDS guidance on reference data publishing

GDS have published some guidance about publishing reference data for reuse across government. I’ve had a read and it contains a good set of recommendations. But some of them could be clearer. And I feel like some important areas aren’t covered. So I thought I’d write this post to capture my feedback.

Like the original guidance my feedback largely ignores considerations of infrastructure or tools. That’s quite a big topic and recommendations in those areas are unlikely to be applicable solely to reference data.

The guidance also doesn’t address issues around data sharing, such as privacy or regulatory compliance. I’m also going to gloss over that. Again, not because its not important, but because those considerations apply to sharing and publishing any form of data, not just reference data

Here’s the list of things I’d revise or add to this guidance:

  1. The guidance should recommend that reference data be at open as possible, to allow it to be reused as broadly as possible. Reference data that doesn’t contain personal information should be published under an open licence. Licensing is important even for cross-government sharing because other parts of government might be working with private or third sector who also need to be able to use the reference data. This is the biggest omission for me.
  2. Reference data needs to be published over the long term so that other teams can rely on it and build it into their services and workflows. When developing an approach for publishing reference data, consider what investment needs to be made for this to happen. That investment will need to cover people and infrastructure costs. If you can’t do that, then at least indicate how long you expect to be publishing this data. Transparent stewardship can build trust.
  3. For reference data to be used, it needs to be discoverable. The guide mentions creating metadata and doing SEO on dataset pages, but doesn’t include other suggestions such as using Schema.org Dataset metadata or even just depositing metadata in data.gov.uk.
  4. The guidance should recommend that stewardship of reference data is part of a broader data governance strategy. While you may need to identify stewards for individual datasets, governance of reference data should be part of broader data governance within the organisation. It’s not a separate activity. Implementing that wider strategy shouldn’t block making early progress to open up data, but consider reference data alongside other datasets
  5. Forums for discussing how reference data is published should include external voices. The guidance suggests creating a forum for discussing reference data, involving people from across the organisation. But the intent is to publish data so it can be reused by others. This type of forum needs external voices too.
  6. The guidance should recommend documenting provenance of data. It notes that reference data might be created from multiple sources, but does not encourage recording or sharing information about its provenance. That’s important context for reusers.
  7. The guide should recommend documenting how identifiers are assigned and managed. The guidance has quite a bit of detail about adding unique identifiers to records. It should also encourage those publishing reference data to document how and when they create identifiers for things, and what types of things will be identified. Mistakes in understanding the scope and coverage of reference data can have huge impacts.
  8. There is a recommendation to allow users to report errors or provide feedback on a dataset. That should be extended to include a recommendation that the data publisher makes known errors clear to other users, as well as transparency around when individual errors might be fixed. Reporting an error without visibility of the process for fixing data is frustrating
  9. GDS might recommend an API first approach, but reference data is often used in bulk. So there should be a recommendation to have bulk access to data, not just an API. It might also be cheaper and more sustainable to share data in this way
  10. The guidance on versioning should include record level metadata. The guidance contains quite a bit of detail around versioning of datasets. While useful, it should also include suggestions to include status codes and timestamps on individual records, to simplify integration and change monitoring. Change reporting is an important but detailed topic.
  11. While the guidance doesn’t touch on infrastructure, I think it would be helpful for it to recommend that platforms and tools used to manage reference data are open sourced. This will help others to manage and publish their own reference data, and build alignment around how data is published.
  12. Finally, if multiple organisations are benefiting from use of the same reference data then encouraging exploration of collaborative maintenance might help to reduce costs for maintaining data, as well as improving its quality. This can help to ensure that data infrastructure is properly supported and invested in.

OSM Queries

For the past month I’ve been working on a small side project which I’m pleased to launch for Open Data Day 2021.

I’ve long been a fan of OpenStreetMap. I’ve contributed to the map, coordinated a local crowd-mapping project and used OSM tiles to help build web based maps. But I’ve only done a small amount of work with the actual data. Not much more than running a few Overpass API queries and playing with some of the exports available from Geofabrik.

I recently started exploring the Overpass API again to learn how to write useful queries. I wanted to see if I could craft some queries to help me contribute more effectively. For example by helping me to spot areas that might need updating. Or identify locations where I could add links to Wikidata.

There’s a quite a bit of documentation about the Overpass API and the query language it uses, which is called OverpassQL. But I didn’t find them that accessible. The documentation is more of a reference than a tutorial.

And, while, there’s quite a few example queries to find across the OSM wiki and other websites, there isn’t always a great deal of context to those examples that explain how they work or when you might use them.

So I’ve been working on two things to address what I think is a gap in helping people learn how to get more from the OpenStreetMap API.

overpass-doc

The first is a simple tool that will take a collection of Overpass queries and build a set of HTML pages from them. It’s based on a similar tool I built for SPARQL queries a few years ago. Both are inspired by Javadoc and other code documentation tools.

The idea was to encourage the publication of collections of useful, documented queries. E.g. to be shared amongst members of a community or people working on a project. The OSM wiki can be used to share queries, but it might not always be a suitable home for this type of content.

The tool is still at quite an early stage. It’s buggy, but functional.

To test it out I’ve been working on my own collection of Overpass queries. I initially started to pull together some simple examples that illustrated a few features of the language. But then realised that I should just use the tool to write a proper tutorial. So that’s what I’ve been doing for the last week or so.

Announcing OSM Queries

OSM Queries is the result. As of today the website contains four collections of queries. The main collection of queries is a 26 part tutorial that covers the basic features of Overpass QL.

By working through the tutorial you’ll learn:

  • some basics of the OpenStreetMap data model
  • how to write queries to extract nodes, ways and relations from the OSM database using a variety of different methods
  • how to filtering data to extract just the features of interest
  • how to write spatial queries to find features based on whether they are within specific areas or are within proximity to one another
  • how to output data as CSV and JSON for use in other tools

Every query in the tutorial has its own page containing an embedded syntax highlighted version of the query. This makes them easier to share with others. You can click a button to load and run the query using the Overpass Turbo IDE. So you can easily view the results and tinker with the query.

I think the tutorial covers all the basic options for querying and filtering data. Many of the queries include comments that illustrate variations of the syntax, encouraging you to further explore the language.

I’ve also been compiling an Overpass QL syntax reference that provides a more concise view of some of the information in the OSM wiki. There’s a lot of advanced features (like this) which I will likely cover in a separate tutorial.

Writing a tutorial against the live OpenStreetMap database is tricky. The results can change at any time. So I opted to focus on demonstrating the functionality using mostly natural features and administrative boundaries.

In the end I chose to focus on an area around Uluru in Australia. Not just because it provides an interesting and stable backdrop for the tutorial. But because I also wanted to encourage a tiny bit of reflection in the reader about what gets mapped, who does the mapping, and how things get tagged.

A bit of map art, and a request

The three other query collections are quite small:

I ended up getting a bit creative with the MapCSS queries.

For example, to show off the functionality I’ve written a query that shows the masonic symbol hidden in the streets of Bath, styled Brøndby Haveby like a bunch of flowers and the Lotus Bahai Temple as, well, a lotus flower.

These were all done by styling the existing OSM data. No edits were done to change the map. I wouldn’t encourage you to do that.

I’ve put all the source files and content for the website into the public domain so you’re free to adapt, use and share however you see fit.

While I’ll continue to improve the tutorial and add some more examples I’m also hoping that I can encourage others to contribute to the site. If you have useful queries that you could be added to the site then submit them via Github. I’ve provided a simple issue template to help you do that.

I’m hoping this provides a useful resource for people in the OSM community and that we can collectively improve it over time. I’ve love to get some feedback, so feel free to drop me an email, comment on this post or message me on twitter.

And if you’ve never explored the data behind OpenStreetMap then Open Data Day is a great time to dive in. Enjoy.

Bath Historical Images

One of my little side projects is to explore historical images and maps of Bath and the surrounding areas. I like understanding the contrast between how Bath used to look and how it is today. It’s grown and changed a huge amount over the years. It gives me a strong sense of place and history.

There is a rich archive of photographs and images of the city and area that were digitised for the Bath in Time project. Unfortunately the council has chosen to turn this archive into a, frankly terrible, website that is being used to sell over-priced framed prints.

The website has limited navigation and there’s no access to higher resolution imagery. Older versions of the site had better navigation and access to some maps.

The current version looks like it’s based on a default ecommerce theme for WordPress rather than being designed to show off the richness of the 40,000 images it contains. Ironically the @bathintime twitter account tweets out higher resolution images than you can find on the website.

This is a real shame. Frankly I can’t imagine there’s a huge amount of revenue being generated from these prints.

If the metadata and images were published under a more open licence (even with a non-commercial limitation) then it would be more useful for people like me who are interested in local history. We might even be able to help build useful interfaces. I would happily invest time in cataloguing images and making something useful with them. In fact, I have been.

In lieu of a proper online archive, I’ve been compiling a list of publicly available images from other museums and collections. So far, I’ve sifted through:

I’ve only found around 230 images (including some duplicates across collections) so far, but there are some interesting items in there. Including some images of old maps.

I’ve published the list as open data.

So you can take the metadata and links and explore them for yourself. I thought they may be useful for anyone looking to reuse images in their research or publications.

I’m in the process of adding geographic coordinates to each of the images, so they can be placed on the map. I’m approaching that by geocoding them as if they were produced using a mobile phone or camera. For example, an image of the abbey won’t have the coordinates of the abbey associated with it, it’ll be wherever the artist was standing when they painted the picture.

This is already showing some interesting common views over the years. I’ve included a selection below.

Views from the river, towards Pulteney Bridge

Southern views of the city

Looking to the east across abbey churchyard

Views of the Orange Grove and Abbey

It’s really interesting to be able to look at the same locations over time. Hopefully that gives a sense of what could be done if more of the archives we made available.

There’s more documentation on the dataset if you want to poke around. If you know of other collections of images I should look at, then let me know.

And if you have metadata or images to release under an open licence, or have archives you want to share, then get in touch as I may be able to help.

Cooking up a new approach to supporting purposeful use of data

In my last post I explored how we might better support the use of datasets. To do that I applied the BASEDEF framework to outline the ways in which communities might collaborate to help unlock more value from individual datasets.

But what if we changed our focus from supporting discovery and use of datasets and instead focused on helping people explore specific types of problems or questions?

Our paradigm around data discovery is based on helping people find individual datasets. But unless a dataset has been designed to answer the specific question you have in mind, then it’s unlikely to be sufficient. Any non-trivial analysis is likely to need multiple datasets.

We know that data is more useful when it is combined, so why isn’t our approach to discovery based around identifying useful collections of datasets?

A cooking metaphor

To explore this further let’s use a cooking metaphor. I love cooking.

Many cuisines are based on a standard set of elements. Common spices or ingredients that become the base of most dishes. Like a mirepoix, a sofrito, the holy trinity of Cajun cooking, or the mother sauces in French cuisine.

As you learn to cook you come to appreciate how these flavour bases and sauces can be used to create a range of dishes. Add some extra spices and ingredients and you’ve created a complete dish.

Recipes help us consistently recreate these sauces.

A recipe consists of several elements. It will have a set of ingredients and a series of steps to combine them. A good recipe will also include some context. For example some background on the origins of the recipe and descriptions of unusual spices or ingredients. It might provide some things to watch out for during the cooking (“don’t burn the spices”) or suggest substitutions for difficult to source ingredients.

Our current approach to dataset discovery involves trying to document the provenance of an individual ingredient (a dataset) really well. We aren’t helping people combine them together to achieve results.

Efforts to improve dataset metadata, documentation and provenance reporting are important. Projects like the dataset nutrition label are great examples of that. We all want to be ethical, sustainable cooks. To do that we need to make informed choices about our ingredients.

But, to whisk these food metaphors together, nutrition labels are there to help you understand what’s gone into your supermarket pasta sauce. It’s not giving you a recipe to cook it from scratch for yourself. Or an idea of how to use the sauce to make a tasty dish.

Recipes for data-informed problem solving

I think we should think about sharing dataset recipes: instructions for how to mix up a selection of dataset ingredients. What would they consist of?

Firstly, the recipe would need to based around a specific type of question, problem or challenge.  Examples might include:

  • How can I understand air quality in my city?
  • How is deprivation changing in my local area?
  • What are the impacts of COVID-19 in my local authority?

Secondly, a recipe would include a list of datasets that have to be sourced, prepared and combined together to explore the specific problem. For example, if you’re exploring impacts of COVID-19 in your local authority you’re probably going to need:

  • demographic data from the most recent census
  • spatial boundaries to help visualise and present results
  • information about deprivation to help identify vulnerable people

Those three datasets are probably the holy trinity of any local spatial analysis?

Finally, you’re going to need some instructions for how to combine the datasets together. The instructions might identify some tools you need (Excel or QGIS), reference some techniques (Reprojection) and maybe some hints about how to substitute for key ingredients if you can’t get them in your local area (FOI).

The recipe might ways to vary the recipe for different purposes: add a sprinkle of Companies House data to understand your local business community, and a dash of OpenStreetMap to identify greenspaces?

As a time saver maybe you can find some pre-made versions of some of the steps in the recipe?

Examples in the wild

OK, its easy to come up with a metaphor and an idea. But would this actually meet a need? There’s a few reasons why I’m reasonably confident that dataset recipes could be helpful. Mostly because I can see this same approach re-appearing in some related contexts. For example:

If you have examples then let me know in the comments or on twitter.

How can dataset recipes help?

I think there’s a whole range of ways in which these types of recipe can be useful.

Data analysis always starts by posing a question. By documenting how datasets can be applied specific questions will make them easier to find on search engines. It just fits better with what people want to do.

Data discovery is important during periods where there is a sudden influx of new potential users. For example, where datasets have just been published under an open licence and are now available to more people, for a wider range of purposes.

In my experience data analysts and scientists who understand a domain, e.g population or transport modelling, have built up an tacit understanding of what datasets are most useful in different contexts. They understand the limitations and the process of combining datasets together. This thread from Chris Gale with a recipe about doing spatial analysis using PHE’s COVID-19 data is a perfect example. Documenting and sharing this knowledge can help others to do similar analyses. It’s like a cooking masterclass.

Discovery is also difficult when there is a sudden influx of new data available. Such as during this pandemic. Writing recipes is a good way to share learning across a community.

Documenting useful recipes might help us scale innovation across local areas.

Lastly, we’re still trying to understand which datasets are a most important part of our local, national and international data infrastructure. We’re currently lacking any real quantitative information about how datasets are combined together. In the same way that recipes can be analysed to create ingredient networks, dataset recipes could be analysed to find out how datasets are being used together. We can then strengthen that infrastructure.

If you’ve built something that helps people publish dataset recipes then send me a link to your app. I’d like to try it.

How can you help support the use of a dataset?

Getting the most value from data, whilst minimising its harmful impacts, is a community activity. Datasets need to be governed and published well. Most of that responsibility falls on the data publisher. Because the choices they make shapes data ecosystems.

But other people have a role to play too. Being a good data user means engaging with that process.

Helping others to find data and find the value in it, feels particularly important at the moment. During the pandemic there are many new datasets becoming available. And there are lots of questions to be answered. Some of them can be answered through better use of data.

So, how can communities work together to support use of data?

There are a lot of different ways to explore that question. But there’s a framework called BASEDEF, created by the open source community, which I find helpful.

BASEDEF stands for Blog, Apply, Suggest, Extend, Document, Evangelize and Fix. It describes the different types of contributions that can support an open source project. It can also be applied to help organise a small team in doing that work. Here’s a handy cheat sheet.

But the framework can also be applied to the task of supporting the use of an openly licensed dataset. Let’s run through the framework with that in mind.


Blog

You can write about a dataset to help others to discover it. You can help explain the potential value of applying the dataset to specific problems. Or perhaps you can see some downsides that others should consider.

Writing about how a dataset has been useful to you, by describing how you’ve successfully applied it in a project, will also help others see its potential value.

Apply

You can show how a dataset can be used, by creating something with it. You might do a detailed analysis of the data, but some simpler contributions can also be helpful.

For example you might create a simple visualisation. Or write and publish some code that illustrates how the dataset can be accessed and used. You could publish a quick demo showing how the dataset can be imported and used in some frequently used tools and platforms.

At the moment everyone is a bit tired of charts and graphs. And I agree with the first principle in the visualisation design principles for the pandemic. But a helpful visualisation can do a range of things. Visualisation can be exploratory rather than explanatory.

A visualisation could support other people in understanding the shape of a dataset, to inform their analysis and interpretation of it. It can help identify outliers, gaps, or highlight some of the richness in the data. I’d recommend making it clear when you’re doing it type of visualisation, rather than trying to derive specific insights.

Suggest

Read the documentation. Download and explore the dataset. Ask questions. Give feedback.

Make suggestions to the publisher about changes they could make to publish the data better. Rather than just offer academic critique, be clear about how suggested changes will support your needs or that of your community.

Extend

The freedoms granted by an open licence allow you to enrich and improve a dataset.

Sometimes the smallest changes can have the most impact. Convert the data into other common or standard formats. Extracting data from spreadsheets into CSV files. Convert data published in more complex formats or via APIs into simpler tabular data to make it more accessible to analysts rather than programmers.

Or maybe you can enrich a dataset by adding identifiers that will allow it to be linked to other sources. Do the work of merging with other datasets to bring in more context.

The downside here is that if the original data changes your extended version will get out of date. If you can’t commit to keeping your version up to date, then be sure to share your code and document your methods.

Allow others to repeat the steps you’ve taken. And don’t forget to suggest the improvements to the publisher.

Document

Write additional documentation to fill in gaps where the publisher has not provided sufficient background or explanation. Explain technical concepts or academic terms to a non-specialist audience.

As a user of the data, you’re able to write that documentation from a perspective that reflects the needs and questions of your specific community and the kinds of questions you need to ask. The original publisher might not have all that context or understand those needs, so this work can be really helpful.

Good documentation can be a finding aid. There are structured ways that you can go about writing documentation, such as this tool for writing civic data guides. (Check out some of the examples).

Evangelise

Email people that might have a need for the data. Tweet about it to a wider community. Highlight it in a presentation. Talk about it over coffee Zoom.

Fix

If the dataset is collaboratively maintained then go ahead and fix errors and omissions. If you’re not confident about making a fix, then submit an error report. In addition to fixing errors you might be able to help verify that data is correct.

If a dataset isn’t collaboratively maintained then, when you find errors, be sure to flag them to the publisher and highlight the issue for others. Or consider publishing an enriched version with fixes applied.


This framework isn’t perfect. The name is a bit clunky for a start. But there’s a couple of things that I like about it.

Firstly, it recognises that not all contributions need to be technical. There’s room for others to use different skills and in different ways.

Secondly, the elements overlap and reinforce one another. Writing documentation and blogging about how you’ve used a dataset helps to evangelise it. Enriching a dataset can help demonstrate in a practical way how a publisher can improve how data is published.

Finally, it serves to highlight some important aspects of community curation which aren’t always well supported in existing data platforms and portals. We can do better here.

If you’re interested in working on adapting this further then happy to chat!. It might be useful to have a cheat sheet that supports its application to data and more examples of how to do these different elements well.

Why is change discovery important for open data?

Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.

Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:

  • How often will a dataset get routinely updated and republished?
  • What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
  • How will the dataset, or parts of it be version controlled?
  • How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
  • How will planned and unplanned updates and changes be communicated to users of the dataset?
  • How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
  • Are large scale changes to the data model expected, and if so over what timescale?
  • Are changes to the technical infrastructure planned, and if so over what timescale?
  • How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?

These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.

Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.

Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.

A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:

  • changes to a row in a dataset might be flagged with a timestamp and a change notice
  • API responses might indicate the version of the object being retrieved
  • dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
  • a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
  • changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum

These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.

Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:

  • make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
  • use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
  • subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
  • subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems

Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.

I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.

The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.

Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.

Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.

In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.

And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.

How can publishing more data decrease the value of existing data?

Last month I wrote a post looking at how publishing new data might increase the value of existing data. I ended up listing seven different ways including things like improving validation, increasing coverage, supporting the ability to link together datasets, etc.

But that post only looked at half of the issue. What about the opposite? Are there ways in which publishing new data might reduce the value of data that’s already available?

The short answer is: yes there are.  But before jumping into that, lets take a moment to reflect on the language we’re using.

A note on language

The original post was prompted by an economic framing of the value of data. I was exploring how the option value for a dataset might be affected by increasing access to other data. While this post is primarily looking at how option value might be reduced, we need to acknowledge that “value” isn’t the only way to frame this type of question.

We might also ask, “how might increasing access to data increase potential for harms?” As part of a wider debate around the issues of increasing access to data, we need to use more than just economic language. There’s a wealth of good writing about the impacts of data on privacy and society which I’m not going to attempt to precis here.

It’s also important to highlight that “increasing value” and “decreasing value” are relative terms.

Increasing the value of existing datasets will not seem like a positive outcome if your goal is to attempt to capture as much value as possible, rather than benefit a broader ecosystem. Similarly, decreasing value of existing data, e.g. through obfuscation, might be seen as a positive outcome if it results in better privacy or increased personal safety.

Decreasing value of existing data

Having acknowledged that, lets try and answer the earlier question. In what ways can publishing new data reduce the value we can derive from existing data?

Increased harms leading to retraction and reduced trust

Publishing new data always runs the risk of re-identification and the enabling of unintended inferences. While the impacts of these harms are likely to be most directly felt by both communities and individuals, there are also broader commercial and national security issues. Together, these issues might ultimately reduce the value of the existing data ecosystem in several ways:

  • Existing datasets may need to be retracted, have their scope changed, or have their circulation reduced in order to avoid further harm. Data privacy impact assessments will need to be updated as the contexts in which data is being shared and published change
  • Increased concerns over potential privacy impacts might lead to organisations to choose not to increase access to similar or related datasets
  • Increased concerns might also lead communities and individuals to reduce the amount of data they are willing to share with previously trusted sources

Overall this can lead to a reduction in the overall coverage, quality and linking of data across a data ecosystem. It’s likely to be one of the most significant impact of poorly considered data releases. It can be mitigated through proper impact assessments, consultation and engagement.

Reducing overall quality

Newly published data might be intended to increase coverage, enrich, link, validate or otherwise improve existing data. But it might actually have the opposite effect because its of poor quality. I’ve briefly touched on this in a previous post on fictional data.

Publication of poor quality data might be unintended. For example an organisation may just be publishing the data it has to help address an issue, without properly considering or addressing underlying problems with it. Or a researcher may publish data that contains honest mistakes.

But publication of poor quality data might also be deliberate. For example as spam or misinformation intended to “poison the well“.

More subtly, practices like p-hacking and falsification of data which might be intended to have a short-term direct benefit to the publisher or author, might have longer term issues by impacting the use of other datasets.

This is why understanding and documenting the provenance of data, monitoring of retractions, fixes and updates to data, and the ability to link analyses with datasets are all so important.

Creating unnecessary competition or increasing friction

Publishing new datasets containing new observations and data about an area or topic of interest can lead to positive impacts, e.g. by increasing confidence or coverage. But datasets are also competing with one another. The same types of data might be available from different sources, but under different licences, access arrangements, pricing, etc.

This competition isn’t necessarily positive. For example, the data ecosystem might not benefit as much from the network effects that follow from linking data because key datasets are not linked or cannot be used together. Incompatible and competing datasets can add friction across an ecosystem.

Building poor foundations

Data is often published as a means of building stronger data infrastructure for a sector, or to address a specific challenge. But if that data is poorly maintained or is not sustainably funded, then the energy that goes into building the communities, tools and other datasets around that infrastructure might be wasted.

That reduces the value of existing datasets which might otherwise have provided a better foundation to build upon. Or whose quality is dependent on the shared infrastructure. While this issue is similar to that of the previous one about competition, its root causes and impacts are slightly different.

 

As I noted in my earlier post. I don’t think this is an exhaustive list and it can be improved by contributions. Leave a comment if you have any thoughts.