Schema explorers and how they can help guide adoption of common standards

Despite being very different projects Wikidata and OpenStreetmap have a number of similarities. Recurring patterns in how they organise and support the work of their communities.

We documented a number of these patterns in the ODI Collaborative Maintenance Guidebook. There were also a number we didn’t get time to write-up.

A further pattern which I noticed recently is that both Wikidata and OSM provide tools and documentation that help contributors and data users explore the schema that shapes the data.

Both projects have a core data model around which their communities are building and iterating on a more focused domain model. This approach of providing tools for the community to discuss, evolve and revise a schema is what we called the Shared Canvas pattern in the ODI guidebook.

In OpenStreetmap that core model is consists of nodes, ways and relations. Tags (name-value pairs) can be attached to any of these types.

In Wikidata the core data model is essentially a graph. A collection of statements that associate values with nodes using a range of different properties. It’s actually more complicated than that, but the detail isn’t important here.

The list of properties in Wikidata and the list of tags in OpenStreetmap are continually revised and extended by the community to capture additional information.

The OpenStreetmap community documents tags in its Wiki (e.g. the building tag). Wikidata documents its properties within the project dataset (e.g. the name property, P2561).

But to successfully apply the Shared Canvas pattern, you also need to keep the community up to date about your Evolving Schema. To do that you need some way to communicate which properties or tags are in use, and how. OSM and Wikidata both provide tools to support that.

In OSM this role is filled by TagInfo. It can provide you with a break down of what type of feature the tag is used on, the range of values, combinations with other tags and some idea of its geographic usage. Tag uses varies by geographic community in OSM. Here’s the information about the building tag.

In Wikidata this tooling is provided by a series of reports that are available from the Discussion page for an individual property. This includes information about how often it is used and pointers to examples of frequent and recent uses. Here’s the information about the name property.

Both tools provide useful insight into how different aspects of a schema are being adopted and uses. They can help guide not just the discussion around the schema (“is this tag in use?”, but also the process of collecting data (“which tags should I use here”) and using the data (“what tags might I find, or query for?”).

Any project that adopts a Shared Canvas approach is likely to need to implement this type of tooling. Lets call it the “Schema explorer” pattern for now.

I’ll leave documenting it further for another post, or a contribution to the guidebook.

Schema explorers for open standards and open data

This type of tooling would be useful in other contexts.

Anywhere that we’re trying to drive adoption of a common data standard, it would be helpful to be able to assess how well used different parts of that schema are by analysing the available data.

That’s not something I’ve regularly seen produced. In our survey of decentralised publishing initiatives at the ODI we found common types of documentation, data validators and other tools to support use of data, like useful aggregations. But no tooling to help explore how well it is adopted. Or to help data users understand the shape of the available data prior to aggregating it.

When i was working on the OpenActive standard, I found the data profiles that Dan Winchester produced really helpful. They provide useful insight into which parts of a standard different publishers were actually using.

I was thinking about this again recently whilst doing some work for Full Fact, exploring the ClaimReview markup in Schema.org. It would be great to see which features different fact checkers are actually using. In fact that would be true of many different aspects of Schema.org.

This type of reporting is hard to do in a distributed environment without aggregating all the data. But Google are regularly harvesting some of this data, so it feels like it would be relatively easy for them to provide insights like this if they chose.

An alternative is the Schema.org Table Corpus which provides exports of Schema.org data contained in the Common Crawl dataset. But more work is likely needed to generate some useful views over the data, and it is less frequently updated.

Outside of Schema.org, schema explorers reporting on the contents of open datasets, would help inform a range of standards work. For example, it could help inform decisions about how to iterate on a schema, guide the production of documentation, and help improve the design of validators and other tools.

If you’ve seen examples of this type of tooling, then I’d be interested to see some links.