Some lessons learned from building standards around Schema.org

OpenActive is a community-led initiative in the sport and physical activity sector in England. It’s goal is to help to get people healthier and more active by making its easier for people to find information about activities and events happening in their area. Publishing open data about opportunities to be active is a key part of its approach.

The initiative has been running for several years, funded by Sport England. Its supported by a team at the Open Data Institute who are working in close collaboration with a range of organisations across the sector.

During the early stages of the project I was responsible for leading the work to develop the technical standards and guidance that would help organisations publish open data about squash courts and exercise classes. I’ve written some previous blog posts that described the steps that got us to version 1.0 of the standards and then later the roadmap towards 2.0.

Since then the team have been exploring new features like publishing data about walking and cycling routes, improving accessibility information and, more recently, testing a standard API for booking classes.

If you’re interested in more of the details then I’d encourage you to dig into those posts as well as the developer portal.

What I wanted to cover in this blog post are some reflections about one of the key decisions we made early in the standards workstream. This was to base the core data model on Schema.org.

Why did we end up basing the standards on Schema.org?

We started the standards work in OpenActive by doing a proper scoping exercise. This helped us to understand the potential benefits of introducing a standard, and the requirements that would inform its development.

As part of our initial research, we did a review of what standards existed in the sector. We found very little that matched our needs. The few APIs that were provided were quite limited and proprietary and there was little consistency around how data was organised.

It was clear that some standardisation would be beneficial and that there was little in the way of sector-specific work to build on. It was also clear that we’d need a range of different types of standard. Data formats and APIs to support exchange of data, a common data model to help organise data and a taxonomy to help describe different types of activity.

For the data model, it was clear that the core domain model would need to be able to describe events. E.g. that a yoga class takes place in a specific gym at regular times. This would support basic discovery use cases. Where can I go and exercise today? What classes are happening near me?

As part of our review of existing standards, we found that Schema.org already provided this core model along with some additional vocabulary that would help us categorise and describe both the events and locations. For example, whether an Event was free, its capacity and information about the organiser.

For many people Schema.org may be more synonymous with publishing data for use by search engines. But as a project its goal is much broader, it is “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data“.

The data model covers much more than what search engines are consuming. Some communities are instead using the project as a means to collaborate on developing better vocabulary for sharing data between other applications. As well as aligning existing vocabularies under a common umbrella.

New standards should ideally be based on existing standards. We knew we were going to be building the OpenActive technical standards around a “stack” of standards that included HTTP, JSON and JSON-LD. So it was a natural step to base our initial domain model on aspects of Schema.org.

What were the benefits?

An early benefit of this approach is that we could immediately focus our roadmap on exploring extensions to the Schema.org data model that would add value to the physical activity sector.

Our initial community sessions around the OpenActive standards involved demonstrating how well the existing Schema.org model fitted the core requirements. And exploring where additional work was needed.

This meant we skipped any wrangling around how to describe events and instead focused on what we wanted to say about them. Important early questions focused on what information would potential participants find helpful in understanding whether this is specific activity or event is something that they might want to try? For example, details like: what activities they involved and for what level of competency?

We were able to identify those elements of the core Schema.org model supported out use cases and then documented some extensions in our own specifications. The extensions and clarifications were important for the OpenActive community, but not necessarily relevant in the broader context in which Schema.org is being used. We wanted to build some agreement and usage in our community first, before suggesting changes to Schema.org.

As well as giving us an initial head start, the decision also helped us address new requirements much quicker.

As we uncovered further requirements that mean expanding our data model, we were always able to initially look to see if existing Schema.org terms covered what we needed. We began using it as a kind of “dictionary” that we could draw on when needed.

Where existing parts of the Schema.org model fitted out needs, it was gratifying to be able to rapidly address the new requirements by documenting patterns for how to use them. Data publishers were also doing the same thing. Having a common dictionary of terms gave freedom to experiment with new features, drawing on terms defined in a public schema, before the community had discussed and agreed how to implement those patterns more broadly.

Every standards project has its own cadence. The speed of development and adoption are tied up with a whole range of different factors that go well beyond how quickly you can reach consensus around a specification.

But I think the decision to use Schema.org definitely accelerated progress and helped us more quickly deliver a data model that covered the core requirements for the sector.

Where were the challenges?

The approach wasn’t without its challenges, however.

Firstly, for a sector that was new to building open standards, choosing to based parts of that new standard on one project and then defining extensions created some confusion. Some communities seem more comfortable with piecing together vocabularies and taxonomies, but that is not true more widely.

Developers found it tricky to refer to both specifications, to explore their options for publishing different types of data. So we ended up expanding our documentation to cover all of the Schema.org terms we recommended or suggested people use, instead of focusing more on our own extensions.

Secondly, we also initially adopted the same flexible, non-prescriptive approach to data publishing that Schema.org uses. It does not define strict conformance critiera and there are often different options for how the same data might be organised depending on the level of detail a publisher has available. If Schema.org were too restrictive then it would limit how well the model could be used by different communities. It also leaves space for usage patterns to emerge.

In OpenActive we recognised that the physical activity sector had a wide range of capabilities when it came to publishing structured data. And different organisations organised data in different ways. We adopted the same less prescriptive approach to publishing with the goal of reducing the barriers to getting more data published. Essentially asking publishers to structure data as best they could within the options available.

In the end this wasn’t the right decision.

Too much flexibility made it harder for implementers to understand what data would be most useful to publish. And how to do it well. Many publishers were building new services to expose the data so they needed a clearer specification for their development teams.

We addressed this in Version 2 of the specifications by considerably tightening up the requirements. We defined which terms were required or just recommended (and why). And added cardinalities and legal values for terms. Our specification became a more formal, extended profile of Schema.org. This also allowed us build a data validator that is now being released and maintained alongside the specifications.

Our third challenge was about process. In a few cases we identified changes that we felt would sit more naturally within Schema.org than our own extensions. For example, they were improvements and clarifications around the core Event model that would be useful more widely. So we submitted those as proposed changes and clarifications.

Given that Schema.org has a very open process, and the wide range of people active in discussing issues and proposals, it was sometimes hard to know how decisions would get made. We had good support from Dan Brickley and others stewarding the project, but without knowing much about who is commenting on your proposal, their background or their own uses cases, it was tricky to know how much time to spend on handling this feedback. Or when we could confidently say that we had achieved some level of consensus.

We managed to successfully navigate this, by engaging as we would within any open community: working transparently and collegiately, and being willing to reflect on and incorporate feedback regardless of its source.

The final challenge was about assessing the level of use of different parts of the Schema.org model. If we wanted to propose a change in how a term was documented or suggest a revision to its expected values, it is difficult to assess the potential impact of that change. There’s no easy way to see which applications might be relying on specific parts of the model. Or how many people are publishing data that uses different terms.

The Schema.org documentation does flag terms that are currently under discussion or evaluation as “pending”. But outside of this its difficult to understand more about how the model is being used in practice. To do that you need to engage with a user community, or find some metrics about deployment.

We handled this by engaging with the open process of discussion, sharing our own planned usage to inform the discussion. And, where we felt that Schema.org didn’t fit with the direction we needed, we were happy to look to other standards that better filled those gaps. For example we chose to use SKOS to help us organise and structure a taxonomy of physical activities rather than using some of the similar vocabulary that Schema.org provides.

Choosing to draw on Schema.org as a source of part of our domain model didn’t mean that we felt tied to using only what it provides.

Some recommendations

Overall I’m happy that we made the right decision. The benefits definitely outweighed the challenges.

But navigating those challenges was easier because those of us leading the standards work were comfortable both with working in the open and in combining different standards to achieve a specific goal. Helping to build more competency in this area is one goal of the ODI standards guidebook.

If you’re involved in a project to build a common data model as part of a community project to publish data, then I’d recommend looking at whether based some or all of that model around Schema.org might help kickstart your technical work.

If you do that, my personal advice would be:

Remember that Schema.org isn’t the right home for every data model. Depending on your requirements, the complexity and the potential uses for the data, you may be better off designing and iterating on your model separately. Similarly, don’t expect that every change or extension you might want to make will necessarily be accepted into the core model
Don’t assume that search engines will start using your data, just because you’re using Schema.org as a basis for publishing, or even if you successfully submit change proposals. It’s not a means of driving adoption and use of your data or preferred model
Plan to write your own specifications and documentation that describe how your application or community expects data to be published. You’ll need to add more conformance criteria and document useful patterns that go beyond that Schema.org is providing
Work out how you will engage with your community. To make it easier to refine your specifications, discuss extensions and gather implementation feedback, you’ll still need a dedicated forum or channel for your community to collaborate. Schema.org doesn’t really provide a home for that. You might have your own github project or setup a W3C community group.
Build your own tooling. Schema.org are improving their own tooling, but you’ll likely need your own validation tools that are tailored to your community and your specifications
Contribute to the Schema.org project where you can. If you have feedback, proposed changes or revisions then submit these to the project. Its through a community approach that we improve the model for everyone. Just be aware that there are likely to be a whole range of different use cases that may be different to your own. Your proposals may need to go through several revisions before being accepted. Proposals that draw on real-world experience or are tied to actual applications will likely carry more weight than general opinions about the “right” way to design something
Be prepared to diverge where necessary. As I’ve explained above, sometimes the right option is to propose changes to Schema.org. And sometimes you may need to be ready to draw on other standards or approaches.

Why did we end up basing the standards on Schema.org?

What were the benefits?

Where were the challenges?

Some recommendations

Share this:

Published by Leigh Dodds