GDS have published some guidance about publishing reference data for reuse across government. I’ve had a read and it contains a good set of recommendations. But some of them could be clearer. And I feel like some important areas aren’t covered. So I thought I’d write this post to capture my feedback.
Like the original guidance my feedback largely ignores considerations of infrastructure or tools. That’s quite a big topic and recommendations in those areas are unlikely to be applicable solely to reference data.
The guidance also doesn’t address issues around data sharing, such as privacy or regulatory compliance. I’m also going to gloss over that. Again, not because its not important, but because those considerations apply to sharing and publishing any form of data, not just reference data
Here’s the list of things I’d revise or add to this guidance:
- The guidance should recommend that reference data be at open as possible, to allow it to be reused as broadly as possible. Reference data that doesn’t contain personal information should be published under an open licence. Licensing is important even for cross-government sharing because other parts of government might be working with private or third sector who also need to be able to use the reference data. This is the biggest omission for me.
- Reference data needs to be published over the long term so that other teams can rely on it and build it into their services and workflows. When developing an approach for publishing reference data, consider what investment needs to be made for this to happen. That investment will need to cover people and infrastructure costs. If you can’t do that, then at least indicate how long you expect to be publishing this data. Transparent stewardship can build trust.
- For reference data to be used, it needs to be discoverable. The guide mentions creating metadata and doing SEO on dataset pages, but doesn’t include other suggestions such as using Schema.org Dataset metadata or even just depositing metadata in data.gov.uk.
- The guidance should recommend that stewardship of reference data is part of a broader data governance strategy. While you may need to identify stewards for individual datasets, governance of reference data should be part of broader data governance within the organisation. It’s not a separate activity. Implementing that wider strategy shouldn’t block making early progress to open up data, but consider reference data alongside other datasets
- Forums for discussing how reference data is published should include external voices. The guidance suggests creating a forum for discussing reference data, involving people from across the organisation. But the intent is to publish data so it can be reused by others. This type of forum needs external voices too.
- The guidance should recommend documenting provenance of data. It notes that reference data might be created from multiple sources, but does not encourage recording or sharing information about its provenance. That’s important context for reusers.
- The guide should recommend documenting how identifiers are assigned and managed. The guidance has quite a bit of detail about adding unique identifiers to records. It should also encourage those publishing reference data to document how and when they create identifiers for things, and what types of things will be identified. Mistakes in understanding the scope and coverage of reference data can have huge impacts.
- There is a recommendation to allow users to report errors or provide feedback on a dataset. That should be extended to include a recommendation that the data publisher makes known errors clear to other users, as well as transparency around when individual errors might be fixed. Reporting an error without visibility of the process for fixing data is frustrating
- GDS might recommend an API first approach, but reference data is often used in bulk. So there should be a recommendation to have bulk access to data, not just an API. It might also be cheaper and more sustainable to share data in this way
- The guidance on versioning should include record level metadata. The guidance contains quite a bit of detail around versioning of datasets. While useful, it should also include suggestions to include status codes and timestamps on individual records, to simplify integration and change monitoring. Change reporting is an important but detailed topic.
- While the guidance doesn’t touch on infrastructure, I think it would be helpful for it to recommend that platforms and tools used to manage reference data are open sourced. This will help others to manage and publish their own reference data, and build alignment around how data is published.
- Finally, if multiple organisations are benefiting from use of the same reference data then encouraging exploration of collaborative maintenance might help to reduce costs for maintaining data, as well as improving its quality. This can help to ensure that data infrastructure is properly supported and invested in.