Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.
Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:
- How often will a dataset get routinely updated and republished?
- What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
- How will the dataset, or parts of it be version controlled?
- How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
- How will planned and unplanned updates and changes be communicated to users of the dataset?
- How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
- Are large scale changes to the data model expected, and if so over what timescale?
- Are changes to the technical infrastructure planned, and if so over what timescale?
- How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?
These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.
Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.
Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.
A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:
- changes to a row in a dataset might be flagged with a timestamp and a change notice
- API responses might indicate the version of the object being retrieved
- dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
- a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
- changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum
These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.
Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:
- make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
- use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
- subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
- subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems
Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.
I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.
The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.
Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.
Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.
In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.
And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.