Brief review of revisions and corrections policies for official statistics

In my earlier post on the importance of tracking updates to datasets I noted that the UK Statistics Authority Code of Practice includes a requirement that publishers of official statistics must publish a policy that describes their approach to revisions and corrections.

See 3.9 in T3: Orderly Release, which states: “Scheduled revisions or unscheduled corrections to the statistics and data should be released as soon as practicable. The changes should be handled transparently in line with a published policy.”

The Code of Practice includes definitions of both Scheduled Revisions and Unscheduled Corrections.

Scheduled Revisions are defined as: “Planned amendments to published statistics in order to improve quality by incorporating additional data that were unavailable at the point of initial publication“.

Whereas Unscheduled Corrections are: “Amendments made to published statistics in response to the identification or errors following their initial publication“

I decided to have a read through a bunch of policies to see what they include and how they compare.

Here are some observations based on a brief survey of this list of 15 different policies including those by the Office of National Statistics, the FSA, Gambling Commission, CQC, DfE, PHE, HESA and others.

Scope

The Code of Practice applies to official statistics. Some organisations publishing official statistics also publish other statistical datasets.

In some cases organisations have written policies that apply:

to all their statistical outputs, regardless of designation
only to those outputs that are official statistics
individual policies relating to specific datasets

There’s some variation in the amount of detail provided.

Some read as basic compliance documents with basic statements of intent to follow the recommendations of the code of practice. The include, for example a note that revisions and corrections will be handled transparently, in a timely way and with general notes about how that will happen.

Others are more detailed, giving more insight into how the policy will actually be carried out in practice. From a data consumer perspective these feel a bit more useful as they often include timescales for reporting, lines of responsibility and notes about how changes are communicated.

Definitions

Some policies elaborate on the definitions in the code of practice, providing a bit more breakdown on the types of scheduled revisions and sources of error.

For example some policies indicate that changes to statistics may be driven by:

access to new or corrected source data
routine recalculations, as per methodologies, to establish baselines
improvements to methodologies
corrections to calculations

Some organisations publish provisional releases of these statistics. So their policies discuss Scheduled Revisions in this light: a dataset is published in one or more provisional releases before being finalised. During those updates the organisation may have been in receipt of new or updated data that impacts how the statistics are calculated. Or may fix errors.

Other organisations do not publish provisional statistics so their datasets do not have scheduled revisions.

A few policies include a classification of the severity of errors, along the lines of:

major errors that impact interpretation or reuse of data
minor errors in statistics, which may include anything that is not major
other minor errors or mistakes, e.g. typographical errors

These classifications are used to describe different approaches to handling the errors, appropriate to their severity.

Decision making

The policies frequently require decision making around how specific revisions and corrections might be handled. With implications for investment of time and resources in handling and communicating the necessary revisions and corrections.

In some cases responsibility lies with a senior leader, e.g. a Head of Profession, or other senior analyst. In some cases decision making rests with the product owner with responsibility for the dataset.

Scheduled revisions

Scheduled changes are, by definition, planned in advance. So the policy sections relating to these revisions are typically brief and tend to focus on the release process.

In general, the policies align around:

having clear timetables for when revisions are to be expected
summarising key impacts, detail and extent of revisions in the next release of a publication and/or dataset
clear labelling of provisional, final and revised statistics

Several of the policies include methodological changes in their handling of scheduled revisions. These explain that changes will be consulted on and clearly communicated in advance. In some cases historical data may be revised to align with the new methodology.

Corrections

Handling of corrections tends to be the larger sections of each policy. These sections frequently highlight that despite rigorous quality control errors may creep in, either because of mistakes or because of corrections to upstream data sources.

There are different approaches to how quickly errors will be handled and fixed. In some cases this depends on the severity of errors. But in others the process is based on publication schedules or organisational preference.

For example, in one case (SEPA), there is a stated preference to handle publication of unscheduled corrections once a year. In other policies corrections will be applied at the next planned (“orderly”) release of the dataset.

Impact assessments

Several policies note that there will be an impact assessment undertaken to fully understand an error before any changes are made.

These assessments include questions like:

does the error impact a headline figure or statistic?
is the error within previously reported margins of accuracy or certainty
who will be impacted by the change
the consequences of the change, e.g. does it impact the main insights from the previously published statistics or how it might be used?

Severity of errors

Major errors tend to get some special treatment. Corrections to these errors are typically made more rapidly. But there are few commitments to timeliness of publishing corrections. “As soon as possible” is a typical statement.

The two exceptions I noted are the MOD policy which notes that minor errors will be corrected within 12 months, and the CQC policy which commits to publishing corrections within 20 days of an agreement to do so. (Others may include commitments that I’ve missed.)

A couple of policies highlight that errors may be found before a fix. In these cases, the existence of the error will still be reported.

The Welsh Revenue Authority was the only policy that noted that it might even retract a dataset from publication while it fixed an error.

A couple of policies noted that minor errors that did not impact interpretation may not be fixed at all. For example one ONS policy notes that errors within the original bounds of uncertainty in the statistics may not be corrected.

Minor typographic errors might just be directly fixed on websites without recording or reporting of changes.

Marking

There seems to be general consensus on the use of “p” for provisional and “r” for revised figures in statistics. Interestingly, in the Welsh Revenue Authority policy they note that while there is an accepted welsh translation for “provisional” and “revised”, the marker symbols remain untranslated.

Some policies clarify that these markers may be applied at several levels, e.g. to individual cells as well as rows and columns in a table.

Only one policy noted a convention around adding “revised” to a dataset name.

Communications

As required by the code of practice, the policies align on providing some transparency around what has been changed and the reason for the changes. Where they differ is around how that will be communicated and how much detail is included in the policy.

In general, revisions and corrections will simply be explained in the next release of the dataset, or before if a major error is fixed. The goal being to provide users with a reason for the change, and the details of the impact on the statistics and data.

These explanations are handled by additional documentation to be included in publications, markers on individual statistics, etc. Revision logs and notices are common.

Significant changes to methodologies or major errors get special treatment. E.g. via notices on websites or announcements via twitter.

Many of the policies also explain that known users or “key users” will be informed of significant revisions or corrections. Presumably this is via email or other communications.

One policy noted that the results of their impact assessment and decision making around how to handle a problem might be shared publicly.

Capturing lessons learned

A few of the policies included a commitment to carry out a review of how an error occurred in order to improve internal processes, procedures and methods. This process may be extended to include data providers where appropriate.

One policy noted that the results of this review and any planned changes might be published where it would be deemed to increase confidence in the data.

Wrapping up

I found this to be an interesting exercise. It isn’t a comprehensive review, but hopefully it provides a useful summary of approaches.

I’m going to resist the urge to write recommendations or thoughts on what might be added to these policies. Reading a policy doesn’t tell us how well its implemented, or whether users feel it is serving their needs.

I will admit to feeling a little surprised that there isn’t a more structured approach in many cases. For example, pointers to where I might find a list of recent revisions or how to sign up to get notified as an interested user of the data.

I had also expected some stronger commitments about how quickly fixes may be made. These can be difficult to make in a general policy, but are what you might expect from a data product or service.

These elements might be covered by other policies or regulations. If you know of any that are worth reviewing, then let me know.