How Do We Attribute Data?

This post is another in my ongoing series of “basic questions about open data”, which includes “What is a Dataset?” and “What does a dataset contain?“. In this post I want to focus on dataset attribution and in particular questions such as:

Why should we attribute data?
How are data publishers asking to be attributed?
What are some of the issues with attribution?
Can we identify some common conventions around attribution?
Can we monitor or track attribution?

I started to think about this because I’ve encountered a number of data publishers recently that have published Open Data but are now struggling to highlight how and where that data has been used or consumed. If data is published for anonymous download, or accessible through an open API then a data publisher only has usage logs to draw on.

I had thought that attribution might help here: if we can find links back to sources, then perhaps we can help data publishers mine the web for links and help them build evidence of usage. But it quickly became clear, as we’ll see in a moment, that there really aren’t any conventions around attribution, making it difficult to achieve this.

So lets explore the topic from first principles and tick off my questions individually.

Why Attribute?

The obvious answer here is simply that if we are building on the work of others, then it’s only fair that those efforts should be acknowledged. This helps the creator of the data (or work, or code) be recognised for their creativity and effort, which is the very least we can do if we’re not exchanging hard cash.

There are also legal reasons why the source of some data might be need to be acknowledged. Some licenses require attribution, copyright may need to be acknowledged. As a consumer I might also want to (or need to) clearly indicate that I am not the originator of some data in case it is find to be false, or misleading, etc.

Acknowledging my sources may also help guarantee that the data I’m using continues to be available: a data publisher might be collecting evidence of successful re-use in order to justify ongoing budget for data collection, curation and publishing. This is especially true when the data publisher is not directly benefiting from the data supply; and I think it’s almost always true for public sector data. If I’m reusing some data I should make it as clear as possible that I’m so doing.

There’s some additional useful background on attribution from a public sector perspective in a document called “Supporting attribution, protecting reputation, and preserving integrity“.

It might also be useful to distinguish between:

Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

While these two cases clearly overlap, the intention is often slightly different. As a user of an application, or the reader of an academic paper, I might want a clear citation to the underlying dataset so I can re-use it myself, or do some fact checking. The important use case there is tracking facts and figures back to their sources. Attribution is more about crediting the effort involved in collecting that information.

It may be possible to achieve both goals with a simple link, but I think recognising the different use cases is important.

How are data publishers asking to be attributed?

So how are data publishers asking for attribution? What follows isn’t an exhaustive survey but should hopefully illustrate some of the variety.

Lets look first at some of the suggested wordings in some common Open Data licenses, then poke around in some terms and conditions to see how these are being applied in practice.

Attribution Statements in Common Open Data Licenses

The Open Data Commons Attribution license includes some recommended text (Section 4.3a – Example Notice):

Contains information from DATABASE NAME which is made available under the ODC Attribution License.

Where DATABASE NAME is the name of the dataset and is linked to the dataset homepage. Notice no mention of the originator, just the database. The license notes that in plain text the links should be included as text. The Open Data Commons Database license has the same text (again, section 4.3a)

The UK Open Government License notes that re-users should:

…acknowledge the source of the Information by including any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence

Where no attribution is provided, or multiple sources must be attributed, then the suggested default text, which should include a link to the license is:

Contains public sector information licensed under the Open Government Licence v1.0.

So again, no reference to the publisher but also no reference to the dataset either. The National Archives have some guidance on attribution which includes some other variations. These variants do suggest including more detail including name of department, date of publication, etc. These look more like typical bibliographic citations.

As another data point we can look at the Ordnance Survey Open Data License. This is a variation of the Open Government License but carries some additional requirements, specifically around attribution. The basic attribution statement is:

Contains Ordnance Survey data © Crown copyright and database right [year]

However the Code Point Open dataset has some additional attribution requirements, which also acknowledge copyright of the Royal Mail and National Statistics. All of these statements acknowledge the originators and there’s no requirement to cite the dataset itself.

Interestingly, while the previous licenses state that re-publication of data should be under a compatible license, only the OS Open Data license explicitly notes that the attribution statements must also be preserved. So both the license and attribution have viral qualities.

Attribution Statements in Terms and Conditions

Now lets look at some specific Open Data services to see what attribution provisions they include.

Freebase is an interesting example. It draws on multiple datasets which are supplemented by contributions of its user community. Some of that data is under different licenses. As you can see from their attribution page, there are variants in attribution statements depending on whether the data is about one or several resources and whether it includes Wikipedia content, which must be specially acknowledged.

They provide a handy HTML snippet for you to include in your webpage to make sure you get the attribution exactly right. Ironically at the time of writing this service is broken (“User Rate Limit Exceeded”). If you want a slightly different attribution, then you’re asked to contact them.

Now, while Freebase might not meet everyone’s definition of Open Data, its an interesting data point. Particularly as they ask for deep links back to the dataset, as well as having a clear expectation of where/how the attribution will be surfaced.

OpenCorporates is another illustrative example. Their legal/license info page examples that their dataset is licensed under the Open Data Commons Database License and explains that:

Use of any data must be accompanied by a hyperlink reading “from OpenCorporates” and linking to either the OpenCorporates homepage or the page referring to the information in question

There are also clear expectations around the visibility of that attribution:

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

So there is a clear expectation that the attribution should be displayed alongside any data. Like the OS license these attribution requirements are also viral as they must be passed on by aggregators.

My intention isn’t to criticise either OpenCorporates or Freebase, but merely to highlight some real world examples.

What are some of the issues with data attribution?

Clearly we could undertake a much more thorough review than I have done here. But this is sufficient to highlight what I think are some of the key issues. Put yourself in the position of a developer consuming some Open Data under any or all of these conditions. How do you responsibly provide attribution?

The questions that occur to me, at least are:

Do I need to put attribution on every page of my application, or can I simply add it to a colophon? (Aside: lanyrd has a great colophon page). In some cases it seems like I might have some freedom of choice, in others I don’t
If I do have to put a link or some text on a page, then do I have any flexibility around its size, positioning, visibility, etc? Again, in some cases I may do, but in others I have some clear guidance to follow. This might be challenging if I’m creating a mobile application with limited screen space. Or creating a voice or SMS application.
What if I just re-use the data as part of some back-end analysis, but none of that data is actually surfaced to the user? How do I attribute in this scenario?
Do I need to acknowledge the publisher, or a link to the source page(s)?
What if I need to address multiple requirements, e.g. if I mashed up data from data.gov.uk, the Ordnance Survey, Freebase and OpenCorporates? That might get awkward.

There are no clear answers to these questions. For individual datasets I might be able to get guidance, but it requires me to read the detailed terms and conditions for the dataset or API I’m using. Isn’t the whole purpose in having off-the-shelf licenses like the OGL or ODbL supposed to help us streamline data sharing? Attribution, or rather unclear or overly detailed attribution requirements are a clear source of friction. Especially if there are legal consequences for getting it wrong.

And that’s just when we’re considering integrating data sources by hand. What about if we want to automatically combine data? How is a machine going to understand these conditions? I suspect that every Linked Data browser and application fails to comply with the attribution requirements of the data its consuming.

Of course these issues have been explored already. The Science Commons Protocol encourages publishing data into the public domain — so no legal requirement for attribution at all. It also acknowledges the “Attribution Stacking” problem (section 5.3) which occurs when trying to attribute large numbers of datasets, each with their own requirements. Too much friction discourages use, whether its research or commercial.

Unfortunately the recently published Amsterdam Manifesto on data citation seems to overlook these issues, requiring all authors/contributors to be attributed.

The scientific community may be more comfortable with a public domain licensing approach and a best effort attribution model because it is supported by strong social norms: citation and attribution is essential to scientific discourse. We don’t have anything like that in the broader open data community. Maybe its not achievable, but it seems like clear guidance would be very useful.

There’s some useful background on problems with attribution and marking requirements on the Creative Commons wiki that also references some possible amendments and clarifications.

Can we converge on some common conventions?

So would it be possible to converge on a simple set of conventions or norms around data re-use? Ideally to the extent that attribution can be simplified and ideally automated as far as possible.

How about the following:

Publishers should clearly describe their attribution requirements. Ideally this should be a short simple statement (similar to the Open Government License) which includes their name and a link to their homepage. This attribution could be included anywhere on the web site or application that consumes the data.
Publishers should be aware that the consumers of their data will be doing so in a variety of applications and on a variety of platforms. This means allowing a deal of flexibility around how/where attribution is displayed.
Publishers should clearly indicate whether attribution must be passed on to down-stream users
Publishers should separately document their citation requirements. If they want to encourage users to link to the dataset, or an individual page on their site, to allow users to find the original context, then they should publish instructions on how to do it. However this kind of linking is for citation so consumers should be bound to include it
Consumers should comply with publishers wishes and include an about page on their site or within their application that attributes the originators of the data they use. Where feasible they should also provide citations to specific resources or datasets from within their applications. This provides their users with clear citations to sources of data
Both sides should collaborate on structured markup to support publication of these attribution and citation requirements, as well as harvesting of links

Whether attribution should be a legally enforced is another discussion. Personally I’d be keen to see a common set of conventions regardless of the legal basis for doing it. Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

4 thoughts on “How Do We Attribute Data?”

Torsten Rohlfing says:

May 2, 2013 at 7:09 pm

Please don’t use authorship to attribute public data — http://dx.doi.org/10.1016/j.neuroimage.2011.09.080
Bill Roberts (@billroberts) says:

May 2, 2013 at 7:54 pm

Thanks Leigh, very useful analysis. Good to highlight the very real problem of data publishers wanting to know if/how their stuff is being used, without necessarily restricting that use in any way. I wonder if there is some kind of trackback or pingback style approach that could be used…
Owen Boswarva (@owenboswarva) says:

May 2, 2013 at 9:02 pm

Leigh, this is certainly a worthwhile analysis of how more thoughtful use of attribution statements could make it easier for data publishers to gather anecdotal evidence of re-use on the web.

My concern is that we have to be careful not to assume web-based use cases are representative. That may lead us to misidentify which datasets and categories of open data have the widest potential for re-use.

Open data lends itself more to web-based applications than does closed data, because the licensing terms are so much more flexible. However many firms re-use open data for their internal business purposes, or incorporate it in products or services delivered to customers via channels other than the web. Attribution for that re-use may appear only in offline documentation.

Publishers’ download statistics will also usually understate the scale of re-use. Firms may procure a dataset once centrally and then cascade it to internal users or forward it under sub-licence to end users.

I think tracking re-use comprehensively will always be an intractable problem for open data publishers. It’s rather inherent in the nature of open data, because there’s no exchange of contracts and no obligation on re-users to report back to the publisher.
Pingback: Getting it right with data attribution : CloudAve

Comments are closed.