Monthly Archives: April 2013

How Do We Attribute Data?

This post is another in my ongoing series of “basic questions about open data”, which includes “What is a Dataset?” and “What does a dataset contain?“. In this post I want to focus on dataset attribution and in particular questions such as:

  • Why should we attribute data?
  • How are data publishers asking to be attributed?
  • What are some of the issues with attribution?
  • Can we identify some common conventions around attribution?
  • Can we monitor or track attribution?

I started to think about this because I’ve encountered a number of data publishers recently that have published Open Data but are now struggling to highlight how and where that data has been used or consumed. If data is published for anonymous download, or accessible through an open API then a data publisher only has usage logs to draw on.

I had thought that attribution might help here: if we can find links back to sources, then perhaps we can help data publishers mine the web for links and help them build evidence of usage. But it quickly became clear, as we’ll see in a moment, that there really aren’t any conventions around attribution, making it difficult to achieve this.

So lets explore the topic from first principles and tick off my questions individually.

Why Attribute?

The obvious answer here is simply that if we are building on the work of others, then it’s only fair that those efforts should be acknowledged. This helps the creator of the data (or work, or code) be recognised for their creativity and effort, which is the very least we can do if we’re not exchanging hard cash.

There are also legal reasons why the source of some data might be need to be acknowledged. Some licenses require attribution, copyright may need to be acknowledged. As a consumer I might also want to (or need to) clearly indicate that I am not the originator of some data in case it is find to be false, or misleading, etc.

Acknowledging my sources may also help guarantee that the data I’m using continues to be available: a data publisher might be collecting evidence of successful re-use in order to justify ongoing budget for data collection, curation and publishing. This is especially true when the data publisher is not directly benefiting from the data supply; and I think it’s almost always true for public sector data. If I’m reusing some data I should make it as clear as possible that I’m so doing.

There’s some additional useful background on attribution from a public sector perspective in a document called “Supporting attribution, protecting reputation, and preserving integrity“.

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

While these two cases clearly overlap, the intention is often slightly different. As a user of an application, or the reader of an academic paper, I might want a clear citation to the underlying dataset so I can re-use it myself, or do some fact checking. The important use case there is tracking facts and figures back to their sources. Attribution is more about crediting the effort involved in collecting that information.

It may be possible to achieve both goals with a simple link, but I think recognising the different use cases is important.

How are data publishers asking to be attributed?

So how are data publishers asking for attribution? What follows isn’t an exhaustive survey but should hopefully illustrate some of the variety.

Lets look first at some of the suggested wordings in some common Open Data licenses, then poke around in some terms and conditions to see how these are being applied in practice.

Attribution Statements in Common Open Data Licenses

The Open Data Commons Attribution license includes some recommended text (Section 4.3a – Example Notice):

Contains information from DATABASE NAME which is made available under the ODC Attribution License.

Where DATABASE NAME is the name of the dataset and is linked to the dataset homepage. Notice no mention of the originator, just the database. The license notes that in plain text the links should be included as text. The Open Data Commons Database license has the same text (again, section 4.3a)

The UK Open Government License notes that re-users should:

…acknowledge the source of the Information by including any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence

Where no attribution is provided, or multiple sources must be attributed, then the suggested default text, which should include a link to the license is:

Contains public sector information licensed under the Open Government Licence v1.0.

So again, no reference to the publisher but also no reference to the dataset either. The National Archives have some guidance on attribution which includes some other variations.  These variants do suggest including more detail including name of department, date of publication, etc. These look more like typical bibliographic citations.

As another data point we can look at the Ordnance Survey Open Data License. This is a variation of the Open Government License but carries some additional requirements, specifically around attribution. The basic attribution statement is:

Contains Ordnance Survey data © Crown copyright and database right [year]

However the Code Point Open dataset has some additional attribution requirements, which also acknowledge copyright of the Royal Mail and National Statistics. All of these statements acknowledge the originators and there’s no requirement to cite the dataset itself.

Interestingly, while the previous licenses state that re-publication of data should be under a compatible license, only the OS Open Data license explicitly notes that the attribution statements must also be preserved. So both the license and attribution have viral qualities.

Attribution Statements in Terms and Conditions

Now lets look at some specific Open Data services to see what attribution provisions they include.

Freebase is an interesting example. It draws on multiple datasets which are supplemented by contributions of its user community. Some of that data is under different licenses. As you can see from their attribution page, there are variants in attribution statements depending on whether the data is about one or several resources and whether it includes Wikipedia content, which must be specially acknowledged.

They provide a handy HTML snippet for you to include in your webpage to make sure you get the attribution exactly right. Ironically at the time of writing this service is broken (“User Rate Limit Exceeded”). If you want a slightly different attribution, then you’re asked to contact them.

Now, while Freebase might not meet everyone’s definition of Open Data, its an interesting data point.  Particularly as they ask for deep links back to the dataset, as well as having a clear expectation of where/how the attribution will be surfaced.

OpenCorporates is another illustrative example. Their legal/license info page examples that their dataset is licensed under the Open Data Commons Database License and explains that:

Use of any data must be accompanied by a hyperlink reading “from OpenCorporates” and linking to either the OpenCorporates homepage or the page referring to the information in question

There are also clear expectations around the visibility of that attribution:

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

So there is a clear expectation that the attribution should be displayed alongside any data. Like the OS license these attribution requirements are also viral as they must be passed on by aggregators.

My intention isn’t to criticise either OpenCorporates or Freebase, but merely to highlight some real world examples.

What are some of the issues with data attribution?

Clearly we could undertake a much more thorough review than I have done here. But this is sufficient to highlight what I think are some of the key issues. Put yourself in the position of a developer consuming some Open Data under any or all of these conditions. How do you responsibly provide attribution?

The questions that occur to me, at least are:

  • Do I need to put attribution on every page of my application, or can I simply add it to a colophon? (Aside: lanyrd has a great colophon page). In some cases it seems like I might have some freedom of choice, in others I don’t
  • If I do have to put a link or some text on a page, then do I have any flexibility around its size, positioning, visibility, etc? Again, in some cases I may do, but in others I have some clear guidance to follow. This might be challenging if I’m creating a mobile application with limited screen space. Or creating a voice or SMS application.
  • What if I just re-use the data as part of some back-end analysis, but none of that data is actually surfaced to the user? How do I attribute in this scenario?
  • Do I need to acknowledge the publisher, or a link to the source page(s)?
  • What if I need to address multiple requirements, e.g. if I mashed up data from data.gov.uk, the Ordnance Survey, Freebase and OpenCorporates? That might get awkward.

There are no clear answers to these questions. For individual datasets I might be able to get guidance, but it requires me to read the detailed terms and conditions for the dataset or API I’m using. Isn’t the whole purpose in having off-the-shelf licenses like the OGL or ODbL supposed to help us streamline data sharing? Attribution, or rather unclear or overly detailed attribution requirements are a clear source of friction. Especially if there are legal consequences for getting it wrong.

And that’s just when we’re considering integrating data sources by hand. What about if we want to automatically combine data? How is a machine going to understand these conditions? I suspect that every Linked Data browser and application fails to comply with the attribution requirements of the data its consuming.

Of course these issues have been explored already. The Science Commons Protocol encourages publishing data into the public domain — so no legal requirement for attribution at all. It also acknowledges the “Attribution Stacking” problem (section 5.3) which occurs when trying to attribute large numbers of datasets, each with their own requirements. Too much friction discourages use, whether its research or commercial.

Unfortunately the recently published Amsterdam Manifesto on data citation seems to overlook these issues, requiring all authors/contributors to be attributed.

The scientific community may be more comfortable with a public domain licensing approach and a best effort attribution model because it is supported by strong social norms: citation and attribution is essential to scientific discourse. We don’t have anything like that in the broader open data community. Maybe its not achievable, but it seems like clear guidance would be very useful.

There’s some useful background on problems with attribution and marking requirements on the Creative Commons wiki that also references some possible amendments and clarifications.

Can we convergence on some common conventions?

So would it be possible to converge on a simple set of conventions or norms around data re-use? Ideally to the extent that attribution can be simplified and ideally automated as far as possible.

How about the following:

  • Publishers should clearly describe their attribution requirements. Ideally this should be a short simple statement (similar to the Open Government License) which includes their name and a link to their homepage. This attribution could be included anywhere on the web site or application that consumes the data.
  • Publishers should be aware that the consumers of their data will be doing so in a variety of applications and on a variety of platforms. This means allowing a deal of flexibility around how/where attribution is displayed.
  • Publishers should clearly indicate whether attribution must be passed on to down-stream users
  • Publishers should separately document their citation requirements. If they want to encourage users to link to the dataset, or an individual page on their site, to allow users to find the original context, then they should publish instructions on how to do it. However this kind of linking is for citation so consumers should be bound to include it
  • Consumers should comply with publishers wishes and include an about page on their site or within their application that attributes the originators of the data they use. Where feasible they should also provide citations to specific resources or datasets from within their applications. This provides their users with clear citations to sources of data
  • Both sides should collaborate on structured markup to support publication of these attribution and citation requirements, as well as harvesting of links

Whether attribution should be a legally enforced is another discussion. Personally I’d be keen to see a common set of conventions regardless of the legal basis for doing it. Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

Thoughts on Coursera and Online Courses

I recently completed my first online course (or “MOOC“) on Coursera. It was an interesting experience and wanted to share some thoughts here.

I decided to take an online course for several reasons. Firstly the topic, Astrobiology, was fun and I thought the short course might make an interesting alternative to watching BBC documentaries and US TV box sets. I certainly wasn’t disappointed as the course content was accessible and well-presented. As a biology graduate I found much of the content was fairly entry-level, but it was nevertheless a good refresher in a number of areas. The mix of biology, astronomy, chemistry and geology was really interesting. The course was very well attended, with around 40,000 registrants and 16,000 active students.

The second reason I wanted to try a course was because MOOCs are so popular at the moment. I was curious how well an online course would work, in terms of both content delivery and the social aspects of learning. Many courses are longer and are more rigorously marked and assessed, but the short Astrobiology course looked like it would still offer some useful insights into online learning.

Clearly some of my experiences will be specific to the particular course and Coursera, but I think some of the comments below will generalise to other platforms.

Firstly, the positives:

  • The course material was clear and well presented
  • The course tutors appeared to be engaged and actively participated in discussions
  • The ability to download the video lectures, allowing me to (re)view content whilst travelling was really appreciated. Flexibility around consuming course content seems like an essential feature to me. While the online experience will undoubtedly be richer, I’m guessing that many people are doing these courses in spare time around other activities. With this in mind, video content needs to be available in easily downloadable chunks.
  • The Coursera site itself was on the whole well constructed. It was easy to navigate to the content, tests and the discussions. The service offered timely notifications that new content and assessments had been published
  • Although I didn’t use it myself, the site offered good integration with services like Meetup, allowing students to start their own local groups. This seemed like a good feature, particularly for longer running courses.

However there were a number of areas in which I thought things could be greatly improved:

  • The online discussion forums very quickly became unmanageable. With so many people contributing, across many different threads, it’s hard to separate the signal from the noise. The community had some interesting extremes: people associated with the early NASA programme, through to alien contact and conspiracy theory nut-cases. While those particular extremes are peculiar to this course, I expect other courses may experience similar challenges
  • Related to the above point, the ability to post anonymously in forums lead to trolling on a number of occasions. I’m sensitive to privacy, but perhaps pseudonyms may be better than anonymity?
  • The discussions are divorced from the content, e.g. I can’t comment directly on a video I have to create a new thread for it in a discussion group. I wanted to see something more sophisticated, maybe SoundCloud style annotations on the videos or per-video discussion threads.
  • No integration with wider social networks: there were discussions also happening on twitter, G+ and Facebook. Maybe its better to just integrate those, rather than offer a separate discussion forum?
  • Students consumed content at different rates which meant that some discussions contained “spoilers” for material I hadn’t yet watched. This is largely a side-effect of the discussions happening independently from the content.
  • Coursera offered a course wiki but this seemed useless
  • It wasn’t clear to me during the course what would happen to the discussions after the course ended. Would they be wiped out, preserved, or would later students build on what was there already? Now that it’s finished it looks each course is instanced and discussions are preserved as an archive. I’m not sure what the right option is there. Starting with a clean slate seems like a good default, but can particularly useful discussions be highlighted in later courses? Seems like the course discussions would be an interesting thing to mine for links and topics, especially for lecturers

There are some interesting challenges with designing this kind of product. Unlike almost every other social application the communities for these courses don’t ramp up over time: they arrive en masse at a particular date and then more or less evaporate over night.

As a member of that community this makes it very hard to identify which people in the community are worth listening too and who to ignore: all of a sudden I’m surrounded by 16000 people all talking at once. When things ramp up more slowly, I can build out my social network more easily. Coursera doesn’t have any notion of study groups.

I expect the lecturers must have similar challenges as very quickly they’re faced with a lot of material that they might have to potentially read, review and respond to. This must present challenges when engaging with each new intake.

While a traditional discussion forum might provide the basic infrastructure for enabling the necessary basic communication, MOOC platforms need to have more nuanced social features — for both students and lecturers — to support the community. Features that are sensitive to the sudden growth of the community. I found myself wanting to find out things like:

  • Who is posting on which topics and how frequently?
  • Which commentators are getting up-voted (or down-voted) the most?
  • Which community members are at the same stage in the course as me?
  • Which community members have something to offer on a particular topic, e.g. because of their specific background?
  • What links are people sharing in discussions? Perhaps filtered by users.
  • What courses are my fellow students undertaking next? Are there shared journeys?
  • Is there anyone watching this material at the same time?

Answering all of these requires more than just mining discussions but it feels like some useful metrics could be nevertheless. For example, one common use of the forums was to share additional material, e.g. recent news reports, scientific papers, you tube videos, etc. That kind of content could either be collected in other ways, e.g. via a shared reading list, or as a list that is automatically surfaced from discussions. I ended up sifting through the forums and creating a reading list on readlists, as well as a YouTube playlist just to see whether others would find them useful (they did).

All of these challenges we can see playing out in wider social media, but with a MOOC they’re often compressed into relatively short time spans.

(Perhaps inevitably) I also kept thinking that much of the process of creating, delivering and consuming the content could be improved with better linking and annotation tools. Indeed, do we even need specialised MOOC platforms at all? Why not just place all of the content on services like YouTube, ReadLists, etc. Isn’t the web our learning infrastructure?

Well I think there is a role for these platforms. The role in certification — these people have taken this course — is clearly going to become more important, for example.

However I think their real value is in marking out a space within which the learning experience takes place: these people are taking this content during this period. The community needs a focal point, even if its short-lived.

If everything was just on the web, with no real definition to the course, then that completely dissolves the community experience. By concentrating the community into time-boxed, instanced courses, it creates focus that can enrich the experience. The challenge is balancing unwieldy MOOC “flashmobs” against a more diffused internet community.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers