Can you publish data from twitter as open data? The short answer is: No. Read on for some notes, pointers and comments.
Twitter’s developer policy places a number of restrictions on your use of their API and the data you get from it. Some of the key ones are:
- In the Restrictions on Use of Licensed Materials (II.C) they make it clear that you can’t use any geographic data from the platform. You can only use it to identify the location from which a tweet was made and not for any other purpose. You also can’t aggregate or cache it, unless you’re storing it with the the rest of the tweet. And elsewhere they place further restrictions on storage of tweets. They reiterate this in section B.9
- Section F.2 “Be a Good Partner to Twitter” (sic) is the key one for data, as here you’re agreeing to not store anything except the ID for a tweet. You can’t store the message, it’s metadata or anything about the user, just the ID.
- You are allowed to make those IDs downloadable in various ways but there are restrictions on how many tweets you can publish per user, per day
- In the Ownership and Feedback section, they make it clear that the only rights you have to use content are derived from this agreement and those rights can be taken away at any time.
That’s a very closed set of terms.
There’s some great analysis of the terms and what they mean for researchers elsewhere. Ernesto Priego has an interesting pair of posts looking at twitter as public evidence and the ethics of twitter research and why you might want to archive and share small twitter datasets.
Ed Summers has also written about archiving twitter datasets and the process of “hydrating” a twitter ID to turn it back into useful content. There’s a whole set of APIs, tools and practices that have built up around the process of hydration as a means to work around twitters terms. I think it’s interesting as an example of using a combination of data and open source to address licensing limitations.
Yesterday, Justin Littman published a short piece highlighting how Twitter have just further restricted their terms. The key changes are around placing upper limits on how many tweet IDs you can distribute. The changes raise concerns about how archival projects like DocNow can continue. Although in my reading of the terms, those projects were already under question as Twitter doesn’t grant you the rights to re-publish data under anything other than its own terms. I think those datasets were already in breach of the agreement.
So, we get to our answer: no you can’t publish anything from twitter under an open licence. If you’re intending to do this in a project then I recommend you get approval from twitter directly.
Obviously these terms are designed for Twitters sole benefit. It helps them retain as much value as possible while still operating as a platform. Data asymmetry in action.
I think what’s particularly frustrating is that they seem to rarely enforce these terms, even for services that clearly breach them. After crafting a legal agreement they choose not to actively police it, because its not worth their time to do so. Presumably they will step in if there are large scale, significant breaches. But it makes you wonder how much value is really being protected.
In the meantime we are left with areas of doubt and uncertainty. Does the continued existence of a service mean its an exemplar of acceptable practice. Or are twitter just choosing to ignore it? And this starts to poison the well of open data. A more open approach would be for them to offer some allowance for small scale archiving and data sharing. Openly licensing twitter IDs would be a start.
For better or worse Twitter’s data has a role in helping us understand modern society, so we should be able to use it. Unfortunately their donation of the twitter archive to the Library of Congress is floundering because of a mixture of technical and legal issues. Twitter is not really a public space. It’s a private hall where we choose to meet.
A couple of final extra points based on comments on this post (see below) and on twitter. Ed Summers rightly pointed out is that services that are seemingly breaching Twitter’s terms may in fact have permission to do so. In fact a couple of examples came up.
Andy Piper (Twitter Dev lead) notes that Twitter have posted a policy update clarification:
The clarification explains that developers can request permission to share more 1.5m tweet ids in a 30 day period. It also notes that researchers from “an accredited academic institution” can share unlimited number of tweets. This raises some of the restrictions on distribution, but also reinforces some of the key points I make above: any use of the data remains subject to Twitter’s policies. By default data from Twitter can’t be published as open data. But if you’re willing to pay then it looks like Twitter are willing to share more widely.
CrossRef negotiated this permission as part of their commercial arrangement with Twitter. This means that at least some Tweet IDs can be considered to be in the public domain. It just depends on where you got them from: the Twitter API or CrossRef.