Can you publish data from twitter as open data? The short answer is: No. Read on for some notes, pointers and comments.
Twitter’s developer policy places a number of restrictions on your use of their API and the data you get from it. Some of the key ones are:
- In the Restrictions on Use of Licensed Materials (II.C) they make it clear that you can’t use any geographic data from the platform. You can only use it to identify the location from which a tweet was made and not for any other purpose. You also can’t aggregate or cache it, unless you’re storing it with the the rest of the tweet. And elsewhere they place further restrictions on storage of tweets. They reiterate this in section B.9
- Section F.2 “Be a Good Partner to Twitter” (sic) is the key one for data, as here you’re agreeing to not store anything except the ID for a tweet. You can’t store the message, it’s metadata or anything about the user, just the ID.
- You are allowed to make those IDs downloadable in various ways but there are restrictions on how many tweets you can publish per user, per day
- In the Ownership and Feedback section, they make it clear that the only rights you have to use content are derived from this agreement and those rights can be taken away at any time.
That’s a very closed set of terms.
There’s some great analysis of the terms and what they mean for researchers elsewhere. Ernesto Priego has an interesting pair of posts looking at twitter as public evidence and the ethics of twitter research and why you might want to archive and share small twitter datasets.
Ed Summers has also written about archiving twitter datasets and the process of “hydrating” a twitter ID to turn it back into useful content. There’s a whole set of APIs, tools and practices that have built up around the process of hydration as a means to work around twitters terms. I think it’s interesting as an example of using a combination of data and open source to address licensing limitations.
Yesterday, Justin Littman published a short piece highlighting how Twitter have just further restricted their terms. The key changes are around placing upper limits on how many tweet IDs you can distribute. The changes raise concerns about how archival projects like DocNow can continue. Although in my reading of the terms, those projects were already under question as Twitter doesn’t grant you the rights to re-publish data under anything other than its own terms. I think those datasets were already in breach of the agreement.
So, we get to our answer: no you can’t publish anything from twitter under an open licence. If you’re intending to do this in a project then I recommend you get approval from twitter directly.
Obviously these terms are designed for Twitters sole benefit. It helps them retain as much value as possible while still operating as a platform. Data asymmetry in action.
I think what’s particularly frustrating is that they seem to rarely enforce these terms, even for services that clearly breach them. After crafting a legal agreement they choose not to actively police it, because its not worth their time to do so. Presumably they will step in if there are large scale, significant breaches. But it makes you wonder how much value is really being protected.
In the meantime we are left with areas of doubt and uncertainty. Does the continued existence of a service mean its an exemplar of acceptable practice. Or are twitter just choosing to ignore it? And this starts to poison the well of open data. A more open approach would be for them to offer some allowance for small scale archiving and data sharing. Openly licensing twitter IDs would be a start.
For better or worse Twitter’s data has a role in helping us understand modern society, so we should be able to use it. Unfortunately their donation of the twitter archive to the Library of Congress is floundering because of a mixture of technical and legal issues. Twitter is not really a public space. It’s a private hall where we choose to meet.
A couple of final extra points based on comments on this post (see below) and on twitter. Ed Summers rightly pointed out is that services that are seemingly breaching Twitter’s terms may in fact have permission to do so. In fact a couple of examples came up.
Andy Piper (Twitter Dev lead) notes that Twitter have posted a policy update clarification:
The clarification explains that developers can request permission to share more 1.5m tweet ids in a 30 day period. It also notes that researchers from “an accredited academic institution” can share unlimited number of tweets. This raises some of the restrictions on distribution, but also reinforces some of the key points I make above: any use of the data remains subject to Twitter’s policies. By default data from Twitter can’t be published as open data. But if you’re willing to pay then it looks like Twitter are willing to share more widely.
Joe Wass from CrossRef explained that they’ve had explicit permission from Google to distribute Tweet IDs under a CC0 waiver within their Event Data service.
CrossRef negotiated this permission as part of their commercial arrangement with Twitter. This means that at least some Tweet IDs can be considered to be in the public domain. It just depends on where you got them from: the Twitter API or CrossRef.
3 thoughts on “Can you publish tweets as open data?”
Thanks for this post Leigh. It nicely summarizes where many research organizations are with Twitter and highlights some of the efforts around it. Up until now I’ve missed the virality clause that would for example make a user of a dataset of tweet ids automatically agree to all of Twitter’s Terms of Service. Can you point to where you saw that? To practically use a tweet id dataset the first thing you have to do is hydrate it, which requires that you get a Twitter API key, which requires you to agree to all the terms. So I guess it doesn’t matter. But I, and others have been sharing the tweet id datasets (just lists of numbers) with CC licenses (CC-BY, CC0).
We haven’t really talked about it much publicly yet, but the Documenting the Now project is in the process of shifting it’s focus to building tools for what archivists call appraisal. Rather than enabling people to slurp up data created by others and archive it we want to build tools that help cultural heritage organizations understand what is going on in Twitter so they can reach out to content creators doing significant work and ask if they would like their content to become part of the historical record in an archive. Even Twitter assert that these users still own copyright to their content and thus are able to give it to others. Obviously there are some serious challenges to being able to do this effectively though.
One thing you perhaps overlook is that services that are clearly breaching the ToS may have in fact gotten a letter from Twitter allowing them to operate. Also, while I understand why many (myself included) like their content to be “open data” I can see lots of benefits to having social media platforms close it off. In fact, I wish that there were more options for controlling how data in social media platforms was shared with third parties.
The relevant clause if F.2.b:
Also relevant to your use case is the next sub-clause F.2.b.b:
“You may not distribute Tweet IDs for the purposes of (a) enabling any entity to store and analyze Tweets for a period exceeding 30 days without the express written permission of Twitter”
You’re right about other services perhaps having permission from twitter, I should have noted that. For even then I’d argue that some transparency would be useful, to help clarify these types of licensing arrangements. As a user of twitter I might also want to know if a third-party has been given additional or different rights over content I’ve submitted to twitter.
I agree about the need for better permissions. I’m not arguing that all of twitter needs to be open data, more highlighting that it can’t be, for people who might otherwise breach the terms.
This is the best, most clear-eyed public summary of the issues I’ve seen. Thank you for writing it.
Whether or not Twitter chooses to enforce all of the terms is moot in many academic and library settings, where scare resources, conservative/conflict-averse legal counsel, and reliance on grant funding mean that librarians, archivists, and the researchers they work with are not able to choose to ignore them.
Comments are closed.