In this blog post I compare the Streaming, Search, and Firehose APIs over a three day period (3rd to the 5th of January, 2015) across three different tools. A comprehensive outline of the different APIs and how they return tweets can be found here.
Most research on Twitter uses either the Search API, or the Streaming API. Twitter’s Search API provides access to tweets that have occurred i.e., users can request tweets that match a ‘search’ criteria similar to how an individual user would conduct a search directly on Twitter. When you query Twitter via the Search API, the maximum number of searchers going back in time that Twitter will return, is 3,200 (with a limit of 180 searchers every 15 minutes).
Twitter states that the Search API is:
…focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead (Twitter developers).
The Streaming API is a push of data as tweets occur in near real-time. However, Twitter only returns a small percentage of tweets. The tweets that are returned depend on various factors such as the demand on Twitter, and how broad/specific the search query is. Twitter states that the Streaming APIs:
…give developers low latency access to Twitter’s global stream of Tweet data. A proper implementation of a streaming client will be pushed messages indicating Tweets and other events have occurred, without any of the overhead associated with polling a REST endpoint.If your intention is to conduct singular searches, read user profile information, or post Tweets, consider using the REST APIs instead (Twitter developers).
The Firehose (which can be quite costly) provides all the tweets in near real-time, however, unlike the Streaming API there are no limitation on the number of search results that are provided. I won a historical data prize from DiscoverText and which provided me access to 3 days worth of Firehose data. I selected this data to overlap with data I had gathered via the Streaming API (using Chorus), and the Search API (using Mozdeh).
This is what I found:
Table 1 – The amount of tweets retrieved via
API across three different tools
Tool | API | No. tweets |
DiscoverText/Texifter
Mozdeh Chorus |
Firehose API
Search API Search API |
195,713
155,086 145,348 |
Table 1 shows that searchers with the keyword ‘Ebola’ were gathering up to 79% (155,086) of all tweets via the Search API using Mozdeh, and 74% (145,348) of all tweets via the Search API Streaming API using Chorus. As compared to baseline, the complete set of tweets were 195,713 (100%) obtained via DiscoverText.
I produced three word clouds to examine the most frequent words across the three samples in order to investigate whether there were any major differences in word frequencies.
Word Cloud 1: 195,713 Ebola tweets via the Firehose API using DiscoverText and from the 3rd of January to the 5th of January:
Word Cloud 2:155,086 Ebola tweets via Mozdeh using the Search API from the 3rd of January to the 5th of January:
Word Cloud 3:145,348 Ebola tweets via Chorus using the Streaming Search API from the 3rd of January to the 5th of January:
The word clouds provide a visual representation of the samples in terms of word frequency i.e., the more frequent a word is the bigger it will appear in the word cloud. These word clouds contain words such as ‘nurse’, ‘critical’, and ‘condition’, as within this time period a nurse (Pauline Cafferkey) suffering from Ebola in the U.K. had fallen into critical condition. The word clouds are very similar across the different tools and APIs. This may be because, as Twitter’s senior partner engineer, Taylor Singletary, in a forum post, suggested that the sample stream via Streaming API would be a random sample of all tweets that were available on the platform (Gerlitz and Rieder,2013).
These results suggest that if you use a limited amount of search queries and gather data over a relatively short period of time that Twitter will provide a fair amount of tweets, and depending on the research question of a project this may be sufficient. However, González-Bailón et al (2014) have found that the structure of samples may be affected by both the type of API and the number of hashtags that are used to retrieve the data. Therefore, depending on the number of keywords and hashtags used the amount of tweets retrieved are likely to vary. All of this may change as Twitter introduces potential adjustments to the Streaming API. These results form a part of a larger project which has ethics approval.
Edit 08/07/15
As Dr. Timothy Cribbin has pointed out in the comments, Chorus uses the Search API and not the Streaming API as previously mentioned in this blog post. Although not across three APIs I hope the comparison is still interesting.
Acknowledgements
I am very grateful to Dr Farida Vis for her expert guidance & advice, for providing me with the literature, and for the various discussions on Twitter APIs.
References and further reading
Gaffney and Puschmann. (2014). Data Collection on Twitter. In Jones, S (Eds.) Twitter and Society (pp.55-67). New York, NY: Peter Lang.
Gerlitz, C., & Rieder, B. (2013). Mining one percent of Twitter: Collections, baselines, sampling. M/C Journal, 16(2).
González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. doi:10.1016/j.socnet.2014.01.004
Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. (2013). Is the Sample Good Enough? Comparing Data from {Twitter’s} Streaming {API} with {Twitter’s} {Firehose}. Proceedings of ICWSM.
Thanks for running this interesting comparison. However, I would like to point out that Chorus Tweetcatcher currently uses the Search API, not the Streaming API, for query searches. This is restricted to the last seven days worth of matching tweets, which may be more than 3200 in number.
LikeLiked by 1 person
Thank you very much for this. I edited the blog post as soon as I saw this. I hope the comparison is still interesting between the Search APIs and the Firehose API. However, I am seeking a more quantitative method of comparing the datasets (as opposed to the word clouds) so may build on this again.
I really appreciate that you took the time to let me know about this (if there is anything I can do for the Chorus team, a guest blog post etc please do let me know). I had the pleasure of meeting one of the developers behind Chorus recently at a workshop, and I think the work you are doing is fantastic i.e., opening up Twitter research to social scientists (like me).
LikeLiked by 1 person
Hi, Wasim. I just ran across this entry and am curious to know how your work is developing.
Perhaps you are now aware of this, but the historical search that Sifter provides is not actually equivalent to gathering data from the Firehose API. The Firehose provides (nearly) 100% of all tweets in real time, while the Twitter archive contains all *non-deleted* historical tweets. The “non-deleted” element is important, because it means that the archive is missing a number of tweets that would be captured when connected to the Firehose API. And the more time passes from the period in question, the more tweets will be missing from the archive.
My own work is investigating the rate by which this data “degrades,” the reasons for deletion, and how these trends impact statistical (and other analytical) inferences.
LikeLike
Hi Rebekah, I’ve not developed any further work on this as of yet. It is something I will definitely look to follow-up on.
Thank you very much for letting me know about this, that is very useful to know, both for this blog post and for my PhD.
I hadn’t considered or factored in the deleted aspect of tweets via the archive. I wonder then, how one would ever be in a position to access tweets via the full Firehose API?
Let me know how your work progresses, it all sounds very interesting. The degradation aspect of Twitter data via the archive is under-studied. You are the first person to bring this to my attention.
LikeLike