This blog post is a small case study of how I used Mozdeh to capture and analyse over 5 million tweets related to the norovirus infection. Mozdeh requires no programming knowledge and can be used by those from the social sciences to capture and analyse Twitter data. Mozdeh currently only supports Windows operating systems (32 and 64 bit) and it is advisable to use Mozdeh with a Desktop computer. Mozdeh can be downloaded from here, an excellent user guide can be found here, a theoretical overview can be found here, and an overview of Twitter query set generation can be found here.
Mozdeh uses the Search API which is rate limited at 180 queries per 15 minutes i.e., that Mozdeh will search Twitter a 180 times per 15 minutes, and will return a maximum of 3,200 tweets going back in time. An overview of the different software and APIs can be found in one of my previous blog posts here.
Searches took place contentiously from 16/02/2015 19:20:00 to 06/04/2015 14:06:26 with a total of 5,055,299 tweets with 2,731,452 unique tweeters. I produced four time series graphs, figures 1 to 4.
Figure 1 – Time series for all tweets all of the keywords
Note that for the time series graph some of the fluctuations are caused by the time of day, with troughs for each night. This is because up to 60% of all tweets have English language settings, and this may indicate sleeping time for English speaking users (Gerlitz & Rieder, 2013). The dip in tweets at the beginning and middle of the graph are caused by Twitter’s rate limiting. In total there are 5,055,299 tweets with 2,731,452 unique tweeters.
The peaks can be further investigated by clicking on a point of interest, for instance, by double clicking on the largest peak will bring up the list of tweets that caused the peak. It is possible to create a time series graph with another keyword or hashtag e.g., if a particular keyword or hashtag is occurring frequently. For example if it was found that the word ‘poisoning’ was occurring frequently this word by-itself could be searched for, to create a time series graph (which will also produce a new sentiment graph), as shown in figure 2 below.
Figure 2 – Time series for all tweets containing the word ‘poisoning’
On the 16th of February 2015 there is a peak in tweets mentioning the word ‘poisoning’, after investigating this peak, it appears that users indicate that they may be suffering from food poisoning, however there is also an increase in spam (i.e., tweets that are not relevant to food poisoning). The second largest peak occurs on the 9th of March 2015 and this is a genuine peak i.e., users indicate that they may have food poisoning with very little spam. The third largest peak occurs on the 19th of March 2015 and is solely due to an increase of spam with very few users mentioning that they have food poisoning.
Figure 3 – Time series of all tweets containing the word ‘pain’
Looking at the word ‘pain’ in this time series graph, the first two peaks (16th of February 2015 and 9th of March 2015) occur at the same time as the peaks of the previous graph of the word ‘poisoning’. The third peak occurs on the evening of the 18th of March 2015 and continues on to the 19th of March 2015 which also overlaps with the previous peak in tweets with words that use ‘poisoning’.
Figure 4 – Average tweet sentiment of all tweets
The time series analysis also plots the average tweet sentiment, however tweets related to the norovirus sickness contain very little negative or positive sentiment.
I used the following search queries (adapted from queries used by the Foods Standards Agency) to gather the data: sick bug, Sickness bug, stomach flu, vomiting, #Sicknessbug, sickness virus, winter AND sickness, bug, winter sickness, winter virus, winter vomiting, #winterbug, #norovirus, norovirus AND outbreak, Norovirus, Norovirus AND symptoms, puke, retch, sick AND fever spit up, stomach pain, throw up, throwing up, upset stomach, Vom, #barf, Barf, being sick, chuck up, feeling sick, Heave, Norovirus AND Food, Food poisoning, Norovirus AND cruise, and Norovirus AND cruise ship. Data were analysed at an aggregate level, and these results form part of a project that has ethics approval.
Gerlitz, C., & Rieder, B. (2013). Mining one percent of Twitter: Collections, baselines, sampling. M/C Journal, 16(2).
González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. doi:10.1016/j.socnet.2014.01.004
I would like to thank Prof. Peter Bath from the Health Informatics Research group for suggesting to examine peaks and sentiments across health topics.