Using Visibrain Focus to analyse #ILookLikeASuregon

In the previous blog post I examined the unrest in Ferguson using a commercial tool, Visibrain Focus. In this blog post I will outline some Twitter analytics related to the #ILookLikeASuregon hashtag using Visibrain Focus which has access to Twitter’s Firehose i.e., all of the tweets.The results presented here are accurate at the time of writing, and these are from 12 to 1PM UK GMT time on the 27th of August 2015. The network graphs however are over two specific time periods, of the first day of the hashtag, and the of the latest day (27th of August).

The #ILookLikeASuregon hashtag attempts to challenge gender stereotypes, and was inspired by the #ILookLikeAnEngineer hashtag. Both of these hashtags attempt to break the male stereotype that can be associated with these two professions. There is now also a #ILookLikeAPhysicist hashtag which attempts to break the male stereotype that can be associated with the field of physics. The hashtags have received quite a lot of media attention, and you can read BBC Trending’s write up of the #ILookLikeASuregon hashtag here.

Over the past week I have managed to speak to quite a lot of the Twitter users behind the hashtag, even finding myself in a 6-way conference call with Surgeons across continents. There was a lot of passion and excitement as I could tell that this was a hashtag that meant a lot to them. I also had the opportunity to interview Heather Logghe, MD whom provided some insight into how the hashtag came about.

In total, at the time of writing, over the last 30 days there have been at least 28,337 tweets by 6,005 users, with 89,936,713 impressions i.e., the number of times users have seen the tweets. 5,878 (22%) of the tweets are original, 22,459 are retweets (79%). This retweet percentage is quite high. The users behind the campaign have indicated that they look to retweet any mentions of the hashtag, this may be one reason for why there is a high retweet ratio. Also interesting here is that 23,532 links have been shared.

Figure 1 – Timeline of tweets related to the #ILookLikeASuregon hashtag

timeline

Figure 1 is a time series graph going back in time 30 days from the date this blog post was written. As mentioned previously, the awareness campaign began on the 5th of August by Heather Logghe, MD. The largest peak occurred on the 12th of August where at least 2,221 tweets were posted.

Figure 2 – World Map of tweets related to the #ILookLikeASuregon hashtag

map of the world

Figure 2 is a map that plots user locations related to #ILookLikeASuregon using data provided within a users bio. However, this map only displays instances of the keyword where users have used the English language hashtag #ILookLikeASuregon, rather than say a European or Asian alternative.

Figure 3 – Word cloud of related hashtags used in conjunction with the #ILookLikeASuregon hashtag

looklikeasurgeonwordcloud

Figure 3 is a word cloud of hashtags that are present within the tweets. The most frequently used hashtags alongside #ILookLikeASuregon include, #surgtweeting, #diversitymatters, and #challengestereotypes. Many of the hashtags are related to challenging gender stereotypes, which is not surprising considering the aim of the campaign.

Figure 4 – Top expressions within tweets related to the #ILookLikeASuregon hashtagtop expressions

Figure 4 displays the most frequently used expressions within each of the tweets. The interesting expressions within this word cloud include the phrase awareness about women, diversity in surgery, women surgeonsfemale surgeon, and the phrase not me in heels. 

Figure 5 – Most frequently mentioned users in tweets related to #ILookLikeASuregon hashtag

top mentions

Figure 5 displays the most frequently mentioned user-handles which include @WomenSurgeons, @LoggheMD, and @DrKathy whom are among the users that helped raise the profile of the campaign.

Below are two network graphs, the first corresponds to one of the first days the hashtag was used (06th of August 2015), and the second network graph is of a more recent day (27th of August 2015). Both network graphs represent data retrieved from Twitter’s Firehose.

Figure 6 – Network graph of #ILookLikeASuregon from 06 Aug 2015 00:00 to 06 Aug 2015 23:00

day1

This is a network graph created in Gephi using data obtained from Visibrain Focus from 06 Aug 2015 00:00 to 06 Aug 2015 23:00 and the nodes are ranked by the betweenness centrality algorithm using the Fruchterman Reingold layout. Verbal consent was obtained from this community before the analysis was conducted.

Figure 7 – Network graph of #ILookLikeASuregon from 27 Aug 2015 00:00 to 27 Aug 2015 23:00

2

This is a network graph created in Gephi using data obtained from Visibrain Focus from 27 Aug 2015 00:00 to 28 Aug 2015 23:00 and the nodes are ranked by the betweenness centrality algorithm using the Fruchterman Reingold layout.

What figures 6 and 7 demonstrate is that compared to the very beginning (06 August) to fairly recently (27 August) there is an increase of users tweeting with the hashtag, demonstrating that the community has grown significantly.

Massive thank you to Heather Logghe, MD for letting me talk to her, and thanks also all of the other surgeons that were on the conference call I mentioned earlier in the post. Thanks also to Mimi Poinsett, MD for suggesting to analyse the #ILookLikeASuregon hashtag. Massive thanks as always to the lovely Georgina Parsons from Visibrain FocusThis is a link to the platform for anyone interested in seeing what it is all about.

Using Visibrain Focus to analyse the unrest in Ferguson

In my previous blog post I outlined a number of free tools that could be used to capture and analyse data from Twitter, in the next series of posts I will look at more powerful commercial tools. Over the past few weeks I have had the opportunity to use Visibrain Focus (commercial), which is a Twitter monitoring platform for digital marketing professions, however, it has several features which are useful for research purposes.

This blog post has two aims. Firstly, to show the potential of Visibrain Focus, and secondly to provide some Twitter insight related to the Ferguson unrest (using the ‘#Ferguson’ hashtag and the ‘Ferguson protests’ keyword). As I have the unique opportunity to access tweets from the Firehose API (i.e., all of the tweets), I hope it can also help those which are currently conducting research around these themes.

Over the last 30 days (i.e., 30 days going back from 22nd August 2015, 3.07PM, GMT), in total there are 1,715,534 tweets by 500,252 users. There are 13,337,415,455 impressions (that is to say the amount of users have seen the tweets). The tweets are 36% original (n=618,772), 64% are retweets (n=1,096,762), and 74% of tweets contain a link (n=1,269,006). The retweet percentage is of interest here, indicating that tweets related to the Ferguson unrest have a high retweet ratio.

Figure 1 – Timeline of tweets containing the keywords ‘#Ferguson’ or ‘Ferguson protests’ 

timeline1

As shown in the figure above tweets start to increase on August 9th which corresponds to the one year anniversary of the fatal shooting of Michael Brown by a white police officer. The largest peak occurs on the 10th of August where a total of 550,928 tweets are posted. There is a sharp increase as during this time period, police in Ferguson, Missouri, shot and critically injured an African-American teenager

Figure 2 – Most frequently occurring hashtags used in tweets related to the unrest in Ferguson 

fergurson word cloud

In regards to the top three hashtags, #Ferguson is used 803,860 times, #blacklivesmatter is used 70,393 times, and #mikebrown is used 52,823 times. However, it is important to note that in order to retrieve this dataset the hashtag #Ferguson and the keyword ‘Ferguson Protests’ were used. It may be better to state that the word cloud above represents the most frequently occurring co-hashtags.

Figure 3 – Most commonly used expressions in tweets related to the unrest in Ferguson 

expressions

The above word cloud is generated by using the most commonly reoccurring terms found in tweet content. In addition to the hashtags in the word cloud above (such as blacklivesmatter), other interesting expressions include ‘state of emergency’ ‘police’, ‘shots’ and ‘last year’. Also interesting here is the expression ‘Sir Alex Ferguson’ which is the ‘noise’ in our dataset.

Figure 4 – World map of tweets related related to the unrest in Ferguson 

worldmap

The figure above is a map of where users are tweeting from using the location provided within a user’s biography. The majority of tweets derive from the U.S. 69.3% (n=531,654), U.K. 5.3% (n=40,303), and Canada 2.7% (n=21,093). However, this is a distribution I have observed across topics on Twitter and may have more to do with overall use of Twitter, as well as access to the Internet, and mobile devices.

In regards to language, the majority of tweets are in English 84.2% (n=1,445,680), Spanish 4.2% (n=72,854), and German 4% (n=68,061). Taken with figure 4 above, this is not surprising as the majority of tweets derive from English-speaking countries.

Visibrain can also infer gender, in this instance, 22.2% of tweets derive from males (n=381,061), and 17.2% derive from females (n=296,120) with 60.6% (n=1,039,824) classified as other i.e., as it is not possible to infer gender. This may be because the name provided by a Twitter user is not a real name or it is in a format that can not be processed by Visibrain’s algorithm.

Figure 5 – Audience and following numbers of tweets related to the unrest in Ferguson 

audience

The figure above shows audience and following numbers of users that have tweeted about the unrest. The most interesting aspect is that users have an average of 7,617 followers, and 158,815 users have a following of over 158 thousand i.e., a high audience.

In terms of devices, 56.7% of users use a mobile (n=973,988), 2.6% (n=45,311) use a desktop, 1.8% (n=30,870) use a web related client, and 8.3% (n=141,812) use an automated method with 30.6% (n=525,024) classified as other.

The top 5 domains include twitter.com 54.6% (n=693,775)  youtube.com 2.8% (n=35,523)  nytimes.com (1.7%) (n=21,705)  theguardian.com 1.5% (n=19,267), and  cnn.com  1.22% (n=815,694). Many videos are shared on Twitter so it is not surprising to see YouTube as the second most popular domain. However, it is interesting, to see The New York Times, The Guardian, and CNN as popular domains.

The top 5 content types include, text 62.5% (n=793,983),  photo 46.1% (n=584,916), video 10% (126,789), and audio 0.2% (2,806). Image and video sharing are quite high, however text based tweets out number both photo and video sharing. Also of interest is that 1,273,716 tweets contain a link. 

Visibrain allows end-users to export mention data in Gexf format, the files can then be imported into a Gephi to create network graphs. I extracted a mention graph from 12AM to 1AM on August 9th (i.e., 1 hours worth of tweets) in order to create a network graph, shown below.

Figure 6 – Network graph of 1 hour of tweets related to the unrest in Ferguson on August 9th 2015 created in Gephi

1 screenshot_172842

Visibrain has many other features, for instance, it is also possible to look at most occurring tweets, most re-tweeted users and apply various filters to sort through users and tweets. I hope to tweet out the different features and types of analysis that is possible using Visibrain over the coming weeks.

Below, is a more recent network graph tweeted over 4PM and 5PM on the 22nd of August.

Figure 7- Network graph of 1 hour of tweets related to the unrest in Ferguson on August 22nd 2015 created in Gephi

Fscreenshot_202242

Special thanks goes to the lovely Georgina Parsons (@G_Parsons33 ) from Visibrain Focus, whom has provided excellent user support. Massive thanks also to Pierrette Mimi Poinsett, MD (@yayayarndiva) for providing the idea to examine the Ferguson unrest on Twitter. This is a link to the platform for anyone interested in seeing what it is all about.

Table of tested software that can gather data from Twitter without programming knowledge

This is a table of software that I have used and tested for my PhD research (so far) to either gather or analyse Twitter data. I use them all in combination as they complement each other very well. Some of the software can allow for data gathered from other platforms to be imported into the application so it is best to read the documentation thoroughly.

Tool OS Platforms
Mozdeh Windows (Desktop advisable) Twitter
Webometric Analyst Windows Twitter (+image extraction), YouTube, Flickr
Mendeley, & Other web resources
NodeXL Windows Twitter, YouTube, & Flicker
Netlytic Web based Twitter, Facebook, YouTube, & Instagram
Twitter Arching Google Spreadsheet (TAGS) Web based Twitter
Chorus Windows (Desktop advisable) Twitter
DiscoverText (free 30 day trial) Web based Twitter, Facebook, Blogs, Forums, & Online news platforms
COSMOS Project Windows
MAC OS X

Twitter

Visibrain Web based

Twitter

Did I leave something out? Let me know! Either in the comments section or via Twitter (@was3210).The table first appeared in an LSE impact blog post Using Twitter as a data source: An overview of current social media research tool
s

Using @NodeXL to analyse @foodgov and associated hashtags

In this blog post I want to analyse the Foods Standards Agency (FSA), specifically their Twitter handle @foodgov by producing a network graph and associated analytics using the very powerful Microsoft Excel plugin NodeXL. I then want to further analyse the top 5 hashtags by creating 5 further network graphs.

I selected the FSA as I had the opportunity to attend an event at Twitter HQ, London where the head of the Head of Information Management , Dr Sian Thomas provided some insight into the innovate and intuitive methods the FSA have applied. Both in using social media data, and as a method of reaching the public via allergy awareness campaigns and by use of influencers (that is to say, users who may have a bigger reach or a different type of user following compared to the FSA). My report on the event which provides more context to the work by the FSA can be found here.

In the network graphs, G1, G2, and G3 etc. refer to different groups of users and the words at the top of each group are those that occur most frequently. By visiting the NodeXL graph gallery more analytics can be located such as top URLs overall in the graph and in the separate groups (in this blog post I have hyperlinked each of the graphs i.e, by clicking on Network graph 1, for example, will take you to the graph gallery version of the network graph).

I find network graphs useful in summarizing and providing a snapshot of what users are conversating about on Twitter related to a keyword, hashtag, or user-handle at any given time. One topic i.e., bird flu may generate a range of conversations and this would be represented in the network graph with a number of different groups and associated keywords and URLs. For each graph I have added a section where I briefly mention what I found interesting about it.

 Network graph 1Tweets containing @foodgov
@foodgov

The graph above represents a network of 441 Twitter users whose recent tweets contained “@foodgov”, or who were replied to or mentioned in those tweets, taken from a data set limited to a maximum of 10,000 tweets.

An interesting observation in this network graph: Top URLs such as: FSA advice about avian (bird) flu,  &  FSA Board agrees restrictions on raw milk should remain,  &  Suspected bird flu found on Lancashire poultry farm,  &  Campylobacter Action Plan – Our Progress,  &  J & K Smokery Ltd recalls vacuum packed smoked fish because of concerns over Clostridium botulinum controls 

What I am interested in this post is the top 5 hashtags in the entire graph and these were:

fsaboard
birdflu
rawmilk
recall
foodallergy

So, one by one, I entered these hashtags into NodeXL to create 5 further network graphs.

Network graph 2 – fsaboard

fsaboard

The graph represents a network of 100 Twitter users whose recent tweets contained “#fsaboard”, or who were replied to or mentioned in those tweets, taken from a data set limited to a maximum of 10,000 tweets.

An interesting observation in this network graph: The @foodgov account was most influential in this network graph (ranked by betweenness centrality). The top URL in G1 and overall in the graph was: FSA Board agrees restrictions on raw milk should remain and one of the top keywords in this group was ‘raw milk’ indicating that discussion revolved around this news article.

Network graph 3birdflu

birdflu

The graph represents a network of 876 Twitter users whose recent tweets contained “#birdflu”, or who were replied to or mentioned in those tweets, taken from a data set limited to a maximum of 10,000 tweets.

An interesting observation in this network graph: In G1 a number of Twitter users (that are not connected to each other) are relaying the message i.e., are posting a tweet that contains the keyword or hashtag ‘birdflu’. The top URL in the entire graph and G1 was Avian flu confirmed in Lancashire.

Network graph 4 – rawmilk

rawmilk

The graph represents a network of 175 Twitter users whose recent tweets contained “#rawmilk”, or who were replied to or mentioned in those tweets, taken from a data set limited to a maximum of 10,000 tweets.

An interesting observation in this network graph: In G2 a number of unconnected users are relaying tweeting about a news article related to scientific risk assessments that were recently published in the Journal of Food Protection. Drawing on these results the author of the article suggests that raw milk is ‘remarkably’ safe. The top 3 URLs overall and the top URL in G2 is the aforementioned news story: New Science Confirms that Drinking Raw Milk is Remarkably Safe.

Network graph 5 – recall

recall

The graph represents a network of 1,777 Twitter users whose recent tweets contained “#recall”, or who were replied to or mentioned in those tweets, taken from a data set limited to a maximum of 10,000 tweets.

An interesting observation in this network graph: The most influential Twitter account in the entire graph is @usdafoodsafety (ranked by betweenness centrality)In G1 a number of unconnected users are relaying messages i.e., tweeting about products (mostly  food) being recalled. The top URL in overall in the graph is a wordpress website which provides news and email alerts on product recalls (not always food products which explains the top keywords such as ‘gm’ and ‘India’ as the company General Motors recently had to recall a large number of vehicles due to a wiring problem).

Network graph 6 – foodallergy

foodallergy

The graph represents a network of 662 Twitter users whose recent tweets contained “#foodallergy”, or who were replied to or mentioned in those tweets, taken from a data set limited to a maximum of 10,000 tweets.

An interesting observation in this network graph: The top hashtags in this graph, foodallergy, faact, and peanutallergy. The top co-words: food, allergy & peanut,patch, & foodallergy,friendly,& phase,iii, iii,trials, & trials,foodallergy. The most influential Twitter account @foodallergy. The top URL in the entire graph Peanut Patch’ Heads to Phase III Trials.

This blog post has presented some analytics on the @foodgov Twitter account and associated hashtags using NodeXL, there is much more going on within each graph and I have only highlighted what I found interesting, particularly from a health informatics perspective. Written consent was obtained (consent via a tweet) to analyse the FSA’s Twitter account and/or associated keywords and hashtags related to the FSA’s Twitter account.

For anyone wanting to learn more about NodeXL and network graphs check out this video – Network Mapping the Ecosystem by Marc Smith (@marc_smith) and this excellent article Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters.

Why is there so much research on Twitter? And what does this mean for our methods?

I was asked on Twitter by a fellow PhD student what tools and methods there were of capturing and analysing data from Facebook, and although I was able to find a few, there were far more Twitter data capture tools. I also noticed that there are very few tools that can be used to obtain data from other social media platforms such as, Pinterest, Goolge+, Tumblr, Instagram, Flickr, Vine, and Amazon among others. This led me to wonder whether it was tool availability, or some other reason for why there is more research on Twitter, compared to other social media platforms.

I then asked the following question on Twitter:

Why is there so much research on Twitter? Is it because it’s difficult to get data from other platforms? Or is Twitter a special platform?

I received a range of responses:

  1. Twitter is a popular platform in terms of the media attention it receives and therefore it attracts more research due to this cultural status
  2. Twitter makes it easier to find and follow conversations which consequently makes it easier to research
  3. Twitter has hashtag norms which make it easier gathering, sorting, and expanding searches when collecting data
  4. Twitter data is easy to retrieve as major incidents, news stories and events on Twitter are normally centered around a hashtag
  5. The Twitter API is more open and accessible compared to other social media platforms, which makes Twitter more favorable to developers creating tools to access data. This consequently increases the availability of tools to researchers.

It is probable that a combination of response 1 to 5 have led to more research on Twitter. However, this raises another distinct but closely related question: when research is focused so heavily on Twitter, what (if any) are the implications of this on our methods?

The methods that are currently used in analysing Twitter data i.e., sentiment analysis, time series analysis (examining peaks in tweets), network analysis etc., can these be applied to other platforms or are different tools, methods and techniques required?

I have used the following four methods in analysing Twitter data for the purposes of my PhD, below I consider whether these would work for other platforms:

  1. Sentiment analysis works well with Twitter data, as tweets are consistent in length (i.e., <= 140) would sentiment analysis work well with, for example Facebook data where posts may be longer?
  2. Time series analysis is normally used when examining tweets overtime to see when a peak of tweets may occur, would examining time stamps in Facebook posts, or Instagram posts, for example, produce the same results? Or is this only a viable method because of the real-time nature of Twitter data?
  3. Network analysis is used to visualize the connections between people and to better understand the structure of the conversation. Would this work as well on other platforms whereby users may not be connected to each other i.e., public Facebook pages, or images from Instagram?
  4. Machine learning methods may work well with Twitter data due to the length of tweets (i.e., <= 140) but would these work for longer posts and posts (i.e., Instagram) where images may be present?

It may well be that at least some of these methods can be applied to other platforms, however they may not be the best methods, and may require the formulation of new methods, techniques, and tools. On the tool front, I would like to see more software for those in the social sciences to obtain data for a range of platforms and including a range of data i.e., web links, images, and video. At the Masters and PhD level there should be more emphasis on training for social science students in effectively using existing software that can be used to capture data analyse data from social media platforms.

Acknowledgements

I would like to thank Curtis Jessop, Blog Editor of NSMNSS and Senior Researcher at NatCen Social Research, for the suggestion to write this blog post and the idea to examine the methodological implications of focusing on certain social media platforms.

Metadata across Twitter tools

In this very short blog post I want to show the amount of metadata it is possible to obtain via the Twitter API across TexifterTAGSMozdeh, and Chorus. 

I used Tweepy (a Python library), to pull in Twitter data in a raw JSON format. This is shown in Figure 1 below:

Figure 1 – A Tweet with all of the accompanying metadata

JSON All

Code can then be written to extract  i.e., to lift the tweet, time stamp, author, and tweet ID out of the raw JSON. This is shown in Figure 2 below:

Figure 2 – Filtered JSON

JSON Small

This reduction in metadata may be required as for hardware purposes it may not be feasible to maintain and or process all of the data. The tweets can then be saved to a CSV file, for example, to be opened in a spreadsheet.

I thought it would be interesting to compare what metadata existing tools can provide and to see how they have been programmed for metadata retention. It is important to note that in 2013 Twitter increased the amount of metadata that was returned, therefore if a tool was designed prior to this, the developers could have chosen to retain the same metadata or to not have updated the software at all.

Figure 2 – Metadata provided by TexifterTAGSMozdeh, and Chorus. 

Comparing the APIS

Texifter a commercial provider of Twitter data (using the Firehose) supplies all of the metadata whereas free tools provide only certain amounts of metadata. TAGS provides the most metadata (TAGS can also provide further metadata ), and Mozdeh the least.

The figure above demonstrates the varying levels of data that tools can provide i..e, that decisions about what metadata is included or excluded has already been decided when it is time to use the tool.This may have implications for the types of research that can be conducted as, for example, it may not be possible to study the favoriting behaviour of users if a tool does not provide this data.

Using Mozdeh to analyse Norovirus tweets

This blog post is a small case study of how I used Mozdeh to capture and analyse over 5 million tweets related to the norovirus infection. Mozdeh requires no programming knowledge and can be used by those from the social sciences to capture and analyse Twitter data. Mozdeh currently only supports Windows operating systems (32 and 64 bit) and it is advisable to use Mozdeh with a Desktop computer. Mozdeh can be downloaded from here, an excellent user guide can be found here, a theoretical overview can be found here, and an overview of Twitter query set generation can be found here.

Mozdeh uses the Search API which is rate limited at 180 queries per 15 minutes i.e., that Mozdeh will search Twitter a 180 times per 15 minutes, and will return a maximum of 3,200 tweets going back in time. An overview of the different software and APIs can be found in one of my previous blog posts here. 

Searches took place contentiously from 16/02/2015 19:20:00 to 06/04/2015 14:06:26 with a total of 5,055,299 tweets with 2,731,452 unique tweeters. I produced four time series graphs, figures 1 to 4.

Figure 1 – Time series for all tweets all of the keywords

T1

Note that for the time series graph some of the fluctuations are caused by the time of day, with troughs for each night. This is because up to 60% of all tweets have English language settings, and this may indicate sleeping time for English speaking users (Gerlitz & Rieder, 2013). The dip in tweets at the beginning and middle of the graph are caused by Twitter’s rate limiting. In total there are 5,055,299 tweets with 2,731,452 unique tweeters.

The peaks can be further investigated by clicking on a point of interest, for instance, by double clicking on the largest peak will bring up the list of tweets that caused the peak. It is possible to create a time series graph with another keyword or hashtag e.g., if a particular keyword or hashtag is occurring frequently. For example if it was found that the word ‘poisoning’ was occurring frequently this word by-itself could be searched for, to create a time series graph (which will also produce a new sentiment graph), as shown in figure 2 below.

Figure 2 – Time series for all tweets containing the word ‘poisoning’

t3

On the 16th of February 2015 there is a peak in tweets mentioning the word ‘poisoning’, after investigating this peak, it appears that users indicate that they may be suffering from food poisoning, however there is also an increase in spam (i.e., tweets that are not relevant to food poisoning). The second largest peak occurs on the 9th of March 2015 and this is a genuine peak i.e., users indicate that they may have food poisoning with very little spam. The third largest peak occurs on the 19th of March 2015 and is solely due to an increase of spam with very few users mentioning that they have food poisoning.

Figure 3 – Time series of all tweets containing the word ‘pain’

t4

Looking at the word ‘pain’ in this time series graph, the first two peaks (16th of February 2015 and 9th of March 2015) occur at the same time as the peaks of the previous graph of the word ‘poisoning’. The third peak occurs on the evening of the 18th of March 2015 and continues on to the 19th of March 2015 which also overlaps with the previous peak in tweets with words that use ‘poisoning’.

Figure 4 – Average tweet sentiment of all tweets

t5

The time series analysis also plots the average tweet sentiment, however tweets related to the norovirus sickness contain very little negative or positive sentiment.

I used the following search queries (adapted from queries used by the Foods Standards Agency) to gather the data: sick bug, Sickness bug, stomach flu, vomiting, #Sicknessbug, sickness virus, winter AND sickness, bug, winter sickness, winter virus, winter vomiting, #winterbug, #norovirus, norovirus AND outbreak, Norovirus, Norovirus AND symptoms, puke, retch, sick AND fever spit up, stomach pain, throw up, throwing up, upset stomach, Vom, #barf, Barf, being sick, chuck up, feeling sick, Heave, Norovirus AND Food, Food poisoning, Norovirus AND cruise, and Norovirus AND cruise ship. Data were analysed at an aggregate level, and these results form part of a project that has ethics approval.

References 

Gerlitz, C., & Rieder, B. (2013). Mining one percent of Twitter: Collections, baselines, sampling. M/C Journal, 16(2).

González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. doi:10.1016/j.socnet.2014.01.004

Acknowledgments

I would like to thank Prof. Peter Bath from the Health Informatics Research group for suggesting to examine peaks and sentiments across health topics.

Tools that can be used to create network graphs of Twitter data

This is by no means a comprehensive list and the tools are presented in no particular order. I normally use one of each when visualizing Twitter chats, workshops, or conferences. These tools requires no programming knowledge and can be used by those in the social sciences. Recently, I have been exploring how network graphs can be used to better understand Health Communication on Twitter. I created a network graph for my Twitter network (@was3210) using each tool (clicking on the image will show a larger version in a separate window).

NodeXL (@nodexl)

NodeXL is a Microsoft Excel Plugin. The software can be used to obtain data from Twitter, YouTube, and Flicker. NodeXL runs on Windows operating systems. Users can download graph options from the NodeXL graph gallery by navigating to the bottom of the page. The GraphML file can then be imported into NodeXL, and using the automate feature a graph with a similar layout can be constructed. The workbook used to create a particular graph is often linked from the bottom of the page, and it can be opened without importing the GraphML file (as previously stated), there is also no reason to use the automate feature.The graphs can be further customized i.e., adding or removing group labels. There are some excellent NodeXL tutorials on YouTube. NodeXL is part of the Social Media Research Foundation. Marc Smith director of the foundation can be found on Twitter (@marc_smith). There is no need to register an account to create the graphs, however an account is required to upload these to the graph gallery.

@was3210 NodeXL

A network graph of @was3210 created using NodeXL 

Netlytic (@Netlytic)

Netlytic is a cross platform as it is a web based tool and can be used for Twitter, Facebook, YouTube, and Instagram. Tier 2 of Netlytic allows users to create and manage up to 5 data-sets and 10000 records.Feature number 4, ‘Network Analysis’ allows users to visualize and customize the data that is captured into a network graph. Netlytic can also automatically summarize large volumes of text and discover social networks from conversations on social media. There are some excellent Netlytic guides on YouTube. Netlytic is part of the The Social Media Lab. Members of the lab, Anatoliy Gruzd (@gruzd), and Philip Mai (@PhMai) can be found on Twitter. Users are required to register for an account to use Netlytic.

@was3210 Netlytic

A network graph of @was3210 created using Netlytic 

Twitter Archiving Google Spreadsheet (TAGs)

TAGS (Twitter Arching Google Spreadsheet) created and managed by Martin Hawksey is a web based tool and is cross platform. After capturing Twitter data, of a keyword, hashtag, or user-handle it is possible to use the TAGS Explorer, currently in beta, to visualize networks. Martin can be found on Twitter (@mhawksey). There is no need to register for an account as the tool uses Google Spreadsheets.

TAGS

A network graph of @was3210 created using TAGs

SocioViz (@SocioVizNet )

SocioViz is a social media analytics platform powered by Social Network Analysis metrics. SocioViz is an analytics tool, so it does not capture data (unlike the other tools) it is possible to extract data. SocioViz can provide analytics for keywords, hashtags, or user-handles. Users are required to register for an account. More information on SocioViz and how data can be extracted can be found here.

@was3210 socioviz

A network graph of @was3210 created using SocioViz

Gephi (@Gephi)

Gephi, a network visualization and analysis platform is a very powerful tool and it may be of interest to developers. A variety of file extensions are supported by Gephi which makes it easy to import data into the program.

If I have missed a tool please let me know as I can create a test network graph and include it in this blog post.

A comparison of Twitter APIs across tools

In this blog post I compare the Streaming, Search, and Firehose APIs over a three day period (3rd to the 5th of January, 2015) across three different tools. A comprehensive outline of the different APIs and how they return tweets can be found here.

Most research on Twitter uses either the Search API, or the Streaming API. Twitter’s Search API provides access to tweets that have occurred i.e., users can request tweets that match a ‘search’ criteria similar to how an individual user would conduct a search directly on Twitter. When you query Twitter via the Search API, the maximum number of searchers going back in time that Twitter will return, is 3,200 (with a limit of 180 searchers every 15 minutes).

Twitter states that the Search API is:

…focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead (Twitter developers).

The Streaming API is a push of data as tweets occur in near real-time. However, Twitter only returns a small percentage of tweets. The tweets that are returned depend on various factors such as the demand on Twitter, and how broad/specific the search query is. Twitter states that the Streaming APIs:

…give developers low latency access to Twitter’s global stream of Tweet data. A proper implementation of a streaming client will be pushed messages indicating Tweets and other events have occurred, without any of the overhead associated with polling a REST endpoint.If your intention is to conduct singular searches, read user profile information, or post Tweets, consider using the REST APIs instead (Twitter developers).

The Firehose (which can be quite costly) provides all the tweets in near real-time, however, unlike the Streaming API there are no limitation on the number of search results that are provided. I won a historical data prize from DiscoverText and which provided me access to 3 days worth of Firehose data. I selected this data to overlap with data I had gathered via the Streaming API (using Chorus), and the Search API (using Mozdeh).

This is what I found:

Table 1 – The amount of tweets retrieved via


API across three different tools

Tool  API No. tweets
DiscoverText/Texifter

Mozdeh

Chorus

Firehose API

Search API

Search API

195,713

155,086

145,348

Table 1 shows that searchers with the keyword ‘Ebola’ were gathering up to 79% (155,086) of all tweets via the Search API using Mozdeh, and 74% (145,348) of all tweets via the Search API Streaming API using Chorus. As compared to baseline, the complete set of tweets were 195,713 (100%) obtained via DiscoverText.

I produced three word clouds to examine the most frequent words across the three samples in order to investigate whether there were any major differences in word frequencies.

Word Cloud 1: 195,713 Ebola tweets via the Firehose API using DiscoverText and from the 3rd of January to the 5th of January:

Firehose API

Word Cloud 2:155,086 Ebola tweets via Mozdeh using the Search API from the 3rd of January to the 5th of January:

Mozdeh

Word Cloud 3:145,348 Ebola tweets via Chorus using the Streaming Search API from the 3rd of January to the 5th of January:

Chorus Streaming API

The word clouds provide a visual representation of the samples in terms of word frequency i.e., the more frequent a word is the bigger it will appear in the word cloud. These word clouds contain words such as ‘nurse’, ‘critical’, and ‘condition’, as within this time period a nurse (Pauline Cafferkey) suffering from Ebola in the U.K. had fallen into critical condition. The word clouds are very similar across the different tools and APIs. This may be because, as Twitter’s senior partner engineer, Taylor Singletary, in a forum post, suggested that the sample stream via Streaming API would be a random sample of all tweets that were available on the platform (Gerlitz and Rieder,2013).

These results suggest that if you use a limited amount of search queries and gather data over a relatively short period of time that Twitter will provide a fair amount of tweets, and depending on the research question of a project this may be sufficient. However, González-Bailón et al (2014) have found that the structure of samples may be affected by both the type of API and the number of hashtags that are used to retrieve the data. Therefore, depending on the number of keywords and hashtags used the amount of tweets retrieved are likely to vary. All of this may change as Twitter introduces potential adjustments to the Streaming API. These results form a part of a larger project which has ethics approval.

Edit 08/07/15

As Dr. Timothy Cribbin has pointed out in the comments, Chorus uses the Search API and not the Streaming API as previously mentioned in this blog post. Although not across three APIs I hope the comparison is still interesting.

Acknowledgements

I am very grateful to Dr Farida Vis for her expert guidance & advice, for providing me with the literature, and for the various discussions on Twitter APIs.

References and further reading

Gaffney and Puschmann. (2014). Data Collection on Twitter. In Jones, S (Eds.) Twitter and Society (pp.55-67). New York, NY: Peter Lang.

Gerlitz, C., & Rieder, B. (2013). Mining one percent of Twitter: Collections, baselines, sampling. M/C Journal, 16(2).

González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. doi:10.1016/j.socnet.2014.01.004

Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. (2013). Is the Sample Good Enough? Comparing Data from {Twitter’s} Streaming {API} with {Twitter’s} {Firehose}. Proceedings of ICWSM.

Algorithmic Visibility at the Selfie Citizenship Workshop

The Selfie Citizenship Workshop was held on the 16th of April at the Digital Innovation Centre at Manchester Metropolitan University, and brought together researchers across various disciplines, fields, and backgrounds in order to explore the notion of ‘selfie citizenship’, and how the selfie has been used for acts of citizenship. The event was very well tweeted, using the hashtag: #selfiecitizenship, and generated over 400 tweets during the day, a network analysis of tweets at the event can be seen here. The event was sponsored by the Visual Social Media Lab, Manchester School of Art, Digital Innovation, and the Institute of Humanities and Social Science Research.

20150416_100904

A talk that stood out to me the most was that by Dr Farida Vis, titled: Algorithmic Visibility: Edgerank, Selfies and the Networked Photograph. The reason for this is that I once wrote a blog post where I briefly outlined Farida’s talk at the Digital Culture Conference: improving reality, on algorithmic culture.

The talk at this workshop was centered on an image that Farida saw pop up in her Facebook news feed. This image was shown to her because one of her friends had commented on the picture. Due to their perceived close tie, that is to say, as they were Facebook friends, the image was also shown to her. The image was of an Egyptian protester who is displaying solidarity with Occupy Oakland by holding a homemade cardboard sign with the caption ‘from Egypt to wall street don’t afraid Go ahead #occupyoakland, #ows’.

Occupy Wall Street (OWS) refers to the protest movement which began on September 17th in Zuccotti Park, in New York City’s Wall Street financial district. The movement received global attention, which led to an international occupy movement against social and economic inequality across the world. Hence, why an Egyptian protestor is holding a sign with both the #occupyoakland, and #ows hashtags.

The image left an impression on her, especially the composition of the image; the sign and the man’s face, presumably inviting us to look at his face.  Months later she attempted to locate the image, and was surprised to find she could not locate it anywhere on her friend’s wall. It was as if that she had not seen the image in the first place. She asked then, how do people locate images on social media? That is to say, if you see an image, do not initially retrieve it, and are then unable to locate it. How would you locate it? In this case, she knew that the image was about the Occupy movement and was related to Egypt, and she combined these as search queries and, with some detective work, was able to locate the image.

She found that the photographer had uploaded a range of images on a Facebook album, and that there was a similar image to the one she was searching for, but that in this case the protester had their eyes closed. Surprisingly, this image had the exact amount of likes and more shares than the original image. However, this series of other similar images from the same protest were not made visible to her. She argued here, that we should think critically and carefully about the different structures for organising images which can vary across platforms, and how images may be made visible to us.

That is for example, how does EdgeRank decide what image to show us? EdgeRank is the name that was given to the algorithm that Facebook once used to decide what articles should be displayed in a user’s news feed. However, Facebook no longer refer to its algorithm as EdgeRank internally, but rather now employ a more complex news feed algorithm. And that as EdgeRank ranked three elements: Affinity, Weight, and Time Decay. That the current algorithm, that does not have a catchy name, but now takes into account over 100,000 factors in addition the EdgeRank’s three. I would argue here, that just to understand what an algorithm is, in this instance, is difficult. Then, when you attempt to understand the workings behind the algorithm, you find that this is not possible as the methods that Facebook, for example, use to adjust the parameters of the algorithm are considered proprietary and are not available to the public. Moreover, if we do understand how images are made visible, then we are taking the images to be a given.

Algorithms can also get it wrong, take the example of the Facebook year in review feature. which received much press coverage. Displaying one user a photograph of his recently deceased daughter, another user of his fathers ashes, and in one case showing a user a picture of their deceased dog.

This was raised in one of the Q&A’s; that changes in features on social media need to be better documented. This is important in this context, as the image was on a Facebook album, a feature that is not used as widely today. In my own work, for example, I have found that Twitter has implemented several new features, and which is difficult to document and to also connect back to data sets where these new features were not present. Further points raised in the Q&A’s, that I thought were interesting were that of Twitter users ‘hacking’ the platform in order to share Instagram images on Twitter, after Instagram images stopped appearing on Twitter. IFTTT, for example, will allow users to connect Instagram to Twitter.

Overall, I thought the talk highlighted very well that it is important to think about the conditions in which an image may be shown to us, and to also think about what is not shown to us. As a social media user and a Facebook user I see images, videos, links pop up on my news feed. I had not given much thought to the conditions for their visibility, or that an algorithm taking into account over 100,000 factors was deciding what would appear on my news feed.

%d bloggers like this: