A comparison of Twitter APIs across tools

In this blog post I compare the Streaming, Search, and Firehose APIs over a three day period (3rd to the 5th of January, 2015) across three different tools. A comprehensive outline of the different APIs and how they return tweets can be found here.

Most research on Twitter uses either the Search API, or the Streaming API. Twitter’s Search API provides access to tweets that have occurred i.e., users can request tweets that match a ‘search’ criteria similar to how an individual user would conduct a search directly on Twitter. When you query Twitter via the Search API, the maximum number of searchers going back in time that Twitter will return, is 3,200 (with a limit of 180 searchers every 15 minutes).

Twitter states that the Search API is:

…focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead (Twitter developers).

The Streaming API is a push of data as tweets occur in near real-time. However, Twitter only returns a small percentage of tweets. The tweets that are returned depend on various factors such as the demand on Twitter, and how broad/specific the search query is. Twitter states that the Streaming APIs:

…give developers low latency access to Twitter’s global stream of Tweet data. A proper implementation of a streaming client will be pushed messages indicating Tweets and other events have occurred, without any of the overhead associated with polling a REST endpoint.If your intention is to conduct singular searches, read user profile information, or post Tweets, consider using the REST APIs instead (Twitter developers).

The Firehose (which can be quite costly) provides all the tweets in near real-time, however, unlike the Streaming API there are no limitation on the number of search results that are provided. I won a historical data prize from DiscoverText and which provided me access to 3 days worth of Firehose data. I selected this data to overlap with data I had gathered via the Streaming API (using Chorus), and the Search API (using Mozdeh).

This is what I found:

Table 1 – The amount of tweets retrieved via

API across three different tools

Tool  API No. tweets



Firehose API

Search API

Search API




Table 1 shows that searchers with the keyword ‘Ebola’ were gathering up to 79% (155,086) of all tweets via the Search API using Mozdeh, and 74% (145,348) of all tweets via the Search API Streaming API using Chorus. As compared to baseline, the complete set of tweets were 195,713 (100%) obtained via DiscoverText.

I produced three word clouds to examine the most frequent words across the three samples in order to investigate whether there were any major differences in word frequencies.

Word Cloud 1: 195,713 Ebola tweets via the Firehose API using DiscoverText and from the 3rd of January to the 5th of January:

Firehose API

Word Cloud 2:155,086 Ebola tweets via Mozdeh using the Search API from the 3rd of January to the 5th of January:


Word Cloud 3:145,348 Ebola tweets via Chorus using the Streaming Search API from the 3rd of January to the 5th of January:

Chorus Streaming API

The word clouds provide a visual representation of the samples in terms of word frequency i.e., the more frequent a word is the bigger it will appear in the word cloud. These word clouds contain words such as ‘nurse’, ‘critical’, and ‘condition’, as within this time period a nurse (Pauline Cafferkey) suffering from Ebola in the U.K. had fallen into critical condition. The word clouds are very similar across the different tools and APIs. This may be because, as Twitter’s senior partner engineer, Taylor Singletary, in a forum post, suggested that the sample stream via Streaming API would be a random sample of all tweets that were available on the platform (Gerlitz and Rieder,2013).

These results suggest that if you use a limited amount of search queries and gather data over a relatively short period of time that Twitter will provide a fair amount of tweets, and depending on the research question of a project this may be sufficient. However, González-Bailón et al (2014) have found that the structure of samples may be affected by both the type of API and the number of hashtags that are used to retrieve the data. Therefore, depending on the number of keywords and hashtags used the amount of tweets retrieved are likely to vary. All of this may change as Twitter introduces potential adjustments to the Streaming API. These results form a part of a larger project which has ethics approval.

Edit 08/07/15

As Dr. Timothy Cribbin has pointed out in the comments, Chorus uses the Search API and not the Streaming API as previously mentioned in this blog post. Although not across three APIs I hope the comparison is still interesting.


I am very grateful to Dr Farida Vis for her expert guidance & advice, for providing me with the literature, and for the various discussions on Twitter APIs.

References and further reading

Gaffney and Puschmann. (2014). Data Collection on Twitter. In Jones, S (Eds.) Twitter and Society (pp.55-67). New York, NY: Peter Lang.

Gerlitz, C., & Rieder, B. (2013). Mining one percent of Twitter: Collections, baselines, sampling. M/C Journal, 16(2).

González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. doi:10.1016/j.socnet.2014.01.004

Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. (2013). Is the Sample Good Enough? Comparing Data from {Twitter’s} Streaming {API} with {Twitter’s} {Firehose}. Proceedings of ICWSM.

Algorithmic Visibility at the Selfie Citizenship Workshop

The Selfie Citizenship Workshop was held on the 16th of April at the Digital Innovation Centre at Manchester Metropolitan University, and brought together researchers across various disciplines, fields, and backgrounds in order to explore the notion of ‘selfie citizenship’, and how the selfie has been used for acts of citizenship. The event was very well tweeted, using the hashtag: #selfiecitizenship, and generated over 400 tweets during the day, a network analysis of tweets at the event can be seen here. The event was sponsored by the Visual Social Media Lab, Manchester School of Art, Digital Innovation, and the Institute of Humanities and Social Science Research.


A talk that stood out to me the most was that by Dr Farida Vis, titled: Algorithmic Visibility: Edgerank, Selfies and the Networked Photograph. The reason for this is that I once wrote a blog post where I briefly outlined Farida’s talk at the Digital Culture Conference: improving reality, on algorithmic culture.

The talk at this workshop was centered on an image that Farida saw pop up in her Facebook news feed. This image was shown to her because one of her friends had commented on the picture. Due to their perceived close tie, that is to say, as they were Facebook friends, the image was also shown to her. The image was of an Egyptian protester who is displaying solidarity with Occupy Oakland by holding a homemade cardboard sign with the caption ‘from Egypt to wall street don’t afraid Go ahead #occupyoakland, #ows’.

Occupy Wall Street (OWS) refers to the protest movement which began on September 17th in Zuccotti Park, in New York City’s Wall Street financial district. The movement received global attention, which led to an international occupy movement against social and economic inequality across the world. Hence, why an Egyptian protestor is holding a sign with both the #occupyoakland, and #ows hashtags.

The image left an impression on her, especially the composition of the image; the sign and the man’s face, presumably inviting us to look at his face.  Months later she attempted to locate the image, and was surprised to find she could not locate it anywhere on her friend’s wall. It was as if that she had not seen the image in the first place. She asked then, how do people locate images on social media? That is to say, if you see an image, do not initially retrieve it, and are then unable to locate it. How would you locate it? In this case, she knew that the image was about the Occupy movement and was related to Egypt, and she combined these as search queries and, with some detective work, was able to locate the image.

She found that the photographer had uploaded a range of images on a Facebook album, and that there was a similar image to the one she was searching for, but that in this case the protester had their eyes closed. Surprisingly, this image had the exact amount of likes and more shares than the original image. However, this series of other similar images from the same protest were not made visible to her. She argued here, that we should think critically and carefully about the different structures for organising images which can vary across platforms, and how images may be made visible to us.

That is for example, how does EdgeRank decide what image to show us? EdgeRank is the name that was given to the algorithm that Facebook once used to decide what articles should be displayed in a user’s news feed. However, Facebook no longer refer to its algorithm as EdgeRank internally, but rather now employ a more complex news feed algorithm. And that as EdgeRank ranked three elements: Affinity, Weight, and Time Decay. That the current algorithm, that does not have a catchy name, but now takes into account over 100,000 factors in addition the EdgeRank’s three. I would argue here, that just to understand what an algorithm is, in this instance, is difficult. Then, when you attempt to understand the workings behind the algorithm, you find that this is not possible as the methods that Facebook, for example, use to adjust the parameters of the algorithm are considered proprietary and are not available to the public. Moreover, if we do understand how images are made visible, then we are taking the images to be a given.

Algorithms can also get it wrong, take the example of the Facebook year in review feature. which received much press coverage. Displaying one user a photograph of his recently deceased daughter, another user of his fathers ashes, and in one case showing a user a picture of their deceased dog.

This was raised in one of the Q&A’s; that changes in features on social media need to be better documented. This is important in this context, as the image was on a Facebook album, a feature that is not used as widely today. In my own work, for example, I have found that Twitter has implemented several new features, and which is difficult to document and to also connect back to data sets where these new features were not present. Further points raised in the Q&A’s, that I thought were interesting were that of Twitter users ‘hacking’ the platform in order to share Instagram images on Twitter, after Instagram images stopped appearing on Twitter. IFTTT, for example, will allow users to connect Instagram to Twitter.

Overall, I thought the talk highlighted very well that it is important to think about the conditions in which an image may be shown to us, and to also think about what is not shown to us. As a social media user and a Facebook user I see images, videos, links pop up on my news feed. I had not given much thought to the conditions for their visibility, or that an algorithm taking into account over 100,000 factors was deciding what would appear on my news feed.

Pandemics and epidemics on Twitter

Predictions that a global pandemic will wipe out a large percentage of the population is regarded as a genuine threat. And it was recently reported that an outbreak of a drug-resistant infection could kill 80,000 people in the UK.


A man protests for the mandatory quarantine of everyone that has returned from Ebola affected countries in front of the White House in Washington, D.C., on Oct.24, 2014. Photographer: Mark Wilson/Getty Images.

In terms of a threat that has just passed, if we think back to September of last year, at the peak of the Ebola outbreak. Conversations about the virus on Twitter started to increase. Due to the accumulation of news reports and sensationalised headlines, similar to the one above.

The actual threat, however, as opposed to the perceived public threat, remained low. Fear and hysteria may not allow people to think, or act logically during an outbreak. So, it is crucial to have an awareness of how people are communicating about an infectious disease. Using real-time data from Twitter it is possible for researchers to gauge public opinion on infectious disease outbreaks.

Why use Twitter?

Twitter feasibly offers researchers millions of views on an outbreak that are available in real-time.  This allows the examination of how a subset of the population may react to an infectious disease outbreak. There is ongoing research on why people may have negative views towards vaccines, for example, as this could affect the spread of a disease.pg17-twitter-getty

Picture:  Getty Images

Gauging public opinion at the precise time of an outbreak may not be feasible using traditional methods; as designing a survey or questionnaire is an expensive, and time-consuming process. Though, most research suggests that data from Twitter is best used in combination with traditional methods rather than as a substitute. Especially for research that predicts the occurrence of an infectious disease.

Challenges of using Twitter

On the other hand, not all adult Internet users are on Twitter, but adult internet users on Twitter is increasing. According to the Pew Research Centre, 23% of adult internet users also use Twitter (18% in 2013); 19% of the entire adult population. Twitter, however, is most popular with those who are under 50, and college educated.

When these figures are compared to Facebook, Twitter does not stack-up well, as 71% of adult internet users are on Facebook; 58% of the entire population. And 65% of Facebook users are 65 and over. Those who tweet about outbreaks may be overrepresented in relation to the national offline population, but these people may be under-represented in survey data.

It is also difficult to obtain Twitter data as Twitter only provides a sample of data to researchers. And obtaining full Twitter data can be quite costly for small to medium sized research groups. There are also issues that arise surrounding spam on the platform, and developing methods of filtering out useful content can be quite challenging.

Current research on Infectious diseases using Twitter

Current research on infectious disease outbreaks suggests that Twitter offers a method of understanding what a subset of the population communicate about in real-time. The misconceptions that people may hold, and whether these will be harmful in a public health epidemic or pandemic.


A man dressed in protective hazmat closing leaves after treating a nurse in Texas who is diagnosed with the Ebola virus. Photographer: Mike Stone/Getty Images

Specifically on the Ebola outbreak, early research indicates that there may have been medical misinformation present on the platform regarding vaccines, the role of health officials, and the cure and transmission of Ebola.

My own research involves using Twitter data related to the Ebola outbreak to better understand the content on the platform, how people communicate about Ebola, and to examine the types of information that is present on the platform.

In the present day, research teams are developing better methods in analysing social media data. So this type of research will start to become more sophisticated in the future.

Language frequency of Ebola tweets

Ebola is a unique word particularly for an infectious disease; in comparison to Bird Flu or Swine Flu, for example, where developing search queries may be difficult. In the case of Ebola, using the keyword on its own, for me, has been sufficient to gather an enormous amount of tweets.  And for languages supported on Twitter, ‘Ebola’ is used across 15 languages and 7 languages have their own translation. As shown in the table below:

Language Key word
English, German, Spanish, Portuguese, French, Italian, Dutch, Turkish, Hungarian, Swedish, Polish, Danish, Norwegian, Finnish, Hindi Use ‘Ebola’
Russian, Japanese, Arabic Korean, Thai, Urdu, Farsi Different keyword

I found that my sample of tweets contain languages which have different translation of Ebola as Twitter users may opt to use ‘Ebola’ rather than their own translation. For example, Russian tweeters may use ‘Ebola’ rather than ‘Эбола’.

In order to examine the percentage of English tweets relative to those in other languages; I gathered over a million tweets using Mozdeh which uses Twitter’s Search API. The tweets were gathered over an 11 day period starting 27th of November and ending on the 7th of December 2014.

I used the language metadata to work out the frequencies of these using SPSS, and I have created a table to show the different languages:

Language Breakdown
Frequency (%)
English 632112  (62.3)
Spanish 220566 (21.8)
Portuguese 59774 (5.9)
French 42242 (4.2)
Italian 20645 (2.0)
Dutch 12698 (1.3)
Turkish 5099 (0.5)
German 4899 (0.5)
Russian* 2267 (0.2)
Hungarian 1854 (0.2)
Swedish 1779 (0.2)
Japanese* 1649 (0.2)
Polish 1362 (0.1)
Arabic* 1303 (0.1)
Danish 586 (0.1)
Norwegian 465 (0.0)
Finnish 405 (0.0)
Korean 366 (0.0)
Hindi 187 (0.0)
Thai* 170 (0.0)
Urdu* 116 (0.0)
Farsi* 36 (0.0)
Total 1010580
Missing** 37995
Total 1048575

*These languages have their own translation of ‘Ebola’, but users have still chosen to use ‘Ebola’.
**Not all tweets have language identifiers 

The keyword Ebola was picked up across 22 out of 29 languages that Twitter supports. It is interesting to note that 62.3% of Ebola tweets are in English, and Spanish tweets are the second most frequent (21.8%), the third most frequent tweets are in Portuguese (5.9%). For my PhD research I am focusing on English language tweets and this type of analysis tells me that there are a sufficient number of English language tweets related to the Ebola epidemic.

A limitation of this, however, is that I was only able to draw up frequencies of languages that are ‘supported’ by Twitter, for which there is metadata. And not for languages which do not have language identifiers, such as Sub-Saharan African languages.

In the next post I will look at the number of tweets on Ebola that have geolocation data and cross-tabulate these with language identifiers. These results form a part of a larger project which has ethics approval.

Twitter data capture tools from a usability perspective

In a blog post comment I was asked what tools are good from a usability and interface perspective. And I thought this would make for a good blog post. The tools covered in this blog were recommended to me by my PhD supervisor. Many of these tools have existing guides, videos or instructional tutorials and rather than provide my own I have provided the links to these.

Users of these tools are reminded that the data obtained via the tools should be used in a fair and responsible manner. And this means adhering to Twitter’s Rules of the Road as well as applicable ethical codes of practice and data protection laws.

TAGS (Twitter Arching Google Spreadsheet)

System: TAGS is a Web based tool so it will work on most operating systems.

Download TAGS: https://tags.hawksey.info/get-tags/

TAGS Support Forums: https://tags.hawksey.info/forums/


System: Mozdeh only works on Windows and it is advisable to use a Desktop computer (there are 32 and 64 bit versions).

Download Mozdeh: http://mozdeh.wlv.ac.uk/installation.html

Mozdeh User Guide: http://mozdeh.wlv.ac.uk/resources/MozdehManual.docx

Mozdeh Theoretical overview: http://mozdeh.wlv.ac.uk/resources/TwitterTimeSeriesAndSentimentAnalysis.pdf

Twitter query set generation with Mozdeh: http://mozdeh.wlv.ac.uk/TwitterQuerySetGeneration.html


System: Chorus only runs on Windows. It is also advisable to use Chorus with a desktop computer.

Request to download Chorus: http://chorusanalytics.co.uk/chorus/request_download.php

Chorus Tweetcatcher Desktop manual:  http://chorusanalytics.co.uk/manuals/Chorus-TCD_usermanual.pdf

YouTube tutorial: https://www.youtube.com/watch?v=KmCrmiBOOvw

I made another list a while back ‘A list of tools to capture Twitter data’ at: https://wasimahmed1.wordpress.com/2015/01/30/a-list-of-tools-to-capture-twitter-data/ 

Also be sure to check out via Dr Deen Freelon’s curated list at: https://docs.google.com/document/d/1UaERzROI986HqcwrBDLaqGG8X_lYwctj6ek6ryqDOiQ/edit

You can catch me on Twitter @was3210 

Using Twitter to gain an insight into public views and opinions for the Ebola epidemic

The World Health Organisation writes that Ebola, a haemorrhagic fever, is a very severe and fatal illness with an average fatality rate of 50%. The first outbreak of Ebola occurred in 1976. The first case of Ebola, outside of West Africa, was reported in the U.S on September 19th 2014. The current Ebola outbreak has taken more lives and infected more people than all the other outbreaks combined. And Twitter provides a platform for people to express their views and opinions on Ebola.

Chew and Eysenbach, for example, used Twitter to monitor the mentions of Swine Flu during the 2009 pandemic. They found that Twitter provided health authorities with the potential to become aware of the concerns, which were raised by the public. Similarly, Szomszor, Kostkova, and Louis examined Swine Flu on Twitter and found that Twitter offers the ability to sample large populations for health sentiment (public views and opinions). Signorini, Segre, and Polgreen also found that by using Twitter it was possible to understand user’s interests and concerns during the Swine Flu outbreak.

In 2010, Chew and Eysenbach wrote that Swine Flu was the first global pandemic which had occurred in the age of Web 2.0, and argued that this was a unique opportunity to investigate the role of technology for public health. Fast forward to the current outbreak of Ebola, this is the first time a global outbreak of Ebola has occurred in the age of Web 2.0.
And as the number of Twitter users has increased since 2010, there is the possibility to examine the recent Ebola outbreak on a larger scale.

In relation to the Ebola outbreak on Twitter. A study by Oluwafemi, Elia and Rolf published last year examined misinformation for Ebola on Twitter. This study found that the most common types of misinformation on Ebola were, that ingesting a plant ‘Ewedu’, blood transfusions, or drinking salt water could cure Ebola. Another study by Jin et al, which was published last year, found that there were conspiracy theories, innuendos, and rumours on Twitter related to Ebola. Jin et al looked at the time period between late September to late October (2014). Among some of the rumours reported, was that the Ebola vaccine only worked on white people, that Ebola patients had risen from the dead, and that terrorists would contract Ebola and spread it around the world.

Therefore, Twitter has the potential to provide insight into public views and opinions related to the Ebola outbreak, which would allow health authorities to become aware of the public concerns. Furthermore, by examining the rumours related to Ebola health authorities will be able to dispel false information via new or existing health campaigns.

In the next post I will examine the language dynamics of tweets related to Ebola.


I would like to thank Jennifer Salter, from the health informatics research group, for reading and providing extremely valuable feedback on an earlier version of this blog post.


Chew, C., & Eysenbach, G. (2010). Pandemics in the age of Twitter: Content analysis of tweets during the 2009 H1N1 outbreak. PLOS ONE, 5(11).

Fang Jin; Wei Wang; Liang Zhao; Dougherty, E.; Yang Cao; Chang-Tien Lu; Ramakrishnan, N., “Misinformation Propagation in the Age of Twitter,” Computer , vol.47, no.12, pp.90,94, Dec. 2014
doi: 10.1109/MC.2014.361

Signorini A, Segre AM, Polgreen PM. (2011) The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic. PLoS ONE 6(5): e19467. doi:10.1371/journal.pone.0019467

Szomszor, M., Kostkova, P., & St Louis, C. (2011). Twitter informatics: Tracking and understanding public reaction during the 2009 Swine Flu pandemic. In Proceedings – 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011 (Vol. 1, pp. 320–323). doi:10.1109/WI-IAT.2011.311

WHO. (2015). WHO | Ebola virus disease. [ONLINE] Available at: http://www.who.int/mediacentre/factsheets/fs103/en/ [Last accessed 20/01/2015].

Oyeyemi Sunday Oluwafemi, Gabarron Elia, Wynn Rolf. Ebola, Twitter, and misinformation: a dangerous combination? BMJ 2014; 349 :g6178

An outline of upcoming blog posts

Starting this week, I’m going to be posting blog posts about my PhD research. I’m currently looking at Twitter to better understand public views and opinions related to the Ebola outbreak. I have gathered tweets on Ebola using both open source, and industry specific software. And monitored the international news coverage of Ebola very carefully. I have a series of blog posts lined up which will cover some of the following topics:

  • Using Twitter to gather public views and opinions on Ebola
  • The different languages people use to Tweet about Ebola
  • The number of tweets on Ebola that have geolocation data
  • The number of Ebola tweets that have geolocation and language identifiers
  • A comparison of Ebola tweets with geolocation data across different APIs
  • Popular hashtags, TAG and word clouds on Ebola for Firehose data
  • TAG and word cloud comparisons across the REST, Streaming, and Firehose APIs
  • Network analysis using NodeXL

A list of tools to capture Twitter data

A list of tools that I have used to capture data from Twitter and which worked:




Netlytic: https://netlytic.org

Facepager: http://www.ls1.ifkw.uni-muenchen.de/personen/wiss_ma/keyling_till/software.html

Twython at: https://github.com/ryanmcgrath/twython

KNIME: https://www.knime.org/ with the Palladian Extension (obtained via the app). Instructions on set up here: http://tech.knime.org/wiki/how-to-get-twitter-data-into-knime .  Using the Twitter nodes from the extension menu provided by KNIME is much better. The instructions on setting this up are here : http://www.knime.org/blog/knime-twitter-nodes I could not figure out a way to extract the tweets.

NodeXL at: http://nodexl.codeplex.com/

Visibrain (Commercial): http://www.visibrain.com/en/

More tools:

Nvivo/Ncapture at: http://www.qsrinternational.com/products_nvivo_add-ons.aspx

TweetMapper at: http://tweetmapper.us

Twitonomy at: http://www.twitonomy.com

Webometrics at: http://lexiurl.wlv.ac.uk/index.html

Follow the Hashtag at: http://analytics.followthehashtag.com/#/

iScience Maps at: http://tweetminer.eu

More tools (require programming knowledge) from Deen Freelon’s curated Google Sheets template at: https://docs.google.com/document/d/1UaERzROI986HqcwrBDLaqGG8X_lYwctj6ek6ryqDOiQ/edit it is a great list and I make sure to add to it:

DMI-TCAT at: https://github.com/digitalmethodsinitiative/dmi-tcat

yourTwapperKeeper at: https://github.com/540co/yourTwapperKeeper

140dev at: http://140dev.com/

Hosebird at: https://github.com/twitter/hbc

Pattern at: http://www.clips.ua.ac.be/pattern

poll.emic at: https://github.com/sbenthall/poll.emic

Social Feed Manager at: http://gwu-libraries.github.io/social-feed-manager/

SocialMediaMineR at: http://cran.r-project.org/web/packages/SocialMediaMineR/

streamR at: http://cran.r-project.org/web/packages/streamR/index.html

tStreamingArchiver at: https://github.com/brendam/tStreamingArchiver

twarc at: https://github.com/edsu/twarc

tweepy at: https://github.com/tweepy/tweepy

twitteR at: http://cran.r-project.org/web/packages/twitteR/index.html

Twitter-Tap at: https://github.com/janezkranjc/twitter-tap

Twitter Stream Downloader at: https://github.com/mdredze/twitter_stream_downloader

TWurl at: https://github.com/twitter/twurl

Be sure to check out my other list: ‘A list of tools to capture Twitter data’ at: https://wasimahmed1.wordpress.com/2015/01/30/a-list-of-tools-to-capture-twitter-data/

Also be sure to check out via Dr Deen Freelon’s curated list at: https://docs.google.com/document/d/1UaERzROI986HqcwrBDLaqGG8X_lYwctj6ek6ryqDOiQ/edit You can catch me on Twitter @was3210 

Almost 6 months of PhD!

My six month progress report is due in soon so I decided to do a blog post about some of the topics and issues I have encountered, and with which I am currently battling with.  I am looking at pandemics and epidemics on Web 2.0. More recently, however, I have been investigating the Ebola epidemic, and I have been collecting Ebola related tweets.

Big data

Big data is a current buzzword within academia and is considered by some to be the new oil. However, keeping with the oil analogy, is it real oil or snake oil? This issue was chronicled by Simon Moss in a Wired article Big Data: New Oil or Snake Oil? Simon discusses the issue of normalising big data in an organisational sense. My issue is that of information quality, that is, the data is big, but, at times, it is of a poor quality. When the data is filtered it is not as big as it once was, and so it becomes little data. However, this small or little data is much more valuable in comparison to the larger set of data.


Ethical issues are ever present in social media research. The argument in favour of the utilisation of Web 2.0 for research is centred on the argument on whether the data is in the public domain. This raises questions on whether there is informed consent. Moreover, do Twitter users know that I am gathering this data? If I ask for consent for a tweet on Ebola that I captured in August would I even get a reply? There is a sense, as a Twitter user, that when you send a Tweet out that after a while it goes away. Thus, it is imperative that Twitter users are involved in the decision progress when discussing ethical issues. This was discussed at a conference I attended in November, Picturing the Social: Analysing Social Media Images.


I recently viewed a talk by Farida Vis which formed a part of the digital culture conference, improving reality. A very well-articulated example of the human influence on an algorithm was provided by Farida. This was of an advert on Facebook which promoted an assisted reproduction program, with a picture of a baby. Farida argues that this reflects how those who programmed the algorithm understand gender normative issues. That is, those who wrote the code held a schema whereby they believed a women of a certain age should have children. More recently, on Twitter I witnessed an advert that was advertising a laptop with the caption ‘Costs less than what you spend on Pizza last year’ which resulted in livid responses e.g. ‘Twitter what are you trying to say?’ This advert could have been targeted at all users, so this may not be the best example of a targeted algorithm. A further example is that of adverts for educational courses from Facebook, before I started university. This leads to a question of how much influence social media has on young adults.  There is scope here, also, to examine how websites such as Amazon create suggestions. How does their algorithm work? And where does the human schemas fit in to this.


When talking about methods there is a tendency to select either a quantitative or qualitative research philosophy. However, in regards to social media research using a mixed method approach will yield richer results. That is, a method of analysis such as network analysis should be complemented with content analysis. If we limit ourselves to a particular research philosophy we will learn less from the data. So, I hope to employ a range of methods in analysing my own data. A related issue around methods, is that of the cost of big data. Big data is certainly out of reach for most academics and this is further exacerbated by stringent terms and conditions which restrict data sharing. The issue of whether the data is available for free, or whether there is a tool to obtain the data is also shaping the platforms I look at.


In my dataset of tweets images occur with great frequency and are often represented as a block of web links when scrolling down a spreadsheet. When I start to filter the dataset should I remove these links? An observation of big data is that it is associated with words and not images. However, in regards to images on Twitter; I would argue that they form a larger network of big data. According to one estimate, there are 250 million images shared on Twitter daily. However, these are overlooked in the majority of Twitter research. That is, during the 2009/2010 epidemic of H1N1, and the various subsequent outbreaks, images must have been shared on Twitter. The images would have formed an integral part of how a person may subsequently think about outbreaks.  However, there was no evidence based research examining these images. Comparing images from different time points allows us to see whether narratives told via images remain the same or whether these change.

In text references:

The Wired News Article I mentioned can be found here: http://www.wired.com/2014/10/big-data-new-oil-or-snake-oil/

The talk by Farida Vis on algorithmic culture I mentioned can be found here: https://www.youtube.com/watch?v=WBXddqzIZTA

[Edited on 26/01/15]

Started a blog today

Started the blog (08/12/14).

%d bloggers like this: