Language frequency of Ebola tweets

Ebola is a unique word particularly for an infectious disease; in comparison to Bird Flu or Swine Flu, for example, where developing search queries may be difficult. In the case of Ebola, using the keyword on its own, for me, has been sufficient to gather an enormous amount of tweets.  And for languages supported on Twitter, ‘Ebola’ is used across 15 languages and 7 languages have their own translation. As shown in the table below:

Language Key word
English, German, Spanish, Portuguese, French, Italian, Dutch, Turkish, Hungarian, Swedish, Polish, Danish, Norwegian, Finnish, Hindi Use ‘Ebola’
Russian, Japanese, Arabic Korean, Thai, Urdu, Farsi Different keyword

I found that my sample of tweets contain languages which have different translation of Ebola as Twitter users may opt to use ‘Ebola’ rather than their own translation. For example, Russian tweeters may use ‘Ebola’ rather than ‘Эбола’.

In order to examine the percentage of English tweets relative to those in other languages; I gathered over a million tweets using Mozdeh which uses Twitter’s Search API. The tweets were gathered over an 11 day period starting 27th of November and ending on the 7th of December 2014.

I used the language metadata to work out the frequencies of these using SPSS, and I have created a table to show the different languages:

Language Breakdown
Frequency (%)
English 632112  (62.3)
Spanish 220566 (21.8)
Portuguese 59774 (5.9)
French 42242 (4.2)
Italian 20645 (2.0)
Dutch 12698 (1.3)
Turkish 5099 (0.5)
German 4899 (0.5)
Russian* 2267 (0.2)
Hungarian 1854 (0.2)
Swedish 1779 (0.2)
Japanese* 1649 (0.2)
Polish 1362 (0.1)
Arabic* 1303 (0.1)
Danish 586 (0.1)
Norwegian 465 (0.0)
Finnish 405 (0.0)
Korean 366 (0.0)
Hindi 187 (0.0)
Thai* 170 (0.0)
Urdu* 116 (0.0)
Farsi* 36 (0.0)
Total 1010580
Missing** 37995
Total 1048575

*These languages have their own translation of ‘Ebola’, but users have still chosen to use ‘Ebola’.
**Not all tweets have language identifiers 

The keyword Ebola was picked up across 22 out of 29 languages that Twitter supports. It is interesting to note that 62.3% of Ebola tweets are in English, and Spanish tweets are the second most frequent (21.8%), the third most frequent tweets are in Portuguese (5.9%). For my PhD research I am focusing on English language tweets and this type of analysis tells me that there are a sufficient number of English language tweets related to the Ebola epidemic.

A limitation of this, however, is that I was only able to draw up frequencies of languages that are ‘supported’ by Twitter, for which there is metadata. And not for languages which do not have language identifiers, such as Sub-Saharan African languages.

In the next post I will look at the number of tweets on Ebola that have geolocation data and cross-tabulate these with language identifiers. These results form a part of a larger project which has ethics approval.

