data science – DataXpats

Its exactly 21 days to presidential elections in Kenya and being Kenyan gives me the chance to rewrite this with some level of confidence the data way. The leading contenders for the top seat are arguably Uhuru Kenyatta and Raila Odinga. This led me to look at whether their tweeting patterns and content depict how Kenyans perceive them to be. Part 1 of this article focuses solely on Uhuru Kenyatta.

Data science is fast gaining momentum in the field of politics. I chose the same path to look how leaders and those to be tweet in Kenya. Are they as abrasive as they are in political rallies or are there tonal variations especially when online?

Data set collection is always the first point to get to. What other better platform provided the experimental space than Twitter. Kenya sure enough is a hotbed of social media exchanges, more so KOT. I collected tweets on Uhuru Kenyatta’s timeline via my Twitter apps development keys which totaled to 184 tweets which I converted to a data frame for easier manipulation in terms of conversion to vector format etc. Sample output is as below:-

Tweetsample_Kenyatta

Tweeting sources are primarily three as the handle @UKenyatta disseminates tweets from Android, iPad and Web Client devices. Our interest is in accurate identification of which device Uhuru Kenyatta personally tweets from of the three sources. This is hard but possible as it can be attributed to the tweeting times as well as the content(a bit hard). We therefore assume that either Uhurus’ campaign team tweets from either of two devices as the assumption is that Uhuru tweets personally on the other remaining device. Tweeting times as in the below graph point show that “he” actually tweets from an android device when the tweeting times are factored in. “He” tweets quite early in the morning and the rate goes down in the afternoons and starts tweeting again late at night. Tweets from the iPad and Web Client are dominant late mornings and afternoons. They fizzle out at night. This proves the assumption that “he” may just be the twitterer on the android device.

tweeting_times_uk

I used tidytext R package to clean the dataset as tweets are noisy in nature. Removal of stop words, lemmatization as well as removal of numerals among others were carried out. Overall common word occurances in the entire corpus are summarized below:- word_co_occurances_uk

Another dimension supporting our assumption that “he” tweets from an Android device is the absence of attached images from his Android device vis a vis the web client and iPad. An infographic view of the same paints a better picture of this assumption.

pictures_per_account

Another dimension of interest is in the choice of words from the UKenyatta account based on the source. I therefore looked at the likelihood of words from the different accounts converging or simply a measure of commonality between the terms. We’ll consider which words were most common from Uhuru’s android device relative to the iPad as we’ve established “he” may just be tweeting personally from the android one. The best way to measure this is by calculating the log odds ratio for each word relative to the two data sources (Android and iPad). Formula is as below:-

log2((No. in Android+1/Total No. Android+1)/(No.in iPad+1/Total No. in iPad+1))

Its basically the number of a certain term in the android sourced set over the total number of words from android source. The same is replicated for the iPad. Comparisons are done for both the web client source in comparison with android source as well as iPad.

Overall its a bit hard to clearly distinguish the tone bearing in mind most of the tweets focus on the campaign thus the disparity between the three sources is NOT clear-cut content-wise.

Sentiment Analysis: Anything of concern?

We’ll also look at the words as per the sources and their associated emotions. NRC Word-Emotion Association lexicon is the preferred place to make the associations. The lexicon attaches words with 10 sentiments namely anticipation, positive, negative, anger,disgust, fear, joy, sadness, surprise, and trust.

To measure the sentiments in the two sources, we count the words in relation to the sentiments in the NRC lexicon. A sample output is below:-

source sentiment total_words words
<chr> <chr> <int> <dbl>
1 Android anger 246 7
2 Android anticipation 246 15
3 Android disgust 246 3
4 Android fear 246 8
5 Android joy 246 19
6 Android negative 246 10

What about the looking at how emotionally charged terms are in Android sourced tweets (Uhuru Kenyatta’s) vs those from the iPad( Assumption: Uhurus online team)? A poisson test can be used to measure this difference in sentiments at 95% confidence interval. The output is as below:-

emotions_uhuru

Disgust is dominant in Uhuru’s Android account tweets as well as anger and sadness compared to the those from his iPad. This is an interesting observation as emotions are explicitly expressed in the Android account.

Part 2 of the post will be on Raila Odinga and whether he really tweets from his iPad and whether he is an emotionally charged person or not.

Category: data science

Uhuru Personally Tweets from an Android device; Raila insists on Change, so his iPad says (Part 1).

Sentiment Analysis: Anything of concern?

Sentiment Analysis: Anything of concern?

Share this: