Chapter 8 Sentiment Analysis
Each word can be assigned an emotion or sentiment such as positive or negative or other categories such as happy, joy, fear, etc. The sentiment of each word can be best identified for a particular problem. For example, when studying positive and negative affects, one can ask people to identify whether a word shows positive or negative meanings.
8.1 Word sentiment
In the literature, there are three sentiment dictionaries or lexicons with identified word meaning that are widely used - AFINN, nrc, and bing.
8.1.1 AFINN
The AFINN lexicon assigns words with a score between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. AFINN includes 2,476 words in total. For a complete list with the scores see AFINN word list. Some example words and their associated scores are given below.
get_sentiments("afinn")
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
get_sentiments("afinn") %>% filter(score == 5)
## # A tibble: 5 x 2
## word score
## <chr> <int>
## 1 breathtaking 5
## 2 hurrah 5
## 3 outstanding 5
## 4 superb 5
## 5 thrilled 5
get_sentiments("afinn") %>% filter(score == -5)
## # A tibble: 16 x 2
## word score
## <chr> <int>
## 1 bastard -5
## 2 bastards -5
## 3 bitch -5
## 4 bitches -5
## 5 cock -5
## 6 cocksucker -5
## 7 cocksuckers -5
## 8 cunt -5
## 9 motherfucker -5
## 10 motherfucking -5
## 11 niggas -5
## 12 nigger -5
## 13 prick -5
## 14 slut -5
## 15 son-of-a-bitch -5
## 16 twat -5
8.1.2 nrc
The nrc lexicon categorizes words into positive or negative sentiments as well as 8 different emotions including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. For a complete list with the categories see nrc word list. In total, 6,468 words were rated.
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
8.1.3 bing
The bing lexicon categorizes words in a binary fashion into positive and negative categories. For a complete list with the categories see bing word list. In total, 6,788 words are rated.
get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
8.2 Basic sentiment analysis
For each comment, we can calculate its overall sentiment. To quantify the emotion or sentiment of a comment, we score it based on individual words. We first use the afinn lexicon for sentiment analysis. This can be done using the code below. Note that we add a new column called score
to the dataset. For the word “best”, its score is 3, but for the word “odd” its score is -2.
rating.sentiment <- prof.tm %>% inner_join(get_sentiments("afinn"))
rating.sentiment[1:10, c("word", "score")]
## word score
## 1 best 3
## 2 like 2
## 3 best 3
## 4 help 2
## 5 clear 1
## 6 great 3
## 7 clear 1
## 8 odd -2
## 9 better 2
## 10 respected 2
We can now calculate the overall sentiment of a comment based on the scores of all the individual words. Basically, we add all the scores of the words for a comment together. For future analysis, we also get the numerical rating of teaching and the difficulty of the class. From the output, we can see that the overall sentiment of the first comment is 5 and for the fifth comment is -10.
prof.tm.sentiment <- rating.sentiment %>% group_by(id) %>% summarise(rating = mean(rating),
difficulty = mean(difficulty), sentiment = sum(score))
prof.tm.sentiment
## # A tibble: 37,397 x 4
## id rating difficulty sentiment
## <int> <dbl> <dbl> <int>
## 1 1 5 3 5
## 2 2 5 4 6
## 3 3 4 5 4
## 4 4 3 5 5
## 5 5 1 5 -10
## 6 6 5 5 4
## 7 7 5 5 6
## 8 8 2 4 0
## 9 9 3 5 1
## 10 10 3 5 2
## # ... with 37,387 more rows
The histogram of the sentiment scores of all the comments is shown below. The distribution seems to be symmetric but with long tails.
prof.tm.sentiment %>% ggplot(aes(sentiment)) + geom_histogram(color = "black", fill = "white",
bins = 30)
We then look at the relationship between the rating and the sentiment of the comment using Pearson correlation, which is 0.56, a quite large correlation. The correlation between the sentiment and the difficulty is negative.
cor(prof.tm.sentiment[, 2:4])
## rating difficulty sentiment
## rating 1.0000000 -0.4768910 0.5635187
## difficulty -0.4768910 1.0000000 -0.3164134
## sentiment 0.5635187 -0.3164134 1.0000000
prof.tm.sentiment %>% ggplot(aes(x = rating, y = sentiment)) + geom_jitter()
geom_jitter
is a convenient shortcut for geom_point(position = “jitter”). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.
8.3 2-gram sentiment analysis
The sentiment analysis was based on individual words. In a comment, there are often words such “not” and “don’t” that can give the single word opposite meaning. For example, “good” is, in general, a positive word but “not good” is negative. Therefore, we would need to identify the negative meaning of phrases such as not good, not useful, etc. This can be done by dividing the two words. For the first word, we can find negative words such as not, no, never, etc. Then, we can score the second word as in the previous sentiment analysis. After that, we change the direction of the sentiment for those with proceeding negative words.
The following code first uses 2grams to divide the comments into consecutive two words. After that, we break down the two words into two columns - word1
and word2
. We then score word2
using the sentiment word list.
prof.tm <- unnest_tokens(prof1000, word, comments, token = "ngrams", n = 2)
prof.separated <- prof.tm %>% separate(word, c("word1", "word2"), sep = " ")
rating.sentiment <- prof.separated %>% inner_join(get_sentiments("afinn"), by = c(word2 = "word"))
We now identify a list of words that would flip the sentiment of the words following them.
negativeword <- c("no", "not", "never", "dont", "don't", "cannot", "can't", "won't",
"wouldn't", "shouldn't", "aren't", "isn't", "wasn't", "weren't", "haven't", "hasn't",
"hadn't", "doesn't", "didn't", "mightn't", "mustn't")
We now create a new score2
variable that takes into account the negative words. The code for the analysis is given below. Note that we increased the correlation from 0.563 to 0.575. The increase is not dramatic. Actually, the two sentiment scores are highly correlated with a correlation of 0.96.
rating.sentiment <- rating.sentiment %>% mutate(score1 = score, score2 = ifelse(word1 %in%
negativeword, -score, score))
prof.tm.sentiment <- rating.sentiment %>% group_by(id) %>% summarise(rating = mean(rating),
easy = mean(difficulty), sentiment1 = sum(score1), sentiment2 = sum(score2))
cor(prof.tm.sentiment[, 2:5])
## rating easy sentiment1 sentiment2
## rating 1.0000000 -0.4771256 0.5331909 0.5747061
## easy -0.4771256 1.0000000 -0.2962231 -0.3261040
## sentiment1 0.5331909 -0.2962231 1.0000000 0.9595544
## sentiment2 0.5747061 -0.3261040 0.9595544 1.0000000
prof.tm.sentiment %>% ggplot(aes(x = sentiment1, y = sentiment2)) + geom_jitter()