Chapter 8 Sentiment Analysis

Each word can be assigned an emotion or sentiment such as positive or negative or other categories such as happy, joy, fear, etc. The sentiment of each word can be best identified for a particular problem. For example, when studying positive and negative affects, one can ask people to identify whether a word shows positive or negative meanings.

8.1 Word sentiment

In the literature, there are three sentiment dictionaries or lexicons with identified word meaning that are widely used - AFINN, nrc, and bing.

8.1.2 nrc

The nrc lexicon categorizes words into positive or negative sentiments as well as 8 different emotions including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. For a complete list with the categories see nrc word list. In total, 6,468 words were rated.

8.2 Basic sentiment analysis

For each comment, we can calculate its overall sentiment. To quantify the emotion or sentiment of a comment, we score it based on individual words. We first use the afinn lexicon for sentiment analysis. This can be done using the code below. Note that we add a new column called score to the dataset. For the word “best”, its score is 3, but for the word “odd” its score is -2.

We can now calculate the overall sentiment of a comment based on the scores of all the individual words. Basically, we add all the scores of the words for a comment together. For future analysis, we also get the numerical rating of teaching and the difficulty of the class. From the output, we can see that the overall sentiment of the first comment is 5 and for the fifth comment is -10.

The histogram of the sentiment scores of all the comments is shown below. The distribution seems to be symmetric but with long tails.

We then look at the relationship between the rating and the sentiment of the comment using Pearson correlation, which is 0.56, a quite large correlation. The correlation between the sentiment and the difficulty is negative.

geom_jitter is a convenient shortcut for geom_point(position = “jitter”). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.

8.3 2-gram sentiment analysis

The sentiment analysis was based on individual words. In a comment, there are often words such “not” and “don’t” that can give the single word opposite meaning. For example, “good” is, in general, a positive word but “not good” is negative. Therefore, we would need to identify the negative meaning of phrases such as not good, not useful, etc. This can be done by dividing the two words. For the first word, we can find negative words such as not, no, never, etc. Then, we can score the second word as in the previous sentiment analysis. After that, we change the direction of the sentiment for those with proceeding negative words.

The following code first uses 2grams to divide the comments into consecutive two words. After that, we break down the two words into two columns - word1 and word2. We then score word2 using the sentiment word list.

We now identify a list of words that would flip the sentiment of the words following them.

We now create a new score2 variable that takes into account the negative words. The code for the analysis is given below. Note that we increased the correlation from 0.563 to 0.575. The increase is not dramatic. Actually, the two sentiment scores are highly correlated with a correlation of 0.96.