Chapter 1 Data

Throughout the book, we will use a set of data to illustrate how to conduct text mining. We first introduce the data set first.

1.1 Teaching evaluation data

The data file prof1000.original.csv includes teaching evaluation data on 1,000 professors. The structure of the data is shown below.

prof1000 <- read.csv("data/prof1000.original.csv", stringsAsFactors = FALSE)
str(prof1000)
## 'data.frame':    38240 obs. of  12 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ profid    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ rating    : num  5 5 4 3 1 5 5 2 3 3 ...
##  $ difficulty: int  3 4 5 5 5 5 5 4 5 5 ...
##  $ credit    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ grade     : int  5 4 5 7 3 NA 6 7 7 8 ...
##  $ book      : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ take      : int  1 1 1 0 0 0 1 0 NA NA ...
##  $ attendance: int  1 1 0 1 1 1 1 1 1 0 ...
##  $ tags      : chr  "respected;accessible outside class;skip class? you won't pass ." "accessible outside class;lots of homework;respected" "tough grader;lots of homework;accessible outside class" "tough grader;so many papers;lots of homework" ...
##  $ comments  : chr  "best professor i've had in college . only thing i dont like is the writing assignments" "Professor  has been the best math professor I've had at   thus far . He assigns a heavy amount of homework but "| __truncated__ "He was a great professor . he does give a lot of homework but he will work with you if you don't clearly unders"| __truncated__ "Professor  is an incredibly respected teacher, however his class is extremely difficult . I believe he just ass"| __truncated__ ...
##  $ date      : chr  "04/17/2018" "02/13/2018" "01/07/2018" "12/11/2017" ...

By default, when the R function read.csv reads data into R, the non-numerical data are converted to factors and the values of a vector are treated as different levels a factor. Because text data are the focus of text mining, we should keep the data as characters by setting stringsAsFactors = FALSE.

The data include 38240 records on 12 variables. Below is the list of variables.

id: The identification number of an individual rating. It ranges from 1 to 38240.
profid: The identification number of a professor. It ranges from 1 to 1,000.
rating: The overall rating of a professor’s teaching. It ranges from 1 to 5 with 5 indicating the best.
difficulty: An evaluation of the difficulty of the course. It ranges from 1 to 5 with 5 indicating the most difficult.
credit: Whether the course was taken for credit or not. 1: Yes; 0: No.
grade: The grade a student received on the course. It ranges from 1 (A+) to 13 (F). See the table below for details.

Table 1.1: Grade information
grade	value
1	A+
2	A
3	A-
4	B+
5	B
6	B-
7	C+
8	C
9	C-
10	D+
11	D
12	D-
13	F

book: Whether a textbook is used or not. 1: Yes; 0: No.
take: Whether you would take a class from the professor again. 1: Yes; 0: No.
attendance: Whether attendance is required. 1: Yes; 0: No.

tags: A student can select three tags from a list of 20 tags below to describe the professor/course.

accessible outside class, amazing lectures, beware of pop quizzes, 
caring, clear grading criteria, extra credit, get ready to read, 
gives good feedback, graded by few things, group projects, 
hilarious, inspirational, lecture heavy, lots of homework, 
participation matters, respected, skip class? you will not pass , 
so many papers, test heavy, tough grader

comments: Narrative evaluation of a professor/course.
date: Date when the evaluation was provided.

In the dataset, NA always represents a missing value.

1.2 Initial processing of data

We first conduct some very preliminary processing of the data. Since very short comments might not be useful, we first remove all the comments that are short than 50 characters. This can be done using the code below. The function nchar counts the number of characters in a string.

prof1000 <- prof1000[nchar(prof1000$comments) > 50, ]
dim(prof1000)
## [1] 38193    12

Now the dataset has 38193 records left.

1.3 Handling spelling errors and abbreviations

Spelling errors and abbreviations are much more common in reviews and comments than in formal writing. We can conduct some pre-processing of the comments to remove/replace common spelling errors and abbreviations. This process can be time-consuming and tedious but will increase the analysis quality later on.

First, we obtain a list of unique words, including the misspelled ones and abbreviations, from all the comments. Second, we use Hunspell, a popular spell checker and morphological analyzer to find words that might be misspelled. After that, we ask Hunspell to suggest potential correct words for the misspelled ones. Then, we replace the misspelled words with the suggested correct words. In R, the package hunspell can be used to conduct a spelling check and find the suggested corrections.

1.3.1 Get the unique word list

To get a list of words used in all comments, we first split the comments into individual words. To do so, we use the unnest_tokens function from the tidytext package.

tidytext is an R package for text mining. It provides easy-to-use functions mining text information. For more information on the use of tidytext, see the online book by Julia Silge and David Robinson: Text Mining with R: A Tidy Approach.

If a package has not been installed, one can use the command install.packages(‘pkgname’) to install it. To use the package, load it with library(‘pkgname’). pkgname is the name of the package to be used, e.g., tidytext here.

The following R code returns a list of words used in all comments.

## tidytext
library(tidytext)
prof.tm <- unnest_tokens(prof1000, word, comments)
head(prof.tm[, c("id", "word")])
##     id      word
## 1    1      best
## 1.1  1 professor
## 1.2  1      i've
## 1.3  1       had
## 1.4  1        in
## 1.5  1   college

words <- unique(prof.tm$word)
head(words)
## [1] "best"      "professor" "i've"      "had"       "in"        "college"
length(words)
## [1] 21226

Note that there are a total of 21226 unique words.

The function unnest_tokens splits a column of a data frame into tokens using the tokenizers package. The resulting data frame has each individual word on a row. A token is typically a word but can be something else. In this example, the column comments of prof1000 is split and saved into a new column called word.

1.3.2 Identify misspelled words and abbreviations

We now use the R package hunspell to find the misspelled words and abbreviations. Note that the outcome of the function hunspell is a list. If a word is misspelled, it shows it, otherwise, a null character (character(0)). For this data set, 7,173 words were found to be problematic.

library(hunspell)

## find the misspelled words
bad.words <- hunspell(words)

bad.words <- unique(unlist(bad.words))
length(bad.words)
## [1] 7174
bad.words[1:10]
##  [1] "i've"         "dont"         "mth"          "knowledgable" "webwork"     
##  [6] "dr"           "judgements"   "professsor"   "nitpicky"     "calc"

1.3.3 Suggest alternatives to the identified words

The package hunspell can also suggest potentially alternative, e.g., correct, words for the misspelled or abbreviated words. For example, for the word “Professsor”, it has several suggestions below.

hunspell_suggest("Professsor")
## [[1]]
## [1] "Professor"  "Professors" "Profession" "Professed"  "Processor" 
## [6] "Oppressor"

Note the first word is typically the best alternative. We, therefore, construct a list of corrected words for the corresponding misspelled words. Clearly, some words are not the best choice. But overall, we can improve the quality of data after correction.

## suggest the correct words
sugg.words <- hunspell_suggest(bad.words)
head(sugg.words)
## [[1]]
## [1] "I've"
## 
## [[2]]
##  [1] "font"  "dint"  "don"   "dot"   "don't" "done"  "dent"  "dons"  "dost" 
## [10] "dona"  "dolt"  "cont"  "dong"  "wont"  "Mont" 
## 
## [[3]]
##  [1] "nth"  "mtg"  "mt"   "meth" "math" "moth" "myth" "meh"  "mph"  "mt h"
## 
## [[4]]
## [1] "knowledgeable" "knowledgeably" "knowledge"    
## 
## [[5]]
## [1] "web work" "web-work" "workweek"
## 
## [[6]]
##  [1] "rd"  "Dr"  "fr"  "d"   "r"   "dry" "er"  "tr"  "do"  "or"  "dc"  "dd" 
## [13] "gr"  "pr"  "hr"

sugg.words <- unlist(lapply(sugg.words, function(x) x[1]))

Many times, it worths the effort to quickly go through the list to check the quality of the suggested words. In order to save time and effort, we will only manually check the words with at least 6 occurrences in all the comments. To do so, we first count the frequency of each word using the function count from tidyverse. Then, we only keep the words in the bad-word list by using the function inner_join. After that, we save the data into the word.list.freq.csv.csv file.

word.list <- as.data.frame(cbind(bad.words, sugg.words))

freq.word <- count(prof.tm, word)
freq.word <- inner_join(freq.word, word.list, by = c(word = "bad.words"))
freq.word
## # A tibble: 7,121 x 3
##    word           n sugg.words  
##    <chr>      <int> <fct>       
##  1 a'ce           1 ace         
##  2 a's          198 A's         
##  3 aa             7 AA          
##  4 aaaagghhh      1 <NA>        
##  5 aaas           2 alas        
##  6 aameetings     1 meetinghouse
##  7 aarc           2 arc         
##  8 aaron          4 Aaron       
##  9 aas            3 ass         
## 10 abaluyia       1 Alabamian   
## # ... with 7,111 more rows

write.csv(freq.word, "data/word.list.freq.csv", row.names = F)

We opened the file word.list.freq.csv in Excel and checked the words in the list. Eventually, we found that about 220 words need correction. For some words, the suggested alternatives by hunspell were not correct and we, therefore, provided the alternatives ourselves. We saved the words list to the file word.list.csv. We also saved the rest of the words into a file called stopwordlist.csv.

1.3.4 Replace the misspelled words and abbreviations with the corrections

We now replace the misspelled words and abbreviations with the ones we identified. To do so, we will use the R package stringi. Particularly, we use the function stri_replace_all_regex. The function can replace all the found patterns. For example, if a word is a part of another word, it will be replaced. To avoid the problem, we can put “\b” before and after the words to be searched. In this case, only the whole words will be found and replaced. A simple example can be seen below.

library(stringi)
stri_replace_all_regex("there is the test", c("the", "test"), c("this", "exam"), 
    vectorize_all = FALSE)
## [1] "thisre is this exam"

stri_replace_all_regex("there is the test", c("\\bthe\\b", "\\btest\\b"), c("this", 
    "exam"), vectorize_all = FALSE)
## [1] "there is this exam"

We now replace all the misspelled words in all the comments using the code below.

word.list <- read.csv("data/word.list.csv", stringsAsFactors = FALSE)
bad.whole.words <- paste0("\\b", word.list$bad.words, "\\b")
sugg.words <- word.list$sugg.words

prof1000$comments <- stri_replace_all_regex(prof1000$comments, bad.whole.words, sugg.words, 
    vectorize_all = FALSE)

There are also a lot of abbreviations of course names and some other unusual spelling errors. This kind of information is often not useful for analysis, so we simply remove all those words from text comments (replacing them with an empty space). We save the data after the change to a file called prof1000.corrected.csv.

stop.list <- read.csv("data/stoplist.csv", stringsAsFactors = FALSE)
whole.words <- paste0("\\b", stop.list$word, "\\b")

prof1000$comments <- stri_replace_all_regex(prof1000$comments, whole.words, " ", 
    vectorize_all = FALSE)

## save the data with corrected words
write.csv(prof1000, "data/prof1000.corrected.csv", row.names = F)

1.4 Word stem

Sometimes, different forms of a word might be used. For example, “love”, “loves”, “loving”, and “loved” are forms of “love”. It can be useful to combine these words if they mean the same thing. One way to do it is to find the stem, root or common part of the words. Different R packages can be used to do this. For example, we can use the wordStem function from the SnowballC package.

prof.tm <- unnest_tokens(prof1000, word, comments, to_lower = TRUE)

## use library SnowballC
library(SnowballC)
prof.tm <- mutate(prof.tm, word.stem = wordStem(word, language = "en"))

The package hunspell can also be used to find the stem of a word. But sometimes, it suggests multiple stems. Here, we simply pick the last one in the list of suggestion. If it has no stem, nothing will be returned for a word. To find the stem of a word, we write a function stem_hunspell to do it. Then we get a list of unique words and obtain the possible stem of it. We finally replace the original words in the comments by their stems. For better performance, we again saved the suggested stems to a file and manually checked whether a stem is accurate or not.

## A function to do the stem
stem_hunspell <- function(term) {
    stems <- hunspell_stem(term)[[1]]
    
    if (length(stems) == 0) {
        stem <- NA
    } else {
        if (nchar(stems[[length(stems)]]) > 1) 
            stem <- stems[[length(stems)]] else stem <- term
    }
    stem
}

word.list <- count(prof.tm, word)

## find the stems of the words
words.stem <- lapply(word.list$word, stem_hunspell)

## save to file and manually check the words
stem.list <- cbind(word.list, stem = unlist(words.stem))
stem.list <- stem.list[!is.na(stem.list[, 3]) & stem.list[, 1] != stem.list[, 3], 
    ]

## replace the words in the comments with stems
stem.list <- read.csv("data/stem.list.csv", stringsAsFactors = FALSE)

prof.stem <- prof1000

orig.words <- paste0("\\b", stem.list[, 1], "\\b")
stem.words <- stem.list[, 3]

prof.stem$comments <- stri_replace_all_regex(prof.stem$comments, orig.words, stem.words, 
    vectorize_all = FALSE)

## remove duplicated information
prof.stem <- prof.stem[!duplicated(prof.stem$comments), ]
write.csv(prof.stem, "data/prof1000.stem.csv", row.names = F)

After stemming, we save the data into the file prof1000.stem.csv. Since stemming is often recommended in text mining, in the rest of the analysis, we will use this data file unless otherwise specified.

After the above data cleaning step, we now have a data set with 38157 rows and 12 columns.

str(prof.stem)
## 'data.frame':    38157 obs. of  12 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ profid    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ rating    : num  5 5 4 3 1 5 5 2 3 3 ...
##  $ difficulty: int  3 4 5 5 5 5 5 4 5 5 ...
##  $ credit    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ grade     : int  5 4 5 7 3 NA 6 7 7 8 ...
##  $ book      : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ take      : int  1 1 1 0 0 0 1 0 NA NA ...
##  $ attendance: int  1 1 0 1 1 1 1 1 1 0 ...
##  $ tags      : chr  "respected;accessible outside class;skip class? you won't pass." "accessible outside class;lots of homework;respected" "tough grader;lots of homework;accessible outside class" "tough grader;so many papers;lots of homework" ...
##  $ comments  : chr  "best professor I've had in college. only thing i dont like is the writing assignment" "Professor has been the best mathematics professor I've had at thus far. He assign a heavy amount of homework bu"| __truncated__ "He was a great professor. he does give a lot of homework but he will work with you if you don't clear understan"| __truncated__ "Professor is an incredible respected teacher, however his class is extreme difficult. I believe he just assume "| __truncated__ ...
##  $ date      : chr  "04/17/2018" "02/13/2018" "01/07/2018" "12/11/2017" ...