Chapter 1 Data
Throughout the book, we will use a set of data to illustrate how to conduct text mining. We first introduce the data set first.
1.1 Teaching evaluation data
The data file prof1000.original.csv
includes teaching evaluation data on 1,000 professors. The structure of the data is shown below.
prof1000 <- read.csv("data/prof1000.original.csv", stringsAsFactors = FALSE)
str(prof1000)
## 'data.frame': 38240 obs. of 12 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ profid : int 1 1 1 1 1 1 1 1 1 1 ...
## $ rating : num 5 5 4 3 1 5 5 2 3 3 ...
## $ difficulty: int 3 4 5 5 5 5 5 4 5 5 ...
## $ credit : int 1 1 1 1 1 1 1 1 1 1 ...
## $ grade : int 5 4 5 7 3 NA 6 7 7 8 ...
## $ book : int 0 0 0 0 0 1 1 1 1 1 ...
## $ take : int 1 1 1 0 0 0 1 0 NA NA ...
## $ attendance: int 1 1 0 1 1 1 1 1 1 0 ...
## $ tags : chr "respected;accessible outside class;skip class? you won't pass ." "accessible outside class;lots of homework;respected" "tough grader;lots of homework;accessible outside class" "tough grader;so many papers;lots of homework" ...
## $ comments : chr "best professor i've had in college . only thing i dont like is the writing assignments" "Professor has been the best math professor I've had at thus far . He assigns a heavy amount of homework but "| __truncated__ "He was a great professor . he does give a lot of homework but he will work with you if you don't clearly unders"| __truncated__ "Professor is an incredibly respected teacher, however his class is extremely difficult . I believe he just ass"| __truncated__ ...
## $ date : chr "04/17/2018" "02/13/2018" "01/07/2018" "12/11/2017" ...
By default, when the R function read.csv
reads data into R, the non-numerical data are converted to factors and the values of a vector are treated as different levels a factor. Because text data are the focus of text mining, we should keep the data as characters by setting stringsAsFactors = FALSE
.
The data include 38240 records on 12 variables. Below is the list of variables.
id
: The identification number of an individual rating. It ranges from 1 to 38240.profid
: The identification number of a professor. It ranges from 1 to 1,000.rating
: The overall rating of a professor’s teaching. It ranges from 1 to 5 with 5 indicating the best.difficulty
: An evaluation of the difficulty of the course. It ranges from 1 to 5 with 5 indicating the most difficult.credit
: Whether the course was taken for credit or not. 1: Yes; 0: No.grade
: The grade a student received on the course. It ranges from 1 (A+) to 13 (F). See the table below for details.
grade | value |
---|---|
1 | A+ |
2 | A |
3 | A- |
4 | B+ |
5 | B |
6 | B- |
7 | C+ |
8 | C |
9 | C- |
10 | D+ |
11 | D |
12 | D- |
13 | F |
book
: Whether a textbook is used or not. 1: Yes; 0: No.take
: Whether you would take a class from the professor again. 1: Yes; 0: No.attendance
: Whether attendance is required. 1: Yes; 0: No.tags
: A student can select three tags from a list of 20 tags below to describe the professor/course.accessible outside class, amazing lectures, beware of pop quizzes, caring, clear grading criteria, extra credit, get ready to read, gives good feedback, graded by few things, group projects, hilarious, inspirational, lecture heavy, lots of homework, participation matters, respected, skip class? you will not pass , so many papers, test heavy, tough grader
comments
: Narrative evaluation of a professor/course.date
: Date when the evaluation was provided.
In the dataset, NA
always represents a missing value.
1.2 Initial processing of data
We first conduct some very preliminary processing of the data. Since very short comments might not be useful, we first remove all the comments that are short than 50 characters. This can be done using the code below. The function nchar
counts the number of characters in a string.
Now the dataset has 38193 records left.
1.3 Handling spelling errors and abbreviations
Spelling errors and abbreviations are much more common in reviews and comments than in formal writing. We can conduct some pre-processing of the comments to remove/replace common spelling errors and abbreviations. This process can be time-consuming and tedious but will increase the analysis quality later on.
First, we obtain a list of unique words, including the misspelled ones and abbreviations, from all the comments. Second, we use Hunspell, a popular spell checker and morphological analyzer to find words that might be misspelled. After that, we ask Hunspell to suggest potential correct words for the misspelled ones. Then, we replace the misspelled words with the suggested correct words. In R, the package hunspell
can be used to conduct a spelling check and find the suggested corrections.
1.3.1 Get the unique word list
To get a list of words used in all comments, we first split the comments into individual words. To do so, we use the unnest_tokens
function from the tidytext
package.
tidytext
is an R package for text mining. It provides easy-to-use functions mining text information. For more information on the use of tidytext
, see the online book by Julia Silge and David Robinson: Text Mining with R: A Tidy Approach.
If a package has not been installed, one can use the command install.packages(‘pkgname’)
to install it. To use the package, load it with library(‘pkgname’)
. pkgname
is the name of the package to be used, e.g., tidytext
here.
The following R code returns a list of words used in all comments.
## tidytext
library(tidytext)
prof.tm <- unnest_tokens(prof1000, word, comments)
head(prof.tm[, c("id", "word")])
## id word
## 1 1 best
## 1.1 1 professor
## 1.2 1 i've
## 1.3 1 had
## 1.4 1 in
## 1.5 1 college
words <- unique(prof.tm$word)
head(words)
## [1] "best" "professor" "i've" "had" "in" "college"
length(words)
## [1] 21226
Note that there are a total of 21226 unique words.
The function unnest_tokens
splits a column of a data frame into tokens using the tokenizers
package. The resulting data frame has each individual word on a row. A token is typically a word but can be something else. In this example, the column comments
of prof1000
is split and saved into a new column called word
.
1.3.2 Identify misspelled words and abbreviations
We now use the R package hunspell
to find the misspelled words and abbreviations. Note that the outcome of the function hunspell
is a list. If a word is misspelled, it shows it, otherwise, a null character (character(0)
). For this data set, 7,173 words were found to be problematic.
1.3.3 Suggest alternatives to the identified words
The package hunspell
can also suggest potentially alternative, e.g., correct, words for the misspelled or abbreviated words. For example, for the word “Professsor”, it has several suggestions below.
hunspell_suggest("Professsor")
## [[1]]
## [1] "Professor" "Professors" "Profession" "Professed" "Processor"
## [6] "Oppressor"
Note the first word is typically the best alternative. We, therefore, construct a list of corrected words for the corresponding misspelled words. Clearly, some words are not the best choice. But overall, we can improve the quality of data after correction.
## suggest the correct words
sugg.words <- hunspell_suggest(bad.words)
head(sugg.words)
## [[1]]
## [1] "I've"
##
## [[2]]
## [1] "font" "dint" "don" "dot" "don't" "done" "dent" "dons" "dost"
## [10] "dona" "dolt" "cont" "dong" "wont" "Mont"
##
## [[3]]
## [1] "nth" "mtg" "mt" "meth" "math" "moth" "myth" "meh" "mph" "mt h"
##
## [[4]]
## [1] "knowledgeable" "knowledgeably" "knowledge"
##
## [[5]]
## [1] "web work" "web-work" "workweek"
##
## [[6]]
## [1] "rd" "Dr" "fr" "d" "r" "dry" "er" "tr" "do" "or" "dc" "dd"
## [13] "gr" "pr" "hr"
sugg.words <- unlist(lapply(sugg.words, function(x) x[1]))
Many times, it worths the effort to quickly go through the list to check the quality of the suggested words. In order to save time and effort, we will only manually check the words with at least 6 occurrences in all the comments. To do so, we first count the frequency of each word using the function count
from tidyverse
. Then, we only keep the words in the bad-word list by using the function inner_join
. After that, we save the data into the word.list.freq.csv.csv
file.
word.list <- as.data.frame(cbind(bad.words, sugg.words))
freq.word <- count(prof.tm, word)
freq.word <- inner_join(freq.word, word.list, by = c(word = "bad.words"))
freq.word
## # A tibble: 7,121 x 3
## word n sugg.words
## <chr> <int> <fct>
## 1 a'ce 1 ace
## 2 a's 198 A's
## 3 aa 7 AA
## 4 aaaagghhh 1 <NA>
## 5 aaas 2 alas
## 6 aameetings 1 meetinghouse
## 7 aarc 2 arc
## 8 aaron 4 Aaron
## 9 aas 3 ass
## 10 abaluyia 1 Alabamian
## # ... with 7,111 more rows
write.csv(freq.word, "data/word.list.freq.csv", row.names = F)
We opened the file word.list.freq.csv
in Excel and checked the words in the list. Eventually, we found that about 220 words need correction. For some words, the suggested alternatives by hunspell
were not correct and we, therefore, provided the alternatives ourselves. We saved the words list to the file word.list.csv
. We also saved the rest of the words into a file called stopwordlist.csv
.
1.3.4 Replace the misspelled words and abbreviations with the corrections
We now replace the misspelled words and abbreviations with the ones we identified. To do so, we will use the R package stringi
. Particularly, we use the function stri_replace_all_regex
. The function can replace all the found patterns. For example, if a word is a part of another word, it will be replaced. To avoid the problem, we can put “\b” before and after the words to be searched. In this case, only the whole words will be found and replaced. A simple example can be seen below.
library(stringi)
stri_replace_all_regex("there is the test", c("the", "test"), c("this", "exam"),
vectorize_all = FALSE)
## [1] "thisre is this exam"
stri_replace_all_regex("there is the test", c("\\bthe\\b", "\\btest\\b"), c("this",
"exam"), vectorize_all = FALSE)
## [1] "there is this exam"
We now replace all the misspelled words in all the comments using the code below.
word.list <- read.csv("data/word.list.csv", stringsAsFactors = FALSE)
bad.whole.words <- paste0("\\b", word.list$bad.words, "\\b")
sugg.words <- word.list$sugg.words
prof1000$comments <- stri_replace_all_regex(prof1000$comments, bad.whole.words, sugg.words,
vectorize_all = FALSE)
There are also a lot of abbreviations of course names and some other unusual spelling errors. This kind of information is often not useful for analysis, so we simply remove all those words from text comments (replacing them with an empty space). We save the data after the change to a file called prof1000.corrected.csv
.
stop.list <- read.csv("data/stoplist.csv", stringsAsFactors = FALSE)
whole.words <- paste0("\\b", stop.list$word, "\\b")
prof1000$comments <- stri_replace_all_regex(prof1000$comments, whole.words, " ",
vectorize_all = FALSE)
## save the data with corrected words
write.csv(prof1000, "data/prof1000.corrected.csv", row.names = F)
1.4 Word stem
Sometimes, different forms of a word might be used. For example, “love”, “loves”, “loving”, and “loved” are forms of “love”. It can be useful to combine these words if they mean the same thing. One way to do it is to find the stem, root or common part of the words. Different R packages can be used to do this. For example, we can use the wordStem
function from the SnowballC
package.
prof.tm <- unnest_tokens(prof1000, word, comments, to_lower = TRUE)
## use library SnowballC
library(SnowballC)
prof.tm <- mutate(prof.tm, word.stem = wordStem(word, language = "en"))
The package hunspell
can also be used to find the stem of a word. But sometimes, it suggests multiple stems. Here, we simply pick the last one in the list of suggestion. If it has no stem, nothing will be returned for a word. To find the stem of a word, we write a function stem_hunspell
to do it. Then we get a list of unique words and obtain the possible stem of it. We finally replace the original words in the comments by their stems. For better performance, we again saved the suggested stems to a file and manually checked whether a stem is accurate or not.
## A function to do the stem
stem_hunspell <- function(term) {
stems <- hunspell_stem(term)[[1]]
if (length(stems) == 0) {
stem <- NA
} else {
if (nchar(stems[[length(stems)]]) > 1)
stem <- stems[[length(stems)]] else stem <- term
}
stem
}
word.list <- count(prof.tm, word)
## find the stems of the words
words.stem <- lapply(word.list$word, stem_hunspell)
## save to file and manually check the words
stem.list <- cbind(word.list, stem = unlist(words.stem))
stem.list <- stem.list[!is.na(stem.list[, 3]) & stem.list[, 1] != stem.list[, 3],
]
## replace the words in the comments with stems
stem.list <- read.csv("data/stem.list.csv", stringsAsFactors = FALSE)
prof.stem <- prof1000
orig.words <- paste0("\\b", stem.list[, 1], "\\b")
stem.words <- stem.list[, 3]
prof.stem$comments <- stri_replace_all_regex(prof.stem$comments, orig.words, stem.words,
vectorize_all = FALSE)
## remove duplicated information
prof.stem <- prof.stem[!duplicated(prof.stem$comments), ]
write.csv(prof.stem, "data/prof1000.stem.csv", row.names = F)
After stemming, we save the data into the file prof1000.stem.csv
. Since stemming is often recommended in text mining, in the rest of the analysis, we will use this data file unless otherwise specified.
After the above data cleaning step, we now have a data set with 38157 rows and 12 columns.
str(prof.stem)
## 'data.frame': 38157 obs. of 12 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ profid : int 1 1 1 1 1 1 1 1 1 1 ...
## $ rating : num 5 5 4 3 1 5 5 2 3 3 ...
## $ difficulty: int 3 4 5 5 5 5 5 4 5 5 ...
## $ credit : int 1 1 1 1 1 1 1 1 1 1 ...
## $ grade : int 5 4 5 7 3 NA 6 7 7 8 ...
## $ book : int 0 0 0 0 0 1 1 1 1 1 ...
## $ take : int 1 1 1 0 0 0 1 0 NA NA ...
## $ attendance: int 1 1 0 1 1 1 1 1 1 0 ...
## $ tags : chr "respected;accessible outside class;skip class? you won't pass." "accessible outside class;lots of homework;respected" "tough grader;lots of homework;accessible outside class" "tough grader;so many papers;lots of homework" ...
## $ comments : chr "best professor I've had in college. only thing i dont like is the writing assignment" "Professor has been the best mathematics professor I've had at thus far. He assign a heavy amount of homework bu"| __truncated__ "He was a great professor. he does give a lot of homework but he will work with you if you don't clear understan"| __truncated__ "Professor is an incredible respected teacher, however his class is extreme difficult. I believe he just assume "| __truncated__ ...
## $ date : chr "04/17/2018" "02/13/2018" "01/07/2018" "12/11/2017" ...