Chapter 1 Data

Throughout the book, we will use a set of data to illustrate how to conduct text mining. We first introduce the data set first.

1.1 Teaching evaluation data

The data file prof1000.original.csv includes teaching evaluation data on 1,000 professors. The structure of the data is shown below.

By default, when the R function read.csv reads data into R, the non-numerical data are converted to factors and the values of a vector are treated as different levels a factor. Because text data are the focus of text mining, we should keep the data as characters by setting stringsAsFactors = FALSE.

The data include 38240 records on 12 variables. Below is the list of variables.

  • id: The identification number of an individual rating. It ranges from 1 to 38240.
  • profid: The identification number of a professor. It ranges from 1 to 1,000.
  • rating: The overall rating of a professor’s teaching. It ranges from 1 to 5 with 5 indicating the best.
  • difficulty: An evaluation of the difficulty of the course. It ranges from 1 to 5 with 5 indicating the most difficult.
  • credit: Whether the course was taken for credit or not. 1: Yes; 0: No.
  • grade: The grade a student received on the course. It ranges from 1 (A+) to 13 (F). See the table below for details.
Table 1.1: Grade information
grade value
1 A+
2 A
3 A-
4 B+
5 B
6 B-
7 C+
8 C
9 C-
10 D+
11 D
12 D-
13 F

In the dataset, NA always represents a missing value.

1.2 Initial processing of data

We first conduct some very preliminary processing of the data. Since very short comments might not be useful, we first remove all the comments that are short than 50 characters. This can be done using the code below. The function nchar counts the number of characters in a string.

Now the dataset has 38193 records left.

1.3 Handling spelling errors and abbreviations

Spelling errors and abbreviations are much more common in reviews and comments than in formal writing. We can conduct some pre-processing of the comments to remove/replace common spelling errors and abbreviations. This process can be time-consuming and tedious but will increase the analysis quality later on.

First, we obtain a list of unique words, including the misspelled ones and abbreviations, from all the comments. Second, we use Hunspell, a popular spell checker and morphological analyzer to find words that might be misspelled. After that, we ask Hunspell to suggest potential correct words for the misspelled ones. Then, we replace the misspelled words with the suggested correct words. In R, the package hunspell can be used to conduct a spelling check and find the suggested corrections.

1.3.1 Get the unique word list

To get a list of words used in all comments, we first split the comments into individual words. To do so, we use the unnest_tokens function from the tidytext package.

tidytext is an R package for text mining. It provides easy-to-use functions mining text information. For more information on the use of tidytext, see the online book by Julia Silge and David Robinson: Text Mining with R: A Tidy Approach.

If a package has not been installed, one can use the command install.packages(‘pkgname’) to install it. To use the package, load it with library(‘pkgname’). pkgname is the name of the package to be used, e.g., tidytext here.

The following R code returns a list of words used in all comments.

Note that there are a total of 21226 unique words.

The function unnest_tokens splits a column of a data frame into tokens using the tokenizers package. The resulting data frame has each individual word on a row. A token is typically a word but can be something else. In this example, the column comments of prof1000 is split and saved into a new column called word.

1.3.2 Identify misspelled words and abbreviations

We now use the R package hunspell to find the misspelled words and abbreviations. Note that the outcome of the function hunspell is a list. If a word is misspelled, it shows it, otherwise, a null character (character(0)). For this data set, 7,173 words were found to be problematic.

1.3.3 Suggest alternatives to the identified words

The package hunspell can also suggest potentially alternative, e.g., correct, words for the misspelled or abbreviated words. For example, for the word “Professsor”, it has several suggestions below.

Note the first word is typically the best alternative. We, therefore, construct a list of corrected words for the corresponding misspelled words. Clearly, some words are not the best choice. But overall, we can improve the quality of data after correction.

Many times, it worths the effort to quickly go through the list to check the quality of the suggested words. In order to save time and effort, we will only manually check the words with at least 6 occurrences in all the comments. To do so, we first count the frequency of each word using the function count from tidyverse. Then, we only keep the words in the bad-word list by using the function inner_join. After that, we save the data into the word.list.freq.csv.csv file.

We opened the file word.list.freq.csv in Excel and checked the words in the list. Eventually, we found that about 220 words need correction. For some words, the suggested alternatives by hunspell were not correct and we, therefore, provided the alternatives ourselves. We saved the words list to the file word.list.csv. We also saved the rest of the words into a file called stopwordlist.csv.

1.3.4 Replace the misspelled words and abbreviations with the corrections

We now replace the misspelled words and abbreviations with the ones we identified. To do so, we will use the R package stringi. Particularly, we use the function stri_replace_all_regex. The function can replace all the found patterns. For example, if a word is a part of another word, it will be replaced. To avoid the problem, we can put “\b” before and after the words to be searched. In this case, only the whole words will be found and replaced. A simple example can be seen below.

We now replace all the misspelled words in all the comments using the code below.

There are also a lot of abbreviations of course names and some other unusual spelling errors. This kind of information is often not useful for analysis, so we simply remove all those words from text comments (replacing them with an empty space). We save the data after the change to a file called prof1000.corrected.csv.

1.4 Word stem

Sometimes, different forms of a word might be used. For example, “love”, “loves”, “loving”, and “loved” are forms of “love”. It can be useful to combine these words if they mean the same thing. One way to do it is to find the stem, root or common part of the words. Different R packages can be used to do this. For example, we can use the wordStem function from the SnowballC package.

The package hunspell can also be used to find the stem of a word. But sometimes, it suggests multiple stems. Here, we simply pick the last one in the list of suggestion. If it has no stem, nothing will be returned for a word. To find the stem of a word, we write a function stem_hunspell to do it. Then we get a list of unique words and obtain the possible stem of it. We finally replace the original words in the comments by their stems. For better performance, we again saved the suggested stems to a file and manually checked whether a stem is accurate or not.

After stemming, we save the data into the file prof1000.stem.csv. Since stemming is often recommended in text mining, in the rest of the analysis, we will use this data file unless otherwise specified.

After the above data cleaning step, we now have a data set with 38157 rows and 12 columns.