Chapter 3 Word Frequency

To understand how students evaluate professors, we can first investigate what words are used in describing a professor. This can be done by finding the most commonly used words in the narrative comments.

We use the cleaned, stemmed data in the analysis. We name the dataset prof1000 in R.

We then divide the comments on the professors into individual words using the function unnest_tokens from the R package tidytext. We call the dataset prof.tm.

After that, we have a total of 1668696 individual words from the 38157 comments on the 1,000 professors.

We now get the frequency of each word using the count function. We also sort the words from most frequent to least frequent.

3.1 Stopwords

From the output, a total of 8573 unique words were used in the 38157 comments. The words with the highest frequency are “the”, “and”, “is”, and so on. Although those words are extremely common, they do not provide much useful information for analysis because they are commonly used in all contexts. Therefore, we often simply remove these words before any analysis.

Tidytext includes a dataset called stop_words which consists of words from three systems.

  • SMART: This stopword list was built by Gerard Salton and Chris Buckley for the SMART information retrieval system at Cornell University. It consists of 571 words.
  • snowball: The snowball list is from the string processing language snowball. It has 174 words.
  • onix: This stopword list is probably the most widely used stopword list. It is from the onix system. This wordlist contains 429 words.

The combination of the three lists, a total of 1,149 words (some are the same words), might be too aggressive. Particularly, the list was not developed for teaching evaluation. For example, the SMART list would remove the word least. But in many comments, it is very useful for teaching evaluation. For example, in one comment, it said, This is the least organized professor. Similarly, the onix list suggests removing best, which we think should be kept. Therefore, we went through the 1,149 words and identified a total of 568 words that can be safely removed. There are also words such as “professor”, “teacher”, and “Dr.” that are not in the three stopwords list. However, they are very commonly used in teaching evaluation and should be removed when necessary. The list of words is saved in the file stopwords.evaluation.csv. It has two types of words - “evaluation” and “optional”.

The function read_csv is similar to read.csv but creates a tibble data set instead of the typical data frame.

To remove the stopwords, we can use the function anti_join. We first remove the “evaluation” words. We use the filter function to select the subset of words.

After removing the common stopwords, we have a total of 8,191 unique words left. We can also see that the most frequent words are “class”, “not”, “very”, “professor”, “teacher”, etc. Clearly, some of these words are not very helpful to analyze in many situations. We have included those words in our optional stopwords list. Therefore, we further remove those words.

3.2 Visualization of word frequency

We can visualize the frequency of the words using different methods.

3.2.1 Barplot

We can plot the frequency of the words using a barplot in which the length of the bars representing the frequencies of the words. For example, Figure 3.1 shows the top 20 most frequently used words. In the figure, we flipped the coordinates to put the words on the y-axis using the option coord_flip.

Barplot of word frequency

Figure 3.1: Barplot of word frequency

Note that the words were ordered in the alphabetical order. We can also order the words based their frequency. In doing so, we convert the word variable from a character variable to a factor and then plot it in Figure 3.2. The function reorder treats word as a categorical variable, and reorders individual words as its levels based on the word frequency n.

Barplot of word frequency

Figure 3.2: Barplot of word frequency

3.2.2 Word cloud (wordcloud)

Another way to visualize the word frequency is to generate a word cloud. A word cloud directly plots the words in a figure with the size of the words representing the frequency of them. A word cloud can be generated using the R package wordcloud.

The package cannot be directly used in the “pipe” way. However, we can use with function together with it. For example, the code below generates a word cloud in Figure ?? for the top 50 words.

Wordclouds of word frequency

Figure 3.3: Wordclouds of word frequency

The R function wordcloud needs at least a variable with the list words and another variable with the word frequency. The other useful options include

  • max.words: Maximum number of words to be plotted and the least frequent terms are not plotted.
  • random.order: Plot words in random order; otherwise they will be plotted in decreasing frequency. In general, it is better to set this as FALSE.
  • rot.per: Proportion words with 90-degree rotation.
  • colors: Color words from least to most frequent. brewer.pal makes the color palettes from The R package ColorBrewer available as R palettes. The palette Dark2 seems to work best for a word cloud.

Another package that can be used to generate a word cloud is wordcloud2. It is developed based on wordcloud but provides some other useful features. It is particularly useful to show the plot in a webpage because it interactively shows the word frequency.

3.3 n-gram analysis

Compared to a single word, a group of words can be more informative. For example, see the comment below. It starts with not easy. If the comment was broken into individual words and look at “easy” only, the meaning is the opposite.

not easy at all, yes she is good teacher and explaine material well enough. BUT there is a LOT of material and you never know what paricularly would be on test. she ask on test not much but in details. be ready to read and memorise a huge amount of information

Instead of splitting a comment to single words, we can divide them into groups, e.g., two words, three words or more. In this way, we can capture some information that cannot be reflected in individual words. The method that breaks words into groups is in general called “n-gram” analysis. With two words, it’s 2-gram or bigram analysis and with three it’s called 3-gram or trigram analysis.

The tidytext package can also be used to split documents or comments into multiple words.

3.3.1 2-gram analysis

The following code will split the comments on the professors into two consecutive words. Note that we set token="ngrams" and then n=2 for two words.

Each two-word phase can be viewed as an individual word for word frequency analysis as in the single-word situation.

As for the single word, the extremely common phrases such as “he is” and “if you” might not be very informative in teaching evaluation analysis. We can remove those frequent terms first. To do so, we first split the two-word phrases into single words and remove stopwords as we did earlier.

Then we can check the frequency of them. Before counting the frequency, we connect the two words together using the function unite.

From the output, we can see the most widely used pair of words is “great teacher” after removing the stopwords. Note that as in stemming words, we can also combine some 2-grams such as “great teacher”, “great professor” and “best professor” into one 2-gram.

We now visualize the information using both bar plot and word clouds.

3.3.1.1 Network plot of the word relationship

With n-grams, we can connect words together in a graph to form a network plot. This is another way to visualize 2-grams and the word relationship. The R package ggraph can be used to generate such a network plot. To use the package, we have to first create an igraph graph using the igraph package. This can be done easily using the graph_from_data_frame() function from the igraph package as shown in the code below.

Note that in the R code, we used the data set with the separated words after removing stopwords. For better presentation, we also select the pairs of words with at least 300 occurrences. The generated igraph graph is called word.network. It has 58 words and 56 connections among them.

An igraph graph includes a lot of information.

The first line of the output includes the following information.

  • A unique identifier of the graph - here it is “42e299e”.
  • Following the name is the indicator of the type of graph. Different letters can be used to denote the type - “U or D”, “N”, “W”, and “B” in the presented order.
    • First letter: “U” for undirected and “D” for directed graphs.
    • Second letter: “N” for a named graph.
    • Third letter: “W” for weighted graphs.
    • Fourth letter: “B” for bipartite graphs.
  • Two numbers: the number of vertices (words in this example) and the number of edges (the connection between words) in the graph.

The second line contains all the attributes of the graph. It is either about the vertices (nodes, actors) identified by the first letter “v” in the parenthesis or the edges (paths, links) identified by the first letter “e” in the parenthesis. For example, this graph has a “name” graph attribute, of type character (identified by the second letter“c” after “/”). The “name” attribute include the information on the individual words. It also has a graph attribute called “n” for the edges. It consists of the frequency information of the words. Many attributes can be used here but an attribute is not required.

Starting on the third line, it includes the information on the edges. For a given edge, it has the starting vertice, the direction, and the end vertice.

With the igraph graph representation, we can now generate a network plot using the R package ggraph, which uses the same plotting method as ggplot2. Particularly the following code can be used. From the plot, we can easily see how the words are related to each other. For example, the words “not”, “very”, and “teacher” are connected to many other words.