Chapter 4 Document Term Matrix

Previously, we have looked the word frequency across all the comments on all professors. Many times, it can be helpful to look at the words used in the comments for individual professors. In this case, we can count the word frequency for each professor. This can actually be done easily using the tidytext package.

prof1000 <- read.csv("data/prof1000.stem.gender.csv", stringsAsFactors = FALSE)
prof.tm <- unnest_tokens(prof1000, word, comments)
stopwords <- read_csv("data/stopwords.evaluation.csv")
prof.tm <- prof.tm %>% anti_join(filter(stopwords))

word.freq <- prof.tm %>% group_by(profid) %>% count(word, sort = TRUE)

word.freq
## # A tibble: 229,773 x 3
## # Groups:   profid [1,000]
##    profid word           n
##     <int> <chr>      <int>
##  1    960 lecture      419
##  2    960 homework     188
##  3    960 great        172
##  4    960 material     158
##  5    960 help         152
##  6    960 good         144
##  7    960 understand   134
##  8    960 easy         123
##  9    133 test         121
## 10    960 video        121
## # ... with 229,763 more rows

In the dataset, the frequency of each word for a given professor was calculated. Another way to organize the data is to create a document-term matrix (DTM). A DTM is a matrix with the rows representing the documents, comments, or any other bodies of words. The columns of it are the terms or words used in the document. Therefore, the tidytext data can be viewed as the long format while DTM can be viewed as a wide format of the same information.

DTM is the default format used by the R package tm. The package tidytext includes a function cast_dtm to convert data to DTM.

prof.dtm <- word.freq %>% cast_dtm(profid, word, n)
tm::inspect(prof.dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 8155)>>
## Non-/sparse entries: 229773/7925227
## Sparsity           : 97%
## Maximal term length: 19
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  easy good great hard help learn lecture test time work
##   105   20   25    14   43   27    18      27  105   19   10
##   133   58   47    62   62   25    28      54  121   28   23
##   3     36   30    52   45   19    16     100   62   16   12
##   562   35   35    29   35   30    12      32   19   52   13
##   641   38   38    45   47   55    10      47   65   30   40
##   658  113   28    45   23   14    17      39   90   11    1
##   721  119   25    40   18    8    30      67   46   25   11
##   922   42   21    64    8   33    31       8    5   15   26
##   960  123  144   172   96  152   114     419   64   77   67
##   973   50   35    69   16   46    28       2    7   31   58

The related information and part of the DTM matrix are shown above. First, it shows the number of documents (professors) and the number of terms (words). Then, it shows the sparsity of the matrix, i.e., the number and percentage of zero values in the matrix. For our example, 97% of the matrix has value 0.

If a word was very rarely used, it might not provide much useful information for further analysis. Therefore, one might remove such words to reduce the sparsity of the DTM matrix. For example, the following code only keeps the words with less than 97% sparsity. Now the matrix has 1,205 terms with an overall sparsity 84%.

prof.dtm <- removeSparseTerms(prof.dtm, 0.97)
tm::inspect(prof.dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 1221)>>
## Non-/sparse entries: 196213/1024787
## Sparsity           : 84%
## Maximal term length: 15
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  easy good great hard help learn lecture test time work
##   105   20   25    14   43   27    18      27  105   19   10
##   133   58   47    62   62   25    28      54  121   28   23
##   3     36   30    52   45   19    16     100   62   16   12
##   562   35   35    29   35   30    12      32   19   52   13
##   641   38   38    45   47   55    10      47   65   30   40
##   658  113   28    45   23   14    17      39   90   11    1
##   721  119   25    40   18    8    30      67   46   25   11
##   922   42   21    64    8   33    31       8    5   15   26
##   960  123  144   172   96  152   114     419   64   77   67
##   973   50   35    69   16   46    28       2    7   31   58

4.1 tf-idf

Word frequency, also called term frequency (tf), is one way to measure how a word is used. It clearly has the drawback that the frequency of a word is related to how commonly the word is used in general. We have discussed before that we can remove the commonly used word early. Another way to deal with it is to use the “weighting” method that is commonly used in sample survey.

The basic idea is that if a word appears in almost every comment, in the extreme case, all comments, then the word does not add much information to analysis and a small weight should be used. For example, if a word appears exactly once in evaluating every professor, a weight of 0 can be used because it has no useful information for further analysis.

In general, we can define a term frequency–inverse document frequency (tf-idf) statistic as

\[ tf\text{-}idf(t,d,D) = tf(t,d)\times idf(t,D) \] where

\(t\) is for a specific term or word,
\(d\) represents a specific document or comment or professor,
\(tf(t,d)\) is the frequency of \(t\) in \(d\),
\(D\) represents all documents or comments,
\(idf(t,D)\) is called the inverse document freqency, which can be viewed as a weight.

The \(idf(t,D)\) can take many different forms. If \(idf(t,D)=1\), then the frequency of the words is used as usual. A popular \(idf(t,D)\) is defined as

\[\begin{equation} \tag{4.1} idf(t,D) = \log\frac{N}{n_t} \end{equation}\]

where \(N\) is the total number of documents or comments or professors and \(n_t\) is the number of documents with the term \(t\) in it. Varieties of Equation (4.1) can be defined. For example, to avoid zero weight, we can use \(\log(1+N/n_t)\). Another way to define it is to use \(\log[(N-n_t)/n_t]\).

Along the line, the raw count of words, denoted by \(f_{t,d}\), is just one way to define a frequency measure. For example, we can also define a \(tf\) by taking into account the length of the documents or comments. In this case, we can have

\[tf(t,d)=\frac{f_{t,d}}{\sum_{t\in d}f_{t,d}}\] where the denominator is the total number of words in the document \(d\). This is also called normalized frequency.

Many different ways can be used to get the tf-idf. For example, the R package tidytext has the function bind_tf_idf to calculate tfidf for each word. As shown below, three variables were added - tf, idf, and tf_idf.

word.freq <- word.freq %>% bind_tf_idf(word, profid, n)
word.freq
## # A tibble: 229,773 x 6
## # Groups:   profid [1,000]
##    profid word           n     tf    idf   tf_idf
##     <int> <chr>      <int>  <dbl>  <dbl>    <dbl>
##  1    960 lecture      419 0.0507 0.164  0.00830 
##  2    960 homework     188 0.0227 0.546  0.0124  
##  3    960 great        172 0.0208 0.0398 0.000828
##  4    960 material     158 0.0191 0.201  0.00384 
##  5    960 help         152 0.0184 0.0651 0.00120 
##  6    960 good         144 0.0174 0.0315 0.000549
##  7    960 understand   134 0.0162 0.143  0.00231 
##  8    960 easy         123 0.0149 0.0243 0.000362
##  9    133 test         121 0.0333 0.121  0.00403 
## 10    960 video        121 0.0146 1.82   0.0267  
## # ... with 229,763 more rows

The package tm can also be used to get tfidf as shown below.

prof.dtm.tfidf <- weightTfIdf(prof.dtm)
tm::inspect(prof.dtm.tfidf)
## <<DocumentTermMatrix (documents: 1000, terms: 1221)>>
## Non-/sparse entries: 196213/1024787
## Sparsity           : 84%
## Maximal term length: 15
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##      Terms
## Docs     english       essay        exam    homework mathematics     online
##   113 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   114 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   377 0.00000000 0.018912773 0.005772269 0.000000000 0.000000000 0.00000000
##   389 0.00641668 0.004683727 0.002858992 0.000000000 0.000000000 0.00000000
##   393 0.00828713 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   490 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   601 0.00000000 0.005929004 0.000000000 0.000000000 0.009072636 0.00000000
##   639 0.00000000 0.000000000 0.000000000 0.000000000 0.007281224 0.00000000
##   71  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   894 0.00000000 0.000000000 0.003316431 0.002866781 0.000000000 0.04163887
##      Terms
## Docs        paper    project     quizzes        quot
##   113 0.000000000 0.00000000 0.000000000 0.000000000
##   114 0.000000000 0.00000000 0.000000000 0.000000000
##   377 0.002933118 0.00000000 0.002855662 0.000000000
##   389 0.026149805 0.01545946 0.000000000 0.006378790
##   393 0.011257473 0.00000000 0.000000000 0.008238195
##   490 0.008387921 0.00000000 0.000000000 0.027622182
##   601 0.022068221 0.03913943 0.000000000 0.000000000
##   639 0.000000000 0.04711690 0.002873851 0.000000000
##   71  0.006668096 0.01182630 0.000000000 0.000000000
##   894 0.000000000 0.00000000 0.000000000 0.000000000