Chapter 4 Document Term Matrix

Previously, we have looked the word frequency across all the comments on all professors. Many times, it can be helpful to look at the words used in the comments for individual professors. In this case, we can count the word frequency for each professor. This can actually be done easily using the tidytext package.

In the dataset, the frequency of each word for a given professor was calculated. Another way to organize the data is to create a document-term matrix (DTM). A DTM is a matrix with the rows representing the documents, comments, or any other bodies of words. The columns of it are the terms or words used in the document. Therefore, the tidytext data can be viewed as the long format while DTM can be viewed as a wide format of the same information.

DTM is the default format used by the R package tm. The package tidytext includes a function cast_dtm to convert data to DTM.

The related information and part of the DTM matrix are shown above. First, it shows the number of documents (professors) and the number of terms (words). Then, it shows the sparsity of the matrix, i.e., the number and percentage of zero values in the matrix. For our example, 97% of the matrix has value 0.

If a word was very rarely used, it might not provide much useful information for further analysis. Therefore, one might remove such words to reduce the sparsity of the DTM matrix. For example, the following code only keeps the words with less than 97% sparsity. Now the matrix has 1,205 terms with an overall sparsity 84%.

4.1 tf-idf

Word frequency, also called term frequency (tf), is one way to measure how a word is used. It clearly has the drawback that the frequency of a word is related to how commonly the word is used in general. We have discussed before that we can remove the commonly used word early. Another way to deal with it is to use the “weighting” method that is commonly used in sample survey.

The basic idea is that if a word appears in almost every comment, in the extreme case, all comments, then the word does not add much information to analysis and a small weight should be used. For example, if a word appears exactly once in evaluating every professor, a weight of 0 can be used because it has no useful information for further analysis.

In general, we can define a term frequency–inverse document frequency (tf-idf) statistic as

\[ tf\text{-}idf(t,d,D) = tf(t,d)\times idf(t,D) \] where

  • \(t\) is for a specific term or word,
  • \(d\) represents a specific document or comment or professor,
  • \(tf(t,d)\) is the frequency of \(t\) in \(d\),
  • \(D\) represents all documents or comments,
  • \(idf(t,D)\) is called the inverse document freqency, which can be viewed as a weight.

The \(idf(t,D)\) can take many different forms. If \(idf(t,D)=1\), then the frequency of the words is used as usual. A popular \(idf(t,D)\) is defined as

\[\begin{equation} \tag{4.1} idf(t,D) = \log\frac{N}{n_t} \end{equation}\]

where \(N\) is the total number of documents or comments or professors and \(n_t\) is the number of documents with the term \(t\) in it. Varieties of Equation (4.1) can be defined. For example, to avoid zero weight, we can use \(\log(1+N/n_t)\). Another way to define it is to use \(\log[(N-n_t)/n_t]\).

Along the line, the raw count of words, denoted by \(f_{t,d}\), is just one way to define a frequency measure. For example, we can also define a \(tf\) by taking into account the length of the documents or comments. In this case, we can have

\[tf(t,d)=\frac{f_{t,d}}{\sum_{t\in d}f_{t,d}}\] where the denominator is the total number of words in the document \(d\). This is also called normalized frequency.

Many different ways can be used to get the tf-idf. For example, the R package tidytext has the function bind_tf_idf to calculate tfidf for each word. As shown below, three variables were added - tf, idf, and tf_idf.

The package tm can also be used to get tfidf as shown below.

prof.dtm.tfidf <- weightTfIdf(prof.dtm)
tm::inspect(prof.dtm.tfidf)
## <<DocumentTermMatrix (documents: 1000, terms: 1221)>>
## Non-/sparse entries: 196213/1024787
## Sparsity           : 84%
## Maximal term length: 15
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##      Terms
## Docs     english       essay        exam    homework mathematics     online
##   113 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   114 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   377 0.00000000 0.018912773 0.005772269 0.000000000 0.000000000 0.00000000
##   389 0.00641668 0.004683727 0.002858992 0.000000000 0.000000000 0.00000000
##   393 0.00828713 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   490 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   601 0.00000000 0.005929004 0.000000000 0.000000000 0.009072636 0.00000000
##   639 0.00000000 0.000000000 0.000000000 0.000000000 0.007281224 0.00000000
##   71  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
##   894 0.00000000 0.000000000 0.003316431 0.002866781 0.000000000 0.04163887
##      Terms
## Docs        paper    project     quizzes        quot
##   113 0.000000000 0.00000000 0.000000000 0.000000000
##   114 0.000000000 0.00000000 0.000000000 0.000000000
##   377 0.002933118 0.00000000 0.002855662 0.000000000
##   389 0.026149805 0.01545946 0.000000000 0.006378790
##   393 0.011257473 0.00000000 0.000000000 0.008238195
##   490 0.008387921 0.00000000 0.000000000 0.027622182
##   601 0.022068221 0.03913943 0.000000000 0.000000000
##   639 0.000000000 0.04711690 0.002873851 0.000000000
##   71  0.006668096 0.01182630 0.000000000 0.000000000
##   894 0.000000000 0.00000000 0.000000000 0.000000000