Chapter 4 Document Term Matrix
Previously, we have looked the word frequency across all the comments on all professors. Many times, it can be helpful to look at the words used in the comments for individual professors. In this case, we can count the word frequency for each professor. This can actually be done easily using the tidytext
package.
prof1000 <- read.csv("data/prof1000.stem.gender.csv", stringsAsFactors = FALSE)
prof.tm <- unnest_tokens(prof1000, word, comments)
stopwords <- read_csv("data/stopwords.evaluation.csv")
prof.tm <- prof.tm %>% anti_join(filter(stopwords))
word.freq <- prof.tm %>% group_by(profid) %>% count(word, sort = TRUE)
word.freq
## # A tibble: 229,773 x 3
## # Groups: profid [1,000]
## profid word n
## <int> <chr> <int>
## 1 960 lecture 419
## 2 960 homework 188
## 3 960 great 172
## 4 960 material 158
## 5 960 help 152
## 6 960 good 144
## 7 960 understand 134
## 8 960 easy 123
## 9 133 test 121
## 10 960 video 121
## # ... with 229,763 more rows
In the dataset, the frequency of each word for a given professor was calculated. Another way to organize the data is to create a document-term matrix (DTM). A DTM is a matrix with the rows representing the documents, comments, or any other bodies of words. The columns of it are the terms or words used in the document. Therefore, the tidytext
data can be viewed as the long format while DTM can be viewed as a wide format of the same information.
DTM is the default format used by the R package tm
. The package tidytext
includes a function cast_dtm
to convert data to DTM.
prof.dtm <- word.freq %>% cast_dtm(profid, word, n)
tm::inspect(prof.dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 8155)>>
## Non-/sparse entries: 229773/7925227
## Sparsity : 97%
## Maximal term length: 19
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs easy good great hard help learn lecture test time work
## 105 20 25 14 43 27 18 27 105 19 10
## 133 58 47 62 62 25 28 54 121 28 23
## 3 36 30 52 45 19 16 100 62 16 12
## 562 35 35 29 35 30 12 32 19 52 13
## 641 38 38 45 47 55 10 47 65 30 40
## 658 113 28 45 23 14 17 39 90 11 1
## 721 119 25 40 18 8 30 67 46 25 11
## 922 42 21 64 8 33 31 8 5 15 26
## 960 123 144 172 96 152 114 419 64 77 67
## 973 50 35 69 16 46 28 2 7 31 58
The related information and part of the DTM matrix are shown above. First, it shows the number of documents (professors) and the number of terms (words). Then, it shows the sparsity of the matrix, i.e., the number and percentage of zero values in the matrix. For our example, 97% of the matrix has value 0.
If a word was very rarely used, it might not provide much useful information for further analysis. Therefore, one might remove such words to reduce the sparsity of the DTM matrix. For example, the following code only keeps the words with less than 97% sparsity. Now the matrix has 1,205 terms with an overall sparsity 84%.
prof.dtm <- removeSparseTerms(prof.dtm, 0.97)
tm::inspect(prof.dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 1221)>>
## Non-/sparse entries: 196213/1024787
## Sparsity : 84%
## Maximal term length: 15
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs easy good great hard help learn lecture test time work
## 105 20 25 14 43 27 18 27 105 19 10
## 133 58 47 62 62 25 28 54 121 28 23
## 3 36 30 52 45 19 16 100 62 16 12
## 562 35 35 29 35 30 12 32 19 52 13
## 641 38 38 45 47 55 10 47 65 30 40
## 658 113 28 45 23 14 17 39 90 11 1
## 721 119 25 40 18 8 30 67 46 25 11
## 922 42 21 64 8 33 31 8 5 15 26
## 960 123 144 172 96 152 114 419 64 77 67
## 973 50 35 69 16 46 28 2 7 31 58
4.1 tf-idf
Word frequency, also called term frequency (tf), is one way to measure how a word is used. It clearly has the drawback that the frequency of a word is related to how commonly the word is used in general. We have discussed before that we can remove the commonly used word early. Another way to deal with it is to use the “weighting” method that is commonly used in sample survey.
The basic idea is that if a word appears in almost every comment, in the extreme case, all comments, then the word does not add much information to analysis and a small weight should be used. For example, if a word appears exactly once in evaluating every professor, a weight of 0 can be used because it has no useful information for further analysis.
In general, we can define a term frequency–inverse document frequency (tf-idf) statistic as
\[ tf\text{-}idf(t,d,D) = tf(t,d)\times idf(t,D) \] where
- \(t\) is for a specific term or word,
- \(d\) represents a specific document or comment or professor,
- \(tf(t,d)\) is the frequency of \(t\) in \(d\),
- \(D\) represents all documents or comments,
- \(idf(t,D)\) is called the inverse document freqency, which can be viewed as a weight.
The \(idf(t,D)\) can take many different forms. If \(idf(t,D)=1\), then the frequency of the words is used as usual. A popular \(idf(t,D)\) is defined as
\[\begin{equation} \tag{4.1} idf(t,D) = \log\frac{N}{n_t} \end{equation}\]
where \(N\) is the total number of documents or comments or professors and \(n_t\) is the number of documents with the term \(t\) in it. Varieties of Equation (4.1) can be defined. For example, to avoid zero weight, we can use \(\log(1+N/n_t)\). Another way to define it is to use \(\log[(N-n_t)/n_t]\).
Along the line, the raw count of words, denoted by \(f_{t,d}\), is just one way to define a frequency measure. For example, we can also define a \(tf\) by taking into account the length of the documents or comments. In this case, we can have
\[tf(t,d)=\frac{f_{t,d}}{\sum_{t\in d}f_{t,d}}\] where the denominator is the total number of words in the document \(d\). This is also called normalized frequency.
Many different ways can be used to get the tf-idf. For example, the R package tidytext
has the function bind_tf_idf
to calculate tfidf
for each word. As shown below, three variables were added - tf
, idf
, and tf_idf
.
word.freq <- word.freq %>% bind_tf_idf(word, profid, n)
word.freq
## # A tibble: 229,773 x 6
## # Groups: profid [1,000]
## profid word n tf idf tf_idf
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 960 lecture 419 0.0507 0.164 0.00830
## 2 960 homework 188 0.0227 0.546 0.0124
## 3 960 great 172 0.0208 0.0398 0.000828
## 4 960 material 158 0.0191 0.201 0.00384
## 5 960 help 152 0.0184 0.0651 0.00120
## 6 960 good 144 0.0174 0.0315 0.000549
## 7 960 understand 134 0.0162 0.143 0.00231
## 8 960 easy 123 0.0149 0.0243 0.000362
## 9 133 test 121 0.0333 0.121 0.00403
## 10 960 video 121 0.0146 1.82 0.0267
## # ... with 229,763 more rows
The package tm
can also be used to get tfidf as shown below.
prof.dtm.tfidf <- weightTfIdf(prof.dtm)
tm::inspect(prof.dtm.tfidf)
## <<DocumentTermMatrix (documents: 1000, terms: 1221)>>
## Non-/sparse entries: 196213/1024787
## Sparsity : 84%
## Maximal term length: 15
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Terms
## Docs english essay exam homework mathematics online
## 113 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
## 114 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
## 377 0.00000000 0.018912773 0.005772269 0.000000000 0.000000000 0.00000000
## 389 0.00641668 0.004683727 0.002858992 0.000000000 0.000000000 0.00000000
## 393 0.00828713 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
## 490 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
## 601 0.00000000 0.005929004 0.000000000 0.000000000 0.009072636 0.00000000
## 639 0.00000000 0.000000000 0.000000000 0.000000000 0.007281224 0.00000000
## 71 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000
## 894 0.00000000 0.000000000 0.003316431 0.002866781 0.000000000 0.04163887
## Terms
## Docs paper project quizzes quot
## 113 0.000000000 0.00000000 0.000000000 0.000000000
## 114 0.000000000 0.00000000 0.000000000 0.000000000
## 377 0.002933118 0.00000000 0.002855662 0.000000000
## 389 0.026149805 0.01545946 0.000000000 0.006378790
## 393 0.011257473 0.00000000 0.000000000 0.008238195
## 490 0.008387921 0.00000000 0.000000000 0.027622182
## 601 0.022068221 0.03913943 0.000000000 0.000000000
## 639 0.000000000 0.04711690 0.002873851 0.000000000
## 71 0.006668096 0.01182630 0.000000000 0.000000000
## 894 0.000000000 0.00000000 0.000000000 0.000000000