11.4 Word Importance
We use tf-idf (term frequency - inverse document frequency) as a way to pull out uniquely important/relevant words for a given character.
- Relative frequency of a term inversely weighted by the number of documents in which the term appears.
- Functionally, if everyone uses the word “know,” then it’s not very important for distinguishing characters/documents from each other.
- We want words that a speech used frequently, that other speeches use less frequently
## words uniquely important to a character
<- weightTfIdf(sotu.dtm)
sotu.tfidf
## convert to matrix
<- as.matrix(sotu.tfidf) sotu.tfidf.mat
We can summarize the uniquely relevant words for each speech
<-head(sort(sotu.tfidf.mat[1,], decreasing=T), n=8)
Gw1790.tfidf <-head(sort(sotu.tfidf.mat[236,], decreasing=T), n=8) BO2016.tfidf
Gw1790.tfidf
## intimating licentiousness discern inviolable derive
## 0.01527644 0.01527644 0.01333846 0.01220481 0.01172350
## persuaded cherishing comprehending
## 0.01172350 0.01077658 0.01077658
barplot(Gw1790.tfidf, cex.axis=.7,
cex.names=.7,
main= "Most `Important' 1790 SOTU Words (tf-idf)",
horiz = T, las=2)
barplot(BO2016.tfidf,
cex.names=.7, cex.axis=.7,
main= "Most `Important' 2016 SOTU Words (tf-idf)",
horiz=T, las=2)