11.4 Word Importance
We use tf-idf (term frequency - inverse document frequency) as a way to pull out uniquely important/relevant words for a given character.
- Relative frequency of a term inversely weighted by the number of documents in which the term appears.
- Functionally, if everyone uses the word “know,” then it’s not very important for distinguishing characters/documents from each other.
- We want words that a speech used frequently, that other speeches use less frequently
## words uniquely important to a character
sotu.tfidf <- weightTfIdf(sotu.dtm)
## convert to matrix
sotu.tfidf.mat <- as.matrix(sotu.tfidf)We can summarize the uniquely relevant words for each speech
Gw1790.tfidf <-head(sort(sotu.tfidf.mat[1,], decreasing=T), n=8)
BO2016.tfidf <-head(sort(sotu.tfidf.mat[236,], decreasing=T), n=8)Gw1790.tfidf##     intimating licentiousness        discern     inviolable         derive 
##     0.01527644     0.01527644     0.01333846     0.01220481     0.01172350 
##      persuaded     cherishing  comprehending 
##     0.01172350     0.01077658     0.01077658barplot(Gw1790.tfidf, cex.axis=.7,
         cex.names=.7,
        main= "Most `Important' 1790 SOTU Words (tf-idf)", 
        horiz = T, las=2)barplot(BO2016.tfidf,
         cex.names=.7, cex.axis=.7,
        main= "Most `Important' 2016 SOTU Words (tf-idf)", 
        horiz=T, las=2)