11.4 Word Importance

We use tf-idf (term frequency - inverse document frequency) as a way to pull out uniquely important/relevant words for a given character.

  • Relative frequency of a term inversely weighted by the number of documents in which the term appears.
  • Functionally, if everyone uses the word “know,” then it’s not very important for distinguishing characters/documents from each other.
  • We want words that a speech used frequently, that other speeches use less frequently
## words uniquely important to a character
sotu.tfidf <- weightTfIdf(sotu.dtm)

## convert to matrix
sotu.tfidf.mat <- as.matrix(sotu.tfidf)

We can summarize the uniquely relevant words for each speech

Gw1790.tfidf <-head(sort(sotu.tfidf.mat[1,], decreasing=T), n=8)
BO2016.tfidf <-head(sort(sotu.tfidf.mat[236,], decreasing=T), n=8)
Gw1790.tfidf
##     intimating licentiousness        discern     inviolable         derive 
##     0.01527644     0.01527644     0.01333846     0.01220481     0.01172350 
##      persuaded     cherishing  comprehending 
##     0.01172350     0.01077658     0.01077658
barplot(Gw1790.tfidf, cex.axis=.7,
         cex.names=.7,
        main= "Most `Important' 1790 SOTU Words (tf-idf)", 
        horiz = T, las=2)

barplot(BO2016.tfidf,
         cex.names=.7, cex.axis=.7,
        main= "Most `Important' 2016 SOTU Words (tf-idf)", 
        horiz=T, las=2)