15.4 Examples of Prediction and Classification

You may be wondering why we are covering these topics in a maximum likelihood course. Well, one of the most fundamental types of classification relies on logistic regression. We are just going to reframe the tools we have already learned for a different purpose.

Let’s use an example with the donation data from 12.7

library(rio)
don <- import("https://github.com/ktmccabe/teachingdata/blob/main/donation.dta?raw=true")

Recall that a key outcome variable here was donation indicating if a particular member of the donorate donated to a given senator (=1) or not (=0)

table(don$donation)


    0     1 
61533  2377

Can we predict whether someone will donate to a Senator?

Outcome: donation (1 or 0), binary variable = binary classifier

Choose an approach: Build a model (e.g., logistic regression)
Train the model
Assess accuracy
- Within sample and test model using cross-validation out-of-sample
If we were very satisfied with our model, we could apply the model to data for which we do not know the answer in the future.
- If you were a campaign, could you predict who will donate?

Let’s go!

## build a model (choose "features" you think will be good for prediction)

## you will want to remove missing data first
donsub <- don[, c("donation", "NetWorth", "Edsum")]
donsub <- na.omit(donsub)

fit <- glm(donation ~ NetWorth + Edsum, 
           family = binomial(link = "logit"), data = donsub)

In-Sample Accuracy Assessments

## generate probability of donation for each observation
donsub$prob <- predict(fit, type = "response")

## set a prediction threshold
donsub$pred <- ifelse(donsub$prob > mean(donsub$donation), 1, 0)

## accuracy- proportion where prediction matches reality
mean(donsub$pred == donsub$donation)

[1] 0.6210679

## confusion matrix
table(truth = donsub$donation, predicted = donsub$pred)

     predicted
truth     0     1
    0 31928 19311
    1   830  1083

There are different measures for accuracy that focus on particular types of errors:

## where did we miss
table(actual = donsub$donation, pred = donsub$pred)

      pred
actual     0     1
     0 31928 19311
     1   830  1083

truepos <- table(actual = donsub$donation, pred = donsub$pred)[2,2]
falsepos <- table(actual = donsub$donation, pred = donsub$pred)[1,2]
trueneg <- table(actual = donsub$donation, pred = donsub$pred)[1,1]
falseneg <- table(actual = donsub$donation, pred = donsub$pred)[2,1]

## precision
precision <- truepos/(truepos + falsepos)

## specificity
specificity <- trueneg / (trueneg + falsepos)

## false positive rate
falsepos <- falsepos/(trueneg+ falsepos)

## recall aka sensitivity
recall <- truepos/(truepos + falseneg)

## f-score, combination of precision/recall
F1 <- (2 * precision * recall) / (precision + recall)

See this post for more details and guidance on which one to choose.

Another common way to assess accuracy is through an ROC curve

ROC curves plot the true positive rate/precision (y) vs. 1 - false positive (x) rates. Want the curve to be away from the diagonal, increasing the area under the curive (AUC).

# install.packages("pROC")
library(pROC)
ROC <- roc(response = donsub$donation,
                  predictor = donsub$pred)
plot(ROC, print.thres = "best")

auc(ROC)

Area under the curve: 0.5946

For more information, see here.

Out-of-Sample Tests: Cross-Validation

## split data into k folds
library(cvTools)
set.seed(1234)
folds <- cvFolds(nrow(donsub), K = 10)

Let’s run through the process for one fold

## leave one fold out != i
fit <- glm(donation ~ NetWorth + Edsum, family = binomial(link = "logit"), 
           data = donsub[folds$subsets[folds$which != 1],])

## out of sample prediction on left out fold == i
testdata <- donsub[folds$subsets[folds$which == 1],]
testdata$prob <- predict(fit, newdata = testdata, 
                       type = "response")

## choose some criteria to assess accuracy
## we will just assess basic matches, but you may want a different one
## set a prediction threshold
testdata$pred <- ifelse(testdata$prob > .5, 1, 0)
acc <- mean(testdata$pred == testdata$donation)
acc

[1] 0.9683973

Let’s make a loop to repeat this process for each fold.

## container vector for assessment criteria
accs <- rep(NA, 10)

## set to length of container
for(i in 1:10){
  
  ## leave one fold out != i
  fit <- glm(donation ~ NetWorth + Edsum, family = binomial(link = "logit"), 
           data = donsub[folds$subsets[folds$which != i],])
  
  ## out of sample prediction on left out fold == i
  testdata <- donsub[folds$subsets[folds$which == i],]
  testdata$prob <- predict(fit, newdata = testdata, 
                       type = "response")

  ## choose some criteria to assess accuracy
  ## set a prediction threshold
  testdata$pred <- ifelse(testdata$prob > .5, 1, 0)
  accs[i] <- mean(testdata$pred == testdata$donation)
}
accs

 [1] 0.9683973 0.9665162 0.9616181 0.9589840 0.9674506 0.9616181 0.9678269
 [8] 0.9668862 0.9595484 0.9612418

mean(accs)

[1] 0.9640088

We’ve done it! If we are satisfied with this level of accuracy, we could stop here and apply our model with any new data. If we are not satisfied, we would start by building a new model and repeat the process.

Note: this is actually probably a bad measure of accuracy for our case given the skew in our data. We are only accurate because the model is predicting no one will donate! Bad news for campaigns. How do we know this? By inspecting the confusion matrix. Our threshold of .5 was probably too high.

Continuous Outcome

Let’s now try an example with a continuous outcome: How much will someone donate to a Senator? One way to judge accuracy of a linear model is the root mean squared error

One basic approach would be to use a linear regression model. Other than that, it’s the same process as before, but we will use a different assessment for accuracy.

fit <- lm(total_donation ~ NetWorth + Edsum, data = don)

## Root mean squared error
rmse <- sqrt(sum(residuals(fit)^2)/fit$df.residual)
rmse

[1] 324.081