15.4 Examples of Prediction and Classification
You may be wondering why we are covering these topics in a maximum likelihood course. Well, one of the most fundamental types of classification relies on logistic regression. We are just going to reframe the tools we have already learned for a different purpose.
Let’s use an example with the donation
data from 12.7
library(rio)
<- import("https://github.com/ktmccabe/teachingdata/blob/main/donation.dta?raw=true") don
Recall that a key outcome variable here was donation
indicating if a particular member of the donorate donated to a given senator (=1) or not (=0)
table(don$donation)
0 1
61533 2377
Can we predict whether someone will donate to a Senator?
Outcome: donation
(1 or 0), binary variable = binary classifier
- Choose an approach: Build a model (e.g., logistic regression)
- Train the model
- Assess accuracy
- Within sample and test model using cross-validation out-of-sample
- If we were very satisfied with our model, we could apply the model to data for which we do not know the answer in the future.
- If you were a campaign, could you predict who will donate?
Let’s go!
## build a model (choose "features" you think will be good for prediction)
## you will want to remove missing data first
<- don[, c("donation", "NetWorth", "Edsum")]
donsub <- na.omit(donsub)
donsub
<- glm(donation ~ NetWorth + Edsum,
fit family = binomial(link = "logit"), data = donsub)
In-Sample Accuracy Assessments
## generate probability of donation for each observation
$prob <- predict(fit, type = "response")
donsub
## set a prediction threshold
$pred <- ifelse(donsub$prob > mean(donsub$donation), 1, 0)
donsub
## accuracy- proportion where prediction matches reality
mean(donsub$pred == donsub$donation)
[1] 0.6210679
## confusion matrix
table(truth = donsub$donation, predicted = donsub$pred)
predicted
truth 0 1
0 31928 19311
1 830 1083
There are different measures for accuracy that focus on particular types of errors:
## where did we miss
table(actual = donsub$donation, pred = donsub$pred)
pred
actual 0 1
0 31928 19311
1 830 1083
<- table(actual = donsub$donation, pred = donsub$pred)[2,2]
truepos <- table(actual = donsub$donation, pred = donsub$pred)[1,2]
falsepos <- table(actual = donsub$donation, pred = donsub$pred)[1,1]
trueneg <- table(actual = donsub$donation, pred = donsub$pred)[2,1]
falseneg
## precision
<- truepos/(truepos + falsepos)
precision
## specificity
<- trueneg / (trueneg + falsepos)
specificity
## false positive rate
<- falsepos/(trueneg+ falsepos)
falsepos
## recall aka sensitivity
<- truepos/(truepos + falseneg)
recall
## f-score, combination of precision/recall
<- (2 * precision * recall) / (precision + recall) F1
See this post for more details and guidance on which one to choose.
Another common way to assess accuracy is through an ROC curve
ROC curves plot the true positive rate/precision (y) vs. 1 - false positive (x) rates. Want the curve to be away from the diagonal, increasing the area under the curive (AUC).
# install.packages("pROC")
library(pROC)
<- roc(response = donsub$donation,
ROC predictor = donsub$pred)
plot(ROC, print.thres = "best")
auc(ROC)
Area under the curve: 0.5946
For more information, see here.
Out-of-Sample Tests: Cross-Validation
## split data into k folds
library(cvTools)
set.seed(1234)
<- cvFolds(nrow(donsub), K = 10) folds
Let’s run through the process for one fold
## leave one fold out != i
<- glm(donation ~ NetWorth + Edsum, family = binomial(link = "logit"),
fit data = donsub[folds$subsets[folds$which != 1],])
## out of sample prediction on left out fold == i
<- donsub[folds$subsets[folds$which == 1],]
testdata $prob <- predict(fit, newdata = testdata,
testdatatype = "response")
## choose some criteria to assess accuracy
## we will just assess basic matches, but you may want a different one
## set a prediction threshold
$pred <- ifelse(testdata$prob > .5, 1, 0)
testdata<- mean(testdata$pred == testdata$donation)
acc acc
[1] 0.9683973
Let’s make a loop to repeat this process for each fold.
## container vector for assessment criteria
<- rep(NA, 10)
accs
## set to length of container
for(i in 1:10){
## leave one fold out != i
<- glm(donation ~ NetWorth + Edsum, family = binomial(link = "logit"),
fit data = donsub[folds$subsets[folds$which != i],])
## out of sample prediction on left out fold == i
<- donsub[folds$subsets[folds$which == i],]
testdata $prob <- predict(fit, newdata = testdata,
testdatatype = "response")
## choose some criteria to assess accuracy
## set a prediction threshold
$pred <- ifelse(testdata$prob > .5, 1, 0)
testdata<- mean(testdata$pred == testdata$donation)
accs[i]
} accs
[1] 0.9683973 0.9665162 0.9616181 0.9589840 0.9674506 0.9616181 0.9678269
[8] 0.9668862 0.9595484 0.9612418
mean(accs)
[1] 0.9640088
We’ve done it! If we are satisfied with this level of accuracy, we could stop here and apply our model with any new data. If we are not satisfied, we would start by building a new model and repeat the process.
- Note: this is actually probably a bad measure of accuracy for our case given the skew in our data. We are only accurate because the model is predicting no one will donate! Bad news for campaigns. How do we know this? By inspecting the confusion matrix. Our threshold of .5 was probably too high.
Continuous Outcome
Let’s now try an example with a continuous outcome: How much will someone donate to a Senator? One way to judge accuracy of a linear model is the root mean squared error
One basic approach would be to use a linear regression model. Other than that, it’s the same process as before, but we will use a different assessment for accuracy.
<- lm(total_donation ~ NetWorth + Edsum, data = don)
fit
## Root mean squared error
<- sqrt(sum(residuals(fit)^2)/fit$df.residual)
rmse rmse
[1] 324.081