8.7 Uncertainty with Prediction

Regression (and other prediction algorithms) give us our best guess

But any guess has some uncertainty, prediction error, and potential outliers
Sometimes these errors can be systematic
Even when we use more advanced statistical models
A “best guess” is often better than a random guess– but shouldn’t necessarily be treated as “ground truth.”

Prediction helps us guess unknowns with observed data, but MUST PROCEED WITH CAUTION

8.7.1 Example: Butterfly Ballot in Florida

In the U.S. 2000 presidential election, the race came down to Florida, which was extremely close. As part of the contest, different counties in Florida came under a microscope. One result that seemed unusual was the amount of votes Buchanan received in certain areas, which seemed to be a result of an odd ballot design choice. In this exercise, we examine voting patterns in Florida during the 2000 election.

For more on the 2000 race, you can watch this video.

Load the data and explore the variables

county: county name
Clinton96: Clinton’s votes in 1996
Dole96: Dole’s votes in 1996
Perot96: Perot’s votes in 1996
Bush00: Bush’s votes in 2000
Gore00: Gore’s votes in 2000
Buchanan00: Buchanan’s votes in 2000

florida <- read.csv("florida.csv")

Chapter 4 in QSS also discusses this example.

Using what you learned from the last section, try to complete the following steps:

Regress Buchanan 2000 votes (your Y) on Perot 1996 (your X) votes
Create a scatterplot of the two variables and add the regression line
Find and interpret the slope coefficient for the relationship between Perot and Buchanan votes
Calculate the root-mean-squared error for the regression and interpret this

Try on your own, then expand for the solution.

For every 1 additional vote Perot received in 1996, we expect Buchanan to receive .036 additional votes in 2000.

fit <- lm(Buchanan00 ~ Perot96, data = florida)
coef(fit)

## (Intercept)     Perot96 
##  1.34575212  0.03591504

In 1996, Perot received 8 million votes as a third-party candidate. Buchanan received less than 1/2 a million. Overall Perot received more votes, but where Perot received votes in 1996 was positively correlated with where Buchanan received votes in 2000.

plot(x=florida$Perot96,
     y=florida$Buchanan00,
     ylab="Buchanan Votes 2000",
     xlab="Perot Votes 1996",  
     pch=20)
abline(fit, lwd=3, col="purple")

sigma(fit)

## [1] 316.3765

A typical prediction error is about 316.4 votes above or below the Buchanan total.

8.7.2 Multiple Predictors

Can we reduce the error by adding more variables?

fitnew <- lm(Buchanan00 ~ Perot96 + Dole96 + Clinton96, data = florida)
coef(fitnew)

##  (Intercept)      Perot96       Dole96    Clinton96 
## 20.572650070  0.030663207 -0.001559196  0.001865809

Again, when we have multiple predictors, this changes our interpretation of the coefficients slightly.

We now interpret the slope as the change in the outcome expected with a 1-unit change in the independent variable– holding all other variables constant (or ``controlling” for all other variables)
For example, a 1-unit increase (a 1-vote increase) in the number of Perot voters in 1996 is associated with a 0.03 vote increase in the number of Buchanan votes in 2000, holding constant the number of Clinton and Dole votes a county received.

When we make predictions with multiple variables, we have to tell R where we want to set each variable’s value.

predict(fitnew, data.frame(Perot96=20000, Clinton96=300000, Dole96=300000))

##        1 
## 725.8208

See how the prediction changes if you shift Perot96 but keep the other variables where they are. That’s the idea of “controlling” for the other variables!

The addition of the new variables, in this case, made very little difference in the RMSE.

sigma(fit)

## [1] 316.3765

sigma(fitnew)

## [1] 318.3798

Note: the value R generates through sigma is the residual standard error, which penalizes the RMSE for the number of variables included in the model. You could also calculate it without this penalty by manually taking the square root of the mean of the squared residuals.

With little change from the addition of predictors, let’s stick with the more simple model and explore the prediction errors.

plot(x=fitted(fit), # predicted outcome
     y=resid(fit),  # prediction error
     type="n", # makes the plot blank
     xlim = c(0, 1500), 
     ylim = c(-750, 2500), 
     xlab = "Predicted Buchanan Votes", 
     ylab = "Prediction Error")
abline(h = 0) # adds horizontal line
text(x=fitted(fit), y=resid(fit), labels = florida$county, cex=.8)

How does the prediction error change if we remove Palm Beach County?

florida.pb <- subset(florida, subset = (county != "PalmBeach"))
fit2 <- lm(Buchanan00 ~ Perot96, data = florida.pb)
sigma(fit2)

## [1] 87.74994

My, oh my, our RMSE also goes way down if we remove Palm Beach. Something unique seems to be happening in that county. See this academic paper for an elaboration of the evidence that “The Butterfly [ballot] Did it.”

8.7.3 Confidence Intervals

Social scientists like to characterize the uncertainty in their predictions using what is called a “confidence interval.”

Confidence intervals show a range of values that are likely to contain the true value

predict(fit, data.frame(Perot96 = 13600), interval = "confidence")

##        fit      lwr      upr
## 1 489.7903 394.8363 584.7443

By default, R supplies the 95% confidence interval.

For example, our estimate is for a county with 13,600 votes for Perot in 1996, the expected Buchanan vote is 489.79 votes.
- The confidence interval is 394.84 to 584.74 votes, which means we believe there is a 95% chance that this interval contains the true value of the Buchanan 2000 vote share.
Instead of thinking about our prediction as just 489.79, we should think about the entire interval as having a good chance of including the true value.

Similarly, our coefficients also have uncertainty.

coef(fit)

## (Intercept)     Perot96 
##  1.34575212  0.03591504

confint(fit)

##                    2.5 %       97.5 %
## (Intercept) -98.03044506 100.72194929
## Perot96       0.02724733   0.04458275

For every 1 vote increase in the Perot 1996 vote, we expect a \(\hat \beta =.036\) increase in Buchanan votes. However, the confidence interval is 0.027 to 0.045.

We think there is a 95% chance that this interval 0.027 to 0.045 includes the true \(\beta\), describing the rate of change in Buchanan votes for a given change in Perot 1996 votes