4.6 Generating predictions from regression models
The regression coefficients tell us how much \(Y\) is expected to change for a one-unit change in \(x_k\). It does not immediately tell us the values we estimate our outcome (\(\hat Y\)) to take conditional on particular values of \(x_k\). While often knowing our independent variables have a significant effect on the outcome and the size of the coefficient is sufficient for testing our hypotheses, it can be helpful for interpretation’s sake, to see the estimated values for the outcome. This is going to be particularly important once we get into models like logistic regression, where the coefficients won’t be immediately interpretable.
Recall that our equation for estimating values of our outcomes is:
\(\hat Y = X\hat \beta\) This can also be written out in long form for any particular observation \(i\):
\(\hat y_i = \hat \alpha + \hat \beta_1*x_1i + \hat \beta_2*x_2i + ... \hat\beta_k*x_ki\)
The estimated values of our regression \(\hat Y\) are often called the “fitted values.” In R, you can identify the estimated values for each observation using the fitted()
command.
## Y hat for the first observation in the data
fitted(fit.1)[1]
1
291.252
Again, this is just the multiplication of the matrix \(X\) and \(\hat \beta\). If we have already run a regression model in R, one shortcut for getting the \(X\) matrix, is to use the model.matrix
command. We can get \(\hat \beta\) using the coef()
command.
<- model.matrix(fit.1)
X head(X) # head() shows about the first six values of an object
(Intercept) Perot96
1 1 8072
2 1 667
3 1 5922
4 1 819
5 1 25249
6 1 38964
<- coef(fit.1) betahat
Our fitted values are then just
<- X %*% betahat
yhat head(yhat)
[,1]
1 291.25196
2 25.30108
3 214.03462
4 30.76017
5 908.16461
6 1400.73939
If I want to generate an estimate for any particular observation, I could also just extract its specific value for Perot96
.
$Perot96[1] florida
[1] 8072
Let’s estimate the Buchanan 2000 votes for the first county in the data with Perot 96 votes of 8072. We can write it out as \(\hat Buchanan00_1 =\hat \alpha + \hat \beta*Perot96_1\)
<- coef(fit.1)[1] + coef(fit.1)[2]*florida$Perot96[1]
buch00hat buch00hat
(Intercept)
291.252
What is useful about this is that now we have the coefficient estimates, we can apply them to any values of \(X\) we wish in order to generate estimates/predictions of the values \(Y\) will take given particular values of our independent variables.
One function that is useful for this (as a shortcut) is the predict(fit, newdata=newdataframe)
function in R. It allows you to enter in “newdata”– meaning values of the \(X\) variables for which you want to generate estimates of \(Y\) based on the coefficient estimates of your regression model.
For example, let’s repeat the calculation from above for Perot96 = 8072
.
predict(fit.1, newdata = data.frame(Perot96 = 8072))
1
291.252
We can also generate confidence intervals around these estimates by adding interval = "confidence"
in the command.
predict(fit.1, newdata = data.frame(Perot96 = 8072), interval="confidence")
fit lwr upr
1 291.252 213.7075 368.7964
We can also simultaneously generate multiple predictions by supplying a vector of values in the predict()
command. For example, let’s see the estimated Buchanan votes for when the Perot 1996 votes took values of 1000 to 10,000 by intervals of 1,000.
predict(fit.1, newdata = data.frame(Perot96 = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000)))
1 2 3 4 5 6 7 8
37.26079 73.17583 109.09087 145.00591 180.92095 216.83599 252.75104 288.66608
9 10
324.58112 360.49616
The important thing to note about the predict()
command is that if you have multiple independent variables, you have to specify the values you want each of them to take when generating the estimated values of y.
.2 <- lm(Buchanan00 ~ Perot96 + Clinton96, data = florida) fit
For example, let’s build a second model with Clinton96
as an additional predictor. In order to generate the same prediction for different values of Perot 1996 votes, we need to tell R at what values we should “hold constant” Clinton96.
I.e., we want to see how hypothetical changes in Perot96
votes influence changes in Buchanan 2000 votes while also leaving the Clinton votes identical. This is that lightswitch metaphor– flipping one switch, while keeping the rest untouched.
There are two common approaches to doing this. 1) We can hold constant Clinton96
votes at its mean value in the data 2) We can keep Clinton96
at its observed values in the data. In linear regression, it’s not going to matter which approach you take. In other models we will talk about later, this distinction may matter more substantially because of how our quantities of interest change across different values of \(X\hat \beta\).
The first approach is easily implemented in predict.
predict(fit.2, newdata = data.frame(Perot96 = 8072, Clinton96 = mean(florida$Clinton96)))
1
283.997
For the second approach, what we will do is generate an estimate for Buchanan’s votes in 2000 when Perot96 takes 8072 votes, and we keep Clinton96’s votes at whatever value it currently is in the data. That is, we will generate \(n\) estimates for Buchanan’s votes when Perot takes 8072. Then, we will take the mean of this as our “average estimate” of Buchanan’s votes in 2000 based on Perot’s votes at a level of 8072. We can do this in one of two ways:
## Manipulating the X matrix
<- model.matrix(fit.2)
X ## Replace Perot96 column with all 8072 values
"Perot96"] <- 8072
X[, head(X) #take a peek
(Intercept) Perot96 Clinton96
1 1 8072 40144
2 1 8072 2273
3 1 8072 17020
4 1 8072 3356
5 1 8072 80416
6 1 8072 320736
## Generate yhat
<-X %*% coef(fit.2)
yhats
## take the mean
mean(yhats)
[1] 283.997
## Use predict
<- predict(fit.2, newdata = data.frame(Perot96=8072, Clinton96=florida$Clinton96))
yhats mean(yhats)
[1] 283.997
Now often, after we generate these predicted values, we want to display them for the whole world to see. You will get a chance to visualize values like this using the plotting functions in the problem sets. We have already seen one example of this in the simple bivariate case, when R plotted the bivariate regression line in section 4.3.2. However, the predict function extends are capabilities to plot very specific values of \(X\) and \(\hat Y\) for bivariate or multiple regressions.
- The
predict()
function is also very relevant when we move to logistic, probit, etc. regressions. This is just the start of a beautiful friendship between you andpredict()
and associated functions.