4.6 Generating predictions from regression models

The regression coefficients tell us how much \(Y\) is expected to change for a one-unit change in \(x_k\). It does not immediately tell us the values we estimate our outcome (\(\hat Y\)) to take conditional on particular values of \(x_k\). While often knowing our independent variables have a significant effect on the outcome and the size of the coefficient is sufficient for testing our hypotheses, it can be helpful for interpretation’s sake, to see the estimated values for the outcome. This is going to be particularly important once we get into models like logistic regression, where the coefficients won’t be immediately interpretable.

Recall that our equation for estimating values of our outcomes is:

\(\hat Y = X\hat \beta\) This can also be written out in long form for any particular observation \(i\):
\(\hat y_i = \hat \alpha + \hat \beta_1*x_1i + \hat \beta_2*x_2i + ... \hat\beta_k*x_ki\)

The estimated values of our regression \(\hat Y\) are often called the “fitted values.” In R, you can identify the estimated values for each observation using the fitted() command.

## Y hat for the first observation in the data
fitted(fit.1)[1]

      1 
291.252

Again, this is just the multiplication of the matrix \(X\) and \(\hat \beta\). If we have already run a regression model in R, one shortcut for getting the \(X\) matrix, is to use the model.matrix command. We can get \(\hat \beta\) using the coef() command.

X <- model.matrix(fit.1)
head(X) # head() shows about the first six values of an object

  (Intercept) Perot96
1           1    8072
2           1     667
3           1    5922
4           1     819
5           1   25249
6           1   38964

betahat <- coef(fit.1)

Our fitted values are then just

yhat <- X %*% betahat
head(yhat)

        [,1]
1  291.25196
2   25.30108
3  214.03462
4   30.76017
5  908.16461
6 1400.73939

If I want to generate an estimate for any particular observation, I could also just extract its specific value for Perot96.

florida$Perot96[1]

[1] 8072

Let’s estimate the Buchanan 2000 votes for the first county in the data with Perot 96 votes of 8072. We can write it out as \(\hat Buchanan00_1 =\hat \alpha + \hat \beta*Perot96_1\)

buch00hat <- coef(fit.1)[1] + coef(fit.1)[2]*florida$Perot96[1]
buch00hat

(Intercept) 
    291.252

What is useful about this is that now we have the coefficient estimates, we can apply them to any values of \(X\) we wish in order to generate estimates/predictions of the values \(Y\) will take given particular values of our independent variables.

One function that is useful for this (as a shortcut) is the predict(fit, newdata=newdataframe) function in R. It allows you to enter in “newdata”– meaning values of the \(X\) variables for which you want to generate estimates of \(Y\) based on the coefficient estimates of your regression model.

For example, let’s repeat the calculation from above for Perot96 = 8072.

predict(fit.1, newdata = data.frame(Perot96 = 8072))

      1 
291.252

We can also generate confidence intervals around these estimates by adding interval = "confidence" in the command.

predict(fit.1, newdata = data.frame(Perot96 = 8072), interval="confidence")

      fit      lwr      upr
1 291.252 213.7075 368.7964

We can also simultaneously generate multiple predictions by supplying a vector of values in the predict() command. For example, let’s see the estimated Buchanan votes for when the Perot 1996 votes took values of 1000 to 10,000 by intervals of 1,000.

predict(fit.1, newdata = data.frame(Perot96 = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000)))

        1         2         3         4         5         6         7         8 
 37.26079  73.17583 109.09087 145.00591 180.92095 216.83599 252.75104 288.66608 
        9        10 
324.58112 360.49616

The important thing to note about the predict() command is that if you have multiple independent variables, you have to specify the values you want each of them to take when generating the estimated values of y.

fit.2 <- lm(Buchanan00 ~ Perot96 + Clinton96, data = florida)

For example, let’s build a second model with Clinton96 as an additional predictor. In order to generate the same prediction for different values of Perot 1996 votes, we need to tell R at what values we should “hold constant” Clinton96. I.e., we want to see how hypothetical changes in Perot96 votes influence changes in Buchanan 2000 votes while also leaving the Clinton votes identical. This is that lightswitch metaphor– flipping one switch, while keeping the rest untouched.

There are two common approaches to doing this. 1) We can hold constant Clinton96 votes at its mean value in the data 2) We can keep Clinton96 at its observed values in the data. In linear regression, it’s not going to matter which approach you take. In other models we will talk about later, this distinction may matter more substantially because of how our quantities of interest change across different values of \(X\hat \beta\).

The first approach is easily implemented in predict.

predict(fit.2, newdata = data.frame(Perot96 = 8072, Clinton96 = mean(florida$Clinton96)))

      1 
283.997

For the second approach, what we will do is generate an estimate for Buchanan’s votes in 2000 when Perot96 takes 8072 votes, and we keep Clinton96’s votes at whatever value it currently is in the data. That is, we will generate \(n\) estimates for Buchanan’s votes when Perot takes 8072. Then, we will take the mean of this as our “average estimate” of Buchanan’s votes in 2000 based on Perot’s votes at a level of 8072. We can do this in one of two ways:

## Manipulating the X matrix
X <- model.matrix(fit.2)
## Replace Perot96 column with all 8072 values
X[, "Perot96"] <- 8072
head(X) #take a peek

  (Intercept) Perot96 Clinton96
1           1    8072     40144
2           1    8072      2273
3           1    8072     17020
4           1    8072      3356
5           1    8072     80416
6           1    8072    320736

## Generate yhat
yhats <-X %*% coef(fit.2)

## take the mean
mean(yhats)

[1] 283.997

## Use predict
yhats <- predict(fit.2, newdata = data.frame(Perot96=8072, Clinton96=florida$Clinton96))
mean(yhats)

[1] 283.997

Now often, after we generate these predicted values, we want to display them for the whole world to see. You will get a chance to visualize values like this using the plotting functions in the problem sets. We have already seen one example of this in the simple bivariate case, when R plotted the bivariate regression line in section 4.3.2. However, the predict function extends are capabilities to plot very specific values of \(X\) and \(\hat Y\) for bivariate or multiple regressions.

The predict() function is also very relevant when we move to logistic, probit, etc. regressions. This is just the start of a beautiful friendship between you and predict() and associated functions.