4.7 Wrapping up OLS

Linear regression is a great way to explain the relationship between one or more independent variables and an outcome variables. However, there is no free lunch. We have already mentioned a couple of assumptions along the way. Below we will summarize these and other assumptions. These are things you should be mindful of when you use linear regression in your own work. Some conditions that generate violations of these assumptions can also motivate why we will seek out alternative methods, such as those that rely on maximum likelihood estimation.

This is a good place to review Gelman section 3.6.

Exogeneity. This one we haven’t discussed yet, but is an important assumption for letting us interpret our coefficients \(\hat \beta\) as “unbiased” estimates of the true parameters \(\beta\). We assume that the random error in the regression model \(\epsilon\) is indeed random, and uncorrelated with and independent of our independent variables \(X\). Formally:
- \(\mathbb{E}(\epsilon| X) = \mathbb{E}(\epsilon) = 0\).
- This can be violated, for example, when we suffer from Omitted Variable Bias due to having an “endogenous explanatory variable” that is correlated with some unobserved or unaccounted for factor. This bias comes from a situation where there is some variable that we have left out of the model (\(Z\)), and is therefore a part of the unobserved error term. Moreover this variable which is correlated with–and a pre-cursor of– our independent variables and is a cause of our dependent variable. A failure to account for omitted variables can create bias in our coefficient estimates. Concerns about omitted variable bias often prompt people to raise their hands in seminars and ask questions like, “Well have you accounted for this? Have you accounted for that? How do you know it is \(X\) driving your results and not \(Z\)?” If we omit important covariates, we may wrongly attribute an effect to \(X\) when it was really the result of our omitted factor \(Z\). Messing discusses this here.
- This is a really tough assumption. The only real way to guarantee the independence of your error term and the independent variables is if you have randomly assigned values to the independent variables (such as what you do when you randomly assign people to different treatment conditions in an experiment). Beyond random assignment, you have to rely on theory to understand what variables you need to account for in the regression model to be able to plausibly claim your estimate of the relationship between a given independent variable and the dependent variable is unbiased. Failing to control for important factors can lead to misleading results, such as what happens in Simpson’s paradox, referenced in the Messing piece.
- Danger Note 1: The danger here, though, is that the motivation for avoiding omitted variable bias might be to keep adding control after control after control into the regression model. However, model building in this way can sometimes be atheoretical and result in arbitrary fluctuations in the size of your coefficients and their significance. At its worse, it can lead to “p-hacking” where researchers keep changing their models until they find the results they like. The Lenz and Sahn article on Canvas talks more about the dangers of arbitrarily adding controls to the model.
- Danger Note 2: We also want to avoid adding “bad controls” to the model. Messing talks about this in the medium article as it relates to collider bias. We want to avoid adding controls to our model, say \(W\) that are actually causes of \(Y\) and causes of \(X\) instead of the other way around.
- Model building is a delicate enterprise that depends a lot on having a solid theory that guides the choice of variables.
Homoscedasticity. We saw this when defining the variance estimator for the OLS coefficients. We assume constant error variance. This can be violated when we think observations at certain values of our independent variables may have different magnitudes of error than observations at other values of our independent variables.
- No correlation in the errors. The error terms are not correlated with each other. This can be violated in time series models (where we might think past, present, and future errors are correlated) or in cases where our observations are nested in some hierarchical structures (e.g., perhaps students in a school) and the errors are correlated.
No perfect collinearity. The \(X\) matrix must be full rank: We cannot have linear dependence between columns in our X matrix. We saw this in the tutorial when we tried to add the dummy variables for all of our racial groups into a regression at once. When there is perfect collinearity between variables, our regression will fail.
- We should also avoid situations where we have severe multicollinearity. This can happen when we include two or more variables in a regression model that are highly correlated (just not perfectly correlated). While the regression will still run in this case, it can inflate the standard errors of the coefficients, making it harder to detect significant effects. This is particularly problematic in smaller samples.
Linearity. The relationship between the independent and dependent variables needs to be linear in the parameters. It should be modeled as the addition of constants or parameters multiplied by the independent variables. If instead the model requires the multiplication of parameters, this is no longer linear (e.g., \(\beta^2\)). Linearity also often refers to the shape of the model. Our coefficients tell us how much change we expect in the outcome for each one-unit change in an independent variable. We might think some relationships are nonlinear– meaning this rate of change varies across values of the independent variables. If that is the case, we need to shift the way our model is specified to account for this or change modeling approaches.
- For example, perhaps as people get older (one-unit changes in age), they become more politically engaged, but at some age level, their political engagement starts to decline. This would mean the slope (that expected change in political engagement for each one-unit change in age) is not constant across all levels of age. There, we might be violating linearity in the curvature of the relationship between the independent and dependent variables. This is sometimes why you might see \(age^2\) or other nonlinear terms in regression equations to better model this curvature.
- Likewise, perhaps each additional level of education doesn’t result in the same average increase in \(y\). If not, you could consider including categorical dummy variables for different levels of education instead of treating education as a numeric variable.
Normality. We assume that the errors are normally distributed. As Gelman 3.6 notes, this is a less important assumption and is generally not required.

OLS Properties

Why we like OLS. When we meet our assumptions, OLS produces the best linear unbiased estimates (BLUE). A discussion of this here. We have linearity in our parameters (e.g., \(\beta\) and not \(\beta^2\) for example). The unbiasedness means that the expected value (aka the average over repeated samples) of our estimates \(\mathbb{E}(\hat \beta)= \beta\) is the true value. Our estimates are also efficient, which has to do with the variance, not only are our estimates true in expectation, but we also have lower variance than an alternative linear unbiased estimator could get us. If our assumptions fail, then we might no longer have BLUE. OLS estimates are also consistent, meaning that as the sample gets larger and larger, the estimates start converging to the truth.

Now, a final hidden assumption in all of this is that the sample of our data is representative of the population we are trying to make inferences about. If that is not the case, then we may no longer be making unbiased observations to that population level. Further adjustments may be required (e.g., analyses of survey data sometimes use weights to adjust estimates to be more representative).

When we violate these assumptions, OLS may no longer be best, and we may opt for other approaches. More soon!

4.7.1 Practice Problems

Let’s use the florida data. Run a regression according to the following formula:
- \(Buchanan00_i = \alpha + \beta_1*Perot96_i + \beta_2*Dole96 + \beta_3*Gore00 + \epsilon\)
Report the coefficient for Perot96. What do you conclude about the null hypothesis that there is no relationship between 1996 Perot votes and 2000 Buchanan votes?
What is the confidence interval for the Perot96 coefficient estimate?
When Perot 1996 vote is 5500, what is the expected 2000 Buchanan vote?

4.7.2 Practice Problem Code for Solutions

fit.practice <- lm(Buchanan00 ~ Perot96 + Dole96 + Gore00, data = florida)

coef(fit.practice)["Perot96"]

   Perot96 
0.02878927

confint(fit.practice)["Perot96", ]

      2.5 %      97.5 % 
0.004316382 0.053262150

expbuch <- model.matrix(fit.practice)
expbuch[,"Perot96"] <- 5500
mean(expbuch %*% as.matrix(coef(fit.practice)))

[1] 211.1386