4.5 Uncertainty and Regression

We have now gone through the process of minimizing the sum of squared errors ($\mathbf{e'e}$) and deriving estimates for the OLS coefficients $\hat \beta = (X'X)^{-1}X'Y$. In this section, we will discuss how to generate estimates of the uncertainty around these estimates.

Where we are going:

In the last section, we visited an example related to the 2000 election in Florida. We regressed county returns for Buchanan in 2000 (Y) on county returns for Perot in 1996 (X).

## Load Data
florida <- read.csv("https://raw.githubusercontent.com/ktmccabe/teachingdata/main/florida.csv")
fit.1 <- lm(Buchanan00 ~ Perot96, data = florida)
summary(fit.1)


Call:
lm(formula = Buchanan00 ~ Perot96, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-612.74  -65.96    1.94   32.88 2301.66 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.34575   49.75931   0.027    0.979    
Perot96      0.03592    0.00434   8.275 9.47e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 316.4 on 65 degrees of freedom
Multiple R-squared:  0.513, Adjusted R-squared:  0.5055 
F-statistic: 68.48 on 1 and 65 DF,  p-value: 9.474e-12

The summary output of the model shows many different quantities in addition to the coefficient estimates. In particular, in the second column of the summary, we see the standard errors of the coefficients. Like many statistical software programs, the lm() function neatly places these right next to the coefficients. We will now discuss how we get to these values.

4.5.1 Variance of the Coefficients

The standard error is the square root of the variance, representing the typical deviation we would expect to see between our estimates $\hat \beta$ of the parameter $\beta$ across repeated samples. So to get to the standard error, we just need to get to an estimate of the variance.

Let’s take the journey. First the math. As should start becoming familiar, we have our initial regression equation, which describes the relationship between the independent variables and dependent variables.

Start with the model: $Y = X\beta + \epsilon$
- We want to generate uncertainty for our estimate of $\hat \beta =(X'X)^{-1}X'Y$
Note: Conditional on fixed values of $X$ (I say fixed values because this is our data. We know $X$ from our dataset.), the only random component is $\epsilon$.
- What does that mean? Essentially, the random error term in our regression equation is what is giving us the uncertainty. If $Y$ was a deterministic result of $X$, we would have no need for it, but it’s not. The relationship is not exact, varies sample to sample, subject to random perturbations, represented by $\epsilon$.

Below we go through how to arrive at the mathematical quantity representing the variance of $\hat \beta$ which we will notate as $\mathbf{V}(\hat\beta)$. The first part of the math below is just substituting terms:

\[\begin{align*} \mathbf{V}(\widehat{\beta}) &= \mathbf{V}( (X^T X) ^{-1} X^T Y)) \\ &= \underbrace{\mathbf{V}( (X^T X) ^{-1} X^T (X\beta + \epsilon))}_\text{Sub in the expression for Y from above} \\ &= \underbrace{\mathbf{V}((X^T X) ^{-1} X^T X \beta + (X^T X) ^{-1} X^T \epsilon)}_\text{Distribute the term to the items in the parentheses} \\ &= \underbrace{\mathbf{V}(\beta + (X^T X) ^{-1} X^T \epsilon)}_\text{Using the rules of inverses, the two terms next to $\beta$ canceled each other out} \end{align*}\]

The next part of the math requires us to use knowledge of the definition of variance and the rules associated. We draw on two in particular:

The variance of a constant is zero.
When you have a constant multipled by a random variable, e.g., $\mathbf{V}(4d)$, it can come out of the variance operator, but must be squared: $16\mathbf{V}(d)$
Putting these together: $\mathbf{V}(2 + 4d)= 16\mathbf{V}(d)$

Knowing these rules, we can proceed: \[\begin{align*} \mathbf{V}(\widehat{\beta}) &= \mathbf{V}(\beta + (X^T X) ^{-1} X^T \epsilon) \\ &=\underbrace{ \mathbf{V}((X^T X) ^{-1} X^T \epsilon)}_\text{$\beta$ drops out because in a regression it is an unkown "parameter"-- it's constant, which means its variance is zero.}\\ &= \underbrace{(X^T X)^{-1}X^T \mathbf{V}( \epsilon) ((X^T X)^{-1}X^T)^T}_\text{We can move $(X^T X)^{-1}X^T$ out front because our data are fixed quantities, but in doing so, we have to "square" the matrix.}\\ &= (X^T X)^{-1}X^T \mathbf{V}( \epsilon) X (X^T X)^{-1} \end{align*}\]

The resulting quantity is our expression for the $\mathbf{V}(\hat \beta)$. However, in OLS, we make an additional assumption that allows us to further simplify the expression. We assume homoscedasticity aka “constant” or “equal error variance” which says that the variance of the errors are the same across observations: $\mathbf{V}(\epsilon) = \sigma^2 I_n$.

If we assume homoscedastic errors, then Var$(\epsilon) = \sigma^2 I_n$

\[\begin{align*} \mathbf{V}(\widehat{\beta}) &= (X^T X)^{-1}X^T \mathbf{V}( \epsilon) X (X^T X)^{-1}\\ &= \underbrace{(X^T X)^{-1}X^T \sigma^2I_n X (X^T X)^{-1}}_\text{Assume homoskedasticity}\\ &= \underbrace{\sigma^2(X^T X)^{-1} X^T X (X^T X)^{-1}}_\text{Because it is a constant, we can move it out in front of the matrix multiplication, and then simplify the terms.} \\ &= \sigma^2(X^T X)^{-1} \end{align*}\]

All done! This expression: $\sigma^2(X^T X)^{-1}$ represents the variance of our coefficient estimates. Note its dimensions: $k \times k$. It has the same number of rows and columns as the number of our independent variables (plus the intercept).

There is one catch, though. How do we know what $\sigma^2$ is? Well, we don’t. Just like the unknown parameter $\beta$, we have to estimate it in our regression model.

Just like with the coefficients, we notate our estimate as $\widehat{\sigma}^2$. Our estimate is based on the observed residual errors in the model and is as follows:

$\widehat{\sigma}^2 = \frac{1}{N-K}\sum_{i=1}^N \widehat{\epsilon_i^2} = \frac{1}{N-K} \mathbf{e'e}$

That means our estimate of the variance of the coefficients is found within: $\hat \sigma^2(X^T X)^{-1}$

Again, this is a $k \times k$ matrix and is often called the variance covariance matrix. We can extract this quantity from our linear models in R using vcov().

vcov(fit.1)

             (Intercept)       Perot96
(Intercept) 2475.9885768 -1.360074e-01
Perot96       -0.1360074  1.883619e-05

This is the same that we would get if manually we took the residuals and multiplied it by our $X$ matrix according to the formula above:

X <- cbind(1, florida$Perot96)
e <- cbind(residuals(fit.1))
sigmahat <- ((t(e) %*% e) / (nrow(florida) -2)) 
## tell r to stop treating sigmahat as a matrix
sigmahat <-as.numeric(sigmahat)
XtX <- solve(t(X) %*%X)
sigmahat * XtX

             [,1]          [,2]
[1,] 2475.9885768 -1.360074e-01
[2,]   -0.1360074  1.883619e-05

The terms on the diagonal represent the variance of a particular coefficient in the model.The standard error of a particular coefficient $k$ is: s.e.($\hat{\beta_k}) = \sqrt{\widehat{\sigma}^2 (X'X)^{-1}}_{kk}$. The off-diagonal components represent the covariances between the coefficients.

Recall that the standard error is just the square root of the variance. So, to get the nice standard errors we saw in the summary output, we can take the square root of the quantities on the diagonal of this matrix.

sqrt(diag(vcov(fit.1)))

 (Intercept)      Perot96 
49.759306434  0.004340068

summary(fit.1)$coefficients[,2]

 (Intercept)      Perot96 
49.759306434  0.004340068

Why should I care?

Well R actually doesn’t make it that easy to extract standard errors from the summary output. You can see above that the code for extracting the standard errors using what we know about them being the square root of the variance is about as efficient as extracting the second column of the coefficient component of the summary of the model.
Sometimes, we may think that the assumption of equal error variance is not feasible and that we have unequal error variance or “heteroscedasticity.” Researchers have developed alternative expressions to model unequal error variance. Generally, what this means is they can no longer make that simplifying assumption, have to stop at the step with the uglier expression $(X^T X)^{-1}X^T \mathbf{V}( \epsilon) X (X^T X)^{-1}$ and then assume something different about the structure of the errors in order to estimate the coefficients. These alternative variance estimators are generally what are referred to as “robust standard errors.” There are many different robust estimators, and you will likely come across them in your research.

Some of you may have learned the formal, general definition for variance as defined in terms of expected value: $\mathbb{E}[(\widehat{m} - \mathbb{E}(\widehat{m}))^2 ]$. We could also start the derivation there. This is not required for the course, but it is below if you find it useful. In particular, it can help show why we wend up needing to square a term when we move it outside the variance operator:

\[\begin{align*} \mathbf{V}(\widehat{\beta}) &= \mathbb{E}[(\widehat{\beta} - \mathbb{E}(\hat \beta))^2)] \\ &= \mathbb{E}[(\widehat{\beta} - \beta)^2]\\ &= \mathbb{E}[(\widehat{\beta} - \beta)(\widehat{\beta} - \beta)^T ] \\ &= \mathbb{E}[(X^T X) ^{-1} X^TY - \beta)(X^T X) ^{-1} X^TY - \beta)^T] \\ &= \mathbb{E}[(X^T X) ^{-1} X^T(X\beta + \epsilon) - \beta)(X^T X) ^{-1} X^T(X\beta + \epsilon) - \beta)^T]\\ &= \mathbb{E}[(X^T X) ^{-1} X^TX\beta + (X^T X) ^{-1} X^T\epsilon - \beta)(X^T X) ^{-1} X^TX\beta + (X^T X) ^{-1} X^T\epsilon - \beta)^T]\\ &= \mathbb{E}[(\beta + (X^T X) ^{-1} X^T\epsilon - \beta)(\beta + (X^T X) ^{-1} X^T\epsilon - \beta)^T]\\ &= \mathbb{E}[(X^T X) ^{-1} X^T\epsilon)(X^T X) ^{-1} X^T\epsilon)^T]\\ &= (X^T X) ^{-1} X^T\mathbb{E}(\epsilon\epsilon^T)X(X^T X) ^{-1}\\ &= \underbrace{(X^T X) ^{-1} X^T\sigma^2I_nX(X^T X) ^{-1}}_\text{Assume homoskedasticity}\\ &= \sigma^2(X^T X) ^{-1} X^TX(X^T X) ^{-1}\\ &= \sigma^2(X^T X) ^{-1} \end{align*}\]

Note: Along the way, in writing $\mathbb{E}(\hat \beta) = \beta$, we have implicitly assumed that $\hat \beta$ is an “unbiased” estimator of $\beta$. This is not free. It depends on an assumption that the error term in the regression $\epsilon$ is independent of our independent variables. This can be violated in some situations, such as when we have omitted variable bias, which is discussed at the end of our OLS section.

4.5.2 Hypothesis Testing

Most of the time in social science, we run a regression because we have some hypothesis about how a change in our independent variable affects the change in our outcome variable.

In OLS, we can perform a hypothesis test for each independent variable in our data. The structure of the hypothesis test is:

Null hypothesis: $\beta_k = 0$
- This essentially means that we don’t expect a particular $x_k$ independent variable to have a relationship with our outcome variable.
Alternative hypothesis: $\beta_k \neq 0$
- We do expect a positive or negative relationship between a particular $x_k$ and the dependent variable.

We can use our estimates for $\hat \beta$ coefficients and their standard errors to come to a conclusion about rejecting or failing to reject the null hypothesis of no relationship by using a t-test.

In a t-test, we take our coefficient estimates and divide them by the standard error in order to “standardize” them on a scale that we can use to determine how likely it is we would have observed a value for $\hat \beta$ as extreme or more extreme as the one we observed in a world where the true $\beta = 0$. This is just like a t-test you might have encountered before for a difference in means between groups, except this time our estimate is $\hat \beta$.

\[\begin{align*} t_{\hat \beta_k} &= \frac{\hat \beta_k}{s.e.(\hat \beta_k)} \end{align*}\]

Generally speaking, when $t$ is about +/-2 or greater in magnitude, the coefficient will be “significant” at conventional levels (i.e., $p <0.05$), meaning that we are saying that it is really unlikely we would have observed a value as big as $\hat \beta_k$ if the null hypothesis were true. Therefore, we can reject the null hypothesis.

However, to get a specific quantity, we need to calculate the p-value, which depends on the t-statistic and the degrees of freedom in the model. The degrees of freedom in a regression model are $N-k$, the number of observations in the model minus the number of independent variables plus the intercept.

In R, we can calculate p-values using the pt() function. By default, most people use two-sided hypothesis tests for regression. So to do that, we are going to find the area on each side of the t values, or alternatively, multiply the area to the right of our positive t-value by 2.

## Let's say t was 2.05 and 
## And there were 32 observations and 3 variables in the regression plus an intercept
t <- 2.05
df.t <- 32 -4
p.value <- 2 * (pt(abs(t), df=df.t, lower.tail = FALSE))
p.value

[1] 0.04983394

Let’s do this for the florida example. First, we can find t by dividing our coefficients by the standard errors.

t <- coef(fit.1) / (sqrt(diag(vcov(fit.1))))
t

(Intercept)     Perot96 
 0.02704523  8.27522567

## Compare with output
summary(fit.1)$coefficients[, 3]

(Intercept)     Perot96 
 0.02704523  8.27522567

We can then find the p-values.

t <- coef(fit.1) / (sqrt(diag(vcov(fit.1))))
df.t <- fit.1$df.residual
p.value <- 2 * (pt(abs(t), df=df.t, lower.tail = FALSE))
p.value

 (Intercept)      Perot96 
9.785065e-01 9.473505e-12

summary(fit.1)$coefficients[, 4]

 (Intercept)      Perot96 
9.785065e-01 9.473505e-12

We see that the coefficient for Perot96 is significant. The p-value is tiny. In R, for small numbers, R automatically shifts to scientific notation. The 9.47e-12 means the p-value is essentially zero, with the stars in the summary output indicating the p-value is $p < 0.001$. R will also output a test of the significance of the intercept using the same formula as all other coefficients. This generally does not have much interpretive value, so you are usually safe to ignore it.

Confidence Intervals

Instead of representing the significance using p-values, sometimes it is helpful to report confidence intervals around the coefficients. This can be particularly useful when visualizing the coefficients. The 95% confidence interval represents roughly 2 standard errors above and below the coefficient. The key thing to look for is whether it overlaps with zero (not significant) or does not (in which case the coefficient is significant).

The precise formula is

$\widehat{\beta}_k$ Confidence intervals: $\widehat{\beta}_k - t_{crit.value} \times s.e._{\widehat{\beta}_k}, \widehat{\beta}_k + t_{crit.value} \times s.e_{\widehat{\beta}_k}$

In R, we can use qt() to get the specific critical value associated with a 95% confidence interval. This will be around 2, but fluctuates depending on the degrees of freedom in your model (which are function of your sample size and how many variables you have in the model.) R also has a shortcut confint() function to extract the coefficients from the model. Below we do this for the Perot96 coefficient.

## Critical values from t distribution at .95 level
qt(.975, df = fit.1$df.residual) # n- k degrees of freedom

[1] 1.997138

## Shortcut
confint(fit.1)[2,]

     2.5 %     97.5 % 
0.02724733 0.04458275

## By hand
coef(fit.1)[2] - qt(.975, df = fit.1$df.residual)*sqrt(diag(vcov(fit.1)))[2]

   Perot96 
0.02724733

coef(fit.1)[2] + qt(.975, df = fit.1$df.residual)*sqrt(diag(vcov(fit.1)))[2]

   Perot96 
0.04458275

4.5.3 Goodness of Fit

A last noteworthy component to the standard regression output is the goodness of fit statistics. For this class, we can put less attention on these, though there will be some analogues when we get into likelihood.

These are measures of how much of the total variation in our outcome measure can be explained by our model, as well as how far off are our estimates from the truth.

For the first two measures R-squared and Adjusted R-squared, we draw on three quantities:

Total Sum of Squares–how much variance in $Y_i$ is there to explain?
- $TSS: \sum_{i=1}^N (Y_i -\overline Y_i)^2$
Estimated Sum of Squares–how much of this variance do we explain?
- $ESS: \sum_{i=1}^N (\widehat Y_i -\overline Y_i)^2$
Residual Sum of Squares–how much variance is unexplained?
- $RSS: \sum_{i=1}^N ( Y_i -\widehat Y_i)^2$
$TSS = ESS + RSS$
Multiple R-squared: $\frac{ESS}{TSS}$
- This is a value from 0 to 1, representing the proportion of the variance in the outcome that can be explained by the model. Higher values are generally considered better, but there are many factors that can affect R-squared values. In most social science tasks where the goal is to engage in hypothesis testing of coefficients, this measure is of less value.
Adjusted R-squared: $1 - \frac{\frac{RSS}{n - k}}{\frac{TSS}{n - 1}}$
- This is essentially a penalized version of R-squared. When you add additional predictors to a model, the R-squared value can never decrease, even if the predictors are useless. The Adjusted R-squared adds a consideration for the degrees of freedom into the equation, creating a penalty for adding more and more predictors.
Residual standard error aka root mean squared error aka square root of the mean squared residual: $r.s.e = \sqrt{\frac{RSS}{n-k}}$
- This represents the typical deviation of an estimate of the outcome from the actual outcome. This quantity is often used to assess the quality of prediction exercises. It is used less often in social science tasks where the goal is hypothesis testing of the relationship between one or more independent variables and the outcome.

F-Statistic

So far we have conducted hypothesis tests for each individual coefficient. We can also conduct a global hypothesis test, where the null hypothesis is that all coefficients are zero, with the alternative being that at least one coefficient is nonzero. This is the test represented by the F-statistic in the regression output.

The F-statistic helps us test the null hypothesis that all of the regression slopes are 0: $H_0 = \beta_1 = \beta_2 = \dots = \beta_k = 0$

$F_0 = \frac{ESS/(k - 1)}{RSS/(n - k)}$
The F-Statistic has two separate degrees of freedom.
- The model sum of squares degrees of freedom (ESS) are $k - 1$.
- The residual error degrees of freedom (RSS) are $n - k$.
- In a regression output, the model degrees of freedom are generally the first presented: “F-statistic: 3.595 on $(k - 1) = 1$ and $(n - k) = 48$ DF.”

Note: This test is different from our separate hypothesis tests that a $k$ regression slope is 0. For that, we use the t-tests discussed above.