5.2 Generalized Linear Models

Before we get into the details of deriving the estimators, we are going to discuss another connection between linear models and the types of models we will work with when we are using common maximum likelihood estimators.

Recall our linear model: \(y_i = \beta_o + \beta_1x_{i1} + ... \beta_kx_{ik} + \epsilon\)

\(Y\) is modelled by a linear function of explanatory variables \(X\)
\(\hat \beta\) is our estimate of how much \(X\) influences \(Y\) (the slope of the line)
On average, a one-unit change in \(X_{ik}\) is associated with a \(\hat \beta_{k}\) change in \(Y_i\)
Slope/rate of change is linear, does not depend on where you are in \(X\). Every one-unit change has the same expected increase or decrease

Sometimes we are dealing with outcome data that are restricted or “limited” in some way such that this standard linear predictor will no longer make sense. If we keep changing \(X\) we may eventually generate estimates of \(\hat y\) that extend above or below the plausible range of values for our actual observed outcomes.

The generalized linear model framework helps address this problem by adding two components: a nonlinear transformation and a probability model. This allows us to make predictions of our outcomes that retain the desired bounded qualities of our observed data. Generalized linear models include linear regression as a special case (a case where no nonlinear transformation is required), but as its name suggests, is much more general and can be applied to many different outcome structures.

5.2.1 GLM Model.

In a GLM, we still have a “linear predictor”: \(\eta_i = \beta_o + \beta_1x_{i1} + ... + \beta_kx_{ik}\)

But our \(Y_i\) might be restricted in some way (e.g., might be binary).
So, now we require a “link” function which tells us how \(Y\) depends on the linear predictor. This is the key to making sure our linear predictor, when transformed, will map into sensible units of Y.

Our \(Y_i\) will also now be expressed in terms of a probability model, and it is this probability distribution that generates the randomness (the stochastic component of the model). For example, when we have binary outcome data, such as \(y_i =\) 1 or 0 for someone turning out to vote or not, we may try to estimate the probability that someone turns out to vote given certain explanatory variables. We can write this as \(Pr(Y_i = 1 | x_i\)).

In a GLM, we need a way to transform our linear predictor such that as we shift in values of \(X\hat \beta\), we stay within plausible probability ranges.

To do so we use a “link” function that is used to model the data.
- For example, in logistic regression, our link function will be the “logit”:
\[\begin{align*} Pr(Y_i = 1 | x_i) &= \pi_i\\ \eta_i &= \text{logit}(\pi_i) = \log \frac{\pi_i}{1-\pi_i} &= \beta_o + \beta_1x_{i1} + ... + \beta_kx_{ik} \end{align*}\]
One practical implication of this is that when we generate our coefficient estimates \(\hat \beta\), these will no longer be in units if \(y_i\) or even in units of probability. Instead, they will be in units as specified by the link function. In logistic regression, this means they will be in “logits.”
- For every one-unit change in \(x_k\), we get a \(\hat \beta_k\) change in logits of \(y\)
However, the nice thing is that because we know the link function, with a little bit of work, we can use the “response” function to transform our estimates back into the units of \(y_i\) that we care about.

\[\begin{align*} Pr(Y_i = 1 | x_i) &= \pi_i = g^{-1}(\eta_i) \\ &= \text{logit}^{-1}(\pi_i) \\ &= \frac{exp^{x_i'\beta}}{1 + exp^{x_i'\beta}} \end{align*}\]

5.2.2 Linking likelihood and the GLM

Let’s use \(\theta\) to represent the parameters of the pdf/pmf that we have deemed appropriate for our outcome data. As discussed before, we can write the likelihood for an observation as a probability statement.

\(\mathcal L (\theta | Y_i) = \Pr(Y=Y_i | \theta)\)

In social science, instead of thinking of these parameters as just constants (e.g., \(p\) or \(\mu\)), we generally believe that they vary according to our explanatory variables in \(X\). We think \(Y_i\) is distributed according to a particular probability function and that the parameters that shape that distribution are a function of the covariates.

\(Y_i \sim f(y_i | \theta_i)\) and \(\theta_i = g(X_i, \beta)\)

Each type of model we come across–guided by the structure of the dependent variable– is just going to have different formulas for each of these components.

Examples

Model	PDF	\(\theta_i\) ; Link\(^{-1}\)	\(\eta_i\)
Linear	\(Y_i \sim \mathcal{N}(\mu_i,\sigma^2)\)	\(\mu_i = X_i^\prime\beta\)	\(\mu_i\)
Logit	\(Y_i \sim \rm{Bernoulli}(\pi_i)\)	\(\pi_i=\frac{\exp(X_i^\prime\beta)}{(1+\exp(X_i^\prime\beta))}\)	logit\((\pi_i)\)
Probit	\(Y_i \sim \rm{Bernoulli}(\pi_i)\)	\(\pi_i = \Phi(X_i^\prime\beta)\)	\(\Phi^{-1}(\pi_i)\)

These generalized linear models are then fit through maximum likelihood estimation, through an approach discussed in the next section where we use algorithms to choose the most likely values of the \(\beta\) parameters given the observed data.

Note: not all ML estimators can be written as generalized linear models, though many we use in political science are indeed GLMs. To be a GLM, the distribution we specify for the data generating process has to be a part of the exponential family of probability distributions (fortunately the gaussian normal, poisson, bernouilli, binomial, gamma, and negative binomial are), and after that, we need the linear predictor and link function.

5.2.3 GLM in R

The way generalized linear models work in R is very similar to lm.

Below is a simple example where we will specify a linear model in lm() and glm() to compare.

## Load Data
florida <- read.csv("https://raw.githubusercontent.com/ktmccabe/teachingdata/main/florida.csv")

fit.lm <- lm(Buchanan00 ~ Perot96, data=florida)
fit.glm <- glm(Buchanan00 ~ Perot96, data=florida, 
               family=gaussian(link = "identity"))

For the glm, we just need to tell R the family of distributions we are using and the appropriate link function. In this example, we are going to use the normal gaussian distribution to describe the data generating process for Buchanan00. This is appropriate for nice numeric continuous data, even if it isn’t perfectly normal. The normal model has a link function, but it is the special case where the link function is just the identity. There is no nonlinear transformation that takes place. Therefore, we can still interpret the \(\hat \beta\) results in units of \(Y\) (votes in this case).

In this special case, the \(\hat \beta\) estimates from lm() and glm() will be the same.

coef(fit.lm)

(Intercept)     Perot96 
 1.34575212  0.03591504

coef(fit.glm)

(Intercept)     Perot96 
 1.34575212  0.03591504

There are some differences in the mechanics of how we get to the results in each case, but we will explore those more in the next section. I.e., these coefficients do not come out of thin air. Just like in OLS, we have to work for them.