10.2 Poisson Model

So we are left unsatisfied with a linear model for our count data. What do we do? Fortunately, there are multiple probability distributions that may be appropriate for the data generating process that generates counts, which we can use maximum likelihood to estimate. Since we have been in Bernoulli world for a while, let’s refresh on the steps we consider when approaching a potential maximum likelihood problem.

What is the data generating process? Based on this, describe the probability distribution for $Y_i$.
Define the likelihood for a single observation
Define the likelihood for all observations
Find the log-likelihood
Maximize the function with respect to (wrt) $\theta$

The first probability distribution we will consider is the Poisson. This is a discrete distribution (so we are in pmf instead of pdf territory). Let’s go step-by-step.

What is the data generating process? Based on this, describe the probability distribution for $Y_i$.
- $Y_i \sim Pois(\lambda)$
- $\Pr(Y=Y_i|\lambda)= \frac{exp(-\lambda) \lambda^{Y_i}}{Y_i!}$, which describes the probability that the outcome takes a particular value, given $\lambda$.

We assume no two events occur at the same time. We also assume the probability of an event occurring at a particular time is not a function of the previous events. Events happen independently. Sometimes the occurrence of an event makes the occurrence of a subsequent event less or more likely. This would be a violation and suggestive of overdispersion, which we will return to later.

Gary King uses the example of text messaging. Receiving a text message now probably makes it more likely you will receive additional text messages (i.e., group chats)

Recall that in a normal distribution, we needed two parameters $\mu$ and $\sigma^2$, to describe the shape of a normally distributed variable– the mean and variance of a normally distributed outcome.

In the Poisson case, the shape of our distribution is defined by the single parameter $\lambda$. A special feature of the Poisson probability distribution is that $\lambda$ is both the parameter for the mean $\mathbf E(Y) = \lambda$ and the variance $\mathbf V(Y) = \lambda$.

Try to remember this detail, because it will come up later when we assess if a Poisson process is actually a good approximation of the data generating process. It’s often not the case that a mean is the same as the variance. It can actually be a quite bad approximation of the variance.

Let’s look at an example of Poisson data to prove to ourselves that if data are Poisson the mean and variance are equivalent:

## We use rpois() to generate some count data according to the Poisson distribution. 
## Let's specify lambda = 4
## This means the mean of the distribution will be 4 and the variance will be 4
## In any given sample, it might be slightly off from this
Y <- rpois(n=100000, lambda = 4)
mean(Y)

[1] 3.99614

var(Y)

[1] 4.020165

The pmf above describes the probability that Y takes a particular outcome.

## For example given a Y ~ Pois(lambda = 4), let's look at the probability of getting particular counts using dpois, the pmf function for poisson

counts <- 0:16
probs <- dpois(counts, lambda = 4)
names(probs) <- counts
round(probs, digits=3)

    0     1     2     3     4     5     6     7     8     9    10    11    12 
0.018 0.073 0.147 0.195 0.195 0.156 0.104 0.060 0.030 0.013 0.005 0.002 0.001 
   13    14    15    16 
0.000 0.000 0.000 0.000

## The probability is higher the closer we are to 4, the mean of the distribution
## Let's check our formula from above for the probability Y = 2

lambda <- 4
yi <- 2
dpois(yi, lambda=4)

[1] 0.1465251

## formula from above
exp(-lambda)*lambda^yi/factorial(yi)

[1] 0.1465251

The expected count here is 4, the mean of the distribution. The probability of any given count is specified according to the pmf above.

Adding covariates

In regression, we will consider $\lambda$ (the expected count) to be a function of $\mathbf x_i'\beta$, and we will try to estimate our outcome $\mathbf E(Y_i | \mathbf x_i)$ given values of $\mathbf x_i$.

However, just like in the logit/probit case, our parameter $\lambda \neq \mathbf x_i'\beta$. Instead, it is a nonlinear function of $\mathbf x_i'\beta$. Here, we just have a different link function. Specifically,

$\lambda_i = exp(\mathbf x_i^T\beta)$
$\log \lambda_i = \eta_i = \mathbf x_i^T\beta$
$\lambda_i = \mathbf E[Y_i | \mathbf x_i] = \exp(\beta_0 + \beta_1x_{i1} + ... + \beta_{k}x_{ik})$ is the expected number of events per a unit of time or space

Analogy: Recall, in the Bernoulli case, we had just the parameter $\pi$, which described the expected probability of success given there is just one trial. Recall, in logistic regression $\pi_i = \frac{exp(\mathbf x_i^T\beta)}{1 + exp(\mathbf x_i^T\beta)}$. Here, the transformation is just a different link function.

OK, if we are using the existing functions in R, we can essentially stop here and proceed to glm (yes, we get to use our glm friend again). But, let’s look at the likelihood to finish out the process.

Define the likelihood for a single observation

This is just that pmf from above. For now, we will just write $\lambda$, but we know eventually we will need to substitute it with our expression $\lambda_i = exp(\mathbf x_i^T\beta)$.

\[\begin{align*} \mathcal L(\beta |Y_i)=\Pr(Y=Y_i|\lambda) \end{align*}\]

Define the likelihood for all observations

Here, we need to multiply across all observations. To do this, we are assuming independence.

\[\begin{align*} \mathcal L(\beta |Y)&=\mathcal L(\beta|Y_1)\times\mathcal L(\beta|Y_2)\times \ldots \times \mathcal L(\beta|Y_{n})\\ \mathcal L(\beta|Y)&=\prod_{i=1}^N\mathcal L(\beta|Y_i)\\ &= \prod_{i = 1}^{N}\frac{1}{Y_i!}\lambda_i^{Y_i}\exp(-\lambda_i) \end{align*}\]

Find the log-likelihood

We’ve seen this party trick before. Taking the $\log$ gives us the sums:

\[\begin{align*} \mathcal l(\beta|Y)&=\sum_{i=1}^N\mathcal \log(\mathcal L(\beta|Y_i))\\ &= \sum_{i = 1}^{n}\log(\frac{1}{Y_i!}\lambda_i^{Y_i}\exp(-\lambda_i))\\ &= \sum_{i = 1}^{n}\log (\frac{1}{Y_i!}) + Y_i\log(\lambda_i) - \lambda_i\\ &= \sum_{i = 1}^{n}Y_i\mathbf x_i^\top\beta - \exp(\mathbf x_i^\top\beta) - \log(Y_i!) \end{align*}\]

Maximize the function with respect to (wrt) $\theta$

Oof, this is where we take the derivative to find the $S(\theta)$.

Fortunately, the last term does not have a $\beta$, so it falls out
Recall, the derivative of $e^{z}= e^{z}$

\[\begin{align*} \frac{\delta}{\delta \beta} \ell(\beta | Y, X) &= \frac{\delta}{\delta \beta} \sum_{i = 1}^{n}Y_i\mathbf x_i^\top\beta - \exp(\mathbf x_i^\top\beta) - \log(Y_i!)\\ &= \sum_{i = 1}^{n} (Y_i - \exp(\mathbf x_i^\top\beta))\mathbf x_i^\top\\ &= X^TY - X^T\exp(X\beta) \text{ which is a $k \times 1$} \end{align*}\]

This will not yield a closed form solution for $\hat \beta$ when setting it to zero. Instead, we have to use numerical methods to estimate the parameters (e.g., think optim).

The good thing is that now that we have taken the first derivative, we can take the second derivative to find the Hessian, which will allow us to estimate uncertainty.

\[\begin{align*} &= - \sum_{i = 1}^{n} \mathbf x_i\mathbf x_i'\exp(\mathbf x_i^\top\beta)\\ &= - X^TVX \text{ where } V = n \times n \text{ diagonal matrix of } \exp(X\beta) \end{align*}\]

Note this is the $k \times k$ matrix!

For the variance estimates of our coefficients, we want the $(\mathbf E(-H))^{-1} = (X^TVX)^{-1}$
- That is our vcov(fit) in the Poisson case

OK, let’s start translating this math into R.