5.1 What is likelihood?

Just like the derivation of the OLS coefficient estimators $\hat \beta$ for the $\beta$ parameters started with a goal (minimizing least squared errors) in describing the relationship between variables, with likelihood we also start with a goal.

The “likelihood” is going to ask the question: What values of the unknown parameters make the data we see least surprising?

When we get into likelihood, we will be drawing more directly on concepts within probability. We start by making a choice about what type of data generating process best describes our outcome data. Eventually, our likelihood function represents the “probability density of the data given the parameters and predictors.” (Definition taken from Gelman et al. 2020, pg. 105).

In MLE, we are going to choose parameter estimates $\widehat{\theta}$ for $\theta$ that maximize the likelihood that our data came from the particular distribution.

Already, we are placing a lot of emphasis on the nature of our outcome data. The nature of our likelihood will change depending on if the data are dichotomous (e.g, similar to a set of coin flips that could be head or tails) or a count (e.g., similar to the number of events expected to occur over a certain interval) or more of a continuous numeric distribution with (e.g., where the probabilities of certain levels can be visualized similar to a bell curve). Each of these sets of data are generated through a different process, which is described by a particular probability function.

Example

This introduction is based on Ben Lambert’s video. I highly recommend watching this 8-9 minute video. Below we highlight a few of the key concepts and definitions.

Take the UK population of 70 million. We have a sample of this, and in our sample some observations are male and some female. How can we use what we have, a sample, to estimate the probability that an individual is male?³

First, we can make a judgment about the data generating process. We can suppose there is some probability distribution function that determines the probability is a male or female. A probability density function (PDF for continuous data or PMF for discrete data) tells you the relative probability or likelihood of observing a particular value given the parameters of the distribution.
- Let’s call this $f(y_i | p)$ where $y_i = 1$ if male, 0 if female.
- We will say $p$ is the probability an individual is male. $p^{y_i}(1 - p)^{1-y_i}$
We are going to treat this like a toss of a coin, which has a Bernouilli distribution (every probability distribution we are dealing with is going to have an associated formula. You don’t need to memorize these. You can look them up on the internet when needed.) So, $f()$ tells us the probability we would have gotten the value of the observation if we think of our observation as this Bernouilli toss of coin. This is the likelihood for a single observation.
For example, we can now plug this into our Bernoulli function for the two possible cases of values for $y_i$.
- $f(1|p) = p^1(1-p)^{1-1} = p$ probability an individual is male
- $f(0 |p)= p ^0(1-p)^{1-0} = 1-p$ probability individual is female
Now we want an estimate using our entire sample of observations, not just a single $i$. What if we have $n$ observations? $f(y_1, y_2, ... y_n | p)$. We can write the joint probability as all individual probabilities multiplied together if we assume our observations are independent. This will now represent the likelihood for all observations.
- $L = P(Y_1 = y_1, Y_2=y_2, ..., Y_n = y_n)= \prod_{i=1}^n p^y_i(1 - p)^{1-y_i}$
- This answers what is the probability that $Y_1$ took on the particular value $y_1$
- Ok, now we have the statement $L = P(Y_1 = y_1, Y_2=y_2, ..., Y_n = y_n)$. This joint probability is the likelihood (Technically it is proportionate to the likelihood, but this detail will not matter for us. What this means is there is a hidden constant $k(y)$ multiplied by the joint probability. Because likelihood is a relative concept, this constant can fall out.)
Generally, we don’t know $p$. We are trying to estimate it. What we want to do is choose the $\hat p$ to maximize the likelihood that we would have gotten this set of observations given that $Y_i$ has a probability distribution as specified.
- We have used a buzz word: “maximize.” Just as in OLS, that should be our signal that a derivative should be taken so that we can find the quantities that represent the maximum.
- We differentiate L with respect to p, set it to 0, to give us $\hat{p}$.
Our issue (or at least one of our issues) is that products are tough to differentiate. A chain rule disaster. Instead, we use a trick of taking the log of the likelihood: log $\prod ()$.⁴ Benefit: it turns it into a sum, much easier. $\log ab = \log a + \log b$. So we will actually differentiate the $\log$ of the likelihood. Yes, this is why we had logs as part of Section 3.

5.1.1 Summarizing Steps for Maximum Likelihood

Initial Setup

What is the data generating process? This means think about the structure of the dependent variable. Is it continuous, is it a count, is it binary, is it ordered? Based on this, describe the probability distribution for $Y_i$.
Define the likelihood for a single observation
Define the likelihood for all observations
Find the log-likelihood

Then, the derivation begins! Yes, we’ve only just begun, but that first step of deciding the data generating process is huge.

Example: Count Data

Let’s say we are trying to understand the relationship between an independent variable and the number of news articles reported on a topic. This is a count of data. It goes from 0 to some positive number, never extending below 0. A distribution that is a good fit for this is the Poisson. We start by specifying this as our data generating process and looking up the Poisson probability density function, which has the parameter $\lambda$.

Data Generating Process and probability density function.

\[\begin{align*} &Y_i \stackrel{\rm i.i.d.}{\sim} Pois(\lambda)\Rightarrow\\ &\Pr(Y=Y_i|\lambda)=\lambda \frac{exp(-\lambda) \lambda^{Y_i}}{Y_i!} \end{align*}\]

Note: we assume our observations are iid (independently and identically distributed (For definition.) This assumption can be relaxed.

What is the likelihood for a single observation?

\[\begin{align*} \mathcal L(\lambda|Y_i)=\Pr(Y=Y_i|\lambda) \end{align*}\]

What is the likelihood for all observations?

\[\begin{align*} \mathcal L(\lambda|Y)&=\mathcal L(\lambda|Y_1)\times\mathcal L(\lambda|Y_2)\times \ldots \times \mathcal L(\lambda|Y_{N})\\ \mathcal L(\lambda|Y)&=\prod_{i=1}^N\mathcal L(\lambda|Y_i)\\ \end{align*}\]

Easier to work with log-likelihood

\[\begin{align*} \ell(\lambda|Y)&=\sum_{i=1}^N\mathcal \log(\mathcal L(\lambda|Y_i))\\ \end{align*}\]

Given observed data $Y$, what is the likelihood it was generated from $\lambda$? We will be choosing estimates of the parameters that maximize the likelihood we would have seen these data. Generally, we will also consider parameters like $\lambda$ to be functions of our covariates– the things we think help explain our otucome.

For additonal practice, try to write down the likelihood of a single observation, the likelihood for all observations, and the log likelihood for an outcome we believe is normally distributed. We have $Y_i \sim N(\mu, \sigma^2)$. Our PDF is:

\[\begin{align*} f(Y_i | \theta) &= \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(Y_i-\mu)^2}{2\sigma^2}} \end{align*}\]

You can take it from here.

Try on your own, then expand for the solution.

We have the PDF.

Let’s write the likelihood for a single observation.

\[\begin{align*} L(\theta | Y_i) = L(\mu, \sigma^2 | Y_i) &= \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(Y_i-\mu)^2}{2\sigma^2}} \end{align*}\]

Let’s write the likelihood for all observations.

\[\begin{align*} L(\mu, \sigma^2 | Y) &= \prod_{i=1}^{N} \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(Y_i-\mu)^2}{2\sigma^2}} \end{align*}\]

Let’s write the log likelihood.

\[\begin{align*} \ell(\mu, \sigma^2 | Y) &= \sum_{i = 1}^N \log \Bigg( \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(Y_i-\mu)^2}{2\sigma^2}}\Bigg)\\ &= \sum_{i = 1}^N \underbrace{\log \Bigg( \frac{1}{\sigma\sqrt{2\pi}}\Bigg) + \log e^{\frac{-(Y_i-\mu)^2}{2\sigma^2}}}_\text{Using the rule $\log ab = \log a + \log b$}\\ &= \underbrace{\sum_{i = 1}^N \log \frac{1}{\sigma\sqrt{2\pi}} - \frac{(Y_i-\mu)^2}{2\sigma^2}}_\text{The second term was of the form $\log e ^ a$, we can re-write as $a * \log e$. $\log e$ cancels to 1, leaving us with just $a$.}\\ &= \sum_{i = 1}^N \log \frac{1}{\sigma\sqrt{2\pi}} - \frac{(Y_i-\mathbf{x}_i'\beta)^2}{2\sigma^2} \end{align*}\]

Note how we usually will sub in $X\beta$ or $\mathbf{x_i'\beta}$ for the parameter because we think these will vary according to covariates in our data. The $e$ in the Normal PDF is the base of the natural log. It is a mathematical constant. Sometimes you might see this written as $exp$ instead of $e$. In R, you can use exp() to get this constant. For example, $\log e^2$ in R would be log(exp(2)).

In the future, we may want to know the probability of turning out to vote, or of going to war, or voting “yes” on a particular policy question, etc.↩︎
Why does this work? It has to do with the shape of the log (always increasing). Details are beyond the scope.↩︎