6.1 Data Generating Process
Let’s say \(Y_i\) is a set of 0’s and 1’s for whether two states have experienced a dispute, an outcome common in IR studies.
\[\begin{gather*} Y_i = \begin{cases}1, \;\text{a dispute happened}\\ 0,\; \text{a dispute did not happen}\end{cases} \end{gather*}\]
- Outside of political science, for example, in the field of higher education, we might instead think about an outcome related to whether a student has (= 1) or has not (= 0) had a negative advising experience.
# Example of first 20 observations (Y1, Y2, ..., Y20)
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
We need to align these data with a data generating process and distribution.
- For each \(Y_i\), it is like a single trial, where you have a dispute with some probability \((\pi)\)
- This sounds like the Bernoulli distribution! \(Y_i \sim Bernouli(\pi)\)
6.1.1 MLE Estimation
Let’s do the steps we saw in the previous section.
What is the data generating process? Based on this, describe the probability distribution for \(Y_i\).
- Note: if you are using a function like
glm()
you can proceed directly there after this step. However, let’s work under the hood for a bit.
\[\begin{align*} Y_i \sim f(Y_i | \pi) &= Pr(Y_i = y_i |\pi_i) = \underbrace{\pi^{y_i}(1 -\pi)^{(1-y_i)}}_\text{pmf for Bernoulli} \end{align*}\] for \(y = 0,1\)
- Note: if you are using a function like
Define the likelihood for a single observation
Define the likelihood for all observations
Find the log-likelihood
\[\begin{align*} \mathcal L( \pi | Y_i) &= \underbrace{\pi^{y_i}(1 -\pi)^{(1-y_i)}}_\text{Likelihood for single observation}\\ \mathcal L( \pi | Y) &= \underbrace{\prod_{i=1}^n\pi^{y_i}(1 -\pi)^{(1-y_i)}}_\text{Likelihood for all observations}\\ \ell( \pi | Y) &= \underbrace{\sum_{i=1}^n\log \pi^{y_i}(1 -\pi)^{(1-y_i)}}_\text{Log likelihood}\\ \hat \pi &= \text{Next step: arg max } \ell( \pi | Y) \text{ wrt $\pi$} \end{align*}\]
Add step: nonlinear transformation to \(X\beta\)
Note that because we are likely using covariates, we need to express our parameter as a function of \(X\beta\). Why? Because we don’t think there is a constant probability for a dispute. Instead, we think the probability of a dispute varies according to different independent variables, which are included in the \(X\) matrix and everntually will each have their own \(\beta_k\) relationship with the probability of dispute.
- Now that we are outside of linear territory, we cannot just simply replace \(\pi\) with \(X\beta\) in the equation. Instead, \(\pi\) is a function of \(X\beta\). \(\pi = g(X_i, \beta) \neq \mathbf{x}_i'\beta\)
This is because we need a transformation, such as the logit or probit to map our linear predictor into the outcome to make sure the linear predictor can be transformed back into sensible units of the outcome. The logit is one variety, as is the probit. Like two roads diverged in a yellow wood, this is the point in the process where we choose the transformation. For this first example, let’s apply a logit transformation, which restricts our estimates to between 0 and 1 (a good thing for probability!) where:
\(\pi_i = \text{logit}^{-1}(\eta_i) = \frac{exp^{\eta_i}}{1 + exp^{\eta_i}} = \frac{exp^{\mathbf{x}_i'\beta}}{1 + exp^{\mathbf{x}_i'\beta}}\)
\(\eta_i = \text{logit}(\pi_i) = \log\frac{\pi_i}{1-\pi_i} = \mathbf{x}_i'\beta\)
- Maximize the function with respect to (wrt) \(\theta\)
Where \(\pi_i = \frac{exp^{\mathbf{x}_i'\beta}}{1 + exp^{\mathbf{x}_i'\beta}}\)
\[\begin{align*} \hat \pi &= \text{arg max } \ell( \pi | Y) \text{ wrt $\pi$} \\ &= \text{arg max} \sum_{i=1}^n\log \pi_i^{y_i}(1 -\pi_i)^{(1-y_i)}\\ &= \text{arg max} \sum_{i=1}^n \underbrace{y_i \log \Big( \frac{exp^{\mathbf{x}_i'\beta}}{1 + exp^{\mathbf{x}_i'\beta}}\Big) + (1-y_i)\log \Big(1-\frac{exp^{\mathbf{x}_i'\beta}}{1 + exp^{\mathbf{x}_i'\beta}}\Big)}_\text{We replaced $\pi_i$ and used the rule $\log a^b = b \log a$ to bring down the $y_i$ terms.} \end{align*}\]
At this point, we take the derivative with respect to \(\beta\). You can try this on your own, and, if you’re lucky, it may show up on a problem set near you. With a bit of work, we should get something that looks like the below, which we can represent in terms of a sum or in matrix notation.
\[\begin{align*} S(\theta) &= \sum_{i=1}^n (Y_i - \pi_i)\mathbf{x}^T_i\\ &= X^T(Y - \mathbf{\pi}) \end{align*}\]
You can note that the matrix notation retains the dimensions \(k \times 1\), which we would expect because we want to choose a set of \(k\) coefficients in our \(k \times 1\) vector \(\beta\). The score, as written in summation notation, also has length \(k\) but here, we use a convention as writing \(\mathbf{x}_i'\) in row vector representation instead of a column vector. You could instead represent this as multiplied by \(\mathbf{x}_i\), which would give us the \(k \times 1\) dimensions. Either way we have \(k\) coefficients. These are just different notations.
- Take the second derivative of the log likelihood to get the “hessian” and help estimate the uncertainty of the estimates. Again, we can represent this as a sum or in matrix notation, which might be easier when translating this into R code where our data is more naturally inside a matrix.
\[\begin{align*} H(\theta) &= - \sum_{i=1}^n \mathbf{x}_i\mathbf{x}^T_i(\pi_i)(1 - \pi_i)\\ &= -X^TVX \end{align*}\] where \(V\) is \(n \times n\) diagonal matrix with weights that are the ith element of \((\pi)(1 - \pi)\)
Once we have these quantities, we can optimize the function with an algorithm or go to glm
in R, which will do that for us.