11.1 Sample Selection

Here are a few thought exercises to underscore the potential issues with common sources of data.

Graduate School Admissions Suppose we observe that college grades are uncorrelated with success in graduate school. Can we infer that college grades are irrelevant?

No. applicants admitted with low grades may not be representative of the population with low grades. Unmeasured variables (e.g. motivation) used in the admissions process might explain why those who enter graduate school with low grades do as well as those who enter graduate school with high grades.
- Selection into graduate school is not random
Implication: may be unmeasured factors that bias our inferences from the sample we do have complete data about (graduate school students)
Solution: May want to use a sample selection model to account for non-random sample

What leads rivals to wage war?

Lemke and Reed 2001 argue that if we only focus on rivals, this may lead to biased inference.

Need first DV: whether members of “great power” dyads are rivals
In addition to second: whether “great power rivals” wage war

“We discover that what makes great powers more likely to be rivals is statistically related to their propensity to experience war.” Results suggest that any analysis of the onset of war between rivals that fails to control for the prior influence of variables on the existence of rivalry almost surely produces inaccurate estimate.

11.1.1 How do we go about estimating this?

The technical details, followed by implementation.

Selection equation- into grad school sample?

\(\zeta_i = z_i^T\gamma + \delta_i\)
- \(\zeta_i\) DV of selection equation
- \(z_i^T\) vector of covariates for selection equation
- \(\gamma\) vector of coefficients for selection equation
- \(\delta_i\) random disturbances

Outcome equation

\(\xi_i = x_i^T\beta + \epsilon_i\)
- \(\xi_i\) DV of outcome equation (success in grad school)
- \(x_i^T\) vector of covariates for outcome equation
- \(\beta\) vector of coefficients for outcome equation
- \(\epsilon_i\) random disturbances

The Problem

We actually want estimates of \(Y_i\) not \(\xi_i\).

\(Y_i =\)

\[\begin{cases} \text{missing }, \; \text{for } \zeta_i \leq 0\\ \xi_i,\; \text{for } \zeta_i > 0\end{cases}\]

Two-step Estimation

Define a dichotomous outcome to indicate if in the sample or not \(W_i =\)

\[\begin{cases} 1, \; \text{if } Y_i \text{ is observed } \\ 0, \; \text{ if } Y_i \text{ is missing } \end{cases}\]

Fit a probit regression with \(W_i\) as the outcome with the linear predictor: \(\hat \psi_i = z_i^T\hat \gamma\)
- Calculate the “inverse Mills ratio” of \(\hat \eta_i = \frac{\phi(\hat \psi_i)}{\Phi(\hat \psi_i)}\)
- Note this is dnorm()/pnorm(), ratio of the probability density function over the cumulative distribution function for each i
Use \(\hat \eta_i\) as an auxiliary regressor of \(Y_i\) on the \(x_i^T\) for those where \(Y_i\) is observed.
- Note the SEs have to be adjusted (not just the standard OLS errors).

11.1.2 Sample Selection Model Assumptions

Big assumptions

Exclusion restriction – selection equation should contain at least one variable that predicts selection but not the outcome
Errors in the probit equation are homoskedastic
Error terms for selection equation and outcome are correlated (\(\rho_{\epsilon \delta}\)).
- \(\epsilon_i\) and \(\delta_i\) should be distributed as bivariate normal if using the MLE approach discussed below.
\(\epsilon_i\) and \(\delta_i\) should be independent of the regressors in their equations
Results can be sensitive to how you specify the selection equation

A useful discussion in IR about these issues is Simmons and Hopkins (2005). An extension of this model has also been developed for models where the outcome is dichotomous (see: bivariate probit models)