## install.packages("sampleSelection")
library(sampleSelection)
data("Mroz87")
11 Sample Selection Models
This section will provide a brief overview of models designed to address issues where we do not observe our full outcome data, or the outcome is censored or truncated in some way. In each of these cases, if we only use the standard methods discussed so far in the course, we may end up with biased estimates.
Here is a brief video overview.
Sample | Y | X | Example |
---|---|---|---|
Censored | y is known exactly only if some criterion defined in terms of y is met. | x variables are observed for the entire sample, regardless of whether y is observed exactly | If income is measured exactly only if above the poverty line. All other incomes are reported at the poverty line. |
Sample Selected | y is observed only if a criteria defined in terms of some other random variable (Z) is met. | the determinants of whether Z =1 are observed for the entire sample, regardless of whether y is observed or not | Survey data with item or unit non-response |
Truncated | y is known only if some criterion defined in terms of y is met | x variables are observed only if y is observed | Donations to political campaigns |
Here are supplemental resources
- King, Gary. 1998. Unifying political methodology: The likelihood theory of statistical inference. University of Michigan Press. Chapters: 9. Available online through Rutgers libraries.
- Fox, John. Applied Regression and Generalized Linear Models. Excerpts from Chapter 20 (see Canvas)
- The documentation for the
sampleSelection
package in R is here
11.1 Sample Selection
Here are a few thought exercises to underscore the potential issues with common sources of data.
Graduate School Admissions Suppose we observe that college grades are uncorrelated with success in graduate school. Can we infer that college grades are irrelevant?
- No. applicants admitted with low grades may not be representative of the population with low grades. Unmeasured variables (e.g. motivation) used in the admissions process might explain why those who enter graduate school with low grades do as well as those who enter graduate school with high grades.
- Selection into graduate school is not random
- Implication: may be unmeasured factors that bias our inferences from the sample we do have complete data about (graduate school students)
- Solution: May want to use a sample selection model to account for non-random sample
What leads rivals to wage war?
Lemke and Reed 2001 argue that if we only focus on rivals, this may lead to biased inference.
- Need first DV: whether members of “great power” dyads are rivals
- In addition to second: whether “great power rivals” wage war
“We discover that what makes great powers more likely to be rivals is statistically related to their propensity to experience war.” Results suggest that any analysis of the onset of war between rivals that fails to control for the prior influence of variables on the existence of rivalry almost surely produces inaccurate estimate.
11.1.1 How do we go about estimating this?
The technical details, followed by implementation.
Selection equation- into grad school sample?
- \(\zeta_i = z_i^T\gamma + \delta_i\)
- \(\zeta_i\) DV of selection equation
- \(z_i^T\) vector of covariates for selection equation
- \(\gamma\) vector of coefficients for selection equation
- \(\delta_i\) random disturbances
Outcome equation
- \(\xi_i = x_i^T\beta + \epsilon_i\)
- \(\xi_i\) DV of outcome equation (success in grad school)
- \(x_i^T\) vector of covariates for outcome equation
- \(\beta\) vector of coefficients for outcome equation
- \(\epsilon_i\) random disturbances
The Problem
We actually want estimates of \(Y_i\) not \(\xi_i\).
- \(Y_i =\)
Two-step Estimation
- Define a dichotomous outcome to indicate if in the sample or not \(W_i =\)
- Fit a probit regression with \(W_i\) as the outcome with the linear predictor: \(\hat \psi_i = z_i^T\hat \gamma\)
- Calculate the “inverse Mills ratio” of \(\hat \eta_i = \frac{\phi(\hat \psi_i)}{\Phi(\hat \psi_i)}\)
- Note this is
dnorm()/pnorm()
, ratio of the probability density function over the cumulative distribution function for each i
- Use \(\hat \eta_i\) as an auxiliary regressor of \(Y_i\) on the \(x_i^T\) for those where \(Y_i\) is observed.
- Note the SEs have to be adjusted (not just the standard OLS errors).
11.1.2 Sample Selection Model Assumptions
Big assumptions
- Exclusion restriction – selection equation should contain at least one variable that predicts selection but not the outcome
- Errors in the probit equation are homoskedastic
- Error terms for selection equation and outcome are correlated (\(\rho_{\epsilon \delta}\)).
- \(\epsilon_i\) and \(\delta_i\) should be distributed as bivariate normal if using the MLE approach discussed below.
- \(\epsilon_i\) and \(\delta_i\) should be independent of the regressors in their equations
- Results can be sensitive to how you specify the selection equation
A useful discussion in IR about these issues is Simmons and Hopkins (2005). An extension of this model has also been developed for models where the outcome is dichotomous (see: bivariate probit models)
11.2 Fitting Sample Selection in R
We can use functions in R to do this calculation for us. We will follow the example from the R Pubs resource, which uses data from
- Mroz, T. A. (1987) “The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55, 765–799.
Review the summary at the link above for information about the setup, where our outcome is married women’s wages, and the selection is labor force participation.
The selection variable is lfp
, a 0 or 1 variable indicating labor force participation.
table(Mroz87$lfp)
0 1
325 428
The sample selection issue is we do not observe wages for those out of the labor force. Thus, we can first estimate a model predicting labor force participation, and then a model predicting wages. We do so in the below code.
- Note, we need at least one variable in the selection equation that predicts selection but not wages. Here, we use
kids
for this.
$kids <- (Mroz87$kids5 + Mroz87$kids618) Mroz87
We fit the model by providing two regression formulas in the selection
function.
## 2-step estimator
<- selection(selection = lfp ~ age +
selection1 + kids + educ,
faminc outcome = wage ~ exper + age + educ + city,
data = Mroz87,
method = "2step")
summary(selection1)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
753 observations (325 censored and 428 observed)
13 free parameters (df = 741)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.377e-01 4.479e-01 0.307 0.758629
age -2.253e-02 6.898e-03 -3.266 0.001140 **
faminc 5.168e-06 4.150e-06 1.245 0.213428
kids -1.319e-01 3.768e-02 -3.500 0.000493 ***
educ 8.889e-02 2.285e-02 3.890 0.000109 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.45192 2.18010 -0.666 0.505627
exper 0.01747 0.02167 0.806 0.420270
age 0.01570 0.02524 0.622 0.534085
educ 0.41456 0.11376 3.644 0.000287 ***
city 0.41415 0.31925 1.297 0.194947
Multiple R-Squared:0.1261, Adjusted R-Squared:0.1158
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio -1.1666 1.5530 -0.751 0.453
sigma 3.2149 NA NA NA
rho -0.3629 NA NA NA
--------------------------------------------
If the coefficient estimate on the inverse mills ratio is non-zero, that suggests that the selection probability does influence wages. We do not necessarily have strong evidence here given the high p-value. This may vary by how to specify the model. Note also that the estimate for this coefficient is the multiplication of sigma*rho, where rho is the correlation of the errors between equations and sigma is the standard error of the residuals from the regression equation.
Maximum likelihood also allows us to estimate the equations simultaneously. We just specify ml
as the method.
## alternative using maximum likelihood
<- selection(selection = lfp ~ age +
selection2 + kids + educ,
faminc outcome = wage ~ exper + age + educ + city,
data = Mroz87,
method = "ml")
We can see what is going on under the hood by fitting the 2-step process manually. The only difference is our manual standard errors would be wrong.
## Selection equation
<- glm(lfp ~ age + faminc + kids + educ,
seleqn1 family=binomial(link="probit"), data=Mroz87)
## Calculate inverse Mills ratio by hand ##
$IMR <- dnorm(seleqn1$linear.predictors)/pnorm(seleqn1$linear.predictors)
Mroz87
## Outcome equation correcting for selection
<- lm(wage ~ exper + exper + age + educ + city+IMR , data=Mroz87,
outeqn1 subset=(lfp==1))
## Compare with the selection package results
coef(outeqn1)
(Intercept) exper age educ city IMR
-1.45197781 0.01747395 0.01569873 0.41456568 0.41415262 -1.16657579
How should we interpret these results?
- If variables are only in the outcome equation, like city and experience, we can interpret them like OLS coefficients.
- If variables appear in both equations, then we can also make an adjustment to have an estimate for the full average marginal effect, that also accounts for selection instead of just specifying the effect of the variable for those “that are selected”.
## Example
<- selection(selection = lfp ~ age +
selection3 + kids + educ,
faminc outcome = wage ~
+ age + educ + city, data = Mroz87,
exper method = "2step")
## average marginal effect:
<- selection3$coefficients[5]
beta.educ.sel <- selection3$coefficients[9]
beta.educ.out <- selection3$coefficients[11]
beta.IMR <- selection3$imrDelta
delta
<- beta.educ.out - (beta.educ.sel * beta.IMR * delta)
marginal.effect ## average marginal effect
mean(marginal.effect)
[1] 0.4757589
11.3 Heckman Example Using Survey Data
This example is from “Poverty and Divine Rewards: The Electoral Advantage of Islamist Political Parties” published in the American Journal of Political Science in 2019.
Abstract Political life in many Muslim-majority countries has been marked by the electoral dominance of Islamist parties. Recent attempts to explain why have highlighted their material and organizational factors, such as the provision of social services. In this article, we revive an older literature that emphasizes the appeal of these parties’ religious nature to voters experiencing economic hardship. Individuals suffering economic strain may vote for Islamists because they believe this to be an intrinsically virtuous act that will be met with divine rewards in the afterlife. We explore this hypothesis through a series of laboratory experiments in Tunisia. Individuals assigned to treatment conditions instilling feelings of economic strain exhibit greater support for Islamist parties, and this support is causally mediated by an expectation of divine compensation in the hereafter. The evidence suggests that the religious nature of Islamist parties may thus be an important factor in their electoral success.
We are going to replicate a small part of their analysis of an experiment:
- Experiment 2 induced economic strain by exposing participants (n = 201) to four hypothetical financial scenarios
- Half were randomly assigned to a “hard” condition, in which the four scenarios involved financial costs that were relatively high, whereas
- Half were assigned to an “easy” condition that involved substantially lower costs.
- One of the secondary dependent variables was: In Experiment 2, why they chose to vote for the party they did, giving them six options, including “Allah will be more pleased if I vote for this party than other parties.”
- For each answer option, we asked respondents for their level of agreement with the statement and
- subsequently asked them to rank each statement they agreed with in importance.
Let’s load the data and explore the variables.
<- read.csv("https://github.com/ktmccabe/teachingdata/raw/main/exp2.csv") exp2
The authors are looking to verify that pleasing Allah had something to do with Ennahda vote choice, particularly among poor voters. Let’s look at the variable votenahda
, which is a 0 or 1 outcome.
- 1 if plan to vote for Ennahda if elections held tomorrow, 0 if not
table(exp2$votenahda)
0 1
335 66
Whether a voter is poor is determined by if they fall below 7 on the variable inc
. Let’s subset our data to only examine poor voters.
<- subset(exp2, inc < 7) subdata
They want to understand if pleasing Allah was a top reason for voting for the party. This information is only available for those that agreed or strongly agreed with the statement, “Allah will be more pleased if I vote for this party than other parties.”
This information is in the variable voteAllah2
voteAllah2
: 1=strongly agree or agree, 0=otherwise
table(exp2$voteAllah2)
0 1
118 169
The ranking information is available in the variable, voteAllahrank3
:
voteAllahrank3
: 1 if voteAllahrank \(>\) 4 (top two reasons); NA if voteAllahrank = 0; 0 otherwise.
table(exp2$voteAllahrank3)
0 1
104 64
Let’s identify the sample selection issue.
voteAllahrank3
is only observed for those who strongly agreed or agreed with the statement
table(RankedTopTwo=exp2$voteAllahrank3, Agreed=exp2$voteAllah2)
Agreed
RankedTopTwo 0 1
0 0 104
1 0 64
What makes this a candidate for a Heckman sample selection model?
Try on your own, then expand.
- We are interested in estimating whether Ennahda voters are more likely than others to rank pleasing Allah among the top two reasons. Our desired outcome is \(Y_i =\)
voteAllahrank3
- \(Y_i\) only observed for those who met some criteria set by another random variable (in this case
voteAllah2
= 1). - However, we still have information on the independent variables for all respondents, regardless of whether they are a 0 or 1 on
voteAllah2
- \(Y_i\) only observed for those who met some criteria set by another random variable (in this case
Estimate the two-step process, following the authors in terms of what variables to include in each stage according to columns 2 and 3 in the table.
- Prior to fitting the model, subset the data to include only respondents who voted for
votenahda
,votenidaa
,votejabha
,voteirada
, indicated by respondents being coded as a 1 on these variables. These correspond to the party variables in columns 2 and 3.- other relevant variables are
treat
andquran
- other relevant variables are
Try on your own, then expand for the solution.
As the authors note, “The ranking is a two-step process, as respondents only get to rank factors that they agree with. To model their rankings, we therefore employ a Heckman selection model, analyzing first who agreed that pleasing Allah is important in their vote choice, and then analyzing second who ranked pleasing Allah as one of their top two factors.”
<- subset(subdata, votenahda==1 |
subdata2 ==1 | votejabha==1 |
votenidaa==1)
voteirada<- selection(selection = voteAllah2~votenahda+votejabha+treat+quran,
two outcome = voteAllahrank3~votenahda+votenidaa,
data=subdata2,
method="2step")
summary(two)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
90 observations (31 censored and 59 observed)
11 free parameters (df = 80)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3418 0.2897 1.180 0.2415
votenahda 0.2442 0.3383 0.722 0.4725
votejabha -0.1369 0.3646 -0.375 0.7083
treat -0.4057 0.2906 -1.396 0.1665
quran 0.6495 0.2927 2.219 0.0293 *
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007066 0.248585 -0.028 0.9774
votenahda 0.312768 0.185545 1.686 0.0958 .
votenidaa 0.226522 0.161635 1.401 0.1650
Multiple R-Squared:0.0558, Adjusted R-Squared:0.0043
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio 0.3248 0.3356 0.968 0.336
sigma 0.5187 NA NA NA
rho 0.6263 NA NA NA
--------------------------------------------
How should we interpret the results in light of the researchers’ hypothesis that Ennahda voters would be more likely to rank pleasing Allah as a top reason?
Try on your own, then expand for the solution.
As the authors note describing the outcome equation, “results suggest that poor Ennahda voters were about 31% more likely to rank pleasing Allah among their top two factors (p = .096; see the SI, p. 22) than poor supporters of secular parties.”
We have to be careful here, that this is from the outcome equation and does not represent the marginal effect based on both selection and outcome processes.
11.4 Tobit Model aka Censored regression model
Dealing with outcomes that are “top-coded” or “bottom-coded” at a threshold value
Example: if income is top-coded as “above $250k”
\(Y_i\) =
\[\begin{cases} Y_i^*, \; Y_i < 250 \\ \text{ above 250 } \; Y_i \geq 250 \end{cases}\]We are interested in \(Y_i^*\): actual income, not censored income. Problem- it’s unobserved for part of the sample
- Example: want to use SAT as measure of aptitude, but scores capped between 200 and 800
- Example: want to measure support for candidate but legal maximum for campaign donations is $5000
- Example: want to measure like-dislike of candy bars, but candy bars consumed bottom-coded at 0
- Note: in classic tobit model, censoring happens at zero
For elaboration in R, see this UCLA resource and tobit()
in the AER package
11.4.1 Tobit Model Assumptions
- Assume homoskedastic and normally distributed errors
- When data are censored at zero (clumping at zero), assume same underlying stochastic process to determine
- whether the response is zero or positive
- as well as the value of a positive response
- Any variable which increases the probability of a non-zero value must also increase the mean of positive values.
- Should generally be used in cases where the dependent variable could take on negative values
An alternative model discussed in the count data section: “two-part” and hurdle model–appropriate when the 0 is a “true zero”
11.5 Truncated Models
Sample selection is determined by values of the \(Y\) variable. Do not observe x or y for truncated observations
- Example: interested in effect of education on income, but only sample people below a certain income.
- Example: studies of electoral success of newly formed political parties. Problem: likely only observe new parties whn their chance of success is likely high
- Example: studies using newspaper reports on social movements to study predictors of violence in these movements. Problem: newspapers select which movements to report on- likely those with chance of violence.
Solution? Requires information about the mechanism that leads to the incomplete (or truncated) data set