11.2 Fitting Sample Selection in R
We can use functions in R to do this calculation for us. We will follow the example from the R Pubs resource, which uses data from
- Mroz, T. A. (1987) “The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55, 765–799.
Review the summary at the link above for information about the setup, where our outcome is married women’s wages, and the selection is labor force participation.
## install.packages("sampleSelection")
library(sampleSelection)
data("Mroz87")
The selection variable is lfp
, a 0 or 1 variable indicating labor force participation.
table(Mroz87$lfp)
0 1
325 428
The sample selection issue is we do not observe wages for those out of the labor force. Thus, we can first estimate a model predicting labor force participation, and then a model predicting wages. We do so in the below code.
- Note, we need at least one variable in the selection equation that predicts selection but not wages. Here, we use
kids
for this.
$kids <- (Mroz87$kids5 + Mroz87$kids618) Mroz87
We fit the model by providing two regression formulas in the selection
function.
## 2-step estimator
<- selection(selection = lfp ~ age +
selection1 + kids + educ,
faminc outcome = wage ~ exper + age + educ + city,
data = Mroz87,
method = "2step")
summary(selection1)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
753 observations (325 censored and 428 observed)
13 free parameters (df = 741)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.377e-01 4.479e-01 0.307 0.758629
age -2.253e-02 6.898e-03 -3.266 0.001140 **
faminc 5.168e-06 4.150e-06 1.245 0.213428
kids -1.319e-01 3.768e-02 -3.500 0.000493 ***
educ 8.889e-02 2.285e-02 3.890 0.000109 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.45192 2.18010 -0.666 0.505627
exper 0.01747 0.02167 0.806 0.420270
age 0.01570 0.02524 0.622 0.534085
educ 0.41456 0.11376 3.644 0.000287 ***
city 0.41415 0.31925 1.297 0.194947
Multiple R-Squared:0.1261, Adjusted R-Squared:0.1158
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio -1.1666 1.5530 -0.751 0.453
sigma 3.2149 NA NA NA
rho -0.3629 NA NA NA
--------------------------------------------
If the coefficient estimate on the inverse mills ratio is non-zero, that suggests that the selection probability does influence wages. We do not necessarily have strong evidence here given the high p-value. This may vary by how to specify the model. Note also that the estimate for this coefficient is the multiplication of sigma*rho, where rho is the correlation of the errors between equations and sigma is the standard error of the residuals from the regression equation.
Maximum likelihood also allows us to estimate the equations simultaneously. We just specify ml
as the method.
## alternative using maximum likelihood
<- selection(selection = lfp ~ age +
selection2 + kids + educ,
faminc outcome = wage ~ exper + age + educ + city,
data = Mroz87,
method = "ml")
We can see what is going on under the hood by fitting the 2-step process manually. The only difference is our manual standard errors would be wrong.
## Selection equation
<- glm(lfp ~ age + faminc + kids + educ,
seleqn1 family=binomial(link="probit"), data=Mroz87)
## Calculate inverse Mills ratio by hand ##
$IMR <- dnorm(seleqn1$linear.predictors)/pnorm(seleqn1$linear.predictors)
Mroz87
## Outcome equation correcting for selection
<- lm(wage ~ exper + exper + age + educ + city+IMR , data=Mroz87,
outeqn1 subset=(lfp==1))
## Compare with the selection package results
coef(outeqn1)
(Intercept) exper age educ city IMR
-1.45197781 0.01747395 0.01569873 0.41456568 0.41415262 -1.16657579
How should we interpret these results?
- If variables are only in the outcome equation, like city and experience, we can interpret them like OLS coefficients.
- If variables appear in both equations, then we can also make an adjustment to have an estimate for the full average marginal effect, that also accounts for selection instead of just specifying the effect of the variable for those “that are selected”.
## Example
<- selection(selection = lfp ~ age +
selection3 + kids + educ,
faminc outcome = wage ~
+ age + educ + city, data = Mroz87,
exper method = "2step")
## average marginal effect:
<- selection3$coefficients[5]
beta.educ.sel <- selection3$coefficients[9]
beta.educ.out <- selection3$coefficients[11]
beta.IMR <- selection3$imrDelta
delta
<- beta.educ.out - (beta.educ.sel * beta.IMR * delta)
marginal.effect ## average marginal effect
mean(marginal.effect)
[1] 0.4757589