11.2 Fitting Sample Selection in R

We can use functions in R to do this calculation for us. We will follow the example from the R Pubs resource, which uses data from

  • Mroz, T. A. (1987) “The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55, 765–799.

Review the summary at the link above for information about the setup, where our outcome is married women’s wages, and the selection is labor force participation.

## install.packages("sampleSelection")
library(sampleSelection)

data("Mroz87")

The selection variable is lfp, a 0 or 1 variable indicating labor force participation.

table(Mroz87$lfp)

  0   1 
325 428 

The sample selection issue is we do not observe wages for those out of the labor force. Thus, we can first estimate a model predicting labor force participation, and then a model predicting wages. We do so in the below code.

  • Note, we need at least one variable in the selection equation that predicts selection but not wages. Here, we use kids for this.
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618)

We fit the model by providing two regression formulas in the selection function.

## 2-step estimator
selection1 <- selection(selection = lfp ~ age  +
                          faminc + kids + educ,
                        outcome = wage ~ exper + age + educ + city, 
                        data = Mroz87,
                        method = "2step")
summary(selection1)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
753 observations (325 censored and 428 observed)
13 free parameters (df = 741)
Probit selection equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.377e-01  4.479e-01   0.307 0.758629    
age         -2.253e-02  6.898e-03  -3.266 0.001140 ** 
faminc       5.168e-06  4.150e-06   1.245 0.213428    
kids        -1.319e-01  3.768e-02  -3.500 0.000493 ***
educ         8.889e-02  2.285e-02   3.890 0.000109 ***
Outcome equation:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.45192    2.18010  -0.666 0.505627    
exper        0.01747    0.02167   0.806 0.420270    
age          0.01570    0.02524   0.622 0.534085    
educ         0.41456    0.11376   3.644 0.000287 ***
city         0.41415    0.31925   1.297 0.194947    
Multiple R-Squared:0.1261,  Adjusted R-Squared:0.1158
   Error terms:
              Estimate Std. Error t value Pr(>|t|)
invMillsRatio  -1.1666     1.5530  -0.751    0.453
sigma           3.2149         NA      NA       NA
rho            -0.3629         NA      NA       NA
--------------------------------------------

If the coefficient estimate on the inverse mills ratio is non-zero, that suggests that the selection probability does influence wages. We do not necessarily have strong evidence here given the high p-value. This may vary by how to specify the model. Note also that the estimate for this coefficient is the multiplication of sigma*rho, where rho is the correlation of the errors between equations and sigma is the standard error of the residuals from the regression equation.

Maximum likelihood also allows us to estimate the equations simultaneously. We just specify ml as the method.

## alternative using maximum likelihood
selection2 <- selection(selection = lfp ~ age +
                          faminc + kids + educ,
                        outcome = wage ~ exper + age + educ + city, 
                        data = Mroz87,
                        method = "ml")

We can see what is going on under the hood by fitting the 2-step process manually. The only difference is our manual standard errors would be wrong.

## Selection equation
seleqn1 <- glm(lfp ~ age  + faminc + kids + educ,
               family=binomial(link="probit"), data=Mroz87)

## Calculate inverse Mills ratio by hand ##
Mroz87$IMR <- dnorm(seleqn1$linear.predictors)/pnorm(seleqn1$linear.predictors)

## Outcome equation correcting for selection 
outeqn1 <- lm(wage ~ exper + exper + age + educ + city+IMR , data=Mroz87,
              subset=(lfp==1))

## Compare with the selection package results
coef(outeqn1)
(Intercept)       exper         age        educ        city         IMR 
-1.45197781  0.01747395  0.01569873  0.41456568  0.41415262 -1.16657579 

How should we interpret these results?

  • If variables are only in the outcome equation, like city and experience, we can interpret them like OLS coefficients.
  • If variables appear in both equations, then we can also make an adjustment to have an estimate for the full average marginal effect, that also accounts for selection instead of just specifying the effect of the variable for those “that are selected”.
## Example
selection3 <- selection(selection = lfp ~ age +
                          faminc + kids + educ,
                        outcome = wage ~ 
                          exper + age + educ + city, data = Mroz87,
                        method = "2step")
## average marginal effect:
beta.educ.sel <- selection3$coefficients[5]
beta.educ.out <- selection3$coefficients[9]
beta.IMR <- selection3$coefficients[11]
delta <- selection3$imrDelta

marginal.effect <- beta.educ.out - (beta.educ.sel * beta.IMR * delta)
## average marginal effect
mean(marginal.effect)
[1] 0.4757589