10.8 Additional Considerations

10.8.1 Offset

We said that counts are often considered rates per interval. Sometimes when we have a count, we know the greatest possible value that an observation could take or we think that will be influenced by a particular variable, which varies by observation. This is called the exposure. For example maybe we are looking at the number of traffic accidents per year per population size.

In contrast, in the example from before, counts of executive orders were just measured as a span of one year.

We are going to use an example with an exposure from Gelman and Hill and the study of stop-and-frisk as a policy. This is example is based on the tutorial by Clay Ford here.

The data include an outcome desribing the number of stops in a given area. The regression model looks at the relationship between race/ethnicity factor(eth) and the number of stops made in a precinct.

Run the few lines below to load and prepare the data according to Gelman and Hill’s instructions:

## Stop and frisk. Does race/ethnicity influence number of stops?
## Prepare noisy data

dat <- read.table( "http://www.stat.columbia.edu/~gelman/arm/examples/police/frisk_with_noise.dat", header=TRUE, skip=6)
stops <- aggregate(cbind(stops, past.arrests) ~ eth + precinct, data=dat, sum)

The unit of analysis after we run this is the number of stops in a given precinct for a particular racial/ethnicity group.

head(stops)

  eth precinct stops past.arrests
1   1        1   202          980
2   2        1   102          295
3   3        1    81          381
4   1        2   132          753
5   2        2   144          557
6   3        2    71          431

It is possible that the count of the number of stops may be influenced by the number of past arrests of a particular unit of analysis. We might want to measure a count per number of past arrests for an ethnicity group in a precinct instead of just a count per ethnicity group in a precinct. This will be our exposure and the \(\log\) past.arrests will be our offset.

When we have an offset, our model changes to:

\(\frac{\mathbf E(Y_i | X)}{N_i} = \exp(\mathbf x_i'\beta)\) or alternatively
\(\mathbf E(Y_i | X) = \exp(\mathbf x_i'\beta + \log N_i)\).
- The exposure enters the right side of the equation as what is referred to as an “offset”– the \(\log\) of the exposure.
- We will not get a regression coeffcient for the offset because it is fixed to 1. However, we will still need to incorporate the offset when we estimate our outcomes.

Let’s use a quasipoisson model this time. We enter the offset through an explicit argument offset. (Otherwise it will be treated like any regression variable, and its coefficient won’t be fixed to 1.)

## No natural limit for stops BUT might be influenced by past arrests
sf.1 <- glm(stops ~ factor(eth), 
            data = stops, 
            family="quasipoisson", 
            offset =  log(past.arrests))
summary(sf.1)


Call:
glm(formula = stops ~ factor(eth), family = "quasipoisson", data = stops, 
    offset = log(past.arrests))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-47.327   -7.740   -0.182   10.241   39.140  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.58809    0.05646 -10.416   <2e-16 ***
factor(eth)2  0.07021    0.09042   0.777    0.438    
factor(eth)3 -0.16158    0.12767  -1.266    0.207    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 222.5586)

    Null deviance: 46120  on 224  degrees of freedom
Residual deviance: 45437  on 222  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

When we calculate our quantities of interest, predict will automatically incorporate the offset. When we calculate them manually, we need to explicitly enter the offset.

## Predicted counts with offsets exp(XB + log(N))
sf.count <- predict(sf.1, type = "response")

X <- model.matrix(sf.1)
B <- coef(sf.1)
sf.count.man <- exp(X %*% B + log(stops$past.arrests))

Note that \(e^{X\hat\beta + \log N} = e^{X\hat\beta} * e^ {\log N} = N*e^{X\hat\beta}\). This means we could equivalently write:

sf.count.man2 <- exp(X %*% B)*stops$past.arrests
cbind(sf.count, sf.count.man, sf.count.man2)[1:6,]

  sf.count                  
1 544.2816 544.2816 544.2816
2 175.7562 175.7562 175.7562
3 180.0316 180.0316 180.0316
4 418.2082 418.2082 418.2082
5 331.8515 331.8515 331.8515
6 203.6578 203.6578 203.6578