10.8 Additional Considerations
10.8.1 Offset
We said that counts are often considered rates per interval. Sometimes when we have a count, we know the greatest possible value that an observation could take or we think that will be influenced by a particular variable, which varies by observation. This is called the exposure. For example maybe we are looking at the number of traffic accidents per year per population size.
In contrast, in the example from before, counts of executive orders were just measured as a span of one year.
We are going to use an example with an exposure from Gelman and Hill and the study of stop-and-frisk as a policy. This is example is based on the tutorial by Clay Ford here.
The data include an outcome desribing the number of stops
in a given area. The regression model looks at the relationship between race/ethnicity factor(eth)
and the number of stops made in a precinct.
Run the few lines below to load and prepare the data according to Gelman and Hill’s instructions:
## Stop and frisk. Does race/ethnicity influence number of stops?
## Prepare noisy data
<- read.table( "http://www.stat.columbia.edu/~gelman/arm/examples/police/frisk_with_noise.dat", header=TRUE, skip=6)
dat <- aggregate(cbind(stops, past.arrests) ~ eth + precinct, data=dat, sum) stops
The unit of analysis after we run this is the number of stops in a given precinct for a particular racial/ethnicity group.
head(stops)
eth precinct stops past.arrests
1 1 1 202 980
2 2 1 102 295
3 3 1 81 381
4 1 2 132 753
5 2 2 144 557
6 3 2 71 431
It is possible that the count of the number of stops may be influenced by the number of past arrests of a particular unit of analysis. We might want to measure a count per number of past arrests for an ethnicity group in a precinct instead of just a count per ethnicity group in a precinct. This will be our exposure and the \(\log\) past.arrests
will be our offset.
When we have an offset, our model changes to:
- \(\frac{\mathbf E(Y_i | X)}{N_i} = \exp(\mathbf x_i'\beta)\) or alternatively
- \(\mathbf E(Y_i | X) = \exp(\mathbf x_i'\beta + \log N_i)\).
- The exposure enters the right side of the equation as what is referred to as an “offset”– the \(\log\) of the exposure.
- We will not get a regression coeffcient for the offset because it is fixed to 1. However, we will still need to incorporate the offset when we estimate our outcomes.
Let’s use a quasipoisson model this time. We enter the offset through an explicit argument offset
. (Otherwise it will be treated like any regression variable, and its coefficient won’t be fixed to 1.)
## No natural limit for stops BUT might be influenced by past arrests
.1 <- glm(stops ~ factor(eth),
sfdata = stops,
family="quasipoisson",
offset = log(past.arrests))
summary(sf.1)
Call:
glm(formula = stops ~ factor(eth), family = "quasipoisson", data = stops,
offset = log(past.arrests))
Deviance Residuals:
Min 1Q Median 3Q Max
-47.327 -7.740 -0.182 10.241 39.140
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.58809 0.05646 -10.416 <2e-16 ***
factor(eth)2 0.07021 0.09042 0.777 0.438
factor(eth)3 -0.16158 0.12767 -1.266 0.207
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 222.5586)
Null deviance: 46120 on 224 degrees of freedom
Residual deviance: 45437 on 222 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
When we calculate our quantities of interest, predict
will automatically incorporate the offset. When we calculate them manually, we need to explicitly enter the offset.
## Predicted counts with offsets exp(XB + log(N))
<- predict(sf.1, type = "response")
sf.count
<- model.matrix(sf.1)
X <- coef(sf.1)
B <- exp(X %*% B + log(stops$past.arrests)) sf.count.man
Note that \(e^{X\hat\beta + \log N} = e^{X\hat\beta} * e^ {\log N} = N*e^{X\hat\beta}\). This means we could equivalently write:
<- exp(X %*% B)*stops$past.arrests
sf.count.man2 cbind(sf.count, sf.count.man, sf.count.man2)[1:6,]
sf.count
1 544.2816 544.2816 544.2816
2 175.7562 175.7562 175.7562
3 180.0316 180.0316 180.0316
4 418.2082 418.2082 418.2082
5 331.8515 331.8515 331.8515
6 203.6578 203.6578 203.6578