10.9 How to think about Zero Counts
Sometimes our outcome data have an excessive number of zeroes. For example, perhaps there are a lot of people that never post on social media at all, and then there are a smaller number of those that do, and they may post in any positive number of times.
For these, we might think there are two decisions
- People that post vs. do not post. This sounds like a binary model.
- How many times to post? Ok this sounds more like Poisson or Negative Binomial
A large number of zeroes is not necessarily something that the Poisson and Negative binomial distributions would predict with high probability. For that reason, we might want to use a modeling strategy that accounts for zero excess. We will discuss two: Hurdle models and Zero-inflated poisson/negative binomial.
This video provides the overview.
10.9.1 Hurdle Models
Hurdle models may be useful when there are possibly sequential steps in achieving a positive count. The above example could motivate a hurdle model. First, someone decides if they want to post, and then if they want to post, they may post any positive \(>0\) number of times.
- There is a probability that governs the likelihood of not posting (\(Pr(Y_i = 0) = \pi\))
- And then there is a count model restricted to \(>0\) (“zero truncated”) describing the number of posts.
This post from the University of Virginia explains how to fit hurdle models in R.
10.9.2 Zero Inflated Poisson/Negative binomial
When you have excess zeroes, the intuitively named zero-inflated poisson or negative binomial model may also be appropriate. These are “mixture models” because there is a mixture of two distributions: the Bernoulli and Poisson/Negative Binomial. Here we think that there are two types of zeroes in the data.
- This is only appropriate to the extent that there are some observations that are truly ineligible from having a positive count– that have zero probability of having a count \(>\) 0.
- For example, in the UCLA R tutorial linked to below, they study the number of fish a particular camping group caught at the park. Well some people might not have gone fishing! This would be a case where some of the zeroes may reflect a separate process (the decision to fish)
- However, even among those that decide to go fishing, some people may still catch zero fish. Just like in a typical Poisson or Negative Binomial process, it is still possible to have a 0 count.
- Here, we just think there may be two processes explaining the zeroes, and only using a standard count model does not help explain that first process.
- We fit two models – a logistic regression model and a count model.
These tutorials from UCLA here and here describe how one would fit these models in R.
I also recommend reading this blog post from Paul Allison in what to consider when choosing between count models. He argues often it may make just as or even more sense to stick with the overdispersed poisson or negative binomial unless you have a good reason to believe that there are people with zero-probability of having a positive count.