10.1 Overview of Count Data

Many of our dependent variables in social science may be considered counts:

The number of arrests or traffic stops
The number of bills passed
The number of terrorist attacks
The number of tweets
The number of judges a president nominates per year

Each of these variables shares the features that they are discrete and range from 0 to some positive number.

Example: Your outcome data might look like this, where each \(i\) observation represents a count of some kind:

Y <- rpois(n=30, lambda = 2)
Y

 [1] 3 1 1 1 0 2 5 1 2 1 0 1 3 0 2 1 1 3 3 0 1 1 2 3 1 2 3 1 2 2

A common way one might approach modeling data of this type is to use OLS to estimate a linear model. After all, the data seem quasi-continuous! Sometimes, this might be just fine, but let’s think about some situations where this could go awry.

Oftentimes count data are heavily right-skewed and very sparse.
- For example, suppose we were interested in the number of times a social media user makes a comment on a discussion forum. It is very common for a large number of people to make close to zero posts, while a small share of users might make a larger number of posts.
- Particularly in small samples, OLS can struggle with heavily skewed data because the error variance is likely not going to be homogenous, the distribution of errors is not going to be normal⁶, and the linearity assumption could very well be suspect (violations of the usual assumptions).
- When continuous data are heavily right-skewed (e.g., sometimes income is), it is often recommended to \(\log\) transform the \(y\) variable before fitting the regression with a linear model. With count data, we can pursue other options. Moreover, if you have a lot of counts that are 0, this transformation is problematic anyway because \(\log(0) = -Inf\). The transformation won’t really work, as standard statistical software will often treat those values as missing.

Below is an example of this type of skew, sparsity, and clustering toward 0.

Nonsensical values?
- With OLS, there is also no guarantee that smaller estimated \(\hat y\) values from the regression line will stay non-negative even though we know that the actual count outcomes are always going to be non-negative.
- There is also no guarantee that larger \(\hat y\) values will stay within the possible upper-range of \(y\) values.
- When data are heavily skewed, the regression line, which represents the “conditional mean” \(\mathbf E(Y |X=x)\) given some values of \(X\) might be a poor estimate given that generally we know that means can be poor estimates of highly skewed data (e.g., picture how estimates of income given a certain level of education would change if Bill Gates and Mark Zuckerberg are in your sample vs. if they are not).

For more information on dealing with skewed data and non-normal errors in linear regression, see Chapter 12.1 posted on Canvas from the Fox textbook.

Count data will always prevent the errors from being normality distributed, which can be problematic for estimates in small samples. In large samples, the uncertainty estimates will still approximate the correct values.↩︎