12.2 Panel and Hierarchical Data

\(Y_i, X_i\) now become \(Y_{it}, X_{it}\) where \(i\) is a unique indicator for the unit and \(t\) is a unique indicator for time (or question, task, etc., other repeated measurement)

For example, perhaps we observe a set of \(N\) European countries, each indexed by \(i\) over \(10=T\) years, each indexed by \(t\).
The dataset can be represented in “wide” or “long/stacked” format (see section 2 of the course notes)
Can get more complicated. Perhaps we have a Member of Parliament in a year in a country: \(Y_{itj}, X_{itj}\). Now we have three indices for our variables.
We also might not have variation over time, but instead, multiple levels of units– perhaps students (\(i\)) nested in schools (\(j\)).

Why do we like this type of data in social science?

We think some \(Z\) variable is related to some \(Y\) outcome.
So we test this with data by comparing data that vary on \(Z\) to see if they vary systematically on \(Y\).
Our goal is to isolate \(Z\). We want to “control” on everything that differs between our units that could affect \(Y\) except for \(Z\), which varies.
Problem: this can feel almost impossible in cross-sectional data
- Example: Bob and Suzy probably differ in a million ways, but we can only measure and account for so many covariates
- Example: France and Germany probably differ in a million ways, but we can only measure and account for so many covariates
- This makes comparing Bob vs. Suzy and France vs. Germany for our variation is a somewhat suspect way to show how \(Z\) relates to \(Y\) if we cannot control on all possible relevant factors.

Possible solution: enter repeated observations

Idea: Bob in wave 1 is probably pretty similar to Bob at wave 2. This might be a more sensible comparison than Bob vs. Suzy.
Idea: France in 1968 vs. France 1970 is probably a more sensible comparison than France vs. Germany
- So, perhaps we incorporate Bob vs. Bob and Suzy vs. Suzy; France vs. France and Germany vs. Germany comparisons instead of just making between-unit comparisons.
Issue: need to learn new methods to account for our grouped/repeated data structure.

Why?

Recall OLS

\[\begin{align*} Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \end{align*}\]

Assumptions:

\(\mathbf E(\epsilon_i) = 0\);
Errors independent of regressors: Cov\((\epsilon_i, X_i) = 0\);
Errors of different observations not correlated: Cov\((\epsilon_i, \epsilon_j) = 0\)
Constant error variance \(V(\epsilon_i | X) = \sigma^2\).

When we have grouped/longitudinal data, many of these assumptions are likely violated. For elaboration, see the video on the previous page that discusses how unobserved factors related to particular geographical areas might influence the explanatory variable crime rate video.

Let’s take a look at why.

What if we have \[\begin{align*} Y_{it} = \beta_0 + \beta_1 x_{it} + \epsilon_{it} \end{align*}\]

We could still treat this as OLS if we believe the assumptions hold. But often, we are concerned that our model actually looks like this:

\[\begin{align*} Y_{it} = \beta_0 + \beta_1x_{it}+ (c_i + \epsilon_{it}). \end{align*}\]

We think there may be \(c_i\) unobserved characteristics about our units \(i\) that are related to our explanators. If unaccounted for and left as part of the error term, this would cause bias in our coefficients.

Note: in OLS, we assume Cov\((c_i + \epsilon_{it}, x_{it}) = 0\). Here, we believe there may be unmeasured factors that induce covariance between \(c_i\) and \(x_{it}\).

One Solution: fixed effects: removes time-constant unobserved characteristics about our units.