12.1 Week 10 Tutorial

In this section, we are going to use data from the CDC on national and state-level COVID hospitalizations and positive case increases between July 2020- end September 2020.

Let’s load the national and state data.

library(rio)
national <- import("https://github.com/ktmccabe/teachingdata/blob/main/national.RData?raw=true")
states <- import("https://github.com/ktmccabe/teachingdata/blob/main/states.RData?raw=true")

head(national)
          date hospitalizedCurrently positiveIncrease
159 2020-09-30                 31021            44909
160 2020-09-29                 30601            36766
161 2020-09-28                 29696            35376
162 2020-09-27                 29579            34990
163 2020-09-26                 29670            47268
164 2020-09-25                 29888            55237
head(states)
           date hospitalizedCurrently positiveIncrease state
8849 2020-09-30                    53              104    AK
8850 2020-09-30                   776             1147    AL
8851 2020-09-30                   484              942    AR
8853 2020-09-30                   560              323    AZ
8854 2020-09-30                  3267             3200    CA
8855 2020-09-30                   264              511    CO

During this time period, the national level of hospitalizations was declining.

library(ggplot2)
ggplot(national, aes(x=date, y=hospitalizedCurrently))+
  geom_line()

However, the national data represents aggregated state-level data. It is possible that these trends might look different if we looked within each state.

ggplot(states, aes(x=date, y=hospitalizedCurrently))+
  geom_line()+
  facet_wrap(~state, scales = "free")

When we have data that are grouped in some way, it is important to consider how group-specific factors may influence the results.

Let’s look at the relationship between positive increases in cases and hospitalizations using the state-level data. Note, we are not epidemiologists here, so this is in no way exactly how you would want to model this in practice. We will leave that to the experts. Nonetheless, it will give us some visualizations. The regression line below is from the following regression, which pools across states and dates:

\(hospitalizedCurrently_{it} = \alpha + positiveIncrease_{it} + \epsilon_{it}\)

ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
  geom_point()+
  geom_smooth(aes(y=hospitalizedCurrently),method="lm", se=F, color="black")+
  ylim(0, 15000)

What issues do you have with that type of analysis?

  • How could we improve it?

Let’s look at the variation by state.

ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
  geom_point()+
  geom_smooth(method="lm", se=F)+
  facet_wrap(~state, scales = "free")

One thing we could do to account for the variation in cases, is to add controls for state. This is called adding “fixed effects.” It allows the intercept for each state to be different in regression, but the slopes are considered the same.

fit <- lm(hospitalizedCurrently ~ positiveIncrease + as.factor(state), data=states)
fit.pred <- predict(fit)

ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
  geom_point()+
  geom_smooth(aes(y=hospitalizedCurrently),method="lm", se=F, color="black")+
  geom_line(aes(y=fit.pred))+
  ylim(0, 15000)

ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
  geom_point()+
  geom_line(aes(y=fit.pred))+
  facet_wrap(~state, scales = "free")

We could also add interactions with state, which will allow the slopes to vary.

fit <- lm(hospitalizedCurrently ~ positiveIncrease*as.factor(state), data=states)
fit.pred <- predict(fit)


ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
  geom_point()+
  geom_line(aes(y=fit.pred))+
  facet_wrap(~state, scales = "free")

Any downsides to adding interactions?