12.1 Week 10 Tutorial
In this section, we are going to use data from the CDC on national and state-level COVID hospitalizations and positive case increases between July 2020- end September 2020.
Let’s load the national and state data.
library(rio)
<- import("https://github.com/ktmccabe/teachingdata/blob/main/national.RData?raw=true")
national <- import("https://github.com/ktmccabe/teachingdata/blob/main/states.RData?raw=true")
states
head(national)
date hospitalizedCurrently positiveIncrease
159 2020-09-30 31021 44909
160 2020-09-29 30601 36766
161 2020-09-28 29696 35376
162 2020-09-27 29579 34990
163 2020-09-26 29670 47268
164 2020-09-25 29888 55237
head(states)
date hospitalizedCurrently positiveIncrease state
8849 2020-09-30 53 104 AK
8850 2020-09-30 776 1147 AL
8851 2020-09-30 484 942 AR
8853 2020-09-30 560 323 AZ
8854 2020-09-30 3267 3200 CA
8855 2020-09-30 264 511 CO
During this time period, the national level of hospitalizations was declining.
library(ggplot2)
ggplot(national, aes(x=date, y=hospitalizedCurrently))+
geom_line()
However, the national data represents aggregated state-level data. It is possible that these trends might look different if we looked within each state.
ggplot(states, aes(x=date, y=hospitalizedCurrently))+
geom_line()+
facet_wrap(~state, scales = "free")
When we have data that are grouped in some way, it is important to consider how group-specific factors may influence the results.
Let’s look at the relationship between positive increases in cases and hospitalizations using the state-level data. Note, we are not epidemiologists here, so this is in no way exactly how you would want to model this in practice. We will leave that to the experts. Nonetheless, it will give us some visualizations. The regression line below is from the following regression, which pools across states and dates:
\(hospitalizedCurrently_{it} = \alpha + positiveIncrease_{it} + \epsilon_{it}\)
ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
geom_point()+
geom_smooth(aes(y=hospitalizedCurrently),method="lm", se=F, color="black")+
ylim(0, 15000)
What issues do you have with that type of analysis?
- How could we improve it?
Let’s look at the variation by state.
ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
geom_point()+
geom_smooth(method="lm", se=F)+
facet_wrap(~state, scales = "free")
One thing we could do to account for the variation in cases, is to add controls for state. This is called adding “fixed effects.” It allows the intercept for each state to be different in regression, but the slopes are considered the same.
<- lm(hospitalizedCurrently ~ positiveIncrease + as.factor(state), data=states)
fit <- predict(fit)
fit.pred
ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
geom_point()+
geom_smooth(aes(y=hospitalizedCurrently),method="lm", se=F, color="black")+
geom_line(aes(y=fit.pred))+
ylim(0, 15000)
ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
geom_point()+
geom_line(aes(y=fit.pred))+
facet_wrap(~state, scales = "free")
We could also add interactions with state, which will allow the slopes to vary.
<- lm(hospitalizedCurrently ~ positiveIncrease*as.factor(state), data=states)
fit <- predict(fit)
fit.pred
ggplot(states, aes(x=positiveIncrease, y=hospitalizedCurrently, color=state))+
geom_point()+
geom_line(aes(y=fit.pred))+
facet_wrap(~state, scales = "free")
Any downsides to adding interactions?