10.7 Week 8 Tutorial
For this week’s tutorial, we will use data from the article, “Less than you think: Prevalence and predictors of fake news dissemination on Facebook” published in Science Advances by Andrew Guess, Jonathan Nagler, and Joshua Tucker available here.
Abstract.So-called “fake news” has renewed concerns about the prevalence and effects of misinformation in political campaigns. Given the potential for widespread dissemination of this material, we examine the individual-level characteristics associated with sharing false articles during the 2016 U.S. presidential campaign. To do so, we uniquely link an original survey with respondents’ sharing activity as recorded in Facebook profile data. First and foremost, we find that sharing this content was a relatively rare activity. Conservatives were more likely to share articles from fake news domains, which in 2016 were largely pro-Trump in orientation, than liberals or moderates. We also find a strong age effect, which persists after controlling for partisanship and ideology: On average, users over 65 shared nearly seven times as many articles from fake news domains as the youngest age group.
The authors look at the demographics predictors of disseminating fake news using survey data and behavioral data on respondents’ Facebook sharing history.
The key outcome variable is num_fake_shares
, representing the number of articles an individual shared from a list of fake news domains created by Buzzfeed.
- Key independent variables include
party
,ideology
(factor),age
(factor),female
,black
,educ
,faminc
,num_posts
Let’s load the data and look at the outcome.
library(rio)
<- import("https://github.com/ktmccabe/teachingdata/blob/main/sciencerep.RData?raw=true")
fake
table(fake$num_fake_shares)
0 1 2 3 4 5 6 7 8 9 10 12 13 25 50
1090 63 12 8 5 1 1 2 1 2 2 1 1 1 1
It looks like our data are a count. Let’s try to visualize this distribution in a histogram. How would you describe this distribution?
library(ggplot2)
ggplot(fake, aes(num_fake_shares))+
geom_histogram(bins=50)+
theme_minimal()+
ggtitle("Distribution of Fake News Shares") +
ylab("Number of respondents") +
xlab("Number of fake news stories shared")
To filter the histogram by party like Fig. 1 in the paper, expand.
library(ggplot2)
<- subset(fake, party %in% c("Democrat", "Republican", "Independent"))
fakesub ggplot(fakesub, aes(num_fake_shares, fill=party))+
geom_histogram(bins=50)+
theme_minimal()+
ggtitle("Distribution of Fake News Shares") +
ylab("Number of respondents") +
scale_fill_manual("", values = c("blue", "red", "darkgray")) +
xlab("Number of fake news stories shared") +
theme(legend.position = "bottom", legend.title = element_blank())
Warning: Removed 1499 rows containing non-finite values (stat_bin).
Let’s say we were interested in exploring the relationship between age and the number of fake news posts. It can always be useful to conduct more descriptive or exploratory analysis before hitting the heavy machinery. For example, the authors look at the average number of fake news posts shared by age group in Figure 2. We can do the same. Ours will look a little different because it’s unweighted:
## Base r version
<- tapply(fake$num_fake_shares, fake$age, mean, na.rm=T)
meansbyage barplot(meansbyage,
main = "Age and Fake News Shares",
ylab= "Mean number of fake news stories shared",
xlab="Age group",
ylim=c(0,.6),
col="black")
## ggplot version
library(tidyverse)
%>%
fake filter(!is.na(age)) %>%
group_by(age) %>%
summarise(mean_fakes = mean(num_fake_shares, na.rm = TRUE)) %>%
ggplot(aes(x=age, y=mean_fakes)) +
geom_bar(stat = "identity") +
ggtitle("Age and Fake News Shares") +
ylab("Mean number of fake news stories shared") +
xlab("Age group") +
ylim(0, .6)+
theme_minimal()
- Let’s move to a regression approach to study this relationship. How could we model this relationship? What are the pros and cons of each approach?
- Would OLS work?
- Would a binary logit or probit work?
- What other approaches could work?
- Let’s replicate the first column of Table 2 in the authors’ analysis.
- What type of model do the authors use? Why might this be an appropriate model in their case?
- How should the coefficients be interpreted?
- What would be a quantity of interest?
Recall that for our primary count models, the regression equation is of the form:
- \(\log \lambda_i = \eta_i = \mathbf x_i^T\beta\)
- \(\lambda_i = exp(\mathbf x_i^T\beta)\)
- \(\lambda_i = \mathbf E[Y_i | \mathbf x_i] = \exp(\beta_0 + \beta_1x_{i1} + ... + \beta_{k}x_{ik})\) is the expected number of events per a unit of time or space
- Let’s fit the model using
glm
. Be careful to specify the rightfamily
.- What is the dispersion parameter? What conceptual quantity does this represent?
Expand for details.
We will fit a quasipoisson following the authors. Note: they apply survey weights through the weights function. You could also use the survey
package for this. For now, we will follow the approach of the authors.
<- glm(num_fake_shares ~ ideology + age + female + black + educ + faminc,
model1.qp weights = weight_svypop_w3,
data = fake,
family = "quasipoisson")
summary(model1.qp)
Call:
glm(formula = num_fake_shares ~ ideology + age + female + black +
educ + faminc, family = "quasipoisson", data = fake, weights = weight_svypop_w3)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4510 -0.5414 -0.3406 -0.1929 12.8243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.416028 1.379091 -2.477 0.01341 *
ideologyVery liberal 0.486833 1.238165 0.393 0.69426
ideologyLiberal -1.127456 1.438798 -0.784 0.43345
ideologyModerate 0.332692 1.186454 0.280 0.77922
ideologyConservative 2.186850 1.154932 1.893 0.05857 .
ideologyV. conserv. 2.366213 1.158278 2.043 0.04132 *
age30-44 0.771712 0.811327 0.951 0.34174
age45-65 1.136384 0.764972 1.486 0.13771
ageOver 65 2.052391 0.766352 2.678 0.00752 **
female -0.113780 0.216606 -0.525 0.59950
black -0.879538 0.753713 -1.167 0.24351
educ -0.084881 0.081317 -1.044 0.29681
faminc -0.007145 0.008419 -0.849 0.39627
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 3.304861)
Null deviance: 1676.7 on 1040 degrees of freedom
Residual deviance: 1172.0 on 1028 degrees of freedom
(2459 observations deleted due to missingness)
AIC: NA
Number of Fisher Scoring iterations: 7
What was the form of the regression model?
library(equatiomatic)
::extract_eq(model1.qp, wrap=T, terms_per_line = 2) equatiomatic
\[ \begin{aligned} \log ({ E( \operatorname{num_fake_shares} ) }) &= \alpha + \beta_{1}(\operatorname{ideology}_{\operatorname{Very\ liberal}})\ + \\ &\quad \beta_{2}(\operatorname{ideology}_{\operatorname{Liberal}}) + \beta_{3}(\operatorname{ideology}_{\operatorname{Moderate}})\ + \\ &\quad \beta_{4}(\operatorname{ideology}_{\operatorname{Conservative}}) + \beta_{5}(\operatorname{ideology}_{\operatorname{V.\ conserv.}})\ + \\ &\quad \beta_{6}(\operatorname{age}_{\operatorname{30-44}}) + \beta_{7}(\operatorname{age}_{\operatorname{45-65}})\ + \\ &\quad \beta_{8}(\operatorname{age}_{\operatorname{Over\ 65}}) + \beta_{9}(\operatorname{female})\ + \\ &\quad \beta_{10}(\operatorname{black}) + \beta_{11}(\operatorname{educ})\ + \\ &\quad \beta_{12}(\operatorname{faminc}) \end{aligned} \]
We could have fit a regular poisson model. In that model, the dispersion paramter is taken to be 1, such that the conditional mean and variance are equivalent. In contrast, the quasipoisson accounts for the possibility that the variance is greater.
Let’s see how much of an issue this is.
<- glm(num_fake_shares ~ ideology + age + female + black + educ + faminc,
model1.p weights = weight_svypop_w3,
data = fake,
family = "poisson")
We can conduct a test of overdispersion in our model using dispersiontest
. If we have a significant result and a dispersion constant \(>\) 0, this would suggest overdispersion. What should we conclude? What are the implications?
library(AER)
dispersiontest(model1.p)
Overdispersion test
data: model1.p
z = 2.9334, p-value = 0.001676
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
3.027398
Note: The authors address different modeling choices in the paper. “We aggregated all shares to the individual respondent level so that our dependent variables are counts (i.e., number of fake news stories shared). To account for this feature of the data, as well as the highly skewed distribution of the counts, we primarily used Poisson or quasi-Poisson regressions to model the determinants of Facebook sharing behavior. We conducted dispersion tests on the count data and used quasi-Poisson models if the null hypothesis of no dispersion is rejected. Below, we included negative binomial and Ordinary Least Squares (OLS) regressions to show that our results are generally not sensitive to model choice. All models applied weights from YouGov to adjust for selection into the sample. We specifically used sample-matching weights produced for the third wave of the survey, which was closest to the Facebook encouragement sent to respondents (27). (Results also do not appear to be sensitive to the use of weights.)”
- Let’s generate some quantities of interest.
- What is the incidence rate ratio for sharing fake news of being Age 65+ vs. the reference category 18-29?
- What is the average number of fake news posts expected to be shared by age group?
Expand for details.
Incidence rate ratios
## Incidence rate ratios are exp(Bj)
exp(coef(model1.qp)["ageOver 65"])
ageOver 65
7.786494
How should we interpret this?
Average Fake News Posts Shared by Age Group
library(prediction)
## Note: levels(fake$age) is a shortcut for putting in the vector of all possible age levels. Only works for factor variables.
<- prediction(model1.qp, at=list(age = levels(fake$age)), type="response")
count.byage summary(count.byage)
at(age) Prediction SE z p lower upper
18-29 0.06575 0.04894 1.344 1.791e-01 -0.03017 0.1617
30-44 0.14225 0.04644 3.063 2.193e-03 0.05122 0.2333
45-65 0.20484 0.03468 5.907 3.493e-09 0.13687 0.2728
Over 65 0.51196 0.08653 5.917 3.285e-09 0.34237 0.6816
## 65+
<- model.matrix(model1.qp)
X "age30-44"] <- 0
X[, "age45-65"] <- 0
X[, "ageOver 65"] <- 1
X[, <- coef(model1.qp)
B <- X %*% B
eta .65 <- mean(exp(eta))
avg.fake
## 45-65
<- model.matrix(model1.qp)
X "age30-44"] <- 0
X[, "age45-65"] <- 1
X[, "ageOver 65"] <- 0
X[, <- coef(model1.qp)
B <- X %*% B
eta .45 <- mean(exp(eta))
avg.fake
## 30-44
<- model.matrix(model1.qp)
X "age30-44"] <- 1
X[, "age45-65"] <- 0
X[, "ageOver 65"] <- 0
X[, <- coef(model1.qp)
B <- X %*% B
eta .30 <- mean(exp(eta))
avg.fake
## 18-29
<- model.matrix(model1.qp)
X "age30-44"] <- 0
X[, "age45-65"] <- 0
X[, "ageOver 65"] <- 0
X[, <- coef(model1.qp)
B <- X %*% B
eta .18 <- mean(exp(eta))
avg.fake
## Gather results
c(avg.fake.18, avg.fake.30, avg.fake.45, avg.fake.65)
[1] 0.06574956 0.14224696 0.20484149 0.51195856
How could we calculate uncertainty?