library(rio)
<- import("https://github.com/ktmccabe/teachingdata/blob/main/paluckgreen.dta?raw=true")
pg
## Let's treat the outcome as a factor
$dissent <- as.factor(pg$dissent)
pg
## Let's visualize the outcome by group
library(ggplot2)
library(tidyverse)
%>%
pg filter(is.na(dissent)==F) %>%
ggplot(aes(x=dissent, group=factor(treat),
fill=factor(treat)))+
geom_bar(aes(y=..prop..),stat= "count", position="dodge", color="black")+
theme_minimal()+
theme(legend.position = "bottom")+
scale_fill_brewer("Condition", labels=c("Control", "Treatment"),palette="Paired")+
scale_x_discrete(labels= c("Should Stay Quiet", "2", "3", "Should Dissent"))
8 Ordinal Outcomes
This section will cover ordinal logit and probit regression. These are models that work well for outcome variables that include a set of (more than 2) ordered discrete categories.
Often, survey responses have this form of outcome (e.g., a likert scale from “strongly agree” to “strongly disagree”). We also might categorize behavioral responses in an ordinal way from “stay home” to “protest”, for example.
You can review the following for additional external resources on this section.
- King, Gary. 1998. Unifying political methodology: The likelihood theory of statistical inference. University of Michigan Press. Chapter 5.4.
- Resources for ordinal models in R here and here
- A section from Gelman and Hill Chapter 6, posted to Canvas.
8.1 Ordinal Outcome Data
Here is a motivating example for the use of ordered data from Paluck and Green “Deference, Dissent, and Dispute Resolution: An Experimental Intervention Using Mass Media to Change Norms and Behavior in Rwanda” which was published in the American Political Science Review in 2009. doi:10.1017/S0003055409990128
Abstract. Deference and dissent strike a delicate balance in any polity. Insufficient deference to authority may incapacitate government, whereas too much may allow leaders to orchestrate mass violence. Although cross-national and cross-temporal variation in deference to authority and willingness to express dissent has long been studied in political science, rarely have scholars studied programs designed to change these aspects of political culture. This study, situated in post-genocide Rwanda, reports a qualitative and quantitative assessment of one such attempt, a radio program aimed at discouraging blind obedience and reliance on direction from authorities and promoting independent thought and collective action in problem solving. Over the course of one year, this radio program or a comparable program dealing with HIV was randomly presented to pairs of communities, including communities of genocide survivors, Twa people, and imprisoned genocidaires … Although the radio program had little effect on many kinds of beliefs and attitudes, it had a substantial impact on listeners’ willingness to express dissent and the ways they resolved communal problems.
In a field experiment, the authors have randomly assigned participants in different research sites to listen to a radio program over the course of a year that varied in its message. As the authors note, “Because radios and batteries are relatively expensive for Rwandans, they usually listen to the radio in groups. Thus, we used a group-randomized design in which adults from a community listened together either to the treatment (reconciliation) program or to the control program (another entertainment-education radio soap opera about health and HIV).” The authors have 14 clusters without 40 individuals within each cluster.
- Treatment (
treat
): radio program with one of two messages, where 1=the treatment condition with a reconciliation message and 0=control, listening to a health message. - Outcome (
dissent
): Willingness to Display Dissent: An ordered scale with four categories from 1 (“I should stay quiet”) to 4 (“I should dissent”)
Let’s load the data and look at the treatment and outcome.
We can see variation in the outcome, where some people are at the “stay quiet” end of the scale, while others are at the opposite end. We might have a few questions about the outcome:
- What is the probability of being in a particular category given a set of \(\mathbf{x_i'}\) values?
- Does the treatment influence likelihood of expressing dissent?
- Does the treatment significantly affect the probability of responding in a particular category?
What model should they use to help answer these questions?
One approach would be to use OLS.
- They could treat 1 to 4 scale as continuous from Should stay quiet to Should dissent
- If they do this, the interpretation of the regression coefficients would be:
- Going from Control to Treatment (0 to 1), is associated with \(\hat \beta\) movement on this scale.
- What could be problematic here?
- Might go below or above scale points
- Distance between scale points might not be an equal interval
- Doesn’t answer the “probability” question, just describes movement up and down the scale.
A second approach could be to collapse the scale to be dichotomous and use logit/probit or a linear probability model.
- For example, they could treat the outcome to 0 = (lean toward stay quiet/stay quiet) vs. 1 (lean toward dissent/dissent)
- Here, after converting the outcomes in probability, the interpretation would be
- Going from Control to Treatment (0 to 1), is associated with an average difference in predicted probability of dissent (\(Y_i = 1\))
- What could be problematic here?
- We lose information.
A third approach–and the focus of this section– would be to use an ordinal logistic or probit regression.
- This is appropriate when our goal, at least in part, is to estimate the probability of being in a specific category, and these categories have a natural ordering to them.
8.1.1 Ordinal Model
With ordered data, we have an outcome variable \(Y_i\) that can fall into different, ordered categories: \(Y_i \in \{C_1, C_2, \ldots, C_J \}\) with some probability.
The image above shows a distribution where the area under the curve sums to 1, with the area divided into 4 categories, separated by three cutpoints. The area represents probability mass. For example, the area to the left of z1 represents the \(Pr(Y_i^\ast \leq z1)\).
In our ordered model, we assume that there is a latent (unobserved) \(Y^\ast_i = X_i\beta + \epsilon\)
- This means we can still have a single model \(X_i \beta\), which determines what outcome level is achieved (this requires an assumption).
- where \(\epsilon\) is either assumed to be normally distributed (probit) or distributed logistically (logit), and corresponds to the fuzziness of the cutpoints (\(\zeta_j\)), which define in which category an outcome is observed to fall. Instead of fine lines, we estimate probabilistically in which category \(Y_i\) is predicted to fall.
We observe this category in our outcome: \(Y_i\).
- For example, we observe if someone said “should stay quiet” or “should dissent” vs. one of the two middle categories.
The \(\zeta_j\) are called “cutpoints”
- Need cutpoints that are distinct, but the distance between cutpoints does not have to be the same.
- This can be particularly useful when we have scales that have a natural ordering, but the distance between scale points might not have the same meaning or be the same (e.g., “Agree”,“Disagree”, “Revolt”). This is different from an interval variable, where we assume the difference between scale points carries the same meaning (e.g., credit score, cups of flour in a recipe).
- Note: There is no intercept in linear prediction model in this case. Instead of the intercept, we have the specific cutpoints.
- Rule of thumb: Estimation with more than 3-5 categories unstable
8.1.2 Interpretation
Here is an example of what the ordered probit model output can look like from the authors’ paper. You can see coefficients similar to the models we’ve been working with before but instead of an intercept, we have the different cutpoints, in this case, three cutpoints for \(J-1\) categories.
We can interpret the coefficients as a one-unit increase in \(x\) has a \(\hat \beta\) increase/decrease in the linear predictor scale of \(Y^\ast\) (in log-odds for logit or probit z-score standard deviations).
- This gives us an initial sense (based on sign and significance) of how an independent variable positively or negatively affects the position of \(Y*\). However, it does not give us any information about specific categories.
- Thus, alas, this is unsatisfying for a couple of reasons.
- \(Y_i^\ast\) is an unobserved variable (not the categories themselves)
- The scale is harder to interpret than probability
- Therefore, we will generally want to convert our estimates into probabilities (our quantities of interest)
- One wrinkle here is we now have \(J\) predicted probabilities to estimate, one for each category.
- A second wrinkle here is any change in the probability of being in the \(jth\) category of \(Y\) also affects the probabilities of being in the \(\neq jth\) categories because the probabilities of being in each category have to sum together to 1. (E.g., Increasing the probability that someone said “should dissent” affects the probability they said “should stay quiet.)
8.2 Likelihood Framework
In the binary case, we wanted to estimate \(Pr(Y_i = 1 | X)\). Our goal in an ordinal model is to estimate the probability that \(Y_i\) is in a particular \(j\) category \(C_j\).
\(Pr(Y_i = C_j | X)\)
To do so, we are going to use the cumulative distribution functions to estimate the probability that \(Y_i\) is below a particular cutpoint \(\zeta_j\) or between two cutpoints \(\zeta_j\) and \(\zeta_{j+1}\).
Finding predicted probabilities for a given \(j\) category can be written as follows: \(Pr(Y_i|X_i) = \mathbf 1(Y_i=C_j) \{ \Pr(Y^\ast \le \zeta_{j})- \Pr(Y^\ast \le \zeta_{j-1})\}\)
We can spell this out more explicitly for each \(j\) category:
- \(Pr(Y_i = C_{J} | X) = 1 - Pr(Y^\ast\leq \zeta_{J - 1} | X_i)\)
- \(Pr(Y_i = C_{3} | X) = Pr(Y^\ast \leq \zeta_{3} | X_i) - Pr(Y^\ast \leq \zeta_{2} | X_i)\)
- \(Pr(Y_i = C_{2} | X) = Pr(Y^\ast \leq \zeta_{2} | X_i) - Pr(Y^\ast \leq \zeta_{1} | X_i)\)
- \(Pr(Y_i = C_{1} | X) = Pr(Y^\ast \leq \zeta_{1} | X_i)\)
Just as in the binary case, we use our \(\Phi()\) pnorm
or \(\frac{exp^{X\beta}}{1 + exp^{X\beta}}\) plogis
functions to get our probabilities from the linear predictors. However, in this ordered case, we also have to include the estimate for the cutpoint \(\zeta_j\) when performing this operation. You can kind of think of this as having a separate intercept for each category instead of just one intercept in the binary case.
For example, in the ordinal logit case, the linear predictor is in the scale of log of the proportional-odds. We can write our regression as:
\(\log \frac{P(Y \leq j | X)}{P(Y > j | X)} = (\zeta_j - \eta)\) where \(\eta = x_1\beta_1 + x_2\beta_2 + ... + x_k\beta_k\)
- To get probability we apply the
plogis
function \(logit^{-1}(\zeta_j - \eta)\) - Same for probit, but we use
pnorm
: \(probit^{-1}(\zeta_j - \eta)\)
8.2.1 Likelihood
The likelihood of all observations, assuming independence is:
\(\mathcal L(\beta, \zeta | Y) = \prod_{i=1}^{N} Pr(Y_i = C_j)\) for a given category. To incorporate all \(J\) categories, we can write:
\(\mathcal L(\beta, \zeta | Y) = \prod_{i=1}^{N} \prod_{j=1}^{J} \mathbf 1(Y_i=C_j) \{ \Pr(Y^\ast \le \zeta_{j}) - \Pr(Y^\ast \le \zeta_{j-1})\}\) where \(\mathbf 1(Y_i=C_j)\) is an indicator for whether or not a given \(Y_i\) is observed in the \(jth\) category.
Note that here instead of estimating just \(\beta\), we now also estimate the cutpoints \(\zeta\).
The log likelihood then just changes this to the sum:
\[\begin{align*} \mathcal l(\beta, \zeta | Y) &= \sum_{i=1}^{N} \sum_{j=1}^{J}\mathbf 1(Y_i=C_j) \{ \log( \Pr(Y^\ast \le \zeta_{j}) - \Pr(Y^\ast \le \zeta_{j-1}))\}\\ &= \sum_{i=1}^{N} \sum_{j=1}^{J} \mathbf 1(Y_i=C_j) \{\log(\Phi(\zeta_j - \mathbf x_i'\beta) - \Phi(\zeta_{j-1} - \mathbf x_i'\beta))\} \end{align*}\]In addition to assuming independence of our observations, we assume each category has a nonzero probability of being observed and that the cutpoints are monotonically increasing: \(\zeta_j < \zeta_{j+1}\).
8.3 Fitting Ordinal Models in R
To fit an ordinal logit or probit model in R, we can use the MASS
package and polr
function.
You may need to install this in your RStudio the first time you use it.
install.packages("MASS")
When we are ready to run the model, we then open the package and use the function polr
. The first inputs are very similar to lm
and glm
. However, we add an argument specifying the method =
which can be “logistic” (the default) or “probit.” We also specify Hess=T
to make sure we get the uncertainty estimates with the results.
- By the way, why would that argument be called
Hess
?
library(MASS)
<- polr(as.factor(dissent) ~ treat, data= pg, Hess = T, method = "probit")
fit summary(fit)
Call:
polr(formula = as.factor(dissent) ~ treat, data = pg, Hess = T,
method = "probit")
Coefficients:
Value Std. Error t value
treat 0.2645 0.1021 2.589
Intercepts:
Value Std. Error t value
1|2 -0.5102 0.0761 -6.7017
2|3 -0.0576 0.0740 -0.7784
3|4 0.2697 0.0742 3.6344
Residual Deviance: 1253.844
AIC: 1261.844
(116 observations deleted due to missingness)
To get p-values, we can use the AER
package.
library(AER)
coeftest(fit)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
treat 0.264455 0.102143 2.5891 0.0096238 **
1|2 -0.510237 0.076135 -6.7017 2.06e-11 ***
2|3 -0.057570 0.073958 -0.7784 0.4363235
3|4 0.269665 0.074198 3.6344 0.0002786 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Or calculate manually
round(2*pnorm(abs(summary(fit)$coefficients[,3]),
lower.tail = F), digits=6)
treat 1|2 2|3 3|4
0.009624 0.000000 0.436323 0.000279
For details on how to fit this manually in R using optim
, expand.
# Grab X and Y, listwise deletion
<- na.omit(cbind(pg$dissent, pg$treat))
pgsub <- pgsub[,1]
Y <- pgsub[, 2]
X
## Log likelihood in R for this example
<- function(par, Y, X){
llikor <- 1 # X columns
k <- 4 # categories
j <- par[1:k]
beta <- par[(k+1):length(par)]
zeta <-cbind(X)
X <- cbind(beta)
beta ## linear predictor
<- X %*% beta
eta ## indicator variables
<- ifelse(Y==1, 1, 0)
y1 <- ifelse(Y==2, 1, 0)
y2 <- ifelse(Y==3, 1, 0)
y3 <- ifelse(Y==4, 1, 0)
y4 ## likelihood for each category
<- y1*log(pnorm(zeta[1] - eta))
l1 <- y2*log(pnorm(zeta[2] - eta) - pnorm(zeta[1] - eta))
l2 <- y3*log(pnorm(zeta[3] - eta) - pnorm(zeta[2] - eta))
l3 <- y4*log(1 - pnorm(zeta[3] - eta))
l4 ## sum together
<- sum(c(l1, l2, l3, l4))
llik return(llik)
}
## Optimize
<- optim(par=c(.2, 1, 2, 3), fn=llikor, X=X, Y=Y, method="BFGS",
opt.out control=list(fnscale=-1), hessian = T)
## Coefficients
$par[1] opt.out
[1] 0.2644911
coef(fit)
treat
0.2644546
## Cutpoints
$par[2:4] opt.out
[1] -0.51019872 -0.05757855 0.26969063
$zeta fit
1|2 2|3 3|4
-0.5102370 -0.0575699 0.2696646
## Standard errors
sqrt(diag(solve(- opt.out$hessian)))
[1] 0.10214310 0.07613473 0.07395763 0.07419776
sqrt(diag(vcov(fit)))
treat 1|2 2|3 3|4
0.10214309 0.07613503 0.07395770 0.07419775
8.3.1 Quantities of Interest
In discussing the results of Table 3 from the ordered probit analysis in the paper, the authors note, “A shift of .26 probits implies, for example, that a health group respondent with a 30% chance of strongly agreeing to dissent would move to a 40% chance if assigned to the reconciliation program group. This is a large, but not implausibly large, shift in opinion.”
Similar to the binary case, we can convert our estimates to probability, but here we will do it for specific categories. Let’s first conduct an estimate of the difference in probability of being in the top category of dissent for those in the treatment - control conditions.
We will follow the formulas outlined above. For the top category (category 4 here) we have:
- \(Pr(Y_i = C_4) = 1 - Pr(Y^* \leq \zeta_{4-1} | X_i)\) where our estimate of \(Y* = X\hat \beta\).
To calculate this we take \(1 -\) pnorm
\((\zeta_{4-1} - X\hat \beta)\)
- Note that we use
pnorm
because this is a probit model. If it were a logistic model, we would useplogis
instead. - We are estimating the probability that our observation is in category 4, when setting \(X\) to particular values. In this case, we will estimate the probability when an observation is in the treatment or control group.
Let’s find \(X\)
<- model.matrix(fit)
X head(X)
(Intercept) treat
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
Let’s estimate the probability when someone is in the treatment group vs. control group. For this, we will set X to be 1 or 0. Because we only have one coefficient in our model, we actually don’t need the entire matrix. For example, once we set treat
to be 1, all observations (rows) are identical to each other:
"treat"] <- 1 # in treatment
X[, head(X)
(Intercept) treat
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
So instead of estimating the outcome for every single row and then take the mean, we could just estimate the outcome for one row given that all rows are identical. In regressions where you have other covariates, those covariates will likely have different values, making the rows distinct. This is a special case where the below code would give identical results.
<- cbind(1,1) # A 1 for the intercept and 1 for the treatment
Xt <- cbind(1,0) # A 1 for the intercept and 0 for the treatment (being in the control) Xc
By default, R will include an Intercept in the model.matrix
. We actually do not want this when calculating ordinal regression quantities of interest because the cutpoints will be serving the purpose of the intercept. So we will remove this from \(X\), but we will use as.matrix()
to make sure R will still treat the object like a matrix.
- If you instead used the full \(X\) matrix, the only extra step you would need to do is to take the
mean()
of the probability estimates below (e.g., the mean ofpp1.4
). See the practice problem solution in 8.5 for an example.
<- as.matrix(X[, -1])
X head(X)
[,1]
1 1
2 1
3 1
4 1
5 1
6 1
<- as.matrix(Xt[, -1])
Xt Xt
[,1]
[1,] 1
<- as.matrix(Xc[, -1])
Xc Xc
[,1]
[1,] 0
## probability of being in top category
<- fit$zeta[3] # grab the top cutpoint J - 1, 4-1=3
cutpoint3 <- coef(fit) # coefficients
b
.4 <- 1 - pnorm(cutpoint3 - Xt%*%b) # prob of top category for treatment
pp1.4 pp1
[,1]
[1,] 0.4979215
.4 <- 1 - pnorm(cutpoint3 - Xc%*%b) # prob of top category for control
pp0.4 pp0
[,1]
[1,] 0.3937091
## difference
.4 - pp0.4 pp1
[,1]
[1,] 0.1042124
We recover that there is an 10 percentage point difference in probability of being in the top category of dissent, predicted for those in the treatment vs. control.
If we were interested in other categories, we could similarly compare probabilities within those. For example, what about category 3?
- Here we just apply a different formula from 8.2: \(Pr(Y_i = C_3) = Pr(Y^* \leq \zeta_{3} | X_i) - Pr(Y^* \leq \zeta_{2} | X_i)\) where our estimate of \(Y^* = X\hat \beta\).
To calculate this we take pnorm
\((\zeta_{3} - X\hat \beta)\) - pnorm
\((\zeta_{2} - X\hat \beta)\)
<- fit$zeta[3]
cutpoint3 <- fit$zeta[2]
cutpoint2
.3 <- pnorm(cutpoint3 - Xt%*%b) - pnorm(cutpoint2 - Xt%*%b)
pp1.3 pp1
[,1]
[1,] 0.1283614
.3 <- pnorm(cutpoint3 - Xc%*%b) - pnorm(cutpoint2 - Xc%*%b)
pp0.3 pp0
[,1]
[1,] 0.1292452
.3 - pp0.3 pp1
[,1]
[1,] -0.000883825
These represent the estimated probabilities of being in category 3, or the estimated difference in probability of being in category 3 for the treatment vs. control condition.
What about category 2 or category 1?
Try on your own, and then expand for the solution.
- Category 2: Here we just apply a different formula from 8.2: \(Pr(Y_i = C_2) = Pr(Y^* \leq \zeta_{2} | X_i) - Pr(Y^* \leq \zeta_{1} | X_i)\).
- Category 1: Here we just apply a different formula from 8.2: \(Pr(Y_i = C_1) = Pr(Y^* \leq \zeta_{1} | X_i)\).
<- fit$zeta[1]
cutpoint1 <- fit$zeta[2]
cutpoint2
## category 2
.2 <- pnorm(cutpoint2 - Xt%*%b) - pnorm(cutpoint1 - Xt%*%b)
pp1.2 pp1
[,1]
[1,] 0.1544561
.2 <- pnorm(cutpoint2 - Xc%*%b) - pnorm(cutpoint1 - Xc%*%b)
pp0.2 pp0
[,1]
[1,] 0.1721029
.2 - pp0.2 pp1
[,1]
[1,] -0.01764679
## category 1 (bottom category)
.1 <- pnorm(cutpoint1 - Xt%*%b)
pp1.1 pp1
[,1]
[1,] 0.2192609
.1 <- pnorm(cutpoint1 - Xc%*%b)
pp0.1 pp0
[,1]
[1,] 0.3049427
.1 - pp0.1 pp1
[,1]
[1,] -0.08568175
We can use our R shortcuts for all categories. Here, to make sure we are in probabilities, we can use type="probs"
.
## predict
<- predict(fit, newdata = data.frame(treat = c(0,1)), type = "probs")
prs prs
1 2 3 4
1 0.3049427 0.1721029 0.1292452 0.3937091
2 0.2192609 0.1544561 0.1283614 0.4979215
The first row shows the probability of being in each category (1 through 4) for the first entry for the treat variable (0), implying that the observation was in the control group. The second row does the same thing for when the treat variable is 1, implying that the observation was in the treatment group. These should match up with our manual calculations.
The marginaleffects
and prediction
packages will also work to some extent, which is nice when you have covariates you want to hold at observed values. Note that we put whatever the names of the levels of our variables are.
- It just so happens that in this case, both
treat
and ourdissent
outcomes have numbers for the categories. In other cases, you might have text as the category names, in which case you would enter those text values (e.g., “Somewhat Agree” or “Treatment condition”) as the variable or category labels.
## marginaleffects
library(marginaleffects)
## predicts for all categories
<- predictions(fit, newdata = datagrid(treat = c(0,1)))
outme outme
Group treat Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1 0 0.305 0.0267 11.44 <0.001 98.2 0.2527 0.357
1 1 0.219 0.0245 8.95 <0.001 61.3 0.1713 0.267
2 0 0.172 0.0176 9.79 <0.001 72.8 0.1377 0.207
2 1 0.154 0.0164 9.41 <0.001 67.4 0.1223 0.187
3 0 0.129 0.0151 8.56 <0.001 56.3 0.0997 0.159
3 1 0.128 0.0150 8.54 <0.001 56.0 0.0989 0.158
4 0 0.394 0.0285 13.79 <0.001 141.4 0.3378 0.450
4 1 0.498 0.0313 15.91 <0.001 186.8 0.4366 0.559
Columns: rowid, group, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, dissent, treat
Type: probs
## to filter for a specific category, like category 4:
subset(outme, group==4)
Group Estimate Std. Error z Pr(>|z|) S CI low CI high
4 0.394 0.0285 13.8 <0.001 141.4 0.338 0.450
4 0.498 0.0313 15.9 <0.001 186.8 0.437 0.559
Columns: rowid, group, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, dissent, treat
## Prediction
library(prediction)
## can specify which category
<- prediction(fit, at = list(treat = c(0,1)),
outp category = "4")
summary(outp)$Prediction
[1] 0.3937091 0.4979215
If we wanted to compare treatment conditions to see the difference in probability, we would move to the avg_comparisons
function in marginaleffects
.
library(marginaleffects)
<- avg_comparisons(fit, type="probs",
marg.me variables = list(treat = c(0,1)))
subset(marg.me, group == 3)
Group Term Contrast Estimate Std. Error z Pr(>|z|) S CI low CI high
3 treat 1 - 0 -0.000884 0.00186 -0.475 0.635 0.7 -0.00453 0.00276
Columns: term, group, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Finally, the margins
package should also work in many cases. You can compare this output with the pp1.4 - pp0.4
calculation from above.
library(margins)
<- summary(margins(fit, variables = "treat", change = c(0, 1),
marg.obj vce="bootstrap", category = "4"))
marg.obj
factor AME SE z p lower upper
treat 0.1042 0.0413 2.5244 0.0116 0.0233 0.1851
How can we visualize the results?
- Given our independent variable of interest,
treat
just has two categories, we could make a plot similar to the barplot at the beginning of the section. - When our independent variable of interest has several categories, things can get a bit messier. We will do an example in a weekly tutorial.
How can we get uncertainty?
- Like before, we can use bootstrap or simulation, or ask
marginaleffects
ormargins
to do it for us.
How can we incorporate covariates?
- Like before, we would just have a matrix of \(X\) covariates instead of a single column for the treatment.
- Like before, we could also calculate probabilities holding those covariates at observed values or setting them to designated values of interest.
How could we fit a logit version?
- We just switch the
method = "logistic"
, and then we should be careful to also useplogis()
in place ofpnorm
for the probability calculation. - Note: ordered logits also have the benefit of having the odds ratio interpretation when we
exp(coef(fit))
exponentiate the coefficients. See this post’s section on “Interpreting the odds ratio” halfway down the page for more information. Again, it is more common in political science to see probabilities than odds ratios, but different disciplines prefer different quantities of interest.
8.4 Assumptions
A key assumption for the ordinal models is Parallel lines/Proportional Odds: We only have one set of \(k \times 1\) \(\hat \beta\), not a separate set of coefficients for each ordinal category.
- This means that the relationship between the lowest versus all higher categories of the response variable are assumed to be the same as those that describe the relationship between the next lowest category and all higher categories, etc.
- For each \(X\) term included in the model, the coefficient ‘slopes’ are the same regardless of the threshold. If not, we would need a separate set of coefficients describing each pair of outcomes (e.g., Slopes for being in Cat 1 vs. Cat 2; Cat 2 vs. Cat 3, etc.)
- Even though we have different cutpoint values across categories, a one-unit change, going from control to treatment, the effects are parallel across response categories.
- For example, if theoretically, being a woman vs. a man has a positive effect on moving between Categories 3 and 4 in a particular model, but you believe it would have the opposite effect for moving from Category 1 to 2, this would suggest the ordinal model is not appropriate.
What to do if assumption is violated? Ignore, Do Binary, Do multinomial (discussed in the next session), use a model that has been developed for relaxing this assumption (e.g., see clm
function in R).
One test for this that has been developed for the ordered logit case is the Brant test.
<- polr(as.factor(dissent) ~ treat, data= pg,
fit.l Hess = T, method = "probit")
## One way to test this
#install.packages("brant")
library(brant)
brant(fit.l,by.var=F)
--------------------------------------------
Test for X2 df probability
--------------------------------------------
Omnibus 3.93 2 0.14
treat 3.93 2 0.14
--------------------------------------------
H0: Parallel Regression Assumption holds
## Second way- compare fit of ordinal and multinom models
#install.packages("nnet")
library(nnet)
<- multinom(as.factor(dissent) ~ treat, data=pg) mlm
# weights: 12 (6 variable)
initial value 686.215709
iter 10 value 625.130759
final value 625.130585
converged
<- logLik(fit.l)
M1 <- logLik(mlm)
M2 <- -2*(M1[1] - M2[1])
G pchisq(G, 6 - 4,lower.tail = FALSE)
[1] 0.1667305
In both cases, our p-value is large enough that we cannot reject the null hypothesis, meaning that we are “okay” sticking with the assumption in this case.
8.5 Ordinal Practice Problems
Here are a few practice problems for ordered models.
- Try to replicate column 2 from Table 3 by adding
factor(site_pr)
to the regression model. Note that the standard errors will not exactly match. We will discuss why in a follow-up section. - Calculate the average predicted probability of leaning toward staying quiet (category 2) for the treatment and control group, holding covariates at observed values.
- Note: If you do this manually, you need to remove the intercept column that is automatically added to the
model.matrix
- Note: If you do this manually, you need to remove the intercept column that is automatically added to the
Try on your own, and then expand for the solution.
<- polr(as.factor(dissent) ~ treat + factor(site_pr),
fit.c2 data=pg, method="probit", Hess =T)
coef(fit.c2)
treat factor(site_pr)2 factor(site_pr)3 factor(site_pr)4
0.28484306 -0.17480464 -0.25513562 -0.17731799
factor(site_pr)5 factor(site_pr)6 factor(site_pr)7
-0.30547159 -0.11384681 0.04659582
## Option 1a
library(marginaleffects)
<- avg_predictions(fit.c2, by = "treat",
p1me newdata = datagridcf(treat=c(0,1)))
subset(p1me, group==2)
Group Estimate Std. Error z Pr(>|z|) S CI low CI high
2 0.173 0.0176 9.81 <0.001 73.0 0.138 0.207
2 0.154 0.0164 9.39 <0.001 67.2 0.122 0.186
Columns: group, treat, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
## Option 1b
library(prediction)
<- prediction(fit.c2, at=list(treat=c(0,1)), category=2)
p1 summary(p1)$Prediction
[1] 0.1726136 0.1537839
## Option 2
<- model.matrix(fit.c2)[,-1]
X "treat"]<-1
X[, <- coef(fit.c2)
bh <- X %*% bh
eta <- fit.c2$zeta[2]
cutpoint2 <- fit.c2$zeta[1]
cutpoint1 mean(pnorm(cutpoint2 - eta) - pnorm(cutpoint1 - eta))
[1] 0.1537839
"treat"] <-0
X[, <- coef(fit.c2)
bh <- X %*% bh
eta <- fit.c2$zeta[2]
cutpoint2 <- fit.c2$zeta[1]
cutpoint1 mean(pnorm(cutpoint2 - eta) - pnorm(cutpoint1 - eta))
[1] 0.1726136
8.5.1 A note on robust standard errors
In writing down our likelihood equations, we assume our observations are independent. However, that is not always the case. For example, in Paluck and Green’s article, the experimental design is based on clustered pairs, and the treatment is randomly assigned to one unit of a pair. Ideally, we want to account for this non-independence in our model.
In linear models (e.g., using lm
), this is often accomplished post-estimation by adjusting the standard errors. We might adjust them for clustering if we have a case like the authors’ or for suspected heteroskedasticity (maybe the errors are not constant, but instead, are larger for higher values of a given X variable).
Professor Jeffrey Wooldridge, an expert on econometrics, is a proponent of robust standard errors, given the stringent assumption of homoskedasticity.
There are many types of these “robust” standard errors, and we may encounter more later in the course. Generally, what each of these robust standard errors does is alter the nature of the variance covariance matrix from which we extract the standard errors. Recall in OLS, we have an expression for the variance of the coefficients that looks like the below. By assuming constant error variance, we can simplify it.
\[\begin{align*} \mathbf{V}(\widehat{\beta}) &= (X^T X)^{-1}X^T \mathbf{V}( \epsilon) X (X^T X)^{-1}\\ &= \underbrace{(X^T X)^{-1}X^T \sigma^2I_n X (X^T X)^{-1}}_\text{Assume homoskedasticity}\\ &= \underbrace{\sigma^2(X^T X)^{-1} X^T X (X^T X)^{-1}}_\text{Because it is a constant, we can move it out in front of the matrix multiplication, and then simplify the terms.} \\ &= \sigma^2(X^T X)^{-1} \end{align*}\]After we plug in \(\hat \sigma^2\) for \(\sigma^2\), we arrive at our variance-covariance matrix vcov
:\(\hat \sigma^2(X^T X)^{-1}=\frac{1}{N-K} \mathbf{e'e}(X^T X)^{-1}\)
When we don’t have constant error variance, we have to pause here:
\((X^T X)^{-1}X^T \mathbf{V}( \epsilon) X (X^T X)^{-1}\)
and assume something different about the structure of the error terms in \(\mathbf{V}( \epsilon)\) and end up with a different final expression for vcov
.
The robust estimators differ in how they specify an estimate for \(\mathbf{V}( \epsilon) = \Sigma\) and whether they take into account the degrees of freedom in the model (number of observations, number of variables, number of clusters, etc.)
For example, let’s fit the authors’ model as a linear model using MacKinnon and White HC1 robust standard errors. Those specify the following inside portion for an estimate of \(\mathbf{V}( \epsilon)\).This is what Stata uses as a default “robust” standard error:
\(\Sigma = \frac{n}{n-k} \text{diag}(\hat \epsilon_i^2)\)
This means we have a slightly different term for each observation instead of a constant estimate \(\hat \sigma^2\) for all observations.
<- lm(as.numeric(dissent) ~ treat + factor(site_pr),
fit.lin data=pg)
## Option 1: Adjusting for heteroskedasticity
library(sandwich)
<- vcovHC(fit.lin, type="HC1")
newvcov sqrt(diag(newvcov))
(Intercept) treat factor(site_pr)2 factor(site_pr)3
0.1661596 0.1145570 0.2128812 0.2158906
factor(site_pr)4 factor(site_pr)5 factor(site_pr)6 factor(site_pr)7
0.2164289 0.2176110 0.2175578 0.2085747
## Option 2: Adjusting for heteroskedasticity
library(estimatr)
<- lm_robust(as.numeric(dissent) ~ treat + factor(site_pr),
fit.lin2 data=pg, se_type = "stata")
sqrt(diag(vcov(fit.lin2)))
(Intercept) treat factor(site_pr)2 factor(site_pr)3
0.1661596 0.1145570 0.2128812 0.2158906
factor(site_pr)4 factor(site_pr)5 factor(site_pr)6 factor(site_pr)7
0.2164289 0.2176110 0.2175578 0.2085747
Another adjustment takes into account the grouped or “clustered” nature of the data. Here, we have a separate estimate for each group in the data. This can be implemented in R by using vcovCL
or the clusters
argument in lm_robust
.
## Option 1: Adjusting for clustering
library(sandwich)
<- vcovCL(fit.lin, type="HC1", cadjust=T, cluster = pg$sitcode)
newvcov sqrt(diag(newvcov))
(Intercept) treat factor(site_pr)2 factor(site_pr)3
0.07685775 0.07066623 0.12388030 0.09271341
factor(site_pr)4 factor(site_pr)5 factor(site_pr)6 factor(site_pr)7
0.07174477 0.12894245 0.20605985 0.06962861
## Option 2: Adjusting for clustering
library(estimatr)
<- lm_robust(as.numeric(dissent) ~ treat + factor(site_pr),
fit.lin2 data=pg, se_type = "stata", clusters = sitcode)
sqrt(diag(vcov(fit.lin2)))
(Intercept) treat factor(site_pr)2 factor(site_pr)3
0.07685775 0.07066623 0.12388030 0.09271341
factor(site_pr)4 factor(site_pr)5 factor(site_pr)6 factor(site_pr)7
0.07174477 0.12894245 0.20605985 0.06962861
This is all great, but the authors don’t have a linear model, they have an ordered probit model. It turns out that people have developed similar robust standard error estimators to adjust the standard errors from these non-linear models. For more details, see slides from Molly Roberts on this point.
These take a slightly different form, but the intuition is the same. What we are altering is the structure of the variance-covariance matrix (which is a function of the Hessian in the likelihood case), making adjustments across all observations or clusters of observations.
We can also implement these in R using HC0 standard errors.
## Adjusting for clustered standard errors
<- polr(as.factor(dissent) ~ treat + factor(site_pr),
fit.c2 data=pg, method="probit", Hess =T)
<- vcovCL(fit.c2, type="HC0", cluster = pg$sitcode) clval
You can compare these standard errors to column 2 from Table 3 in the paper.
sqrt(diag(clval))
treat factor(site_pr)2 factor(site_pr)3 factor(site_pr)4
0.06212521 0.11636326 0.07559847 0.06387804
factor(site_pr)5 factor(site_pr)6 factor(site_pr)7 1|2
0.10879657 0.16765260 0.06391195 0.07144512
2|3 3|4
0.07494704 0.08409084
The coeftest
function in R, when applied to a new vcov
will reproduce the full regression output.
library(lmtest)
coeftest(fit.c2, vcov=clval)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
treat 0.284843 0.062125 4.5850 4.54e-06 ***
factor(site_pr)2 -0.174805 0.116363 -1.5022 0.1330371
factor(site_pr)3 -0.255136 0.075598 -3.3749 0.0007385 ***
factor(site_pr)4 -0.177318 0.063878 -2.7759 0.0055052 **
factor(site_pr)5 -0.305472 0.108797 -2.8077 0.0049892 **
factor(site_pr)6 -0.113847 0.167653 -0.6791 0.4970975
factor(site_pr)7 0.046596 0.063912 0.7291 0.4659633
1|2 -0.646720 0.071445 -9.0520 < 2.2e-16 ***
2|3 -0.190723 0.074947 -2.5448 0.0109348 *
3|4 0.138053 0.084091 1.6417 0.1006487
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I got 99 problems, and standard errors are just one
However, robust standard errors cannot correct underlying model misspecification. Recall that the initial likelihood equation is what assumes independence– this is the equation we use for everything, not only for computing standard errors but also for estimating the coefficients themselves. If we truly believe we have dependence among our observations, we might consider using an entirely different likelihood function– one that incorporates this dependence, instead of just adjusting the standard errors after the fact.
Nonetheless, we may still think our model is good enough, even if not “correct.” So long as you recognize that the robust standard errors don’t correct for this underlying problem, some are still proponents for using robust standard errors in these cases. For example, Wooldridge recommends reporting robust standard errors, and writes,
8.6 Ordinal Tutorial
For this example, we will replicate a portion of the article”How Empathic Concern Fuels Partisan Polarization” by Elizabeth N. Simas, Scott Clifford, and Justin H. Kirkland.published in 2020 in the American Political Science Review. Replication files are available here
Abstract. Over the past two decades, there has been a marked increase in partisan social polarization, leaving scholars in search of solutions to partisan conflict. The psychology of intergroup relations identifies empathy as one of the key mechanisms that reduces intergroup conflict, and some have suggested that a lack of empathy has contributed to partisan polarization. Yet, empathy may not always live up to this promise. We argue that, in practice, the experience of empathy is biased toward one’s ingroup and can actually exacerbate political polarization. First, using a large, national sample, we demonstrate that higher levels of dispositional empathic concern are associated with higher levels of affective polarization. Second, using an experimental design, we show that individuals high in empathic concern show greater partisan bias in evaluating contentious political events. Taken together, our results suggest that, contrary to popular views, higher levels of dispositional empathy actually facilitate partisan polarization.
We are going to replicate Study 1’s analysis testing Hypotheses 1 and 2. Here, the authors conduct an original survey fielded by YouGov during May 2016 with 1000 respondents.
Let’s load the data.
- How many observations do we have?
library(foreign)
<- read.dta("https://github.com/ktmccabe/teachingdata/blob/main/week6.dta?raw=true") emp
The authors’ first two hypotheses are:
- Empathic concern should predict more positive affect for copartisans, relative to outpartisans (H1).
- Empathic concern should increase negative affect for outpartisans (H2).
Outcome Measure: “To examine this type of partisan favoritism, we utilize responses to two questions asking respondents to rate the Democratic and Republican Parties on a seven-point scale ranging from”very favorable” to “very unfavorable.” We then subtract respondents’ ratings of the opposite party from their ratings of their own party to create an ordinal measure that ranges from six (highest inparty rating, lowest outparty rating) to −6 (lowest inparty rating and highest outparty rating)”
affectpol
: an ordinal measure that ranges from six (highest inparty rating, lowest outparty rating) to −6 (lowest inparty rating and highest outparty rating)”outfav
: rating of opposing party on a seven-point scale ranging from “very favorable” to “very unfavorable.”
Let’s take a look at these variables.
- Are they coded as the authors describe?
- What class are they currently?
table(emp$affectpol)
-6 -4 -3 -2 -1 0 1 2 3 4 5 6
1 1 4 5 11 106 80 80 103 148 137 116
class(emp$affectpol)
[1] "numeric"
table(emp$outfav)
1 2 3 4 5 6 7
431 163 73 85 16 16 8
class(emp$outfav)
[1] "numeric"
Independent Variables.
empconc
: Mean of empathetic concern items from the Interpersonal Reactivity Index (IRI)- Additional variables to measure other dimensions of empathy:
empdist
personal distress,emppers
perspective taking,empfant
fantasy
- Additional controls for strength of party identification
pidext
, ideological extremityideoext
, news interestnews
, dummy variable for party membershipdem
, and demographics:educ
,age
,male
,white
,inc3miss
(income)
What type of model could they use to test H1 and H2?
They choose to run an ordinal logistic regression. Let’s do as they do. To run an ordinal model in R, we need to make sure our outcome is ordinal! (meaning a factor variable)
$outfav <- as.factor(emp$outfav)
emptable(emp$outfav)
1 2 3 4 5 6 7
431 163 73 85 16 16 8
class(emp$outfav)
[1] "factor"
Go ahead and replicate the first regression in the table with all of the controls, making sure to treat income as a factor variable but education as a numeric variable.
- Note: your coefficients will not exactly match because the authors weight their data using survey weights.
library(MASS)
<- polr(outfav ~ empconc + empdist + emppers + empfant +
fit.emp + pidext + ideoext + news + dem +
as.numeric(educ) + age + male + white + factor(inc3miss),
data=emp, Hess=T, method="logistic")
summary(fit.emp)
Call:
polr(formula = outfav ~ empconc + empdist + emppers + empfant +
+pidext + ideoext + news + dem + as.numeric(educ) + age +
male + white + factor(inc3miss), data = emp, Hess = T, method = "logistic")
Coefficients:
Value Std. Error t value
empconc -0.805771 0.492455 -1.63623
empdist 0.321857 0.413656 0.77808
emppers 0.433013 0.528121 0.81991
empfant -0.020821 0.408769 -0.05093
pidext -0.246698 0.095193 -2.59155
ideoext -0.556725 0.072652 -7.66288
news -0.557133 0.094192 -5.91487
dem 0.002573 0.156964 0.01639
as.numeric(educ) 0.071461 0.054630 1.30809
age -0.004130 0.004784 -0.86345
male -0.310108 0.160583 -1.93114
white -0.015491 0.176718 -0.08766
factor(inc3miss)2 -0.049340 0.196223 -0.25145
factor(inc3miss)3 -0.127770 0.200626 -0.63686
factor(inc3miss)4 -0.148389 0.242841 -0.61105
Intercepts:
Value Std. Error t value
1|2 -3.4355 0.6385 -5.3808
2|3 -2.3032 0.6308 -3.6513
3|4 -1.5674 0.6259 -2.5042
4|5 -0.2119 0.6325 -0.3351
5|6 0.3713 0.6467 0.5741
6|7 1.5460 0.7177 2.1541
Residual Deviance: 1828.862
AIC: 1870.862
(245 observations deleted due to missingness)
- Let’s take a look out how the weights affect the result by using the
survey
package.- There are many options in establishing an
svydesign
. Ours is a relatively simple case where all we have is a vector of weights. In other cases, samples might include information about the sampling units or strata. - Once we establish an
svydesign
object, we now need to usesvy
commands for our operations, such assvymean
orsvyglm
orsvyolr
- There are many options in establishing an
## install.packages("survey", dependencies =T)
library(survey)
<- svydesign(ids=~1, weights = emp$weight_group, data=emp)
empd <- svyolr(outfav ~ empconc + empdist + emppers + empfant +
fit.empw2 + pidext + ideoext + news + dem +
as.numeric(educ) + age +
+ white + factor(inc3miss),
male design=empd, method="logistic")
summary(fit.empw2)
Call:
svyolr(outfav ~ empconc + empdist + emppers + empfant + +pidext +
ideoext + news + dem + as.numeric(educ) + age + male + white +
factor(inc3miss), design = empd, method = "logistic")
Coefficients:
Value Std. Error t value
empconc -1.4141333759 0.585070456 -2.4170309
empdist 0.7759996312 0.561768870 1.3813504
emppers 1.1700466598 0.672754809 1.7391874
empfant 0.9733692488 0.475484526 2.0471103
pidext -0.3281938237 0.124714241 -2.6315665
ideoext -0.4970353573 0.106530714 -4.6656531
news -0.5759629296 0.130609880 -4.4097960
dem -0.0456812052 0.195236242 -0.2339791
as.numeric(educ) 0.0284167135 0.065624317 0.4330211
age -0.0007305433 0.005493873 -0.1329742
male -0.1759866726 0.212493495 -0.8281979
white 0.0366164598 0.215144017 0.1701951
factor(inc3miss)2 -0.0470362687 0.226737718 -0.2074479
factor(inc3miss)3 -0.0794217660 0.246811912 -0.3217907
factor(inc3miss)4 -0.0588030650 0.330225820 -0.1780693
Intercepts:
Value Std. Error t value
1|2 -2.9118 0.9199 -3.1652
2|3 -1.7749 0.9029 -1.9658
3|4 -1.0853 0.9058 -1.1982
4|5 0.3080 0.8948 0.3442
5|6 0.9101 0.9006 1.0106
6|7 2.8740 0.9296 3.0918
(245 observations deleted due to missingness)
The weights seem to make a difference! Now we are closely matching what the authors report. The use of survey weights represents yet another point of researcher discretion.
Let’s use the weighted results and proceed to make them easier to interpret. Recall, H2 was: Empathic concern should increase negative affect for outpartisans (H2).
We want to show how negative affect toward the outparty changes across levels of empathic concern. How should we visualize this?
- Could calculate probabilities of being in each of the
outfav
seven categories across different levels of empathetic concern. - Could calculate probabilities of being in theoretically interesting
outfav
categories across different levels of empathetic concern.
Note: in each case, we need to decide where to set our covariate values and potentially also calculate uncertainty estimates.
What do they do? (focus on the right side for the out-party measure)
Let’s estimate the probability of being in the lowest category for empathy values from 0 to 1 by .2 intervals. Let’s set all covariates at their observed values.
- We could do this in
predict
or manually. We will do it manually for now.
## Set covariates to particular values (here, we hold at observed)
<- model.matrix(fit.empw2)
X <- X[, -1] #remove intercept
X "empconc" ] <- 0
X[,
# Find Xb and zeta
coef(fit.empw2)
empconc empdist emppers empfant
-1.4141333759 0.7759996312 1.1700466598 0.9733692488
pidext ideoext news dem
-0.3281938237 -0.4970353573 -0.5759629296 -0.0456812052
as.numeric(educ) age male white
0.0284167135 -0.0007305433 -0.1759866726 0.0366164598
factor(inc3miss)2 factor(inc3miss)3 factor(inc3miss)4 1|2
-0.0470362687 -0.0794217660 -0.0588030650 -2.9118275702
2|3 3|4 4|5 5|6
-1.7749101884 -1.0852947154 0.3080104278 0.9101173775
6|7
2.8740344326
# this piece [1:ncol(X)] makes sure we omit the zetas
# this is only necessary for svyolr. The polr function already omits zetas from coef()
<- coef(fit.empw2)[1:ncol(X)]
b <- X %*% b
eta <- fit.empw2$zeta
zeta
## Find Pr(lowest category)
<- mean(plogis(zeta[1] - eta))
emp0 emp0
[1] 0.3025648
## Repeat for each value of empconc of interest
"empconc"] <- .2
X[,# Find Xb and zeta
<- X %*% coef(fit.empw2)[1:ncol(X)]
eta <- fit.empw2$zeta
zeta ## Find Pr(lowest category)
<- mean(plogis(zeta[1] - eta))
emp2 emp2
[1] 0.3555723
## Or put it in a function to be faster
<- function(val){
findpr "empconc"] <- val
X[,# Find Xb and zeta
<- X %*% coef(fit.empw2)[1:ncol(X)]
eta <- fit.empw2$zeta
zeta ## Find Pr(lowest category)
<- mean(plogis(zeta[1] - eta))
pr return(pr)
}## Does it work? Test
findpr(0)
[1] 0.3025648
## Repeat for all values of empathy
<- sapply(seq(0, 1, .2), findpr)
emp.prs emp.prs
[1] 0.3025648 0.3555723 0.4116936 0.4696702 0.5281235 0.5856687
We can visualize these estimates similar to the authors.
plot(x=seq(0, 1, .2),
y=emp.prs,
ylim = c(.1, .7),
type="l",
xlab = "Empathic Concern",
ylab = "",
main = "Outparty Favorability \n Pr(Outparty Very Unfavorable)",
cex.main=.8)
We could add lines for all categories. We’ll just add it to the function for now.
<- function(val){
findprall "empconc"] <- val
X[,# Find Xb and zeta
<- X %*% coef(fit.empw2)[1:ncol(X)]
eta <- fit.empw2$zeta
zeta ## Find Pr(7th lowest category)
<- mean(1 - plogis(zeta[6] - eta))
pr7 ## Find Pr(6th lowest category)
<- mean(plogis(zeta[6] - eta) - plogis(zeta[5] - eta))
pr6 ## Find Pr(5th lowest category)
<- mean(plogis(zeta[5] - eta) - plogis(zeta[4] - eta))
pr5 ## Find Pr(4th lowest category)
<- mean(plogis(zeta[4] - eta) - plogis(zeta[3] - eta))
pr4 ## Find Pr(3rd lowest category)
<- mean(plogis(zeta[3] - eta) - plogis(zeta[2] - eta))
pr3 ## Find Pr(2nd lowest category)
<- mean(plogis(zeta[2] - eta) - plogis(zeta[1] - eta))
pr2 ## Find Pr(lowest category)
<- mean(plogis(zeta[1] - eta))
pr1 return(c(pr1, pr2, pr3, pr4, pr5, pr6, pr7))
}## Repeat for all values of empathy
<- sapply(seq(0, 1, .2), findprall) emp.prsall
A note on the survey weights:
- When we take the
mean()
of our estimates for each observation, we are treating each row of our \(X\) matrix as if it should have equal weight in this mean. This is normally fine, but with weighted data, we might instead want to weight the mean, according to the survey weights we used in our regression. The functionweighted.mean
can facilitate this. Either approach is valid, but you may want to remember this distinction and the possibility that you could even incorporate the survey weights at this final stage.
We can add these lines to the plot. Yikes! A bit messy. You can see why they focus on the first category only.
plot(x=seq(0, 1, .2),
y=emp.prsall[1,],
ylim = c(0, 1),
type="l",
xlab = "Empathic Concern",
ylab = "",
main = "Outparty Favorability \n Pr(Outparty Very Unfavorable)",
cex.main=.8)
points(x=seq(0, 1, .2), y=emp.prsall[2,], type="l", lty=2)
points(x=seq(0, 1, .2), y=emp.prsall[3,], type="l", lty=3)
points(x=seq(0, 1, .2), y=emp.prsall[4,], type="l", lty=4)
points(x=seq(0, 1, .2), y=emp.prsall[5,], type="l", lty=5)
points(x=seq(0, 1, .2), y=emp.prsall[6,], type="l", lty=6)
points(x=seq(0, 1, .2), y=emp.prsall[7,], type="l", lty=7)
legend("topleft", lty=1:7, c("Very Unfav", "2", "3", "4", "5", "6", "Very Fav"), cex=.6)
A last step would be to calculate uncertainty. Just like before, we could use simulation or the bootstrap method. In the bootstrap method, the manual calculations would be the “meat” of the bootstrap function that you used in the previous course notes section.
A special note: The svyolr
function does not appear compatible with the prediction
function. As an alternative, we could fit the polr
model with an extra argument for weights. These should produce the same coefficient estimates, though the standard errors might be incorrect. You could potentially use this to generate the raw probability estimates. See for example, estimates for being in category 1 below:
<- polr(outfav ~ empconc + empdist + emppers + empfant +
fit.emp2 + pidext + ideoext + news + dem +
as.numeric(educ) + age + male + white + factor(inc3miss),
data=emp, Hess=T, method="logistic", weights = emp$weight_group)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.emp2)
Call:
polr(formula = outfav ~ empconc + empdist + emppers + empfant +
+pidext + ideoext + news + dem + as.numeric(educ) + age +
male + white + factor(inc3miss), data = emp, weights = emp$weight_group,
Hess = T, method = "logistic")
Coefficients:
Value Std. Error t value
empconc -1.4141936 0.479987 -2.9463
empdist 0.7759968 0.412555 1.8810
emppers 1.1700975 0.527905 2.2165
empfant 0.9733691 0.406788 2.3928
pidext -0.3281915 0.097894 -3.3525
ideoext -0.4970277 0.078991 -6.2922
news -0.5759557 0.094099 -6.1207
dem -0.0456691 0.161136 -0.2834
as.numeric(educ) 0.0284158 0.056125 0.5063
age -0.0007309 0.004667 -0.1566
male -0.1759908 0.164116 -1.0724
white 0.0366286 0.168809 0.2170
factor(inc3miss)2 -0.0470201 0.205414 -0.2289
factor(inc3miss)3 -0.0794134 0.207121 -0.3834
factor(inc3miss)4 -0.0587784 0.233469 -0.2518
Intercepts:
Value Std. Error t value
1|2 -2.9118 0.6315 -4.6109
2|3 -1.7749 0.6251 -2.8394
3|4 -1.0853 0.6211 -1.7473
4|5 0.3080 0.6257 0.4923
5|6 0.9101 0.6363 1.4303
6|7 2.8741 0.7763 3.7022
Residual Deviance: 1862.152
AIC: 1904.152
(245 observations deleted due to missingness)
## marginaleffects
<- avg_predictions(fit.emp2, by = "empconc",
preds.me newdata = datagridcf(empconc = seq(0,1,.2)))
subset(preds.me, group==1)
Group Estimate Std. Error z Pr(>|z|) S CI low CI high
1 0.303 0.0616 4.92 <0.001 20.1 0.182 0.423
1 0.356 0.0485 7.34 <0.001 42.0 0.261 0.451
1 0.412 0.0333 12.36 <0.001 114.1 0.346 0.477
1 0.470 0.0200 23.54 <0.001 404.4 0.431 0.509
1 0.528 0.0201 26.33 <0.001 505.1 0.489 0.567
1 0.586 0.0333 17.60 <0.001 227.9 0.520 0.651
Columns: group, empconc, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
## prediction
<- prediction(fit.emp2, at=list(empconc = seq(0,1,.2)), category=1) preds
Warning in check_values(data, at): A 'at' value for 'empconc' is outside
observed data range (0.0357142873108387,1)!
round(summary(preds), digits=3)
at(empconc) Prediction SE z p lower upper
0.0 0.303 NA NA NA NA NA
0.2 0.356 NA NA NA NA NA
0.4 0.412 NA NA NA NA NA
0.6 0.470 NA NA NA NA NA
0.8 0.528 NA NA NA NA NA
1.0 0.586 NA NA NA NA NA
## Compare with the manual approach
round(emp.prsall[1,], digits=3) # category 1
[1] 0.303 0.356 0.412 0.470 0.528 0.586