13.2 Survey R package
Let’s assume for now that we have some survey data, and those data contain some information about the sampling process, as well as survey weights. With this information, we are ready to use the survey
package to conduct analysis.
The survey
package by Thomas Lumley allows you to declare the nature of the survey design.
- Load your dataset as usual into a dataframe in R
- Construct the variables you will use in analysis
- Declare to R the survey design by assigning your data frame as a new survey object.
- Perform analyses using functions from the
survey
package
svydesign
(data =
…,
weights=
…,
fpc=
…,
strata =
…,
id =
…)
data
: data frameweights
: sampling weightsfpc
: Finite population correction (less common)strata
: Formula or vector specifying strata (NULL by default)ids
: Primary sampling unit. Formula or data frame with cluster ids. Use ~1 or ~0 if no clusters.
13.2.1 Examples of specifying svydesigns
The easiest, but also common, way to specify a survey design occurs when you have survey data and a column of survey weights that the survey firm has provided– no clusters or strata.
Let’s look at this using data from within the survey
package. The apisrs
represents a simple random sample with a column pw
for survey sampling weights. We can specify the design below:
- See also, a detailed example from Pew analyzing one of their datasets with the survey package.
Example: Sample of students in California.
library(survey)
data(api) # loads several dataframes
## Sample with just weights, no clusters/strata
head(apisrs)
cds stype name sname snum
1039 15739081534155 H McFarland High McFarland High 1039
1124 19642126066716 E Stowers (Cecil Stowers (Cecil B.) Elementary 1124
2868 30664493030640 H Brea-Olinda Hig Brea-Olinda High 2868
1273 19644516012744 E Alameda Element Alameda Elementary 1273
4926 40688096043293 E Sunnyside Eleme Sunnyside Elementary 4926
2463 19734456014278 E Los Molinos Ele Los Molinos Elementary 2463
dname dnum cname cnum flag pcttest api00
1039 McFarland Unified 432 Kern 14 NA 98 462
1124 ABC Unified 1 Los Angeles 18 NA 100 878
2868 Brea-Olinda Unified 79 Orange 29 NA 98 734
1273 Downey Unified 187 Los Angeles 18 NA 99 772
4926 San Luis Coastal Unified 640 San Luis Obispo 39 NA 99 739
2463 Hacienda la Puente Unif 284 Los Angeles 18 NA 93 835
api99 target growth sch.wide comp.imp both awards meals ell yr.rnd
1039 448 18 14 No Yes No No 44 31 <NA>
1124 831 NA 47 Yes Yes Yes Yes 8 25 <NA>
2868 742 3 -8 No No No No 10 10 <NA>
1273 657 7 115 Yes Yes Yes Yes 70 25 <NA>
4926 719 4 20 Yes Yes Yes Yes 43 12 <NA>
2463 822 NA 13 Yes Yes Yes No 16 19 <NA>
mobility acs.k3 acs.46 acs.core pct.resp not.hsg hsg some.col col.grad
1039 6 NA NA 24 82 44 34 12 7
1124 15 19 30 NA 97 4 10 23 43
2868 7 NA NA 28 95 5 9 21 41
1273 23 23 NA NA 100 37 40 14 8
4926 12 20 29 NA 91 8 21 27 34
2463 13 19 29 NA 71 1 8 20 38
grad.sch avg.ed full emer enroll api.stu pw fpc
1039 3 1.91 71 35 477 429 30.97 6194
1124 21 3.66 90 10 478 420 30.97 6194
2868 24 3.71 83 18 1410 1287 30.97 6194
1273 1 1.96 85 18 342 291 30.97 6194
4926 10 3.17 100 0 217 189 30.97 6194
2463 34 3.96 75 20 258 211 30.97 6194
## Specify design
<- svydesign(data = apisrs,
apisrs_design weights = ~pw,
id = ~1)
Example: Sample of students in California. Samples stratified by School type
Here is an example with a stratified random sample with a known finite population variable. Here, we had the fpc
column and strata column stype
. We still supply the weights pw
.
## Use apistrat
<- svydesign(data = apistrat,
apistrat_design weights = ~pw,
fpc = ~fpc, # population of school types known
id = ~1, # no clusters
strata = ~stype)
Example: Sample of students in California by clustering.
Here is a more complex example.
- Samples clustered by school districts
dnum
. - Within district, five schools were sampled
snum
. - Fpc for district and school are
fpc1
andfpc2
<- svydesign(data = apiclus2,
apiclus_design weights = ~pw,
fpc = ~fpc1 + fpc2,
id = ~dnum + snum) # start from biggest clu
For more established surveys, often the codebook might include documentation on how to account for survey weights and sampling design. For example, the American National Election Study will generally include this guidance in recent codebooks. Here is an example accounting for the sampling design in the 2016 ANES.
You can use foreign
, rio
, or what we will use here, haven
to load the .dta data.
library(haven)
<- read_dta("https://github.com/ktmccabe/teachingdata/blob/main/anes_timeseries_2016_Stata12.dta?raw=true") an
One benefit of loading data through haven
and some of the newer packages is it will help retain value labels from the original data that R might not otherwise load. (E.g., labels that would show up in SPSS or Stata but seem lost in R.)
Let’s check this for an example: “How well does feminist describe you?
library(tidyverse)
$V161346 %>% attr("labels") an
-9. Refused
-9
-5. Interview breakoff (sufficient partial IW)
-5
1. Extremely well
1
2. Very well
2
3. Somewhat well
3
4. Not very well
4
5. Not at all
5
OK enough about that. Let’s specify the survey design.
<-
anes_design svydesign(
ids = ~V160202 ,
strata = ~V160201 ,
data = an,
weights = ~V160102,
nest = TRUE)
The nest
argument indicates if cluster ids should be relabeled to align with strata.
13.2.2 Working with survey designs
The good news is that once you have specified the survey design, the rest of the analysis will flow similarly across types of survey designs. The bad news is that you will need to use special functions to analyze the data instead of the functions we are used to working with. These generally start with svy
….. but the good news is, they are not that different from our normal functions!
Here are a few examples of common functions using the survey design.
## Means and uncertainty of variables
## Note: we put design where you would normally see data
svymean(~mobility, design = apiclus_design)
mean SE
mobility 18.154 1.471
SE(svymean(~mobility, design = apiclus_design))
mobility
mobility 1.471006
confint(svymean(~mobility, design = apiclus_design))
2.5 % 97.5 %
mobility 15.27039 21.03662
## Linear Regression
<- svyglm(mobility ~ meals + awards, family = "gaussian",
fit.lin design = apiclus_design)
## Logit Regression
<- svyglm(awards ~ meals,
fit.logit family = binomial(link = "logit"),
design = apiclus_design)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
For an example of ordinal logistic regression see section 8.6 of the course notes.
## Can also subset like normal with subset command
<- subset(apiclus_design, meals > 1) apiclus_design2
- The Pew research center has also written a tutorial for using various
tidyverse
tools with survey data. See here for details. - In addition, there has been a package
srvyr
recently developed that incorporates more tidyverse language with the same set of survey tools. See this tutorial for details and some analysis examples. - Our friends the
prediction
andmargins
packages have some capabilities for models fit through thesurvey
package.- With the
margins
package, you will want to specify thedesign
in order to make sure the average marginal effects reported are based on the survey design object. Example:
- With the
library(margins)
<-margins(fit.logit, variables = "meals",
res design=apiclus_design,
change="iqr")
summary(res)
factor AME SE z p lower upper
meals -0.1018 0.0592 -1.7208 0.0853 -0.2178 0.0142
13.2.3 Why do we need the survey
package?
The benefits of the survey package really come from the tool’s specialty in calculating appropriate variances for the survey design and accounting for very complex stratified or clustered designs. There are other weighting functions in R that will allow you to recover the same point estimates as the survey tools in some cases. However, it won’t always be the case that they will give you accurate measures of uncertainty.
If you don’t need measures of uncertainty, then other functions from base R, such as weighted.mean
can also work in cases where you just have a column of survey weights.
weighted.mean(apisrs$mobility, w=apisrs$pw)
[1] 17.46
svymean(~mobility, design = apisrs_design)
mean SE
mobility 17.46 0.6529