13 Survey Data

This section will provide a brief introduction to tools available in R for the analysis of survey data.

Here are some supplemental resources:

The primary survey package we will be using is from Thomas Lumley, appropriately called survey.
- Recently, there has been a package srvyr that incorporates more tidyverse language. See this tutorial for details.
The Pew Research center has released a tutorial for the analysis of their survey datasets. This can be applied more generally to analyze other types of survey data.

13.1 What’s the fuss?

Let’s first discuss why we might even want special methods for analyzing survey data in the first place.

When you have survey data, oftentimes as researchers are interested in making inferences from a sample available in the survey to a broader population. However, there may be many ways in which our sample is not representative of the population. The represenativeness of a survey is one component of what is called the “total survey error” framework.

Matt Salganik’s book Bit by Bit describes this in section 3.3

In the ideal case, we would draw a simple random sample from the population using probability sampling.

Sample n units from an N population where each unit has equal probability of being sampled. We can get sample statistics very simply in this case.
- For example, the simple mean of a variable measured in the sample would be equivalent to the expected value in the population.

\[\begin{align*} \bar y &= \frac{1}{n}\sum_{i=1}^n y_i\\ \mathbf E(\bar y) &= \mu \end{align*}\]

In this estimator, we assume that all sample units \(i\) represent the same number of units in the population. When could this go wrong? Basically, almost any time we actually try to sample from a population in practice. It is very, very hard to get simple random samples of the population.

Sometimes the way survey samples are collected are through complex designs. Example:

Stratified sample of the United States

Consider each county a “strata”
Conduct a random sample within strata
Why? If you conduct a simple random sample at the individual-level, just by chance you might not not sample within each county

Clustered random sample within the United States

Consider each county
Sample \(m\) number of counties (these are considered to be “clusters”)
Sample within each cluster

When you have generated a sample using cluster or stratified sampling, it is best to account for this data generating process in the analysis to get accurate estimates for sample averages and variances. The R package we will use will account for this.

13.1.1 Weighting Surveys

In many cases, our survey data are not perfectly representative of the population. In these cases, often a researcher might want to employ weights to adjust the estimates calculated from a sample in order to make them more accurate for the population of interest.

What was the problem: Each unit \(i\) in the sample no longer represents the same number of units in the cluster, strata, or population. Some types of people might be “overrepresented” in the sample relative to the population, and some might be underrepresented.
One Solution: Weights \(w\) to reflect/ adjust for the number of units in the population that the sampled unit \(i\) represents.
What do we need? Auxiliary information about the target population so we know how to adjust the data
- Where do we get this? Sometimes it is provided by survey firms or described in the codebook of existing surveys. Other times, you could consider constructing your own weights.

Example: \[\begin{align*} \bar y &= \sum_{i=1}^n y_i*w_i\\ \end{align*}\]

When will weights matter?

Weights will matter particularly when your data are unrepresentative on characteristics that directly influence your outcomes of interest.
- For example, let’s say you had a survey that was representative except for age. You had too many young people in the sample relative to the population. Let’s say you are interested in predicting the proportion of people who voted for Biden in 2020. If age did not matter for vote choice, then it might not matter that your survey was unrepresentative by age. But, if age does matter, then the unweighted estimate from your sample might be biased!

Where can weights go wrong?

Let’s be real here. Weighting is an art as much or more than a science.
- It is not always immediately obvious what the target population is (e.g., if the target population is the set of people who will vote in an election).
- It is not always immediately obvious how to get accurate data on this target population, even if known (e.g., not every population as up-to-date Census information).
- It is not always obvious which variables to choose to weight on (e.g., which demographics?)
- You may also not have all variables of interest available in your sample or at the population level.
- Missing cells. Even if you have all of the above, weighting can still be insufficient if your sample simply does not contain certain subgroups of the population or contains too few members of a certain subgroup of the population. (e.g., suppose your sample only includes 18-25 year olds – it will be hard to infer things about older populations.)

For example, The Upshot gave different polling firms the same survey data. They each came up with different estimates of vote choice in 2016 due to small differences in choices about weighting and identifying the target population.The linked article describes the different choices the pollsters likely made.

13.1.2 Broad Types of Survey Sampling Techniques

Survey sampling techniques are sometimes broken into two types: non-probability and probability sampling.

In probability samples, the probability that a respondent is selected for the survey is known, and this helps ensure that the sample will be representative. However, what is becoming increasingly less known, is the probability that a person will respond to a survey. Because this is less well-known, even probability samples will need adjustments for weighting to get accurate estimates.

Non-probability samples come in many different forms, such as those that use “quota sampling,” convenience samples from online labor markets, or more sophisticated algorithms for choosing respondents that reflect the population. Pew describes different types of non-probability samples here.

Matt Salganik, as part of the curriculum for the Summer Institutes in Computational Social Science, provides a nice overview of the tradeoffs between probability and non-probability samples and what this means for survey weighting.

Pew provides a somewhat more pessimistic take on nonprobability samples in this report.

13.2 Survey R package

Let’s assume for now that we have some survey data, and those data contain some information about the sampling process, as well as survey weights. With this information, we are ready to use the survey package to conduct analysis.

The survey package by Thomas Lumley allows you to declare the nature of the survey design.

Load your dataset as usual into a dataframe in R
Construct the variables you will use in analysis
Declare to R the survey design by assigning your data frame as a new survey object.
Perform analyses using functions from the survey package

svydesign(data = …, weights= …, fpc= …, strata = …, id =…)

data: data frame
weights: sampling weights
fpc: Finite population correction (less common)
strata: Formula or vector specifying strata (NULL by default)
ids: Primary sampling unit. Formula or data frame with cluster ids. Use ~1 or ~0 if no clusters.

13.2.1 Examples of specifying svydesigns

The easiest, but also common, way to specify a survey design occurs when you have survey data and a column of survey weights that the survey firm has provided– no clusters or strata.

Let’s look at this using data from within the survey package. The apisrs represents a simple random sample with a column pw for survey sampling weights. We can specify the design below:

See also, a detailed example from Pew analyzing one of their datasets with the survey package.

Example: Sample of students in California.

library(survey)
data(api) # loads several dataframes 

## Sample with just weights, no clusters/strata
head(apisrs)

                cds stype            name                         sname snum
1039 15739081534155     H  McFarland High                McFarland High 1039
1124 19642126066716     E Stowers (Cecil  Stowers (Cecil B.) Elementary 1124
2868 30664493030640     H Brea-Olinda Hig              Brea-Olinda High 2868
1273 19644516012744     E Alameda Element            Alameda Elementary 1273
4926 40688096043293     E Sunnyside Eleme          Sunnyside Elementary 4926
2463 19734456014278     E Los Molinos Ele        Los Molinos Elementary 2463
                        dname dnum           cname cnum flag pcttest api00
1039        McFarland Unified  432            Kern   14   NA      98   462
1124              ABC Unified    1     Los Angeles   18   NA     100   878
2868      Brea-Olinda Unified   79          Orange   29   NA      98   734
1273           Downey Unified  187     Los Angeles   18   NA      99   772
4926 San Luis Coastal Unified  640 San Luis Obispo   39   NA      99   739
2463  Hacienda la Puente Unif  284     Los Angeles   18   NA      93   835
     api99 target growth sch.wide comp.imp both awards meals ell yr.rnd
1039   448     18     14       No      Yes   No     No    44  31   <NA>
1124   831     NA     47      Yes      Yes  Yes    Yes     8  25   <NA>
2868   742      3     -8       No       No   No     No    10  10   <NA>
1273   657      7    115      Yes      Yes  Yes    Yes    70  25   <NA>
4926   719      4     20      Yes      Yes  Yes    Yes    43  12   <NA>
2463   822     NA     13      Yes      Yes  Yes     No    16  19   <NA>
     mobility acs.k3 acs.46 acs.core pct.resp not.hsg hsg some.col col.grad
1039        6     NA     NA       24       82      44  34       12        7
1124       15     19     30       NA       97       4  10       23       43
2868        7     NA     NA       28       95       5   9       21       41
1273       23     23     NA       NA      100      37  40       14        8
4926       12     20     29       NA       91       8  21       27       34
2463       13     19     29       NA       71       1   8       20       38
     grad.sch avg.ed full emer enroll api.stu    pw  fpc
1039        3   1.91   71   35    477     429 30.97 6194
1124       21   3.66   90   10    478     420 30.97 6194
2868       24   3.71   83   18   1410    1287 30.97 6194
1273        1   1.96   85   18    342     291 30.97 6194
4926       10   3.17  100    0    217     189 30.97 6194
2463       34   3.96   75   20    258     211 30.97 6194

## Specify design
apisrs_design <- svydesign(data = apisrs, 
                             weights = ~pw, 
                             id = ~1)

Example: Sample of students in California. Samples stratified by School type

Here is an example with a stratified random sample with a known finite population variable. Here, we had the fpc column and strata column stype. We still supply the weights pw.

## Use apistrat
apistrat_design <- svydesign(data = apistrat, 
                             weights = ~pw, 
                             fpc = ~fpc, # population of school types known
                             id = ~1, # no clusters
                             strata = ~stype)

Example: Sample of students in California by clustering.

Here is a more complex example.

Samples clustered by school districts dnum.
Within district, five schools were sampled snum.
Fpc for district and school are fpc1 and fpc2

apiclus_design <- svydesign(data = apiclus2, 
                             weights = ~pw, 
                             fpc = ~fpc1 + fpc2, 
                             id = ~dnum + snum) # start from biggest clu

For more established surveys, often the codebook might include documentation on how to account for survey weights and sampling design. For example, the American National Election Study will generally include this guidance in recent codebooks. Here is an example accounting for the sampling design in the 2016 ANES.

You can use foreign, rio, or what we will use here, haven to load the .dta data.

library(haven)
an <- read_dta("https://github.com/ktmccabe/teachingdata/blob/main/anes_timeseries_2016_Stata12.dta?raw=true")

One benefit of loading data through haven and some of the newer packages is it will help retain value labels from the original data that R might not otherwise load. (E.g., labels that would show up in SPSS or Stata but seem lost in R.)

Let’s check this for an example: “How well does feminist describe you?

library(tidyverse)
an$V161346 %>% attr("labels")

                                   -9. Refused 
                                            -9 
-5. Interview breakoff (sufficient partial IW) 
                                            -5 
                             1. Extremely well 
                                             1 
                                  2. Very well 
                                             2 
                              3. Somewhat well 
                                             3 
                              4. Not very well 
                                             4 
                                 5. Not at all 
                                             5

OK enough about that. Let’s specify the survey design.

anes_design <-
    svydesign( 
        ids = ~V160202 , 
        strata = ~V160201 , 
        data = an, 
        weights = ~V160102,
        nest = TRUE)

The nest argument indicates if cluster ids should be relabeled to align with strata.

13.2.2 Working with survey designs

The good news is that once you have specified the survey design, the rest of the analysis will flow similarly across types of survey designs. The bad news is that you will need to use special functions to analyze the data instead of the functions we are used to working with. These generally start with svy….. but the good news is, they are not that different from our normal functions!

Here are a few examples of common functions using the survey design.

## Means and uncertainty of variables
## Note: we put design where you would normally see data
svymean(~mobility, design = apiclus_design)

           mean    SE
mobility 18.154 1.471

SE(svymean(~mobility, design = apiclus_design))

         mobility
mobility 1.471006

confint(svymean(~mobility, design = apiclus_design))

            2.5 %   97.5 %
mobility 15.27039 21.03662

## Linear Regression
fit.lin <- svyglm(mobility ~ meals + awards, family = "gaussian",
       design = apiclus_design)

## Logit Regression
fit.logit <- svyglm(awards ~ meals, 
              family = binomial(link = "logit"),
       design = apiclus_design)

Warning in eval(family$initialize): non-integer #successes in a binomial glm!

For an example of ordinal logistic regression see section 8.6 of the course notes.

## Can also subset like normal with subset command
apiclus_design2 <- subset(apiclus_design, meals > 1)

The Pew research center has also written a tutorial for using various tidyverse tools with survey data. See here for details.
In addition, there has been a package srvyr recently developed that incorporates more tidyverse language with the same set of survey tools. See this tutorial for details and some analysis examples.
Our friends the marginaleffects, prediction and margins packages have some capabilities for models fit through the survey package.
- With the margins package, you will want to specify the design in order to make sure the average marginal effects reported are based on the survey design object. Example:

library(margins)
res <-margins(fit.logit, variables = "meals",
              design=apiclus_design, 
              change="iqr")
summary(res)

 factor     AME     SE       z      p   lower  upper
  meals -0.1018 0.0592 -1.7208 0.0853 -0.2178 0.0142

13.2.3 Why do we need the `survey` package?

The benefits of the survey package really come from the tool’s specialty in calculating appropriate variances for the survey design and accounting for very complex stratified or clustered designs. There are other weighting functions in R that will allow you to recover the same point estimates as the survey tools in some cases. However, it won’t always be the case that they will give you accurate measures of uncertainty.

If you don’t need measures of uncertainty, then other functions from base R, such as weighted.mean can also work in cases where you just have a column of survey weights.

weighted.mean(apisrs$mobility, w=apisrs$pw)

[1] 17.46

svymean(~mobility, design = apisrs_design)

          mean     SE
mobility 17.46 0.6529

13.3 Constructing your own weights

Sometimes you might have collected your own survey data, which you know is not representative, but no weights are provided for you.

If you think that weighting the data will be important, then you can construct your own weights! We can:

Collect information on our respondents to compare sample to population (including demographic questions on your survey)
Gather statistics on the population (e.g., Census data)
Construct weights to indicate how many units in the population each \(i\) unit in our sample should represent
- One example of constructing your own weights is through a process called Raking. Here, we find weights so that the weighted distributions of your variables in your sample match the distributions of the variables in the general population very closely. How?
  1. Decide the variables on which you want to weight your data (e.g., gender, education, age)
  2. Find the proportion that each subgroup within these variables exist in your target population (e.g., maybe .55 women, .45 men)
  3. Use a raking algorithm to adjust your sample data to, when weighted, reflect these distributions in the population. For example, perhaps your sample had 70% women. After raking, we would hope that when svymean is applied, you would find the weighted proportion in your data is 55%.

This R bloggers tutorial goes through an example using the survey package and function rake.

In addition, Pew has released an R package with similar weighting capabilities described here.

Raking is not the only type of weighting that is possible. Pew describes and evaluates other weighting processes here.

Several pollsters interested in measuring vote choice and public opinion have started turning to a process called multilevel regression and post-stratfication (MRP or, fondly, Mr. P).

This “multilevel” refers to the same multilevel modeling approach we discussed, using lme4. An R introduction to this is here with a companion paper here that has citations to political science examples using MRP.
The technique has been around for a long time, but now is starting to infiltrate mainstream polling in addition to political science.
- See Andy Gelman’s take on it here along with a debate with Nate Silver here.
- For example, CBS used MRP in their 2020 election tracking .
- Doug Rivers provided a short course on MRP at an AAPOR conference. Slides here.