13.2 Survey R package

Let’s assume for now that we have some survey data, and those data contain some information about the sampling process, as well as survey weights. With this information, we are ready to use the survey package to conduct analysis.

The survey package by Thomas Lumley allows you to declare the nature of the survey design.

  1. Load your dataset as usual into a dataframe in R
  2. Construct the variables you will use in analysis
  3. Declare to R the survey design by assigning your data frame as a new survey object.
  4. Perform analyses using functions from the survey package

svydesign(data = …, weights= …, fpc= …, strata = …, id =…)

  • data: data frame
  • weights: sampling weights
  • fpc: Finite population correction (less common)
  • strata: Formula or vector specifying strata (NULL by default)
  • ids: Primary sampling unit. Formula or data frame with cluster ids. Use ~1 or ~0 if no clusters.

13.2.1 Examples of specifying svydesigns

The easiest, but also common, way to specify a survey design occurs when you have survey data and a column of survey weights that the survey firm has provided– no clusters or strata.

Let’s look at this using data from within the survey package. The apisrs represents a simple random sample with a column pw for survey sampling weights. We can specify the design below:

  • See also, a detailed example from Pew analyzing one of their datasets with the survey package.

Example: Sample of students in California.

library(survey)
data(api) # loads several dataframes 

## Sample with just weights, no clusters/strata
head(apisrs)
                cds stype            name                         sname snum
1039 15739081534155     H  McFarland High                McFarland High 1039
1124 19642126066716     E Stowers (Cecil  Stowers (Cecil B.) Elementary 1124
2868 30664493030640     H Brea-Olinda Hig              Brea-Olinda High 2868
1273 19644516012744     E Alameda Element            Alameda Elementary 1273
4926 40688096043293     E Sunnyside Eleme          Sunnyside Elementary 4926
2463 19734456014278     E Los Molinos Ele        Los Molinos Elementary 2463
                        dname dnum           cname cnum flag pcttest api00
1039        McFarland Unified  432            Kern   14   NA      98   462
1124              ABC Unified    1     Los Angeles   18   NA     100   878
2868      Brea-Olinda Unified   79          Orange   29   NA      98   734
1273           Downey Unified  187     Los Angeles   18   NA      99   772
4926 San Luis Coastal Unified  640 San Luis Obispo   39   NA      99   739
2463  Hacienda la Puente Unif  284     Los Angeles   18   NA      93   835
     api99 target growth sch.wide comp.imp both awards meals ell yr.rnd
1039   448     18     14       No      Yes   No     No    44  31   <NA>
1124   831     NA     47      Yes      Yes  Yes    Yes     8  25   <NA>
2868   742      3     -8       No       No   No     No    10  10   <NA>
1273   657      7    115      Yes      Yes  Yes    Yes    70  25   <NA>
4926   719      4     20      Yes      Yes  Yes    Yes    43  12   <NA>
2463   822     NA     13      Yes      Yes  Yes     No    16  19   <NA>
     mobility acs.k3 acs.46 acs.core pct.resp not.hsg hsg some.col col.grad
1039        6     NA     NA       24       82      44  34       12        7
1124       15     19     30       NA       97       4  10       23       43
2868        7     NA     NA       28       95       5   9       21       41
1273       23     23     NA       NA      100      37  40       14        8
4926       12     20     29       NA       91       8  21       27       34
2463       13     19     29       NA       71       1   8       20       38
     grad.sch avg.ed full emer enroll api.stu    pw  fpc
1039        3   1.91   71   35    477     429 30.97 6194
1124       21   3.66   90   10    478     420 30.97 6194
2868       24   3.71   83   18   1410    1287 30.97 6194
1273        1   1.96   85   18    342     291 30.97 6194
4926       10   3.17  100    0    217     189 30.97 6194
2463       34   3.96   75   20    258     211 30.97 6194
## Specify design
apisrs_design <- svydesign(data = apisrs, 
                             weights = ~pw, 
                             id = ~1)

Example: Sample of students in California. Samples stratified by School type

Here is an example with a stratified random sample with a known finite population variable. Here, we had the fpc column and strata column stype. We still supply the weights pw.

## Use apistrat
apistrat_design <- svydesign(data = apistrat, 
                             weights = ~pw, 
                             fpc = ~fpc, # population of school types known
                             id = ~1, # no clusters
                             strata = ~stype)

Example: Sample of students in California by clustering.

Here is a more complex example.

  • Samples clustered by school districts dnum.
  • Within district, five schools were sampled snum.
  • Fpc for district and school are fpc1 and fpc2
apiclus_design <- svydesign(data = apiclus2, 
                             weights = ~pw, 
                             fpc = ~fpc1 + fpc2, 
                             id = ~dnum + snum) # start from biggest clu

For more established surveys, often the codebook might include documentation on how to account for survey weights and sampling design. For example, the American National Election Study will generally include this guidance in recent codebooks. Here is an example accounting for the sampling design in the 2016 ANES.

You can use foreign, rio, or what we will use here, haven to load the .dta data.

library(haven)
an <- read_dta("https://github.com/ktmccabe/teachingdata/blob/main/anes_timeseries_2016_Stata12.dta?raw=true")

One benefit of loading data through haven and some of the newer packages is it will help retain value labels from the original data that R might not otherwise load. (E.g., labels that would show up in SPSS or Stata but seem lost in R.)

Let’s check this for an example: “How well does feminist describe you?

library(tidyverse)
an$V161346 %>% attr("labels")
                                   -9. Refused 
                                            -9 
-5. Interview breakoff (sufficient partial IW) 
                                            -5 
                             1. Extremely well 
                                             1 
                                  2. Very well 
                                             2 
                              3. Somewhat well 
                                             3 
                              4. Not very well 
                                             4 
                                 5. Not at all 
                                             5 

OK enough about that. Let’s specify the survey design.

anes_design <-
    svydesign( 
        ids = ~V160202 , 
        strata = ~V160201 , 
        data = an, 
        weights = ~V160102,
        nest = TRUE)

The nest argument indicates if cluster ids should be relabeled to align with strata.

13.2.2 Working with survey designs

The good news is that once you have specified the survey design, the rest of the analysis will flow similarly across types of survey designs. The bad news is that you will need to use special functions to analyze the data instead of the functions we are used to working with. These generally start with svy….. but the good news is, they are not that different from our normal functions!

Here are a few examples of common functions using the survey design.

## Means and uncertainty of variables
## Note: we put design where you would normally see data
svymean(~mobility, design = apiclus_design)
           mean    SE
mobility 18.154 1.471
SE(svymean(~mobility, design = apiclus_design))
         mobility
mobility 1.471006
confint(svymean(~mobility, design = apiclus_design))
            2.5 %   97.5 %
mobility 15.27039 21.03662
## Linear Regression
fit.lin <- svyglm(mobility ~ meals + awards, family = "gaussian",
       design = apiclus_design)

## Logit Regression
fit.logit <- svyglm(awards ~ meals, 
              family = binomial(link = "logit"),
       design = apiclus_design)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!

For an example of ordinal logistic regression see section 8.6 of the course notes.

## Can also subset like normal with subset command
apiclus_design2 <- subset(apiclus_design, meals > 1)
  • The Pew research center has also written a tutorial for using various tidyverse tools with survey data. See here for details.
  • In addition, there has been a package srvyr recently developed that incorporates more tidyverse language with the same set of survey tools. See this tutorial for details and some analysis examples.
  • Our friends the prediction and margins packages have some capabilities for models fit through the survey package.
    • With the margins package, you will want to specify the design in order to make sure the average marginal effects reported are based on the survey design object. Example:
library(margins)
res <-margins(fit.logit, variables = "meals",
              design=apiclus_design, 
              change="iqr")
summary(res)
 factor     AME     SE       z      p   lower  upper
  meals -0.1018 0.0592 -1.7208 0.0853 -0.2178 0.0142

13.2.3 Why do we need the survey package?

The benefits of the survey package really come from the tool’s specialty in calculating appropriate variances for the survey design and accounting for very complex stratified or clustered designs. There are other weighting functions in R that will allow you to recover the same point estimates as the survey tools in some cases. However, it won’t always be the case that they will give you accurate measures of uncertainty.

If you don’t need measures of uncertainty, then other functions from base R, such as weighted.mean can also work in cases where you just have a column of survey weights.

weighted.mean(apisrs$mobility, w=apisrs$pw)
[1] 17.46
svymean(~mobility, design = apisrs_design)
          mean     SE
mobility 17.46 0.6529