13.1 How to investigate new data
Real-world data may often require some cleaning or “wrangling”
So you have some data…. AND it’s a mess!!!
A lot of the data we may encounter in courses has been simplified to allow students to focus on other concepts. We may have data that look like the following:
<- data.frame(gender = c("Male", "Female", "Female", "Male"),
nicedata age = c(16, 20, 66, 44),
voterturnout = c(1, 0, 1, 0))
gender | age | voterturnout |
---|---|---|
Male | 16 | 1 |
Female | 20 | 0 |
Female | 66 | 1 |
Male | 44 | 0 |
In the real world, our data may hit us like a ton of bricks, like the below:
<- data.frame(VV160002 = c(2, NA, 1, 2),
uglydata VV1400068 = c(16, 20, 66, 44),
VV20000 = c(1, NA, 1, NA))
VV160002 | VV1400068 | VV20000 |
---|---|---|
2 | 16 | 1 |
NA | 20 | NA |
1 | 66 | 1 |
2 | 44 | NA |
A lot of common datasets we use in the social sciences are messy, uninformative, sprawling, misshaped, and/or incomplete. What do I mean by this?
- The data might have a lot of missing values. For example, we may have
NA
values in R, or perhaps a research firm has used some other notation for missing data, such as a99
. - The variable names may be uninformative.
- For example, there may be no way to know by looking at the data, which variable represents gender. We have to look at a codebook.
- Even if we can tell what a variable is, its categories may not be coded in a way that aligns with how we want to use the data for our research question.
- For example, perhaps you are interested in the effect of a policy on people below vs. 65 and over in age. Well, your age variables might just be a numeric variable. You will have to create a new variable that aligns with your theoretical interest.
- Datasets are often sprawling. Some datasets may have more than 1000 variables. It is hard to sort through all of them. Likewise, datasets may have millions of observations. We cannot practically look through all the values of a column to know what is there.
13.1.1 Dealing with Uninformative Variable Names
Hopefully, there is an easy fix for dealing with uninformative variable names. I say “hopefully” because hopefully when you encounter a dataset with uninformative variable names, the place where you downloaded the data will also include a codebook telling you what each variable name means, and how the corresponding values are coded.
Unfortunately, this may not always be the case. One thing you can do as a researcher is when you create a dataset for your own work, keep a record (in written form, on a word document or in a pdf or code file) of what each variable means (e.g., the survey question it corresponds to or the exact economic measure), as well as how the values of the variables are coded. This good practice will help you in the short-term, as you pause and come back to working on a project over the course of a year, as well as benefit other researchers in the long run after you finish your research.
For examples of large codebooks, you can view the 2016 American National Election Study Survey and click on a codebook.
I recommend that once you locate the definition of a variable of interest, rename the variable in your dataset to be informative. You can do this by creating a new variable or overwriting the name of the existing variable. You might also comment a note for yourself of what the values mean.
## Option 1: create new variable
## gender 2=Male, 1=Female
$gender <- uglydata$VV160002
uglydata
## Option 2: Overwrite
names(uglydata)[1] <- "gender2"
13.1.2 Dealing with Missing Data
When we have a column with missing data, it is best to do a few things:
- Try to quantify how much missing data there is and poke at the reason why data are missing.
- Is it minor non-response data?
- Or is it indicative of a more systematic issue? For example, maybe data from a whole group of people or countries is missing for certain variables.
- If the data are missing at a very minor rate and/or there is a logical explanation for the missing data that should not affect your research question, you may choose to “ignore” the missing data when performing common analyses, such as taking the mean or running a regression.
If we want to figure out how much missing data we have in a variable, we have a couple of approaches:
## Summarize this variable
summary(uglydata$gender)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.500 2.000 1.667 2.000 2.000 1
## What is the length of the subset of the variable where the data are missing
length(uglydata$gender[is.na(uglydata$gender) == T])
## [1] 1
If we choose to ignore missing data, this can often be easily accomplished in common operations. For example, when taking the mean
we just add an argument na.rm = T
:
mean(uglydata$VV1400068, na.rm=T)
## [1] 36.5
We should always be careful with missing data to understand how R is treating it in a particular scenario.
13.1.3 Dealing with Variable Codings that Aren’t Quite Right
Oftentimes the ways that variables are coded in datasets we get off-the-shelf are not coded exactly as how we were dreaming up operationalizing our concepts. Instead, we are going to have to wrangle the data to get them into shape.
This may involve creating new variables that recode certain values, creating new variables that collapse some values into a smaller number of categories, combining multiple variables into a single variable (e.g., representing the average), or setting some of the variable values to be missing (NA
). All of these scenarios may come up when you are dealing with real data.
## create variable indicating over 65 vs. under 65
## approach 1
$over65 <- NA
uglydata$over65[uglydata$VV1400068 >=65] <- 1
uglydata$over65[uglydata$VV1400068 < 65] <- 0
uglydata
## approach 2
$over65 <- ifelse(uglydata$VV1400068 >=65, 1, 0) uglydata
13.1.4 Dealing with Parts of Datasets
We may also want to limit our analysis to just small parts of datasets instead of the entire dataset. Recall the function subset
to limit the data to only rows that meet certain criteria.
## limit data to those over 65
<- subset(uglydata, over65 == 1) over65