2.5 Exploratory Data Analysis Tools
One of the first things you may want to do when you have a new dataset is to explore! Get a sense of the variables you have, their class, and how they are coded. There are many functions in base R that help with this, such as summary()
, table()
, and descriptive statistics like mean
or quantile
.
Let’s try this with the built-in mtcars
data.
data("mtcars")
summary(mtcars$cyl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 4.000 6.000 6.188 8.000 8.000
quantile(mtcars$wt)
## 0% 25% 50% 75% 100%
## 1.51300 2.58125 3.32500 3.61000 5.42400
mean(mtcars$mpg, na.rm=T)
## [1] 20.09062
sd(mtcars$mpg, na.rm=T)
## [1] 6.026948
table(gear=mtcars$gear, carb=mtcars$carb)
## carb
## gear 1 2 3 4 6 8
## 3 3 4 3 5 0 0
## 4 4 4 0 4 0 0
## 5 0 2 0 1 1 1
As discussed in the visualization, you can also quickly describe univariate data with histograms, barplots, or density plots using base R or ggplot
.
hist(mtcars$mpg, breaks=20, main="Histogram of MPG")
plot(density(mtcars$mpg, na.rm=T),
main="Distribution of MPG")
barplot(table(mtcars$gear), main="Barplot of Gears")
library(ggplot2)
ggplot(mtcars, aes(mpg))+
geom_histogram(bins=20)+
ggtitle("Histogram of MPG")
ggplot(mtcars, aes(mpg))+
geom_density()+
ggtitle("Distribution of MPG")
ggplot(mtcars, aes(gear))+
geom_bar(stat="count")+
ggtitle("Barplot of Gears")
There is a package in R that helps make this process easier when you want to summarize several variables at a time. It is called dataReporter
. Documentation is available here.
Let’s try this with our resume
dataset.
<- read.csv("https://raw.githubusercontent.com/ktmccabe/teachingdata/main/resume.csv") resume
You can either keep all the variables or subset the dataframe to include only certain variables (recommended if you have a large number of columns) before generating a report, or indicate which variables to summarize through the useVar
argument. If you have a LaTex distribution installed on your computer, the function will automatically generate a pdf report. You can set the output
to a different format, such as in the below. Each time you develop a new report, you can save a particular file =
name or add a vol =
number.
library(dataReporter)
## pdf default when you have LaTex
#makeDataReport(resume)
## specify word, volume 2
#makeDataReport(resume, output="word", vol=2)
## specify variables to summarize
#makeDataReport(resume, output="word", vol=3, useVar = c("call", "firstname"))
One judgment call you may want to make is whether you want most of your categorical variables to be loaded as character
or factor
variables. When you use read.csv
, R in its most recent versions will default to reading in these type of “string” variables as character variables. You can override this by adding an argument if you desire.
<- read.csv("https://raw.githubusercontent.com/ktmccabe/teachingdata/main/resume.csv", stringsAsFactors = T)
resume2
## compare
class(resume$firstname)
## [1] "character"
class(resume2$firstname)
## [1] "factor"