8.3 Step 1: Approach- Regression in R

A regression draws a “best fit line” between the points. This allows us – for any given OBP – to estimate the number of runs scored.

Our best prediction of the number of runs scored would be the spot on the purple line directly above a given OBP.

The regression model is \(Y = \alpha + \beta X + \epsilon\). Let’s demystify this.

A regression model describes the relationship between one or more independent variables \(X\) (explanatory variables) and an outcome variable \(Y\) (dependent variable)
- For example, the relationship between our independent variable, On Base Percentage, and our dependent variable, Runs Scored
We want to know what happens with our dependent variable \(Y\) if our independent variable \(X\) increases.
- As we increase our On Base Percentage, a regression model will help us estimate how much we should expect our Runs Scored to increase (or decrease)
\(\alpha\) and \(\beta\) are considered “parameters” – things we don’t know but want to estimate. These two numbers will define exactly how we think \(X\) and \(Y\) are related.
No two variables are perfectly related, so we also have the \(\epsilon\) term, which describes the error in the model

When we have data, we estimate \(Y\), \(\alpha\), and \(\beta\): \(\hat Y = \hat \alpha + \hat \beta X\).

The \({\hat{hat}}\) over the letters means those are our estimated values.

In R, the regression syntax is fit <- lm(y ~ x, data = mydata)

fit is just whatever you want to call the output of the model,
y is the name of the dependent variable,
x is the name of the independent variable, and
mydata is whatever you have called your dataframe. E.g.:

fit <- lm(RS ~ OBP, data = baseball)

When we have data, we estimate \(Y\), \(\alpha\), and \(\beta\): \(\hat Y = \hat \alpha + \hat \beta X\).

Our model gives us the “coefficient” estimates for \(\hat \alpha\) and \(\hat \beta\).

coef(fit)

## (Intercept)         OBP 
##   -1076.602    5490.386

The first coefficient is \(\hat \alpha\), this represents the intercept – the estimated value our dependent variable will take if our independent variable is 0.

The value the estimated runs scored would be if a team had a 0.000 on base percentage. In our case, this value is estimated to be negative, which is impossible (but it would also be unusual for a team to have a 0.000 on base percentage). Therefore, the intercept isn’t inherently substantively interesting to us.

The second coefficient is \(\hat \beta\) is the slope This represents the expected change in our dependent variable for a 1-unit increase in our independent variable.

For example, if we go from a 0.000 on base percentage to a 1.000 on base percentage, we would expect a 5490.4 increase in runs scored.
Note: slope can be positive or negative similar to correlation.
Note: slope is in the units of the dependent variable (e.g., runs). It is not constrained to be between -1 and 1.
It is telling us that the greater the OBP, the better!

8.3.1 Visualizing a regression

We can plot the regression using a scatterplot and abline().

plot(x=baseball$OBP, y=baseball$RS, 
     ylab = "Runs Scored",
     xlab =  "On Base Percentage", 
     main="Runs Scored by On Base Percentage",
     pch=20)

## Add regression line
abline(fit, lwd=3, col = "purple") # add line

8.3.2 Making predictions with regression

A regression model allows us to estimate or “predict” values of our dependent variable for a given value of our independent variable.

The red dot represents our estimate (best prediction) of the number of runs scored if a team has an on base percentage of .300. In R, we can calculate this value using predict().

The syntax is predict(fit, data.frame(x = value)) where fit is the name of the model, x is the name of the independent variable, and value represents the value for the independent variable for which you want to predict your outcome (e.g., .300).

predict(fit, data.frame(OBP=.300))

##        1 
## 570.5137

Under the hood, this is just using the regression formula described above. For example, to estimate the number of runs scored for a .300 on base percentage, we take \(\hat \alpha + \hat \beta * .300\)

Note that below we compare the output of the predict function to our output if we manually calculated the estimated value.

predict(fit, data.frame(OBP=.300))

##        1 
## 570.5137

# a + b*.300
coef(fit)[1] +  coef(fit)[2]*.300

## (Intercept) 
##    570.5137

Let’s say a team thought they needed about 900 runs scored to get to the playoffs, and they were pretty sure they could get a team on base percentage of .500. How many runs would they be expected to score with that OBP? Do you think they will make the playoffs?

Try on your own, then expand for the solution.

predict(fit, data.frame(OBP=.500))

##        1 
## 1668.591

It’s greater than 900, so we should feel good about our chances.