8.2 Application: Baseball Predictions
For our first example, we will stay outside of politics and use regression to predict the success of a baseball team.
Moneyball is a $100 million Hollywood movie that is all about linear regression… and some baseball… and Brad Pitt, but really… it’s MOSTLY about linear regression
The movie describes the Oakland A’s shift to start using data to build their team. They make two observations 1) To win baseball games, you need runs. 2) To score runs, you need to get on base. We can estimate what on base percentage we would need as a team to score enough runs to make the playoffs in a typical season.
We will use regression to make these predictions.
For a video explainer of the code for this application, see below. (Via youtube, you can speed up the playback to 1.5 or 2x speed.)
We use baseball.csv
data
RS
: runs scored;RA
: runs allowed;W
: wins;Playoffs
: whether team made playoffs;OBP
: on base percentage;BA
: batting average;SLG
: Slugging Percentage
<- read.csv("baseball.csv") baseball
head(baseball)
## Team League Year RS RA W OBP SLG BA Playoffs RankSeason
## 1 ARI NL 2012 734 688 81 0.328 0.418 0.259 0 NA
## 2 ATL NL 2012 700 600 94 0.320 0.389 0.247 1 4
## 3 BAL AL 2012 712 705 93 0.311 0.417 0.247 1 5
## 4 BOS AL 2012 734 806 69 0.315 0.415 0.260 0 NA
## 5 CHC NL 2012 613 759 61 0.302 0.378 0.240 0 NA
## 6 CHW AL 2012 748 676 85 0.318 0.422 0.255 0 NA
## RankPlayoffs G OOBP OSLG
## 1 NA 162 0.317 0.415
## 2 5 162 0.306 0.378
## 3 4 162 0.315 0.403
## 4 NA 162 0.331 0.428
## 5 NA 162 0.335 0.424
## 6 NA 162 0.319 0.405
Below we can see the first observation made: Runs scored are highly correlated with team wins
What the A’s noticed is that a team’s On Base Percentage is also highly correlated with runs scored. This aligns with conventional wisdom. Players get a lot of hype when they achieve a high OBP.
Hernandez is hitting .500 (16-for-32) with five homers, four doubles, nine RBI, nine runs scored and a .514 on-base percentage in seven postseason games.- NBC Boston
This correlation shows up in our data, too.