15.3 Cross-Validation
Cross-Validation is an approach to address overfitting issues
- Take data for which you know the answer – we call this “training data”
 - Randomly subset out a portion of the training data. This will become our “test” data.
 - Develop a model based on the training data.
 - Test the accuracy of the model on the test data (out-of-sample data that was not used to train the model).
 - Repeat process for different portions of the data.
 
Goal: See how well our model will generalize to new data (data the model hasn’t seen).
15.3.1 k-fold cross-validation
Divide your data into folds (how many \(k\) depends on how much data you have)
- Fit your model to \(k-1\) folds
 - See how well your model predicts the data in the \(k\)th fold.
 - Can repeat, leaving out a different fold each time
 
15.3.2 Leave-one-out cross-validation
Best for smaller data
- Fit your model to all but one observation in your data
 - See how well your model predicts the left-out observation
 - Can repeat, continuing to leave out one observation each time