15.3 Cross-Validation
Cross-Validation is an approach to address overfitting issues
- Take data for which you know the answer – we call this “training data”
- Randomly subset out a portion of the training data. This will become our “test” data.
- Develop a model based on the training data.
- Test the accuracy of the model on the test data (out-of-sample data that was not used to train the model).
- Repeat process for different portions of the data.
Goal: See how well our model will generalize to new data (data the model hasn’t seen).
15.3.1 k-fold cross-validation
Divide your data into folds (how many \(k\) depends on how much data you have)
- Fit your model to \(k-1\) folds
- See how well your model predicts the data in the \(k\)th fold.
- Can repeat, leaving out a different fold each time
15.3.2 Leave-one-out cross-validation
Best for smaller data
- Fit your model to all but one observation in your data
- See how well your model predicts the left-out observation
- Can repeat, continuing to leave out one observation each time