15.3 Cross-Validation

Cross-Validation is an approach to address overfitting issues

  • Take data for which you know the answer – we call this “training data”
  • Randomly subset out a portion of the training data. This will become our “test” data.
  • Develop a model based on the training data.
  • Test the accuracy of the model on the test data (out-of-sample data that was not used to train the model).
  • Repeat process for different portions of the data.

Goal: See how well our model will generalize to new data (data the model hasn’t seen).

15.3.1 k-fold cross-validation

Divide your data into folds (how many \(k\) depends on how much data you have)

  • Fit your model to \(k-1\) folds
  • See how well your model predicts the data in the \(k\)th fold.
  • Can repeat, leaving out a different fold each time

15.3.2 Leave-one-out cross-validation

Best for smaller data

  • Fit your model to all but one observation in your data
  • See how well your model predicts the left-out observation
  • Can repeat, continuing to leave out one observation each time