15.5 Extending Machine Learning

How do I know which variables matter? Examples of more complex machine learning methods.

  • Random forests
  • Gradient boosting
  • LASSO
  • SVM

Tradeoffs: Machine Learning as “black box”- see here

Here is a rough example using caret with one of these methods along with cross-validation. Be sure to look at the documentation before using this in your own work. The performance metrics are described here.

library(caret)

## Establish type of training
fitControl <- trainControl(## 5-fold CV
                           method = "cv",
                           number = 5)

## Train model. Note the . means include all variables. Let's subset first
library(tidyverse)
don2 <- don %>% dplyr::select(donation, Edsum, same_state, sameparty, NetWorth, peragsen)
don2$donation <-as.factor(ifelse(don2$donation == 1, "Donated", "Not Donated"))
don2 <- na.omit(don2)
mod_fit <- train(donation ~ ., data = don2, 
                 method = 'gbm',
                 verbose=F,
                 trControl = fitControl)
mod_fit
Stochastic Gradient Boosting 

52888 samples
    5 predictor
    2 classes: 'Donated', 'Not Donated' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 42310, 42311, 42310, 42311, 42310 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa     
  1                   50      0.9640561  0.00000000
  1                  100      0.9641506  0.06508933
  1                  150      0.9640750  0.10469663
  2                   50      0.9643208  0.07823384
  2                  100      0.9638292  0.08893512
  2                  150      0.9639049  0.08536604
  3                   50      0.9641318  0.07915004
  3                  100      0.9639238  0.07251411
  3                  150      0.9641128  0.07076335

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 50, interaction.depth =
 2, shrinkage = 0.1 and n.minobsinnode = 10.

There are nearly countless ways to use machine learning. See here.