Mastering Machine Learning with R
上QQ阅读APP看书,第一时间看更新

Cross-validation and logistic regression

Our goal here is to build a model using 5-fold cross-validation. We'll utilize the caret package to establish our sampling scheme and to produce the final model. Start by building a separate trainControl() function:

> glm_control <-
caret::trainControl(method = "cv",
number = 5,
returnResamp = "final")

This object is passed as an argument to train the algorithm. We now produce our input features, response variable (must be a factor for caret to train as logistic regression), set our random seed, and train the model. For the train() function, specify glm for Generalized Linear Model (GLM):

> x <- train_reduced[, -22]

> y <- as.factor(train_reduced$y)

> set.seed(1988)

> glm_fit <-
caret::train(x, y, method = "glm",
trControl = glm_control,
trace = FALSE)

When that's done grinding away, you can quickly check the results:

> glm_fit$results
parameter Accuracy Kappa AccuracySD KappaSD
1 none 0.9602 0.0002369 0.0001591 0.001743

Look at that, 96 percent accuracy! I know that's entirely meaningless because if we just guessed that all labels in the response were zero, we would achieve 96 percent. That may seem obvious, but I've interviewed people with Data Science degrees that missed that fact. Kappa refers to what's known as Cohen's Kappa statistic. The Kappa statistic provides an insight into this problem by adjusting the accuracy scores, which is done by accounting for the model being entirely correct by mere chance. The formula for the statistic is as follows:

The percent of agreement is the rate that the model agreed on for the class (accuracy) and percent of chance agreement is the rate that the model randomly agreed. The higher the statistic, the better the performance is with the maximum agreement being one. So, with this Kappa score, the model is pathetic.

Well, Kappa would be useful with more balanced labels. We're now left to find other ways to examine model results. It's always a good idea to compare the probability distributions of the different classes with a density or box plot.

Here we produce an elegant and colorful density plot on the training data:

> glm_train_pred <- predict(glm_fit, train, type = "prob")

> colnames(glm_train_pred) <- c("zero", "one")

> classifierplots::density_plot(train_reduced$y, glm_train_pred$one)

The output of the preceding code is as follows:

This gives us an interesting look at what the model's doing. We don't see any predictive power until we get around 7 percent. We can identify an optimal probability threshold to maximize our classification objective. There's an excellent function in the InformationValue package we'll apply later. It allows for the determination of four different thresholds:

  • misclasserror: The default setting in the function, this identifies the threshold that minimizes a classification error
  • Ones: This is the threshold that maximizes detection of 1s
  • Zeros: This is the threshold that maximizes detection of 0s
  • Both: This is the threshold that maximizes Youden's Index, which is (sensitivity + specificity) - 1
Sensitivity = True Positives / (True Positives + False Negatives): This is also called the True Positive Rate or Recall and is a measure of correctly identifying positive labels. Specificity = True Negatives / (True Negatives + False Positives): Also called the True Negative Rate, this is a measure of correctly identifying negative labels.

In this case, we shall take a look at the threshold for Both. We'll also ask for all of the diagnostics:

> glm_cutoff <-
InformationValue::optimalCutoff(
train_reduced$y,
glm_train_pred$one,
optimiseFor = 'Both',
returnDiagnostics = TRUE
)

If you click on glm_cutoff in your global environment or run View(glm_cutoff), you'll see a list of six different results:

  • optimalCutoff = 0.0606
  • A sensitivity table you can examine further on your own
  • Misclassification error = 0.2006
  • TPR = 0.6079
  • FPR = 0.1927
  • Specificity = 0.8073

If we select a cutoff of 0.0606, we'll achieve a True Positive Rate (TPR) of almost 61 percent. However, over 19 percent will be false positives.

Given the imbalance in the classes, that's a huge amount of customers. A confusion matrix can demonstrate that fact:

> InformationValue::confusionMatrix(train_reduced$y, glm_train_pred$one,   threshold = 0.0607)
0 1
0 47164 944
1 11247 1461

Of the training data, customers that were dissatisfied, a total of 2,405; if we correctly classify 1,461 of them, we'll incorrectly classify 11,247. So where we decide to put an optimal threshold depends on the needs of the business. We'll see how to portray that differently during model comparison.

Let's now see how the algorithm ranked variable importance:

> caret::varImp(glm_fit)
glm variable importance

only 20 most important variables shown (out of 21)

Overall
V2 100.0000
V103 70.2840
V141 33.2809
V105 18.0160
V24 13.1048
V129 12.4327
V55 10.7379
V34 8.7920
V45 8.5681
V124 7.1968
V122 5.9959
V109 5.8668
V33 4.8295
V125 3.6369
V131 1.5439
V126 0.8383
V47 0.7430
V132 0.4286
V66 0.3133
V31 0.0912

Our suspicious variable is number one in overall importance. I would recommend you to try other models dropping V2 from any consideration. But this is up to you as I'm of the mindset right now to see how performance is on the test data:

> glm_test_pred <- predict(glm_fit, test, type = "prob")

> colnames(glm_test_pred) <- c("zero", "one")

> classifierplots::density_plot(test$y, glm_test_pred$one)

The output of the preceding code is as follows:

Very similar results on the test data. What about a confusion matrix given our threshold determined during training? Let's see:

> InformationValue::confusionMatrix(test$y, glm_test_pred$one, threshold = 0.0607)
0 1
0 11710 227
1 2891 376

Consistent results! Now, let's examine this model's performance on the test data so that we can compare it to what the upcoming MARS model will produce. Two metrics should address the issue, Area Under the Curve (AUC), and log-loss. The AUC provides you with a useful indicator of performance, and it can be shown that the AUC is equal to the probability that the observer will correctly identify the positive case when presented with a randomly chosen pair of cases in which one case is positive and one case is negative (Hanley JA & McNeil BJ, 1982). In our case, we'll switch the observer with our algorithm and evaluate accordingly. Log-loss is an effective metric as it takes into account the predicted probability and how much it deviates from the correct label. The following formula produces it:

Like golf, a lower value is better with values between 0 and 1. A perfect model would have a value of 0. We can produce these values easily with the Metrics package:

> Metrics::auc(test$y, glm_test_pred$one)
[1] 0.7794

> Metrics::logLoss(test$y, glm_test_pred$one)
[1] 0.1499

Our AUC isn't that great, I would say. If the model were no better than a random guess, then AUC would be equal to 0.5, and if perfect, it would be 1. Our log-loss is only essential when comparing it to the next model.