Advanced Machine Learning with R
上QQ阅读APP看书,第一时间看更新

LASSO

It's a simple matter to update the code we used for ridge regression to accommodate LASSO. I'm going to change just two things: the random seed and I'll set alpha to 1:

> set.seed(1876)

> lasso <- glmnet::cv.glmnet(
x,
y,
nfolds = 5,
type.measure = "auc",
alpha = 1,
family = "binomial"
)

The plot of the model is quite interesting:

> plot(lasso)

The output of the preceding code is as follows:

You can now see the number of non-zero features as the Lambda changes. The number of features included at one standard error is just eight!

Let's have a gander at those coefficients:

> coef(lasso, s = "lambda.1se")
17 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.30046007
TwoFactor1 -0.53307368
TwoFactor2 0.52110703
Linear1 .
Linear2 -0.42669146
Linear3 0.35514853
Linear4 -0.20726177
Linear5 0.10381320
Linear6 .
Nonlinear1 0.10478862
Nonlinear2 .
Nonlinear3 .
Noise1 .
Noise2 .
Noise3 .
Noise4 .
random1 -0.06581589

Now, this looks much better. LASSO threw out those nonsense noise features and Linear1. However, before we start congratulating ourselves, look at how Linear6 was constrained to zero. Does it need to be in the model or not? We could undoubtedly adjust the lambda value and see where it enters and what effect it makes. 

It's time to check how it does on the training data:

> lasso_pred <-
data.frame(predict(
lasso,
newx = x,
type = "response",
s = "lambda.1se"
))

> Metrics::auc(y, lasso_pred$X1)
[1] 0.8621664

> classifierplots::density_plot(y, lasso_pred$X1)

The output of the preceding code is as follows:

These are quite similar results to those with ridge regression. Correct evaluation, however, is done on the test data:

> lasso_test <-
data.frame(predict(lasso, newx = as.matrix(test[, -17]), type = 'response'),
s = "lambda.1se")

> Metrics::auc(test$y, lasso_test$X1)
[1] 0.8684276

> Metrics::logLoss(test$y, lasso_test$X1)
[1] 0.4512764

> classifierplots::density_plot(test$y, lasso_test$X1)

The output of the preceding code is as follows:

The LASSO model does have a slightly lower AUC and marginally higher log-loss (0.45 versus 0.43). In the real world, I'm not sure that would be meaningful given that we have a more parsimonious model with LASSO. I guess that's another dimension alongside bias-variance, predictive power versus complexity.

Speaking of complexity, let's move on to elastic net.