In this lab we classify the spam4 data again, but using different techniques than in the previous lab. For this lab you need the packages kernlab, tree, randomForest, fastAdaboost, e1071 and caret.

knitr::opts_chunk$set(eval=params$answers, 
                      message=FALSE)
library(kernlab)
library(tree)
library(randomForest)
library(fastAdaboost)
library(caret)
library(e1071)
library(dplyr)

Training the models

To compare our results, we use the same training and test set.

Load the spam data, and convert it to spam4 as in the previous lab.

s <- function(x)as.numeric(cut(x, c(-Inf, unique(quantile(x, probs=seq(0, 1, .1))))))

spam4 <- spam %>% 
  mutate(across(1:57, s))

Make the same vector with indices for the training set as in the previous lab.

set.seed(10) 
train <- createDataPartition(y = spam4$type, p = .75, list = FALSE)

Tree

Grow a tree on the training set. Save the object as train_tree, and plot it.

train_tree <- tree(type ~ ., data = spam4[train, ])
plot(train_tree)
text(train_tree, cex = 0.8)

The fitted may be overfitting. Before we can prune the tree we need to determine the optimal number of knots. Find out this number by performing cross-validation with the function cv.tree().

cv_tree <- cv.tree(train_tree)
cv_tree

Prune the tree using the optimal number of nodes from the cross-validation, and save the pruned tree object under an appropriate name.

prune_tree <- prune.tree(train_tree, best = cv_tree$size[which.min(cv_tree$dev)])

Bagging/random forest

Both models can be trained with the function randomForest(). The only argument that needs to be changed is mtry (see the function’s help page). To learn about the importance of the features, set the argument importance = TRUE.

Train the bagging model, save its object as train_bag, and display and interpret its content.

(train_bag <- randomForest(type ~ ., spam4[train, ], importance = TRUE, mtry = 57))

Train the random forest model, save its object as train_rf, display its content and compare it to that of the random forest object.

(train_rf  <- randomForest(type ~ ., spam4[train, ], importance = TRUE))

Display the variable importance plots of both objects for the 10 most important predictors, and interpret.

varImpPlot(train_bag, n.var = 10)
varImpPlot(train_rf, n.var = 10)

Boosting

For boosting we use the train() function with the “adaboost” method for tuning the nIter hyperparameter (see lecture slides).

Run the code on the slides, and provide a sequence of numbers for tuning nIter. Save the train object under an appropriate name.

train_boost <- train(type ~ ., 
                     data       = spam4[train, ], 
                     method     = "adaboost",
                     tuneGrid   = expand.grid(method = "Adaboost.M1", nIter = c(10, 25, 50)),
                     trControl  = trainControl(method = "cv", 
                                               number = 5)
)

Display and interpret the fitted object.

train_boost

Support Vector Machines

The svm() function offers four different kernels and has many hyperparameters that can be tuned. To restrict the choices, use the function tune("svm", . . .) to tune the cost parameter for an svm with a radial kernel. The default value for cost is 1, so try out a sequence of values in that neighborhood. Save the tune object under an appropriate name.

train_svm  <- tune("svm", type ~ ., 
                     data        = spam4[train, ], 
                     kernel      = "radial",
                     probability = TRUE, 
                     ranges      = list(cost = c(1:4)))

summary(train_svm)

Display the best model in the tune object.

train_svm$best.model

Display a classification plot for two features of your own choice, and interpret.

plot(train_svm$best.model, spam4, telnet ~ charDollar)

Testing the models

As in the previous labs, evaluate the five models (pruned tree, bagging, random forest, boosting and svm) by comparing the misclassification test error rates.

Save the class predictions of the models on the test set.

class_tree  <- predict(prune_tree,           spam4[-train, ], type = "class")

class_bag   <- predict(train_bag,            spam4[-train, ])

class_rf    <- predict(train_rf,             spam4[-train, ])

class_boost <- predict(train_boost,          spam4[-train, ])

class_svm   <- predict(train_svm$best.model, spam4[-train, ])

Obtain and display the confusion matrices.

cf <- function(obs = spam4[-train, 58], est)table(obs = obs, est = est)

(conf_tree <- cf(est = class_tree))

(conf_bag <- cf(est = class_bag))

(conf_rf <- cf(est = class_rf))

(conf_boost <- cf(est = class_boost))

(conf_svm <- cf(est = class_svm))

Compute, display and compare the misclassification test errors for the five models.

miscl <- function(x) 1 - sum(diag(x)) / sum(x)

data.frame(tree  = miscl(conf_tree),
           bag   = miscl(conf_bag),
           rf    = miscl(conf_rf),
           boost = miscl(conf_boost),
           svm   = miscl(conf_svm)) %>% 
  round(4)

Compare these test errors to those of the models in the previous lab. What do you conclude?

END OF LAB

Lab 3B

Tree-based methods and SVM’s

2021