In this lab we classify the spam4
data again, but using different techniques than in the previous lab. For this lab you need the packages kernlab
, tree
, randomForest
, fastAdaboost
, e1071
and caret
.
::opts_chunk$set(eval=params$answers,
knitrmessage=FALSE)
library(kernlab)
library(tree)
library(randomForest)
library(fastAdaboost)
library(caret)
library(e1071)
library(dplyr)
To compare our results, we use the same training and test set.
spam
data, and convert it to spam4
as in the previous lab.<- function(x)as.numeric(cut(x, c(-Inf, unique(quantile(x, probs=seq(0, 1, .1))))))
s
<- spam %>%
spam4 mutate(across(1:57, s))
set.seed(10)
<- createDataPartition(y = spam4$type, p = .75, list = FALSE) train
train_tree
, and plot it.<- tree(type ~ ., data = spam4[train, ])
train_tree plot(train_tree)
text(train_tree, cex = 0.8)
cv.tree()
.<- cv.tree(train_tree)
cv_tree cv_tree
<- prune.tree(train_tree, best = cv_tree$size[which.min(cv_tree$dev)]) prune_tree
Both models can be trained with the function randomForest()
. The only argument that needs to be changed is mtry
(see the function’s help page). To learn about the importance of the features, set the argument importance = TRUE
.
train_bag
, and display and interpret its content.<- randomForest(type ~ ., spam4[train, ], importance = TRUE, mtry = 57)) (train_bag
train_rf
, display its content and compare it to that of the random forest object.<- randomForest(type ~ ., spam4[train, ], importance = TRUE)) (train_rf
varImpPlot(train_bag, n.var = 10)
varImpPlot(train_rf, n.var = 10)
For boosting we use the train()
function with the “adaboost” method for tuning the nIter
hyperparameter (see lecture slides).
nIter
. Save the train
object under an appropriate name.<- train(type ~ .,
train_boost data = spam4[train, ],
method = "adaboost",
tuneGrid = expand.grid(method = "Adaboost.M1", nIter = c(10, 25, 50)),
trControl = trainControl(method = "cv",
number = 5)
)
train_boost
svm()
function offers four different kernels and has many hyperparameters that can be tuned. To restrict the choices, use the function tune("svm", . . .)
to tune the cost
parameter for an svm with a radial kernel. The default value for cost
is 1, so try out a sequence of values in that neighborhood. Save the tune
object under an appropriate name.<- tune("svm", type ~ .,
train_svm data = spam4[train, ],
kernel = "radial",
probability = TRUE,
ranges = list(cost = c(1:4)))
summary(train_svm)
tune
object.$best.model train_svm
plot(train_svm$best.model, spam4, telnet ~ charDollar)
As in the previous labs, evaluate the five models (pruned tree, bagging, random forest, boosting and svm) by comparing the misclassification test error rates.
<- predict(prune_tree, spam4[-train, ], type = "class")
class_tree
<- predict(train_bag, spam4[-train, ])
class_bag
<- predict(train_rf, spam4[-train, ])
class_rf
<- predict(train_boost, spam4[-train, ])
class_boost
<- predict(train_svm$best.model, spam4[-train, ]) class_svm
<- function(obs = spam4[-train, 58], est)table(obs = obs, est = est)
cf
<- cf(est = class_tree))
(conf_tree
<- cf(est = class_bag))
(conf_bag
<- cf(est = class_rf))
(conf_rf
<- cf(est = class_boost))
(conf_boost
<- cf(est = class_svm)) (conf_svm
<- function(x) 1 - sum(diag(x)) / sum(x)
miscl
data.frame(tree = miscl(conf_tree),
bag = miscl(conf_bag),
rf = miscl(conf_rf),
boost = miscl(conf_boost),
svm = miscl(conf_svm)) %>%
round(4)
END OF LAB