In this lab you predict mpg
(miles per gallon) from displacement
(engine displacement). The variables are in the data set Auto
from the package ISLR
. You compare the performance of 3 distinct model; linear regression and quadratic regression models, and \(k\)-nearest neighbors, and you select the best model using the train/dev/test paradigm.
The lab is structured as follows:
Partition and visualize the data
Train the models
Test the models
ISRL
, ggplot2
and caret
, and optionally dplyr
(if you are familiar with his package).library(ggplot2)
library(caret)
library(ISLR)
library(dplyr)
Train
consisting of 80% of the data, and a test set Test
consisting of the remaining 20% of the data. Use the caret
function createDataPartition()
to get the vector with the row numbers of the training set. At the top of the R chunk, set the seed to 1 for reproducibility.set.seed(1)
<- createDataPartition(y = Auto$mpg, p = .8, list = FALSE)
inTrain
<- Auto[inTrain, ]
Train
<- Auto[-inTrain, ] Test
displacement
on the x-axis and mpg
on the y-axis. Display the cases in the training set in blue, and the cases in the test set in red.ggplot() +
geom_point(Train, mapping = aes(displacement, mpg), col = "blue") +
geom_point(Test, mapping = aes(displacement, mpg), col = "red") +
theme_minimal()
The linear regression model to be fitted is
\[\widehat{mpg}=\beta_0+\beta_1\cdot{displacement}\]
The linear regression model is fitted in R with the function lm()
. If you are unfamiliar with this function, check its help page for you begin.
lm
object under the name linear
.<- lm(mpg ~ displacement, data = Train) linear
summary(linear)
linear
with the function resid
), and display its value.mean(resid(linear)^2)
The quadratic regression model to be fitted is
\[\widehat{mpg}=\beta_0+\beta_1\cdot{displacement}+\beta_2\cdot{displacement}^2\]
The square of displacement
can be included in the model formula with I(displacement^2)
(the I()
function literally performs the specified operation).
quadratic
.<- lm(mpg ~ displacement + I(displacement^2), data = Train) quadratic
summary(quadratic)
mean(resid(quadratic)^2)
The KNN model does have a hyperparameter - the number of nearest neighbors \(k\) - so cross-validation is necessary to determine the optimal value for \(k\).
The R code for training the KNN model with 5-fold cross validation can be found in the lecture slides
train()
, and save to object under the name knn
.<- train(mpg ~ displacement,
knn data = Train,
method = "knn",
tuneGrid = expand.grid(k = c(1, 2, 5, 10, 25, 50, 100)),
trControl = trainControl(method = "cv", number = 5))
knn
. Which value of \(k\) has the lowest RMSE? (The RMSE is the square root of the MSE.) knn
min(knn$results$RMSE^2)
To visualize the predictions of the models, we first need to add the fitted values of the three models to the training set. The fitted values can be obtained with the function predict(<fitted object>, data = Train)
.
Train
under the names pred_linear
, pred_quadratic
and pred_knn
.<- Train %>%
Train mutate(pred_linear = predict(linear, Train),
pred_quadratic = predict(quadratic, Train),
pred_knn = predict(knn, Train))
geom_line()
for each of the three models. Give each line a different color.ggplot(Train) +
geom_point(aes(displacement, mpg)) +
geom_line(aes(x = displacement, y = pred_linear), col = "blue") +
geom_line(aes(x = displacement, y = pred_quadratic), col = "red") +
geom_line(aes(x = displacement, y = pred_knn), col = "purple") +
theme_minimal()
To test the models, you need to obtain the predictions of the models for the test, and then compute the MSE for each model. The code for obtaining the predictions is the same as in the previous exercise, but now with Test
as data argument.
Test
.<- Test %>%
Test mutate(pred_linear = predict(linear, Test),
pred_quadratic = predict(quadratic, Test),
pred_knn = predict(knn, Test))
%>% summarise(testMSE_linear = mean((mpg - pred_linear)^2),
Test testMSE_quadratic = mean((mpg - pred_quadratic)^2),
testMSE_knn = mean((mpg - pred_knn)^2))
END OF LAB