::opts_chunk$set(eval=params$answers, message=FALSE)
knitrlibrary(ggplot2)
library(gamlss)
library(splines)
library(rpart)
library(caret)
In this lab you visualize data, and fit polynomials, splines and regression trees. You need the packages ggplot2
, gamlss
, splines
, rpart
and caret
. To get reproducible results, set the seed to 1 before you begin.
set.seed(1)
The data set for this exercise is gamlss::dbbmi
(also see lecture slides). Check the help page of this data set for a description of the data.
age
on the x-axis and bmi
on y-axis.ggplot(dbbmi, aes(x = age, y = bmi)) +
geom_point(alpha = 0.2) +
theme_minimal()
degree
of the polynomial until you a find a satisfactory compromise between fit and model complexity (the optimal bias-variance trade-off).ggplot(dbbmi, aes(x = age, y = bmi)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = fitted(lm(bmi ~ poly(age, degree = 14)))), col="red", size = 1) +
theme_minimal()
df
and compare the regression lines for equal df
. What is the difference?ggplot(dbbmi, aes(x = age, y = bmi)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = fitted(lm(bmi ~ bs(age, df = 8)))), col="red", size = 1) +
geom_line(aes(y = fitted(lm(bmi ~ ns(age, df = 8)))), col="blue", size = 1) +
theme_minimal()
df
it is difficult to get good predictions for the ages close to 0. You may get better results by experimenting with the position of the knots. Try to get a better result by fitting a cubic spline with a single, well placed knot.ggplot(dbbmi, aes(x = age, y = bmi)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = fitted(lm(bmi ~ bs(age, knots = 0.4)))), col="red", size = 1) +
theme_minimal()
The data set Fortune500
for this exercise are on the 500 richest companies in the USA.
load(url("https://maartencruyff.github.io/S31/2A-Feature Space Expansion/data/Fortune500.Rdata"))
.load(url("https://maartencruyff.github.io/S31/2A-Feature Space Expansion/data/Fortune500.Rdata"))
Assets
on the y-axis and Profits
on the x-axis. Based on this plot, what seems to be the best choice for the polynomial degree of Assets
?ggplot(Fortune500, aes(Profits, Assets)) +
geom_point() +
theme_minimal()
The aim is to select the best model for predicting Assets
from Profits
. The candidates are polynomial models (linear, quadratic, cubic and quartic), a regression tree, and k-nearest neighbors. To simplify matters, we do not create a training and test, but we perform a 10-fold cross-validation for each model using the entire sample, and select the model with the lowest cross-validated RMSE.
We start with testing the performances of the linear, quadratic, cubic and quartic models in the traditional way, i.e. by testing significance of the R^2-change when increasing the polynomial degree. So we do not cross-validate, but we just test the R^2 change of the four models with the function anova()
.
anova()
function. A significant result means that the higher-order polynomial fits better than the lower-order one. Which polynomial model comes out as the best?anova(lm(Assets ~ poly(Profits, 1), Fortune500),
lm(Assets ~ poly(Profits, 2), Fortune500),
lm(Assets ~ poly(Profits, 3), Fortune500),
lm(Assets ~ poly(Profits, 4), Fortune500))
ggplot(Fortune500, aes(Profits, Assets)) +
geom_point() +
geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 4)) +
theme_minimal()
train()
using 10-fold cross-validation, and compare the cross-validated RMSEs. Which models comes out as best now?train(Assets ~ poly(Profits, 1), data = Fortune500,
method = "lm",
trControl = trainControl(method = "cv")
)train(Assets ~ poly(Profits, 2), data = Fortune500,
method = "lm",
trControl = trainControl(method = "cv")
)train(Assets ~ poly(Profits, 3), data = Fortune500,
method = "lm",
trControl = trainControl(method = "cv")
)train(Assets ~ poly(Profits, 4), data = Fortune500,
method = "lm",
trControl = trainControl(method = "cv")
)
ggplot(Fortune500, aes(Profits, Assets)) +
geom_point() +
geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 4)) +
geom_smooth(method = "lm", se = F, formula = y ~ x, col = "red") +
theme_minimal()
A regression tree partitions the variable Profits
in on-overlapping regions, and assigns mean of Assets
within each region as predicted value.
rpart()
(check its help page), and fit a tree to the data using the “anova” method. Save the fitted object as tree
.<- rpart(Assets ~ Profits, data = Fortune500,
tree method = "anova")
tree
object, and interpret the plot.plot(tree)
text(tree)
tree
object to find the optimal value for the hyperparameter cp
, which controls the complexity of the tree. You can do this with the train()
function with method = rpart
. Save the object as tree_cv
.<- train(Assets ~ Profits,
tree_cv data = Fortune500,
method = "rpart",
trControl = trainControl(method = "cv"))
tree_cv
with the function predict()
, and save them as pred_tree_cv
.<- predict(tree_cv, Fortune500) pred_tree_cv
ggplot(Fortune500, aes(Profits, Assets)) +
geom_point() +
geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 4)) +
geom_smooth(method = "lm", se = F, formula = y ~ x, col = "red") +
geom_line(aes(Profits, pred_tree_cv), col = "brown") +
theme_minimal()
KNN assigns the mean of the k nearest neighbors as prediction, where k is a hyperparameter to be determined with cross-validation.
train()
function to determine the optimal value fo k. Save the object as knn_cv
.<- train(Assets ~ Profits,
(knn_cv data = Fortune500,
method = "knn",
tuneGrid = expand.grid(k = c(5, 7, 9, 11, 15, 20, 25, 30)),
trControl = trainControl(method = "cv", number = 5))
)
predict()
, as save them as pred_knn_cv
.<- predict(knn_cv) fitted_knn_cv
ggplot(Fortune500, aes(Profits, Assets)) +
geom_point() +
geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 4)) +
geom_smooth(method = "lm", se = F, formula = y ~ x, col = "red") +
geom_line(aes(Profits, pred_tree_cv), col = "brown") +
geom_line(aes(Profits, fitted_knn_cv), col = "green") +
theme_minimal()
END OF LAB