knitr::opts_chunk$set(eval=params$answers, message=FALSE)
library(ggplot2)
library(gamlss)
library(splines)
library(rpart)
library(caret)

In this lab you visualize data, and fit polynomials, splines and regression trees. You need the packages ggplot2, gamlss, splines, rpart and caret. To get reproducible results, set the seed to 1 before you begin.

set.seed(1)

BMI

The data set for this exercise is gamlss::dbbmi (also see lecture slides). Check the help page of this data set for a description of the data.

Display a scatterplot with age on the x-axis and bmi on y-axis.

ggplot(dbbmi, aes(x = age, y = bmi)) +
  geom_point(alpha = 0.2) + 
  theme_minimal()

Display the previous plot with the regression line of a polynomial model. Experiment with the degree of the polynomial until you a find a satisfactory compromise between fit and model complexity (the optimal bias-variance trade-off).

ggplot(dbbmi, aes(x = age, y = bmi)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = fitted(lm(bmi ~ poly(age, degree = 14)))), col="red", size = 1) + 
  theme_minimal()

Now display the previous plot with the regression lines of both a cubic and a natural cubic spline. Experiment with the df and compare the regression lines for equal df. What is the difference?

ggplot(dbbmi, aes(x = age, y = bmi)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = fitted(lm(bmi ~ bs(age, df = 8)))), col="red", size = 1) + 
  geom_line(aes(y = fitted(lm(bmi ~ ns(age, df = 8)))), col="blue", size = 1) + 
  theme_minimal()

Even for large values of df it is difficult to get good predictions for the ages close to 0. You may get better results by experimenting with the position of the knots. Try to get a better result by fitting a cubic spline with a single, well placed knot.

ggplot(dbbmi, aes(x = age, y = bmi)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = fitted(lm(bmi ~ bs(age, knots = 0.4)))), col="red", size = 1) + 
  theme_minimal()

Fortune500 data

The data set Fortune500 for this exercise are on the 500 richest companies in the USA.

Load the data in workspace, either indirectly by first saving it on your machine or directly with the command load(url("https://maartencruyff.github.io/S31/2A-Feature Space Expansion/data/Fortune500.Rdata")).

load(url("https://maartencruyff.github.io/S31/2A-Feature Space Expansion/data/Fortune500.Rdata"))

Visualize the relationship in a scatterplot with Assets on the y-axis and Profits on the x-axis. Based on this plot, what seems to be the best choice for the polynomial degree of Assets?

ggplot(Fortune500, aes(Profits, Assets)) +
  geom_point() +
  theme_minimal()

The aim is to select the best model for predicting Assets from Profits. The candidates are polynomial models (linear, quadratic, cubic and quartic), a regression tree, and k-nearest neighbors. To simplify matters, we do not create a training and test, but we perform a 10-fold cross-validation for each model using the entire sample, and select the model with the lowest cross-validated RMSE.

Polynomial models

We start with testing the performances of the linear, quadratic, cubic and quartic models in the traditional way, i.e. by testing significance of the R^2-change when increasing the polynomial degree. So we do not cross-validate, but we just test the R^2 change of the four models with the function anova().

Fit each polynomial model in ascending order of the polynomial degree within the anova() function. A significant result means that the higher-order polynomial fits better than the lower-order one. Which polynomial model comes out as the best?

anova(lm(Assets ~ poly(Profits, 1), Fortune500),
      lm(Assets ~ poly(Profits, 2), Fortune500),
      lm(Assets ~ poly(Profits, 3), Fortune500),
      lm(Assets ~ poly(Profits, 4), Fortune500))

Reproduce the previous scatter plot, and add the regression line of the polynomial model (in the color blue) that came out as best from the ANOVA test. How do would you interpret this model in terms of bias and variance?

ggplot(Fortune500, aes(Profits, Assets)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, formula = y  ~ poly(x, 4)) +
  theme_minimal()

The method we just used is highly data-driven, and thus prone to overfitting. We will now use cross-validation to select the best model. Fit the four models separately with the function train() using 10-fold cross-validation, and compare the cross-validated RMSEs. Which models comes out as best now?

train(Assets ~ poly(Profits, 1), data = Fortune500,
      method = "lm",
      trControl = trainControl(method = "cv")
      )
train(Assets ~ poly(Profits, 2), data = Fortune500,
      method = "lm",
      trControl = trainControl(method = "cv")
      )
train(Assets ~ poly(Profits, 3), data = Fortune500,
      method = "lm",
      trControl = trainControl(method = "cv")
      )
train(Assets ~ poly(Profits, 4), data = Fortune500,
      method = "lm",
      trControl = trainControl(method = "cv")
      )

Reproduce the previous scatter plot, and add the regression line of the polynomial model (in red) that came out as best in the cross-validation. How do would you interpret this model in terms of bias and variance?

ggplot(Fortune500, aes(Profits, Assets)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, formula = y  ~ poly(x, 4)) +
  geom_smooth(method = "lm", se = F, formula = y ~ x, col = "red") +
  theme_minimal()

Regression trees

A regression tree partitions the variable Profits in on-overlapping regions, and assigns mean of Assets within each region as predicted value.

Use the function rpart() (check its help page), and fit a tree to the data using the “anova” method. Save the fitted object as tree.

tree <- rpart(Assets ~ Profits, data = Fortune500, 
              method = "anova")

Plot the tree object, and interpret the plot.

plot(tree)
text(tree)

Trees tend to overfit, so we perform cross-validation on the tree object to find the optimal value for the hyperparameter cp, which controls the complexity of the tree. You can do this with the train() function with method = rpart. Save the object as tree_cv.

tree_cv <- train(Assets ~ Profits, 
                 data = Fortune500, 
                 method = "rpart",
                 trControl = trainControl(method = "cv"))

Obtain the predictions of tree_cv with the function predict(), and save them as pred_tree_cv.

pred_tree_cv <- predict(tree_cv, Fortune500)

Reproduce the previous scatter plot, and add the regression line of of the tree (in brown).

ggplot(Fortune500, aes(Profits, Assets)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, formula = y  ~ poly(x, 4)) +
  geom_smooth(method = "lm", se = F, formula = y ~ x, col = "red") +
  geom_line(aes(Profits, pred_tree_cv), col = "brown") +
  theme_minimal()

KNN

KNN assigns the mean of the k nearest neighbors as prediction, where k is a hyperparameter to be determined with cross-validation.

Perform a 10-fold cross-validation on the data with the train() function to determine the optimal value fo k. Save the object as knn_cv.

(knn_cv <- train(Assets ~ Profits, 
                data = Fortune500, 
                method    = "knn",
                tuneGrid  = expand.grid(k = c(5, 7, 9, 11, 15, 20, 25, 30)),
                trControl = trainControl(method = "cv", number = 5))
)

Obtain the fitted values for the training data with predict(), as save them as pred_knn_cv.

fitted_knn_cv <- predict(knn_cv)

Reproduce the previous scatter plot, and add the KNN regression line of the tree (in green).

ggplot(Fortune500, aes(Profits, Assets)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, formula = y  ~ poly(x, 4)) +
  geom_smooth(method = "lm", se = F, formula = y ~ x, col = "red") +
  geom_line(aes(Profits, pred_tree_cv), col = "brown") +
  geom_line(aes(Profits, fitted_knn_cv), col = "green") +
  theme_minimal()

Which of the fitted models do you prefer, and why?

END OF LAB

Lab 2A

Feature space expansions

2021

BMI

Fortune500 data

Polynomial models

Regression trees

KNN