knitr::opts_chunk$set(eval=params$answers, message=FALSE)
library(stylo)
library(psych)
library(factoextra)
library(dplyr)

For this lab you need the packages stylo, psych, and factoextra.

Wine data

The wine data include 13 features of three different wine cultivars in Italy.

  1. Load the wine.Rdata file in workspace, and display a summary of the wine data.
load("data/wine.Rdata")
summary(wine)
  1. Perform six k-means cluster analyses on the 13 (standardized) features (so exclude the variable Cultivar) with 1 to 6 to centers, respectively, and save the objects. Then plot the total within sum of squares of the six analyses to determine the optimal number of clusters.
set.seed(1)
k_1 <- kmeans(scale(wine[, -1]), centers = 1, nstart = 10)
k_2 <- kmeans(scale(wine[, -1]), centers = 2, nstart = 10)
k_3 <- kmeans(scale(wine[, -1]), centers = 3, nstart = 10)
k_4 <- kmeans(scale(wine[, -1]), centers = 4, nstart = 10)
k_5 <- kmeans(scale(wine[, -1]), centers = 5, nstart = 10)
k_6 <- kmeans(scale(wine[, -1]), centers = 6, nstart = 10)

plot(1:6,  
       c(k_1$tot.withinss,
         k_2$tot.withinss, 
         k_3$tot.withinss, 
         k_4$tot.withinss, 
         k_5$tot.withinss, 
         k_6$tot.withinss),
     type = "b", ylab = "within clusters ss", xlab = "k")
  1. Display a confusion matrix of the three clusters of the k-means analysis with three centers with the three levels of the variable Cultivar of the wine data. Did the cluster analysis recover the cultivars?
table(obs = wine$Cultivar, est = k_3$cluster)

Novels

In the previous lab we did a PCA on the frequently used words used in the novels by the Bronthe sister and Jane Austin. In this lab we perform a k-means analysis on these data.

  1. Rerun the code on novels data that produces the freqs object.
k_1 <- kmeans(scale(freqs), centers = 1)
k_2 <- kmeans(scale(freqs), centers = 2)
k_3 <- kmeans(scale(freqs), centers = 3)
k_4 <- kmeans(scale(freqs), centers = 4)
k_5 <- kmeans(scale(freqs), centers = 5)
k_6 <- kmeans(scale(freqs), centers = 6)

plot(1:6,  
       c(k_1$tot.withinss,
         k_2$tot.withinss, 
         k_3$tot.withinss, 
         k_4$tot.withinss, 
         k_5$tot.withinss, 
         k_6$tot.withinss),
     type = "b", ylab = "within clusters ss", xlab = "k")
  1. Perform a k-means clustering on the freqs object with 4 clusters, i.e. one cluster for each author. Tabulate the row names of freqs (author_novel) against the cluster numbers. How well did the k-means solution do?
table(rownames(freqs), k_3$cluster)
  1. Visualize the clustering with the function fviz_cluster of the package factoextra. To make this function work, you first have to convert the class of the freqs object from stylo.data to data.frame.
fviz_cluster(k_3, scale(as.data.frame(unclass(freqs))))
  1. Also perform a hierarchical clustering on the freqs data, and plot the dendogram, and add rectangles for 4 clusters. Do the clusters correspond to the authors?
cl <- hclust(dist(scale(freqs)))
plot(cl)
rect.hclust(cl, 4)

Animals

The data set Animals contains the brain and body weights for 28 species of land animals. This data lends itself very well for making a taxonomy. We will perform a series of hierarchical cluster analyses with varying options.

  1. Load animals.Rdata in workspace, and display the rownames and the summary of the animals data.
load("animals.Rdata")
rownames(animals)
summary(animals)
  1. Plot the dendograms of the hierarchical cluster analyses on the original animals data, one with the default linkage method “complete”, and one one with “ward.D2” (if you like, you can also look at other linkage methods). What do you expect to see, given that you did not standardize the features?
plot(hclust(dist(animals), method = "complete"), cex = .8)
plot(hclust(dist(animals), method = "ward.D2"),  cex = .8)
  1. Re-plot the dendograms of the previous exercise , but now after standardizing the features? Does this improve the clustering?
plot(hclust(dist(scale(animals))), cex = .8)
plot(hclust(dist(scale(animals)), method = "ward.D2"), cex = .8)
  1. Display boxplots of the standardized features. You will see that the 1st two features have an extremely skewed distribution.
boxplot(scale(animals))
  1. Re-plot the dendograms of the standardized features, after applying a logarithmic transformation to the 1st two features of animals.
animals <- animals %>% mutate_at(vars(bw, brw), log)

plot(hclust(dist(scale(animals))), cex = .8)
plot(hclust(dist(scale(animals)), method = "ward.D2"), cex = .8)
  1. Which of the taxonomies you created above do you like best?

END OF LAB