In this lab you perform a PCA to a personality test, a bag of words and photo’s of faces.

For this lab you need the packages stylo and psych, and the data sets in Profession.Rdata and faces.RData. The package stylo includes nine novels by the Bronthe sisters and Jane Austin, and functions to compute the relative frequencies of the words used in these novels.

Warning for Mac OS users: the stylo package requires X11 support being installed. For details about installing X11, see the github page stylo: Installation Issues. If you do not want to install X11, do not install the package stylo, but instead load the file novels.Rdata which contains the data frame freqs that you need for the PCA.

knitr::opts_chunk$set(eval=params$answers)

library(stylo) # for Mac OS; only when X11 is installed
library(psych)
library(dplyr)

Personality types

The file “Profession.Rdata” contains the data set Profession with preference scores of 5389 subjects on 48 professions. The professions are subdivided in 6 sets of 8, with each set corresponding to different personality type (e.g. items R1 to R8 are professions that are supposed to be appealing to personality type “R = Realistic (Doers)”). The code book for the data is Profession.txt. The aim of the analysis is to find a principal component for each of the six personality types.

  1. Load the data file “Profession.Rdata”, and display the scree plot for the profession data. How many PCs does the plot suggest?
load("Profession.RData")

VSS.scree(Profession)
  1. There seem to be 7 or 8 PCs with an eigenvalue > 1, but we will ignore the 7th and 8th component. Perform a PCA that extracts 6 factors without rotation, and save the object as prof_pca.
prof_pca <- principal(Profession, nfactors = 6, rotate = "none")
  1. Display the loadings of the first 6 components with the function print(), and set the cutoff to 0.4. Do the loadings correspond to the personality types?
print(loadings(prof_pca), cutoff=0.4)
  1. Rerun the PCA with the default “varimax” rotation and save the result as prof_rot, and display the rotated component loadings. Did the rotation help interpretation of the personality types?
prof_rot <- principal(Profession, nfactors = 6)
print(loadings(prof_rot), cutoff = 0.4)
  1. To get some idea of the distribution of the PCs, display their means and standard deviations, rounded to 4 decimals.
data.frame(mean  = apply(prof_pca$scores, 2 , mean), sd = apply(prof_pca$scores, 2, sd)) %>% round(4)
  1. Rename the PCs in the names of the personality types in the code book (e.g. Doers, Thinkers, etc), and save the result.
d <- as.data.frame(prof_rot$scores) 
names(d) <- c("Organizer", "Doer", "Thinker", "Creator", "Helper", "Persuader") 
  1. Display the scores of a random sample of 5 persons from the data, and interpret this persons personality profile.
d[sample(5389, 5), ] %>%  round(1)

Text analysis

In this exercise you use PCA to analyze word frequencies in nine novels by the the three Bronthe sisters and Jane Austin. The data for this analysis are in the novels object of the package stylo. (Mac OS users without a X11 installation should skip exercise a.)

  1. Load the novels object with the command data(), and summarize the object with the function summary().
data(novels)
summary(novels)
  1. Open the R chunk for this exercise. The code prepares the data for the PCA. It extracts the relative frequencies of the most frequently used words. Run it and inspect the content of the object freqs. (Mac OS users without a X11 installation should instead load the file novels.Rdata.)
tokens   <- txt.to.words.ext(novels, 
                             preserve.case = FALSE)


frequent <- make.frequency.list(tokens, 
                                head = 10)

freqs    <- make.table.of.frequencies(tokens, 
                                      features = frequent)

# load("novels.Rdata") for Mac OS users without X11

attributes(freqs)
  1. Perform a PCA on the freqs object with the function prcomp(), and save the result.
author_pca <- prcomp(freqs)
  1. Display the scree plot of the prcomp object. According to the elbow criterion, how many PCs do we need to extract?
screeplot(author_pca, type="line")
  1. Display the biplot of PC1 and PC2, and interpret the plot. Which of the four authors uses “a” and “the” most frequently, and which authors are characterized by the frequent use of “and”?
biplot(author_pca, cex = .8)

Eigenfaces

The data set for this exercise is called faces and in the file faces.RData. The data for this analysis consist of 2410 pictures of faces of which the pixels are converted to gray scales in the range \((0, 255)\). The pictures all have 8064 pixels \((96\times84)\). The aim of the analysis to is to create a reduced number of principal components that summarize stereotype faces, a.k.a eigenfaces.

  1. Load this file into workspace with the function load().
load("faces.RData")
  1. To get some idea of what the data look like, display a summary columns of the first 5 columns (faces) of faces.
summary(faces[, 1:5])
  1. Open the R chunk for this exercise. It contains code that converts the raw data of the first 9 faces to 84 x 96 matrices, and displays an image of the face in gray scales. Run the code.
par(mfrow=c(3,3))
for(i in sample(1:2410, size=9)){
  face <- matrix(rev(faces[,i]), nrow = 84, ncol = 96)
  image(face, col = gray(0:255 / 255 ))
}
  1. Perform a PCA on faces using prcomp() (may take a while), and save the result.
faces_pca <- prcomp(faces)
  1. In the same way as exercise c, convert the first 9 components scores to 84 x 96 matrices, and display the images of the eigenfaces.
par(mfrow=c(3, 3))
for(i in 1:9){
  face <- matrix(rev(faces_pca$x[, i]), nrow = 84, ncol = 96)
  image(face, col = gray(0:255 / 255 ))
}
  1. In the same way, also display the eigenface images produced by the last 9 principal components.
par(mfrow=c(3, 3))
for(i in 2402:2410){
  face <- matrix(rev(faces_pca$x[, i]), nrow = 84, ncol = 96)
  image(face, col = gray(0:255 / 255 ))
}
  1. Include the raw component scores of the 1st as the sole argument in the plot() function, and also do this for the 2nd component. These plots do not have any sensible interpretation, but they just look nice!
par(mfrow=c(1, 2))
plot(faces_pca$x[, 1])
plot(faces_pca$x[, 2])

END OF LAB