Overview

For this lab you need the R package ggplot2. Install the package if you haven’t done that yet, and load it in workspace with the function library(). The aim of this lab is to get acquainted with building plots layer-by-layer with the ggplot() function, and with the basic geometries for displaying numeric and categorical variables.

For help on the ggplot2 functions, check the Data Visualization Cheat Sheet

if('ggplot2' %in% installed.packages() == FALSE) install.packages('ggplot2')
library(ggplot2)

Make the exercises in the R Markdown template 1A-Lab-Template.Rmd. You’ll find instructions on how to work with R Markdown documents in the template.

The code fo the exercises can be made visible by clicking the code, but first try to find the solution yourself.


Building plots

In this lab we use the data set txhousing (run ?txhousing at the prompt to see its help page).

Data and aesthetics

The data and aesthetics arguments tell ggplot() where to to find the data, and where to map the variables. Together they specify the axes of the plot array, but they do not make any plot yet.

  1. The first step in making plots the data specification with ggplot(data = txhousing). This creates an empty plot surface.
ggplot(data = txhousing) 
  1. The next step is to add one or more aesthetics. We start by mapping the variable volume to the x-axis. Check the result.
ggplot(data = txhousing, mapping = aes(x = volume))
  1. Next, we add anther aesthetic by mapping the variable sales to the y-axis. Again check the result.
ggplot(data = txhousing, mapping = aes(x = volume, y = sales))

Geoms

The geoms determine how the data are represented in the plot. There are many different geoms, including those for displaying histograms, lines, points, densities, bars, boxplots, etc.

  1. The previous plot includes the numeric variables volume and housing, and the common way to represent two numeric variables is a scatter plot. Find the appropriate geom to display a scatter plot of volume and housing. (ggplot generates a warning about missing values. You can prevent this by using na.omit(txhousing) as data specification.)
ggplot(data = na.omit(txhousing), mapping = aes(x = volume, y = sales)) +
  geom_point()
  1. Add a (non-linear) regression line to the previous plot that describes the relationship between volume and sales.
ggplot(data = na.omit(txhousing), mapping = aes(x = volume, y = sales)) +
  geom_point() +
  geom_smooth()
  1. The data are collected over the period 2000-2015. Use a 3rd aesthetic color to color the points in the scatter plot by year.
ggplot(data = na.omit(txhousing), mapping = aes(x = volume, y = sales, col = year)) +
  geom_point() +
  geom_smooth()

Scales, facets, themes

The scale on which the axis are represent can be changed by adding a scale layer to the plot. For example, scale_y_sqrt() transforms the y-axis to the square root of the original values, and scale_y_log10() takes the logarithm of the original values. Thes kind of transformations can be useful to reduce heteroscedasticity (non-constant variance around the regression line).

  1. Display the previous plot choosing the transformation that does the best job in reducing heteroscedasticity.
ggplot(data = na.omit(txhousing), mapping = aes(x = volume, y = sales, col = year)) +
  geom_point() +
  geom_smooth() +
  scale_y_log10() +
  scale_x_log10()

Facets are multiples of the plot, split up by the levels of categorical variable.

  1. Remove the variable year from the aesthetics, and use the facet_wrap() or facet_grid() function to make multiples by year.
ggplot(data = na.omit(txhousing), mapping = aes(x = volume, y = sales)) +
  geom_point() +
  geom_smooth() +
  scale_y_log10() +
  scale_x_log10() +
  facet_wrap(vars(year))

Themes are uses to change the look of the plot.

  1. Check the theme options in the cheat sheet, and choose a theme of your liking to display the previous plot.
ggplot(data = na.omit(txhousing), mapping = aes(x = volume, y = sales)) +
  geom_point() +
  geom_smooth() +
  scale_y_log10() +
  scale_x_log10() +
  facet_wrap(vars(year)) +
  theme_minimal()

Other geoms

In the previous exercise we used geoms for two numeric variables to display a scatter plot with a regression line. We will now briefly explore some geoms for categorical variables, and for combinations of numeric and categorical variables.

  1. Display a histogram of the variable sales, with the x-axis displayed on a logarithmic scale.
ggplot(data = na.omit(txhousing), mapping = aes(x = sales)) +
  geom_histogram() +
  scale_x_log10()
  1. Add the aesthetic fill = factor(month) to the plot
ggplot(data = na.omit(txhousing), mapping = aes(x = sales, fill = factor(month))) +
  geom_histogram() +
  scale_x_log10()
  1. The previous plot is a bit crowded with colors. A better option is to produce separate histograms for each month. Disply this plot.
ggplot(data = na.omit(txhousing), mapping = aes(x = sales)) +
  geom_histogram() +
  scale_x_log10() +
  facet_wrap(vars(month))
  1. Change the geom of the previous plot so that densities instead of histograms are displayed.
ggplot(data = na.omit(txhousing), mapping = aes(x = sales)) +
  geom_density() +
  scale_x_log10() +
  facet_wrap(vars(month))
  1. Display the same graph as in exercise b, but with a geom that creates boxplots for each month (specify month as factor).
ggplot(data = na.omit(txhousing), mapping = aes(x = factor(month), y = sales)) +
  geom_boxplot() +
  scale_y_log10()

END OF LAB.