AE 02: Visualizing penguins

Suggested answers

Application exercise
Answers
Important

These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.

For all analyses, we’ll use the tidyverse and palmerpenguins packages.

library(tidyverse)
library(palmerpenguins)

The dataset we will visualize is called penguins. Let’s glimpse() at it.

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Visualizing penguin weights - Demo

Single variable

Note

Analyzing the a single variable is called univariate analysis.

Create visualizations of the distribution of weights of penguins.

  1. Make a histogram. Set an appropriate binwidth.
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 250)

  1. Make a boxplot.
ggplot(penguins, aes(x = body_mass_g)) +
  geom_boxplot()

  1. Based on these, determine if each of the following statements about the shape of the distribution is true or false.
    • The distribution of penguin weights in this sample is left skewed. FALSE
    • The distribution of penguin weights in this sample is unimodal. TRUE

Two variables

Note

Analyzing the relationship between two variables is called bivariate analysis.

Create visualizations of the distribution of weights of penguins by species.

  1. Make a single histogram. Set an appropriate binwidth.
ggplot(penguins, 
       aes(x = body_mass_g, fill = species)) +
  geom_histogram(binwidth = 250, alpha = 0.5, color = "black")

  1. Use multiple histograms via faceting, one for each species. Set an appropriate binwidth, add color as you see fit, and turn off legends if not needed.
ggplot(penguins, 
       aes(x = body_mass_g, fill = species)) +
  geom_histogram(binwidth = 250, show.legend = FALSE) +
  facet_wrap(~species, ncol = 1)

  1. Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.
ggplot(penguins, 
       aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(show.legend = FALSE)

  1. Use density plots. Add color as you see fit.
ggplot(penguins, 
       aes(x = body_mass_g, fill = species)) +
  geom_density(alpha = 0.5)

  1. Use violin plots. Add color as you see fit and turn off legends if not needed.
ggplot(penguins, 
       aes(x = species, y = body_mass_g, fill = species)) +
    geom_violin(alpha = 0.5, show.legend = FALSE)

  1. Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.
ggplot(penguins, 
       aes(x = species, y = body_mass_g, color = species)) +
    geom_jitter(show.legend = FALSE)

  1. Use beeswarm plots. Add color as you see fit and turn off legends if not needed.
library(ggbeeswarm)

ggplot(penguins, 
       aes(x = species, y = body_mass_g, color = species)) +
  geom_beeswarm(show.legend = FALSE)

  1. Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.
ggplot(penguins, 
       aes(x = species, y = body_mass_g, color = species)) +
  geom_beeswarm(show.legend = FALSE) +
  geom_boxplot(show.legend = FALSE, alpha = 0.5) +
  scale_color_viridis_d(option = "D", end = 0.8) +
  theme_minimal() +
  labs(
    x = "Species",
    y = "Body mass (g)",
    title = "Body mass of penguins in Palmer Archipelago",
    subtitle = "By species"
  )

Multiple variables

Note

Analyzing the relationship between three or more variables is called multivariate analysis.

  1. Facet the plot you created in the previous exercise by island. Adjust labels accordingly.
ggplot(penguins, 
       aes(x = species, y = body_mass_g, color = species)) +
  geom_beeswarm(show.legend = FALSE) +
  geom_boxplot(show.legend = FALSE, alpha = 0.5) +
  facet_wrap(~island) +
  scale_color_viridis_d(option = "D", end = 0.8) +
  theme_minimal() +
  labs(
    x = "Species",
    y = "Body mass (g)",
    title = "Body mass of penguins in Palmer Archipelago",
    subtitle = "By species and island"
  )

Before you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by editing the YAML.

Visualizing other variables - Your turn!

  1. Pick a single categorical variable from the data set and make a bar plot of its distribution.
  1. Pick two categorical variables and make a visualization to visualize the relationship between the two variables. Along with your code and output, provide an interpretation of the visualization.

Interpretation goes here…

  1. Make another plot that uses at least three variables. At least one should be numeric and at least one categorical. In 1-2 sentences, describe what the plot shows about the relationships between the variables you plotted. Don’t forget to label your code chunk.
# add code here

Interpretation goes here…