Exam 1 Review

Lecture 10

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2022

9/29/22

Warm up

While you wait for class to begin…

Open your ae-07 project in RStudio, render your document, and commit and push.

Announcements

  • Exam 1 is released on today at noon and is due at 2pm on Monday.

    • No TA OH during the exam.

    • I will have OH 4-5pm on Friday on Zoom: bit.ly/minezoom

    • Any clarification questions must be emailed to me only.

    • No Slack use during the exam, even about non-exam related questions.

From last time

Continue from last time: ae-07

  • Go to your container and open your ae-07 project.
  • Render, commit, and push.

Important

You might see an error. Read it and do as it says!

  • Pull.

  • Once again, render, commit, and push.

Exam 1 review

Logistics questions

  • Can we use outside sources for our code on the exam as long as we cite where it’s from.

Yes! However, you should be striving the solve the questions in the style that we learned. For example, ggplot2 is not the only plotting package in R. But we expect you to use ggplot2 when making plots, not another system.

  • Will content on lab 3 be on the exam? If so, will we be able to access an answer key at some point during the exam period?

Yes, will be posted Friday at midnight.

  • When asked to replicate a graph, should we also adjust fig height/width?

Yes, though you shouldn’t worry about matching it exactly. More that it should be legible and if the plot you’re replicating is wider than taller, the plot you’re submitting should be as well.

Packages

library(tidyverse)

Operators in R: <- vs. =

  • <-: assignment
  • =: equals
# good
x <- 2

# works, but bad
x = 2

# doesn't work
df <- df |>
  mutate(x <- 2)

# good
df <- df |>
  mutate(x = 2)

Operators in R: = vs. == vs. %in%

  • ==: is equal to
  • %in%: in
x = c(1, 2, 3)
y = c(3, 4, 5)

# do elements in x equal those in y?
# check if each element in x is equal to the 
# corresponding element in y
x == y
[1] FALSE FALSE FALSE
# are any elements in x also in y?
# check if any element in x is equal to any element in y
x %in% y
[1] FALSE FALSE  TRUE
# set x equal to y
x = y
x
[1] 3 4 5

%in% vs ==

df <- tibble(
  x = c(1, 2, 3, 4),
  y = c("a", "b", "c", "d")
  )
df
# A tibble: 4 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
# Filter for x is 2
df |>
  filter(x == 2)
# A tibble: 1 × 2
      x y    
  <dbl> <chr>
1     2 b    

%in% vs ==

# Filter for x is 2 or 3
df |>
  filter(x == c(2, 3))
# A tibble: 0 × 2
# … with 2 variables: x <dbl>, y <chr>
# Filter for x is 2 or 3
df |>
  filter(x %in% c(2, 3))
# A tibble: 2 × 2
      x y    
  <dbl> <chr>
1     2 b    
2     3 c    

Operators in R: |> vs. %>%

  • |>: pipe operator (newer – what we’ve been using in class)
  • %>%: pipe operator (older – what you see in the videos)
  • They effectively do the same thing

Interpreting data visualizations I

Provide a 1-2 sentence interpretation of the relationship between city and highway mileage of cars.

ggplot(
  mpg,
  aes(x = cty, y = hwy)
) +
  geom_jitter(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "City MPG", 
    y = "Highway MPG"
  )

Interpreting data visualizations II

Provide a 1-2 sentence interpretation of the relationship between city and highway mileage of cars, taking into consideration whether they’re 4 wheel drive, front wheel drive, or rear wheel drive.

ggplot(
  mpg,
  aes(x = cty, y = hwy, color = drv)
) +
  geom_jitter(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "City MPG", 
    y = "Highway MPG"
  )

geom_jitter() vs. geom_point()

The same dataset is plotted with geom_jitter() and geom_point() below. Why do the two plots look different?

distinct()

mpg |>
  select(cty, hwy)
# A tibble: 234 × 2
     cty   hwy
   <int> <int>
 1    18    29
 2    21    29
 3    20    31
 4    21    30
 5    16    26
 6    18    26
 7    18    27
 8    18    26
 9    16    25
10    20    28
# … with 224 more rows
mpg |>
  distinct(cty, hwy)
# A tibble: 78 × 2
     cty   hwy
   <int> <int>
 1    18    29
 2    21    29
 3    20    31
 4    21    30
 5    16    26
 6    18    26
 7    18    27
 8    16    25
 9    20    28
10    19    27
# … with 68 more rows

Working with categorical data

tshirts <- tibble(
  size = c("Large", "Medium", "Large", "Small", "Small", "Medium", "Small", "X-Large", "X-Small"),
  price = c(10, 15, 12, 18, 22, 13, 67, 12, 10)
)

ggplot(tshirts, aes(x = size)) +
  geom_bar()

fct_relevel()

Reorder levels based on an order you provide

tshirts |>
  mutate(size = fct_relevel(size, "X-Small", "Small", "Medium", "Large", "X-Large")) |>
  ggplot(aes(x = size)) +
  geom_bar()

fct_reorder()

Reorder levels based on another variable

tshirts |>
  mutate(size = fct_reorder(size, price, mean)) |>
  ggplot(aes(x = size)) +
  geom_bar()

fct_other()

Lump some levels to “Other”

tshirts |>
  mutate(size = fct_other(size, keep = c("Small", "Medium", "Large"))) |>
  ggplot(aes(x = size)) +
  geom_bar()

Pivoting and joining

Let’s visit https://www.garrickadenbuie.com/project/tidyexplain!