Functions and iteration

Lecture 14

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2022

10/18/22

Warm up

While you wait for class to begin…

Clone your ae-11 project from GitHub, render your document, update your name, and commit and push.

Announcements

  • Project proposals due Friday
  • No in person OH today (but I’ll be at majors fair!). OH on Zoom 8-9pm tonight (Tuesday) and Wednesday. Will email reminder with Zoom link + add to OH sheet.
  • Make sure to review the video/reading for Thursday!

Midterm feedback followup

Can be difficult to catch up on AEs if bits are missed.

Will post AE solutions after class instead of waiting for the deadline. You still need to attempt them to get points.

From last time

Recap of ae-10

  • Use the SelectorGadget identify tags for elements you want to grab
  • Use rvest to first read the whole page (into R) and then parse the object you’ve read in to the elements you’re interested in
  • Put the components together in a data frame (a tibble) and analyze it like you analyze any other data

A new R workflow

  • When working in a Quarto document, your analysis is re-run each time you knit

  • If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!

  • An alternative workflow:

    • Use an R script to save your code
    • Saving interim data scraped using the code in the script as CSV or RDS files
    • Use the saved data in your analysis in your Quarto document

Ethics: “Can you?” vs “Should you?”

“Can you?” vs “Should you?”

Challenges: Unreliable formatting

Challenges: Data broken into many pages

Workflow: Screen scraping vs. APIs

Two different scenarios for web scraping:

  • Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)

  • Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files

Functions

Functions in R

What are some functions you’ve learned? What are their inputs, what are their outputs?

mean()

x <- c(1, 2, 3, 4, 5)
mean(x)
[1] 3

Custom function: multiply_by_two()

  • Decide on a goal: Multiply by two
  • Decide on the number and type of inputs: 1 (a numeric vector of length 1)
  • Decide on the number and type of outputs: 1 (a numeric vector of length 1)
multiply_by_two <- function(x){
  x * 2
}
multiply_by_two(1)
[1] 2
multiply_by_two(2)
[1] 4
multiply_by_two(3)
[1] 6

Custom function: multiply()

  • Decide on a goal: Multiply by a given value
  • Decide on the number and type of inputs: 2 (two numeric vectors of length 1)
  • Decide on the number and type of outputs: 1 (a numeric vector of length 1)
multiply <- function(x, y){
  x * y
}
multiply(1, 3)
[1] 3
multiply(2, 5)
[1] 10
multiply(10, 35)
[1] 350

Custom function: temp_convert()

  • Goal: Convert temperatures in degrees Fahrenheit to Celsius; subtract 32 and multiply by \(\frac{5}{9}\).
  • Number and type of inputs: 1 (a numeric vector of length 1)
  • Number and type of outputs: 1 (a numeric vector of length 1)
temp_convert <- function(temp_f){
  (temp_f - 32) * 5/9
}

Test out the function

temp_convert(32)   # freezing point
[1] 0
temp_convert(360)  # cake baking temperature
[1] 182.2222
temp_convert(98.6) # body temperature
[1] 37

Why do we need functions?

Repeat yourself:

# freezing point
(32 - 32) * (5/9)
[1] 0
# cake baking temperature
(360 - 32) * (5/9)
[1] 182.2222
# body temperature
(98.6 - 32) * (5/9)
[1] 37

Do not repeat yourself (DRY):

# freezing point
temp_convert(32)
[1] 0
# cake baking temperature
temp_convert(360)
[1] 182.2222
# body temperature
temp_convert(98.6)
[1] 37

Seriously, DRY!

Load package:

library(tidyverse)

Define input vector:

x <- c(32, 360, 98.6)
x
[1]  32.0 360.0  98.6

Map your function over the elements of the input vector:

map(x, temp_convert)
[[1]]
[1] 0

[[2]]
[1] 182.2222

[[3]]
[1] 37

Control the type of your output:

map_dbl(x, temp_convert)
[1]   0.0000 182.2222  37.0000

Iteration

To apply the same function to multiple values (stored in an object like a vector), use map() functions:

  • map() returns a list

  • map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (logical, integer, double, or character, respectively)

  • map_dfr() and map_dfc() return a data frame created by row-binding and column-binding, respectively

Coming soon…

  • Use a function that takes a data frame and names of variables in that data frame
  • Fits a regression model for predicting one specified variable from the others given
  • Reports the model results along with measurements on prediction error and other diagnostic values

Coming now

  • Write a function that scrapes data from a single page and outputs a data frame with items of interest from that page
  • Map that function over multiple pages to get a bigger data frame with items of interest from all pages
  • Items of interest: Amazon reviews of Yankee Candles

Application exercise

Yankee Candle reviews and COVID

Goal

  • Scrape data from multiple pages and organize it in a tidy format in R
  • Perform light text parsing to clean data
  • Summarize and visualize the data

ae-11

  • Go to the course GitHub org and find your ae-11 (repo name will be suffixed with your GitHub name).
  • Clone the repo in your container, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – 3 days from today.