Models with multiple predictors

Lecture 19

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2022

11/3/22

Warm up

While you wait for class to begin…

  • Clone your ae-18 project from GitHub, render your document, update your name, and commit and push.
  • Post any questions you have about the material so far at sli.do / #sta199.

Announcements

  • Team evaluations open – due Sat night, 11:59pm (so we can review before Monday’s lab)
  • HW 5 will be posted soon
  • HW 6

Questions from last time

Q: What is a two-way table?

A: A table of frequencies for two categorical variables.

How do we go from what’s on the left to what’s on the right?

library(palmerpenguins)
library(tidyverse)

penguins |>
  count(species, sex)
# A tibble: 8 × 3
  species   sex        n
  <fct>     <fct>  <int>
1 Adelie    female    73
2 Adelie    male      73
3 Adelie    <NA>       6
4 Chinstrap female    34
5 Chinstrap male      34
6 Gentoo    female    58
7 Gentoo    male      61
8 Gentoo    <NA>       5
# A tibble: 3 × 4
  species   female  male  `NA`
  <fct>      <int> <int> <int>
1 Adelie        73    73     6
2 Chinstrap     34    34    NA
3 Gentoo        58    61     5

Questions from last time

Q: How do we build a two-way table in a pipeline?

penguins |>
  count(species, sex) |>
  pivot_wider(names_from = sex, values_from = n)
# A tibble: 3 × 4
  species   female  male  `NA`
  <fct>      <int> <int> <int>
1 Adelie        73    73     6
2 Chinstrap     34    34    NA
3 Gentoo        58    61     5

Questions from last time

Q: How do we know whether we can scrape data from a website?

A: Technically, with robotstxt::paths_allowed(), but that doesn’t address ethical considerations which are just as important, if not more.

robotstxt::paths_allowed("https://www.duke.edu/")
[1] TRUE

Questions from last time

Q: What is the proper notation for writing out a model?

  • Population models (truth):

\[ y = \beta_0 + \beta_1 \times x + \epsilon \]

  • Sample models (estimated):

\[ \hat{y} = b_0 + b_1 \times x \]

Goals

  • Fit and interpret models for predicting binary outcomes

  • Introduce over fitting and mitigating it by splitting the data into training and testing sets

Logistic regression

What is logistic regression?

  • Similar to linear regression…. but

  • Modeling tool when our response is categorical

Modelling binary outcomes

  • Variables with binary outcomes follow the Bernouilli distribution:

    • \(y_i \sim Bern(p)\)

    • \(p\): Probability of success

    • \(1-p\): Probability of failure

  • We can’t model \(y\) directly, so instead we model \(p\)

Linear model

\[ p_i = \beta_o + \beta_1 \times X_1 + \cdots + \epsilon \]

  • But remember that \(p\) must be between 0 and 1

  • We need a link function that transforms the linear model to have an appropriate range

This isn’t exactly what we need though…..

  • Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities.

  • We need the opposite of the link function… or the inverse

  • Taking the inverse of the logit function will map arbitrary real values back to the range [0, 1]

Generalized linear model

  • We model the logit (log-odds) of \(p\) :

\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) = \beta_o + \beta_1 \times X1_i + \cdots + \epsilon \]

  • Then take the inverse to obtain the predicted \(p\):

\[ p_i = \frac{e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}}{1 + e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}} \]

A logistic model visualized

Takeaways

  • Generalized linear models allow us to fit models to predict non-continuous outcomes

  • Predicting binary outcomes requires modeling the log-odds of success, where p = probability of success

Review questions

What is a training data set?

“Sandbox” for model building. Build the model on these data.

What is a testing data set?

Held in reserve to test one or two chosen models and to evaluate their performance.