library(tidyverse)
library(tidymodels)
Inference overview
<- read_csv("data/roller_coasters.csv") rc
In this application exercise, we will be looking at a roller coaster and amusement park database by Duane Marden.1 This database records multiple features of roller coasters. For the purpose of this activity, we will work with a random sample of 157 roller coasters.
Variable | Description |
---|---|
age_group |
1: Older (Built between 1900-1979) 2: Recent (1980-1999) 3: Newest (2000-current) |
coaster |
Name of the roller coaster |
park |
Name of the park where the roller coaster is located |
city |
City where the roller coaster is located |
state |
State where the roller coaster is located |
type |
Material of track (Steel or Wooden) |
design |
How a passenger is positioned in the roller coaster: Flying: a roller coaster ridden while parallel with the track. Inverted: a roller coaster which uses trains traveling beneath, rather than on top of, the track. Unlike a suspended roller coaster, an inverted roller coaster’s trains are rigidly attached to the track. Sit Down: a traditional roller coaster ridden while sitting down. Suspended: a roller coaster using trains which travel beneath the track and pivot on a swinging arm from side to side, exaggerating the track’s banks and turns. Stand Up: a coaster ridden while standing up instead of sitting down. Pipeline: a coaster where riders are positioned between the rails instead of above or below. Wing: a coaster where pairs of riders sit on either side of a roller coaster track in which nothing is above or below the riders. 4th Dimension: a coaster where the cars have the ability to rotate on an axis perpendicular to that of the track. |
year_opened |
Year when roller coaster opened |
top_speed |
Maximum speed of roller coaster (mph) |
max_height |
Highest point of roller coaster (ft) |
drop |
Length of largest gap between high and low points of roller coaster (ft) |
length |
Length of roller coaster track (ft) |
duration |
Time length of roller coaster ride (sec) |
inversions |
Whether or not roller coaster flips passengers at any point (Yes or No) |
num_of_inversions |
Number of times roller coaster flips passengers |
Research question 1
Is the true proportion of steel roller coasters opened before 1970 different than those opened after 1970?
- Demo: Create a new variable called
opened_recently
that has a response ofyes
if the roller coaster was opened after 1970, andno
if the roller coaster was opened during or before 1970. Save this new factor variable in therc
data set. Additionally, maketype
a factor variable.
# add code here
- Your turn: Based on the research question, which of the following is the appropriate null hypothesis?
\(H_0\): \(\mu = 0.5\)
\(H_0\): \(p > 0.5\)
\(H_0\): \(\mu_1 - \mu_2 = 0\)
\(H_0\): \(p_1 - p_2 = 0\)
Add response here.
- Demo: Write out the appropriate alternative hypothesis in both words and notation.
Add response here.
- Demo: Calculate the observed statistic.
# add code here
- Demo: How do we create a null distribution for this hypothesis test? Write out the steps below.
Add response here.
- Demo: Create the null distribution with 1,000 repeated samples.
# add code here
- Demo: Visualize the null distribution and shade in the area used to calculate the p-value.
# add code here
- Your turn: Calculate p-value. Then use the p-value to make your conclusion using a significance level of 0.1 and write out an appropriate conclusion below. Recall that the conclusion has 3 components.
- How the p-value compares to the significance level.
- The decision you make with respect to the hypotheses (reject \(H_0\) /fail to reject \(H_0\))
- The conclusion in the context of the alternative hypothesis.
Add response here.
# add code here
Research question 2
We want to investigate the relationship between how fast a roller coaster goes, and how long a roller coaster lasts. Specifically, we are interested in how well the duration of a roller coaster explains or predicts how fast it is.
- Question: Based on this research question, which two variables should we use from our data set? Which is our response?
Add response here.
- Demo: Fit a model predicting
top_speed
fromduration
and display the tidy summary output.
# add code here
- Your turn: Use your model to estimate the top speed of a roller coaster if their duration is 155 minutes
# add code here
- Demo: Write out the estimated model in proper notation and interpret the slope coefficient
Add response here.
- Question: The slope you calculated and interpeted above is a sample statistic (an estimate). It is possible (and quite likely) that the true slope of the relationship between durations and top speeds of all roller coasters is not exactly equal to this value. How can we quantify the uncertainty around this estimate?
Add response here.
- Your turn: Describe how we can construct a bootstrap distribution for the slope of the model predicting
top_speed
fromduration
.
Add response here.
- Demo: Now, construct this bootstrap distribution.
# add code here
- Demo: Create a 90% confidence interval by filling in the code below.
# add code here
- Demo: Interpret the confidence interval in the context of the problem.
Add response here.
- Demo: Now fit a model predicting log of
top_speed
fromduration
.
# add code here
- Demo: Write the estimated model and interpret the slope.
Add response here.
Research question 3
We are also interested in investigating if roller coasters opened after 1970 are faster than those opened before. For this question, we want to estimate the difference.
- Question: How is this different from question one? Should we make a confidence interval or conduct a hypothesis test?
Add response here.
- Demo: Now, use bootstrapping to estimate the difference between the true speeds of roller coasters before and after 1970.
# add code here
- Your turn: Create a 99% confidence interval and interpret it in the context of the data and the research question.
# add code here
Add response here.
Discussion
- Brainstorm the differences between generating new samples via bootstrapping and permuting. Write down the key differences below. Hint: Think back to the different sampling techniques between research question 1 and research questions 2 and 3.