library(tidyverse)
Lab 4 - Probability and Simpson’s Paradox
Learning goals
- Calculate single event, conditional, and “and” probabilities.
- Interpret probabilities in the context of the problem.
- Display a fundamental understanding of Simpson’s Paradox.
- Practice teamwork and collaboration on GitHub.
Getting started
All team members should clone the team GitHub repository for the lab. Then, one team member should edit the document YAML by adding the team name to the subtitle
field and adding the names of the team members contributing to lab to the author
field. Hopefully that’s everyone, but if someone doesn’t contribute during the lab session or throughout the week before the deadline, their name should not be added. If you have 4 members in your team, you can delete the line for the 5th team member. Then, this team member should render the document and commit and push the changes. All others should not touch the document at this stage.
title: "Lab 4 - Probability and Simpson’s Paradox"
subtitle: "Team name"
author:
- "Team member 1"
- "Team member 2"
- "Team member 3"
- "Team member 4"
- "Team member 5"
format: pdf
editor: visual
execute:
echo: fenced
Exercises
Exercise 1 - Probability and you
We use probabilities all the time when making decisions. As a group, provide two real world examples of when you’ve used probability to make decisions in your every day life. Think critically. Be creative.
Exercise 2 - Risk of coronary heart disease
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.
- Load in the data set called
education-disease
and answer the following questions below. - How many levels of education are there in these data? How many levels of disease are there? Hint: The
distinct()
function might be helpful. - Convert the data to a two-way table where each cell is the number of people falling into each combination of Disease and Education. Hint: Use
count
andpivot_wider
. Your answer should be a 4x3 data frame with counts in each cell.
Using the summary table you created in part (c), answer the remaining questions. You do not have to use R functions for your calculations, you can use R as a calculator using the values from the summary table. Make sure to show your work, i.e., instead of reporting just the final answer, use R to calculate that in a way we can see the counts you’ve used along the way.
- What is the probability of a random individual being high risk for cardiovascular disease?
- What is the probability of a random individual having high school or GED education and not being high risk for cardiovascular disease?
- What is the probability that a random individual who is already high risk for cardiovascular disease has a college education?
Exercise 3 - Computer store
In a computer store, 30% of the computers in stock are laptops and 70% are desktops. Five percent of the laptops are on sale, while 10% of the desktops are on sale.1
- Fill in the code below to create a hypothetical two-way table to represent this situation. Assume the total number of computers is 1000.
<- tibble(
data type = c(),
sale = c(),
values = c()
)
|>
data pivot_wider(
names_from = ...,
values_from = ....
)
Calculate the probability of that a randomly selected computer will be a desktop, given that the computer is on sale.
In your own words, explain what this probability means.
Exercise 4 - Bike rentals
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. You are tasked to investigate the relationship between the temperature outside and the number of bikes rented in the Washington DC area between the years 2011 and 2022. You will be investigating data for the months June, July, September, and November.2
Below is a list of variables and their definitions:
Variable | Definition |
---|---|
season |
Numerical representation of Spring (2), Summer (3), and Fall (4) |
year |
Numerical representation of 2011 (0) or 2012 (1) |
month |
Month in which data were collected |
holiday |
Indicator variable for whether data were collected on a holiday (1) or not (0) |
weekday |
Numerical representation of day of week |
temp |
Temperature in Celsius |
count |
Number of bike rentals for that day |
- Read in the data. Then, create a scatter plot that investigates the relationship between the number of bikes rented and the temperature outside. Include a straight line of best fit to help discuss the discovered relationship. Summarize your findings in 2-3 sentences.
- Another researcher suggests to look at the relationship between bikes rented and temperature by each of the four months of interest. Recreate your plot in part a, and color the points by month. Include a straight line for each of the four months to help discuss each month’s relationship between bikes rented and temperature. In 3-4 sentences, summarize your findings.
Please watch the following video on Simpson’s Paradox here. After you do, please answer the following questions.
In your own words, summarize Simpson’s Paradox in 2-3 sentences.
Compare and contrast your findings from part (a) and part (b). What’s different?
Think critically about your answer to part d. What other context from this study could be creating this paradox? That is, identify a potential confounding variable in this study. Be sure to justify how your example could be a potential confounding variable by relating it back to both bike rentals and temperature.
Closing an issue with a commit
Go to your GitHub repository, you will see an issue with the title “Learn to close an issue with a commit”. Your goal is to close this issue with a commit to practice this workflow, which is the workflow you will use when you are addressing feedback on your projects.
- Go to the relevant section in your lab .qmd file.
- Delete the sentence that says “Delete me”.
- Render the document.
- Commit your changes from the git tab with the commit message “Delete sentence, closes #1.”
- Push your changes to your repo and observe that the issue is now closed and the commit associated with this move is linked from the issue.
GitHub allows you to close an issue directly with commits if the commit uses one of the following keywords followed bu the issue number (which you can find next to the issue title): close, closes, closed, fix, fixes, fixed, resolve, resolves, and resolved.
Submission
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Select the first page of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
Component | Points |
---|---|
Ex 1 | 4 |
Ex 2 | 14 |
Ex 3 | 11 |
Ex 4 | 16 |
Workflow & formatting | 5 |
Total | 50 |