Lab 4 - Probability and Simpson’s Paradox

Lab
Important

This lab is due Friday, October 28 at 11:59pm.

Learning goals

  • Calculate single event, conditional, and “and” probabilities.
  • Interpret probabilities in the context of the problem.
  • Display a fundamental understanding of Simpson’s Paradox.
  • Practice teamwork and collaboration on GitHub.
library(tidyverse)

Getting started

All team members should clone the team GitHub repository for the lab. Then, one team member should edit the document YAML by adding the team name to the subtitle field and adding the names of the team members contributing to lab to the author field. Hopefully that’s everyone, but if someone doesn’t contribute during the lab session or throughout the week before the deadline, their name should not be added. If you have 4 members in your team, you can delete the line for the 5th team member. Then, this team member should render the document and commit and push the changes. All others should not touch the document at this stage.

title: "Lab 4 - Probability and Simpson’s Paradox"
subtitle: "Team name"
author: 
  - "Team member 1"
  - "Team member 2"
  - "Team member 3"
  - "Team member 4"
  - "Team member 5"
format: pdf
editor: visual
execute:
  echo: fenced

Exercises

Important

Pick another member of the team write the answer to Exercise 1. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Exercise 1 - Probability and you

We use probabilities all the time when making decisions. As a group, provide two real world examples of when you’ve used probability to make decisions in your every day life. Think critically. Be creative.

Important

After the team member working on Exercise 1 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 2. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Exercise 2 - Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

  1. Load in the data set called education-disease and answer the following questions below.
  2. How many levels of education are there in these data? How many levels of disease are there? Hint: The distinct() function might be helpful.
  3. Convert the data to a two-way table where each cell is the number of people falling into each combination of Disease and Education. Hint: Use count and pivot_wider. Your answer should be a 4x3 data frame with counts in each cell.

Using the summary table you created in part (c), answer the remaining questions. You do not have to use R functions for your calculations, you can use R as a calculator using the values from the summary table. Make sure to show your work, i.e., instead of reporting just the final answer, use R to calculate that in a way we can see the counts you’ve used along the way.

  1. What is the probability of a random individual being high risk for cardiovascular disease?
  2. What is the probability of a random individual having high school or GED education and not being high risk for cardiovascular disease?
  3. What is the probability that a random individual who is already high risk for cardiovascular disease has a college education?
Important

After the team member working on Exercise 2 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 3. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Exercise 3 - Computer store

In a computer store, 30% of the computers in stock are laptops and 70% are desktops. Five percent of the laptops are on sale, while 10% of the desktops are on sale.1

  1. Fill in the code below to create a hypothetical two-way table to represent this situation. Assume the total number of computers is 1000.
data <- tibble( 
  type = c(),
  sale = c(),
  values = c()
  )

data |>
  pivot_wider( 
    names_from = ...,
    values_from = ....
  )
  1. Calculate the probability of that a randomly selected computer will be a desktop, given that the computer is on sale.

  2. In your own words, explain what this probability means.

Important

After the team member working on Exercise 3 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 4. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Exercise 4 - Bike rentals

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. You are tasked to investigate the relationship between the temperature outside and the number of bikes rented in the Washington DC area between the years 2011 and 2022. You will be investigating data for the months June, July, September, and November.2

Below is a list of variables and their definitions:

Variable Definition
season Numerical representation of Spring (2), Summer (3), and Fall (4)
year Numerical representation of 2011 (0) or 2012 (1)
month Month in which data were collected
holiday Indicator variable for whether data were collected on a holiday (1) or not (0)
weekday Numerical representation of day of week
temp Temperature in Celsius
count Number of bike rentals for that day
  1. Read in the data. Then, create a scatter plot that investigates the relationship between the number of bikes rented and the temperature outside. Include a straight line of best fit to help discuss the discovered relationship. Summarize your findings in 2-3 sentences.
  2. Another researcher suggests to look at the relationship between bikes rented and temperature by each of the four months of interest. Recreate your plot in part a, and color the points by month. Include a straight line for each of the four months to help discuss each month’s relationship between bikes rented and temperature. In 3-4 sentences, summarize your findings.

Please watch the following video on Simpson’s Paradox here. After you do, please answer the following questions.

  1. In your own words, summarize Simpson’s Paradox in 2-3 sentences.

  2. Compare and contrast your findings from part (a) and part (b). What’s different?

  3. Think critically about your answer to part d. What other context from this study could be creating this paradox? That is, identify a potential confounding variable in this study. Be sure to justify how your example could be a potential confounding variable by relating it back to both bike rentals and temperature.

Important

After the team member working on Exercise 4 renders, commits, and pushes, all other team members should pull the changes and render the document. Finally, a team member different than the one responsible for typing up responses to Exercise 4 should do the last task outlined below.

Closing an issue with a commit

Go to your GitHub repository, you will see an issue with the title “Learn to close an issue with a commit”. Your goal is to close this issue with a commit to practice this workflow, which is the workflow you will use when you are addressing feedback on your projects.

  • Go to the relevant section in your lab .qmd file.
  • Delete the sentence that says “Delete me”.
  • Render the document.
  • Commit your changes from the git tab with the commit message “Delete sentence, closes #1.”
  • Push your changes to your repo and observe that the issue is now closed and the commit associated with this move is linked from the issue.

GitHub allows you to close an issue directly with commits if the commit uses one of the following keywords followed bu the issue number (which you can find next to the issue title): close, closes, closed, fix, fixes, fixed, resolve, resolves, and resolved.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to render and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

Make sure your data are tidy! That is, your code should not be running off the pages and spaced properly. See: https://style.tidyverse.org/ggplot2.html.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
  • Click on your STA 199 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
  • Select the first page of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Component Points
Ex 1 4
Ex 2 14
Ex 3 11
Ex 4 16
Workflow & formatting 5
Total 50
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • having at least 3 informative commit messages
  • labeling the code chunks
  • having readable code that does not exceed 80 characters, i.e., we can read all your code in the rendered PDF
  • each team member contributing to the repo with commits at least once
  • the issue being closed with a commit message

Footnotes

  1. This exercise was adapted from Mind on Statistics, 5th Ed. By Utts and Heckard.↩︎

  2. Data subset from Progress in Artificial Intelligence, 2013.↩︎