library(tidyverse)
library(tidymodels)
library(dsbox)
Lab 6 - Prediction and bootstrapping
Data and packages
The data can be found in the dsbox package, and it’s called gss16
. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. And we’ll use the tidyverse and tidymodels packages as well.
You can find out more about the dataset by inspecting its documentation, which you can access by running ?gss16
in the Console or using the Help menu in RStudio to search for gss16
. You can also find this information here.
Exercises
Exercise 1
- Create a new data frame called
gss16_advfront
that includes the variablesadvfront
,educ
,polviews
, andwrkstat
. Then, use thedrop_na()
function to remove rows that containNA
s from this new data frame. Sample code is provided below.
<- gss16 %>%
gss16_advfront select(___, ___, ___, ___) %>%
drop_na()
- Re-level the
advfront
variable such that it has two levels:Strongly agree
and “Agree"
combined into a new level calledAgree
and the remaining levels combined into”Not agree"
. Then, re-order the levels in the following order:"Agree"
and"Not agree"
. Finally,count()
how many times each new level appears in theadvfront
variable.
Hint: You can do this in various ways. One option is to use the str_detect()
function to detect the existence of words. Note that these sometimes show up with lowercase first letters and sometimes with upper case first letters. To detect either in the str_detect()
function, you can use [Aa]gree. However, solve the problem however you like, this is just one option!
- Combine the levels of the
polviews
variable such that levels that have the word “liberal” in them are lumped into a level called"Liberal"
and those that have the word conservative in them are lumped into a level called"Conservative"
. Then, re-order the levels in the following order:"Conservative"
,"Moderate"
, and"Liberal"
. Finally,count()
how many times each new level appears in thepolviews
variable.
Exercise 2
Specify a logistic regression model using “glm” as the engine, that predicts
advfront
byeduc
. Name this specification gss16_spec. Report the tidy output below.Write out the estimated model in proper notation.
Using your estimated model, predict the probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government (
Agree
in advfront) if you have an education of 7 years.
Exercise 3
Fit a new model that adds the additional explanatory variable of
polviews
. Report the tidy output below.Now, predict the probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government (
Agree
in advfront) if you have an education of 7 years and are Conservative.
Exercise 4
In 2016, the GSS added a new question on harassment at work. The question is phrased as the following.
Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?
Answers to this question are stored in the harass5
variable in our data set.
Create a subset of the data that only contains
Yes
andNo
answers for the harassment question. How many responses chose each of these answers?Describe how bootstrapping can be used to estimate the proportion of all Americans who have been harassed by their superiors or co-workers at their job.
Calculate a 95% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Use 1000 iterations when creating your bootstrap distribution. Interpret this interval in context of the data.
Exercise 5
Where was your 95% confidence interval centered? Why does this make sense?
Now, calculate 90% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Report the interval below. Is it wider or more narrow than the 95% confidence interval?
Now, suppose you created a bootstrap distribution with 50,000 simulations instead of 1,000. What would you expect to change (if anything)?
- Center of the CI
- Width of the CI
Submission
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Select the first page of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
Component | Points |
---|---|
Ex 1 | 11 |
Ex 2 | 11 |
Ex 3 | 6 |
Ex 4 | 11 |
Ex 5 | 6 |
Workflow & formatting | 5 |
Total | 50 |