Tidying data

Lecture 7

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2022

9/20/22

Warm up

While you wait for class to begin…

  • Open your ae-05 project in RStudio, render your document, and commit and push.
  • Any questions from prepare materials? Go to slido.com / #sta199. You can also upvote others’ questions.

Announcements

  • My office hours this week 2-3 pm on Tuesday or by appointment
  • HW 2 typo fix
  • Comment on masking in class starting Thursday
  • SSMU Welcome Back event: Wed, Sep 21, 7:30 - 8:30 pm at Old Chem 201 (with food & refreshments at the Old Chem porch)
SSMU Fall get together flyer

Questions from last time 1

  • Can you explain the difference between primary and foreign keys?

A primary key is a variable that uniquely identifies an observation. A foreign key is the corresponding variable in another table. These keys might have the same name, but they don’t have to, e.g. by = c("student_email" = "student_id").

  • Will we be able to log into containers while not on duke WiFi?

Yes, you don’t have to be on Duke WiFi to connect to the containers.

  • How do you code “and/or”?

and is & and or is |.

Questions from last time 2

  • What is the difference between rendering and saving a document?

When you edit your Quarto file and you save it, your changes are saved but they’re not reflected in your output HTML or PDF file. When you render the document, the output is also updated to reflect those changes. When you hit render, RStudio automatically first saves your Quarto file, and then renders it. So I recommend you render early and often, both to save your changes, and also to make sure your changes did not introduce any errors into the document.

  • What does it mean to commit and push something?

We “commit” to take a snapshot of the files in our local repository, i.e. the files that are saved on the university servers you’re using. We “push” to get those changes to the remote repository, i.e. your repository on GitHub.

Tidying datasets

What makes a dataset “tidy”?

03:00

Application exercise

ae-05

  • Go to the course GitHub org and find your ae-05 (repo name will be suffixed with your GitHub name).
  • Clone the repo in your container, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – 3 days from today.

Recap of AE

  • Data sets can’t be labeled as wide or long but they can be made wider or longer for a certain analysis that requires a certain format
  • When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do so within the pivot_longer() function.
  • You can tweak a plot forever, but at some point the tweaks are likely not very productive. However, you should always be critical of defaults (however pretty they might be) and see if you can improve the plot to better portray your data / results / what you want to communicate.