library(tidyverse)
library(nycflights13)
AE 03: Wrangling flights
To demonstrate data wrangling we will use flights
, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.
The data frame has over 336,000 observations (rows), 336776 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
names(flights)
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
head(flights)
# A tibble: 6 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The head()
function returns “A tibble: 6 x 19” and then the first six rows of the flights
data.
Tibble vs. data frame
A tibble is an opinionated version of the R
data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!
There are two main differences between a tibble and a data frame:
- When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.
Let’s look at the differences in the output when we type flights
(tibble) in the console versus typing cars
(data frame) in the console.
- Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get an error message if you try to access a variable that doesn’t exist in a tibble. You will get
NULL
if you try to access a variable that doesn’t exist in a data frame.
$apple flights
Warning: Unknown or uninitialised column: `apple`.
NULL
$apple cars
NULL
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the dplyr cheatsheet.
Quick summary of key dplyr functions1:
Rows:
filter()
:chooses rows based on column values.slice()
: chooses rows based on location.arrange()
: changes the order of the rowssample_n()
: take a random subset of the rows
Columns:
select()
: changes whether or not a column is included.rename()
: changes the name of columns.mutate()
: changes the values of columns and creates new columns.
Groups of rows:
summarise()
: collapses a group into a single row.count()
: count unique values of one or more variables.group_by()
: perform calculations separately for each value of a variable
select()
- Demo: Make a data frame that only contains the variables
dep_delay
andarr_delay
.
# add code here
- Demo: Make a data frame that keeps every variable except
dep_delay
.
# add code here
- Demo: Make a data frame that includes all variables between
year
throughdep_delay
(inclusive). These are all variables that provide information about the departure of each flight.
# add code here
- Demo: Use the
select
helpercontains()
to make a data frame that includes the variables associated with the arrival, i.e., contains the string"arr\_"
in the name.
# add code here
The pipe
Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>
, is an operator (a tool) for passing information from one process to another. We will use |>
mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
- Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
# add code here
slice()
- Demo: Display the first five rows of the
flights
data frame.
# add code here
- Demo: Display the last two rows of the
flights
data frame.
# add code here
arrange()
- Demo: Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.
Question: What does it mean for the dep_delay
to have a negative value?
# add code here
- Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.
# add code here
- Your turn (5 minutes): Create a data frame that only includes the plane tail number (
tailnum
), carrier (carrier
), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum
) for this flight? Share your response on Slack on the relevant thread on #general using code formatting.
# add code here
filter()
- Demo: Filter the data frame by selecting the rows where the destination airport is RDU.
# add code here
- Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0.
# add code here
We can do more complex tasks using logical operators:
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x \| y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x
is logical (TRUE / FALSE).
- Your turn (4 minutes): Describe what the code is doing in words. Share your response on Slack on the relevant thread on #general.
# add code here
count()
- Demo: Create a frequency table of the destination locations for flights from New York.
# add code here
- Demo: In which month was there the fewest number of flights? How many flights were there in that month?
# add code here
- Your turn (5 minutes): On which date (month + day) was there the largest number of flights? How many flights were there on that day? Share your response on Slack on the relevant thread on #general using code formatting.
# add code here
mutate()
Use mutate()
to create a new variable.
- Demo: In the code chunk below,
air_time
(minutes in the air) is converted to hours, and then new variablemph
is created, corresponding to the miles per hour of the flight.
# add code here
- Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
# add code here
summarize()
summarize()
collapses the rows into summary statistics and removes columns irrelevant to the calculation.
Be sure to name your columns!
# add code here
Question: Why did this code return NA
?
Let’s fix it!
# add code here
group_by()
group_by()
is used for grouped operations. It’s very powerful when paired with summarise()
to calculate summary statistics by group.
Here we find the mean and standard deviation of departure delay for each month.
# add code here
- Your turn (4 minutes): What is the median departure delay for each airports around NYC (
origin
)? Which airport has the shortest median departure delay? Share your response on Slack on the relevant thread.
# add code here
Additional Practice
- Create a new dataset that only contains flights that do not have a missing departure time. Include the columns
year
,month
,day
,dep_time
,dep_delay
, anddep_delay_hours
(the departure delay in hours). Hint: Note you may need to usemutate()
to make one or more of these variables.
# add code here
- For each airplane (uniquely identified by
tailnum
), use agroup_by()
paired withsummarize()
to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
# add code here
Footnotes
From dplyr vignette↩︎