Use of this document

This is a study note for using \(dplyr\) package for data tranformation. For more details on the study material see https://learning.oreilly.com/library/view/r-for-data/9781491910382/ch03.html.

R Operators

R has several operators to perform tasks including arithmetic, logical and bitwise operations. In this article, you will learn about different R operators with the help of examples.

Arithmetic Operators

Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division
^ Exponent
%% Remainder from division
%/% Integer Division

Relational Operators

Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

Logical Operators:

Operator Description
! Logical NOT
& Element-wise logical AND
&& Logical AND
| Element-wise logical OR
\\ Logical OR
!(x & y),!x | !y De Morgan’s law

Prerequisites

library(nycflights13)
library(tidyverse)

2. dplyr Basics

In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data-manipulation challenges:

Short hand

  • x %in% y: This will select every row where x is one of the values in y.
  • is.na(x): determine if a value is missing.
  • %>%: x %>% f(y) turns into f(x, y).
  • as_tibble(): convert data to tibble form.

2.1 Filter Rows with: filter()

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

  • Format: filter(x,condition,condition)
  • Commonly used method: Logical Operators (!, &,&&, |, \\, !(x & y), !x | !y), missing value (!is.na())
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, month == 11 | month == 12, day == 25) # filter(flights, month %in% c(11, 12), day == 25)
## # A tibble: 1,661 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11    25        5           2359         6      436            445
##  2  2013    11    25      508            500         8      647            651
##  3  2013    11    25      513            515        -2      801            808
##  4  2013    11    25      537            540        -3      827            850
##  5  2013    11    25      542            545        -3      845            835
##  6  2013    11    25      550            600       -10      843            856
##  7  2013    11    25      550            600       -10      647            659
##  8  2013    11    25      551            550         1     1019           1027
##  9  2013    11    25      553            600        -7      657            719
## 10  2013    11    25      553            600        -7      826            846
## # … with 1,651 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, !is.na(dep_delay), !is.na(arr_delay))
## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.2 Arrange Rows with arrange: arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

  • Format: arrange(x,factor, subfactor,subsubfactor)
  • Commonly used method: desc()
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, desc(arr_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.3 Select Columns with: select()

select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables, often narrowing in on the variables you’re actually interested in.

  • Format: select(x,attribute, attribute, attribute)
  • Commonly used method: Arithmetic Operators (+, -, *, /, ^, %/%, %%), starts_with("abc"), ends_with("xyz"), contains("ijk"), matches("(.)\\1"), num_range("x", 1:3), rename(), everything()
select(flights, year, month, day) # select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows
select(flights, -(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
# rename tailnum by tail_num
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# move a important attribute column to the start of the data frame
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # … with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

2.4 Add New Variables with: mutate()

mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables.

  • Format: mutate(x,formula, formula)
  • Commonly used method: Arithmetic Operators (+, -, *, /, ^, %/%, %%), logs (log(), log2(), log10()), Offsets (lead(), lag()), Cumulative and rolling aggregates (cumsum(), cumprod(), cummin(), cummax()), Logical comparisons (<, <=, >, >=, !=), Ranking (min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile())
mutate(flights, 
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60)
## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   gain <dbl>, speed <dbl>

2.5 Summaries with: summarize()

summarize() collapses a data frame to a single row

  • Format: summarize(x,newattr=formula, newattr=formula)
  • Commonly used method: counts (n(), sum(!is.na(x)), n_distinct), Measures of location (mean(), median(), mode()), Measures of spread (sd(x), IQR(x), mad(x)), Measures of rank (min(x), quantile(x, 0.25), max(x)), Measures of position (first(x), nth(x, 2), last(x)), missing value (na.rm = TRUE)
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

3. Piping the operation

You can use the Pipe (%>%, read “then”) to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. piping considerably improves the readability of code.

3.1 Group/Ungroup by attribute with group_by()/ungroup()

group_by() group by multiple variables.
ungroup() remove grouping, and return to operations on ungrouped data.

  • Format: group_by(x, attribute, attribute) / ungroup(x)

3.2 Grouped Summary

To understand the message behind data, we have to go thought a list of process including inspecting, cleansing, transforming, and modeling data, maksing. the summary pipe provide a good standardized process of inspecting.

  • Format: filter() %>% group_by() %>% summarize() %>% filter() %>% arrange() and filter() %>% ggplot()
if(!is.tibble(flights)){as_tibble(flights)} # Convert to a tibble so it prints nicely
   
delays <- flights %>% # Filter outlier instances
  filter(dep_delay > 0) %>% # group by attributes
  group_by(month,tailnum) %>% # summarize instance by attributes
  summarize(count = n(),
            total_delay  = sum(dep_delay,na.rm = TRUE),
            mean_delay = mean(dep_delay,na.rm = TRUE)) %>% # arrange summary with order
  arrange(desc(total_delay))
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
print(delays)
## # A tibble: 31,494 x 5
## # Groups:   month [12]
##    month tailnum count total_delay mean_delay
##    <int> <chr>   <int>       <dbl>      <dbl>
##  1     7 N184JB     29        2257       77.8
##  2     6 N504MQ     12        1889      157. 
##  3     4 N15980     27        1841       68.2
##  4     7 N355JB     34        1823       53.6
##  5     7 N198JB     32        1807       56.5
##  6     3 N16911     26        1802       69.3
##  7    12 N615QX     18        1759       97.7
##  8    12 N11535     17        1706      100. 
##  9     7 N721MQ     27        1703       63.1
## 10     3 N15973     16        1665      104. 
## # … with 31,484 more rows
#plot without filter
delays %>% # filter outlier instances
  ggplot(mapping = aes(x = month, y = mean_delay)) +
    geom_point() +
    geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(data = delays, mapping = aes(x = mean_delay)) +
  geom_freqpoly(binwidth = 10)

# plot with filtering out small size sample
delays %>% # filter outlier instances
  filter(count > 10) %>% # ggplot for representation
  ggplot(mapping = aes(x = month, y = mean_delay)) +
    geom_point() +
    geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

3.3 Grouped Mutates (and Filters)

A grouped filter is a grouped mutate followed by an ungrouped filter.

  • Format: group_by() %>% filter() %>% mutate()
flights %>% # group by attribute
  group_by(dest) %>% # Find all groups bigger than a threshold
  filter(n() > 365) %>% # Find all instances bigger than zero
  filter(arr_delay > 0) %>% # Standardize to compute per group metrics:
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
  select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##     year month   day dest  arr_delay prop_delay
##    <int> <int> <int> <chr>     <dbl>      <dbl>
##  1  2013     1     1 IAH          11  0.000111 
##  2  2013     1     1 IAH          20  0.000201 
##  3  2013     1     1 MIA          33  0.000235 
##  4  2013     1     1 ORD          12  0.0000424
##  5  2013     1     1 FLL          19  0.0000938
##  6  2013     1     1 ORD           8  0.0000283
##  7  2013     1     1 LAX           7  0.0000344
##  8  2013     1     1 DFW          31  0.000282 
##  9  2013     1     1 ATL          12  0.0000400
## 10  2013     1     1 DTW          16  0.000116 
## # … with 131,096 more rows

4. Relational data operation

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

4.1 Key

There are two types of keys:

  • A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.
  • A foreign key uniquely identifies an observation in another table. For example, the flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.

4.2 Joins

A join is a way of connecting each row in x to zero, one, or more rows in y.

4.2.1 Mutating joins

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

  • An inner join matches pairs of observations whenever their keys are equal:
  • An outer join keeps observations that appear in at least one of the tables
    • A left join keeps all observations in x.
    • A right join keeps all observations in y.
    • A full join keeps all observations in x and y.
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
x %>% 
  inner_join(y, by = "key")
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x3    y2   
## 4     1 x4    y1
right_join(x, y, by = "key")
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x3    y2   
## 4     1 x4    y1

4.2.2 Filtering joins

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y

We will use the nycflights13 package to learn about relational data. nycflights13 contains four tibbles that are related to the flights table that you used in data transformation:

airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.
airports
## # A tibble: 1,458 x 8
##    faa   name                       lat    lon   alt    tz dst   tzone          
##    <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
##  1 04G   Lansdowne Airport         41.1  -80.6  1044    -5 A     America/New_Yo…
##  2 06A   Moton Field Municipal A…  32.5  -85.7   264    -6 A     America/Chicago
##  3 06C   Schaumburg Regional       42.0  -88.1   801    -6 A     America/Chicago
##  4 06N   Randall Airport           41.4  -74.4   523    -5 A     America/New_Yo…
##  5 09J   Jekyll Island Airport     31.1  -81.4    11    -5 A     America/New_Yo…
##  6 0A9   Elizabethton Municipal …  36.4  -82.2  1593    -5 A     America/New_Yo…
##  7 0G6   Williams County Airport   41.5  -84.5   730    -5 A     America/New_Yo…
##  8 0G7   Finger Lakes Regional A…  42.9  -76.8   492    -5 A     America/New_Yo…
##  9 0P2   Shoestring Aviation Air…  39.8  -76.6  1000    -5 U     America/New_Yo…
## 10 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     America/Los_An…
## # … with 1,448 more rows
planes 
## # A tibble: 3,322 x 9
##    tailnum  year type          manufacturer   model  engines seats speed engine 
##    <chr>   <int> <chr>         <chr>          <chr>    <int> <int> <int> <chr>  
##  1 N10156   2004 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
##  2 N102UW   1998 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
##  3 N103US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
##  4 N104UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
##  5 N10575   2002 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
##  6 N105UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
##  7 N107US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
##  8 N108UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
##  9 N109UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
## 10 N110UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
## # … with 3,312 more rows
weather
## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
# semi_join
flights %>% 
  filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      838            846
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            559         0      702            706
## 10  2013     1     1      600            600         0      851            858
## # … with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% 
  semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      838            846
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            559         0      702            706
## 10  2013     1     1      600            600         0      851            858
## # … with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# anti_join
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # … with 712 more rows

4.2.3 Duplicate key

  • One table has duplicate keys. This is useful when you want to add in additional information as there is typically a one-to-many relationship.
  • Both tables have duplicate keys. This is usually an error because in neither table do the keys uniquely identify an observation. When you join duplicated keys, you get all possible combinations, the Cartesian product.

4.2.4 defining the key column

For these examples, we’ll make it easier to see what’s going on in the examples by creating a narrower dataset:

  • by = NULL (default), uses all variables that appear in both tables, the so called natural join. For example, the flights and weather tables match on their common variables: year, month, day, hour and origin.

  • by = "x". This is like a natural join, but uses only some of the common variables. For example, flights and planes have year variables, but they mean different things so we only want to join by tailnum.

  • by = c("a" = "b") This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

# dataset
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
# by = NULL
flights2 %>% 
  left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
# by = "x"
flights2 %>% 
  left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type 
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixe…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixe…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixe…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA> 
## # … with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
# by = c("a" = "b")
flights2 %>% 
  left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Geor…  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Geor…  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>   NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chic…  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort…  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Wash…  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orla…  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chic…  42.0 -87.9   668
## # … with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

4.3 Compare to merge()

base::merge() can perform all four types of mutating join, but the advantages of the specific dplyr verbs is that * they more clearly convey the intent of your code: the difference between the joins is really important but concealed in the arguments of merge(). * dplyr’s joins are considerably faster and don’t mess with the order of the rows.

dplyr merge
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

4.4 translation to SQL

SQL is the inspiration for dplyr’s conventions, so the translation is straightforward:

dplyr SQL
inner_join(x, y, by = "z") SELECT * FROM x INNER JOIN y USING (z)
left_join(x, y, by = "z") SELECT * FROM x LEFT OUTER JOIN y USING (z)
right_join(x, y, by = "z") SELECT * FROM x RIGHT OUTER JOIN y USING (z)
full_join(x, y, by = "z") SELECT * FROM x FULL OUTER JOIN y USING (z)