Use of this document

This is a study note for using \(purrr\) ackage for list manipulation in parallel, with $. For more details on the study material see:

Prerequisites

library(purrr) # 
library(repurrrsive) # provides examples of lists. We explore them below, to lay the groundwork for other lessons, and to demonstrate list inspection strategies
library(listviewer) # expose list exploration in a rendered .Rmd document
library(jsonlite)
library(dplyr)
library(tibble)

1. Background

1.1 Functional Programming

\(purrr\) enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science. (source: purrr Overview)

1.2 Relationship between Atomic Vector and List

listviewer::jsonedit(gh_users, mode = "view")

List is built on top of vector, whereas data frames and tibbles is built on top of list. therefore, compare to list, Vector is lower-level, data frames and tibbles is higher-level. The following is an exmaple of the data object architecture:

data object architecture

Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors. You construct lists by using list() instead of c():

  • lists allow the most flexibility
  • a list element can be anything
  • very useful (and common) as output from analytic methods
  • lists can also be variables in a dataset

The following funciton can be use to explore list:

  • str(..., list.len = x, max.level = y)
str(got_chars, list.len = 3, max.level = 1)
## List of 30
##  $ :List of 18
##  $ :List of 18
##  $ :List of 18
##   [list output truncated]
str(got_chars, list.len = 3, max.level = 2)
## List of 30
##  $ :List of 18
##   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1022"
##   ..$ id         : int 1022
##   ..$ name       : chr "Theon Greyjoy"
##   .. [list output truncated]
##  $ :List of 18
##   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1052"
##   ..$ id         : int 1052
##   ..$ name       : chr "Tyrion Lannister"
##   .. [list output truncated]
##  $ :List of 18
##   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1074"
##   ..$ id         : int 1074
##   ..$ name       : chr "Victarion Greyjoy"
##   .. [list output truncated]
##   [list output truncated]

2. Vectorized and “list-ized” operations

2.1 Vectorization

The natural reflex as a programmer may be to loop over all values of the vector and apply the function, but vectorization makes that unnecessary. (source: dummies)

vec <- c(9, 16, 25) # make a vector
# a generic for loop
output <-c()
for (i in 1:length(vec)) {
  output[i] <- sqrt(vec[i])
}
output
## [1] 3 4 5
# Vectorized operations
sqrt(vec) 
## [1] 3 4 5

2.2 “List-ization”

purrr::map() is a function for applying a function to each element of a list, as well as atomic vector. The closest base R function is lapply(). A template for basic map() usage:

map(YOUR_LIST, YOUR_FUNCTION)

2.2.1 Output as atomic vector

If you expect map()to return output that can be turned into an atomic vector, it is best to use a type-specific variant (Also purrr will alert you to any problems, i.e. if one or more inputs has the wrong type or length.):

  • map_lgl(): return logical-type vector
  • map_chr(): return charactor-type vector
  • map_int(): return integer-type vector
  • map_dbl(): return double-type vector

extract information by variable name and index.

map_chr(got_chars[9:12], "name")
## [1] "Daenerys Targaryen" "Davos Seaworth"     "Arya Stark"        
## [4] "Arys Oakheart"
map_chr(got_chars[13:16], 3)
## [1] "Asha Greyjoy"    "Barristan Selmy" "Varamyr"         "Brandon Stark"

2.2.2 Output as list

Without specifying the output type as atomic vector, map() output list as default.By 1) passing [ into .F or 2) using index to extract elelmen(s) from list:

  • single element extration
map(gh_users, 1)
## [[1]]
## [1] "gaborcsardi"
## 
## [[2]]
## [1] "jennybc"
## 
## [[3]]
## [1] "jtleek"
## 
## [[4]]
## [1] "juliasilge"
## 
## [[5]]
## [1] "leeper"
## 
## [[6]]
## [1] "masalmon"
map(gh_users, "login")
## [[1]]
## [1] "gaborcsardi"
## 
## [[2]]
## [1] "jennybc"
## 
## [[3]]
## [1] "jtleek"
## 
## [[4]]
## [1] "juliasilge"
## 
## [[5]]
## [1] "leeper"
## 
## [[6]]
## [1] "masalmon"
  • multiple element extration
x1 <- map(gh_users, `[`, c(18,1,2,21))
listviewer::jsonedit(x1, mode = "view")
x2 <- map(gh_users, `[`, c("name", "login", "id", "location"))
listviewer::jsonedit(x2, mode = "view")

2.2.2 Output as dataframe

map_dfr() specifies the output to be data frame, which is the perfect data structure for a list with multiple variables.

map_dfr(gh_users, `[`, c("name", "login", "id", "location"))
## # A tibble: 6 x 4
##   name                   login             id location              
##   <chr>                  <chr>          <int> <chr>                 
## 1 Gábor Csárdi           gaborcsardi   660288 Chippenham, UK        
## 2 Jennifer (Jenny) Bryan jennybc       599454 Vancouver, BC, Canada 
## 3 Jeff L.                jtleek       1571674 Baltimore,MD          
## 4 Julia Silge            juliasilge  12505835 Salt Lake City, UT    
## 5 Thomas J. Leeper       leeper       3505428 London, United Kingdom
## 6 Maëlle Salmon          masalmon     8360597 Barcelona, Spain

Notice how the variables have been automatically type converted. It’s a beautiful thing. Until it’s not. When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.

gh_users %>% {
  tibble(
       login = map_chr(., "login"),
        name = map_chr(., "name"),
          id = map_int(., "id"),
    location = map_chr(., "location")
  )
}
## # A tibble: 6 x 4
##   login       name                         id location              
##   <chr>       <chr>                     <int> <chr>                 
## 1 gaborcsardi Gábor Csárdi             660288 Chippenham, UK        
## 2 jennybc     Jennifer (Jenny) Bryan   599454 Vancouver, BC, Canada 
## 3 jtleek      Jeff L.                 1571674 Baltimore,MD          
## 4 juliasilge  Julia Silge            12505835 Salt Lake City, UT    
## 5 leeper      Thomas J. Leeper        3505428 London, United Kingdom
## 6 masalmon    Maëlle Salmon           8360597 Barcelona, Spain

3. single mapping

3.1 List operation inside a data frame

listviewer::jsonedit(gh_repos, mode = "view")
# prepare data
unames <- map_chr(gh_repos, c(1, 4, 1))
udf <- gh_repos %>%
    set_names(unames) %>% 
    enframe("username", "gh_repos")
udf
## # A tibble: 6 x 2
##   username    gh_repos   
##   <chr>       <list>     
## 1 gaborcsardi <list [30]>
## 2 jennybc     <list [30]>
## 3 jtleek      <list [30]>
## 4 juliasilge  <list [26]>
## 5 leeper      <list [30]>
## 6 masalmon    <list [30]>

3.1.1 list-column operation inside a dataframe

This shows that we know how to operate on a list-column inside a tibble:

udf %>% mutate(n_repos = map_int(gh_repos, length))
## # A tibble: 6 x 3
##   username    gh_repos    n_repos
##   <chr>       <list>        <int>
## 1 gaborcsardi <list [30]>      30
## 2 jennybc     <list [30]>      30
## 3 jtleek      <list [30]>      30
## 4 juliasilge  <list [26]>      26
## 5 leeper      <list [30]>      30
## 6 masalmon    <list [30]>      30

3.1.2 list-list operation inside a dataframe

The dataframe udf has 6 gh_repos. For one gh_repos, we do the following operation:

one_user <- udf$gh_repos[[1]]
map_df(one_user, `[`, c("name", "fork", "open_issues"))
## # A tibble: 30 x 3
##    name        fork  open_issues
##    <chr>       <lgl>       <int>
##  1 after       FALSE           0
##  2 argufy      FALSE           6
##  3 ask         FALSE           4
##  4 baseimports FALSE           0
##  5 citest      TRUE            0
##  6 clisymbols  FALSE           0
##  7 cmaker      TRUE            0
##  8 cmark       TRUE            0
##  9 conditions  TRUE            0
## 10 crayon      FALSE           7
## # … with 20 more rows

To apply the above one-instance operation to all row in the dataframe, we use mutate() to map() inside a map():

  • the first map() list-izates all elements in gh_repos variable in dataframe udf.
  • the seconde map() list-izates all element in name, fork, open_issues in dataframe gh_repos.
udf %>% 
  mutate(repo_info = gh_repos %>%
           map(. %>% map_df(`[`, c("name", "fork", "open_issues"))))
## # A tibble: 6 x 3
##   username    gh_repos    repo_info        
##   <chr>       <list>      <list>           
## 1 gaborcsardi <list [30]> <tibble [30 × 3]>
## 2 jennybc     <list [30]> <tibble [30 × 3]>
## 3 jtleek      <list [30]> <tibble [30 × 3]>
## 4 juliasilge  <list [26]> <tibble [26 × 3]>
## 5 leeper      <list [30]> <tibble [30 × 3]>
## 6 masalmon    <list [30]> <tibble [30 × 3]>

3.2 function specification

We demonstrate three more ways to specify general .f:

  • an existing function: map(aliases, paste, collapse = "|")
  • an conventional anonymous function, defined on-the-fly, as usual: map(aliases, function(x) paste(x, collapse = "|"))
  • a formula Anonymous function: this is unique to purrr and provides a very concise way to define an anonymous function: map(aliases, ~ paste(.x, collapse = " | "))
# prepare data
aliases <- set_names(map(got_chars, "aliases"), map_chr(got_chars, "name"))
aliases <- aliases[c("Theon Greyjoy", "Asha Greyjoy", "Brienne of Tarth")]; aliases
## $`Theon Greyjoy`
## [1] "Prince of Fools" "Theon Turncloak" "Reek"            "Theon Kinslayer"
## 
## $`Asha Greyjoy`
## [1] "Esgred"                "The Kraken's Daughter"
## 
## $`Brienne of Tarth`
## [1] "The Maid of Tarth"  "Brienne the Beauty" "Brienne the Blue"

3.2.1 Existing function

Use a pre-existing function. Or, as here, define one ourselves, which gives a nice way to build-in our specification for the collapse argument.

my_fun <- function(x) paste(x, collapse = " | ")
map(aliases, my_fun)
## $`Theon Greyjoy`
## [1] "Prince of Fools | Theon Turncloak | Reek | Theon Kinslayer"
## 
## $`Asha Greyjoy`
## [1] "Esgred | The Kraken's Daughter"
## 
## $`Brienne of Tarth`
## [1] "The Maid of Tarth | Brienne the Beauty | Brienne the Blue"

3.2.2 Anonymous function, conventional

Define an anonymous function on-the-fly, in the conventional way. Here we put our desired value for the collapse argument into the function defintion itself.

map(aliases, function(x) paste(x, collapse = " | ")) 
## $`Theon Greyjoy`
## [1] "Prince of Fools | Theon Turncloak | Reek | Theon Kinslayer"
## 
## $`Asha Greyjoy`
## [1] "Esgred | The Kraken's Daughter"
## 
## $`Brienne of Tarth`
## [1] "The Maid of Tarth | Brienne the Beauty | Brienne the Blue"

Alternatively you can simply name the function and provide collapse via ...

map(aliases, paste, collapse = " | ")
## $`Theon Greyjoy`
## [1] "Prince of Fools | Theon Turncloak | Reek | Theon Kinslayer"
## 
## $`Asha Greyjoy`
## [1] "Esgred | The Kraken's Daughter"
## 
## $`Brienne of Tarth`
## [1] "The Maid of Tarth | Brienne the Beauty | Brienne the Blue"

3.2.3 Anonymous function, formula

\(purrr\) provides a very concise way to define an anonymous function: as a formula. This should start with the ~ symbol and then look like a typical top-level expression, as you might write in a script. Use .x to refer to the input, i.e. an individual element of the primary vector or list.

map(aliases, ~ paste(.x, collapse = " | "))
## $`Theon Greyjoy`
## [1] "Prince of Fools | Theon Turncloak | Reek | Theon Kinslayer"
## 
## $`Asha Greyjoy`
## [1] "Esgred | The Kraken's Daughter"
## 
## $`Brienne of Tarth`
## [1] "The Maid of Tarth | Brienne the Beauty | Brienne the Blue"

3.3 List to data frame

The tibble::enframe() function takes a named vector and promotes the names to a proper variable.

# using formula Anonymous function 
map_chr(aliases, ~ paste(.x, collapse = " | ")) %>% 
  tibble::enframe(value = "aliases")
## # A tibble: 3 x 2
##   name             aliases                                                   
##   <chr>            <chr>                                                     
## 1 Theon Greyjoy    Prince of Fools | Theon Turncloak | Reek | Theon Kinslayer
## 2 Asha Greyjoy     Esgred | The Kraken's Daughter                            
## 3 Brienne of Tarth The Maid of Tarth | Brienne the Beauty | Brienne the Blue

4. Parallel mapping

map2() and pmap have all the type-specific friends you would expect: map2_chr(), map2_lgl(), etc.

# prepare data
nms <- got_chars %>% 
  map_chr("name")
birth <- got_chars %>% 
  map_chr("born")

4.1 Map a function over two vectors or lists in parallel

map2(.x, .y, .f, ...)
map(INPUT_ONE, INPUT_TWO, FUNCTION_TO_APPLY, OPTIONAL_OTHER_STUFF)
map2_chr(nms, birth, function(x, y) paste(x, "was born", y)) %>% head()
## [1] "Theon Greyjoy was born In 278 AC or 279 AC, at Pyke"    
## [2] "Tyrion Lannister was born In 273 AC, at Casterly Rock"  
## [3] "Victarion Greyjoy was born In 268 AC or before, at Pyke"
## [4] "Will was born "                                         
## [5] "Areo Hotah was born In 257 AC or before, at Norvos"     
## [6] "Chett was born At Hag's Mire"

4.2 Map a function over two or more vectors or lists in parallel

pmap(.l, .f, ...)
map(LIST_OF_INPUT_LISTS, FUNCTION_TO_APPLY, OPTIONAL_OTHER_STUFF)
df <- got_chars %>% {
  tibble::tibble(
    name = map_chr(., "name"),
    aliases = map(., "aliases"),
    allegiances = map(., "allegiances")
  )
}
my_fun <- function(name, aliases, allegiances) {
  paste(name, "has", length(aliases), "aliases and",
        length(allegiances), "allegiances")
}
df %>% 
  pmap_chr(my_fun) %>% 
  tail()
## [1] "Kevan Lannister has 1 aliases and 1 allegiances"
## [2] "Melisandre has 5 aliases and 0 allegiances"     
## [3] "Merrett Frey has 1 aliases and 1 allegiances"   
## [4] "Quentyn Martell has 4 aliases and 1 allegiances"
## [5] "Samwell Tarly has 7 aliases and 1 allegiances"  
## [6] "Sansa Stark has 3 aliases and 2 allegiances"