Use of this document

This is a study note for using \(plyR\) package for data visualisation. For more details on the study material see https://www.jstatsoft.org/article/view/v040i01.

Prerequisites

# essential
library(plyr)
library(ggplot2)
library(stats)

1. Introduction

Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored.

1.1 Assumption

plyr makes the strong assumption that each piece of data will be processed only once and independently of all other pieces. This means that you can not use these tools when each iteration requires overlapping data (like a running mean), or it depends on the previous iteration (like in a dynamic simulation). Loops are still most appropriate for these tasks.

1.2 Data types

To be able to understand and manipulate the data through \(plyr\), we need to know the differences between the three basic data types in R

1.2.1 Array

Matrix is a special kind of vector. A matrix is a vector with two additional attributes: the number of rows and the number of columns.

x <- matrix(c(1,2,3,4), nrow=2, ncol=2)
print(x)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

1.2.2 Date Frame

A data frame is used for storing data tables. It is a list of vectors of equal length.

name <- c("Mike", "Lucy", "John") 
age <- c(20, 25, 30) 
student <- c(TRUE, FALSE, TRUE) 
df <- data.frame(name, age, student)  
print(df)
##   name age student
## 1 Mike  20    TRUE
## 2 Lucy  25   FALSE
## 3 John  30    TRUE

1.2.3 List

List can contain elements of different types.

y <- list(name="Mike", gender="M", company="ProgramCreek")
print(y)
## $name
## [1] "Mike"
## 
## $gender
## [1] "M"
## 
## $company
## [1] "ProgramCreek"

The 16 main functions are named according to the type of input it accepts and the type of output it produces:

  • a: array,
  • d: data frame,
  • l: list,
  • _: output is discarded.

We use the notation a*ply for functions with common input, a complete row of Table 2, and *aply for functions with common output, a column of Table 2.

1.3 The 12 key functions (Table 2)

Input/Output Array *aply Data frame List Discarded
Array a*ply aaply adply alply a_ply
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply

1.4 Summary of processing function restrictions and null output values for all outputtypes. (Table 3)

Output Processing function restrictions Null output
*aply atomic array, or list vector()
*dply frame data frame, or atomic vector data.frame()
*lply none list()
*_ply none

1.5 Mapping between \(apply\) functions and \(plyr\) functions.(Table 5)

Base function Input Output plyr function
aggregate d d ddply + colwise
apply a a/l aaply alply
by d l dlply
lapply l l llply
mapply a a/l maply / mlply
replicate r a/l raply/ rlply
sapply l a laply

2. Input

Each type of input has different rules for how to split it up, and these rules are described in detail in the following sections. In short:

The arguments in the functions are described at the following:

The functions have either two or three main arguments, depending on the type of input.

2.1 Input: Array

  • a*ply(.data, .margins, .fun, ..., .progress = "none")
    • .margins = 1: Slice up into rows.
    • .margins = 2: Slice up into columns.
    • .margins = c(1,2): Slice up into individual cells.
shape <- function(x) if (is.vector(x)) length(x) else dim(x)
x <- array(1:24, 2:4)
print(x)
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24
shape(x)
## [1] 2 3 4
s.byx <- aaply(x, 1, function(y) 0)
s.byy <- aaply(x, 2, function(y) 0)
s.byz <- aaply(x, 3, function(y) 0)
shape(s.byx)
## [1] 2
shape(s.byy)
## [1] 3
shape(s.byz)
## [1] 4

2.1.1 Special case:

m*ply() A special case of operating on arrays corresponds to the mapply function of base R. mapply seems rather different at first glance: It accepts multiple inputs as separate arguments, compared to a*ply which takes a single array argument. However, the separate arguments to mapply() must have the same length, so conceptually it is the same underlying data structure. It takes a matrix, list-array, or data frame, splits it up by rows and calls the processing function supplying each piece as its parameters.

# generate a grid of parameter values and then evaluate them
df.expand <- expand.grid(mean = 1:5, sd = 1:5)
sample.test <- mdply(df.expand, as.data.frame(rnorm), n = 10)
head(df.expand)
##   mean sd
## 1    1  1
## 2    2  1
## 3    3  1
## 4    4  1
## 5    5  1
## 6    1  2
dim(df.expand)
## [1] 25  2
# mean * sd * n = 5 * 5 * 10 = 250
nrow(sample.test)
## [1] 250

2.2 Input: Data frame

To split up a data frame into groups based on combinations of variables in the data set, you need to specify which variables (or functions of variables) to use in d*ply.

  • d*ply(.data, .variables, .fun, ..., .progress = "none")
    • .(var1) will split the data frame into groups defined by the value of the var1 variable.
    • If you use multiple variables, .(a, b, c), c("var1", "var2"), c("var1", "var2"), the groups will be formed by the interaction of the variables, and output will be labelled with all three variables.
      • Output as Array: there will be three dimensions whose dimension names will be the values of a, b, and c in .data;
      • Output as Data frame: there will be three extra columns with the values of a, b, and c;
      • Output as List: the element names will be the values of a, b, and c appended together separated by periods, along with a split_labels attribute which contains the splits as a data frame.
baseball <- subset(baseball, ab >= 25)
baseball <- ddply(baseball, .(id), transform, cyear = year - min(year) + 1)
# fitting a linear model to each player
model <- function(df) {
  lm(rbi / ab ~ cyear, data = df)
}
bmodels <- dlply(baseball, .(id), model)

2.3 Input: List

Lists are the simplest type of input to deal with because they are already naturally divided into pieces: The elements of the list. For this reason, the l*ply functions do not need an argument that describes how to break up the data structure.

  • l*ply(.data, .fun, ..., .progress = "none")
# intercept, slope and R2 for each models, one for each player
rsq <- function(x) summary(x)$r.squared
bcoefs <- ldply(bmodels, function(x) c(coef(x), rsquare = rsq(x)))
names(bcoefs)[2:3] <- c("intercept", "slope")
head(bcoefs)
##          id  intercept         slope     rsquare
## 1 aaronha01 0.18329371  0.0001478121 0.000862425
## 2 abernte02 0.00000000            NA 0.000000000
## 3 adairje01 0.08599261 -0.0007118756 0.010230121
## 4 adamsba01 0.06265402  0.0012002168 0.030184694
## 5 adamsbo03 0.08867684 -0.0019238835 0.108372596
## 6 adcocjo01 0.14564821  0.0027382939 0.229040266
subset(bcoefs, rsquare > 0.999)$id
##  [1] "bannifl01" "bedrost01" "burbada01" "carrocl02" "cookde01"  "davisma01"
##  [7] "jacksgr01" "lindbpa01" "oliveda02" "penaal01"  "powerte01" "splitpa01"
## [13] "violafr01" "wakefti01" "weathda01" "woodwi01"

2.3.1 Special case:

r*ply A special case of operating on lists corresponds to replicate() in base R, and is useful for drawing distributions of random numbers. This is a little bit different to the other plyr methods. Instead of the .data argument, it has .n, the number of replications to run, and instead of a function it accepts a expression, which is evaluated afresh for each replication.

3. Output

The output type defines how the pieces will be joined back together and how they will be labelled.

3.1 Output: Array

With array output the shape of the output array is determined by the input splits and the dimensionality of each individual result.

  • Input with array, the dimension labels of the output array will be the same as the dimension labels of the splits.
  • Input with List, it is treated like a 1d array.
  • Input with data frame, the output array gets a dimension for each variable in the split, labelled by values of those variables.

3.2 Output: Data frame

When the output is a data frame, it will the results as well as additional label columns. These columns make it possible to merge the old and new data if required.

  • Input with array, a column for each splitting dimension.
  • Input with data frame, there will be a column for each splitting variable;
  • Input with list, a column for list names (if present);

3.3 Output: List

This is the simplest output format, where each processed piece is joined together in a list. The list also stores the labels associated with each piece. llply is convenient for calculating complex objects once

3.4 Output: Discarded

Sometimes it is convenient to operate on a list purely for the side effects. In this case *_ply is a little more efficient than abandoning the output of *lply because it does not store the intermediate results. The side effects include:

  • Caching
  • Output to screen/file
  • Plots
    • .print: controls whether or not each result should be printed.
      i.e., d_ply(.data, .variables, failwith(NA, .fun), .print = TRUE)
# save a plot for every player to a pdf
xlim <- range(baseball$cyear, na.rm=TRUE)
ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)
plotpattern <- function(df) {
  qplot(cyear, rbi / ab, data = df, geom = "line",
        xlim = xlim, ylim = ylim)
  }
pdf("paths.pdf", width = 8, height = 4)
d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE)
dev.off()