This is a study note for using \(plyR\) package for data visualisation. For more details on the study material see https://www.jstatsoft.org/article/view/v040i01.
# essential
library(plyr)
library(ggplot2)
library(stats)
Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored.
plyr makes the strong assumption that each piece of data will be processed only once and independently of all other pieces. This means that you can not use these tools when each iteration requires overlapping data (like a running mean), or it depends on the previous iteration (like in a dynamic simulation). Loops are still most appropriate for these tasks.
To be able to understand and manipulate the data through \(plyr\), we need to know the differences between the three basic data types in R
Matrix is a special kind of vector. A matrix is a vector with two additional attributes: the number of rows and the number of columns.
x <- matrix(c(1,2,3,4), nrow=2, ncol=2)
print(x)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
A data frame is used for storing data tables. It is a list of vectors of equal length.
name <- c("Mike", "Lucy", "John")
age <- c(20, 25, 30)
student <- c(TRUE, FALSE, TRUE)
df <- data.frame(name, age, student)
print(df)
## name age student
## 1 Mike 20 TRUE
## 2 Lucy 25 FALSE
## 3 John 30 TRUE
List can contain elements of different types.
y <- list(name="Mike", gender="M", company="ProgramCreek")
print(y)
## $name
## [1] "Mike"
##
## $gender
## [1] "M"
##
## $company
## [1] "ProgramCreek"
The 16 main functions are named according to the type of input it accepts and the type of output it produces:
a
: array,d
: data frame,l
: list,_
: output is discarded.We use the notation a*ply
for functions with common input, a complete row of Table 2, and *aply
for functions with common output, a column of Table 2.
Input/Output | Array *aply |
Data frame | List | Discarded |
---|---|---|---|---|
Array a*ply |
aaply | adply | alply | a_ply |
Data frame | daply | ddply | dlply | d_ply |
List | laply | ldply | llply | l_ply |
Output | Processing function | restrictions Null output | ||
---|---|---|---|---|
*aply |
atomic array, or list | vector() | ||
*dply |
frame data frame, or atomic vector | data.frame() | ||
*lply |
none | list() | ||
*_ply |
none | — |
Base function | Input | Output | plyr function | |
---|---|---|---|---|
aggregate | d | d | ddply + colwise | |
apply | a | a/l | aaply alply | |
by | d | l | dlply | |
lapply | l | l | llply | |
mapply | a | a/l | maply / mlply | |
replicate | r | a/l | raply/ rlply | |
sapply | l | a | laply |
Each type of input has different rules for how to split it up, and these rules are described in detail in the following sections. In short:
a*ply()
: Arrays are sliced by dimension in to lower-d pieces.d*ply()
: Data frames are subsetted by combinations of variables.l*ply()
: Each element in a list is a piece.The arguments in the functions are described at the following:
.data
which will be split up, processed and recombined..variables
or .margins
, describes how to split up the input into pieces..fun
, is the processing function, and is applied to each piece in turn..progress
argument controls display of a progress bar.The functions have either two or three main arguments, depending on the type of input.
a*ply(.data, .margins, .fun, ..., .progress = "none")
.margins = 1
: Slice up into rows..margins = 2
: Slice up into columns..margins = c(1,2)
: Slice up into individual cells.shape <- function(x) if (is.vector(x)) length(x) else dim(x)
x <- array(1:24, 2:4)
print(x)
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
shape(x)
## [1] 2 3 4
s.byx <- aaply(x, 1, function(y) 0)
s.byy <- aaply(x, 2, function(y) 0)
s.byz <- aaply(x, 3, function(y) 0)
shape(s.byx)
## [1] 2
shape(s.byy)
## [1] 3
shape(s.byz)
## [1] 4
m*ply()
A special case of operating on arrays corresponds to the mapply function of base R. mapply seems rather different at first glance: It accepts multiple inputs as separate arguments, compared to a*ply which takes a single array argument. However, the separate arguments to mapply() must have the same length, so conceptually it is the same underlying data structure. It takes a matrix, list-array, or data frame, splits it up by rows and calls the processing function supplying each piece as its parameters.
# generate a grid of parameter values and then evaluate them
df.expand <- expand.grid(mean = 1:5, sd = 1:5)
sample.test <- mdply(df.expand, as.data.frame(rnorm), n = 10)
head(df.expand)
## mean sd
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 1 2
dim(df.expand)
## [1] 25 2
# mean * sd * n = 5 * 5 * 10 = 250
nrow(sample.test)
## [1] 250
To split up a data frame into groups based on combinations of variables in the data set, you need to specify which variables (or functions of variables) to use in d*ply
.
d*ply(.data, .variables, .fun, ..., .progress = "none")
.(var1)
will split the data frame into groups defined by the value of the var1 variable..(a, b, c)
, c("var1", "var2")
, c("var1", "var2")
, the groups will be formed by the interaction of the variables, and output will be labelled with all three variables.
.data
;split_labels
attribute which contains the splits as a data frame.baseball <- subset(baseball, ab >= 25)
baseball <- ddply(baseball, .(id), transform, cyear = year - min(year) + 1)
# fitting a linear model to each player
model <- function(df) {
lm(rbi / ab ~ cyear, data = df)
}
bmodels <- dlply(baseball, .(id), model)
Lists are the simplest type of input to deal with because they are already naturally divided into pieces: The elements of the list. For this reason, the l*ply functions do not need an argument that describes how to break up the data structure.
l*ply(.data, .fun, ..., .progress = "none")
# intercept, slope and R2 for each models, one for each player
rsq <- function(x) summary(x)$r.squared
bcoefs <- ldply(bmodels, function(x) c(coef(x), rsquare = rsq(x)))
names(bcoefs)[2:3] <- c("intercept", "slope")
head(bcoefs)
## id intercept slope rsquare
## 1 aaronha01 0.18329371 0.0001478121 0.000862425
## 2 abernte02 0.00000000 NA 0.000000000
## 3 adairje01 0.08599261 -0.0007118756 0.010230121
## 4 adamsba01 0.06265402 0.0012002168 0.030184694
## 5 adamsbo03 0.08867684 -0.0019238835 0.108372596
## 6 adcocjo01 0.14564821 0.0027382939 0.229040266
subset(bcoefs, rsquare > 0.999)$id
## [1] "bannifl01" "bedrost01" "burbada01" "carrocl02" "cookde01" "davisma01"
## [7] "jacksgr01" "lindbpa01" "oliveda02" "penaal01" "powerte01" "splitpa01"
## [13] "violafr01" "wakefti01" "weathda01" "woodwi01"
r*ply A
special case of operating on lists corresponds to replicate() in base R, and is useful for drawing distributions of random numbers. This is a little bit different to the other plyr methods. Instead of the .data argument, it has .n, the number of replications to run, and instead of a function it accepts a expression, which is evaluated afresh for each replication.
The output type defines how the pieces will be joined back together and how they will be labelled.
With array output the shape of the output array is determined by the input splits and the dimensionality of each individual result.
When the output is a data frame, it will the results as well as additional label columns. These columns make it possible to merge the old and new data if required.
This is the simplest output format, where each processed piece is joined together in a list. The list also stores the labels associated with each piece. llply
is convenient for calculating complex objects once
Sometimes it is convenient to operate on a list purely for the side effects. In this case *_ply
is a little more efficient than abandoning the output of *lply
because it does not store the intermediate results. The side effects include:
.print
: controls whether or not each result should be printed.d_ply(.data, .variables, failwith(NA, .fun), .print = TRUE)
# save a plot for every player to a pdf
xlim <- range(baseball$cyear, na.rm=TRUE)
ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)
plotpattern <- function(df) {
qplot(cyear, rbi / ab, data = df, geom = "line",
xlim = xlim, ylim = ylim)
}
pdf("paths.pdf", width = 8, height = 4)
d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE)
dev.off()