Title: | Tools for Splitting, Applying and Combining Data |
Version: | 1.8.9 |
Description: | A set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together. For example, you might want to fit a model to each spatial location or time point in your study, summarise data by panels or collapse high-dimensional arrays to simpler summary statistics. The development of 'plyr' has been generously supported by 'Becton Dickinson'. |
License: | MIT + file LICENSE |
URL: | http://had.co.nz/plyr, https://github.com/hadley/plyr |
BugReports: | https://github.com/hadley/plyr/issues |
Depends: | R (≥ 3.1.0) |
Imports: | Rcpp (≥ 0.11.0) |
Suggests: | abind, covr, doParallel, foreach, iterators, itertools, tcltk, testthat |
LinkingTo: | Rcpp |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | yes |
Packaged: | 2023-09-27 13:58:04 UTC; hadleywickham |
Author: | Hadley Wickham [aut, cre] |
Maintainer: | Hadley Wickham <hadley@rstudio.com> |
Repository: | CRAN |
Date/Publication: | 2023-10-02 06:50:08 UTC |
plyr: the split-apply-combine paradigm for R.
Description
The plyr package is a set of clean and consistent tools that implement the split-apply-combine pattern in R. This is an extremely common pattern in data analysis: you solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again.
Details
The plyr functions are named according to what sort of data structure they split up and what sort of data structure they return:
- a
array
- l
list
- d
data.frame
- m
multiple inputs
- r
repeat multiple times
- _
nothing
So ddply
takes a data frame as input and returns a data frame
as output, and l_ply
takes a list as input and returns nothing
as output.
Row names
By design, no plyr function will preserve row names - in general it is too
hard to know what should be done with them for many of the operations
supported by plyr. If you want to preserve row names, use
name_rows
to convert them into an explicit column in your
data frame, perform the plyr operations, and then use name_rows
again to convert the column back into row names.
Helpers
Plyr also provides a set of helper functions for common data analysis problems:
-
arrange
: re-order the rows of a data frame by specifying the columns to order by -
mutate
: add new columns or modifying existing columns, liketransform
, but new columns can refer to other columns that you just created. -
summarise
: likemutate
but create a new data frame, not preserving any columns in the old data frame. -
join
: an adapation ofmerge
which is more similar to SQL, and has a much faster implementation if you only want to find the first match. -
match_df
: a version ofjoin
that instead of returning the two tables combined together, only returns the rows in the first table that match the second. -
colwise
: make any function work colwise on a dataframe -
rename
: easily rename columns in a data frame -
round_any
: round a number to any degree of precision -
count
: quickly count unique combinations and return return as a data frame.
Quote variables to create a list of unevaluated expressions for later evaluation.
Description
This function is similar to ~
in that it is used to
capture the name of variables, not their current value. This is used
throughout plyr to specify the names of variables (or more complicated
expressions).
Usage
.(..., .env = parent.frame())
Arguments
... |
unevaluated expressions to be recorded. Specify names if you want the set the names of the resultant variables |
.env |
environment in which unbound symbols in |
Details
Similar tricks can be performed with substitute
, but when
functions can be called in multiple ways it becomes increasingly tricky
to ensure that the values are extracted from the correct frame. Substitute
tricks also make it difficult to program against the functions that use
them, while the quoted
class provides
as.quoted.character
to convert strings to the appropriate
data structure.
Value
list of symbol and language primitives
Examples
.(a, b, c)
.(first = a, second = b, third = c)
.(a ^ 2, b - d, log(c))
as.quoted(~ a + b + c)
as.quoted(a ~ b + c)
as.quoted(c("a", "b", "c"))
# Some examples using ddply - look at the column names
ddply(mtcars, "cyl", each(nrow, ncol))
ddply(mtcars, ~ cyl, each(nrow, ncol))
ddply(mtcars, .(cyl), each(nrow, ncol))
ddply(mtcars, .(log(cyl)), each(nrow, ncol))
ddply(mtcars, .(logcyl = log(cyl)), each(nrow, ncol))
ddply(mtcars, .(vs + am), each(nrow, ncol))
ddply(mtcars, .(vsam = vs + am), each(nrow, ncol))
Subset splits.
Description
Subset splits, ensuring that labels keep matching
Usage
## S3 method for class 'split'
x[i, ...]
Arguments
x |
split object |
i |
index |
... |
unused |
Split array, apply function, and discard results.
Description
For each slice of an array, apply function and discard results
Usage
a_ply(
.data,
.margins,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.print = FALSE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
matrix, array or data frame to be processed |
.margins |
a vector giving the subscripts to split up |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
if |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.print |
automatically print each result? (default: |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Value
Nothing
Input
This function splits matrices, arrays and data frames by dimensions
Output
All output is discarded. This is useful for functions that you are calling purely for their side effects like displaying plots or saving output.
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other array input:
aaply()
,
adply()
,
alply()
Other no output:
d_ply()
,
l_ply()
,
m_ply()
Split array, apply function, and return results in an array.
Description
For each slice of an array, apply function, keeping results as an array.
Usage
aaply(
.data,
.margins,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.drop = TRUE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
matrix, array or data frame to be processed |
.margins |
a vector giving the subscripts to split up |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
if |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.drop |
should extra dimensions of length 1 in the output be
dropped, simplifying the output. Defaults to |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Details
This function is very similar to apply
, except that it will
always return an array, and when the function returns >1 d data structures,
those dimensions are added on to the highest dimensions, rather than the
lowest dimensions. This makes aaply
idempotent, so that
aaply(input, X, identity)
is equivalent to aperm(input, X)
.
Value
if results are atomic with same type and dimensionality, a vector, matrix or array; otherwise, a list-array (a list with dimensions)
Warning
Contrary to alply
and adply
, passing a data
frame as first argument to aaply
may lead to unexpected results
such as huge memory allocations.
Input
This function splits matrices, arrays and data frames by dimensions
Output
If there are no results, then this function will return a vector of
length 0 (vector()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other array input:
a_ply()
,
adply()
,
alply()
Other array output:
daply()
,
laply()
,
maply()
Examples
dim(ozone)
aaply(ozone, 1, mean)
aaply(ozone, 1, mean, .drop = FALSE)
aaply(ozone, 3, mean)
aaply(ozone, c(1,2), mean)
dim(aaply(ozone, c(1,2), mean))
dim(aaply(ozone, c(1,2), mean, .drop = FALSE))
aaply(ozone, 1, each(min, max))
aaply(ozone, 3, each(min, max))
standardise <- function(x) (x - min(x)) / (max(x) - min(x))
aaply(ozone, 3, standardise)
aaply(ozone, 1:2, standardise)
aaply(ozone, 1:2, diff)
Split array, apply function, and return results in a data frame.
Description
For each slice of an array, apply function then combine results into a data frame.
Usage
adply(
.data,
.margins,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.parallel = FALSE,
.paropts = NULL,
.id = NA
)
Arguments
.data |
matrix, array or data frame to be processed |
.margins |
a vector giving the subscripts to split up |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
if |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
.id |
name(s) of the index column(s).
Pass |
Value
A data frame, as described in the output section.
Input
This function splits matrices, arrays and data frames by dimensions
Output
The most unambiguous behaviour is achieved when .fun
returns a
data frame - in that case pieces will be combined with
rbind.fill
. If .fun
returns an atomic vector of
fixed length, it will be rbind
ed together and converted to a data
frame. Any other values will result in an error.
If there are no results, then this function will return a data
frame with zero rows and columns (data.frame()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other array input:
a_ply()
,
aaply()
,
alply()
Other data frame output:
ddply()
,
ldply()
,
mdply()
Split array, apply function, and return results in a list.
Description
For each slice of an array, apply function then combine results into a list.
Usage
alply(
.data,
.margins,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.parallel = FALSE,
.paropts = NULL,
.dims = FALSE
)
Arguments
.data |
matrix, array or data frame to be processed |
.margins |
a vector giving the subscripts to split up |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
if |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
.dims |
if |
Details
The list will have "dims" and "dimnames" corresponding to the
margins given. For instance alply(x, c(3,2), ...)
where
x
has dims c(4,3,2)
will give a result with dims
c(2,3)
.
alply
is somewhat similar to apply
for cases
where the results are not atomic.
Value
list of results
Input
This function splits matrices, arrays and data frames by dimensions
Output
If there are no results, then this function will return
a list of length 0 (list()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other array input:
a_ply()
,
aaply()
,
adply()
Other list output:
dlply()
,
llply()
,
mlply()
Examples
alply(ozone, 3, quantile)
alply(ozone, 3, function(x) table(round(x)))
Dimensions.
Description
Consistent dimensions for vectors, matrices and arrays.
Usage
amv_dim(x)
Arguments
x |
array, matrix or vector |
Dimension names.
Description
Consistent dimnames for vectors, matrices and arrays.
Usage
amv_dimnames(x)
Arguments
x |
array, matrix or vector |
Details
Unlike dimnames
no part of the output will ever be
null. If a component of dimnames is omitted, amv_dimnames
will return an integer sequence of the appropriate length.
Order a data frame by its colums.
Description
This function completes the subsetting, transforming and ordering triad
with a function that works in a similar way to subset
and
transform
but for reordering a data frame by its columns.
This saves a lot of typing!
Usage
arrange(df, ...)
Arguments
df |
data frame to reorder |
... |
expressions evaluated in the context of |
See Also
order
for sorting function in the base package
Examples
# sort mtcars data by cylinder and displacement
mtcars[with(mtcars, order(cyl, disp)), ]
# Same result using arrange: no need to use with(), as the context is implicit
# NOTE: plyr functions do NOT preserve row.names
arrange(mtcars, cyl, disp)
# Let's keep the row.names in this example
myCars = cbind(vehicle=row.names(mtcars), mtcars)
arrange(myCars, cyl, disp)
# Sort with displacement in descending order
arrange(myCars, cyl, desc(disp))
Make a function return a data frame.
Description
Create a new function that returns the existing function wrapped in a
data.frame with a single column, value
.
Usage
## S3 method for class ''function''
as.data.frame(x, row.names, optional, ...)
Arguments
x |
function to make return a data frame |
row.names |
necessary to match the generic, but not used |
optional |
necessary to match the generic, but not used |
... |
necessary to match the generic, but not used |
Details
This is useful when calling *dply
functions with a function that
returns a vector, and you want the output in rows, rather than columns.
The value
column is always created, even for empty inputs.
Convert split list to regular list.
Description
Strip off label related attributed to make a strip list as regular list
Usage
## S3 method for class 'split'
as.list(x, ...)
Arguments
x |
object to convert to a list |
... |
unused |
Convert input to quoted variables.
Description
Convert characters, formulas and calls to quoted .variables
Usage
as.quoted(x, env = parent.frame())
Arguments
x |
input to quote |
env |
environment in which unbound symbols in expression should be
evaluated. Defaults to the environment in which |
Details
This method is called by default on all plyr functions that take a
.variables
argument, so that equivalent forms can be used anywhere.
Currently conversions exist for character vectors, formulas and call objects.
Value
a list of quoted variables
See Also
Examples
as.quoted(c("a", "b", "log(d)"))
as.quoted(a ~ b + log(d))
Yearly batting records for all major league baseball players
Description
This data frame contains batting statistics for a subset of players collected from http://www.baseball-databank.org/. There are a total of 21,699 records, covering 1,228 players from 1871 to 2007. Only players with more 15 seasons of play are included.
Usage
baseball
Format
A 21699 x 22 data frame
Variables
Variables:
id, unique player id
year, year of data
stint
team, team played for
lg, league
g, number of games
ab, number of times at bat
r, number of runs
h, hits, times reached base because of a batted, fair ball without error by the defense
X2b, hits on which the batter reached second base safely
X3b, hits on which the batter reached third base safely
hr, number of home runs
rbi, runs batted in
sb, stolen bases
cs, caught stealing
bb, base on balls (walk)
so, strike outs
ibb, intentional base on balls
hbp, hits by pitch
sh, sacrifice hits
sf, sacrifice flies
gidp, ground into double play
References
http://www.baseball-databank.org/
Examples
baberuth <- subset(baseball, id == "ruthba01")
baberuth$cyear <- baberuth$year - min(baberuth$year) + 1
calculate_cyear <- function(df) {
mutate(df,
cyear = year - min(year),
cpercent = cyear / (max(year) - min(year))
)
}
baseball <- ddply(baseball, .(id), calculate_cyear)
baseball <- subset(baseball, ab >= 25)
model <- function(df) {
lm(rbi / ab ~ cyear, data=df)
}
model(baberuth)
models <- dlply(baseball, .(id), model)
Column-wise function.
Description
Turn a function that operates on a vector into a function that operates column-wise on a data.frame.
Usage
colwise(.fun, .cols = true, ...)
catcolwise(.fun, ...)
numcolwise(.fun, ...)
Arguments
.fun |
function |
.cols |
either a function that tests columns for inclusion, or a quoted object giving which columns to process |
... |
other arguments passed on to |
Details
catcolwise
and numcolwise
provide version that only operate
on discrete and numeric variables respectively.
Examples
# Count number of missing values
nmissing <- function(x) sum(is.na(x))
# Apply to every column in a data frame
colwise(nmissing)(baseball)
# This syntax looks a little different. It is shorthand for the
# the following:
f <- colwise(nmissing)
f(baseball)
# This is particularly useful in conjunction with d*ply
ddply(baseball, .(year), colwise(nmissing))
# To operate only on specified columns, supply them as the second
# argument. Many different forms are accepted.
ddply(baseball, .(year), colwise(nmissing, .(sb, cs, so)))
ddply(baseball, .(year), colwise(nmissing, c("sb", "cs", "so")))
ddply(baseball, .(year), colwise(nmissing, ~ sb + cs + so))
# Alternatively, you can specify a boolean function that determines
# whether or not a column should be included
ddply(baseball, .(year), colwise(nmissing, is.character))
ddply(baseball, .(year), colwise(nmissing, is.numeric))
ddply(baseball, .(year), colwise(nmissing, is.discrete))
# These last two cases are particularly common, so some shortcuts are
# provided:
ddply(baseball, .(year), numcolwise(nmissing))
ddply(baseball, .(year), catcolwise(nmissing))
# You can supply additional arguments to either colwise, or the function
# it generates:
numcolwise(mean)(baseball, na.rm = TRUE)
numcolwise(mean, na.rm = TRUE)(baseball)
Compact list.
Description
Remove all NULL entries from a list
Usage
compact(l)
Arguments
l |
list |
Count the number of occurences.
Description
Equivalent to as.data.frame(table(x))
, but does not include
combinations with zero counts.
Usage
count(df, vars = NULL, wt_var = NULL)
Arguments
df |
data frame to be processed |
vars |
variables to count unique values of |
wt_var |
optional variable to weight by - if this is non-NULL, count will sum up the value of this variable for each combination of id variables. |
Details
Speed-wise count is competitive with table
for single
variables, but it really comes into its own when summarising multiple
dimensions because it only counts combinations that actually occur in the
data.
Compared to table
+ as.data.frame
, count
also preserves the type of the identifier variables, instead of converting
them to characters/factors.
Value
a data frame with label and freq columns
See Also
table
for related functionality in the base package
Examples
# Count of each value of "id" in the first 100 cases
count(baseball[1:100,], vars = "id")
# Count of ids, weighted by their "g" loading
count(baseball[1:100,], vars = "id", wt_var = "g")
count(baseball, "id", "ab")
count(baseball, "lg")
# How many stints do players do?
count(baseball, "stint")
# Count of times each player appeared in each of the years they played
count(baseball[1:100,], c("id", "year"))
# Count of counts
count(count(baseball[1:100,], c("id", "year")), "id", "freq")
count(count(baseball, c("id", "year")), "freq")
Create progress bar.
Description
Create progress bar object from text string.
Usage
create_progress_bar(name = "none", ...)
Arguments
name |
type of progress bar to create |
... |
other arguments passed onto progress bar function |
Details
Progress bars give feedback on how apply step is proceeding. This is mainly useful for long running functions, as for short functions, the time taken up by splitting and combining may be on the same order (or longer) as the apply step. Additionally, for short functions, the time needed to update the progress bar can significantly slow down the process. For the trivial examples below, using the tk progress bar slows things down by a factor of a thousand.
Note the that progress bar is approximate, and if the time taken by individual function applications is highly non-uniform it may not be very informative of the time left.
There are currently four types of progress bar: "none", "text", "tk", and "win". See the individual documentation for more details. In plyr functions, these can either be specified by name, or you can create the progress bar object yourself if you want more control over its apperance. See the examples.
See Also
progress_none
, progress_text
, progress_tk
, progress_win
Examples
# No progress bar
l_ply(1:100, identity, .progress = "none")
## Not run:
# Use the Tcl/Tk interface
l_ply(1:100, identity, .progress = "tk")
## End(Not run)
# Text-based progress (|======|)
l_ply(1:100, identity, .progress = "text")
# Choose a progress character, run a length of time you can see
l_ply(1:10000, identity, .progress = progress_text(char = "."))
Split data frame, apply function, and discard results.
Description
For each subset of a data frame, apply function and discard results.
To apply a function for each row, use a_ply
with
.margins
set to 1
.
Usage
d_ply(
.data,
.variables,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.drop = TRUE,
.print = FALSE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
data frame to be processed |
.variables |
variables to split data frame by, as |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.drop |
should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default) |
.print |
automatically print each result? (default: |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Value
Nothing
Input
This function splits data frames by variables.
Output
All output is discarded. This is useful for functions that you are calling purely for their side effects like displaying plots or saving output.
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other data frame input:
daply()
,
ddply()
,
dlply()
Other no output:
a_ply()
,
l_ply()
,
m_ply()
Split data frame, apply function, and return results in an array.
Description
For each subset of data frame, apply function then combine results into
an array. daply
with a function that operates column-wise is
similar to aggregate
.
To apply a function for each row, use aaply
with
.margins
set to 1
.
Usage
daply(
.data,
.variables,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.drop_i = TRUE,
.drop_o = TRUE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
data frame to be processed |
.variables |
variables to split data frame by, as quoted variables, a formula or character vector |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.drop_i |
should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default) |
.drop_o |
should extra dimensions of length 1 in the output be
dropped, simplifying the output. Defaults to |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Value
if results are atomic with same type and dimensionality, a vector, matrix or array; otherwise, a list-array (a list with dimensions)
Input
This function splits data frames by variables.
Output
If there are no results, then this function will return a vector of
length 0 (vector()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other array output:
aaply()
,
laply()
,
maply()
Other data frame input:
d_ply()
,
ddply()
,
dlply()
Examples
daply(baseball, .(year), nrow)
# Several different ways of summarising by variables that should not be
# included in the summary
daply(baseball[, c(2, 6:9)], .(year), colwise(mean))
daply(baseball[, 6:9], .(baseball$year), colwise(mean))
daply(baseball, .(year), function(df) colwise(mean)(df[, 6:9]))
Split data frame, apply function, and return results in a data frame.
Description
For each subset of a data frame, apply function then combine results into a
data frame.
To apply a function for each row, use adply
with
.margins
set to 1
.
Usage
ddply(
.data,
.variables,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.drop = TRUE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
data frame to be processed |
.variables |
variables to split data frame by, as |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.drop |
should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default) |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Value
A data frame, as described in the output section.
Input
This function splits data frames by variables.
Output
The most unambiguous behaviour is achieved when .fun
returns a
data frame - in that case pieces will be combined with
rbind.fill
. If .fun
returns an atomic vector of
fixed length, it will be rbind
ed together and converted to a data
frame. Any other values will result in an error.
If there are no results, then this function will return a data
frame with zero rows and columns (data.frame()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
tapply
for similar functionality in the base package
Other data frame input:
d_ply()
,
daply()
,
dlply()
Other data frame output:
adply()
,
ldply()
,
mdply()
Examples
# Summarize a dataset by two variables
dfx <- data.frame(
group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
age = runif(n = 29, min = 18, max = 54)
)
# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
mean = round(mean(age), 2),
sd = round(sd(age), 2))
# An example using a formula for .variables
ddply(baseball[1:100,], ~ year, nrow)
# Applying two functions; nrow and ncol
ddply(baseball, .(lg), c("nrow", "ncol"))
# Calculate mean runs batted in for each year
rbi <- ddply(baseball, .(year), summarise,
mean_rbi = mean(rbi, na.rm = TRUE))
# Plot a line chart of the result
plot(mean_rbi ~ year, type = "l", data = rbi)
# make new variable career_year based on the
# start year for each player (id)
base2 <- ddply(baseball, .(id), mutate,
career_year = year - min(year) + 1
)
Set defaults.
Description
Convient method for combining a list of values with their defaults.
Usage
defaults(x, y)
Arguments
x |
list of values |
y |
defaults |
Descending order.
Description
Transform a vector into a format that will be sorted in descending order.
Usage
desc(x)
Arguments
x |
vector to transform |
Examples
desc(1:10)
desc(factor(letters))
first_day <- seq(as.Date("1910/1/1"), as.Date("1920/1/1"), "years")
desc(first_day)
Number of dimensions.
Description
Number of dimensions of an array or vector
Usage
dims(x)
Arguments
x |
array |
Split data frame, apply function, and return results in a list.
Description
For each subset of a data frame, apply function then combine results into a
list. dlply
is similar to by
except that the results
are returned in a different format.
To apply a function for each row, use alply
with
.margins
set to 1
.
Usage
dlply(
.data,
.variables,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.drop = TRUE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
data frame to be processed |
.variables |
variables to split data frame by, as |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.drop |
should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default) |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Value
list of results
Input
This function splits data frames by variables.
Output
If there are no results, then this function will return
a list of length 0 (list()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other data frame input:
d_ply()
,
daply()
,
ddply()
Other list output:
alply()
,
llply()
,
mlply()
Examples
linmod <- function(df) {
lm(rbi ~ year, data = mutate(df, year = year - min(year)))
}
models <- dlply(baseball, .(id), linmod)
models[[1]]
coef <- ldply(models, coef)
with(coef, plot(`(Intercept)`, year))
qual <- laply(models, function(mod) summary(mod)$r.squared)
hist(qual)
Aggregate multiple functions into a single function.
Description
Combine multiple functions into a single function returning a named vector of outputs. Note: you cannot supply additional parameters for the summary functions
Usage
each(...)
Arguments
... |
functions to combine. each function should produce a single number as output |
See Also
summarise
for applying summary functions to data
Examples
# Call min() and max() on the vector 1:10
each(min, max)(1:10)
# This syntax looks a little different. It is shorthand for the
# the following:
f<- each(min, max)
f(1:10)
# Three equivalent ways to call min() and max() on the vector 1:10
each("min", "max")(1:10)
each(c("min", "max"))(1:10)
each(c(min, max))(1:10)
# Call length(), min() and max() on a random normal vector
each(length, mean, var)(rnorm(100))
Check if a data frame is empty.
Description
Empty if it's null or it has 0 rows or columns
Usage
empty(df)
Arguments
df |
data frame to check |
Evaluate a quoted list of variables.
Description
Evaluates quoted variables in specified environment
Usage
eval.quoted(exprs, envir = NULL, enclos = NULL, try = FALSE)
Arguments
exprs |
quoted object to evaluate |
try |
if TRUE, return |
Value
a list
Fail with specified value.
Description
Modify a function so that it returns a default value when there is an error.
Usage
failwith(default = NULL, f, quiet = FALSE)
Arguments
default |
default value |
f |
function |
quiet |
all error messages be suppressed? |
Value
a function
See Also
Examples
f <- function(x) if (x == 1) stop("Error!") else 1
## Not run:
f(1)
f(2)
## End(Not run)
safef <- failwith(NULL, f)
safef(1)
safef(2)
Capture current evaluation context.
Description
This function captures the current context, making it easier
to use **ply
with functions that do special evaluation and
need access to the environment where ddply was called from.
Usage
here(f)
Arguments
f |
a function that does non-standard evaluation |
Author(s)
Peter Meilstrup, https://github.com/crowding
Examples
df <- data.frame(a = rep(c("a","b"), each = 10), b = 1:20)
f1 <- function(label) {
ddply(df, "a", mutate, label = paste(label, b))
}
## Not run: f1("name:")
# Doesn't work because mutate can't find label in the current scope
f2 <- function(label) {
ddply(df, "a", here(mutate), label = paste(label, b))
}
f2("name:")
# Works :)
Compute a unique numeric id for each unique row in a data frame.
Description
Properties:
-
order(id)
is equivalent todo.call(order, df)
rows containing the same data have the same value
if
drop = FALSE
then room for all possibilites
Usage
id(.variables, drop = FALSE)
Arguments
.variables |
list of variables |
drop |
drop unusued factor levels? |
Value
a numeric vector with attribute n, giving total number of possibilities
See Also
Numeric id for a vector.
Description
Numeric id for a vector.
Usage
id_var(x, drop = FALSE)
Construct an immutable data frame.
Description
An immutable data frame works like an ordinary data frame, except that when you subset it, it returns a reference to the original data frame, not a a copy. This makes subsetting substantially faster and has a big impact when you are working with large datasets with many groups.
Usage
idata.frame(df)
Arguments
df |
a data frame |
Details
This method is still a little experimental, so please let me know if you run into any problems.
Value
an immutable data frame
Examples
system.time(dlply(baseball, "id", nrow))
system.time(dlply(idata.frame(baseball), "id", nrow))
An indexed array.
Description
Create a indexed array, a space efficient way of indexing into a large array.
Usage
indexed_array(env, index)
Arguments
env |
environment containing data frame |
index |
list of indices |
An indexed data frame.
Description
Create a indexed list, a space efficient way of indexing into a large data frame
Usage
indexed_df(data, index, vars)
Arguments
data |
environment containing data frame |
index |
list of indices |
vars |
a character vector giving the variables used for subsetting |
Determine if a vector is discrete.
Description
A discrete vector is a factor or a character vector
Usage
is.discrete(x)
Arguments
x |
vector to test |
Examples
is.discrete(1:10)
is.discrete(c("a", "b", "c"))
is.discrete(factor(c("a", "b", "c")))
Is a formula? Checks if argument is a formula
Description
Is a formula? Checks if argument is a formula
Usage
is.formula(x)
Split iterator that returns values, not indices.
Description
Split iterator that returns values, not indices.
Usage
isplit2(x, f, drop = FALSE, ...)
Warning
Deprecated, do not use in new code.
See Also
Join two data frames together.
Description
Join, like merge, is designed for the types of problems where you would use a sql join.
Usage
join(x, y, by = NULL, type = "left", match = "all")
Arguments
x |
data frame |
y |
data frame |
by |
character vector of variable names to join by. If omitted, will match on all common variables. |
type |
type of join: left (default), right, inner or full. See details for more information. |
match |
how should duplicate ids be matched? Either match just the
|
Details
The four join types return:
-
inner
: only rows with matching keys in both x and y -
left
: all rows in x, adding matching columns from y -
right
: all rows in y, adding matching columns from x -
full
: all rows in x with matching columns in y, then the rows of y that don't match x.
Note that from plyr 1.5, join
will (by default) return all matches,
not just the first match, as it did previously.
Unlike merge, preserves the order of x no matter what join type is used. If needed, rows from y will be added to the bottom. Join is often faster than merge, although it is somewhat less featureful - it currently offers no way to rename output or merge on different variables in the x and y data frames.
Examples
first <- ddply(baseball, "id", summarise, first = min(year))
system.time(b2 <- merge(baseball, first, by = "id", all.x = TRUE))
system.time(b3 <- join(baseball, first, by = "id"))
b2 <- arrange(b2, id, year, stint)
b3 <- arrange(b3, id, year, stint)
stopifnot(all.equal(b2, b3))
Recursively join a list of data frames.
Description
Recursively join a list of data frames.
Usage
join_all(dfs, by = NULL, type = "left", match = "all")
Arguments
dfs |
A list of data frames. |
by |
character vector of variable names to join by. If omitted, will match on all common variables. |
type |
type of join: left (default), right, inner or full. See details for more information. |
match |
how should duplicate ids be matched? Either match just the
|
Examples
dfs <- list(
a = data.frame(x = 1:10, a = runif(10)),
b = data.frame(x = 1:10, b = runif(10)),
c = data.frame(x = 1:10, c = runif(10))
)
join_all(dfs)
join_all(dfs, "x")
Join keys. Given two data frames, create a unique key for each row.
Description
Join keys. Given two data frames, create a unique key for each row.
Usage
join.keys(x, y, by)
Arguments
x |
data frame |
y |
data frame |
by |
character vector of variable names to join by |
Split list, apply function, and discard results.
Description
For each element of a list, apply function and discard results
Usage
l_ply(
.data,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.print = FALSE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
list to be processed |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.print |
automatically print each result? (default: |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Value
Nothing
Input
This function splits lists by elements.
Output
All output is discarded. This is useful for functions that you are calling purely for their side effects like displaying plots or saving output.
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other list input:
laply()
,
ldply()
,
llply()
Other no output:
a_ply()
,
d_ply()
,
m_ply()
Examples
l_ply(llply(mtcars, round), table, .print = TRUE)
l_ply(baseball, function(x) print(summary(x)))
Split list, apply function, and return results in an array.
Description
For each element of a list, apply function then combine results into an array.
Usage
laply(
.data,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.drop = TRUE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
list to be processed |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.drop |
should extra dimensions of length 1 in the output be
dropped, simplifying the output. Defaults to |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Details
laply
is similar in spirit to sapply
except
that it will always return an array, and the output is transposed with
respect sapply
- each element of the list corresponds to a row,
not a column.
Value
if results are atomic with same type and dimensionality, a vector, matrix or array; otherwise, a list-array (a list with dimensions)
Input
This function splits lists by elements.
Output
If there are no results, then this function will return a vector of
length 0 (vector()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other list input:
l_ply()
,
ldply()
,
llply()
Other array output:
aaply()
,
daply()
,
maply()
Examples
laply(baseball, is.factor)
# cf
ldply(baseball, is.factor)
colwise(is.factor)(baseball)
laply(seq_len(10), identity)
laply(seq_len(10), rep, times = 4)
laply(seq_len(10), matrix, nrow = 2, ncol = 2)
Split list, apply function, and return results in a data frame.
Description
For each element of a list, apply function then combine results into a data frame.
Usage
ldply(
.data,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.parallel = FALSE,
.paropts = NULL,
.id = NA
)
Arguments
.data |
list to be processed |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
.id |
name of the index column (used if |
Value
A data frame, as described in the output section.
Input
This function splits lists by elements.
Output
The most unambiguous behaviour is achieved when .fun
returns a
data frame - in that case pieces will be combined with
rbind.fill
. If .fun
returns an atomic vector of
fixed length, it will be rbind
ed together and converted to a data
frame. Any other values will result in an error.
If there are no results, then this function will return a data
frame with zero rows and columns (data.frame()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other list input:
l_ply()
,
laply()
,
llply()
Other data frame output:
adply()
,
ddply()
,
mdply()
Experimental iterator based version of llply.
Description
Because iterators do not have known length, liply
starts by
allocating an output list of length 50, and then doubles that length
whenever it runs out of space. This gives O(n ln n) performance rather
than the O(n ^ 2) performance from the naive strategy of growing the list
each time.
Usage
liply(.iterator, .fun = NULL, ...)
Arguments
.iterator |
iterator object |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
Warning
Deprecated, do not use in new code.
See Also
List to array.
Description
Reduce/simplify a list of homogenous objects to an array
Usage
list_to_array(res, labels = NULL, .drop = FALSE)
Arguments
res |
list of input data |
labels |
a data frame of labels, one row for each element of res |
.drop |
should extra dimensions be dropped (TRUE) or preserved (FALSE) |
See Also
Other list simplification functions:
list_to_dataframe()
,
list_to_vector()
List to data frame.
Description
Reduce/simplify a list of homogenous objects to a data frame. All
NULL
entries are removed. Remaining entries must be all atomic
or all data frames.
Usage
list_to_dataframe(res, labels = NULL, id_name = NULL, id_as_factor = FALSE)
Arguments
res |
list of input data |
labels |
a data frame of labels, one row for each element of res |
id_name |
the name of the index column, |
See Also
Other list simplification functions:
list_to_array()
,
list_to_vector()
List to vector.
Description
Reduce/simplify a list of homogenous objects to a vector
Usage
list_to_vector(res)
Arguments
res |
list of input data |
See Also
Other list simplification functions:
list_to_array()
,
list_to_dataframe()
Split list, apply function, and return results in a list.
Description
For each element of a list, apply function, keeping results as a list.
Usage
llply(
.data,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
list to be processed |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Details
llply
is equivalent to lapply
except that it will
preserve labels and can display a progress bar.
Value
list of results
Input
This function splits lists by elements.
Output
If there are no results, then this function will return
a list of length 0 (list()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other list input:
l_ply()
,
laply()
,
ldply()
Other list output:
alply()
,
dlply()
,
mlply()
Examples
llply(llply(mtcars, round), table)
llply(baseball, summary)
# Examples from ?lapply
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
llply(x, mean)
llply(x, quantile, probs = 1:3/4)
Loop apply
Description
An optimised version of lapply for the special case of operating on
seq_len(n)
Usage
loop_apply(n, f, env = parent.frame())
Arguments
n |
length of sequence |
f |
function to apply to each integer |
env |
environment in which to evaluate function |
Call function with arguments in array or data frame, discarding results.
Description
Call a multi-argument function with values taken from columns of an data frame or array, and discard results into a list.
Usage
m_ply(
.data,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.print = FALSE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
matrix or data frame to use as source of arguments |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
should output be 1d (expand = FALSE), with an element for each row; or nd (expand = TRUE), with a dimension for each variable. |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.print |
automatically print each result? (default: |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Details
The m*ply
functions are the plyr
version of mapply
,
specialised according to the type of output they produce. These functions
are just a convenient wrapper around a*ply
with margins = 1
and .fun
wrapped in splat
.
Value
Nothing
Input
Call a multi-argument function with values taken from columns of an data frame or array
Output
All output is discarded. This is useful for functions that you are calling purely for their side effects like displaying plots or saving output.
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other multiple arguments input:
maply()
,
mdply()
,
mlply()
Other no output:
a_ply()
,
d_ply()
,
l_ply()
Call function with arguments in array or data frame, returning an array.
Description
Call a multi-argument function with values taken from columns of an data frame or array, and combine results into an array
Usage
maply(
.data,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.drop = TRUE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
matrix or data frame to use as source of arguments |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
should output be 1d (expand = FALSE), with an element for each row; or nd (expand = TRUE), with a dimension for each variable. |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.drop |
should extra dimensions of length 1 in the output be
dropped, simplifying the output. Defaults to |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Details
The m*ply
functions are the plyr
version of mapply
,
specialised according to the type of output they produce. These functions
are just a convenient wrapper around a*ply
with margins = 1
and .fun
wrapped in splat
.
Value
if results are atomic with same type and dimensionality, a vector, matrix or array; otherwise, a list-array (a list with dimensions)
Input
Call a multi-argument function with values taken from columns of an data frame or array
Output
If there are no results, then this function will return a vector of
length 0 (vector()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other multiple arguments input:
m_ply()
,
mdply()
,
mlply()
Other array output:
aaply()
,
daply()
,
laply()
Examples
maply(cbind(mean = 1:5, sd = 1:5), rnorm, n = 5)
maply(expand.grid(mean = 1:5, sd = 1:5), rnorm, n = 5)
maply(cbind(1:5, 1:5), rnorm, n = 5)
Replace specified values with new values, in a vector or factor.
Description
Item in x
that match items from
will be replaced by
items in to
, matched by position. For example, items in x
that
match the first element in from
will be replaced by the first
element of to
.
Usage
mapvalues(x, from, to, warn_missing = TRUE)
Arguments
x |
the factor or vector to modify |
from |
a vector of the items to replace |
to |
a vector of replacement values |
warn_missing |
print a message if any of the old values are
not actually present in |
Details
If x
is a factor, the matching levels of the factor will be
replaced with the new values.
The related revalue
function works only on character vectors
and factors, but this function works on vectors of any type and factors.
See Also
revalue
to do the same thing but with a single
named vector instead of two separate vectors.
Examples
x <- c("a", "b", "c")
mapvalues(x, c("a", "c"), c("A", "C"))
# Works on factors
y <- factor(c("a", "b", "c", "a"))
mapvalues(y, c("a", "c"), c("A", "C"))
# Works on numeric vectors
z <- c(1, 4, 5, 9)
mapvalues(z, from = c(1, 5, 9), to = c(10, 50, 90))
Extract matching rows of a data frame.
Description
Match works in the same way as join, but instead of return the combined dataset, it only returns the matching rows from the first dataset. This is particularly useful when you've summarised the data in some way and want to subset the original data by a characteristic of the subset.
Usage
match_df(x, y, on = NULL)
Arguments
x |
data frame to subset. |
y |
data frame defining matching rows. |
on |
variables to match on - by default will use all variables common to both data frames. |
Details
match_df
shares the same semantics as join
, not
match
:
the match criterion is
==
, notidentical
).it doesn't work for columns that are not atomic vectors
if there are no matches, the row will be omitted'
Value
a data frame
See Also
join
to combine the columns from both x and y
and match
for the base function selecting matching items
Examples
# count the occurrences of each id in the baseball dataframe, then get the subset with a freq >25
longterm <- subset(count(baseball, "id"), freq > 25)
# longterm
# id freq
# 30 ansonca01 27
# 48 baineha01 27
# ...
# Select only rows from these longterm players from the baseball dataframe
# (match would default to match on shared column names, but here was explicitly set "id")
bb_longterm <- match_df(baseball, longterm, on="id")
bb_longterm[1:5,]
Call function with arguments in array or data frame, returning a data frame.
Description
Call a multi-argument function with values taken from columns of an data frame or array, and combine results into a data frame
Usage
mdply(
.data,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
matrix or data frame to use as source of arguments |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
should output be 1d (expand = FALSE), with an element for each row; or nd (expand = TRUE), with a dimension for each variable. |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Details
The m*ply
functions are the plyr
version of mapply
,
specialised according to the type of output they produce. These functions
are just a convenient wrapper around a*ply
with margins = 1
and .fun
wrapped in splat
.
Value
A data frame, as described in the output section.
Input
Call a multi-argument function with values taken from columns of an data frame or array
Output
The most unambiguous behaviour is achieved when .fun
returns a
data frame - in that case pieces will be combined with
rbind.fill
. If .fun
returns an atomic vector of
fixed length, it will be rbind
ed together and converted to a data
frame. Any other values will result in an error.
If there are no results, then this function will return a data
frame with zero rows and columns (data.frame()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other multiple arguments input:
m_ply()
,
maply()
,
mlply()
Other data frame output:
adply()
,
ddply()
,
ldply()
Examples
mdply(data.frame(mean = 1:5, sd = 1:5), rnorm, n = 2)
mdply(expand.grid(mean = 1:5, sd = 1:5), rnorm, n = 2)
mdply(cbind(mean = 1:5, sd = 1:5), rnorm, n = 5)
mdply(cbind(mean = 1:5, sd = 1:5), as.data.frame(rnorm), n = 5)
Call function with arguments in array or data frame, returning a list.
Description
Call a multi-argument function with values taken from columns of an data frame or array, and combine results into a list.
Usage
mlply(
.data,
.fun = NULL,
...,
.expand = TRUE,
.progress = "none",
.inform = FALSE,
.parallel = FALSE,
.paropts = NULL
)
Arguments
.data |
matrix or data frame to use as source of arguments |
.fun |
function to apply to each piece |
... |
other arguments passed on to |
.expand |
should output be 1d (expand = FALSE), with an element for each row; or nd (expand = TRUE), with a dimension for each variable. |
.progress |
name of the progress bar to use, see
|
.inform |
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging |
.parallel |
if |
.paropts |
a list of additional options passed into
the |
Details
The m*ply
functions are the plyr
version of mapply
,
specialised according to the type of output they produce. These functions
are just a convenient wrapper around a*ply
with margins = 1
and .fun
wrapped in splat
.
Value
list of results
Input
Call a multi-argument function with values taken from columns of an data frame or array
Output
If there are no results, then this function will return
a list of length 0 (list()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
See Also
Other multiple arguments input:
m_ply()
,
maply()
,
mdply()
Other list output:
alply()
,
dlply()
,
llply()
Examples
mlply(cbind(1:4, 4:1), rep)
mlply(cbind(1:4, times = 4:1), rep)
mlply(cbind(1:4, 4:1), seq)
mlply(cbind(1:4, length = 4:1), seq)
mlply(cbind(1:4, by = 4:1), seq, to = 20)
Mutate a data frame by adding new or replacing existing columns.
Description
This function is very similar to transform
but it executes
the transformations iteratively so that later transformations can use the
columns created by earlier transformations. Like transform, unnamed
components are silently dropped.
Usage
mutate(.data, ...)
Arguments
.data |
the data frame to transform |
... |
named parameters giving definitions of new columns. |
Details
Mutate seems to be considerably faster than transform for large data frames.
See Also
subset
, summarise
,
arrange
. For another somewhat different approach to
solving the same problem, see within
.
Examples
# Examples from transform
mutate(airquality, Ozone = -Ozone)
mutate(airquality, new = -Ozone, Temp = (Temp - 32) / 1.8)
# Things transform can't do
mutate(airquality, Temp = (Temp - 32) / 1.8, OzT = Ozone / Temp)
# mutate is rather faster than transform
system.time(transform(baseball, avg_ab = ab / g))
system.time(mutate(baseball, avg_ab = ab / g))
Toggle row names between explicit and implicit.
Description
Plyr functions ignore row names, so this function provides a way to preserve
them by converting them to an explicit column in the data frame. After the
plyr operation, you can then apply name_rows
again to convert back
from the explicit column to the implicit rownames
.
Usage
name_rows(df)
Arguments
df |
a data.frame, with either |
Examples
name_rows(mtcars)
name_rows(name_rows(mtcars))
df <- data.frame(a = sample(10))
arrange(df, a)
arrange(name_rows(df), a)
name_rows(arrange(name_rows(df), a))
Compute names of quoted variables.
Description
Figure out names of quoted variables, using specified names if they exist,
otherwise converting the values to character strings. This may create
variable names that can only be accessed using ``
.
Usage
## S3 method for class 'quoted'
names(x)
Number of unique values.
Description
Calculate number of unique values of a variable as efficiently as possible.
Usage
nunique(x)
Arguments
x |
vector |
Monthly ozone measurements over Central America.
Description
This data set is a subset of the data from the 2006 ASA Data expo challenge, https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2006. The data are monthly ozone averages on a very coarse 24 by 24 grid covering Central America, from Jan 1995 to Dec 2000. The data is stored in a 3d area with the first two dimensions representing latitude and longitude, and the third representing time.
Usage
ozone
Format
A 24 x 24 x 72 numeric array
References
https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2006
Examples
value <- ozone[1, 1, ]
time <- 1:72
month.abbr <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
month <- factor(rep(month.abbr, length = 72), levels = month.abbr)
year <- rep(1:6, each = 12)
deseasf <- function(value) lm(value ~ month - 1)
models <- alply(ozone, 1:2, deseasf)
coefs <- laply(models, coef)
dimnames(coefs)[[3]] <- month.abbr
names(dimnames(coefs))[3] <- "month"
deseas <- laply(models, resid)
dimnames(deseas)[[3]] <- 1:72
names(dimnames(deseas))[3] <- "time"
dim(coefs)
dim(deseas)
Deprecated Functions in Package plyr
Description
These functions are provided for compatibility with older versions of
plyr
only, and may be defunct as soon as the next release.
Details
Print quoted variables.
Description
Display the str
ucture of quoted variables
Usage
## S3 method for class 'quoted'
print(x, ...)
Print split.
Description
Don't print labels, so it appears like a regular list
Usage
## S3 method for class 'split'
print(x, ...)
Arguments
x |
object to print |
... |
unused |
Null progress bar
Description
A progress bar that does nothing
Usage
progress_none()
Details
This the default progress bar used by plyr functions. It's very simple to understand - it does nothing!
See Also
Other progress bars:
progress_text()
,
progress_time()
,
progress_tk()
,
progress_win()
Examples
l_ply(1:100, identity, .progress = "none")
Text progress bar.
Description
A textual progress bar
Usage
progress_text(style = 3, ...)
Arguments
style |
style of text bar, see Details section of |
... |
other arugments passed on to |
Details
This progress bar displays a textual progress bar that works on all
platforms. It is a thin wrapper around the built-in
setTxtProgressBar
and can be customised in the same way.
See Also
Other progress bars:
progress_none()
,
progress_time()
,
progress_tk()
,
progress_win()
Examples
l_ply(1:100, identity, .progress = "text")
l_ply(1:100, identity, .progress = progress_text(char = "-"))
Text progress bar with time.
Description
A textual progress bar that estimates time remaining. It displays the estimated time remaining and, when finished, total duration.
Usage
progress_time()
See Also
Other progress bars:
progress_none()
,
progress_text()
,
progress_tk()
,
progress_win()
Examples
l_ply(1:100, function(x) Sys.sleep(.01), .progress = "time")
Graphical progress bar, powered by Tk.
Description
A graphical progress bar displayed in a Tk window
Usage
progress_tk(title = "plyr progress", label = "Working...", ...)
Arguments
title |
window title |
label |
progress bar label (inside window) |
... |
other arguments passed on to |
Details
This graphical progress will appear in a separate window.
See Also
tkProgressBar
for the function that powers this progress bar
Other progress bars:
progress_none()
,
progress_text()
,
progress_time()
,
progress_win()
Examples
## Not run:
l_ply(1:100, identity, .progress = "tk")
l_ply(1:100, identity, .progress = progress_tk(width=400))
l_ply(1:100, identity, .progress = progress_tk(label=""))
## End(Not run)
Graphical progress bar, powered by Windows.
Description
A graphical progress bar displayed in a separate window
Usage
progress_win(title = "plyr progress", ...)
Arguments
title |
window title |
... |
other arguments passed on to |
Details
This graphical progress only works on Windows.
See Also
winProgressBar
for the function that powers this progress bar
Other progress bars:
progress_none()
,
progress_text()
,
progress_time()
,
progress_tk()
Examples
## Not run:
l_ply(1:100, identity, .progress = "win")
l_ply(1:100, identity, .progress = progress_win(title="Working..."))
## End(Not run)
Quick data frame.
Description
Experimental version of as.data.frame
that converts a
list to a data frame, but doesn't do any checks to make sure it's a
valid format. Much faster.
Usage
quickdf(list)
Arguments
list |
list to convert to data frame |
Replicate expression and discard results.
Description
Evalulate expression n times then discard results
Usage
r_ply(.n, .expr, .progress = "none", .print = FALSE)
Arguments
.n |
number of times to evaluate the expression |
.expr |
expression to evaluate |
.progress |
name of the progress bar to use, see |
.print |
automatically print each result? (default: |
Details
This function runs an expression multiple times, discarding the results.
This function is equivalent to replicate
, but never returns
anything
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
Examples
r_ply(10, plot(runif(50)))
r_ply(25, hist(runif(1000)))
Replicate expression and return results in a array.
Description
Evalulate expression n times then combine results into an array
Usage
raply(.n, .expr, .progress = "none", .drop = TRUE)
Arguments
.n |
number of times to evaluate the expression |
.expr |
expression to evaluate |
.progress |
name of the progress bar to use, see |
.drop |
should extra dimensions of length 1 be dropped, simplifying the output. Defaults to |
Details
This function runs an expression multiple times, and combines the
result into a data frame. If there are no results, then this function
returns a vector of length 0 (vector(0)
).
This function is equivalent to replicate
, but will always
return results as a vector, matrix or array.
Value
if results are atomic with same type and dimensionality, a vector, matrix or array; otherwise, a list-array (a list with dimensions)
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
Examples
raply(100, mean(runif(100)))
raply(100, each(mean, var)(runif(100)))
raply(10, runif(4))
raply(10, matrix(runif(4), nrow=2))
# See the central limit theorem in action
hist(raply(1000, mean(rexp(10))))
hist(raply(1000, mean(rexp(100))))
hist(raply(1000, mean(rexp(1000))))
Combine data.frames by row, filling in missing columns.
Description
rbind
s a list of data frames filling missing columns with NA.
Usage
rbind.fill(...)
Arguments
... |
input data frames to row bind together. The first argument can be a list of data frames, in which case all other arguments are ignored. Any NULL inputs are silently dropped. If all inputs are NULL, the output is NULL. |
Details
This is an enhancement to rbind
that adds in columns
that are not present in all inputs, accepts a list of data frames, and
operates substantially faster.
Column names and types in the output will appear in the order in which they were encountered.
Unordered factor columns will have their levels unified and character data bound with factors will be converted to character. POSIXct data will be converted to be in the same time zone. Array and matrix columns must have identical dimensions after the row count. Aside from these there are no general checks that each column is of consistent data type.
Value
a single data frame
See Also
Other binding functions:
rbind.fill.matrix()
Examples
rbind.fill(mtcars[c("mpg", "wt")], mtcars[c("wt", "cyl")])
Bind matrices by row, and fill missing columns with NA.
Description
The matrices are bound together using their column names or the column
indices (in that order of precedence.) Numeric columns may be converted to
character beforehand, e.g. using format. If a matrix doesn't have
colnames, the column number is used. Note that this means that a
column with name "1"
is merged with the first column of a matrix
without name and so on. The returned matrix will always have column names.
Usage
rbind.fill.matrix(...)
Arguments
... |
the matrices to rbind. The first argument can be a list of matrices, in which case all other arguments are ignored. |
Details
Vectors are converted to 1-column matrices.
Matrices of factors are not supported. (They are anyways quite inconvenient.) You may convert them first to either numeric or character matrices. If a matrices of different types are merged, then normal covnersion precendence will apply.
Row names are ignored.
Value
a matrix with column names
Author(s)
C. Beleites
See Also
Other binding functions:
rbind.fill()
Examples
A <- matrix (1:4, 2)
B <- matrix (6:11, 2)
A
B
rbind.fill.matrix (A, B)
colnames (A) <- c (3, 1)
A
rbind.fill.matrix (A, B)
rbind.fill.matrix (A, 99)
Replicate expression and return results in a data frame.
Description
Evaluate expression n times then combine results into a data frame
Usage
rdply(.n, .expr, .progress = "none", .id = NA)
Arguments
.n |
number of times to evaluate the expression |
.expr |
expression to evaluate |
.progress |
name of the progress bar to use, see
|
.id |
name of the index column. Pass |
Details
This function runs an expression multiple times, and combines the result into
a data frame. If there are no results, then this function returns a data
frame with zero rows and columns (data.frame()
). This function is
equivalent to replicate
, but will always return results as a
data frame.
Value
a data frame
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
Examples
rdply(20, mean(runif(100)))
rdply(20, each(mean, var)(runif(100)))
rdply(20, data.frame(x = runif(2)))
Reduce dimensions.
Description
Remove extraneous dimensions
Usage
reduce_dim(x)
Arguments
x |
array |
Modify names by name, not position.
Description
Modify names by name, not position.
Usage
rename(x, replace, warn_missing = TRUE, warn_duplicated = TRUE)
Arguments
x |
named object to modify |
replace |
named character vector, with new names as values, and old names as names. |
warn_missing |
print a message if any of the old names are
not actually present in |
warn_duplicated |
print a message if any name appears more
than once in |
Examples
x <- c("a" = 1, "b" = 2, d = 3, 4)
# Rename column d to "c", updating the variable "x" with the result
x <- rename(x, replace = c("d" = "c"))
x
# Rename column "disp" to "displacement"
rename(mtcars, c("disp" = "displacement"))
Replace specified values with new values, in a factor or character vector.
Description
If x
is a factor, the named levels of the factor will be
replaced with the new values.
Usage
revalue(x, replace = NULL, warn_missing = TRUE)
Arguments
x |
factor or character vector to modify |
replace |
named character vector, with new values as values, and old values as names. |
warn_missing |
print a message if any of the old values are
not actually present in |
Details
This function works only on character vectors and factors, but the
related mapvalues
function works on vectors of any type and factors,
and instead of a named vector specifying the original and replacement values,
it takes two separate vectors
See Also
mapvalues
to replace values with vectors of any type
Examples
x <- c("a", "b", "c")
revalue(x, c(a = "A", c = "C"))
revalue(x, c("a" = "A", "c" = "C"))
y <- factor(c("a", "b", "c", "a"))
revalue(y, c(a = "A", c = "C"))
Replicate expression and return results in a list.
Description
Evalulate expression n times then combine results into a list
Usage
rlply(.n, .expr, .progress = "none")
Arguments
.n |
number of times to evaluate the expression |
.expr |
expression to evaluate |
.progress |
name of the progress bar to use, see |
Details
This function runs an expression multiple times, and combines the
result into a list. If there are no results, then this function will return
a list of length 0 (list()
). This function is equivalent to
replicate
, but will always return results as a list.
Value
list of results
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
Examples
mods <- rlply(100, lm(y ~ x, data=data.frame(x=rnorm(100), y=rnorm(100))))
hist(laply(mods, function(x) summary(x)$r.squared))
Round to multiple of any number.
Description
Round to multiple of any number.
Usage
round_any(x, accuracy, f = round)
Arguments
x |
numeric or date-time (POSIXct) vector to round |
accuracy |
number to round to; for POSIXct objects, a number of seconds |
f |
Examples
round_any(135, 10)
round_any(135, 100)
round_any(135, 25)
round_any(135, 10, floor)
round_any(135, 100, floor)
round_any(135, 25, floor)
round_any(135, 10, ceiling)
round_any(135, 100, ceiling)
round_any(135, 25, ceiling)
round_any(Sys.time() + 1:10, 5)
round_any(Sys.time() + 1:10, 5, floor)
round_any(Sys.time(), 3600)
‘Splat’ arguments to a function.
Description
Wraps a function in do.call, so instead of taking multiple arguments, it takes a single named list which will be interpreted as its arguments.
Usage
splat(flat)
Arguments
flat |
function to splat |
Details
This is useful when you want to pass a function a row of data frame or array, and don't want to manually pull it apart in your function.
Value
a function
Examples
hp_per_cyl <- function(hp, cyl, ...) hp / cyl
splat(hp_per_cyl)(mtcars[1,])
splat(hp_per_cyl)(mtcars)
f <- function(mpg, wt, ...) data.frame(mw = mpg / wt)
ddply(mtcars, .(cyl), splat(f))
Split indices.
Description
An optimised version of split for the special case of splitting row
indices into groups, as used by splitter_d
.
Usage
split_indices(group, n = 0L)
Arguments
group |
integer indices |
n |
largest integer (may not appear in index). This is hint: if
the largest value of |
Examples
split_indices(sample(10, 100, rep = TRUE))
split_indices(sample(10, 100, rep = TRUE), 10)
Generate labels for split data frame.
Description
Create data frame giving labels for split data frame.
Usage
split_labels(splits, drop, id = plyr::id(splits, drop = TRUE))
Arguments
splits |
list of variables to split up by |
drop |
whether all possible combinations should be considered, or only those present in the data |
Split an array by .margins.
Description
Split a 2d or higher data structure into lower-d pieces based
Usage
splitter_a(data, .margins = 1L, .expand = TRUE, .id = NA)
Arguments
data |
>1d data structure (matrix, data.frame or array) |
.margins |
a vector giving the subscripts to split up |
.expand |
if splitting a dataframe by row, should output be 1d (expand = FALSE), with an element for each row; or nd (expand = TRUE), with a dimension for each variable. |
.id |
names of the split label.
Pass |
Details
This is the workhorse of the a*ply
functions. Given a >1 d
data structure (matrix, array, data.frame), it splits it into pieces
based on the subscripts that you supply. Each piece is a lower dimensional
slice.
The margins are specified in the same way as apply
, but
splitter_a
just splits up the data, while apply
also
applies a function and combines the pieces back together. This function
also includes enough information to recreate the split from attributes on
the list of pieces.
Value
a list of lower-d slices, with attributes that record split details
See Also
Other splitter functions:
splitter_d()
Examples
plyr:::splitter_a(mtcars, 1)
plyr:::splitter_a(mtcars, 2)
plyr:::splitter_a(ozone, 2)
plyr:::splitter_a(ozone, 3)
plyr:::splitter_a(ozone, 1:2)
Split a data frame by variables.
Description
Split a data frame into pieces based on variable contained in that data frame
Usage
splitter_d(data, .variables = NULL, drop = TRUE)
Arguments
data |
data frame |
.variables |
a quoted list of variables |
drop |
drop unnused factor levels? |
Details
This is the workhorse of the d*ply
functions. Based on the variables
you supply, it breaks up a single data frame into a list of data frames,
each containing a single combination from the levels of the specified
variables.
This is basically a thin wrapper around split
which
evaluates the variables in the context of the data, and includes enough
information to reconstruct the labelling of the data frame after
other operations.
Value
a list of data.frames, with attributes that record split details
See Also
.
for quoting variables, split
Other splitter functions:
splitter_a()
Examples
plyr:::splitter_d(mtcars, .(cyl))
plyr:::splitter_d(mtcars, .(vs, am))
plyr:::splitter_d(mtcars, .(am, vs))
mtcars$cyl2 <- factor(mtcars$cyl, levels = c(2, 4, 6, 8, 10))
plyr:::splitter_d(mtcars, .(cyl2), drop = TRUE)
plyr:::splitter_d(mtcars, .(cyl2), drop = FALSE)
mtcars$cyl3 <- ifelse(mtcars$vs == 1, NA, mtcars$cyl)
plyr:::splitter_d(mtcars, .(cyl3))
plyr:::splitter_d(mtcars, .(cyl3, vs))
plyr:::splitter_d(mtcars, .(cyl3, vs), drop = FALSE)
Remove splitting variables from a data frame.
Description
This is useful when you want to perform some operation to every column in the data frame, except the variables that you have used to split it. These variables will be automatically added back on to the result when combining all results together.
Usage
strip_splits(df)
Arguments
df |
data frame produced by |
Examples
dlply(mtcars, c("vs", "am"))
dlply(mtcars, c("vs", "am"), strip_splits)
Summarise a data frame.
Description
Summarise works in an analogous way to mutate
, except
instead of adding columns to an existing data frame, it creates a new
data frame. This is particularly useful in conjunction with
ddply
as it makes it easy to perform group-wise summaries.
Usage
summarise(.data, ...)
Arguments
.data |
the data frame to be summarised |
... |
further arguments of the form var = value |
Note
Be careful when using existing variable names; the corresponding columns will be immediately updated with the new data and this can affect subsequent operations referring to those variables.
Examples
# Let's extract the number of teams and total period of time
# covered by the baseball dataframe
summarise(baseball,
duration = max(year) - min(year),
nteams = length(unique(team)))
# Combine with ddply to do that for each separate id
ddply(baseball, "id", summarise,
duration = max(year) - min(year),
nteams = length(unique(team)))
Take a subset along an arbitrary dimension
Description
Take a subset along an arbitrary dimension
Usage
take(x, along, indices, drop = FALSE)
Arguments
x |
matrix or array to subset |
along |
dimension to subset along |
indices |
the indices to select |
drop |
should the dimensions of the array be simplified? Defaults
to |
Examples
x <- array(seq_len(3 * 4 * 5), c(3, 4, 5))
take(x, 3, 1)
take(x, 2, 1)
take(x, 1, 1)
take(x, 3, 1, drop = TRUE)
take(x, 2, 1, drop = TRUE)
take(x, 1, 1, drop = TRUE)
Function that always returns true.
Description
Function that always returns true.
Usage
true(...)
Arguments
... |
all input ignored |
Value
TRUE
See Also
colwise
which uses it
Try, with default in case of error.
Description
try_default
wraps try so that it returns a default value in the case of error.
tryNULL
provides a useful special case when dealing with lists.
Usage
try_default(expr, default, quiet = FALSE)
tryNULL(expr)
Arguments
expr |
expression to try |
default |
default value in case of error |
quiet |
should errors be printed (TRUE) or ignored (FALSE, default) |
See Also
Apply with built in try. Uses compact, lapply and tryNULL
Description
Apply with built in try. Uses compact, lapply and tryNULL
Usage
tryapply(list, fun, ...)
Arguments
list |
list to apply function |
fun |
function |
... |
further arguments to |
Un-rowname.
Description
Strip rownames from an object
Usage
unrowname(x)
Arguments
x |
data frame |
Vector aggregate.
Description
This function is somewhat similar to tapply
, but is designed for
use in conjunction with id
. It is simpler in that it only
accepts a single grouping vector (use id
if you have more)
and uses vapply
internally, using the .default
value
as the template.
Usage
vaggregate(.value, .group, .fun, ..., .default = NULL, .n = nlevels(.group))
Arguments
.value |
vector of values to aggregate |
.group |
grouping vector |
.fun |
aggregation function |
... |
other arguments passed on to |
.default |
default value used for missing groups. This argument is also used as the template for function output. |
.n |
total number of groups |
Details
vaggregate
should be faster than tapply
in most situations
because it avoids making a copy of the data.
Examples
# Some examples of use borrowed from ?tapply
n <- 17; fac <- factor(rep(1:3, length.out = n), levels = 1:5)
table(fac)
vaggregate(1:n, fac, sum)
vaggregate(1:n, fac, sum, .default = NA_integer_)
vaggregate(1:n, fac, range)
vaggregate(1:n, fac, range, .default = c(NA, NA) + 0)
vaggregate(1:n, fac, quantile)
# Unlike tapply, vaggregate does not support multi-d output:
tapply(warpbreaks$breaks, warpbreaks[,-1], sum)
vaggregate(warpbreaks$breaks, id(warpbreaks[,-1]), sum)
# But it is about 10x faster
x <- rnorm(1e6)
y1 <- sample.int(10, 1e6, replace = TRUE)
system.time(tapply(x, y1, mean))
system.time(vaggregate(x, y1, mean))