Title: | Advanced and Fast Data Transformation |
Version: | 2.1.0 |
Date: | 2025-03-10 |
Description: | A C/C++ based package for advanced data transformation and statistical computing in R that is extremely fast, class-agnostic, robust and programmer friendly. Core functionality includes a rich set of S3 generic grouped and weighted statistical functions for vectors, matrices and data frames, which provide efficient low-level vectorizations, OpenMP multithreading, and skip missing values by default. These are integrated with fast grouping and ordering algorithms (also callable from C), and efficient data manipulation functions. The package also provides a flexible and rigorous approach to time series and panel data in R. It further includes fast functions for common statistical procedures, detailed (grouped, weighted) summary statistics, powerful tools to work with nested data, fast data object conversions, functions for memory efficient R programming, and helpers to effectively deal with variable labels, attributes, and missing data. It is well integrated with base R classes, 'dplyr'/'tibble', 'data.table', 'sf', 'units', 'plm' (panel-series and data frames), and 'xts'/'zoo'. |
URL: | https://sebkrantz.github.io/collapse/, https://github.com/SebKrantz/collapse |
BugReports: | https://github.com/SebKrantz/collapse/issues |
License: | GPL-2 | GPL-3 | file LICENSE [expanded from: GPL (≥ 2) | file LICENSE] |
Encoding: | UTF-8 |
LazyData: | true |
Depends: | R (≥ 4.1.0) |
Imports: | Rcpp (≥ 1.0.1) |
LinkingTo: | Rcpp |
Suggests: | fastverse, data.table, magrittr, kit, xts, zoo, plm, fixest, vars, RcppArmadillo, RcppEigen, tibble, dplyr, ggplot2, scales, microbenchmark, testthat, covr, knitr, rmarkdown, withr, bit64 |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2025-03-10 04:37:59 UTC; sebastiankrantz |
Author: | Sebastian Krantz |
Maintainer: | Sebastian Krantz <sebastian.krantz@graduateinstitute.ch> |
Repository: | CRAN |
Date/Publication: | 2025-03-10 11:40:02 UTC |
Advanced and Fast Data Transformation
Description
collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are:
To facilitate complex data transformation, exploration and computing tasks in R.
To help make R code fast, flexible, parsimonious and programmer friendly.
It is made compatible with the tidyverse, data.table, sf, units, xts/zoo, and the plm approach to panel data.
Getting Started
Read the short vignette on documentation resources, and check out the built in documentation.
Details
collapse provides an integrated suite of statistical and data manipulation functions that greatly extend and enhance the capabilities of base R. In a nutshell, collapse provides:
Fast C/C++ based (grouped, weighted) computations embedded in highly optimized R code.
More complex statistical, time series / panel data and recursive (list-processing) operations.
A flexible and generic approach supporting and preserving many R objects.
Optimized programming in standard and non-standard evaluation.
The statistical functions in collapse are S3 generic with core methods for vectors, matrices and data frames, and internally support grouped and weighted computations carried out in C/C++.
Additional methods and C-level features enable broad based compatibility with dplyr (grouped tibble), data.table, sf and plm panel data classes. Functions and core methods seek to preserve object attributes (including column attributes such as variable labels), ensuring flexibility and effective workflows with a very broad range of R objects (including most time-series classes). See also the vignette on collapse's handling of R objects.
Missing values are efficiently skipped at C/C++ level. The package default is na.rm = TRUE
. This can be changed using set_collapse(na.rm = FALSE)
. Missing weights are generally supported.
collapse installs with a built-in hierarchical documentation facilitating the use of the package.
The package is coded both in C and C++ and built with Rcpp, but also uses C/C++ functions from data.table, kit, fixest, weights, stats and RcppArmadillo / RcppEigen.
Author(s)
Maintainer: Sebastian Krantz sebastian.krantz@graduateinstitute.ch
Other contributors from packages collapse utilizes:
Matt Dowle, Arun Srinivasan and contributors worldwide (data.table)
Dirk Eddelbuettel and contributors worldwide (Rcpp, RcppArmadillo, RcppEigen)
Morgan Jacob (kit)
Laurent Berge (fixest)
Josh Pasek (weights)
R Core Team and contributors worldwide (stats)
I thank many people from diverse fields for helpful answers on Stackoverflow, Joris Meys for encouraging me and helping to set up the GitHub repository for collapse, and many other people for feature requests and helpful suggestions.
Developing / Bug Reporting
Please report issues at https://github.com/SebKrantz/collapse/issues.
Please send pull-requests to the 'development' branch of the repository.
Examples
## Note: this set of examples is is certainly non-exhaustive and does not
## showcase many recent features, but remains a very good starting point
## Let's start with some statistical programming
v <- iris$Sepal.Length
d <- num_vars(iris) # Saving numeric variables
f <- iris$Species # Factor
# Simple statistics
fmean(v) # vector
fmean(qM(d)) # matrix (qM is a faster as.matrix)
fmean(d) # data.frame
# Preserving data structure
fmean(qM(d), drop = FALSE) # Still a matrix
fmean(d, drop = FALSE) # Still a data.frame
# Weighted statistics, supported by most functions...
w <- abs(rnorm(fnrow(iris)))
fmean(d, w = w)
# Grouped statistics...
fmean(d, f)
# Groupwise-weighted statistics...
fmean(d, f, w)
# Simple Transformations...
head(fmode(d, TRA = "replace")) # Replacing values with the mode
head(fmedian(d, TRA = "-")) # Subtracting the median
head(fsum(d, TRA = "%")) # Computing percentages
head(fsd(d, TRA = "/")) # Dividing by the standard-deviation (scaling), etc...
# Weighted Transformations...
head(fnth(d, 0.75, w = w, TRA = "replace")) # Replacing by the weighted 3rd quartile
# Grouped Transformations...
head(fvar(d, f, TRA = "replace")) # Replacing values with the group variance
head(fsd(d, f, TRA = "/")) # Grouped scaling
head(fmin(d, f, TRA = "-")) # Setting the minimum value in each species to 0
head(fsum(d, f, TRA = "/")) # Dividing by the sum (proportions)
head(fmedian(d, f, TRA = "-")) # Groupwise de-median
head(ffirst(d, f, TRA = "%%")) # Taking modulus of first group-value, etc. ...
# Grouped and weighted transformations...
head(fsd(d, f, w, "/"), 3) # weighted scaling
head(fmedian(d, f, w, "-"), 3) # subtracting the weighted group-median
head(fmode(d, f, w, "replace"), 3) # replace with weighted statistical mode
## Some more advanced transformations...
head(fbetween(d)) # Averaging (faster t.: fmean(d, TRA = "replace"))
head(fwithin(d)) # Centering (faster than: fmean(d, TRA = "-"))
head(fwithin(d, f, w)) # Grouped and weighted (same as fmean(d, f, w, "-"))
head(fwithin(d, f, w, mean = 5)) # Setting a custom mean
head(fwithin(d, f, w, theta = 0.76)) # Quasi-centering i.e. d - theta*fbetween(d, f, w)
head(fwithin(d, f, w, mean = "overall.mean")) # Preserving the overall mean of the data
head(fscale(d)) # Scaling and centering
head(fscale(d, mean = 5, sd = 3)) # Custom scaling and centering
head(fscale(d, mean = FALSE, sd = 3)) # Mean preserving scaling
head(fscale(d, f, w)) # Grouped and weighted scaling and centering
head(fscale(d, f, w, mean = 5, sd = 3)) # Custom grouped and weighted scaling and centering
head(fscale(d, f, w, mean = FALSE, # Preserving group means
sd = "within.sd")) # and setting group-sd to fsd(fwithin(d, f, w), w = w)
head(fscale(d, f, w, mean = "overall.mean", # Full harmonization of group means and variances,
sd = "within.sd")) # while preserving the level and scale of the data.
head(get_vars(iris, 1:2)) # Use get_vars for fast selecting, gv is shortcut
head(fhdbetween(gv(iris, 1:2), gv(iris, 3:5))) # Linear prediction with factors and covariates
head(fhdwithin(gv(iris, 1:2), gv(iris, 3:5))) # Linear partialling out factors and covariates
ss(iris, 1:10, 1:2) # Similarly fsubset/ss for fast subsetting rows
# Simple Time-Computations..
head(flag(AirPassengers, -1:3)) # One lead and three lags
head(fdiff(EuStockMarkets, # Suitably lagged first and second differences
c(1, frequency(EuStockMarkets)), diff = 1:2))
head(fdiff(EuStockMarkets, rho = 0.87)) # Quasi-differences (x_t - rho*x_t-1)
head(fdiff(EuStockMarkets, log = TRUE)) # Log-differences
head(fgrowth(EuStockMarkets)) # Exact growth rates (percentage change)
head(fgrowth(EuStockMarkets, logdiff = TRUE)) # Log-difference growth rates (percentage change)
# Note that it is not necessary to use factors for grouping.
fmean(gv(mtcars, -c(2,8:9)), mtcars$cyl) # Can also use vector (internally converted using qF())
fmean(gv(mtcars, -c(2,8:9)),
gv(mtcars, c(2,8:9))) # or a list of vector (internally grouped using GRP())
g <- GRP(mtcars, ~ cyl + vs + am) # It is also possible to create grouping objects
print(g) # These are instructive to learn about the grouping,
plot(g) # and are directly handed down to C++ code
fmean(gv(mtcars, -c(2,8:9)), g) # This can speed up multiple computations over same groups
fsd(gv(mtcars, -c(2,8:9)), g)
# Factors can efficiently be created using qF()
f1 <- qF(mtcars$cyl) # Unlike GRP objects, factors are checked for NA's
f2 <- qF(mtcars$cyl, na.exclude = FALSE) # This can however be avoided through this option
class(f2) # Note the added class
library(microbenchmark)
microbenchmark(fmean(mtcars, f1), fmean(mtcars, f2)) # A minor difference, larger on larger data
with(mtcars, finteraction(cyl, vs, am)) # Efficient interactions of vectors and/or factors
finteraction(gv(mtcars, c(2,8:9))) # .. or lists of vectors/factors
# Simple row- or column-wise computations on matrices or data frames with dapply()
dapply(mtcars, quantile) # column quantiles
dapply(mtcars, quantile, MARGIN = 1) # Row-quantiles
# dapply preserves the data structure of any matrices / data frames passed
# Some fast matrix row/column functions are also provided by the matrixStats package
# Similarly, BY performs grouped comptations
BY(mtcars, f2, quantile)
BY(mtcars, f2, quantile, expand.wide = TRUE)
# For efficient (grouped) replacing and sweeping out computed statistics, use TRA()
sds <- fsd(mtcars)
head(TRA(mtcars, sds, "/")) # Simple scaling (if sd's not needed, use fsd(mtcars, TRA = "/"))
microbenchmark(TRA(mtcars, sds, "/"), sweep(mtcars, 2, sds, "/")) # A remarkable performance gain..
sds <- fsd(mtcars, f2)
head(TRA(mtcars, sds, "/", f2)) # Groupd scaling (if sd's not needed: fsd(mtcars, f2, TRA = "/"))
# All functions above perserve the structure of matrices / data frames
# If conversions are required, use these efficient functions:
mtcarsM <- qM(mtcars) # Matrix from data.frame
head(qDF(mtcarsM)) # data.frame from matrix columns
head(mrtl(mtcarsM, TRUE, "data.frame")) # data.frame from matrix rows, etc..
head(qDT(mtcarsM, "cars")) # Saving row.names when converting matrix to data.table
head(qDT(mtcars, "cars")) # Same use a data.frame
## Now let's get some real data and see how we can use this power for data manipulation
head(wlddev) # World Bank World Development Data: 216 countries, 61 years, 5 series (columns 9-13)
# Starting with some discriptive tools...
namlab(wlddev, class = TRUE) # Show variable names, labels and classes
fnobs(wlddev) # Observation count
pwnobs(wlddev) # Pairwise observation count
head(fnobs(wlddev, wlddev$country)) # Grouped observation count
fndistinct(wlddev) # Distinct values
descr(wlddev) # Describe data
varying(wlddev, ~ country) # Show which variables vary within countries
qsu(wlddev, pid = ~ country, # Panel-summarize columns 9 though 12 of this data
cols = 9:12, vlabels = TRUE) # (between and within countries)
qsu(wlddev, ~ region, ~ country, # Do all of that by region and also compute higher moments
cols = 9:12, higher = TRUE) # -> returns a 4D array
qsu(wlddev, ~ region, ~ country, cols = 9:12,
higher = TRUE, array = FALSE) |> # Return as a list of matrices..
unlist2d(c("Variable","Trans"), row.names = "Region") |> head()# and turn into a tidy data.frame
pwcor(num_vars(wlddev), P = TRUE) # Pairwise correlations with p-value
pwcor(fmean(num_vars(wlddev), wlddev$country), P = TRUE) # Correlating country means
pwcor(fwithin(num_vars(wlddev), wlddev$country), P = TRUE) # Within-country correlations
psacf(wlddev, ~country, ~year, cols = 9:12) # Panel-data Autocorrelation function
pspacf(wlddev, ~country, ~year, cols = 9:12) # Partial panel-autocorrelations
psmat(wlddev, ~iso3c, ~year, cols = 9:12) |> plot() # Convert panel to 3D array and plot
## collapse offers a few very efficent functions for data manipulation:
# Fast selecting and replacing columns
series <- get_vars(wlddev, 9:12) # Same as wlddev[9:12] but 2x faster
series <- fselect(wlddev, PCGDP:ODA) # Same thing: > 100x faster than dplyr::select
get_vars(wlddev, 9:12) <- series # Replace, 8x faster wlddev[9:12] <- series + replaces names
fselect(wlddev, PCGDP:ODA) <- series # Same thing
# Fast subsetting
head(fsubset(wlddev, country == "Ireland", -country, -iso3c))
head(fsubset(wlddev, country == "Ireland" & year > 1990, year, PCGDP:ODA))
ss(wlddev, 1:10, 1:10) # This is an order of magnitude faster than wlddev[1:10, 1:10]
# Fast transforming
head(ftransform(wlddev, ODA_GDP = ODA / PCGDP, ODA_LIFEEX = sqrt(ODA) / LIFEEX))
settransform(wlddev, ODA_GDP = ODA / PCGDP, ODA_LIFEEX = sqrt(ODA) / LIFEEX) # by reference
head(ftransform(wlddev, PCGDP = NULL, ODA = NULL, GINI_sum = fsum(GINI)))
head(ftransformv(wlddev, 9:12, log)) # Can also transform with lists of columns
head(ftransformv(wlddev, 9:12, fscale, apply = FALSE)) # apply = FALSE invokes fscale.data.frame
settransformv(wlddev, 9:12, fscale, apply = FALSE) # Changing the data by reference
ftransform(wlddev) <- fscale(gv(wlddev, 9:12)) # Same thing (using replacement method)
library(magrittr) # Same thing, using magrittr
wlddev %<>% ftransformv(9:12, fscale, apply = FALSE)
wlddev %>% ftransform(gv(., 9:12) |> # With compound pipes: Scaling and lagging
fscale() |> flag(0:2, iso3c, year)) |> head()
# Fast reordering
head(roworder(wlddev, -country, year))
head(colorder(wlddev, country, year))
# Fast renaming
head(frename(wlddev, country = Ctry, year = Yr))
setrename(wlddev, country = Ctry, year = Yr) # By reference
head(frename(wlddev, tolower, cols = 9:12))
# Fast grouping
fgroup_by(wlddev, Ctry, decade) |> fgroup_vars() |> head()
rm(wlddev) # .. but only works with collapse functions
## Now lets start putting things together
wlddev |> fsubset(year > 1990, region, income, PCGDP:ODA) |>
fgroup_by(region, income) |> fmean() # Fast aggregation using the mean
# Same thing using dplyr manipulation verbs
library(dplyr)
wlddev |> filter(year > 1990) |> select(region, income, PCGDP:ODA) |>
group_by(region,income) |> fmean() # This is already a lot faster than summarize_all(mean)
wlddev |> fsubset(year > 1990, region, income, PCGDP:POP) |>
fgroup_by(region, income) |> fmean(POP) # Weighted group means
wlddev |> fsubset(year > 1990, region, income, PCGDP:POP) |>
fgroup_by(region, income) |> fsd(POP) # Weighted group standard deviations
wlddev |> na_omit(cols = "POP") |> fgroup_by(region, income) |>
fselect(PCGDP:POP) |> fnth(0.75, POP) # Weighted group third quartile
wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |>
fwithin() |> head() # Within transformation
wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |>
fmedian(TRA = "-") |> head() # Grouped centering using the median
# Replacing data points by the weighted first quartile:
wlddev |> na_omit(cols = "POP") |> fgroup_by(country) |>
fselect(country, year, PCGDP:POP) %>%
ftransform(fselect(., -country, -year) |>
fnth(0.25, POP, "fill")) |> head()
wlddev |> fgroup_by(country) |> fselect(PCGDP:ODA) |> fscale() |> head() # Standardizing
wlddev |> fgroup_by(country) |> fselect(PCGDP:POP) |>
fscale(POP) |> head() # Weighted..
wlddev |> fselect(country, year, PCGDP:ODA) |> # Adding 1 lead and 2 lags of each variable
fgroup_by(country) |> flag(-1:2, year) |> head()
wlddev |> fselect(country, year, PCGDP:ODA) |> # Adding 1 lead and 10-year growth rates
fgroup_by(country) |> fgrowth(c(0:1,10), 1, year) |> head()
# etc...
# Aggregation with multiple functions
wlddev |> fsubset(year > 1990, region, income, PCGDP:ODA) |>
fgroup_by(region, income) %>% {
add_vars(fgroup_vars(., "unique"),
fmedian(., keep.group_vars = FALSE) |> add_stub("median_"),
fmean(., keep.group_vars = FALSE) |> add_stub("mean_"),
fsd(., keep.group_vars = FALSE) |> add_stub("sd_"))
} |> head()
# Transformation with multiple functions
wlddev |> fselect(country, year, PCGDP:ODA) |>
fgroup_by(country) %>% {
add_vars(fdiff(., c(1,10), 1, year) |> flag(0:2, year), # Sequence of lagged differences
ftransform(., fselect(., PCGDP:ODA) |> fwithin() |> add_stub("W.")) |>
flag(0:2, year, keep.ids = FALSE)) # Sequence of lagged demeaned vars
} |> head()
# With ftransform, can also easily do one or more grouped mutations on the fly..
settransform(wlddev, median_ODA = fmedian(ODA, list(region, income), TRA = "fill"))
settransform(wlddev, sd_ODA = fsd(ODA, list(region, income), TRA = "fill"),
mean_GDP = fmean(PCGDP, country, TRA = "fill"))
wlddev %<>% ftransform(fmedian(list(median_ODA = ODA, median_GDP = PCGDP),
list(region, income), TRA = "fill"))
# On a groped data frame it is also possible to grouped transform certain columns
# but perform aggregate operatins on others:
wlddev |> fgroup_by(region, income) %>%
ftransform(gmedian_GDP = fmedian(PCGDP, GRP(.), TRA = "replace"),
omedian_GDP = fmedian(PCGDP, TRA = "replace"), # "replace" preserves NA's
omedian_GDP_fill = fmedian(PCGDP)) |> tail()
rm(wlddev)
## For multi-type data aggregation, the function collap() offers ease and flexibility
# Aggregate this data by country and decade: Numeric columns with mean, categorical with mode
head(collap(wlddev, ~ country + decade, fmean, fmode))
# taking weighted mean and weighted mode:
head(collap(wlddev, ~ country + decade, fmean, fmode, w = ~ POP, wFUN = fsum))
# Multi-function aggregation of certain columns
head(collap(wlddev, ~ country + decade,
list(fmean, fmedian, fsd),
list(ffirst, flast), cols = c(3,9:12)))
# Customized Aggregation: Assign columns to functions
head(collap(wlddev, ~ country + decade,
custom = list(fmean = 9:10, fsd = 9:12, flast = 3, ffirst = 6:8)))
# For grouped data frames use collapg
wlddev |> fsubset(year > 1990, country, region, income, PCGDP:ODA) |>
fgroup_by(country) |> collapg(fmean, ffirst) |>
ftransform(AMGDP = PCGDP > fmedian(PCGDP, list(region, income), TRA = "fill"),
AMODA = ODA > fmedian(ODA, income, TRA = "replace_fill")) |> head()
## Additional flexibility for data transformation tasks is offerend by tidy transformation operators
# Within-transformation (centering on overall mean)
head(W(wlddev, ~ country, cols = 9:12, mean = "overall.mean"))
# Partialling out country and year fixed effects
head(HDW(wlddev, PCGDP + LIFEEX ~ qF(country) + qF(year)))
# Same, adding ODA as continuous regressor
head(HDW(wlddev, PCGDP + LIFEEX ~ qF(country) + qF(year) + ODA))
# Standardizing (scaling and centering) by country
head(STD(wlddev, ~ country, cols = 9:12))
# Computing 1 lead and 3 lags of the 4 series
head(L(wlddev, -1:3, ~ country, ~year, cols = 9:12))
# Computing the 1- and 10-year first differences
head(D(wlddev, c(1,10), 1, ~ country, ~year, cols = 9:12))
head(D(wlddev, c(1,10), 1:2, ~ country, ~year, cols = 9:12)) # ..first and second differences
# Computing the 1- and 10-year growth rates
head(G(wlddev, c(1,10), 1, ~ country, ~year, cols = 9:12))
# Adding growth rate variables to dataset
add_vars(wlddev) <- G(wlddev, c(1, 10), 1, ~ country, ~year, cols = 9:12, keep.ids = FALSE)
get_vars(wlddev, "G1.", regex = TRUE) <- NULL # Deleting again
# These operators can conveniently be used in regression formulas:
# Using a Mundlak (1978) procedure to estimate the effect of OECD on LIFEEX, controlling for PCGDP
lm(LIFEEX ~ log(PCGDP) + OECD + B(log(PCGDP), country),
wlddev |> fselect(country, OECD, PCGDP, LIFEEX) |> na_omit())
# Adding 10-year lagged life-expectancy to allow for some convergence effects (dynamic panel model)
lm(LIFEEX ~ L(LIFEEX, 10, country) + log(PCGDP) + OECD + B(log(PCGDP), country),
wlddev |> fselect(country, OECD, PCGDP, LIFEEX) |> na_omit())
# Tranformation functions and operators also support indexed data classes:
wldi <- findex_by(wlddev, country, year)
head(W(wldi$PCGDP)) # Country-demeaning
head(W(wldi, cols = 9:12))
head(W(wldi$PCGDP, effect = 2)) # Time-demeaning
head(W(wldi, effect = 2, cols = 9:12))
head(HDW(wldi$PCGDP)) # Country- and time-demeaning
head(HDW(wldi, cols = 9:12))
head(STD(wldi$PCGDP)) # Standardizing by country
head(STD(wldi, cols = 9:12))
head(L(wldi$PCGDP, -1:3)) # Panel-lags
head(L(wldi, -1:3, 9:12))
head(G(wldi$PCGDP)) # Panel-Growth rates
head(G(wldi, 1, 1, 9:12))
lm(Dlog(PCGDP) ~ L(Dlog(LIFEEX), 0:3), wldi) # Panel data regression
rm(wldi)
# Remove all objects used in this example section
rm(v, d, w, f, f1, f2, g, mtcarsM, sds, series, wlddev)
Apply Functions Across Multiple Columns
Description
across()
can be used inside fmutate
and fsummarise
to apply one or more functions to a selection of columns. It is overall very similar to dplyr::across
, but does not support some rlang
features, has some additional features (arguments), and is optimized to work with collapse's, .FAST_FUN
, yielding much faster computations.
Usage
across(.cols = NULL, .fns, ..., .names = NULL,
.apply = "auto", .transpose = "auto")
# acr(...) can be used to abbreviate across(...)
Arguments
.cols |
select columns using column names and expressions (e.g. |
.fns |
A function, character vector of functions or list of functions. Vectors / lists can be named to yield alternative names in the result (see |
... |
further arguments to |
.names |
controls the naming of computed columns. |
.apply |
controls whether functions are applied column-by-column ( |
.transpose |
with multiple |
Note
across()
does not support purr-style lambdas, and does not support dplyr
-style predicate functions e.g. across(where(is.numeric), sum)
, simply use across(is.numeric, sum)
. In contrast to dplyr
, you can also compute on grouping columns.
Also note that across()
is NOT a function in collapse but a known expression that is internally transformed by fsummarise()/fmutate()
into something else. Thus, it cannot be called using qualified names, i.e., collapse::across()
does not work and is not necessary if collapse is not attached.
See Also
fsummarise
, fmutate
, Fast Data Manipulation, Collapse Overview
Examples
# Basic (Weighted) Summaries
fsummarise(wlddev, across(PCGDP:GINI, fmean, w = POP))
wlddev |> fgroup_by(region, income) |>
fsummarise(across(PCGDP:GINI, fmean, w = POP))
# Note that for these we don't actually need across...
fselect(wlddev, PCGDP:GINI) |> fmean(w = wlddev$POP, drop = FALSE)
wlddev |> fgroup_by(region, income) |>
fselect(PCGDP:GINI, POP) |> fmean(POP, keep.w = FALSE)
collap(wlddev, PCGDP + LIFEEX + GINI ~ region + income, w = ~ POP, keep.w = FALSE)
# But if we want to use some base R function that reguires argument splitting...
wlddev |> na_omit(cols = "POP") |> fgroup_by(region, income) |>
fsummarise(across(PCGDP:GINI, weighted.mean, w = POP, na.rm = TRUE))
# Or if we want to apply different functions...
wlddev |> fgroup_by(region, income) |>
fsummarise(across(PCGDP:GINI, list(mu = fmean, sd = fsd), w = POP),
POP_sum = fsum(POP), OECD = fmean(OECD))
# Note that the above still detects fmean as a fast function, the names of the list
# are irrelevant, but the function name must be typed or passed as a character vector,
# Otherwise functions will be executed by groups e.g. function(x) fmean(x) won't vectorize
# Same, naming in a different way
wlddev |> fgroup_by(region, income) |>
fsummarise(across(PCGDP:GINI, list(mu = fmean, sd = fsd), w = POP, .names = "flip"),
sum_POP = fsum(POP), OECD = fmean(OECD))
# Or we want to do more advanced things..
# Such as nesting data frames..
qTBL(wlddev) |> fgroup_by(region, income) |>
fsummarise(across(c(PCGDP, LIFEEX, ODA),
function(x) list(Nest = list(x)),
.apply = FALSE))
# Or linear models..
qTBL(wlddev) |> fgroup_by(region, income) |>
fsummarise(across(c(PCGDP, LIFEEX, ODA),
function(x) list(Mods = list(lm(PCGDP ~., x))),
.apply = FALSE))
# Or cumputing grouped correlation matrices
qTBL(wlddev) |> fgroup_by(region, income) |>
fsummarise(across(c(PCGDP, LIFEEX, ODA),
function(x) qDF(pwcor(x), "Variable"), .apply = FALSE))
# Here calculating 1- and 10-year lags and growth rates of these variables
qTBL(wlddev) |> fgroup_by(country) |>
fmutate(across(c(PCGDP, LIFEEX, ODA), list(L, G),
n = c(1, 10), t = year, .names = FALSE))
# Same but variables in different order
qTBL(wlddev) |> fgroup_by(country) |>
fmutate(across(c(PCGDP, LIFEEX, ODA), list(L, G), n = c(1, 10),
t = year, .names = FALSE, .transpose = FALSE))
Fast Row/Column Arithmetic for Matrix-Like Objects
Description
Fast operators to perform row- or column-wise replacing and sweeping operations of vectors on matrices, data frames, lists. See also setop
for math by reference and setTRA
for sweeping by reference.
Usage
## Perform the operation with v and each row of X
X %rr% v # Replace rows of X with v
X %r+% v # Add v to each row of X
X %r-% v # Subtract v from each row of X
X %r*% v # Multiply each row of X with v
X %r/% v # Divide each row of X by v
## Perform a column-wise operation between V and X
X %cr% V # Replace columns of X with V
X %c+% V # Add V to columns of X
X %c-% V # Subtract V from columns of X
X %c*% V # Multiply columns of X with V
X %c/% V # Divide columns of X by V
Arguments
X |
a vector, matrix, data frame or list like object (with rows (r) columns (c) matching |
v |
for row operations: an atomic vector of matching |
V |
for column operations: a suitable scalar, vector, or matrix / data frame matching |
Details
With a matrix or data frame X
, the default behavior of R when calling X op v
(such as multiplication X * v
) is to perform the operation of v
with each column of X
. The equivalent operation is performed by X %cop% V
, with the difference that it computes significantly faster if X
/V
is a data frame / list. A more complex but frequently required task is to perform an operation with v
on each row of X
. This is provided based on efficient C++ code by the %rop%
set of functions, e.g. X %r*% v
efficiently multiplies v
to each row of X
.
Value
X
where the operation with v
/ V
was performed on each row or column. All attributes of X
are preserved.
Note
Computations and Output: These functions are all quite simple, they only work with X
on the LHS i.e. v %op% X
will likely fail. The row operations are simple wrappers around TRA
which provides more operations including grouped replacing and sweeping (where v
would be a matrix or data frame with less rows than X
being mapped to the rows of X
by grouping vectors). One consequence is that just like TRA
, row-wise mathematical operations (+, -, *, /) always yield numeric output, even if both X
and v
may be integer. This is different for column- operations which depend on base R and may also preserve integer data.
Rules of Arithmetic: Since these operators are defined as simple infix functions, the normal rules of arithmetic are not respected. So a %c+% b %c*% c
evaluates as (a %c+% b) %c*% c
. As with all chained infix operations, they are just evaluated sequentially from left to right.
Performance Notes: The function setop
and a related set of %op=%
operators as well as the setTRA
function can be used to perform these operations by reference, and are faster if copies of the output are not required!! Furthermore, for Fast Statistical Functions, using fmedian(X, TRA = "-")
will be a tiny bit faster than X %r-% fmedian(X)
. Also use fwithin(X)
for fast centering using the mean, and fscale(X)
for fast scaling and centering or mean-preserving scaling.
See Also
setop
, TRA
, dapply
, Efficient Programming, Data Transformations, Collapse Overview
Examples
## Using data frame's / lists
v <- mtcars$cyl
mtcars %cr% v
mtcars %c-% v
mtcars %r-% seq_col(mtcars)
mtcars %r-% lapply(mtcars, quantile, 0.28)
mtcars %c*% 5 # Significantly faster than mtcars * 5
mtcars %c*% mtcars # Significantly faster than mtcars * mtcars
## Using matrices
X <- qM(mtcars)
X %cr% v
X %c-% v
X %r-% dapply(X, quantile, 0.28)
## Chained Operations
library(magrittr) # Needed here to evaluate infix operators in sequence
mtcars %>% fwithin() %r-% rnorm(11) %c*% 5 %>%
tfm(mpg = fsum(mpg)) %>% qsu()
Split-Apply-Combine Computing
Description
BY
is an S3 generic that efficiently applies functions over vectors or matrix- and data frame columns by groups. Similar to dapply
it seeks to retain the structure and attributes of the data, but can also output to various standard formats. A simple parallelism is also available.
Usage
BY(x, ...)
## Default S3 method:
BY(x, g, FUN, ..., use.g.names = TRUE, sort = .op[["sort"]], reorder = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "vector", "list"))
## S3 method for class 'matrix'
BY(x, g, FUN, ..., use.g.names = TRUE, sort = .op[["sort"]], reorder = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
## S3 method for class 'data.frame'
BY(x, g, FUN, ..., use.g.names = TRUE, sort = .op[["sort"]], reorder = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
## S3 method for class 'grouped_df'
BY(x, FUN, ..., reorder = TRUE, keep.group_vars = TRUE, use.g.names = FALSE)
Arguments
x |
a vector, matrix, data frame or alike object. |
g |
a |
FUN |
a function, can be scalar- or vector-valued. For vector valued functions see also |
... |
further arguments to |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). For vector-valued functions (row-)names are only generated if the function itself creates names for the statistics e.g. |
sort |
logical. Sort the groups? Internally passed to |
reorder |
logical. If a vector-valued function is passed that preserves the data length, |
expand.wide |
logical. If |
parallel |
logical. |
mc.cores |
integer. Argument to |
return |
an integer or string indicating the type of object to return. The default |
keep.group_vars |
grouped_df method: Logical. |
Details
BY
is a re-implementation of the Split-Apply-Combine computing paradigm. It is faster than tapply
, by
, aggregate
and (d)plyr, and preserves data attributes just like dapply
.
It is principally a wrapper around lapply(gsplit(x, g), FUN, ...)
, that uses gsplit
for optimized splitting and also strongly optimizes on the internal code compared to base R functions. For more details look at the documentation for dapply
which works very similar (apart from the splitting performed in BY
). The function is intended for simple cases involving flexible computation of statistics across groups using a single function e.g. iris |> gby(Species) |> BY(IQR)
is simpler than iris |> gby(Species) |> smr(acr(.fns = IQR))
etc..
Value
X
where FUN
was applied to every column split by g
.
See Also
dapply
, collap
, Fast Statistical Functions, Data Transformations, Collapse Overview
Examples
v <- iris$Sepal.Length # A numeric vector
g <- GRP(iris$Species) # A grouping
## default vector method
BY(v, g, sum) # Sum by species
head(BY(v, g, scale)) # Scale by species (please use fscale instead)
BY(v, g, fquantile) # Species quantiles: by default stacked
BY(v, g, fquantile, expand.wide = TRUE) # Wide format
## matrix method
m <- qM(num_vars(iris))
BY(m, g, sum) # Also return as matrix
BY(m, g, sum, return = "data.frame") # Return as data.frame.. also works for computations below
head(BY(m, g, scale))
BY(m, g, fquantile)
BY(m, g, fquantile, expand.wide = TRUE)
ml <- BY(m, g, fquantile, expand.wide = TRUE, # Return as list of matrices
return = "list")
ml
# Unlisting to Data Frame
unlist2d(ml, idcols = "Variable", row.names = "Species")
## data.frame method
BY(num_vars(iris), g, sum) # Also returns a data.fram
BY(num_vars(iris), g, sum, return = 2) # Return as matrix.. also works for computations below
head(BY(num_vars(iris), g, scale))
BY(num_vars(iris), g, fquantile)
BY(num_vars(iris), g, fquantile, expand.wide = TRUE)
BY(num_vars(iris), g, fquantile, # Return as list of matrices
expand.wide = TRUE, return = "list")
## grouped data frame method
giris <- fgroup_by(iris, Species)
giris |> BY(sum) # Compute sum
giris |> BY(sum, use.g.names = TRUE, # Use row.names and
keep.group_vars = FALSE) # remove 'Species' and groups attribute
giris |> BY(sum, return = "matrix") # Return matrix
giris |> BY(sum, return = "matrix", # Matrix with row.names
use.g.names = TRUE)
giris |> BY(.quantile) # Compute quantiles (output is stacked)
giris |> BY(.quantile, names = TRUE, # Wide output
expand.wide = TRUE)
Advanced Data Aggregation
Description
collap
is a fast and versatile multi-purpose data aggregation command.
It performs simple and weighted aggregations, multi-type aggregations automatically applying different functions to numeric and categorical columns, multi-function aggregations applying multiple functions to each column, and fully custom aggregations where the user passes a list mapping functions to columns.
Usage
# Main function: allows formula and data input to `by` and `w` arguments
collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
custom = NULL, ..., keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE,
sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE, return.order = sort,
method = "auto", parallel = FALSE, mc.cores = 2L,
return = c("wide","list","long","long_dupl"), give.names = "auto")
# Programmer function: allows column names and indices input to `by` and `w` arguments
collapv(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
custom = NULL, ..., keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE,
sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE, return.order = sort,
method = "auto", parallel = FALSE, mc.cores = 2L,
return = c("wide","list","long","long_dupl"), give.names = "auto")
# Auxiliary function: for grouped data ('grouped_df') input + non-standard evaluation
collapg(X, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
custom = NULL, keep.group_vars = TRUE, ...)
Arguments
X |
a data frame, or an object coercible to data frame using |
by |
for |
FUN |
a function, list of functions (i.e. |
catFUN |
same as |
cols |
select columns to aggregate using a function, column names, indices or logical vector. Note: |
w |
weights. Can be passed as numeric vector or alternatively as formula i.e. |
wFUN |
same as |
custom |
a named list specifying a fully customized aggregation task. The names of the list are function names and the content columns to aggregate using this function (same input as |
keep.by , keep.group_vars |
logical. |
keep.w |
logical. |
keep.col.order |
logical. Retain original column order post-aggregation. |
sort , decreasing , na.last , return.order , method |
logical / character. Arguments passed to |
parallel |
logical. Use |
mc.cores |
integer. Argument to |
return |
character. Control the output format when aggregating with multiple functions or performing custom aggregation. "wide" (default) returns a wider data frame with added columns for each additional function. "list" returns a list of data frames - one for each function. "long" adds a column "Function" and row-binds the results from different functions using |
give.names |
logical. Create unique names of aggregated columns by adding a prefix 'FUN.var'. |
... |
additional arguments passed to all functions supplied to |
Details
collap
automatically checks each function passed to it whether it is a Fast Statistical Function (i.e. whether the function name is contained in .FAST_STAT_FUN
). If the function is a fast statistical function, collap
only does the grouping and then calls the function to carry out the grouped computations (vectorized in C/C++), resulting in high aggregation speeds, even with weights. If the function is not one of .FAST_STAT_FUN
, BY
is called internally to perform the computation. The resulting computations from each function are put into a list and recombined to produce the desired output format as controlled by the return
argument. This is substantially slower, particularly with many groups.
When setting parallel = TRUE
on a non-windows computer, aggregations will efficiently be parallelized at the column level using mclapply
utilizing mc.cores
cores. Some Fast Statistical Function support multithreading i.e. have an nthreads
argument that can be passed to collap
. Using C-level multithreading is much more effective than R-level parallelism, and also works on Windows, but the two should never be combined.
When the w
argument is used, the weights are passed to all functions except for wFUN
. This may be undesirable in settings like collap(data, ~ id, custom = list(fsum = ..., fmean = ...), w = ~ weights)
where we wish to aggregate some columns using the weighted mean, and others using a simple sum or another unweighted statistic.
Therefore it is possible to append Fast Statistical Functions by _uw
to yield an unweighted computation. So for the above example one can specify: collap(data, ~ id, custom = list(fsum_uw = ..., fmean = ...), w = ~ weights)
to get the weighted mean and the simple sum. Note that the _uw
functions are not available for use outside collap. Thus one also needs to quote them when passing to the FUN
or catFUN
arguments, e.g. use collap(data, ~ id, fmean, "fmode_uw", w = ~ weights)
.
Value
X
aggregated. If X
is not a data frame it is coerced to one using qDF
and then aggregated.
See Also
fsummarise
, BY
, Fast Statistical Functions, Collapse Overview
Examples
## A Simple Introduction --------------------------------------
head(iris)
collap(iris, ~ Species) # Default: FUN = fmean for numeric
collapv(iris, 5) # Same using collapv
collap(iris, ~ Species, fmedian) # Using the median
collap(iris, ~ Species, fmedian, keep.col.order = FALSE) # Groups in-front
collap(iris, Sepal.Width + Petal.Width ~ Species, fmedian) # Only '.Width' columns
collapv(iris, 5, cols = c(2, 4)) # Same using collapv
collap(iris, ~ Species, list(fmean, fmedian)) # Two functions
collap(iris, ~ Species, list(fmean, fmedian), return = "long") # Long format
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4)) # Custom aggregation
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4), # Raw output, no column reordering
return = "list")
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4), # A strange choice..
return = "long")
collap(iris, ~ Species, w = ~ Sepal.Length) # Using Sepal.Length as weights, ..
weights <- abs(rnorm(fnrow(iris)))
collap(iris, ~ Species, w = weights) # Some random weights..
collap(iris, iris$Species, w = weights) # Note this behavior..
collap(iris, iris$Species, w = weights,
keep.by = FALSE, keep.w = FALSE)
## Multi-Type Aggregation --------------------------------------
head(wlddev) # World Development Panel Data
head(collap(wlddev, ~ country + decade)) # Aggregate by country and decade
head(collap(wlddev, ~ country + decade, fmedian, ffirst)) # Different functions
head(collap(wlddev, ~ country + decade, cols = is.numeric)) # Aggregate only numeric columns
head(collap(wlddev, ~ country + decade, cols = 9:13)) # Only the 5 series
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade)) # Only GDP and life-expactancy
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, fsum)) # Using the sum instead
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, sum, # Same using base::sum -> slower!
na.rm = TRUE))
head(collap(wlddev, wlddev[c("country","decade")], fsum, # Same, exploring different inputs
cols = 9:10))
head(collap(wlddev[9:10], wlddev[c("country","decade")], fsum))
head(collapv(wlddev, c("country","decade"), fsum)) # ..names/indices with collapv
head(collapv(wlddev, c(1,5), fsum))
g <- GRP(wlddev, ~ country + decade) # Precomputing the grouping
head(collap(wlddev, g, keep.by = FALSE)) # This is slightly faster now
# Aggregate categorical data using not the mode but the last element
head(collap(wlddev, ~ country + decade, fmean, flast))
head(collap(wlddev, ~ country + decade, catFUN = flast, # Aggregate only categorical data
cols = is_categorical))
## Weighted Aggregation ----------------------------------------
# We aggregate to region level using population weights
head(collap(wlddev, ~ region + year, w = ~ POP)) # Takes weighted mean for numeric..
# ..and weighted mode for categorical data. The weight vector is aggregated using fsum
head(collap(wlddev, ~ region + year, w = ~ POP, # Aggregating weights using sum
wFUN = list(sum = fsum, max = fmax))) # and max (corresponding to mode)
## Multi-Function Aggregation ----------------------------------
head(collap(wlddev, ~ country + decade, list(mean = fmean, N = fnobs), # Saving mean and Nobs
cols = 9:13))
head(collap(wlddev, ~ country + decade, # Same using base R -> slower
list(mean = mean,
N = function(x, ...) sum(!is.na(x))),
cols = 9:13, na.rm = TRUE))
lapply(collap(wlddev, ~ country + decade, # List output format
list(mean = fmean, N = fnobs), cols = 9:13, return = "list"), head)
head(collap(wlddev, ~ country + decade, # Long output format
list(mean = fmean, N = fnobs), cols = 9:13, return = "long"))
head(collap(wlddev, ~ country + decade, # Also aggregating categorical data,
list(mean = fmean, N = fnobs), return = "long_dupl")) # and duplicating it 2 times
head(collap(wlddev, ~ country + decade, # Now also using 2 functions on
list(mean = fmean, N = fnobs), list(mode = fmode, last = flast), # categorical data
keep.col.order = FALSE))
head(collap(wlddev, ~ country + decade, # More functions, string input,
c("fmean","fsum","fnobs","fsd","fvar"), # parallelized execution
c("fmode","ffirst","flast","fndistinct"), # (choose more than 1 cores,
parallel = TRUE, mc.cores = 1L, # depending on your machine)
keep.col.order = FALSE))
## Custom Aggregation ------------------------------------------
head(collap(wlddev, ~ country + decade, # Custom aggregation
custom = list(fmean = 11:13, fsd = 9:10, fmode = 7:8)))
head(collap(wlddev, ~ country + decade, # Using column names
custom = list(fmean = "PCGDP", fsd = c("LIFEEX","GINI"),
flast = "date")))
head(collap(wlddev, ~ country + decade, # Weighted parallelized custom
custom = list(fmean = 9:12, fsd = 9:10, # aggregation
fmode = 7:8), w = ~ POP,
wFUN = list(fsum, fmax),
parallel = TRUE, mc.cores = 1L))
head(collap(wlddev, ~ country + decade, # No column reordering
custom = list(fmean = 9:12, fsd = 9:10,
fmode = 7:8), w = ~ POP,
wFUN = list(fsum, fmax),
parallel = TRUE, mc.cores = 1L, keep.col.order = FALSE))
## Piped Use --------------------------------------------------
iris |> fgroup_by(Species) |> collapg()
wlddev |> fgroup_by(country, decade) |> collapg() |> head()
wlddev |> fgroup_by(region, year) |> collapg(w = POP) |> head()
wlddev |> fgroup_by(country, decade) |> collapg(fmedian, flast) |> head()
wlddev |> fgroup_by(country, decade) |>
collapg(custom = list(fmean = 9:12, fmode = 5:7, flast = 3)) |> head()
Collapse Documentation & Overview
Description
The following table fully summarizes the contents of collapse. The documentation is structured hierarchically: This is the main overview page, linking to topical overview pages and associated function pages (unless functions are documented on the topic page).
Topics and Functions
Topic | Main Features / Keywords | Functions | ||
Fast Statistical Functions | Fast (grouped and weighted) statistical functions for vector, matrix, data frame and grouped data frames (class 'grouped_df', dplyr compatible). | fsum , fprod , fmean , fmedian , fmode , fvar , fsd , fmin , fmax , fnth , ffirst , flast , fnobs , fndistinct |
||
Fast Grouping and Ordering | Fast (ordered) groupings from vectors, data frames, lists. 'GRP' objects are efficient inputs for programming with collapse's fast functions. fgroup_by can attach them to a data frame, for fast dplyr-style grouped computations. Fast splitting of vectors based on 'GRP' objects. Fast radix-based ordering and hash-based grouping (the workhorses behind GRP ). Fast matching (rows) and unique values/rows, group counts, factor generation, vector grouping, interactions, dropping unused factor levels, generalized run-length type grouping and grouping of integer sequences and time vectors.
| GRP , as_factor_GRP , GRPN , GRPid , GRPnames , is_GRP , fgroup_by , group_by_vars , fgroup_vars , fungroup , gsplit , greorder , radixorder(v) , group(v) , fmatch , ckmatch , %!in% , %[!]iin% , funique , fnunique , fduplicated , any_duplicated , fcount(v) , qF , qG , is_qG , finteraction , fdroplevels , groupid , seqid , timeid |
||
Fast Data Manipulation | Fast and flexible select, subset, slice, summarise, mutate/transform, sort/reorder, combine, join, reshape, rename and relabel data. Some functions modify by reference and/or allow assignment. In addition a set of (standard evaluation) functions for fast selecting, replacing or adding data frame columns, including shortcuts to select and replace variables by data type. | fselect(<-) , fsubset/ss , fslice(v) , fsummarise , fmutate , across , (f/set)transform(v)(<-) , fcompute(v) , roworder(v) , colorder(v) , rowbind , join , pivot , (f/set)rename , (set)relabel , get_vars(<-) , add_vars(<-) , num_vars(<-) , cat_vars(<-) , char_vars(<-) , fact_vars(<-) , logi_vars(<-) , date_vars(<-) |
||
Quick Data Conversion | Quick conversions: data.frame <> data.table <> tibble <> matrix (row- or column-wise) <> list | array > matrix, data.frame, data.table, tibble | vector > factor, matrix, data.frame, data.table, tibble; and converting factors / all factor columns. | qDF , qDT , qTBL , qM , qF , mrtl , mctl , as_numeric_factor , as_integer_factor , as_character_factor |
||
Advanced Data Aggregation | Fast and easy (weighted and parallelized) aggregation of multi-type data, with different functions applied to numeric and categorical variables. Custom specifications allow mappings of functions to variables + renaming. | collap(v/g) |
||
Data Transformations | Fast row- and column- arithmetic and (object preserving) apply functionality for vectors, matrices and data frames. Fast (grouped) replacing and sweeping of statistics (by reference) and (grouped and weighted) scaling / standardizing, (higher-dimensional) between- and within-transformations (i.e. averaging and centering), linear prediction and partialling out. | %(r/c)r% , %(r/c)(+/-/*//)% , dapply , BY , (set)TRA , fscale/STD , fbetween/B , fwithin/W , fhdbetween/HDB , fhdwithin/HDW |
||
Linear Models | Fast (weighted) linear model fitting with 6 different solvers and a fast F-test to test exclusion restrictions on linear models with (large) factors. | flm , fFtest |
||
Time Series and Panel Series | Fast and class-agnostic indexed time series and panel data objects, check for irregularity in time series and panels, and efficient time-sequence to integer/factor conversion. Fast (sequences of) lags / leads and (lagged / leaded and iterated, quasi-, log-) differences, and (compounded) growth rates on (irregular) time series and panel data. Flexible cumulative sums. Panel data to array conversions. Multivariate panel- auto-, partial- and cross-correlation functions. |
findex_by , findex , unindex , reindex , is_irregular , to_plm , timeid ,
flag/L/F , fdiff/D/Dlog , fgrowth/G , fcumsum , psmat , psacf , pspacf , psccf |
||
Summary Statistics | Fast (grouped and weighted) summary statistics for cross-sectional and panel data. Fast (weighted) cross tabulation. Efficient detailed description of data frame. Fast check of variation in data (within groups / dimensions). (Weighted) pairwise correlations and covariances (with obs. and p-value), pairwise observation count. | qsu , qtab , descr , varying , pwcor , pwcov , pwnobs |
||
Other Statistical | Fast euclidean distance computations, (weighted) sample quantiles, and range of vector. | fdist , fquantile , frange |
||
List Processing | (Recursive) list search and checks, extraction of list-elements / list-subsetting, fast (recursive) splitting, list-transpose, apply functions to lists of data frames / data objects, and generalized recursive row-binding / unlisting in 2-dimensions / to data frame. | is_unlistable , ldepth , has_elem , get_elem , atomic_elem(<-) , list_elem(<-) , reg_elem , irreg_elem , rsplit , t_list , rapply2d , unlist2d , rowbind |
||
Recode and Replace Values | Recode multiple values (exact or regex matching) and replace NaN/Inf/-Inf and outliers (according to 1- or 2-sided threshold or standard-deviations) in vectors, matrices or data frames. Insert a value at arbitrary positions into vectors, matrices or data frames. | recode_num , recode_char , replace_na , replace_inf , replace_outliers , pad |
||
(Memory) Efficient Programming | Efficient comparisons of a vector/matrix with a value, and replacing values/rows in vector/matrix/DF (avoiding logical vectors or subsets), faster generation of initialized vectors, and fast mathematical operations on vectors/matrices/DF's with no copies at all. Fast missing value detection, (random) insertion and removal/replacement, lengths and C storage types, greatest common divisor of vector, nlevels for factors, nrow , ncol , dim (for data frames) and seq_along rows or columns. Fast vectorization of matrices and lists, and choleski inverse of symmetric PD matrix. |
anyv , allv , allNA , whichv , whichNA , %==% ,
%!=% , copyv , setv , alloc , setop , %+=% , %-=% , %*=% , %/=% , missing_cases , na_insert , na_rm , na_locf , na_focb , na_omit , vlengths , vtypes , vgcd , fnlevels , fnrow , fncol , fdim , seq_row , seq_col , vec , cinv |
||
Small (Helper) Functions | Multiple-assignment, non-standard concatenation, set and extract variable labels and classes, display variable names and labels together, add / remove prefix or postfix to / from column names, check exact or near / numeric equality of multiple objects or of all elements in a list, get names of functions called in an expression, return object with dimnames, row- or colnames efficiently set, or with all attributes removed, C-level functions to set and shallow-copy attributes, identify categorical (non-numeric) and date(-time) objects. | massign , %=% , .c , vlabels(<-) , setLabels , vclasses , namlab , add_stub , rm_stub , all_identical , all_obj_equal , all_funs , setDimnames , setRownames , setColnames , unattrib , setAttrib , setattrib , copyAttrib , copyMostAttrib , is_categorical , is_date |
||
Data and Global Macros | Groningen Growth and Development Centre 10-Sector Database, World Bank World Development dataset, and some global macros containing links to the topical documentation pages (including this page), all exported objects (excluding exported S3 methods and depreciated functions), all generic functions (excluding depreciated), the 2 datasets, depreciated functions, all fast functions, all fast statistical (scalar-valued) functions, and all transformation operators (these are not infix functions but function shortcuts resembling operators in a statistical sense, such as the lag/lead operators L /F , both wrapping flag , see .OPERATOR_FUN ). | GGDC10S, wlddev, .COLLAPSE_TOPICS, .COLLAPSE_ALL, .COLLAPSE_GENERIC, .COLLAPSE_DATA, .COLLAPSE_OLD, .FAST_FUN, .FAST_STAT_FUN, .OPERATOR_FUN |
||
Package Options | set_collapse /get_collapse can be used to globally set/get the defaults for na.rm , nthreads and sort , etc., arguments found in many functions, and to globally control the namespace with options 'mask' and 'remove': 'mask' can be used to mask base R/dplyr functions by export copies of equivalent collapse functions starting with "f" , removing the leading "f" (e.g. exporting subset <- fsubset ). 'remove' allows removing arbitrary functions from the exported namespace. options("collapse_unused_arg_action") sets the action taken by generic statistical functions when unknown arguments are passed to a method. The default is "warning" . | set_collapse , get_collapse |
||
Details
The added top-level documentation infrastructure in collapse allows you to effectively navigate the package.
Calling ?FUN
brings up the documentation page documenting the function, which contains links to associated topic pages and closely related functions. You can also call topical documentation pages directly from the console. The links to these pages are contained in the global macro .COLLAPSE_TOPICS
(e.g. calling help(.COLLAPSE_TOPICS[1])
brings up this page).
Author(s)
Maintainer: Sebastian Krantz sebastian.krantz@graduateinstitute.ch
See Also
collapse Package Options
Description
collapse is globally configurable to an extent few packages are: the default value of key function arguments governing the behavior of its algorithms, and the exported namespace, can be adjusted interactively through the set_collapse()
function.
These options are saved in an internal environment called .op
(for safety and performance reasons) visible in the documentation of some functions such as fmean
. The contents of this environment can be accessed using get_collapse()
.
There are also a few options that can be set using options
(retrievable using getOption
). These options mainly affect package startup behavior.
Usage
set_collapse(...)
get_collapse(opts = NULL)
Arguments
... |
either comma separated options, or a single list of options. The available options are:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
opts |
character. A vector of options to receive from |
Value
set_collapse()
returns the old content of .op
invisibly as a list. get_collapse()
, if called with only one option, returns the value of the option, and otherwise a list.
Options Set Using options()
-
"collapse_unused_arg_action"
regulates how generic functions (such as the Fast Statistical Functions) in the package react when an unknown argument is passed to a method. The default action is"warning"
which issues a warning. Other options are"error"
,"message"
or"none"
, whereby the latter enables silent swallowing of such arguments. -
"collapse_export_F"
, if set toTRUE
, exports the lead operatorF
in the package namespace when loading the package. The operator was exported by default until v1.9.0, but is now hidden inside the package due to too many problems withbase::F
. Alternatively, the operator can be accessed usingcollapse:::F
. -
"collapse_nthreads"
,"collapse_na_rm"
,"collapse_sort"
,"collapse_stable_algo"
,"collapse_verbose"
,"collapse_digits"
,"collapse_mask"
and"collapse_remove"
can be set before loading the package to initialize.op
with different defaults (e.g. using an.Rprofile
file). Once loaded, these options have no effect, and users need to useset_collapse()
to change them. See also the Note.
Note
Setting keywords "fast-fun", "fast-stat-fun", "fast-trfm-fun" or "all" with set_collapse(mask = ...)
will also adjust internal optimization flags, e.g. in (f)summarise
and (f)mutate
, so that these functions - and all expressions containing them - receive vectorized execution (see examples of (f)summarise
and (f)mutate
). Users should be aware of expressions like fmutate(mu = sum(var) / lenth(var))
: this usually gets executed by groups, but with these keywords set,this will be vectorized (like fmutate(mu = fsum(var) / lenth(var))
) implying grouped sum divided by overall length. In this case fmutate(mu = base::sum(var) / lenth(var))
needs to be specified to retain the original result.
Note that passing individual functions like set_collapse(mask = "(f)sum")
will not change internal optimization flags for these functions. This is to ensure consistency i.e. you can be either all in (by setting appropriate keywords) or all out when it comes to vectorized stats with basic R names.
Note also that masking does not change documentation links, so you need to look up the f- version of a function to get the right documentation.
A safe way to set options affecting startup behavior is by using a .Rprofile
file in your user or project directory (see also here, the user-level file is located at file.path(Sys.getenv("HOME"), ".Rprofile")
and can be edited using file.edit(Sys.getenv("HOME"), ".Rprofile")
), or by using a .fastverse
configuration file in the project directory.
options("collapse_remove")
does in fact remove functions from the namespace and cannot be reversed by set_collapse(remove = NULL)
once the package is loaded. It is only reversed by re-loading collapse.
See Also
Collapse Overview, collapse-package
Examples
# Setting new values
oldopts <- set_collapse(nthreads = 2, na.rm = FALSE)
# Getting the values
get_collapse()
get_collapse("nthreads")
# Resetting
set_collapse(oldopts)
rm(oldopts)
## Not run:
## This is a typical working setup I use:
library(fastverse)
# Loading other stats packages with fastverse_extend():
# displays versions, checks conflicts, and installs if unavailable
fastverse_extend(qs, fixest, grf, glmnet, install = TRUE)
# Now setting collapse options with some namespace modification
set_collapse(
nthreads = 4,
sort = FALSE,
mask = c("manip", "helper", "special", "mean", "scale"),
remove = "old"
)
# Final conflicts check (optional)
fastverse_conflicts()
# For some simpler scripts I also use
set_collapse(
nthreads = 4,
sort = FALSE,
mask = "all",
remove = c("old", "between") # I use data.table::between > fbetween
)
# This is now collapse code
mtcars |>
subset(mpg > 12) |>
group_by(cyl) |>
sum()
## End(Not run)
## Changing what happens with unused arguments
oldopts <- options(collapse_unused_arg_action = "message") # default: "warning"
fmean(mtcars$mpg, bla = 1)
# Now nothing happens, same as base R
options(collapse_unused_arg_action = "none")
fmean(mtcars$mpg, bla = 1)
mean(mtcars$mpg, bla = 1)
options(oldopts)
rm(oldopts)
Renamed Functions
Description
These functions were renamed (mostly during v1.6.0 update) to make the namespace more consistent.
Renaming
fNobs -> fnobs fNdistinct -> fndistinct fHDwithin -> fhdwithin fHDbetween -> fhdbetween replace_NA -> replace_na replace_Inf -> replace_inf
Fast Reordering of Data Frame Columns
Description
Efficiently reorder columns in a data frame. To do this fully by reference see also data.table::setcolorder
.
Usage
colorder(.X, ..., pos = "front")
colorderv(X, neworder = radixorder(names(X)),
pos = "front", regex = FALSE, ...)
Arguments
.X , X |
a data frame or list. | ||||||||||||||||||||||||||
... |
for | ||||||||||||||||||||||||||
neworder |
a vector of column names, positive indices, a suitable logical vector, a function such as | ||||||||||||||||||||||||||
pos |
integer or character. Different options regarding column arrangement if
| ||||||||||||||||||||||||||
regex |
logical. |
Value
.X/X
with columns reordered (no deep copies).
See Also
roworder
, Data Frame Manipulation, Collapse Overview
Examples
head(colorder(mtcars, vs, cyl:hp, am))
head(colorder(mtcars, vs, cyl:hp, am, pos = "end"))
head(colorder(mtcars, vs, cyl:hp, am, pos = "after"))
head(colorder(mtcars, vs, cyl, pos = "exchange"))
head(colorder(mtcars, vs, cyl:hp, new = am)) # renaming
## Same in standard evaluation
head(colorderv(mtcars, c(8, 2:4, 9)))
head(colorderv(mtcars, c(8, 2:4, 9), pos = "end"))
head(colorderv(mtcars, c(8, 2:4, 9), pos = "after"))
head(colorderv(mtcars, c(8, 2), pos = "exchange"))
Data Apply
Description
dapply
efficiently applies functions to columns or rows of matrix-like objects and by default returns an object of the same type and with the same attributes (unless the result is scalar and drop = TRUE
). Alternatively it is possible to return the result in a plain matrix or data.frame. A simple parallelism is also available.
Usage
dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame"), drop = TRUE)
Arguments
X |
a matrix, data frame or alike object. |
FUN |
a function, can be scalar- or vector-valued. |
... |
further arguments to |
MARGIN |
integer. The margin which |
parallel |
logical. |
mc.cores |
integer. Argument to |
return |
an integer or string indicating the type of object to return. The default |
drop |
logical. If the result has only one row or one column, |
Details
dapply
is an efficient command to apply functions to rows or columns of data without loosing information (attributes) about the data or changing the classes or format of the data. It is principally an efficient wrapper around lapply
and works as follows:
Save the attributes of
X
.If
MARGIN = 2
(columns), convert matrices to plain lists of columns usingmctl
and remove all attributes from data frames.If
MARGIN = 1
(rows), convert matrices to plain lists of rows usingmrtl
. For data frames remove all attributes, efficiently convert to matrix usingdo.call(cbind, X)
and also convert to list of rows usingmrtl
.Call
lapply
ormclapply
on these plain lists (which is faster than callinglapply
on an object with attributes).depending on the requested output type, use
matrix
,unlist
ordo.call(cbind, ...)
to convert the result back to a matrix or list of columns.modify the relevant attributes accordingly and efficiently attach to the object again (no further checks).
The performance gain from working with plain lists makes dapply
not much slower than calling lapply
itself on a data frame. Because of the conversions involved, row-operations require some memory, but are still faster than apply
.
Value
X
where FUN
was applied to every row or column.
See Also
BY
, collap
, Fast Statistical Functions, Data Transformations, Collapse Overview
Examples
head(dapply(mtcars, log)) # Take natural log of each variable
head(dapply(mtcars, log, return = "matrix")) # Return as matrix
m <- as.matrix(mtcars)
head(dapply(m, log)) # Same thing
head(dapply(m, log, return = "data.frame")) # Return data frame from matrix
dapply(mtcars, sum); dapply(m, sum) # Computing sum of each column, return as vector
dapply(mtcars, sum, drop = FALSE) # This returns a data frame of 1 row
dapply(mtcars, sum, MARGIN = 1) # Compute row-sum of each column, return as vector
dapply(m, sum, MARGIN = 1) # Same thing for matrices, faster t. apply(m, 1, sum)
head(dapply(m, sum, MARGIN = 1, drop = FALSE)) # Gives matrix with one column
head(dapply(m, quantile, MARGIN = 1)) # Compute row-quantiles
dapply(m, quantile) # Column-quantiles
head(dapply(mtcars, quantile, MARGIN = 1)) # Same for data frames, output is also a data.frame
dapply(mtcars, quantile)
# With classed objects, we have to be a bit careful
## Not run:
dapply(EuStockMarkets, quantile) # This gives an error because the tsp attribute is misspecified
## End(Not run)
dapply(EuStockMarkets, quantile, return = "matrix") # These both work fine..
dapply(EuStockMarkets, quantile, return = "data.frame")
# Similarly for grouped tibbles and other data frame based classes
library(dplyr)
gmtcars <- group_by(mtcars,cyl,vs,am)
head(dapply(gmtcars, log)) # Still gives a grouped tibble back
dapply(gmtcars, quantile, MARGIN = 1) # Here it makes sense to keep the groups attribute
dapply(gmtcars, quantile) # This does not make much sense, ...
dapply(gmtcars, quantile, # better convert to plain data.frame:
return = "data.frame")
Data Transformations
Description
collapse provides an ensemble of functions to perform common data transformations efficiently and user friendly:
-
dapply
applies functions to rows or columns of matrices and data frames, preserving the data format. -
BY
is an S3 generic for efficient Split-Apply-Combine computing, similar todapply
. A set of arithmetic operators facilitates row-wise
%rr%
,%r+%
,%r-%
,%r*%
,%r/%
and column-wise%cr%
,%c+%
,%c-%
,%c*%
,%c/%
replacing and sweeping operations involving a vector and a matrix or data frame / list. Since v1.7, the operators%+=%
,%-=%
,%*=%
and%/=%
do column- and element- wise math by reference, and the functionsetop
can also perform sweeping out rows by reference.-
(set)TRA
is a more advanced S3 generic to efficiently perform (groupwise) replacing and sweeping out of statistics, either by creating a copy of the data or by reference. Supported operations are:Integer-id String-id Description 0 "na" or "replace_na" replace only missing values 1 "fill" or "replace_fill" replace everything 2 "replace" replace data but preserve missing values 3 "-" subtract 4 "-+" subtract group-statistics but add group-frequency weighted average of group statistics 5 "/" divide 6 "%" compute percentages 7 "+" add 8 "*" multiply 9 "%%" modulus 10 "-%%" subtract modulus All of collapse's Fast Statistical Functions have a built-in
TRA
argument for faster access (i.e. you can compute (groupwise) statistics and use them to transform your data with a single function call). -
fscale/STD
is an S3 generic to perform (groupwise and / or weighted) scaling / standardizing of data and is orders of magnitude faster thanscale
. -
fwithin/W
is an S3 generic to efficiently perform (groupwise and / or weighted) within-transformations / demeaning / centering of data. Similarlyfbetween/B
computes (groupwise and / or weighted) between-transformations / averages (also a lot faster thanave
). -
fhdwithin/HDW
, shorthand for 'higher-dimensional within transform', is an S3 generic to efficiently center data on multiple groups and partial-out linear models (possibly involving many levels of fixed effects and interactions). In other words,fhdwithin/HDW
efficiently computes residuals from linear models. Similarlyfhdbetween/HDB
, shorthand for 'higher-dimensional between transformation', computes the corresponding means or fitted values. -
flag/L/F
,fdiff/D/Dlog
andfgrowth/G
are S3 generics to compute sequences of lags / leads and suitably lagged and iterated (quasi-, log-) differences and growth rates on time series and panel data.fcumsum
flexibly computes (grouped, ordered) cumulative sums. More in Time Series and Panel Series. -
STD, W, B, HDW, HDB, L, D, Dlog
andG
are parsimonious wrappers around thef-
functions above representing the corresponding transformation 'operators'. They have additional capabilities when applied to data-frames (i.e. variable selection, formula input, auto-renaming and id-variable preservation), and are easier to employ in regression formulas, but are otherwise identical in functionality.
Table of Functions
Function / S3 Generic | Methods | Description | ||
dapply | No methods, works with matrices and data frames | Apply functions to rows or columns | ||
BY | default, matrix, data.frame, grouped_df | Split-Apply-Combine computing | ||
%(r/c)(r/+/-/*//)% | No methods, works with matrices and data frames / lists | Row- and column-arithmetic | ||
(set)TRA | default, matrix, data.frame, grouped_df | Replace and sweep out statistics (by reference) | ||
fscale/STD | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Scale / standardize data | ||
fwithin/W | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Demean / center data | ||
fbetween/B | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Compute means / average data | ||
fhdwithin/HDW | default, matrix, data.frame, pseries, pdata.frame | High-dimensional centering and lm residuals | ||
fhdbetween/HDB | default, matrix, data.frame, pseries, pdata.frame | High-dimensional averages and lm fitted values | ||
flag/L/F , fdiff/D/Dlog , fgrowth/G , fcumsum | default, matrix, data.frame, pseries, pdata.frame, grouped_df | (Sequences of) lags / leads, differences, growth rates and cumulative sums |
See Also
Collapse Overview, Fast Statistical Functions, Time Series and Panel Series
Detailed Statistical Description of Data Frame
Description
descr
offers a fast and detailed description of each variable in a data frame. Since v1.9.0 it fully supports grouped and weighted computations.
Usage
descr(X, ...)
## Default S3 method:
descr(X, by = NULL, w = NULL, cols = NULL,
Ndistinct = TRUE, higher = TRUE, table = TRUE, sort.table = "freq",
Qprobs = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), Qtype = 7L,
label.attr = "label", stepwise = FALSE, ...)
## S3 method for class 'grouped_df'
descr(X, w = NULL,
Ndistinct = TRUE, higher = TRUE, table = TRUE, sort.table = "freq",
Qprobs = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), Qtype = 7L,
label.attr = "label", stepwise = FALSE, ...)
## S3 method for class 'descr'
as.data.frame(x, ..., gid = "Group")
## S3 method for class 'descr'
print(x, n = 14, perc = TRUE, digits = .op[["digits"]], t.table = TRUE, total = TRUE,
compact = FALSE, summary = !compact, reverse = FALSE, stepwise = FALSE, ...)
Arguments
X |
a (grouped) data frame or list of atomic vectors. Atomic vectors, matrices or arrays can be passed but will first be coerced to data frame using | |||||||||||||||||||||
by |
a factor, | |||||||||||||||||||||
w |
a numeric vector of (non-negative) weights. the default method also supports a one-sided formulas i.e. | |||||||||||||||||||||
cols |
select columns to describe using column names, indices a logical vector or selector function (e.g. | |||||||||||||||||||||
Ndistinct |
logical. | |||||||||||||||||||||
higher |
logical. Argument is passed down to | |||||||||||||||||||||
table |
logical. | |||||||||||||||||||||
sort.table |
an integer or character string specifying how the frequency table should be presented:
| |||||||||||||||||||||
Qprobs |
double. Probabilities for quantiles to compute on numeric variables, passed down to | |||||||||||||||||||||
Qtype |
integer. Quantile types 5-9 following Hyndman and Fan (1996) who recommended type 8, default 7 as in | |||||||||||||||||||||
label.attr |
character. The name of a label attribute to display for each variable (if variables are labeled). | |||||||||||||||||||||
... |
for | |||||||||||||||||||||
x |
an object of class 'descr'. | |||||||||||||||||||||
n |
integer. The maximum number of table elements to print for categorical variables. If the number of distinct elements is | |||||||||||||||||||||
perc |
logical. | |||||||||||||||||||||
digits |
integer. The number of decimals to print in statistics, quantiles and percentage tables. | |||||||||||||||||||||
t.table |
logical. | |||||||||||||||||||||
total |
logical. | |||||||||||||||||||||
compact |
logical. | |||||||||||||||||||||
summary |
logical. | |||||||||||||||||||||
reverse |
logical. | |||||||||||||||||||||
stepwise |
logical. | |||||||||||||||||||||
gid |
character. Name assigned to the group-id column, when describing data by groups. |
Details
descr
was heavily inspired by Hmisc::describe
, but is much faster and has more advanced statistical capabilities. It is principally a wrapper around qsu
, fquantile
(.quantile
), and fndistinct
for numeric variables, and computes frequency tables for categorical variables using qtab
. Date variables are summarized with fnobs
, fndistinct
and frange
.
Since v1.9.0 grouped and weighted computations are fully supported. The use of sampling weights will produce a weighted mean, sd, skewness and kurtosis, and weighted quantiles for numeric data. For categorical data, tables will display the sum of weights instead of the frequencies, and percentage tables as well as the percentage of missing values indicated next to 'Statistics' in print, be relative to the total sum of weights. All this can be done by groups. Grouped (weighted) quantiles are computed using BY
.
For larger datasets, calling the stepwise
option directly from descr()
is recommended, as precomputing the statistics for all variables before digesting the results can be time consuming.
The list-object returned from descr
can efficiently be converted to a tidy data frame using the as.data.frame
method. This representation will not include frequency tables computed for categorical variables.
Value
A 2-level nested list-based object of class 'descr'. The list has the same size as the dataset, and contains the statistics computed for each variable, which are themselves stored in a list containing the class, the label, the basic statistics and quantiles / tables computed for the variable (in matrix form).
The object has attributes attached providing the 'name' of the dataset, the number of rows in the dataset ('N'), an attribute 'arstat' indicating whether arrays of statistics where generated by passing arguments (e.g. pid
) down to qsu.default
, an attribute 'table' indicating whether table = TRUE
(i.e. the object could contain tables for categorical variables), and attributes 'groups' and/or 'weights' providing a GRP
object and/or weight vector for grouped and/or weighted data descriptions.
See Also
qsu
, qtab
, fquantile
, pwcor
, Summary Statistics, Fast Statistical Functions, Collapse Overview
Examples
## Simple Use
descr(iris)
descr(wlddev)
descr(GGDC10S)
# Some useful print options (also try stepwise argument)
print(descr(GGDC10S), reverse = TRUE, t.table = FALSE)
# For bigger data consider: descr(big_data, stepwise = TRUE)
# Generating a data frame
as.data.frame(descr(wlddev, table = FALSE))
## Weighted Desciptions
descr(wlddev, w = ~ replace_na(POP)) # replacing NA's with 0's for fquantile()
## Grouped Desciptions
descr(GGDC10S, ~ Variable)
descr(wlddev, ~ income)
print(descr(wlddev, ~ income), compact = TRUE)
## Grouped & Weighted Desciptions
descr(wlddev, ~ income, w = ~ replace_na(POP))
## Passing Arguments down to qsu.default: for Panel Data Statistics
descr(iris, pid = iris$Species)
descr(wlddev, pid = wlddev$iso3c)
Small Functions to Make R Programming More Efficient
Description
A small set of functions to address some common inefficiencies in R, such as the creation of logical vectors to compare quantities, unnecessary copies of objects in elementary mathematical or subsetting operations, obtaining information about objects (esp. data frames), or dealing with missing values.
Usage
anyv(x, value) # Faster than any(x == value). See also kit::panyv()
allv(x, value) # Faster than all(x == value). See also kit::pallv()
allNA(x) # Faster than all(is.na(x)). See also kit::pallNA()
whichv(x, value, # Faster than which(x == value)
invert = FALSE) # or which(x != value). See also Note (3)
whichNA(x, invert = FALSE) # Faster than which((!)is.na(x))
x %==% value # Infix for whichv(v, value, FALSE), use e.g. in fsubset()
x %!=% value # Infix for whichv(v, value, TRUE). See also Note (3)
alloc(value, n, # Fast rep_len(value, n) or replicate(n, value).
simplify = TRUE) # simplify only works if length(value) == 1. See Details.
copyv(X, v, R, ..., invert # Fast replace(X, v, R), replace(X, X (!/=)= v, R) or
= FALSE, vind1 = FALSE, # replace(X, (!)v, R[(!)v]). See Details and Note (4).
xlist = FALSE) # For multi-replacement see also kit::vswitch()
setv(X, v, R, ..., invert # Same for X[v] <- r, X[x (!/=)= v] <- r or
= FALSE, vind1 = FALSE, # x[(!)v] <- r[(!)v]. Modifies X by reference, fastest.
xlist = FALSE) # X/R/V can also be lists/DFs. See Details and Examples.
setop(X, op, V, ..., # Faster than X <- X +\-\*\/ V (modifies by reference)
rowwise = FALSE) # optionally can also add v to rows of a matrix or list
X %+=% V # Infix for setop(X, "+", V). See also Note (2)
X %-=% V # Infix for setop(X, "-", V). See also Note (2)
X %*=% V # Infix for setop(X, "*", V). See also Note (2)
X %/=% V # Infix for setop(X, "/", V). See also Note (2)
na_rm(x) # Fast: if(anyNA(x)) x[!is.na(x)] else x, last
na_locf(x, set = FALSE) # obs. carried forward and first obs. carried back.
na_focb(x, set = FALSE) # (by reference). These also support lists (NULL/empty)
na_omit(X, cols = NULL, # Faster na.omit for matrices and data frames,
na.attr = FALSE, # can use selected columns to check, attach indices,
prop = 0, ...) # and remove cases with a proportion of values missing
na_insert(X, prop = 0.1, # Insert missing values at random
value = NA)
missing_cases(X, cols=NULL, # The opposite of complete.cases(), faster for DF's.
prop = 0, count = FALSE) # See also kit::panyNA(), kit::pallNA(), kit::pcountNA()
vlengths(X, use.names=TRUE) # Faster lengths() and nchar() (in C, no method dispatch)
vtypes(X, use.names = TRUE) # Get data storage types (faster vapply(X, typeof, ...))
vgcd(x) # Greatest common divisor of positive integers or doubles
fnlevels(x) # Faster version of nlevels(x) (for factors)
fnrow(X) # Faster nrow for data frames (not faster for matrices)
fncol(X) # Faster ncol for data frames (not faster for matrices)
fdim(X) # Faster dim for data frames (not faster for matrices)
seq_row(X) # Fast integer sequences along rows of X
seq_col(X) # Fast integer sequences along columns of X
vec(X) # Vectorization (stacking) of matrix or data frame/list
cinv(x) # Choleski (fast) inverse of symmetric PD matrix, e.g. X'X
Arguments
X , V , R |
a vector, matrix or data frame. | ||||||||||||||||||||||||||
x , v |
a (atomic) vector or matrix ( | ||||||||||||||||||||||||||
value |
a single value of any (atomic) vector type. For | ||||||||||||||||||||||||||
invert |
logical. | ||||||||||||||||||||||||||
set |
logical. | ||||||||||||||||||||||||||
simplify |
logical. If | ||||||||||||||||||||||||||
vind1 |
logical. If | ||||||||||||||||||||||||||
xlist |
logical. If | ||||||||||||||||||||||||||
op |
an integer or character string indicating the operation to perform.
| ||||||||||||||||||||||||||
rowwise |
logical. | ||||||||||||||||||||||||||
cols |
select columns to check for missing values using column names, indices, a logical vector or a function (e.g. | ||||||||||||||||||||||||||
n |
integer. The length of the vector to allocate with | ||||||||||||||||||||||||||
na.attr |
logical. | ||||||||||||||||||||||||||
prop |
double. For | ||||||||||||||||||||||||||
count |
logical. | ||||||||||||||||||||||||||
use.names |
logical. Preserve names if | ||||||||||||||||||||||||||
... |
for |
Details
alloc
is a fusion of rep_len
and replicate
that is faster in both cases. If value
is a length one atomic vector (logical, integer, double, string, complex or raw) and simplify = TRUE
, the functionality is as rep_len(value, n)
i.e. the output is a length n
atomic vector with the same attributes as value
(apart from "names"
, "dim"
and "dimnames"
). For all other cases the functionality is as replicate(n, value, simplify = FALSE)
i.e. the output is a length-n
list of the objects. For efficiency reasons the object is not copied i.e. only the pointer to the object is replicated.
copyv
and setv
are designed to optimize operations that require replacing data in objects in the broadest sense. The only difference between them is that copyv
first deep-copies X
before doing replacements whereas setv
modifies X
in place and returns the result invisibly. There are 3 ways these functions can be used:
To replace a single value,
setv(X, v, R)
is an efficient alternative toX[X == v] <- R
, andcopyv(X, v, R)
is more efficient thanreplace(X, X == v, R)
. This can be inverted usingsetv(X, v, R, invert = TRUE)
, equivalent toX[X != v] <- R
.To do standard replacement with integer or logical indices i.e.
X[v] <- R
is more efficient usingsetv(X, v, R)
, and, ifv
is logical,setv(X, v, R, invert = TRUE)
is efficient forX[!v] <- R
. To distinguish this from use case (1) whenlength(v) == 1
, the argumentvind1 = TRUE
can be set to ensure thatv
is always interpreted as an index.To copy values from objects of equal size i.e.
setv(X, v, R)
is faster thanX[v] <- R[v]
, andsetv(X, v, R, invert = TRUE)
is faster thanX[!v] <- R[!v]
.
Both X
and R
can be atomic or data frames / lists. If X
is a list, the default behavior is to interpret it like a data frame, and apply setv/copyv
to each element/column of X
. If R
is also a list, this is done using mapply
. Thus setv/copyv
can also be used to replace elements or rows in data frames, or copy rows from equally sized frames. Note that for replacing subsets in data frames set
from data.table
provides a more convenient interface (and there is also copy
if you just want to deep-copy an object without any modifications to it).
If X
should not be interpreted like a data frame, setting xlist = TRUE
will interpret it like a 1D list-vector analogous to atomic vectors, except that use case (1) is not permitted i.e. no value comparisons on list elements.
Note
None of these functions (apart from
alloc
) currently support complex vectors.-
setop
and the operators%+=%
,%-=%
,%*=%
and%/=%
also work with integer data, but do not perform any integer related checks. R's integers are bounded between +-2,147,483,647 andNA_integer_
is stored as the value -2,147,483,648. Thus computations resulting in values exceeding +-2,147,483,647 will result in integer overflows, andNA_integer_
should not occur on either side of asetop
call. These are programmers functions and meant to provide the most efficient math possible to responsible users. It is possible to compare factors by the levels (e.g.
iris$Species %==% "setosa")
) or using integers (iris$Species %==% 1L
). The latter is slightly more efficient. Nothing special is implemented for other objects apart from basic types, e.g. for dates (which are stored as doubles) you need to generate a date object i.e.wlddev$date %==% as.Date("2019-01-01")
. Usingwlddev$date %==% "2019-01-01"
will giveinteger(0)
.-
setv/copyv
only allow positive integer indices being passed tov
, and, for efficiency reasons, they only check the first and the last index. Thus if there are indices in the middle that fall outside of the data range it will terminate R.
See Also
Data Transformations, Small (Helper) Functions, Collapse Overview
Examples
oldopts <- options(max.print = 70)
## Which value
whichNA(wlddev$PCGDP) # Same as which(is.na(wlddev$PCGDP))
whichNA(wlddev$PCGDP, invert = TRUE) # Same as which(!is.na(wlddev$PCGDP))
whichv(wlddev$country, "Chad") # Same as which(wlddev$county == "Chad")
wlddev$country %==% "Chad" # Same thing
whichv(wlddev$country, "Chad", TRUE) # Same as which(wlddev$county != "Chad")
wlddev$country %!=% "Chad" # Same thing
lvec <- wlddev$country == "Chad" # If we already have a logical vector...
whichv(lvec, FALSE) # is fastver than which(!lvec)
rm(lvec)
# Using the %==% operator can yield tangible performance gains
fsubset(wlddev, iso3c %==% "DEU") # 3x faster than:
fsubset(wlddev, iso3c == "DEU")
# With multiple categories we can use %iin%
fsubset(wlddev, iso3c %iin% c("DEU", "ITA", "FRA"))
## Math by reference: permissible types of operations
x <- alloc(1.0, 1e5) # Vector
x %+=% 1
x %+=% 1:1e5
xm <- matrix(alloc(1.0, 1e5), ncol = 100) # Matrix
xm %+=% 1
xm %+=% 1:1e3
setop(xm, "+", 1:100, rowwise = TRUE)
xm %+=% xm
xm %+=% 1:1e5
xd <- qDF(replicate(100, alloc(1.0, 1e3), simplify = FALSE)) # Data Frame
xd %+=% 1
xd %+=% 1:1e3
setop(xd, "+", 1:100, rowwise = TRUE)
xd %+=% xd
rm(x, xm, xd)
## setv() and copyv()
x <- rnorm(100)
y <- sample.int(10, 100, replace = TRUE)
setv(y, 5, 0) # Faster than y[y == 5] <- 0
setv(y, 4, x) # Faster than y[y == 4] <- x[y == 4]
setv(y, 20:30, y[40:50]) # Faster than y[20:30] <- y[40:50]
setv(y, 20:30, x) # Faster than y[20:30] <- x[20:30]
rm(x, y)
# Working with data frames, here returning copies of the frame
copyv(mtcars, 20:30, ss(mtcars, 10:20))
copyv(mtcars, 20:30, fscale(mtcars))
ftransform(mtcars, new = copyv(cyl, 4, vs))
# Column-wise:
copyv(mtcars, 2:3, fscale(mtcars), xlist = TRUE)
copyv(mtcars, 2:3, mtcars[4:5], xlist = TRUE)
## Missing values
mtc_na <- na_insert(mtcars, 0.15) # Set 15% of values missing at random
fnobs(mtc_na) # See observation count
missing_cases(mtc_na) # Fast equivalent to !complete.cases(mtc_na)
missing_cases(mtc_na, cols = 3:4) # Missing cases on certain columns?
missing_cases(mtc_na, count = TRUE) # Missing case count
missing_cases(mtc_na, prop = 0.8) # Cases with 80% or more missing
missing_cases(mtc_na, cols = 3:4, prop = 1) # Cases mssing columns 3 and 4
missing_cases(mtc_na, cols = 3:4, count = TRUE) # Missing case count on columns 3 and 4
na_omit(mtc_na) # 12x faster than na.omit(mtc_na)
na_omit(mtc_na, prop = 0.8) # Only remove cases missing 80% or more
na_omit(mtc_na, na.attr = TRUE) # Adds attribute with removed cases, like na.omit
na_omit(mtc_na, cols = .c(vs, am)) # Removes only cases missing vs or am
na_omit(qM(mtc_na)) # Also works for matrices
na_omit(mtc_na$vs, na.attr = TRUE) # Also works with vectors
na_rm(mtc_na$vs) # For vectors na_rm is faster ...
rm(mtc_na)
## Efficient vectorization
head(vec(EuStockMarkets)) # Atomic objects: no copy at all
head(vec(mtcars)) # Lists: directly in C
options(oldopts)
Fast Data Manipulation
Description
collapse provides the following functions for fast manipulation of (mostly) data frames.
-
fselect
is a much faster alternative todplyr::select
to select columns using expressions involving column names.get_vars
is a more versatile and programmer friendly function to efficiently select and replace columns by names, indices, logical vectors, regular expressions, or using functions to identify columns. -
num_vars
,cat_vars
,char_vars
,fact_vars
,logi_vars
anddate_vars
are convenience functions to efficiently select and replace columns by data type. -
add_vars
efficiently adds new columns at any position within a data frame (default at the end). This can be done vie replacement (i.e.add_vars(data) <- newdata
) or returning the appended data, e.g.,add_vars(data, newdata1, newdata2, ...)
. It is thus also an efficient alternative tocbind.data.frame
. -
rowbind
efficiently combines data frames / lists row-wise. The implementation is derived fromdata.table::rbindlist
, it is also a fast alternative torbind.data.frame
. -
join
provides fast, class-agnostic, and verbose table joins. -
pivot
efficiently reshapes data, supporting longer, wider and recast pivoting, as well as multi-column-pivots and pivots taking along variable labels. -
fsubset
is a much faster version ofsubset
to efficiently subset vectors, matrices and data frames. If the non-standard evaluation offered byfsubset
is not needed, the functionss
is a much faster and more secure alternative to[.data.frame
. -
fslice(v)
is a much faster alternative todplyr::slice_[head|tail|min|max]
for filtering/deduplicating matrix-like objects (by groups). -
fsummarise
is a much faster version ofdplyr::summarise
, especially when used together with the Fast Statistical Functions andfgroup_by
. -
fmutate
is a much faster version ofdplyr::mutate
, especially when used together with the Fast Statistical Functions, the fast Data Transformation Functions, andfgroup_by
. -
ftransform(v)
is a much faster version oftransform
, which also supports list input and nested pipelines.settransform(v)
does all of that by reference, i.e. it assigns to the calling environment.fcompute(v)
is similar toftransform(v)
but only returns modified/computed columns. -
roworder
is a fast substitute fordplyr::arrange
, but the syntax is inspired bydata.table::setorder
. -
colorder
efficiently reorders columns in a data frame, see alsodata.table::setcolorder
. -
frename
is a fast substitute fordplyr::rename
, to efficiently rename various objects.setrename
renames objects by reference.relabel
andsetrelabel
do the same thing for variable labels (see alsovlabels
).
Table of Functions
Function / S3 Generic | Methods | Description | ||
fselect(<-) | No methods, for data frames | Fast select or replace columns (non-standard evaluation) | ||
get_vars(<-) , num_vars(<-) , cat_vars(<-) , char_vars(<-) , fact_vars(<-) , logi_vars(<-) , date_vars(<-) | No methods, for data frames | Fast select or replace columns | ||
add_vars(<-) | No methods, for data frames | Fast add columns | ||
rowbind | No methods, for lists of lists/data frames | Fast row-binding lists | ||
join | No methods, for data frames | Fast table joins | ||
pivot | No methods, for data frames | Fast reshaping | ||
fsubset | default, matrix, data.frame, pseries, pdata.frame | Fast subset data (non-standard evaluation) | ||
ss | No methods, for data frames | Fast subset data frames | ||
fslice(v) | No methods, for matrices and data frames | Fast slicing of rows | ||
fsummarise | No methods, for data frames | Fast data aggregation | ||
fmutate , (f/set)transform(v)(<-) | No methods, for data frames | Compute, modify or delete columns (non-standard evaluation) | ||
fcompute(v) | No methods, for data frames | Compute or modify columns, returned in a new data frame (non-standard evaluation) | ||
roworder(v) | No methods, for data frames incl. pdata.frame | Reorder rows and return data frame (standard and non-standard evaluation) | ||
colorder(v) | No methods, for data frames | Reorder columns and return data frame (standard and non-standard evaluation) | ||
(f/set)rename , (set)relabel | No methods, for all objects with 'names' attribute | Rename and return object / relabel columns in a data frame. | ||
See Also
Collapse Overview, Quick Data Conversion, Recode and Replace Values
Fast Grouping and Ordering
Description
collapse provides the following functions to efficiently group and order data:
-
radixorder(v)
, provides fast radix-ordering through direct access to the methodorder(..., method = "radix")
, as well as the possibility to return some attributes very useful for grouping data and finding unique elements. The functionroworder(v)
efficiently reorders a data frame. -
group(v)
provides fast grouping in first-appearance order of rows, based on a hashing algorithm in C. Objects have class 'qG', see below. -
GRP
creates collapse grouping objects of class 'GRP' based onradixorder
orgroup
. 'GRP' objects form the central building block for grouped operations and programming in collapse and are very efficient inputs to all collapse functions supporting grouped operations. -
fgroup_by
provides a fast replacement fordplyr::group_by
, creating a grouped data frame (or data.table / tibble etc.) with a 'GRP' object attached. This grouped frame can be used for grouped operations using collapse's fast functions. -
fmatch
is a fast alternative tomatch
, which also supports matching of data frame rows. -
funique
is a faster version ofunique
. The data frame method also allows selecting unique rows according to a subset of the columns.fnunique
efficiently calculates the number of unique values/rows.fduplicated
is a fast alternative toduplicated
.any_duplicated
is a simpler and faster alternative toanyDuplicated
. -
fcount(v)
computes group counts based on a subset of columns in the data, and is a fast replacement fordplyr::count
. -
qF
, shorthand for 'quick-factor' implements very fast factor generation from atomic vectors using either radix orderingmethod = "radix"
or hashingmethod = "hash"
. Factors can also be used for efficient grouped programming with collapse functions, especially if they are generated usingqF(x, na.exclude = FALSE)
which assigns a level to missing values and attaches a class 'na.included' ensuring that no additional missing value checks are executed by collapse functions. -
qG
, shorthand for 'quick-group', generates a kind of factor-light without the levels attribute but instead an attribute providing the number of levels. Optionally the levels / groups can be attached, but without converting them to character. Objects have a class 'qG', which is also recognized in the collapse ecosystem. -
fdroplevels
is a substantially faster replacement fordroplevels
. -
finteraction
is a fast alternative tointeraction
implemented as a wrapper aroundas_factor_GRP(GRP(...))
. It can be used to generate a factor from multiple vectors, factors or a list of vectors / factors. Unused factor levels are always dropped. -
groupid
is a generalization ofdata.table::rleid
providing a run-length type group-id from atomic vectors. It is generalization as it also supports passing an ordering vector and skipping missing values. For exampleqF
andqG
withmethod = "radix"
are essentially implemented usinggroupid(x, radixorder(x))
. -
seqid
is a specialized function which creates a group-id from sequences of integer values. For any regular panel datasetgroupid(id, order(id, time))
andseqid(time, order(id, time))
provide the same id variable.seqid
is especially useful for identifying discontinuities in time-sequences. -
timeid
is a specialized function to convert integer or double vectors representing time (such as 'Date', 'POSIXct' etc.) to factor or 'qG' object based on the greatest common divisor of elements (thus preserving gaps in time intervals).
Table of Functions
Function / S3 Generic | Methods | Description | ||
radixorder(v) | No methods, for data frames and vectors | Radix-based ordering + grouping information | ||
roworder(v) | No methods, for data frames incl. pdata.frame | Row sorting/reordering | ||
group(v) | No methods, for data frames and vectors | Hash-based grouping + grouping information | ||
GRP | default, GRP, factor, qG, grouped_df, pseries, pdata.frame | Fast grouping and a flexible grouping object | ||
fgroup_by | No methods, for data frames | Fast grouped data frame | ||
fmatch | No methods, for vectors and data frames | Fast matching | ||
funique , fnunique , fduplicated , any_duplicated | default, data.frame, sf, pseries, pdata.frame, list | Fast (number of) unique values/rows | ||
fcount(v) | Internal generic, supports vectors, matrices, data.frames, lists, grouped_df and pdata.frame | Fast group counts | ||
qF | No methods, for vectors | Quick factor generation | ||
qG | No methods, for vectors | Quick grouping of vectors and a 'factor-light' class | ||
fdroplevels | factor, data.frame, list | Fast removal of unused factor levels | ||
finteraction | No methods, for data frames and vectors | Fast interactions | ||
groupid | No methods, for vectors | Run-length type group-id | ||
seqid | No methods, for integer vectors | Run-length type integer sequence-id | ||
timeid | No methods, for integer or double vectors | Integer-id from time/date sequences | ||
See Also
Collapse Overview, Data Frame Manipulation, Time Series and Panel Series
Fast (Grouped, Weighted) Statistical Functions for Matrix-Like Objects
Description
With fsum
, fprod
, fmean
, fmedian
, fmode
, fvar
, fsd
, fmin
, fmax
, fnth
, ffirst
, flast
, fnobs
and fndistinct
, collapse presents a coherent set of extremely fast and flexible statistical functions (S3 generics) to perform column-wise, grouped and weighted computations on vectors, matrices and data frames, with special support for grouped data frames / tibbles (dplyr) and data.table's.
Usage
## All functions (FUN) follow a common syntax in 4 methods: FUN(x, ...) ## Default S3 method: FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'matrix' FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'data.frame' FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'grouped_df' FUN(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] [stub = TRUE,] [nthreads = 1L,] ...)
Arguments
x | a vector, matrix, data frame or grouped data frame (class 'grouped_df'). | |
g | a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x . |
|
w | a numeric vector of (non-negative) weights, may contain missing values. Supported by fsum , fprod , fmean , fmedian , fnth , fvar , fsd and fmode . |
|
TRA | an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA . |
|
na.rm | logical. Skip missing values in x . Defaults to TRUE in all functions and implemented at very little computational cost. Not available for fnobs . |
|
use.g.names | logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. | |
nthreads | integer. The number of threads to utilize. Supported by fsum , fmean , fmedian , fnth , fmode and fndistinct . |
|
drop | matrix and data.frame methods: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL . |
|
keep.group_vars | grouped_df method: Logical. FALSE removes grouping variables after computation. By default grouping variables are added, even if not present in the grouped_df. |
|
keep.w | grouped_df method: Logical. TRUE (default) also aggregates weights and saves them in a column, FALSE removes weighting variable after computation (if contained in grouped_df ). |
|
stub | grouped_df method: Character. If keep.w = TRUE and stub = TRUE (default), the aggregated weights column is prefixed by the name of the aggregation function (mostly "sum." ). Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing. |
|
... | arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly (except for the grouped_df method which always returns visible output). |
|
Details
Please see the documentation of individual functions.
Value
x
suitably aggregated or transformed. Data frame column-attributes and overall attributes are generally preserved if the output is of the same data type.
Related Functionality
Panel-decomposed (i.e. between and within) statistics as well as grouped and weighted skewness and kurtosis are implemented in
qsu
.The vector-valued functions and operators
fcumsum
,fscale/STD
,fbetween/B
,fhdbetween/HDB
,fwithin/W
,fhdwithin/HDW
,flag/L/F
,fdiff/D/Dlog
andfgrowth/G
are grouped under Data Transformations and Time Series and Panel Series. These functions also support indexed data (plm).
Examples
## default vector method mpg <- mtcars$mpg fsum(mpg) # Simple sum fsum(mpg, TRA = "/") # Simple transformation: divide all values by the sum fsum(mpg, mtcars$cyl) # Grouped sum fmean(mpg, mtcars$cyl) # Grouped mean fmean(mpg, w = mtcars$hp) # Weighted mean, weighted by hp fmean(mpg, mtcars$cyl, mtcars$hp) # Grouped mean, weighted by hp fsum(mpg, mtcars$cyl, TRA = "/") # Proportions / division by group sums fmean(mpg, mtcars$cyl, mtcars$hp, # Subtract weighted group means, see also ?fwithin TRA = "-") ## data.frame method fsum(mtcars) fsum(mtcars, TRA = "%") # This computes percentages fsum(mtcars, mtcars[c(2,8:9)]) # Grouped column sum g <- GRP(mtcars, ~ cyl + vs + am) # Here precomputing the groups! fsum(mtcars, g) # Faster !! fmean(mtcars, g, mtcars$hp) fmean(mtcars, g, mtcars$hp, "-") # Demeaning by weighted group means.. fmean(fgroup_by(mtcars, cyl, vs, am), hp, "-") # Another way of doing it.. fmode(wlddev, drop = FALSE) # Compute statistical modes of variables in this data fmode(wlddev, wlddev$income) # Grouped statistical modes .. ## matrix method m <- qM(mtcars) fsum(m) fsum(m, g) # .. ## method for grouped data frames - created with dplyr::group_by or fgroup_by library(dplyr) mtcars |> group_by(cyl,vs,am) |> select(mpg,carb) |> fsum() mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg,carb) |> fsum() # equivalent and faster !! mtcars |> fgroup_by(cyl,vs,am) |> fsum(TRA = "%") mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp) # weighted grouped mean, save sum of weights mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp, keep.group_vars = FALSE)
Benchmark
## This compares fsum with data.table (2 threads) and base::rowsum # Starting with small data mtcDT <- qDT(mtcars) f <- qF(mtcars$cyl) library(microbenchmark) microbenchmark(mtcDT[, lapply(.SD, sum), by = f], rowsum(mtcDT, f, reorder = FALSE), fsum(mtcDT, f, na.rm = FALSE), unit = "relative") # expr min lq mean median uq max neval cld # mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c # rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b # fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a # Now larger data tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups microbenchmark(tdata[, lapply(.SD, sum), by = f], rowsum(tdata, f, reorder = FALSE), fsum(tdata, f, na.rm = FALSE), unit = "relative") # expr min lq mean median uq max neval cld # tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c # rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b # fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
See Also
Collapse Overview, Data Transformations, Time Series and Panel Series
Fast Between (Averaging) and (Quasi-)Within (Centering) Transformations
Description
fbetween
and fwithin
are S3 generics to efficiently obtain between-transformed (averaged) or (quasi-)within-transformed (demeaned) data. These operations can be performed groupwise and/or weighted. B
and W
are wrappers around fbetween
and fwithin
representing the 'between-operator' and the 'within-operator'.
(B
/ W
provide more flexibility than fbetween
/ fwithin
when applied to data frames (i.e. column subsetting, formula input, auto-renaming and id-variable-preservation capabilities...), but are otherwise identical.)
Usage
fbetween(x, ...)
fwithin(x, ...)
B(x, ...)
W(x, ...)
## Default S3 method:
fbetween(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, ...)
## Default S3 method:
fwithin(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1, ...)
## Default S3 method:
B(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, ...)
## Default S3 method:
W(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1, ...)
## S3 method for class 'matrix'
fbetween(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, ...)
## S3 method for class 'matrix'
fwithin(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1, ...)
## S3 method for class 'matrix'
B(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, stub = .op[["stub"]], ...)
## S3 method for class 'matrix'
W(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1,
stub = .op[["stub"]], ...)
## S3 method for class 'data.frame'
fbetween(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, ...)
## S3 method for class 'data.frame'
fwithin(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1, ...)
## S3 method for class 'data.frame'
B(x, by = NULL, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
fill = FALSE, stub = .op[["stub"]], keep.by = TRUE, keep.w = TRUE, ...)
## S3 method for class 'data.frame'
W(x, by = NULL, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
mean = 0, theta = 1, stub = .op[["stub"]], keep.by = TRUE, keep.w = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
fbetween(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, ...)
## S3 method for class 'pseries'
fwithin(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1, ...)
## S3 method for class 'pseries'
B(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, ...)
## S3 method for class 'pseries'
W(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1, ...)
## S3 method for class 'pdata.frame'
fbetween(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, ...)
## S3 method for class 'pdata.frame'
fwithin(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1, ...)
## S3 method for class 'pdata.frame'
B(x, effect = 1L, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
fill = FALSE, stub = .op[["stub"]], keep.ids = TRUE, keep.w = TRUE, ...)
## S3 method for class 'pdata.frame'
W(x, effect = 1L, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
mean = 0, theta = 1, stub = .op[["stub"]], keep.ids = TRUE, keep.w = TRUE, ...)
# Methods for grouped data frame / compatibility with dplyr:
## S3 method for class 'grouped_df'
fbetween(x, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE,
keep.group_vars = TRUE, keep.w = TRUE, ...)
## S3 method for class 'grouped_df'
fwithin(x, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1,
keep.group_vars = TRUE, keep.w = TRUE, ...)
## S3 method for class 'grouped_df'
B(x, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE,
stub = .op[["stub"]], keep.group_vars = TRUE, keep.w = TRUE, ...)
## S3 method for class 'grouped_df'
W(x, w = NULL, na.rm = .op[["na.rm"]], mean = 0, theta = 1,
stub = .op[["stub"]], keep.group_vars = TRUE, keep.w = TRUE, ...)
Arguments
x |
a numeric vector, matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df'). |
g |
a factor, |
by |
B and W data.frame method: Same as g, but also allows one- or two-sided formulas i.e. |
w |
a numeric vector of (non-negative) weights. |
cols |
B/W (p)data.frame methods: Select columns to scale using a function, column names, indices or a logical vector. Default: All numeric columns. Note: |
na.rm |
logical. Skip missing values in |
effect |
plm methods: Select which panel identifier should be used as grouping variable. 1L takes the first variable in the index, 2L the second etc. Index variables can also be called by name using a character string. If more than one variable is supplied, the corresponding index-factors are interacted. |
stub |
character. A prefix/stub to add to the names of all transformed columns. |
fill |
option to |
mean |
option to |
theta |
option to |
keep.by , keep.ids , keep.group_vars |
B and W data.frame, pdata.frame and grouped_df methods: Logical. Retain grouping / panel-identifier columns in the output. For data frames this only works if grouping variables were passed in a formula. |
keep.w |
B and W data.frame, pdata.frame and grouped_df methods: Logical. Retain column containing the weights in the output. Only works if |
... |
arguments to be passed to or from other methods. |
Details
Without groups, fbetween
/B
replaces all data points in x
with their mean or weighted mean (if w
is supplied). Similarly fwithin/W
subtracts the (weighted) mean from all data points i.e. centers the data on the mean.
With groups supplied to g
, the replacement / centering performed by fbetween/B
| fwithin/W
becomes groupwise. In terms of panel data notation: If x
is a vector in such a panel dataset, xit
denotes a single data-point belonging to group i
in time-period t
(t
need not be a time-period). Then xi.
denotes x
, averaged over t
. fbetween
/B
now returns xi.
and fwithin
/W
returns x - xi.
. Thus for any data x
and any grouping vector g
: B(x,g) + W(x,g) = xi. + x - xi. = x
. In terms of variance, fbetween/B
only retains the variance between group averages, while fwithin
/W
, by subtracting out group means, only retains the variance within those groups.
The data replacement performed by fbetween
/B
can keep (default) or overwrite missing values (option fill = TRUE
) in x
. fwithin/W
can center data simply (default), or add back a mean after centering (option mean = value
), or add the overall mean in groupwise computations (option mean = "overall.mean"
). Let x..
denote the overall mean of x
, then fwithin
/W
with mean = "overall.mean"
returns x - xi. + x..
instead of x - xi.
. This is useful to get rid of group-differences but preserve the overall level of the data. In regression analysis, centering with mean = "overall.mean"
will only change the constant term. See Examples.
If theta != 1
, fwithin
/W
performs quasi-demeaning x - theta * xi.
. If mean = "overall.mean"
, x - theta * xi. + theta * x..
is returned, so that the mean of the partially demeaned data is still equal to the overall data mean x..
. A numeric value passed to mean
will simply be added back to the quasi-demeaned data i.e. x - theta * xi. + mean
.
Now in the case of a linear panel model y_{it} = \beta_0 + \beta_1 X_{it} + u_{it}
with u_{it} = \alpha_i + \epsilon_{it}
. If \alpha_i \neq \alpha = const.
(there exists individual heterogeneity), then pooled OLS is at least inefficient and inference on \beta_1
is invalid. If E[\alpha_i|X_{it}] = 0
(mean independence of individual heterogeneity \alpha_i
), the variance components or 'random-effects' estimator provides an asymptotically efficient FGLS solution by estimating a transformed model y_{it}-\theta y_{i.} = \beta_0 + \beta_1 (X_{it} - \theta X_{i.}) + (u_{it} - \theta u_{i.}
), where \theta = 1 - \frac{\sigma_\alpha}{\sqrt(\sigma^2_\alpha + T \sigma^2_\epsilon)}
. An estimate of \theta
can be obtained from the an estimate of \hat{u}_{it}
(the residuals from the pooled model). If E[\alpha_i|X_{it}] \neq 0
, pooled OLS is biased and inconsistent, and taking \theta = 1
gives an unbiased and consistent fixed-effects estimator of \beta_1
. See Examples.
Value
fbetween
/B
returns x
with every element replaced by its (groupwise) mean (xi.
). Missing values are preserved if fill = FALSE
(the default). fwithin/W
returns x
where every element was subtracted its (groupwise) mean (x - theta * xi. + mean
or, if mean = "overall.mean"
, x - theta * xi. + theta * x..
). See Details.
References
Mundlak, Yair. 1978. On the Pooling of Time Series and Cross Section Data. Econometrica 46 (1): 69-85.
See Also
fhdbetween/HDB and fhdwithin/HDW
, fscale/STD
, TRA
, Data Transformations, Collapse Overview
Examples
## Simple centering and averaging
head(fbetween(mtcars))
head(B(mtcars))
head(fwithin(mtcars))
head(W(mtcars))
all.equal(fbetween(mtcars) + fwithin(mtcars), mtcars)
## Groupwise centering and averaging
head(fbetween(mtcars, mtcars$cyl))
head(fwithin(mtcars, mtcars$cyl))
all.equal(fbetween(mtcars, mtcars$cyl) + fwithin(mtcars, mtcars$cyl), mtcars)
head(W(wlddev, ~ iso3c, cols = 9:13)) # Center the 5 series in this dataset by country
head(cbind(get_vars(wlddev,"iso3c"), # Same thing done manually using fwithin..
add_stub(fwithin(get_vars(wlddev,9:13), wlddev$iso3c), "W.")))
## Using B() and W() for fixed-effects regressions:
# Several ways of running the same regression with cyl-fixed effects
lm(W(mpg,cyl) ~ W(carb,cyl), data = mtcars) # Centering each individually
lm(mpg ~ carb, data = W(mtcars, ~ cyl, stub = FALSE)) # Centering the entire data
lm(mpg ~ carb, data = W(mtcars, ~ cyl, stub = FALSE, # Here only the intercept changes
mean = "overall.mean"))
lm(mpg ~ carb + B(carb,cyl), data = mtcars) # Procedure suggested by
# ..Mundlak (1978) - partialling out group averages amounts to the same as demeaning the data
plm::plm(mpg ~ carb, mtcars, index = "cyl", model = "within") # "Proof"..
# This takes the interaction of cyl, vs and am as fixed effects
lm(W(mpg) ~ W(carb), data = iby(mtcars, id = finteraction(cyl, vs, am)))
lm(mpg ~ carb, data = W(mtcars, ~ cyl + vs + am, stub = FALSE))
lm(mpg ~ carb + B(carb,list(cyl,vs,am)), data = mtcars)
# Now with cyl fixed effects weighted by hp:
lm(W(mpg,cyl,hp) ~ W(carb,cyl,hp), data = mtcars)
lm(mpg ~ carb, data = W(mtcars, ~ cyl, ~ hp, stub = FALSE))
lm(mpg ~ carb + B(carb,cyl,hp), data = mtcars) # WRONG ! Gives a different coefficient!!
## Manual variance components (random-effects) estimation
res <- HDW(mtcars, mpg ~ carb)[[1]] # Get residuals from pooled OLS
sig2_u <- fvar(res)
sig2_e <- fvar(fwithin(res, mtcars$cyl))
T <- length(res) / fndistinct(mtcars$cyl)
sig2_alpha <- sig2_u - sig2_e
theta <- 1 - sqrt(sig2_alpha) / sqrt(sig2_alpha + T * sig2_e)
lm(mpg ~ carb, data = W(mtcars, ~ cyl, theta = theta, mean = "overall.mean", stub = FALSE))
# A slightly different method to obtain theta...
plm::plm(mpg ~ carb, mtcars, index = "cyl", model = "random")
Efficiently Count Observations by Group
Description
A much faster replacement for dplyr::count
.
Usage
fcount(x, ..., w = NULL, name = "N", add = FALSE,
sort = FALSE, decreasing = FALSE)
fcountv(x, cols = NULL, w = NULL, name = "N", add = FALSE,
sort = FALSE, ...)
Arguments
x |
a data frame or list-like object, including 'grouped_df' or 'indexed_frame'. Atomic vectors or matrices can also be passed, but will be sent through |
... |
for |
cols |
select columns to count cases by, using column names, indices, a logical vector or a selector function (e.g. |
w |
a numeric vector of weights, may contain missing values. In |
name |
character. The name of the column containing the count or sum of weights. |
add |
|
sort , decreasing |
arguments passed to |
Value
If x
is a list, an object of the same type as x
with a column (name
) added at the end giving the count. Otherwise, if x
is atomic, a data frame returned from qDF(x)
with the count column added. By default (add = FALSE
) only the unique rows of x
of the columns used for counting are returned.
See Also
GRPN
, fnobs
, fndistinct
, Fast Grouping and Ordering, Collapse Overview
Examples
fcount(mtcars, cyl, vs, am)
fcountv(mtcars, cols = .c(cyl, vs, am))
fcount(mtcars, cyl, vs, am, sort = TRUE)
fcount(mtcars, cyl, vs, am, add = TRUE)
fcount(mtcars, cyl, vs, am, add = "group_vars")
## With grouped data
mtcars |> fgroup_by(cyl, vs, am) |> fcount()
mtcars |> fgroup_by(cyl, vs, am) |> fcount(add = TRUE)
mtcars |> fgroup_by(cyl, vs, am) |> fcount(add = "group_vars")
## With indexed data: by default counting on the first index variable
wlddev |> findex_by(country, year) |> fcount()
wlddev |> findex_by(country, year) |> fcount(add = TRUE)
# Use fcountv to pass additional arguments to GRP.pdata.frame,
# here using the effect argument to choose a different index variable
wlddev |> findex_by(country, year) |> fcountv(effect = "year")
wlddev |> findex_by(country, year) |> fcountv(add = "group_vars", effect = "year")
Fast (Grouped, Ordered) Cumulative Sum for Matrix-Like Objects
Description
fcumsum
is a generic function that computes the (column-wise) cumulative sum of x
, (optionally) grouped by g
and/or ordered by o
. Several options to deal with missing values are provided.
Usage
fcumsum(x, ...)
## Default S3 method:
fcumsum(x, g = NULL, o = NULL, na.rm = .op[["na.rm"]], fill = FALSE, check.o = TRUE, ...)
## S3 method for class 'matrix'
fcumsum(x, g = NULL, o = NULL, na.rm = .op[["na.rm"]], fill = FALSE, check.o = TRUE, ...)
## S3 method for class 'data.frame'
fcumsum(x, g = NULL, o = NULL, na.rm = .op[["na.rm"]], fill = FALSE, check.o = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
fcumsum(x, na.rm = .op[["na.rm"]], fill = FALSE, shift = "time", ...)
## S3 method for class 'pdata.frame'
fcumsum(x, na.rm = .op[["na.rm"]], fill = FALSE, shift = "time", ...)
# Methods for grouped data frame / compatibility with dplyr:
## S3 method for class 'grouped_df'
fcumsum(x, o = NULL, na.rm = .op[["na.rm"]], fill = FALSE, check.o = TRUE,
keep.ids = TRUE, ...)
Arguments
x |
a numeric vector / time series, (time series) matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df'). |
g |
a factor, |
o |
a vector or list of vectors providing the order in which the elements of |
na.rm |
logical. Skip missing values in |
fill |
if |
check.o |
logical. Programmers option: |
shift |
pseries / pdata.frame methods: character. |
keep.ids |
pdata.frame / grouped_df methods: Logical. Drop all identifiers from the output (which includes all grouping variables and variables passed to |
... |
arguments to be passed to or from other methods. |
Details
If na.rm = FALSE
, fcumsum
works like cumsum
and propagates missing values. The default na.rm = TRUE
skips missing values and computes the cumulative sum on the non-missing values. Missing values are kept. If fill = TRUE
, missing values are replaced with the previous value of the cumulative sum (starting from 0), computed on the non-missing values.
By default the cumulative sum is computed in the order in which elements appear in x
. If o
is provided, the cumulative sum is computed in the order given by radixorderv(o)
, without the need to first sort x
. This applies as well if groups are used (g
), in which case the cumulative sum is computed separately in each group.
The pseries and pdata.frame methods assume that the last factor in the index is the time-variable and the rest are grouping variables. The time-variable is passed to radixorderv
and used for ordered computation, so that cumulative sums are accurately computed regardless of whether the panel-data is ordered or balanced.
fcumsum
explicitly supports integers. Integers in R are bounded at bounded at +-2,147,483,647, and an integer overflow error will be provided if the cumulative sum (within any group) exceeds +-2,147,483,647. In that case data should be converted to double beforehand.
Value
the cumulative sum of values in x
, (optionally) grouped by g
and/or ordered by o
. See Details and Examples.
See Also
fdiff
, fgrowth
, Time Series and Panel Series, Collapse Overview
Examples
## Non-grouped
fcumsum(AirPassengers)
head(fcumsum(EuStockMarkets))
fcumsum(mtcars)
# Non-grouped but ordered
o <- order(rnorm(nrow(EuStockMarkets)))
all.equal(copyAttrib(fcumsum(EuStockMarkets[o, ], o = o)[order(o), ], EuStockMarkets),
fcumsum(EuStockMarkets))
## Grouped
head(with(wlddev, fcumsum(PCGDP, iso3c)))
## Grouped and ordered
head(with(wlddev, fcumsum(PCGDP, iso3c, year)))
head(with(wlddev, fcumsum(PCGDP, iso3c, year, fill = TRUE)))
Fast (Quasi-, Log-) Differences for Time Series and Panel Data
Description
fdiff
is a S3 generic to compute (sequences of) suitably lagged / leaded and iterated differences, quasi-differences or (quasi-)log-differences. The difference and log-difference operators D
and Dlog
also exists as parsimonious wrappers around fdiff
, providing more flexibility than fdiff
when applied to data frames.
Usage
fdiff(x, n = 1, diff = 1, ...)
D(x, n = 1, diff = 1, ...)
Dlog(x, n = 1, diff = 1, ...)
## Default S3 method:
fdiff(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, log = FALSE, rho = 1,
stubs = TRUE, ...)
## Default S3 method:
D(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1,
stubs = .op[["stub"]], ...)
## Default S3 method:
Dlog(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1, stubs = .op[["stub"]],
...)
## S3 method for class 'matrix'
fdiff(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, log = FALSE, rho = 1,
stubs = length(n) + length(diff) > 2L, ...)
## S3 method for class 'matrix'
D(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1,
stubs = .op[["stub"]], ...)
## S3 method for class 'matrix'
Dlog(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1, stubs = .op[["stub"]],
...)
## S3 method for class 'data.frame'
fdiff(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, log = FALSE, rho = 1,
stubs = length(n) + length(diff) > 2L, ...)
## S3 method for class 'data.frame'
D(x, n = 1, diff = 1, by = NULL, t = NULL, cols = is.numeric,
fill = NA, rho = 1, stubs = .op[["stub"]], keep.ids = TRUE, ...)
## S3 method for class 'data.frame'
Dlog(x, n = 1, diff = 1, by = NULL, t = NULL, cols = is.numeric,
fill = NA, rho = 1, stubs = .op[["stub"]], keep.ids = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
fdiff(x, n = 1, diff = 1, fill = NA, log = FALSE, rho = 1,
stubs = length(n) + length(diff) > 2L, shift = "time", ...)
## S3 method for class 'pseries'
D(x, n = 1, diff = 1, fill = NA, rho = 1, stubs = .op[["stub"]], shift = "time", ...)
## S3 method for class 'pseries'
Dlog(x, n = 1, diff = 1, fill = NA, rho = 1, stubs = .op[["stub"]], shift = "time", ...)
## S3 method for class 'pdata.frame'
fdiff(x, n = 1, diff = 1, fill = NA, log = FALSE, rho = 1,
stubs = length(n) + length(diff) > 2L, shift = "time", ...)
## S3 method for class 'pdata.frame'
D(x, n = 1, diff = 1, cols = is.numeric, fill = NA, rho = 1, stubs = .op[["stub"]],
shift = "time", keep.ids = TRUE, ...)
## S3 method for class 'pdata.frame'
Dlog(x, n = 1, diff = 1, cols = is.numeric, fill = NA, rho = 1, stubs = .op[["stub"]],
shift = "time", keep.ids = TRUE, ...)
# Methods for grouped data frame / compatibility with dplyr:
## S3 method for class 'grouped_df'
fdiff(x, n = 1, diff = 1, t = NULL, fill = NA, log = FALSE, rho = 1,
stubs = length(n) + length(diff) > 2L, keep.ids = TRUE, ...)
## S3 method for class 'grouped_df'
D(x, n = 1, diff = 1, t = NULL, fill = NA, rho = 1, stubs = .op[["stub"]],
keep.ids = TRUE, ...)
## S3 method for class 'grouped_df'
Dlog(x, n = 1, diff = 1, t = NULL, fill = NA, rho = 1, stubs = .op[["stub"]],
keep.ids = TRUE, ...)
Arguments
x |
a numeric vector / time series, (time series) matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df'). |
n |
integer. A vector indicating the number of lags or leads. |
diff |
integer. A vector of integers > 1 indicating the order of differencing / log-differencing. |
g |
a factor, |
by |
data.frame method: Same as |
t |
a time vector or list of vectors. See |
cols |
data.frame method: Select columns to difference using a function, column names, indices or a logical vector. Default: All numeric variables. Note: |
fill |
value to insert when vectors are shifted. Default is |
log |
logical. |
rho |
double. Autocorrelation parameter. Set to a value between 0 and 1 for quasi-differencing. Any numeric value can be supplied. |
stubs |
logical. |
shift |
pseries / pdata.frame methods: character. |
keep.ids |
data.frame / pdata.frame / grouped_df methods: Logical. Drop all identifiers from the output (which includes all variables passed to |
... |
arguments to be passed to or from other methods. |
Details
By default, fdiff/D/Dlog
return x
with all columns differenced / log-differenced. Differences are computed as repeat(diff) x[i] - rho*x[i-n]
, and log-differences as log(x[i]) - rho*log(x[i-n])
for diff = 1
and repeat(diff-1) x[i] - rho*x[i-n]
is used to compute subsequent differences (usually diff = 1
for log-differencing). If rho < 1
, this becomes quasi- (or partial) differencing, which is a technique suggested by Cochrane and Orcutt (1949) to deal with serial correlation in regression models, where rho
is typically estimated by running a regression of the model residuals on the lagged residuals.
It is also possible to compute forward differences by passing negative n
values. n
also supports arbitrary vectors of integers (lags), and diff
supports positive sequences of integers (differences):
If more than one value is passed to n
and/or diff
, the data is expanded-wide as follows: If x
is an atomic vector or time series, a (time series) matrix is returned with columns ordered first by lag, then by difference. If x
is a matrix or data frame, each column is expanded in like manor such that the output has ncol(x)*length(n)*length(diff)
columns ordered first by column name, then by lag, then by difference.
For further computational details and efficiency considerations see the help page of flag
.
Value
x
differenced diff
times using lags n
of itself. Quasi and log-differences are toggled by the rho
and log
arguments or the Dlog
operator. Computations can be grouped by g/by
and/or ordered by t
. See Details and Examples.
References
Cochrane, D.; Orcutt, G. H. (1949). Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms. Journal of the American Statistical Association. 44 (245): 32-61.
Prais, S. J. & Winsten, C. B. (1954). Trend Estimators and Serial Correlation. Cowles Commission Discussion Paper No. 383. Chicago.
See Also
flag/L/F
, fgrowth/G
, Time Series and Panel Series, Collapse Overview
Examples
## Simple Time Series: AirPassengers
D(AirPassengers) # 1st difference, same as fdiff(AirPassengers)
D(AirPassengers, -1) # Forward difference
Dlog(AirPassengers) # Log-difference
D(AirPassengers, 1, 2) # Second difference
Dlog(AirPassengers, 1, 2) # Second log-difference
D(AirPassengers, 12) # Seasonal difference (data is monthly)
D(AirPassengers, # Quasi-difference, see a better example below
rho = pwcor(AirPassengers, L(AirPassengers)))
head(D(AirPassengers, -2:2, 1:3)) # Sequence of leaded/lagged and iterated differences
# let's do some visual analysis
plot(AirPassengers) # Plot the series - seasonal pattern is evident
plot(stl(AirPassengers, "periodic")) # Seasonal decomposition
plot(D(AirPassengers,c(1,12),1:2)) # Plotting ordinary and seasonal first and second differences
plot(stl(window(D(AirPassengers,12), # Taking seasonal differences removes most seasonal variation
1950), "periodic"))
## Time Series Matrix of 4 EU Stock Market Indicators, recorded 260 days per year
plot(D(EuStockMarkets, c(0, 260))) # Plot series and annual differnces
mod <- lm(DAX ~., L(EuStockMarkets, c(0, 260))) # Regressing the DAX on its annual lag
summary(mod) # and the levels and annual lags others
r <- residuals(mod) # Obtain residuals
pwcor(r, L(r)) # Residual Autocorrelation
fFtest(r, L(r)) # F-test of residual autocorrelation
# (better use lmtest :: bgtest)
modCO <- lm(QD1.DAX ~., D(L(EuStockMarkets, c(0, 260)), # Cochrane-Orcutt (1949) estimation
rho = pwcor(r, L(r))))
summary(modCO)
rCO <- residuals(modCO)
fFtest(rCO, L(rCO)) # No more autocorrelation
## World Development Panel Data
head(fdiff(num_vars(wlddev), 1, 1, # Computes differences of numeric variables
wlddev$country, wlddev$year)) # fdiff requires external inputs..
head(D(wlddev, 1, 1, ~country, ~year)) # Differences of numeric variables
head(D(wlddev, 1, 1, ~country)) # Without t: Works because data is ordered
head(D(wlddev, 1, 1, PCGDP + LIFEEX ~ country, ~year)) # Difference of GDP & Life Expectancy
head(D(wlddev, 0:1, 1, ~ country, ~year, cols = 9:10)) # Same, also retaining original series
head(D(wlddev, 0:1, 1, ~ country, ~year, 9:10, # Dropping id columns
keep.ids = FALSE))
## Indexed computations:
wldi <- findex_by(wlddev, iso3c, year)
# Dynamic Panel Data Models:
summary(lm(D(PCGDP) ~ L(PCGDP) + D(LIFEEX), data = wldi)) # Simple case
summary(lm(Dlog(PCGDP) ~ L(log(PCGDP)) + Dlog(LIFEEX), data = wldi)) # In log-differneces
# Adding a lagged difference...
summary(lm(D(PCGDP) ~ L(D(PCGDP, 0:1)) + L(D(LIFEEX), 0:1), data = wldi))
summary(lm(Dlog(PCGDP) ~ L(Dlog(PCGDP, 0:1)) + L(Dlog(LIFEEX), 0:1), data = wldi))
# Same thing:
summary(lm(D1.PCGDP ~., data = L(D(wldi,0:1,1,9:10),0:1,keep.ids = FALSE)[,-1]))
## Grouped data
library(magrittr)
wlddev |> fgroup_by(country) |>
fselect(PCGDP,LIFEEX) |> fdiff(0:1,1:2) # Adding a first and second difference
wlddev |> fgroup_by(country) |>
fselect(year,PCGDP,LIFEEX) |> D(0:1,1:2,year) # Also using t (safer)
wlddev |> fgroup_by(country) |> # Dropping id's
fselect(year,PCGDP,LIFEEX) |> D(0:1,1:2,year, keep.ids = FALSE)
Fast and Flexible Distance Computations
Description
A fast and flexible replacement for dist
, to compute euclidean distances.
Usage
fdist(x, v = NULL, ..., method = "euclidean", nthreads = .op[["nthreads"]])
Arguments
x |
a numeric vector or matrix. Data frames/lists can be passed but will be converted to matrix using | ||||||||||||||||
v |
an (optional) numeric (double) vector such that | ||||||||||||||||
... |
not used. A placeholder for possible future arguments. | ||||||||||||||||
method |
an integer or character string indicating the method of computing distances.
| ||||||||||||||||
nthreads |
integer. The number of threads to use. If |
Value
If v = NULL
, a full lower-triangular distance matrix between the rows of x
is computed and returned as a 'dist' object (all methods apply, see dist
). Otherwise, a numeric vector of distances of each row of x
with v
is returned. See Examples.
Note
fdist
does not check for missing values, so NA
's will result in NA
distances.
kit::topn
is a suitable complimentary function to find nearest neighbors. It is very efficient and skips missing values by default.
See Also
flm
, Fast Statistical Functions, Collapse Overview
Examples
# Distance matrix
m = as.matrix(mtcars)
str(fdist(m)) # Same as dist(m)
# Distance with vector
d = fdist(m, fmean(m))
kit::topn(d, 5) # Index of 5 nearest neighbours
# Mahalanobis distance
m_mahal = t(forwardsolve(t(chol(cov(m))), t(m)))
fdist(m_mahal, fmean(m_mahal))
sqrt(unattrib(mahalanobis(m, fmean(m), cov(m))))
# Distance of two vectors
x <- rnorm(1e6)
y <- rnorm(1e6)
microbenchmark::microbenchmark(
fdist(x, y),
fdist(x, y, nthreads = 2),
sqrt(sum((x-y)^2))
)
Fast Removal of Unused Factor Levels
Description
A substantially faster replacement for droplevels
.
Usage
fdroplevels(x, ...)
## S3 method for class 'factor'
fdroplevels(x, ...)
## S3 method for class 'data.frame'
fdroplevels(x, ...)
Arguments
x |
a factor, or data frame / list containing one or more factors. |
... |
not used. |
Details
droplevels
passes a factor from which levels are to be dropped to factor
, which first calls unique
and then match
to drop unused levels. Both functions internally use a hash table, which is highly inefficient. fdroplevels
does not require mapping values at all, but uses a super fast boolean vector method to determine which levels are unused and remove those levels. In addition, if no unused levels are found, x
is simply returned. Any missing values found in x
are efficiently skipped in the process of checking and replacing levels. All other attributes of x
are preserved.
Value
x
with unused factor levels removed.
Note
If x
is malformed e.g. has too few levels, this function can cause a segmentation fault terminating the R session, thus only use with ordinary / proper factors.
See Also
qF
, funique
, Fast Grouping and Ordering, Collapse Overview
Examples
f <- iris$Species[1:100]
fdroplevels(f)
identical(fdroplevels(f), droplevels(f))
fNA <- na_insert(f)
fdroplevels(fNA)
identical(fdroplevels(fNA), droplevels(fNA))
identical(fdroplevels(ss(iris, 1:100)), droplevels(ss(iris, 1:100)))
Fast (Grouped) First and Last Value for Matrix-Like Objects
Description
ffirst
and flast
are S3 generic functions that (column-wise) returns the first and last values in x
, (optionally) grouped by g
. The TRA
argument can further be used to transform x
using its (groupwise) first and last values.
Usage
ffirst(x, ...)
flast(x, ...)
## Default S3 method:
ffirst(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ...)
## Default S3 method:
flast(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ...)
## S3 method for class 'matrix'
ffirst(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'matrix'
flast(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
ffirst(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
flast(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'grouped_df'
ffirst(x, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, ...)
## S3 method for class 'grouped_df'
flast(x, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, ...)
Arguments
x |
a vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
... |
arguments to be passed to or from other methods. If |
Value
ffirst
returns the first value in x
, grouped by g
, or (if TRA
is used) x
transformed by its first value, grouped by g
. Similarly flast
returns the last value in x
, ...
Note
Both functions are significantly faster if na.rm = FALSE
, particularly ffirst
which can take direct advantage of the 'group.starts' elements in GRP
objects.
See Also
Fast Statistical Functions, Collapse Overview
Examples
## default vector method
ffirst(airquality$Ozone) # Simple first value
ffirst(airquality$Ozone, airquality$Month) # Grouped first value
ffirst(airquality$Ozone, airquality$Month,
na.rm = FALSE) # Grouped first, but without skipping initial NA's
## data.frame method
ffirst(airquality)
ffirst(airquality, airquality$Month)
ffirst(airquality, airquality$Month, na.rm = FALSE) # Again first Ozone measurement in month 6 is NA
## matrix method
aqm <- qM(airquality)
ffirst(aqm)
ffirst(aqm, airquality$Month) # etc..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
airquality |> group_by(Month) |> ffirst()
airquality |> group_by(Month) |> select(Ozone) |> ffirst(na.rm = FALSE)
# Note: All examples generalize to flast.
Fast (Weighted) F-test for Linear Models (with Factors)
Description
fFtest
computes an R-squared based F-test for the exclusion of the variables in exc
, where the full (unrestricted) model is defined by variables supplied to both exc
and X
. The test is efficient and designed for cases where both exc
and X
may contain multiple factors and continuous variables. There is also an efficient 2-part formula method.
Usage
fFtest(...) # Internal method dispatch: formula if is.call(..1) || is.call(..2)
## Default S3 method:
fFtest(y, exc, X = NULL, w = NULL, full.df = TRUE, ...)
## S3 method for class 'formula'
fFtest(formula, data = NULL, weights = NULL, ...)
Arguments
y |
a numeric vector: the dependent variable. |
exc |
a numeric vector, factor, numeric matrix or list / data frame of numeric vectors and/or factors: variables to test / exclude. |
X |
a numeric vector, factor, numeric matrix or list / data frame of numeric vectors and/or factors: covariates to include in both the restricted (without |
w |
numeric. A vector of (frequency) weights. |
formula |
a 2-part formula: |
data |
a named list or data frame. |
weights |
a weights vector or expression that results in a vector when evaluated in the |
full.df |
logical. If |
... |
other arguments passed to |
Details
Factors and continuous regressors are efficiently projected out using fhdwithin
, and the option full.df
regulates whether a degree of freedom is subtracted for each used factor level (equivalent to dummy-variable estimator / expanding factors), or only one degree of freedom per factor (treating factors as variables). The test automatically removes missing values and considers only the complete cases of y, exc
and X
. Unused factor levels in exc
and X
are dropped.
Note that an intercept is always added by fhdwithin
, so it is not necessary to include an intercept in data supplied to exc
/ X
.
Value
A 5 x 3 numeric matrix of statistics. The columns contain statistics:
the R-squared of the model
the numerator degrees of freedom i.e. the number of variables (k) and used factor levels if
full.df = TRUE
the denominator degrees of freedom: N - k - 1.
the F-statistic
the corresponding P-value
The rows show these statistics for:
the Full (unrestricted) Model (
y ~ exc + X
)the Restricted Model (
y ~ X
)the Exclusion Restriction of
exc
. The R-squared shown is simply the difference of the full and restricted R-Squared's, not the R-Squared of the modely ~ exc
.
If X = NULL
, only a vector of the same 5 statistics testing the model (y ~ exc
) is shown.
See Also
flm
, fhdwithin
, Data Transformations, Collapse Overview
Examples
## We could use fFtest as a simple seasonality test:
fFtest(AirPassengers, qF(cycle(AirPassengers))) # Testing for level-seasonality
fFtest(AirPassengers, qF(cycle(AirPassengers)), # Seasonality test around a cubic trend
poly(seq_along(AirPassengers), 3))
fFtest(fdiff(AirPassengers), qF(cycle(AirPassengers))) # Seasonality in first-difference
## A more classical example with only continuous variables
fFtest(mpg ~ cyl + vs | hp + carb, mtcars)
fFtest(mtcars$mpg, mtcars[c("cyl","vs")], mtcars[c("hp","carb")])
## Now encoding cyl and vs as factors
fFtest(mpg ~ qF(cyl) + qF(vs) | hp + carb, mtcars)
fFtest(mtcars$mpg, lapply(mtcars[c("cyl","vs")], qF), mtcars[c("hp","carb")])
## Using iris data: A factor and a continuous variable excluded
fFtest(Sepal.Length ~ Petal.Width + Species | Sepal.Width + Petal.Length, iris)
fFtest(iris$Sepal.Length, iris[4:5], iris[2:3])
## Testing the significance of country-FE in regression of GDP on life expectancy
fFtest(log(PCGDP) ~ iso3c | LIFEEX, wlddev)
fFtest(log(wlddev$PCGDP), wlddev$iso3c, wlddev$LIFEEX)
## Ok, country-FE are significant, what about adding time-FE
fFtest(log(PCGDP) ~ qF(year) | iso3c + LIFEEX, wlddev)
fFtest(log(wlddev$PCGDP), qF(wlddev$year), wlddev[c("iso3c","LIFEEX")])
# Same test done using lm:
data <- na_omit(get_vars(wlddev, c("iso3c","year","PCGDP","LIFEEX")))
full <- lm(PCGDP ~ LIFEEX + iso3c + qF(year), data)
rest <- lm(PCGDP ~ LIFEEX + iso3c, data)
anova(rest, full)
Fast Growth Rates for Time Series and Panel Data
Description
fgrowth
is a S3 generic to compute (sequences of) suitably lagged / leaded, iterated and compounded growth rates, obtained with via the exact method of computation or through log differencing. By default growth rates are provided in percentage terms, but any scale factor can be applied. The growth operator G
is a parsimonious wrapper around fgrowth
, and also provides more flexibility when applied to data frames.
Usage
fgrowth(x, n = 1, diff = 1, ...)
G(x, n = 1, diff = 1, ...)
## Default S3 method:
fgrowth(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA,
logdiff = FALSE, scale = 100, power = 1, stubs = TRUE, ...)
## Default S3 method:
G(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, logdiff = FALSE,
scale = 100, power = 1, stubs = .op[["stub"]], ...)
## S3 method for class 'matrix'
fgrowth(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA,
logdiff = FALSE, scale = 100, power = 1,
stubs = length(n) + length(diff) > 2L, ...)
## S3 method for class 'matrix'
G(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, logdiff = FALSE,
scale = 100, power = 1, stubs = .op[["stub"]], ...)
## S3 method for class 'data.frame'
fgrowth(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA,
logdiff = FALSE, scale = 100, power = 1,
stubs = length(n) + length(diff) > 2L, ...)
## S3 method for class 'data.frame'
G(x, n = 1, diff = 1, by = NULL, t = NULL, cols = is.numeric,
fill = NA, logdiff = FALSE, scale = 100, power = 1, stubs = .op[["stub"]],
keep.ids = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
fgrowth(x, n = 1, diff = 1, fill = NA, logdiff = FALSE, scale = 100,
power = 1, stubs = length(n) + length(diff) > 2L, shift = "time", ...)
## S3 method for class 'pseries'
G(x, n = 1, diff = 1, fill = NA, logdiff = FALSE, scale = 100,
power = 1, stubs = .op[["stub"]], shift = "time", ...)
## S3 method for class 'pdata.frame'
fgrowth(x, n = 1, diff = 1, fill = NA, logdiff = FALSE, scale = 100,
power = 1, stubs = length(n) + length(diff) > 2L, shift = "time", ...)
## S3 method for class 'pdata.frame'
G(x, n = 1, diff = 1, cols = is.numeric, fill = NA, logdiff = FALSE,
scale = 100, power = 1, stubs = .op[["stub"]], shift = "time", keep.ids = TRUE, ...)
# Methods for grouped data frame / compatibility with dplyr:
## S3 method for class 'grouped_df'
fgrowth(x, n = 1, diff = 1, t = NULL, fill = NA, logdiff = FALSE,
scale = 100, power = 1, stubs = length(n) + length(diff) > 2L,
keep.ids = TRUE, ...)
## S3 method for class 'grouped_df'
G(x, n = 1, diff = 1, t = NULL, fill = NA, logdiff = FALSE,
scale = 100, power = 1, stubs = .op[["stub"]], keep.ids = TRUE, ...)
Arguments
x |
a numeric vector / time series, (time series) matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df'). |
n |
integer. A vector indicating the number of lags or leads. |
diff |
integer. A vector of integers > 1 indicating the order of taking growth rates, e.g. |
g |
a factor, |
by |
data.frame method: Same as |
t |
a time vector or list of vectors. See |
cols |
data.frame method: Select columns to compute growth rates using a function, column names, indices or a logical vector. Default: All numeric variables. Note: |
fill |
value to insert when vectors are shifted. Default is |
logdiff |
logical. Compute log-difference growth rates instead of exact growth rates. See Details. |
scale |
logical. Scale factor post-applied to growth rates, default is 100 which gives growth rates in percentage terms. See Details. |
power |
numeric. Apply a power to annualize or compound growth rates e.g. |
stubs |
logical. |
shift |
pseries / pdata.frame methods: character. |
keep.ids |
data.frame / pdata.frame / grouped_df methods: Logical. Drop all identifiers from the output (which includes all variables passed to |
... |
arguments to be passed to or from other methods. |
Details
fgrowth/G
by default computes exact growth rates using repeat(diff) ((x[i]/x[i-n])^power - 1)*scale
, so for diff > 1
it computes growth rate of growth rates. If logdiff = TRUE
, approximate growth rates are computed using log(x[i]/x[i-n])*scale
for diff = 1
and repeat(diff-1) x[i] - x[i-n]
thereafter (usually diff = 1
for log-differencing). For further details see the help pages of fdiff
and flag
.
Value
x
where the growth rate was taken diff
times using lags n
of itself, scaled by scale
. Computations can be grouped by g/by
and/or ordered by t
. See Details and Examples.
See Also
flag/L/F
, fdiff/D/Dlog
, Time Series and Panel Series, Collapse Overview
Examples
## Simple Time Series: AirPassengers
G(AirPassengers) # Growth rate, same as fgrowth(AirPassengers)
G(AirPassengers, logdiff = TRUE) # Log-difference
G(AirPassengers, 1, 2) # Growth rate of growth rate
G(AirPassengers, 12) # Seasonal growth rate (data is monthly)
head(G(AirPassengers, -2:2, 1:3)) # Sequence of leaded/lagged and iterated growth rates
# let's do some visual analysis
plot(G(AirPassengers, c(0, 1, 12)))
plot(stl(window(G(AirPassengers, 12), # Taking seasonal growth rate removes most seasonal variation
1950), "periodic"))
## Time Series Matrix of 4 EU Stock Market Indicators, recorded 260 days per year
plot(G(EuStockMarkets,c(0,260))) # Plot series and annual growth rates
summary(lm(L260G1.DAX ~., G(EuStockMarkets,260))) # Annual growth rate of DAX regressed on the
# growth rates of the other indicators
## World Development Panel Data
head(fgrowth(num_vars(wlddev), 1, 1, # Computes growth rates of numeric variables
wlddev$country, wlddev$year)) # fgrowth requires external inputs..
head(G(wlddev, 1, 1, ~country, ~year)) # Growth of numeric variables, id's attached
head(G(wlddev, 1, 1, ~country)) # Without t: Works because data is ordered
head(G(wlddev, 1, 1, PCGDP + LIFEEX ~ country, ~year)) # Growth of GDP per Capita & Life Expectancy
head(G(wlddev, 0:1, 1, ~ country, ~year, cols = 9:10)) # Same, also retaining original series
head(G(wlddev, 0:1, 1, ~ country, ~year, 9:10, # Dropping id columns
keep.ids = FALSE))
Higher-Dimensional Centering and Linear Prediction
Description
fhdbetween
is a generalization of fbetween
to efficiently predict with multiple factors and linear models (i.e. predict with vectors/factors, matrices, or data frames/lists where the latter may contain multiple factor variables). Similarly, fhdwithin
is a generalization of fwithin
to center on multiple factors and partial-out linear models.
The corresponding operators HDB
and HDW
additionally allow to predict / partial out full lm()
formulas with interactions between variables.
Usage
fhdbetween(x, ...)
fhdwithin(x, ...)
HDB(x, ...)
HDW(x, ...)
## Default S3 method:
fhdbetween(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, lm.method = "qr", ...)
## Default S3 method:
fhdwithin(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, lm.method = "qr", ...)
## Default S3 method:
HDB(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, lm.method = "qr", ...)
## Default S3 method:
HDW(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, lm.method = "qr", ...)
## S3 method for class 'matrix'
fhdbetween(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, lm.method = "qr", ...)
## S3 method for class 'matrix'
fhdwithin(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, lm.method = "qr", ...)
## S3 method for class 'matrix'
HDB(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, stub = .op[["stub"]],
lm.method = "qr", ...)
## S3 method for class 'matrix'
HDW(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE, stub = .op[["stub"]],
lm.method = "qr", ...)
## S3 method for class 'data.frame'
fhdbetween(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE,
variable.wise = FALSE, lm.method = "qr", ...)
## S3 method for class 'data.frame'
fhdwithin(x, fl, w = NULL, na.rm = .op[["na.rm"]], fill = FALSE,
variable.wise = FALSE, lm.method = "qr", ...)
## S3 method for class 'data.frame'
HDB(x, fl, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]], fill = FALSE,
variable.wise = FALSE, stub = .op[["stub"]], lm.method = "qr", ...)
## S3 method for class 'data.frame'
HDW(x, fl, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]], fill = FALSE,
variable.wise = FALSE, stub = .op[["stub"]], lm.method = "qr", ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
fhdbetween(x, effect = "all", w = NULL, na.rm = .op[["na.rm"]], fill = TRUE, ...)
## S3 method for class 'pseries'
fhdwithin(x, effect = "all", w = NULL, na.rm = .op[["na.rm"]], fill = TRUE, ...)
## S3 method for class 'pseries'
HDB(x, effect = "all", w = NULL, na.rm = .op[["na.rm"]], fill = TRUE, ...)
## S3 method for class 'pseries'
HDW(x, effect = "all", w = NULL, na.rm = .op[["na.rm"]], fill = TRUE, ...)
## S3 method for class 'pdata.frame'
fhdbetween(x, effect = "all", w = NULL, na.rm = .op[["na.rm"]], fill = TRUE,
variable.wise = TRUE, ...)
## S3 method for class 'pdata.frame'
fhdwithin(x, effect = "all", w = NULL, na.rm = .op[["na.rm"]], fill = TRUE,
variable.wise = TRUE, ...)
## S3 method for class 'pdata.frame'
HDB(x, effect = "all", w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
fill = TRUE, variable.wise = TRUE, stub = .op[["stub"]], ...)
## S3 method for class 'pdata.frame'
HDW(x, effect = "all", w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
fill = TRUE, variable.wise = TRUE, stub = .op[["stub"]], ...)
Arguments
x |
a numeric vector, matrix, data frame, 'indexed_series' ('pseries') or 'indexed_frame' ('pdata.frame'). |
fl |
a numeric vector, factor, matrix, data frame or list (which may or may not contain factors). In the |
w |
a vector of (non-negative) weights. |
cols |
data.frame methods: Select columns to center (partial-out) or predict using column names, indices, a logical vector or a function. Unless specified otherwise all numeric columns are selected. If |
na.rm |
remove missing values from both |
fill |
If |
variable.wise |
(p)data.frame methods: Setting |
effect |
plm methods: Select which panel identifiers should be used for centering. 1L takes the first variable in the index, 2L the second etc.. Index variables can also be called by name using a character vector. The keyword |
stub |
character. A prefix/stub to add to the names of all transformed columns. |
lm.method |
character. The linear fitting method. Supported are |
... |
further arguments passed to |
Details
fhdbetween/HDB
and fhdwithin/HDW
are powerful functions for high-dimensional linear prediction problems involving large factors and datasets, but can just as well handle ordinary regression problems. They are implemented as efficient wrappers around fbetween / fwithin
, flm
and some C++ code from the fixest
package that is imported for higher-order centering tasks (thus fixest
needs to be installed for problems involving more than one factor).
Intended areas of use are to efficiently obtain residuals and predicted values from data, and to prepare data for complex linear models involving multiple levels of fixed effects. Such models can now be fitted using (g)lm()
on data prepared with fhdwithin / HDW
(relying on bootstrapped SE's for inference, or implementing the appropriate corrections). See Examples.
If fl
is a vector or matrix, the result are identical to lm
i.e. fhdbetween / HDB
returns fitted(lm(x ~ fl))
and fhdwithin / HDW
residuals(lm(x ~ fl))
. If fl
is a list containing factors, all variables in x
and non-factor variables in fl
are centered on these factors using either fbetween / fwithin
for a single factor or fixest
C++ code for multiple factors. Afterwards the centered data is regressed on the centered predictors. If fl
is just a list of factors, fhdwithin/HDW
returns the centered data and fhdbetween/HDB
the corresponding means. Take as a most general example a list fl = list(fct1, fct2, ..., var1, var2, ...)
where fcti
are factors and vari
are continuous variables. The output of fhdwithin/HDW | fhdbetween/HDB
will then be identical to calling resid | fitted
on lm(x ~ fct1 + fct2 + ... + var1 + var2 + ...)
. The computations performed by fhdwithin/HDW
and fhdbetween/HDB
are however much faster and more memory efficient than lm
because factors are not passed to model.matrix
and expanded to matrices of dummies but projected out beforehand.
The formula interface to the data.frame method (only supported by the operators HDW | HDB
) provides ease of use and allows for additional modeling complexity. For example it is possible to project out formulas like HDW(data, ~ fct1*var1 + fct2:fct3 + var2:fct2:fct3 + var2:var3 + poly(var5,3)*fct5)
containing simple (:)
or full (*)
interactions of factors with continuous variables or polynomials of continuous variables, and two-or three-way interactions of factors and continuous variables. If the formula is one-sided as in the example above (the space left of (~)
is left empty), the formula is applied to all variables selected through cols
. The specification provided in cols
(default: all numeric variables not used in the formula) can be overridden by supplying one-or more dependent variables. For example HDW(data, var1 + var2 ~ fct1 + fct2)
will return a data.frame with var1
and var2
centered on fct1
and fct2
.
The special methods for 'indexed_series' (plm::pseries
) and 'indexed_frame's (plm::pdata.frame
) center a panel series or variables in a panel data frame on all panel-identifiers. By default in these methods fill = TRUE
and variable.wise = TRUE
, so missing values are kept. This change in the default arguments was done to ensure a coherent framework of functions and operators applied to plm panel data classes.
Value
HDB
returns fitted values of regressing x
on fl
. HDW
returns residuals. See Details and Examples.
Note
On the differences between fhdwithin/HDW
... and fwithin/W
...:
-
fhdwithin/HDW
can center data on multiple factors and also partial out continuous variables and factor-continuous interactions whilefwithin/W
only centers on one factor or the interaction of a set of factors, and does that very efficiently. -
HDW(data, ~ qF(group1) + qF(group2))
simultaneously centers numeric variables in data ongroup1
andgroup2
, whileW(data, ~ group1 + group2)
centers data on the interaction ofgroup1
andgroup2
. The equivalent operation inHDW
would be:HDW(data, ~ qF(group1):qF(group2))
. -
W
always does computations on the variable-wise complete observations (in both matrices and data frames), whereas by defaultHDW
removes all cases missing in eitherx
orfl
. In short,W(data, ~ group1 + group2)
is actually equivalent toHDW(data, ~ qF(group1):qF(group2), variable.wise = TRUE)
.HDW(data, ~ qF(group1):qF(group2))
would remove any missing cases. -
fbetween/B
andfwithin/W
have options to fill missing cases using group-averages and to add the overall mean back to group-demeaned data. These options are not available infhdbetween/HDB
andfhdwithin/HDW
. SinceHDB
andHDW
by default remove missing cases, they also don't have options to keep grouping-columns as inB
andW
.
See Also
fbetween, fwithin
, fscale
, TRA
, flm
, fFtest
, Data Transformations, Collapse Overview
Examples
HDW(mtcars$mpg, mtcars$carb) # Simple regression problems
HDW(mtcars$mpg, mtcars[-1])
HDW(mtcars$mpg, qM(mtcars[-1]))
head(HDW(qM(mtcars[3:4]), mtcars[1:2]))
head(HDW(iris[1:2], iris[3:4])) # Partialling columns 3 and 4 out of columns 1 and 2
head(HDW(iris[1:2], iris[3:5])) # Adding the Species factor -> fixed effect
head(HDW(wlddev, PCGDP + LIFEEX ~ iso3c + qF(year))) # Partialling out 2 fixed effects
head(HDW(wlddev, PCGDP + LIFEEX ~ iso3c + qF(year), variable.wise = TRUE)) # Variable-wise
head(HDW(wlddev, PCGDP + LIFEEX ~ iso3c + qF(year) + ODA)) # Adding ODA as a continuous regressor
head(HDW(wlddev, PCGDP + LIFEEX ~ iso3c:qF(decade) + qF(year) + ODA)) # Country-decade and year FE's
head(HDW(wlddev, PCGDP + LIFEEX ~ iso3c*year)) # Country specific time trends
head(HDW(wlddev, PCGDP + LIFEEX ~ iso3c*poly(year, 3))) # Country specific cubic trends
# More complex examples
lm(HDW.mpg ~ HDW.hp, data = HDW(mtcars, ~ factor(cyl)*carb + vs + wt:gear + wt:gear:carb))
lm(mpg ~ hp + factor(cyl)*carb + vs + wt:gear + wt:gear:carb, data = mtcars)
lm(HDW.mpg ~ HDW.hp, data = HDW(mtcars, ~ factor(cyl)*carb + vs + wt:gear))
lm(mpg ~ hp + factor(cyl)*carb + vs + wt:gear, data = mtcars)
lm(HDW.mpg ~ HDW.hp, data = HDW(mtcars, ~ cyl*carb + vs + wt:gear))
lm(mpg ~ hp + cyl*carb + vs + wt:gear, data = mtcars)
lm(HDW.mpg ~ HDW.hp, data = HDW(mtcars, mpg + hp ~ cyl*carb + factor(cyl)*poly(drat,2)))
lm(mpg ~ hp + cyl*carb + factor(cyl)*poly(drat,2), data = mtcars)
Fast Lags and Leads for Time Series and Panel Data
Description
flag
is an S3 generic to compute (sequences of) lags and leads. L
and F
are wrappers around flag
representing the lag- and lead-operators, such that L(x,-1) = F(x,1) = F(x)
and L(x,-3:3) = F(x,3:-3)
. L
and F
provide more flexibility than flag
when applied to data frames (i.e. column subsetting, formula input and id-variable-preservation capabilities...), but are otherwise identical.
Note: Since v1.9.0, F
is no longer exported, but can be accessed using collapse:::F
, or through setting options(collapse_export_F = TRUE)
before loading the package. The syntax is the same as L
.
Usage
flag(x, n = 1, ...)
L(x, n = 1, ...)
## Default S3 method:
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, ...)
## Default S3 method:
L(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = .op[["stub"]], ...)
## S3 method for class 'matrix'
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = length(n) > 1L, ...)
## S3 method for class 'matrix'
L(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = .op[["stub"]], ...)
## S3 method for class 'data.frame'
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = length(n) > 1L, ...)
## S3 method for class 'data.frame'
L(x, n = 1, by = NULL, t = NULL, cols = is.numeric,
fill = NA, stubs = .op[["stub"]], keep.ids = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
flag(x, n = 1, fill = NA, stubs = length(n) > 1L, shift = "time", ...)
## S3 method for class 'pseries'
L(x, n = 1, fill = NA, stubs = .op[["stub"]], shift = "time", ...)
## S3 method for class 'pdata.frame'
flag(x, n = 1, fill = NA, stubs = length(n) > 1L, shift = "time", ...)
## S3 method for class 'pdata.frame'
L(x, n = 1, cols = is.numeric, fill = NA, stubs = .op[["stub"]],
shift = "time", keep.ids = TRUE, ...)
# Methods for grouped data frame / compatibility with dplyr:
## S3 method for class 'grouped_df'
flag(x, n = 1, t = NULL, fill = NA, stubs = length(n) > 1L, keep.ids = TRUE, ...)
## S3 method for class 'grouped_df'
L(x, n = 1, t = NULL, fill = NA, stubs = .op[["stub"]], keep.ids = TRUE, ...)
Arguments
x |
a vector / time series, (time series) matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df'). Data must not be numeric. |
n |
integer. A vector indicating the lags / leads to compute (passing negative integers to |
g |
a factor, |
by |
data.frame method: Same as |
t |
a time vector or list of vectors. Data frame methods also allows one-sided formula i.e. |
cols |
data.frame method: Select columns to lag using a function, column names, indices or a logical vector. Default: All numeric variables. Note: |
fill |
value to insert when vectors are shifted. Default is |
stubs |
logical. |
shift |
pseries / pdata.frame methods: character. |
keep.ids |
data.frame / pdata.frame / grouped_df methods: Logical. Drop all identifiers from the output (which includes all variables passed to |
... |
arguments to be passed to or from other methods. |
Details
If a single integer is passed to n
, and g/by
and t
are left empty, flag/L/F
just returns x
with all columns lagged / leaded by n
. If length(n)>1
, and x
is an atomic vector (time series), flag/L/F
returns a (time series) matrix with lags / leads computed in the same order as passed to n
. If instead x
is a matrix / data frame, a matrix / data frame with ncol(x)*length(n)
columns is returned where columns are sorted first by variable and then by lag (so all lags computed on a variable are grouped together). x
can be of any standard data type.
With groups/panel-identifiers supplied to g/by
, flag/L/F
efficiently computes a panel-lag/lead by shifting the entire vector(s) but inserting fill
elements in the right places. If t
is left empty, the data needs to be ordered such that all values belonging to a group are consecutive and in the right order. It is not necessary that the groups themselves are alphabetically ordered. If a time-variable is supplied to t
(or a list of time-variables uniquely identifying the time-dimension), the series / panel is fully identified and lags / leads can be securely computed even if the data is unordered / irregular.
Note that the t
argument is processed as follows: If is.factor(t) || (is.numeric(t) && !is.object(t))
(i.e. t
is a factor or plain numeric vector), it is assumed to represent unit timesteps (e.g. a 'year' variable in a typical dataset), and thus coerced to integer using as.integer(t)
and directly passed to C++ without further checks or transformations at the R-level. Otherwise, if is.object(t) && is.numeric(unclass(t))
(i.e. t
is a numeric time object, most likely 'Date' or 'POSIXct'), this object is passed through timeid
before going to C++. Else (e.g. t
is character), it is passed through qG
which performs ordered grouping. If t
is a list of multiple variables, it is passed through finteraction
. You can customize this behavior by calling any of these functions (including unclass/as.integer
) on your time variable beforehand.
At the C++ level, if both g/by
and t
are supplied, flag
works as follows: Use two initial passes to create an ordering through which the data are accessed. First-pass: Calculate minimum and maximum time-value for each individual. Second-pass: Generate an internal ordering vector (o
) by placing the current element index into the vector slot obtained by adding the cumulative group size and the current time-value subtracted its individual-minimum together. This method of computation is faster than any sort-based method and delivers optimal performance if the panel-id supplied to g/by
is already a factor variable, and if t
is an integer/factor variable. For irregular time/panel series, length(o) > length(x)
, and o
represents the unobserved 'complete series'. If length(o) > 1e7 && length(o) > 3*length(x)
, a warning is issued to make you aware of potential performance implications of the oversized ordering vector.
The 'indexed_series' ('pseries') and 'indexed_frame' ('pdata.frame') methods automatically utilize the identifiers attached to these objects, which are already factors, thus lagging is quite efficient. However, the internal ordering vector still needs to be computed, thus if data are known to be ordered and regularly spaced, using shift = "row"
to toggle a simple group-lag (same as utilizing g
but not t
in other methods) can yield a significant performance gain.
Value
x
lagged / leaded n
-times, grouped by g/by
, ordered by t
. See Details and Examples.
See Also
fdiff
, fgrowth
, Time Series and Panel Series, Collapse Overview
Examples
## Simple Time Series: AirPassengers
L(AirPassengers) # 1 lag
flag(AirPassengers) # Same
L(AirPassengers, -1) # 1 lead
head(L(AirPassengers, -1:3)) # 1 lead and 3 lags - output as matrix
## Time Series Matrix of 4 EU Stock Market Indicators, 1991-1998
tsp(EuStockMarkets) # Data is recorded on 260 days per year
freq <- frequency(EuStockMarkets)
plot(stl(EuStockMarkets[,"DAX"], freq)) # There is some obvious seasonality
head(L(EuStockMarkets, -1:3 * freq)) # 1 annual lead and 3 annual lags
summary(lm(DAX ~., data = L(EuStockMarkets,-1:3*freq))) # DAX regressed on its own annual lead,
# lags and the lead/lags of the other series
## World Development Panel Data
head(flag(wlddev, 1, wlddev$iso3c, wlddev$year)) # This lags all variables,
head(L(wlddev, 1, ~iso3c, ~year)) # This lags all numeric variables
head(L(wlddev, 1, ~iso3c)) # Without t: Works because data is ordered
head(L(wlddev, 1, PCGDP + LIFEEX ~ iso3c, ~year)) # This lags GDP per Capita & Life Expectancy
head(L(wlddev, 0:2, ~ iso3c, ~year, cols = 9:10)) # Same, also retaining original series
head(L(wlddev, 1:2, PCGDP + LIFEEX ~ iso3c, ~year, # Two lags, dropping id columns
keep.ids = FALSE))
# Regressing GDP on its's lags and life-Expectancy and its lags
summary(lm(PCGDP ~ ., L(wlddev, 0:2, ~iso3c, ~year, 9:10, keep.ids = FALSE)))
## Indexing the data: facilitates time-based computations
wldi <- findex_by(wlddev, iso3c, year)
head(L(wldi, 0:2, cols = 9:10)) # Again 2 lags of GDP and LIFEEX
head(L(wldi$PCGDP)) # Lagging an indexed series
summary(lm(PCGDP ~ L(PCGDP,1:2) + L(LIFEEX,0:2), wldi)) # Running the lm again
summary(lm(PCGDP ~ ., L(wldi, 0:2, 9:10, keep.ids = FALSE))) # Same thing
## Using grouped data:
library(magrittr)
wlddev |> fgroup_by(iso3c) |> fselect(PCGDP,LIFEEX) |> flag(0:2)
wlddev |> fgroup_by(iso3c) |> fselect(year,PCGDP,LIFEEX) |> flag(0:2,year) # Also using t (safer)
Fast (Weighted) Linear Model Fitting
Description
flm
is a fast linear model command that (by default) only returns a coefficient matrix. 6 different efficient fitting methods are implemented: 4 using base R linear algebra, and 2 utilizing the RcppArmadillo and RcppEigen packages. The function itself only has an overhead of 5-10 microseconds, and is thus well suited as a bootstrap workhorse.
Usage
flm(...) # Internal method dispatch: default if is.atomic(..1)
## Default S3 method:
flm(y, X, w = NULL, add.icpt = FALSE, return.raw = FALSE,
method = c("lm", "solve", "qr", "arma", "chol", "eigen"),
eigen.method = 3L, ...)
## S3 method for class 'formula'
flm(formula, data = NULL, weights = NULL, add.icpt = TRUE, ...)
Arguments
y |
a response vector or matrix. Multiple dependent variables are only supported by methods "lm", "solve", "qr" and "chol". | ||||||||||||||||||||||||||||||||||||
X |
a matrix of regressors. | ||||||||||||||||||||||||||||||||||||
w |
a weight vector. | ||||||||||||||||||||||||||||||||||||
add.icpt |
logical. | ||||||||||||||||||||||||||||||||||||
formula |
a | ||||||||||||||||||||||||||||||||||||
data |
a named list or data frame. | ||||||||||||||||||||||||||||||||||||
weights |
a weights vector or expression that results in a vector when evaluated in the | ||||||||||||||||||||||||||||||||||||
return.raw |
logical. | ||||||||||||||||||||||||||||||||||||
method |
an integer or character string specifying the method of computation:
| ||||||||||||||||||||||||||||||||||||
eigen.method |
integer. Select the method of computation used by
See | ||||||||||||||||||||||||||||||||||||
... |
further arguments passed to other methods. For the formula method further arguments passed to the default method. Additional arguments can also be passed to the default method e.g. |
Value
If return.raw = FALSE
, a matrix of coefficients with the rows corresponding to the columns of X
, otherwise the raw results from the various methods are returned.
Note
Method "qr" supports sparse matrices, so for an X
matrix with many dummy variables consider method "qr" passing as(X, "dgCMatrix")
instead of just X
.
See Also
fhdwithin/HDW
, fFtest
, Data Transformations, Collapse Overview
Examples
# Simple usage
coef <- flm(mpg ~ hp + carb, mtcars, w = wt)
# Same thing in programming usage
flm(mtcars$mpg, qM(mtcars[c("hp","carb")]), mtcars$wt, add.icpt = TRUE)
# Check this is correct
lmcoef <- coef(lm(mpg ~ hp + carb, weights = wt, mtcars))
all.equal(drop(coef), lmcoef)
# Multi-dependent variable (only some methods)
flm(cbind(mpg, qsec) ~ hp + carb, mtcars, w = wt)
# Returning raw results from solver: different for different methods
flm(mpg ~ hp + carb, mtcars, return.raw = TRUE)
flm(mpg ~ hp + carb, mtcars, method = "qr", return.raw = TRUE)
# Test that all methods give the same result
all_obj_equal(lapply(1:6, function(i)
flm(mpg ~ hp + carb, mtcars, w = wt, method = i)))
Fast Matching
Description
Fast matching of elements/rows in x
to elements/rows in table
.
This is a much faster replacement for match
that works
with atomic vectors and data frames / lists of equal-length vectors. It is the workhorse function of join
.
Usage
fmatch(x, table, nomatch = NA_integer_,
count = FALSE, overid = 1L)
# Check match: throws an informative error for non-matched elements
# Default message reflects frequent internal use to check data frame columns
ckmatch(x, table, e = "Unknown columns:", ...)
# Infix operators based on fmatch():
x %!in% table # Opposite of %in%
x %iin% table # = which(x %in% table), but more efficient
x %!iin% table # = which(x %!in% table), but more efficient
# Use set_collapse(mask = "%in%") to replace %in% with
# a much faster version based on fmatch()
Arguments
x |
a vector, list or data frame whose elements are matched against |
table |
a vector, list or data frame to match against. |
nomatch |
integer. Value to be returned in the case when no match is found. Default is |
count |
logical. Counts number of (unique) matches and attaches 4 attributes:
Note that computing these attributes requires an extra pass through the matching vector. Also note that these attributes contain no general information about whether either |
overid |
integer. If
|
e |
the error message thrown by |
... |
further arguments to |
Details
With data frames / lists, fmatch
compares the rows but moves through the data on a column-by-column basis (like a vectorized hash join algorithm). With two or more columns, the first two columns are hashed simultaneously for speed. Further columns can be added to this match. It is likely that the first 2, 3, 4 etc. columns of a data frame fully identify the data. After each column fmatch()
internally checks whether the table
rows that are still eligible for matching (eliminating nomatch
rows from earlier columns) are unique. If this is the case and overid = 0
, fmatch()
terminates early without considering further columns. This is efficient but may give undesirable/wrong results if considering further columns would turn some additional elements of the result vector into nomatch
values.
Value
Integer vector containing the positions of first matches of x
in table
. nomatch
is returned for elements of x
that have no match in table
. If count = TRUE
, the result has additional attributes and a class "qG"
.
See Also
join
, funique
, group
, Fast Grouping and Ordering, Collapse Overview
Examples
x <- c("b", "c", "a", "e", "f", "ff")
fmatch(x, letters)
fmatch(x, letters, nomatch = 0)
fmatch(x, letters, count = TRUE)
# Table 1
df1 <- data.frame(
id1 = c(1, 1, 2, 3),
id2 = c("a", "b", "b", "c"),
name = c("John", "Bob", "Jane", "Carl")
)
head(df1)
# Table 2
df2 <- data.frame(
id1 = c(1, 2, 3, 3),
id2 = c("a", "b", "c", "e"),
name = c("John", "Janne", "Carl", "Lynne")
)
head(df2)
# This gives an overidentification warning: columns 1:2 identify the data
if(FALSE) fmatch(df1, df2)
# This just runs through without warning
fmatch(df1, df2, overid = 2)
# This terminates computation after first 2 columns
fmatch(df1, df2, overid = 0)
fmatch(df1[1:2], df2[1:2]) # Same thing!
# -> note that here we get an additional match based on the unique ids,
# which we didn't get before because "Jane" != "Janne"
Fast (Grouped, Weighted) Mean for Matrix-Like Objects
Description
fmean
is a generic function that computes the (column-wise) mean of x
, (optionally) grouped by g
and/or weighted by w
.
The TRA
argument can further be used to transform x
using its (grouped, weighted) mean.
Usage
fmean(x, ...)
## Default S3 method:
fmean(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fmean(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fmean(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fmean(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE,
keep.w = TRUE, stub = .op[["stub"]], nthreads = .op[["nthreads"]], ...)
Arguments
x |
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
w |
a numeric vector of (non-negative) weights, may contain missing values. |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. Skip missing values in |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
nthreads |
integer. The number of threads to utilize. See Details of |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
keep.w |
grouped_df method: Logical. Retain summed weighting variable after computation (if contained in |
stub |
character. If |
... |
arguments to be passed to or from other methods. If |
Details
The weighted mean is computed as sum(x * w) / sum(w)
, using a single pass in C. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
For further computational details see fsum
, which works equivalently.
Value
The (w
weighted) mean of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighted) mean.
See Also
fmedian
, fmode
, Fast Statistical Functions, Collapse Overview
Examples
## default vector method
mpg <- mtcars$mpg
fmean(mpg) # Simple mean
fmean(mpg, w = mtcars$hp) # Weighted mean: Weighted by hp
fmean(mpg, TRA = "-") # Simple transformation: demeaning (See also ?W)
fmean(mpg, mtcars$cyl) # Grouped mean
fmean(mpg, mtcars[8:9]) # another grouped mean.
g <- GRP(mtcars[c(2,8:9)])
fmean(mpg, g) # Pre-computing groups speeds up the computation
fmean(mpg, g, mtcars$hp) # Grouped weighted mean
fmean(mpg, g, TRA = "-") # Demeaning by group
fmean(mpg, g, mtcars$hp, "-") # Group-demeaning using weighted group means
## data.frame method
fmean(mtcars)
fmean(mtcars, g)
fmean(fgroup_by(mtcars, cyl, vs, am)) # Another way of doing it..
head(fmean(mtcars, g, TRA = "-")) # etc..
## matrix method
m <- qM(mtcars)
fmean(m)
fmean(m, g)
head(fmean(m, g, TRA = "-")) # etc..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fmean() # Ordinary
mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp) # Weighted
mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp, "-") # Weighted Transform
mtcars |> fgroup_by(cyl,vs,am) |>
fselect(mpg,hp) |> fmean(hp, "-") # Only mpg
Fast (Grouped) Maxima and Minima for Matrix-Like Objects
Description
fmax
and fmin
are generic functions that compute the (column-wise) maximum and minimum value of all values in x
, (optionally) grouped by g
. The TRA
argument can further be used to transform x
using its (grouped) maximum or minimum value.
Usage
fmax(x, ...)
fmin(x, ...)
## Default S3 method:
fmax(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ...)
## Default S3 method:
fmin(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ...)
## S3 method for class 'matrix'
fmax(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'matrix'
fmin(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
fmax(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
fmin(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'grouped_df'
fmax(x, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, ...)
## S3 method for class 'grouped_df'
fmin(x, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, ...)
Arguments
x |
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. Skip missing values in |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
... |
arguments to be passed to or from other methods. If |
Details
Missing-value removal as controlled by the na.rm
argument is done at no extra cost since in C++ any logical comparison involving NA
or NaN
evaluates to FALSE
. Large performance gains can nevertheless be achieved in the presence of missing values if na.rm = FALSE
, since then the corresponding computation is terminated once a NA
is encountered and NA
is returned (unlike max
and min
which just run through without any checks).
For further computational details see fsum
.
Value
fmax
returns the maximum value of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped) maximum value. Analogous, fmin
returns the minimum value ...
See Also
Fast Statistical Functions, Collapse Overview
Examples
## default vector method
mpg <- mtcars$mpg
fmax(mpg) # Maximum value
fmin(mpg) # Minimum value (all examples below use fmax but apply to fmin)
fmax(mpg, TRA = "%") # Simple transformation: Take percentage of maximum value
fmax(mpg, mtcars$cyl) # Grouped maximum value
fmax(mpg, mtcars[c(2,8:9)]) # More groups..
g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed !
fmax(mpg, g)
fmax(mpg, g, TRA = "%") # Groupwise percentage of maximum value
fmax(mpg, g, TRA = "replace") # Groupwise replace by maximum value
## data.frame method
fmax(mtcars)
head(fmax(mtcars, TRA = "%"))
fmax(mtcars, g)
fmax(mtcars, g, use.g.names = FALSE) # No row-names generated
## matrix method
m <- qM(mtcars)
fmax(m)
head(fmax(m, TRA = "%"))
fmax(m, g) # etc..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fmax()
mtcars |> fgroup_by(cyl,vs,am) |> fmax("%")
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg) |> fmax()
Fast (Grouped, Weighted) Statistical Mode for Matrix-Like Objects
Description
fmode
is a generic function and returns the (column-wise) statistical mode i.e. the most frequent value of x
, (optionally) grouped by g
and/or weighted by w
.
The TRA
argument can further be used to transform x
using its (grouped, weighted) mode. Ties between multiple possible modes can be resolved by taking the minimum, maximum, (default) first or last occurring mode.
Usage
fmode(x, ...)
## Default S3 method:
fmode(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ties = "first", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fmode(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ties = "first", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fmode(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ties = "first", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fmode(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
ties = "first", nthreads = .op[["nthreads"]], ...)
Arguments
x |
a vector, matrix, data frame or grouped data frame (class 'grouped_df'). | ||||||||||||||||||||||||||
g |
a factor, | ||||||||||||||||||||||||||
w |
a numeric vector of (non-negative) weights, may contain missing values. | ||||||||||||||||||||||||||
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See | ||||||||||||||||||||||||||
na.rm |
logical. Skip missing values in | ||||||||||||||||||||||||||
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. | ||||||||||||||||||||||||||
ties |
an integer or character string specifying the method to resolve ties between multiple possible modes i.e. multiple values with the maximum frequency or sum of weights:
Note: | ||||||||||||||||||||||||||
nthreads |
integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise. | ||||||||||||||||||||||||||
drop |
matrix and data.frame method: Logical. | ||||||||||||||||||||||||||
keep.group_vars |
grouped_df method: Logical. | ||||||||||||||||||||||||||
keep.w |
grouped_df method: Logical. Retain | ||||||||||||||||||||||||||
stub |
character. If | ||||||||||||||||||||||||||
... |
arguments to be passed to or from other methods. If |
Details
fmode
implements a pretty fast C-level hashing algorithm inspired by the kit package to find the statistical mode.
If na.rm = FALSE
, NA
is not removed but treated as any other value (i.e. its frequency is counted). If all values are NA
, NA
is always returned.
The weighted mode is computed by summing up the weights for all distinct values and choosing the value with the largest sum. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
It is possible that multiple values have the same mode (the maximum frequency or sum of weights). Typical cases are simply when all values are either all the same or all distinct. In such cases, the default option ties = "first"
returns the first occurring value in the data reaching the maximum frequency count or sum of weights. For example in a sample x = c(1, 3, 2, 2, 4, 4, 1, 7)
, the first mode is 2 as fmode
goes through the data from left to right. ties = "last"
on the other hand gives 1. It is also possible to take the minimum or maximum mode, i.e. fmode(x, ties = "min")
returns 1, and fmode(x, ties = "max")
returns 4. It should be noted that options ties = "min"
and ties = "max"
give unintuitive results for character data (no strict alphabetic sorting, similar to using <
and >
to compare character values in R). These options are also best avoided if missing values are counted (na.rm = FALSE
) since no proper logical comparison with missing values is possible: With numeric data it depends, since in C++ any comparison with NA_real_
evaluates to FALSE
, NA_real_
is chosen as the min or max mode only if it is also the first mode, and never otherwise. For integer data, NA_integer_
is stored as the smallest integer in C++, so it will always be chosen as the min mode and never as the max mode. For character data, NA_character_
is stored as the string "NA"
in C++ and thus the behavior depends on the other character content.
fmode
preserves all the attributes of the objects it is applied to (apart from names or row-names which are adjusted as necessary in grouped operations). If a data frame is passed to fmode
and drop = TRUE
(the default), unlist
will be called on the result, which might not be sensible depending on the data at hand.
Value
The (w
weighted) statistical mode of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighed) mode.
See Also
fmean
, fmedian
, Fast Statistical Functions, Collapse Overview
Examples
x <- c(1, 3, 2, 2, 4, 4, 1, 7, NA, NA, NA)
fmode(x) # Default is ties = "first"
fmode(x, ties = "last")
fmode(x, ties = "min")
fmode(x, ties = "max")
fmode(x, na.rm = FALSE) # Here NA is the mode, regardless of ties option
fmode(x[-length(x)], na.rm = FALSE) # Not anymore..
## World Development Data
attach(wlddev)
## default vector method
fmode(PCGDP) # Numeric mode
head(fmode(PCGDP, iso3c)) # Grouped numeric mode
head(fmode(PCGDP, iso3c, LIFEEX)) # Grouped and weighted numeric mode
fmode(region) # Factor mode
fmode(date) # Date mode (defaults to first value since panel is balanced)
fmode(country) # Character mode (also defaults to first value)
fmode(OECD) # Logical mode
# ..all the above can also be performed grouped and weighted
## matrix method
m <- qM(airquality)
fmode(m)
fmode(m, na.rm = FALSE) # NA frequency is also counted
fmode(m, airquality$Month) # Groupwise
fmode(m, w = airquality$Day) # Weighted: Later days in the month are given more weight
fmode(m>50, airquality$Month) # Groupwise logical mode
# etc..
## data.frame method
fmode(wlddev) # Calling unlist -> coerce to character vector
fmode(wlddev, drop = FALSE) # Gives one row
head(fmode(wlddev, iso3c)) # Grouped mode
head(fmode(wlddev, iso3c, LIFEEX)) # Grouped and weighted mode
detach(wlddev)
Fast (Grouped) Distinct Value Count for Matrix-Like Objects
Description
fndistinct
is a generic function that (column-wise) computes the number of distinct values in x
, (optionally) grouped by g
. It is significantly faster than length(unique(x))
. The TRA
argument can further be used to transform x
using its (grouped) distinct value count.
Usage
fndistinct(x, ...)
## Default S3 method:
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fndistinct(x, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, nthreads = .op[["nthreads"]], ...)
Arguments
x |
a vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
nthreads |
integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
... |
arguments to be passed to or from other methods. If |
Details
fndistinct
implements a pretty fast C-level hashing algorithm inspired by the kit package to find the number of distinct values.
If na.rm = TRUE
(the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = FALSE
, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA)
will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.
fndistinct
preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.
Value
Integer. The number of distinct values in x
, grouped by g
, or (if TRA
is used) x
transformed by its distinct value count, grouped by g
.
See Also
fnunique
, fnobs
, Fast Statistical Functions, Collapse Overview
Examples
## default vector method
fndistinct(airquality$Solar.R) # Simple distinct value count
fndistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count
## data.frame method
fndistinct(airquality)
fndistinct(airquality, airquality$Month)
fndistinct(wlddev) # Works with data of all types!
head(fndistinct(wlddev, wlddev$iso3c))
## matrix method
aqm <- qM(airquality)
fndistinct(aqm) # Also works for character or logical matrices
fndistinct(aqm, airquality$Month)
## method for grouped data frames - created with dplyr::group_by or fgroup_by
airquality |> fgroup_by(Month) |> fndistinct()
wlddev |> fgroup_by(country) |>
fselect(PCGDP,LIFEEX,GINI,ODA) |> fndistinct()
Fast (Grouped) Observation Count for Matrix-Like Objects
Description
fnobs
is a generic function that (column-wise) computes the number of non-missing values in x
, (optionally) grouped by g
. It is much faster than sum(!is.na(x))
. The TRA
argument can further be used to transform x
using its (grouped) observation count.
Usage
fnobs(x, ...)
## Default S3 method:
fnobs(x, g = NULL, TRA = NULL, use.g.names = TRUE, ...)
## S3 method for class 'matrix'
fnobs(x, g = NULL, TRA = NULL, use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
fnobs(x, g = NULL, TRA = NULL, use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'grouped_df'
fnobs(x, TRA = NULL, use.g.names = FALSE, keep.group_vars = TRUE, ...)
Arguments
x |
a vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
... |
arguments to be passed to or from other methods. If |
Details
fnobs
preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.
Value
Integer. The number of non-missing observations in x
, grouped by g
, or (if TRA
is used) x
transformed by its number of non-missing observations, grouped by g
.
See Also
fndistinct
, Fast Statistical Functions, Collapse Overview
Examples
## default vector method
fnobs(airquality$Solar.R) # Simple Nobs
fnobs(airquality$Solar.R, airquality$Month) # Grouped Nobs
## data.frame method
fnobs(airquality)
fnobs(airquality, airquality$Month)
fnobs(wlddev) # Works with data of all types!
head(fnobs(wlddev, wlddev$iso3c))
## matrix method
aqm <- qM(airquality)
fnobs(aqm) # Also works for character or logical matrices
fnobs(aqm, airquality$Month)
## method for grouped data frames - created with dplyr::group_by or fgroup_by
airquality |> fgroup_by(Month) |> fnobs()
wlddev |> fgroup_by(country) |>
fselect(PCGDP,LIFEEX,GINI,ODA) |> fnobs()
Fast (Grouped, Weighted) N'th Element/Quantile for Matrix-Like Objects
Description
fnth
(column-wise) returns the n'th smallest element from a set of unsorted elements x
corresponding to an integer index (n
), or to a probability between 0 and 1. If n
is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or (default) continuous quantile estimation. For n > 1
, the lower element is always returned (as in sort(x, partial = n)[n]
). See Details.
fmedian
is a simple wrapper around fnth
, which fixes n = 0.5
and (default) ties = "mean"
, i.e., it averages eligible elements. See Details.
Usage
fnth(x, n = 0.5, ...)
fmedian(x, ...)
## Default S3 method:
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ties = "q7", nthreads = .op[["nthreads"]],
o = NULL, check.o = is.null(attr(o, "sorted")), ...)
## Default S3 method:
fmedian(x, ..., ties = "mean")
## S3 method for class 'matrix'
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fmedian(x, ..., ties = "mean")
## S3 method for class 'data.frame'
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fmedian(x, ..., ties = "mean")
## S3 method for class 'grouped_df'
fnth(x, n = 0.5, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fmedian(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
ties = "mean", nthreads = .op[["nthreads"]], ...)
Arguments
x |
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df'). | ||||||||||||||||||||||||||
n |
the element to return using a single integer index such that | ||||||||||||||||||||||||||
g |
a factor, | ||||||||||||||||||||||||||
w |
a numeric vector of (non-negative) weights, may contain missing values only where | ||||||||||||||||||||||||||
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See | ||||||||||||||||||||||||||
na.rm |
logical. Skip missing values in | ||||||||||||||||||||||||||
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. | ||||||||||||||||||||||||||
ties |
an integer or character string specifying the method to resolve ties between adjacent qualifying elements:
| ||||||||||||||||||||||||||
nthreads |
integer. The number of threads to utilize. Parallelism is across groups for grouped computations on vectors and data frames, and at the column-level otherwise. See Details. | ||||||||||||||||||||||||||
o |
integer. A valid ordering of | ||||||||||||||||||||||||||
check.o |
logical. | ||||||||||||||||||||||||||
drop |
matrix and data.frame method: Logical. | ||||||||||||||||||||||||||
keep.group_vars |
grouped_df method: Logical. | ||||||||||||||||||||||||||
keep.w |
grouped_df method: Logical. Retain | ||||||||||||||||||||||||||
stub |
character. If | ||||||||||||||||||||||||||
... |
for |
Details
fnth
uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading:
without weights, quickselect is used to determine a (lower) order statistic. If
ties %!in% c("min", "max")
a second order statistic is found by taking the max of the upper part of the partitioned array, and the two statistics are averaged using a simple mean (ties = "mean"
), or weighted average according to aquantile
method (ties = "q4"-"q9"
). Forn = 0.5
, all supported quantile methods give the sample median. With matrices, multithreading is always across columns, for vectors and data frames it is across groups unlessis.null(g)
for data frames.with weights and no groups (
is.null(g)
),radixorder
is called internally (on each column ofx
). The ordering is used to sum the weights in order ofx
and determine weighted order statistics or quantiles. See details below. Multithreading is disabled asradixorder
cannot be called concurrently on the same memory stack.with weights and groups (
!is.null(g)
), R's quicksort algorithm is used to sort the data in each group and return an index which can be used to sum the weights in order and proceed as before. This is multithreaded across columns for matrices, and across groups otherwise.in
fnth.default
, an ordering ofx
can be supplied to 'o
' e.g.fnth(x, 0.75, o = radixorder(x))
. This dramatically speeds up the estimation both with and without weights, and is useful iffnth
is to be invoked repeatedly on the same data. With groups,o
needs to also account for the grouping e.g.fnth(x, 0.75, g, o = radixorder(g, x))
. Multithreading is possible across groups. See Examples.
If n > 1
, the result is equivalent to (column-wise) sort(x, partial = n)[n]
. Internally, n
is converted to a probability using p = (n-1)/(NROW(x)-1)
, and that probability is applied to the set of non-missing elements to find the as.integer(p*(fnobs(x)-1))+1L
'th element (which corresponds to option ties = "min"
).
When using grouped computations with n > 1
, n
is transformed to a probability p = (n-1)/(NROW(x)/ng-1)
(where ng
contains the number of unique groups in g
).
If weights are used and ties = "q4"-"q9"
, weighted continuous quantile estimation is done as described in fquantile
.
For ties %in% c("mean", "min", "max")
, a target partial sum of weights p*sum(w)
is calculated, and the weighted n'th element is the element k such that all elements smaller than k have a sum of weights <= p*sum(w)
, and all elements larger than k have a sum of weights <= (1 - p)*sum(w)
. If the partial-sum of weights (p*sum(w)
) is reached exactly for some element k, then (summing from the lower end) both k and k+1 would qualify as the weighted n'th element. If the weight of element k+1 is zero, k, k+1 and k+2 would qualify... . If n > 1
, k is chosen (consistent with the unweighted behavior).
If 0 < n < 1
, the ties
option regulates how to resolve such conflicts, yielding lower (ties = "min"
: k), upper (ties = "max"
: k+2) or average weighted (ties = "mean"
: mean(k, k+1, k+2)) n'th elements.
Thus, in the presence of zero weights, the weighted median (default ties = "mean"
) can be an arithmetic average of >2 qualifying elements.
For data frames, column-attributes and overall attributes are preserved if g
is used or drop = FALSE
.
Value
The (w
weighted) n'th element/quantile of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighted) n'th element/quantile.
See Also
fquantile
, fmean
, fmode
, Fast Statistical Functions, Collapse Overview
Examples
## default vector method
mpg <- mtcars$mpg
fnth(mpg) # Simple nth element: Median (same as fmedian(mpg))
fnth(mpg, 5) # 5th smallest element
sort(mpg, partial = 5)[5] # Same using base R, fnth is 2x faster.
fnth(mpg, 0.75) # Third quartile
fnth(mpg, 0.75, w = mtcars$hp) # Weighted third quartile: Weighted by hp
fnth(mpg, 0.75, TRA = "-") # Simple transformation: Subtract third quartile
fnth(mpg, 0.75, mtcars$cyl) # Grouped third quartile
fnth(mpg, 0.75, mtcars[c(2,8:9)]) # More groups..
g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed !
fnth(mpg, 0.75, g)
fnth(mpg, 0.75, g, mtcars$hp) # Grouped weighted third quartile
fnth(mpg, 0.75, g, TRA = "-") # Groupwise subtract third quartile
fnth(mpg, 0.75, g, mtcars$hp, "-") # Groupwise subtract weighted third quartile
## data.frame method
fnth(mtcars, 0.75)
head(fnth(mtcars, 0.75, TRA = "-"))
fnth(mtcars, 0.75, g)
fnth(fgroup_by(mtcars, cyl, vs, am), 0.75) # Another way of doing it..
fnth(mtcars, 0.75, g, use.g.names = FALSE) # No row-names generated
## matrix method
m <- qM(mtcars)
fnth(m, 0.75)
head(fnth(m, 0.75, TRA = "-"))
fnth(m, 0.75, g) # etc..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75)
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75, hp) # Weighted
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75, TRA = "/") # Divide by third quartile
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg, hp) |> # Faster selecting
fnth(0.75, hp, "/") # Divide mpg by its third weighted group-quartile, using hp as weights
# Efficient grouped estimation of multiple quantiles
mtcars |> fgroup_by(cyl,vs,am) |>
fmutate(o = radixorder(GRPid(), mpg)) |>
fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o),
mpg_median = fmedian(mpg, o = o),
mpg_Q3 = fnth(mpg, 0.75, o = o))
## fmedian()
fmedian(mpg) # Simple median value
fmedian(mpg, w = mtcars$hp) # Weighted median: Weighted by hp
fmedian(mpg, TRA = "-") # Simple transformation: Subtract median value
fmedian(mpg, mtcars$cyl) # Grouped median value
fmedian(mpg, mtcars[c(2,8:9)]) # More groups..
fmedian(mpg, g)
fmedian(mpg, g, mtcars$hp) # Grouped weighted median
fmedian(mpg, g, TRA = "-") # Groupwise subtract median value
fmedian(mpg, g, mtcars$hp, "-") # Groupwise subtract weighted median value
## data.frame method
fmedian(mtcars)
head(fmedian(mtcars, TRA = "-"))
fmedian(mtcars, g)
fmedian(fgroup_by(mtcars, cyl, vs, am)) # Another way of doing it..
fmedian(mtcars, g, use.g.names = FALSE) # No row-names generated
## matrix method
fmedian(m)
head(fmedian(m, TRA = "-"))
fmedian(m, g) # etc..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fmedian()
mtcars |> fgroup_by(cyl,vs,am) |> fmedian(hp) # Weighted
mtcars |> fgroup_by(cyl,vs,am) |> fmedian(TRA = "-") # De-median
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg, hp) |> # Faster selecting
fmedian(hp, "-") # Weighted de-median mpg, using hp as weights
Fast (Grouped, Weighted) Product for Matrix-Like Objects
Description
fprod
is a generic function that computes the (column-wise) product of all values in x
, (optionally) grouped by g
and/or weighted by w
. The TRA
argument can further be used to transform x
using its (grouped, weighted) product.
Usage
fprod(x, ...)
## Default S3 method:
fprod(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, ...)
## S3 method for class 'matrix'
fprod(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
fprod(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'grouped_df'
fprod(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE,
keep.w = TRUE, stub = .op[["stub"]], ...)
Arguments
x |
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
w |
a numeric vector of (non-negative) weights, may contain missing values. |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. Skip missing values in |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
keep.w |
grouped_df method: Logical. Retain product of weighting variable after computation (if contained in |
stub |
character. If |
... |
arguments to be passed to or from other methods. If |
Details
Non-grouped product computations internally utilize long-doubles in C, for additional numeric precision.
The weighted product is computed as prod(x * w)
, using a single pass in C. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
For further computational details see fsum
, which works equivalently.
Value
The (w
weighted) product of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighted) product.
See Also
fsum
, Fast Statistical Functions, Collapse Overview
Examples
## default vector method
mpg <- mtcars$mpg
fprod(mpg) # Simple product
fprod(mpg, w = mtcars$hp) # Weighted product
fprod(mpg, TRA = "/") # Simple transformation: Divide by product
fprod(mpg, mtcars$cyl) # Grouped product
fprod(mpg, mtcars$cyl, mtcars$hp) # Weighted grouped product
fprod(mpg, mtcars[c(2,8:9)]) # More groups..
g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed !
fprod(mpg, g)
fprod(mpg, g, TRA = "/") # Groupwise divide by product
## data.frame method
fprod(mtcars)
head(fprod(mtcars, TRA = "/"))
fprod(mtcars, g)
fprod(mtcars, g, use.g.names = FALSE) # No row-names generated
## matrix method
m <- qM(mtcars)
fprod(m)
head(fprod(m, TRA = "/"))
fprod(m, g) # etc..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fprod()
mtcars |> fgroup_by(cyl,vs,am) |> fprod(TRA = "/")
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg) |> fprod()
Fast (Weighted) Sample Quantiles and Range
Description
A faster alternative to quantile
(written fully in C), that supports sampling weights, and can also quickly compute quantiles from an ordering vector (e.g. order(x)
). frange
provides a fast alternative to range
.
Usage
fquantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL,
o = if(length(x) > 1e5L && length(probs) > log(length(x)))
radixorder(x) else NULL,
na.rm = .op[["na.rm"]], type = 7L, names = TRUE,
check.o = is.null(attr(o, "sorted")))
# Programmers version: no names, intelligent defaults, or checks
.quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL, o = NULL,
na.rm = TRUE, type = 7L, names = FALSE, check.o = FALSE)
# Fast range (min and max)
frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
.range(x, na.rm = TRUE, finite = FALSE)
Arguments
x |
a numeric or integer vector. |
probs |
numeric vector of probabilities with values in [0,1]. |
w |
a numeric vector of strictly positive sampling weights. Missing weights are only supported if |
o |
integer. An vector giving the ordering of the elements in |
na.rm |
logical. Remove missing values, default |
finite |
logical. Omit all non-finite values. |
type |
integer. Quantile types 4-9. See |
names |
logical. Generates names of the form |
check.o |
logical. If |
Details
fquantile
is implemented using a quickselect algorithm in C, inspired by data.table's gmedian
. The algorithm is applied incrementally to different sections of the array to find individual quantiles. If many quantile probabilities are requested, sorting the whole array with the fast radixorder
algorithm is more efficient. The default threshold for this (length(x) > 1e5L && length(probs) > log(length(x))
) is conservative, given that quickselect is generally more efficient on longitudinal data with similar values repeated by groups. With random data, my investigations yield that a threshold of length(probs) > log10(length(x))
would be more appropriate.
frange
is considerably more efficient than range
, requiring only one pass through the data instead of two. For probabilities 0 and 1, fquantile
internally calls frange
.
Following Hyndman and Fan (1996), the quantile type-i
quantile function of the sample X
can be written as a weighted average of two order statistics:
\hat{Q}_{X,i}(p) = (1 - \gamma) X_{(j)} + \gamma X_{(j + 1)}
where j = \lfloor pn + m \rfloor,\ m \in \mathbb{R}
and \gamma = pn + m - j,\ 0 \le \gamma \le 1
, with m
differing by quantile type (i
). For example, the default type 7 quantile estimator uses m = 1 - p
, see quantile
.
For weighted data with normalized weights w = \{w_1, ... w_n\}
, where w_k > 0
and \sum_k w_k = 1
, let \{w_{(1)}, ... w_{(n)}\}
be the weights for each order statistic and W_{(k)} = \operatorname{Weight}[X_j \le X_{(k)}] = \sum_{j=1}^k w_{(j)}
the cumulative weight for each order statistic.
We can then first find the largest value l
such that the cumulative normalized weight W_{(l)} \leq p
, and replace pn
with l + (p - W_{(l)})/w_{(l+1)}
, where w_{(l+1)}
is the weight of the next observation. This gives:
j = \lfloor l + \frac{p - W_{(l)}}{w_{(l+1)}} + m \rfloor
\gamma = l + \frac{p - W_{(l)}}{w_{(l+1)}} + m - j
For a more detailed exposition see these excellent notes by Matthew Kay. See also the R implementation of weighted quantiles type 7 in the Examples below.
Value
A vector of quantiles. If names = TRUE
, fquantile
generates names as paste0(round(probs * 100, 1), "%")
(in C).
Note
The new weighted quantile algorithm from v2.1.0 does not skip zero weights anymore as this is technically very difficult (it is not clear if j
hits a zero weight element whether one should move forward or backward to find an alternative). Thus, all non-missing elements are considered and weights should be strictly positive.
Author(s)
Sebastian Krantz based on notes by Matthew Kay.
References
Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, American Statistician 50, 361–365. doi:10.2307/2684934.
Wicklin, R. (2017) Sample quantiles: A comparison of 9 definitions; SAS Blog. https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html
Wikipedia: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
Weighted Quantiles by Matthew Kay: https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html
See Also
fnth
, Fast Statistical Functions, Collapse Overview
Examples
## Basic range and quantiles
frange(mtcars$mpg)
fquantile(mtcars$mpg)
## Checking computational equivalence to stats::quantile()
w = alloc(abs(rnorm(1)), 32)
o = radixorder(mtcars$mpg)
for (i in 5:9) print(all_obj_equal(fquantile(mtcars$mpg, type = i),
fquantile(mtcars$mpg, type = i, w = w),
fquantile(mtcars$mpg, type = i, o = o),
fquantile(mtcars$mpg, type = i, w = w, o = o),
quantile(mtcars$mpg, type = i)))
## Demonstaration: weighted quantiles type 7 in R
wquantile7R <- function(x, w, probs = c(0.25, 0.5, 0.75), na.rm = TRUE, names = TRUE) {
if(na.rm && anyNA(x)) { # Removing missing values (only in x)
cc = whichNA(x, invert = TRUE) # The C code first calls radixorder(x), which places
x = x[cc]; w = w[cc] # missing values last, so removing = early termination
}
o = radixorder(x) # Ordering
wo = proportions(w[o])
Wo = cumsum(wo) # Cumulative sum
res = sapply(probs, function(p) {
l = which.max(Wo > p) - 1L # Lower order statistic
s = l + (p - Wo[l])/wo[l+1L] + 1 - p
j = floor(s)
gamma = s - j
(1 - gamma) * x[o[j]] + gamma * x[o[j+1L]] # Weighted quantile
})
if(names) names(res) = paste0(as.integer(probs * 100), "%")
res
} # Note: doesn't work for min and max.
wquantile7R(mtcars$mpg, mtcars$wt)
all.equal(wquantile7R(mtcars$mpg, mtcars$wt),
fquantile(mtcars$mpg, c(0.25, 0.5, 0.75), mtcars$wt))
## Efficient grouped quantile estimation: use .quantile for less call overhead
BY(mtcars$mpg, mtcars$cyl, .quantile, names = TRUE, expand.wide = TRUE)
BY(mtcars, mtcars$cyl, .quantile, names = TRUE)
mtcars |> fgroup_by(cyl) |> BY(.quantile)
## With weights
BY(mtcars$mpg, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE, expand.wide = TRUE)
BY(mtcars, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE)
mtcars |> fgroup_by(cyl) |> fselect(-wt) |> BY(.quantile, w = mtcars$wt)
mtcars |> fgroup_by(cyl) |> fsummarise(across(-wt, .quantile, w = wt))
Fast Renaming and Relabelling Objects
Description
frename
returns a renamed shallow-copy, setrename
renames objects by reference. These functions also work with objects other than data frames that have a 'names' attribute. relabel
and setrelabel
do that same for labels attached to data frame columns.
Usage
frename(.x, ..., cols = NULL, .nse = TRUE)
rnm(.x, ..., cols = NULL, .nse = TRUE) # Shorthand for frename()
setrename(.x, ..., cols = NULL, .nse = TRUE)
relabel(.x, ..., cols = NULL, attrn = "label")
setrelabel(.x, ..., cols = NULL, attrn = "label")
Arguments
.x |
for |
... |
either tagged vector expressions of the form |
cols |
If |
.nse |
logical. |
attrn |
character. Name of attribute to store labels or retrieve labels from. |
Value
.x
renamed / relabelled. setrename
and setrelabel
return .x
invisibly.
Note
Note that both relabel
and setrelabel
modify .x
by reference. This is because labels are attached to columns themselves, making it impossible to avoid permanent modification by taking a shallow copy of the encompassing list / data.frame. On the other hand frename
makes a shallow copy whereas setrename
also modifies by reference.
See Also
Data Frame Manipulation, Collapse Overview
Examples
## Using tagged expressions
head(frename(iris, Sepal.Length = SL, Sepal.Width = SW,
Petal.Length = PL, Petal.Width = PW))
head(frename(iris, Sepal.Length = "S L", Sepal.Width = "S W",
Petal.Length = "P L", Petal.Width = "P W"))
## Since v2.0.0 this is also supported
head(frename(iris, SL = Sepal.Length, SW = Sepal.Width,
PL = Petal.Length, PW = Petal.Width))
## Using a function
head(frename(iris, tolower))
head(frename(iris, tolower, cols = 1:2))
head(frename(iris, tolower, cols = is.numeric))
head(frename(iris, paste, "new", sep = "_", cols = 1:2))
## Using vectors of names and programming
newname = "sepal_length"
head(frename(iris, Sepal.Length = newname, .nse = FALSE))
newnames = c("sepal_length", "sepal_width")
head(frename(iris, newnames, cols = 1:2))
newnames = c(Sepal.Length = "sepal_length", Sepal.Width = "sepal_width")
head(frename(iris, newnames, .nse = FALSE))
# Since v2.0.0, this works as well
newnames = c(sepal_length = "Sepal.Length", sepal_width = "Sepal.Width")
head(frename(iris, newnames, .nse = FALSE))
## Renaming by reference
# setrename(iris, tolower)
# head(iris)
# rm(iris)
# etc...
## Relabelling (by reference)
# namlab(relabel(wlddev, PCGDP = "GDP per Capita", LIFEEX = "Life Expectancy"))
# namlab(relabel(wlddev, toupper))
Fast (Grouped, Weighted) Scaling and Centering of Matrix-like Objects
Description
fscale
is a generic function to efficiently standardize (scale and center) data. STD
is a wrapper around fscale
representing the 'standardization operator', with more options than fscale
when applied to matrices and data frames. Standardization can be simple or groupwise, ordinary or weighted. Arbitrary target means and standard deviations can be set, with special options for grouped scaling and centering. It is also possible to scale data without centering i.e. perform mean-preserving scaling.
Usage
fscale(x, ...)
STD(x, ...)
## Default S3 method:
fscale(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1, ...)
## Default S3 method:
STD(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1, ...)
## S3 method for class 'matrix'
fscale(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1, ...)
## S3 method for class 'matrix'
STD(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1,
stub = .op[["stub"]], ...)
## S3 method for class 'data.frame'
fscale(x, g = NULL, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1, ...)
## S3 method for class 'data.frame'
STD(x, by = NULL, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
mean = 0, sd = 1, stub = .op[["stub"]], keep.by = TRUE, keep.w = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
fscale(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1, ...)
## S3 method for class 'pseries'
STD(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1, ...)
## S3 method for class 'pdata.frame'
fscale(x, effect = 1L, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1, ...)
## S3 method for class 'pdata.frame'
STD(x, effect = 1L, w = NULL, cols = is.numeric, na.rm = .op[["na.rm"]],
mean = 0, sd = 1, stub = .op[["stub"]], keep.ids = TRUE, keep.w = TRUE, ...)
# Methods for grouped data frame / compatibility with dplyr:
## S3 method for class 'grouped_df'
fscale(x, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1,
keep.group_vars = TRUE, keep.w = TRUE, ...)
## S3 method for class 'grouped_df'
STD(x, w = NULL, na.rm = .op[["na.rm"]], mean = 0, sd = 1,
stub = .op[["stub"]], keep.group_vars = TRUE, keep.w = TRUE, ...)
Arguments
x |
a numeric vector, matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df'). |
g |
a factor, |
by |
STD data.frame method: Same as |
cols |
STD (p)data.frame method: Select columns to scale using a function, column names, indices or a logical vector. Default: All numeric columns. Note: |
w |
a numeric vector of (non-negative) weights. |
na.rm |
logical. Skip missing values in |
effect |
plm methods: Select which panel identifier should be used as group-id. 1L takes the first variable in the index, 2L the second etc.. Index variables can also be called by name using a character string. More than one variable can be supplied. |
stub |
character. A prefix/stub to add to the names of all transformed columns. |
mean |
the mean to center on (default is 0). If |
sd |
the standard deviation to scale the data to (default is 1). A numeric value different from 0 (i.e. |
keep.by , keep.ids , keep.group_vars |
data.frame, pdata.frame and grouped_df methods: Logical. Retain grouping / panel-identifier columns in the output. For |
keep.w |
data.frame, pdata.frame and grouped_df methods: Logical. Retain column containing the weights in the output. Only works if |
... |
arguments to be passed to or from other methods. |
Details
If g = NULL
, fscale
by default (column-wise) subtracts the mean or weighted mean (if w
is supplied) from all data points in x
, and then divides this difference by the standard deviation or frequency-weighted standard deviation. The result is that all columns in x
will have a (weighted) mean 0 and (weighted) standard deviation 1. Alternatively, data can be scaled to have a mean of mean
and a standard deviation of sd
. If mean = FALSE
the data is only scaled (not centered) such that the mean of the data is preserved.
Means and standard deviations are computed using Welford's numerically stable online algorithm.
With groups supplied to g
, this standardizing becomes groupwise, so that in each group (in each column) the data points will have mean mean
and standard deviation sd
. Naturally if mean = FALSE
then each group is just scaled and the mean is preserved. For centering without scaling see fwithin
.
If na.rm = FALSE
and a NA
or NaN
is encountered, the mean and sd for that group will be NA
, and all data points belonging to that group will also be NA
in the output.
If na.rm = TRUE
, means and sd's are computed (column-wise) on the available data points, and also the weight vector can have missing values. In that case, the weighted mean an sd are computed on (column-wise) complete.cases(x, w)
, and x
is scaled using these statistics. Note that fscale
will not insert a missing value in x
if the weight for that value is missing, rather, that value will be scaled using a weighted mean and standard-deviated computed without itself! (The intention here is that a few (randomly) missing weights shouldn't break the computation when na.rm = TRUE
, but it is not meant for weight vectors with many missing values. If you don't like this behavior, you should prepare your data using x[is.na(w), ] <- NA
, or impute your weight vector for non-missing x
).
Special options for grouped scaling are mean = "overall.mean"
and sd = "within.sd"
. The former group-centers vectors on the overall mean of the data (see fwithin
for more details) and the latter scales the data in each group to have the within-group standard deviation (= the standard deviation of the group-centered data). Thus scaling a grouped vector with options mean = "overall.mean"
and sd = "within.sd"
amounts to removing all differences in the mean and standard deviations between these groups. In weighted computations, mean = "overall.mean"
will subtract weighted group-means from the data and add the overall weighted mean of the data, whereas sd = "within.sd"
will compute the weighted within- standard deviation and apply it to each group.
Value
x
standardized (mean = mean, standard deviation = sd), grouped by g/by
, weighted with w
. See Details.
Note
For centering without scaling see fwithin/W
. For simple not mean-preserving scaling use fsd(..., TRA = "/")
. To sweep pre-computed means and scale-factors out of data see TRA
.
See Also
fwithin
, fsd
, TRA
, Fast Statistical Functions, Data Transformations, Collapse Overview
Examples
## Simple Scaling & Centering / Standardizing
head(fscale(mtcars)) # Doesn't rename columns
head(STD(mtcars)) # By default adds a prefix
qsu(STD(mtcars)) # See that is works
qsu(STD(mtcars, mean = 5, sd = 3)) # Assigning a mean of 5 and a standard deviation of 3
qsu(STD(mtcars, mean = FALSE)) # No centering: Scaling is mean-preserving
## Panel Data
head(fscale(get_vars(wlddev,9:12), wlddev$iso3c)) # Standardizing 4 series within each country
head(STD(wlddev, ~iso3c, cols = 9:12)) # Same thing using STD, id's added
pwcor(fscale(get_vars(wlddev,9:12), wlddev$iso3c)) # Correlaing panel series after standardizing
fmean(get_vars(wlddev, 9:12)) # This calculates the overall means
fsd(fwithin(get_vars(wlddev, 9:12), wlddev$iso3c)) # This calculates the within standard deviations
head(qsu(fscale(get_vars(wlddev, 9:12), # This group-centers on the overall mean and
wlddev$iso3c, # group-scales to the within standard deviation
mean = "overall.mean", sd = "within.sd"), # -> data harmonized in the first 2 moments
by = wlddev$iso3c))
## Indexed data
wldi <- findex_by(wlddev, iso3c, year)
head(STD(wldi)) # Standardizing all numeric variables by country
head(STD(wldi, effect = 2L)) # Standardizing all numeric variables by year
## Weighted Standardizing
weights = abs(rnorm(nrow(wlddev)))
head(fscale(get_vars(wlddev,9:12), wlddev$iso3c, weights))
head(STD(wlddev, ~iso3c, weights, 9:12))
# Grouped data
wlddev |> fgroup_by(iso3c) |> fselect(PCGDP,LIFEEX) |> STD()
wlddev |> fgroup_by(iso3c) |> fselect(PCGDP,LIFEEX) |> STD(weights) # weighted standardizing
wlddev |> fgroup_by(iso3c) |> fselect(PCGDP,LIFEEX,POP) |> STD(POP) # weighting by POP ->
# ..keeps the weight column unless keep.w = FALSE
Fast Select, Replace or Add Data Frame Columns
Description
Efficiently select and replace (or add) a subset of columns from (to) a data frame. This can be done by data type, or using expressions, column names, indices, logical vectors, selector functions or regular expressions matching column names.
Usage
## Select and replace variables, analgous to dplyr::select but significantly faster
fselect(.x, ..., return = "data")
fselect(x, ...) <- value
slt(.x, ..., return = "data") # Shorthand for fselect
slt(x, ...) <- value # Shorthand for fselect<-
## Select and replace columns by names, indices, logical vectors,
## regular expressions or using functions to identify columns
get_vars(x, vars, return = "data", regex = FALSE, rename = FALSE, ...)
gv(x, vars, return = "data", ...) # Shorthand for get_vars
gvr(x, vars, return = "data", ...) # Shorthand for get_vars(..., regex = TRUE)
get_vars(x, vars, regex = FALSE, ...) <- value
gv(x, vars, ...) <- value # Shorthand for get_vars<-
gvr(x, vars, ...) <- value # Shorthand for get_vars<-(..., regex = TRUE)
## Add columns at any position within a data.frame
add_vars(x, ..., pos = "end")
add_vars(x, pos = "end") <- value
av(x, ..., pos = "end") # Shorthand for add_vars
av(x, pos = "end") <- value # Shorthand for add_vars<-
## Select and replace columns by data type
num_vars(x, return = "data")
num_vars(x) <- value
nv(x, return = "data") # Shorthand for num_vars
nv(x) <- value # Shorthand for num_vars<-
cat_vars(x, return = "data") # Categorical variables, see is_categorical
cat_vars(x) <- value
char_vars(x, return = "data")
char_vars(x) <- value
fact_vars(x, return = "data")
fact_vars(x) <- value
logi_vars(x, return = "data")
logi_vars(x) <- value
date_vars(x, return = "data") # See is_date
date_vars(x) <- value
Arguments
x , .x |
a data frame or list. | ||||||||||||||||||||||||||||||||||||
value |
a data frame or list of columns whose dimensions exactly match those of the extracted subset of | ||||||||||||||||||||||||||||||||||||
vars |
a vector of column names, indices (can be negative), a suitable logical vector, or a vector of regular expressions matching column names (if | ||||||||||||||||||||||||||||||||||||
return |
an integer or string specifying what the selector function should return. The options are:
Note: replacement functions only replace data, however column names are replaced together with the data (if available). | ||||||||||||||||||||||||||||||||||||
regex |
logical. | ||||||||||||||||||||||||||||||||||||
rename |
logical. If | ||||||||||||||||||||||||||||||||||||
pos |
the position where columns are added in the data frame. | ||||||||||||||||||||||||||||||||||||
... |
for |
Details
get_vars(<-)
is around 2x faster than `[.data.frame`
and 8x faster than `[<-.data.frame`
, so the common operation data[cols] <- someFUN(data[cols])
can be made 10x more efficient (abstracting from computations performed by someFUN
) using get_vars(data, cols) <- someFUN(get_vars(data, cols))
or the shorthand gv(data, cols) <- someFUN(gv(data, cols))
.
Similarly type-wise operations like data[sapply(data, is.numeric)]
or data[sapply(data, is.numeric)] <- value
are facilitated and more efficient using num_vars(data)
and num_vars(data) <- value
or the shortcuts nv
and nv<-
etc.
fselect
provides an efficient alternative to dplyr::select
, allowing the selection of variables based on expressions evaluated within the data frame, see Examples. It is about 100x faster than dplyr::select
but also more simple as it does not provide special methods (except for 'sf' and 'data.table' which are handled internally) .
Finally, add_vars(data1, data2, data3, ...)
is a lot faster than cbind(data1, data2, data3, ...)
, and preserves the attributes of data1
(i.e. it is like adding columns to data1
). The replacement function add_vars(data) <- someFUN(get_vars(data, cols))
efficiently appends data
with computed columns. The pos
argument allows adding columns at positions other than the end (right) of the data frame, see Examples. Note that add_vars
does not check duplicated column names or NULL
columns, and does not evaluate expressions in a data environment, or replicate length 1 inputs like cbind
. All of this is provided by ftransform
.
All functions introduced here perform their operations class-independent. They all basically work like this: (1) save the attributes of x
, (2) unclass x
, (3) subset, replace or append x
as a list, (4) modify the "names" component of the attributes of x
accordingly and (5) efficiently attach the attributes again to the result from step (3).
Thus they can freely be applied to data.table's, grouped tibbles, panel data frames and other classes and will return an object of exactly the same class and the same attributes.
Note
In many cases functions here only check the length of the first column, which is one of the reasons why they are so fast. When lists of unequal-length columns are offered as replacements this yields a malformed data frame (which will also print a warning in the console i.e. you will notice that).
See Also
fsubset
, ftransform
, rowbind
, Data Frame Manipulation, Collapse Overview
Examples
## Wold Development Data
head(fselect(wlddev, Country = country, Year = year, ODA)) # Fast dplyr-like selecting
head(fselect(wlddev, -country, -year, -PCGDP))
head(fselect(wlddev, country, year, PCGDP:ODA))
head(fselect(wlddev, -(PCGDP:ODA)))
fselect(wlddev, country, year, PCGDP:ODA) <- NULL # Efficient deleting
head(wlddev)
rm(wlddev)
head(num_vars(wlddev)) # Select numeric variables
head(cat_vars(wlddev)) # Select categorical (non-numeric) vars
head(get_vars(wlddev, is_categorical)) # Same thing
num_vars(wlddev) <- num_vars(wlddev) # Replace Numeric Variables by themselves
get_vars(wlddev,is.numeric) <- get_vars(wlddev,is.numeric) # Same thing
head(get_vars(wlddev, 9:12)) # Select columns 9 through 12, 2x faster
head(get_vars(wlddev, -(9:12))) # All except columns 9 through 12
head(get_vars(wlddev, c("PCGDP","LIFEEX","GINI","ODA"))) # Select using column names
head(get_vars(wlddev, "[[:upper:]]", regex = TRUE)) # Same thing: match upper-case var. names
head(gvr(wlddev, "[[:upper:]]")) # Same thing
get_vars(wlddev, 9:12) <- get_vars(wlddev, 9:12) # 9x faster wlddev[9:12] <- wlddev[9:12]
add_vars(wlddev) <- STD(gv(wlddev,9:12), wlddev$iso3c) # Add Standardized columns 9 through 12
head(wlddev) # gv and av are shortcuts
get_vars(wlddev, 14:17) <- NULL # Efficient Deleting added columns again
av(wlddev, "front") <- STD(gv(wlddev,9:12), wlddev$iso3c) # Again adding in Front
head(wlddev)
get_vars(wlddev, 1:4) <- NULL # Deleting
av(wlddev,c(10,12,14,16)) <- W(wlddev,~iso3c, cols = 9:12, # Adding next to original variables
keep.by = FALSE)
head(wlddev)
get_vars(wlddev, c(10,12,14,16)) <- NULL # Deleting
head(add_vars(wlddev, new = STD(wlddev$PCGDP))) # Can also add columns like this
head(add_vars(wlddev, STD(nv(wlddev)), new = W(wlddev$PCGDP))) # etc...
head(add_vars(mtcars, mtcars, mpg = mtcars$mpg, mtcars), 2) # add_vars does not check names!
Fast Slicing of Matrix-Like Objects
Description
A fast function to extract rows from a matrix or data frame-like object (by groups).
Usage
fslice(x, ..., n = 1, how = "first", order.by = NULL,
na.rm = .op[["na.rm"]], sort = FALSE, with.ties = FALSE)
fslicev(x, cols = NULL, n = 1, how = "first", order.by = NULL,
na.rm = .op[["na.rm"]], sort = FALSE, with.ties = FALSE, ...)
Arguments
x |
a matrix, data frame or list-like object, including 'grouped_df'. |
... |
for |
cols |
select columns to group by, using column names, indices, a logical vector or a selector function (e.g. |
n |
integer or proportion (if < 1). Number of rows to select from each group. If a proportion is provided, it is converted to the equivalent number of rows. |
how |
character. Method to select rows. One of:
|
order.by |
vector or column name to order by when |
na.rm |
logical. If |
sort |
logical. If |
with.ties |
logical. If |
Value
A subset of x
containing the selected rows.
See Also
fsubset
, fcount
, Data Frame Manipulation, Collapse Overview
Examples
# Basic usage
fslice(mtcars, n = 3) # First 3 rows
fslice(mtcars, n = 3, how = "last") # Last 3 rows
fslice(mtcars, n = 0.1) # First 10% of rows
# Using order.by
fslice(mtcars, n = 3, how = "min", order.by = mpg) # 3 cars with lowest mpg
fslice(mtcars, n = 3, how = "max", order.by = mpg) # 3 cars with highest mpg
# With grouping
mtcars |> fslice(cyl, n = 2) # First 2 cars per cylinder
mtcars |> fslice(cyl, n = 2, sort = TRUE) # with sorting (slightly less efficient)
mtcars |> fslice(cyl, n = 2, how = "min", order.by = mpg) # 2 lowest mpg cars per cylinder
# Using with.ties
mtcars |> fslice(cyl, n = 1, how = "min", order.by = mpg, with.ties = TRUE)
# With grouped data
mtcars |>
fgroup_by(cyl) |>
fslice(n = 2, how = "max", order.by = mpg) # 2 highest mpg cars per cylinder
Fast Subsetting Matrix-Like Objects
Description
fsubset
returns subsets of vectors, matrices or data frames which meet conditions. It is programmed very efficiently and uses C source code from the data.table package.
The methods also provide enhanced functionality compared to subset
. The function ss
provides an (internal generic) programmers alternative to [
that does not drop dimensions and is significantly faster than [
for data frames.
Usage
fsubset(.x, ...)
sbt(.x, ...) # Shorthand for fsubset
## Default S3 method:
fsubset(.x, subset, ...)
## S3 method for class 'matrix'
fsubset(.x, subset, ..., drop = FALSE)
## S3 method for class 'data.frame'
fsubset(.x, subset, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
fsubset(.x, subset, ..., drop.index.levels = "id")
## S3 method for class 'pdata.frame'
fsubset(.x, subset, ..., drop.index.levels = "id")
# Fast subsetting (replaces `[` with drop = FALSE, programmers choice)
ss(x, i, j, check = TRUE)
Arguments
.x |
object to be subsetted according to different methods. |
x |
a data frame / list, matrix or vector/array (only |
subset |
logical expression indicating elements or rows to keep:
missing values are taken as |
... |
For the matrix or data frame method: multiple comma-separated expressions indicating columns to select. Otherwise: further arguments to be passed to or from other methods. |
drop |
passed on to |
i |
positive or negative row-indices or a logical vector to subset the rows of |
j |
a vector of column names, positive or negative indices or a suitable logical vector to subset the columns of |
check |
logical. |
drop.index.levels |
character. Either |
Details
fsubset
is a generic function, with methods supplied for vectors, matrices, and data
frames (including lists). It represents an improvement over subset
in terms of both speed and functionality. The function ss
is an improvement of [
to subset (vectors) matrices and data frames without dropping dimensions. It is significantly faster than [.data.frame
.
For ordinary vectors, subset
can be integer or logical, subsetting is done in C and more efficient than [
for large vectors.
For matrices the implementation is all base-R but slightly more efficient and more versatile than subset.matrix
. Thus it is possible to subset
matrix rows using logical or integer vectors, or character vectors matching rownames. The drop
argument is passed on to the [
method for matrices.
For both matrices and data frames, the ...
argument can be used to subset columns, and is evaluated in a non-standard way. Thus it can support vectors of column names, indices or logical vectors, but also multiple comma separated column names passed without quotes, each of which may also be replaced by a sequence of columns i.e. col1:coln
, and new column names may be assigned e.g. fsubset(data, col1 > 20, newname = col2, col3:col6)
(see examples).
For data frames, the subset
argument is also evaluated in a non-standard way. Thus next to vector of row-indices or logical vectors, it supports logical expressions of the form col2 > 5 & col2 < col3
etc. (see examples). The data frame method is implemented in C, hence it is significantly faster than subset.data.frame
. If fast data frame subsetting is required but no non-standard evaluation, the function ss
is slightly simpler and faster.
Factors may have empty levels after subsetting; unused levels are not automatically removed. See fdroplevels
to drop all unused levels from a data frame.
Value
An object similar to .x/x
containing just the selected elements (for
a vector), rows and columns (for a matrix or data frame).
Note
ss
offers no support for indexed data. Use fsubset
with indices instead.
No replacement method fsubset<-
or ss<-
is offered in collapse. For efficient subset replacement (without copying) use data.table::set
, which can also be used with data frames and tibbles. To search and replace certain elements without copying, and to efficiently copy elements / rows from an equally sized vector / data frame, see setv
.
For subsetting columns alone, please also see selecting and replacing columns.
Note that the use of %==%
can yield significant performance gains on large data.
See Also
fselect
,
get_vars
,
ftransform
,
Data Frame Manipulation, Collapse Overview
Examples
fsubset(airquality, Temp > 90, Ozone, Temp)
fsubset(airquality, Temp > 90, OZ = Ozone, Temp) # With renaming
fsubset(airquality, Day == 1, -Temp)
fsubset(airquality, Day == 1, -(Day:Temp))
fsubset(airquality, Day == 1, Ozone:Wind)
fsubset(airquality, Day == 1 & !is.na(Ozone), Ozone:Wind, Month)
fsubset(airquality, Day %==% 1, -Temp) # Faster for big data, as %==% directly returns indices
ss(airquality, 1:10, 2:3) # Significantly faster than airquality[1:10, 2:3]
fsubset(airquality, 1:10, 2:3) # This is possible but not advised
Fast (Grouped, Weighted) Sum for Matrix-Like Objects
Description
fsum
is a generic function that computes the (column-wise) sum of all values in x
, (optionally) grouped by g
and/or weighted by w
(e.g. to calculate survey totals). The TRA
argument can further be used to transform x
using its (grouped, weighted) sum.
Usage
fsum(x, ...)
## Default S3 method:
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, fill = FALSE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fsum(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
fill = FALSE, nthreads = .op[["nthreads"]], ...)
Arguments
x |
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
w |
a numeric vector of (non-negative) weights, may contain missing values. |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. Skip missing values in |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
fill |
logical. Initialize result with |
nthreads |
integer. The number of threads to utilize. See Details. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
keep.w |
grouped_df method: Logical. Retain summed weighting variable after computation (if contained in |
stub |
character. If |
... |
arguments to be passed to or from other methods. If |
Details
The weighted sum (e.g. survey total) is computed as sum(x * w)
, but in one pass and about twice as efficient. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
This all seamlessly generalizes to grouped computations, which are performed in a single pass (without splitting the data) and are therefore extremely fast. See Benchmark and Examples below.
When applied to data frames with groups or drop = FALSE
, fsum
preserves all column attributes. The attributes of the data frame itself are also preserved.
Since v1.6.0 fsum
explicitly supports integers. Integers are summed using the long long type in C which is bounded at +-9,223,372,036,854,775,807 (so ~4.3 billion times greater than the minimum/maximum R integer bounded at +-2,147,483,647). If the value of the sum is outside +-2,147,483,647, a double containing the result is returned, otherwise an integer is returned. With groups, an integer results vector is initialized, and an integer overflow error is provided if the sum in any group is outside +-2,147,483,647. Data needs to be coerced to double beforehand in such cases.
Multithreading, added in v1.8.0, applies at the column-level unless g = NULL
and nthreads > NCOL(x)
. Parallelism over groups is not available because sums are computed simultaneously within each group. nthreads = 1L
uses a serial version of the code, not parallel code running on one thread. This serial code is always used with less than 100,000 obs (length(x) < 100000
for vectors and matrices), because parallel execution itself has some overhead.
Value
The (w
weighted) sum of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighted) sum.
See Also
fprod
, fmean
, Fast Statistical Functions, Collapse Overview
Examples
## default vector method
mpg <- mtcars$mpg
fsum(mpg) # Simple sum
fsum(mpg, w = mtcars$hp) # Weighted sum (total): Weighted by hp
fsum(mpg, TRA = "%") # Simple transformation: obtain percentages of mpg
fsum(mpg, mtcars$cyl) # Grouped sum
fsum(mpg, mtcars$cyl, mtcars$hp) # Weighted grouped sum (total)
fsum(mpg, mtcars[c(2,8:9)]) # More groups..
g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed !
fsum(mpg, g)
fmean(mpg, g) == fsum(mpg, g) / fnobs(mpg, g)
fsum(mpg, g, TRA = "%") # Percentages by group
## data.frame method
fsum(mtcars)
fsum(mtcars, TRA = "%")
fsum(mtcars, g)
fsum(mtcars, g, TRA = "%")
## matrix method
m <- qM(mtcars)
fsum(m)
fsum(m, TRA = "%")
fsum(m, g)
fsum(m, g, TRA = "%")
## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fsum(hp) # Weighted grouped sum (total)
mtcars |> fgroup_by(cyl,vs,am) |> fsum(TRA = "%")
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg) |> fsum()
## This compares fsum with data.table and base::rowsum
# Starting with small data
library(data.table)
opts <- set_collapse(nthreads = getDTthreads())
mtcDT <- qDT(mtcars)
f <- qF(mtcars$cyl)
library(microbenchmark)
microbenchmark(mtcDT[, lapply(.SD, sum), by = f],
rowsum(mtcDT, f, reorder = FALSE),
fsum(mtcDT, f, na.rm = FALSE), unit = "relative")
# Now larger data
tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs
f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups
microbenchmark(tdata[, lapply(.SD, sum), by = f],
rowsum(tdata, f, reorder = FALSE),
fsum(tdata, f, na.rm = FALSE), unit = "relative")
# Reset options
set_collapse(opts)
Fast Summarise
Description
fsummarise
is a much faster version of dplyr::summarise
, when used together with the Fast Statistical Functions.
fsummarize
and fsummarise
are synonyms.
Usage
fsummarise(.data, ..., keep.group_vars = TRUE, .cols = NULL)
fsummarize(.data, ..., keep.group_vars = TRUE, .cols = NULL)
smr(.data, ..., keep.group_vars = TRUE, .cols = NULL) # Shorthand
Arguments
.data |
a (grouped) data frame or named list of columns. Grouped data can be created with |
... |
name-value pairs of summary functions, |
keep.group_vars |
logical. |
.cols |
for expressions involving |
Value
If .data
is grouped by fgroup_by
or dplyr::group_by
, the result is a data frame of the same class and attributes with rows reduced to the number of groups. If .data
is not grouped, the result is a data frame of the same class and attributes with 1 row.
Note
Since v1.7, fsummarise
is fully featured, allowing expressions using functions and columns of the data as well as external scalar values (just like dplyr::summarise
). NOTE however that once a Fast Statistical Function is used, the execution will be vectorized instead of split-apply-combine computing over groups. Please see the first Example.
See Also
across
, collap
, Data Frame Manipulation, Fast Statistical Functions, Collapse Overview
Examples
## Since v1.7, fsummarise supports arbitrary expressions, and expressions
## containing fast statistical functions receive vectorized execution:
# (a) This is an expression using base R functions which is executed by groups
mtcars |> fgroup_by(cyl) |> fsummarise(res = mean(mpg) + min(qsec))
# (b) Here, the use of fmean causes the whole expression to be executed
# in a vectorized way i.e. the expression is translated to something like
# fmean(mpg, g = cyl) + min(mpg) and executed, thus the result is different
# from (a), because the minimum is calculated over the entire sample
mtcars |> fgroup_by(cyl) |> fsummarise(mpg = fmean(mpg) + min(qsec))
# (c) For fully vectorized execution, use fmin. This yields the same as (a)
mtcars |> fgroup_by(cyl) |> fsummarise(mpg = fmean(mpg) + fmin(qsec))
# More advanced use: vectorized grouped regression slopes: mpg ~ carb
mtcars |>
fgroup_by(cyl) |>
fmutate(dm_carb = fwithin(carb)) |>
fsummarise(beta = fsum(mpg, dm_carb) %/=% fsum(dm_carb^2))
# In across() statements it is fine to mix different functions, each will
# be executed on its own terms (i.e. vectorized for fmean and standard for sum)
mtcars |> fgroup_by(cyl) |> fsummarise(across(mpg:hp, list(fmean, sum)))
# Note that this still detects fmean as a fast function, the names of the list
# are irrelevant, but the function name must be typed or passed as a character vector,
# Otherwise functions will be executed by groups e.g. function(x) fmean(x) won't vectorize
mtcars |> fgroup_by(cyl) |> fsummarise(across(mpg:hp, list(mu = fmean, sum = sum)))
# We can force none-vectorized execution by setting .apply = TRUE
mtcars |> fgroup_by(cyl) |> fsummarise(across(mpg:hp, list(mu = fmean, sum = sum), .apply = TRUE))
# Another argument of across(): Order the result first by function, then by column
mtcars |> fgroup_by(cyl) |>
fsummarise(across(mpg:hp, list(mu = fmean, sum = sum), .transpose = FALSE))
# Since v1.9.0, can also evaluate arbitrary expressions
mtcars |> fgroup_by(cyl, vs, am) |>
fsummarise(mctl(cor(cbind(mpg, wt, carb)), names = TRUE))
# This can also be achieved using across():
corfun <- function(x) mctl(cor(x), names = TRUE)
mtcars |> fgroup_by(cyl, vs, am) |>
fsummarise(across(c(mpg, wt, carb), corfun, .apply = FALSE))
#----------------------------------------------------------------------------
# Examples that also work for pre 1.7 versions
# Simple use
fsummarise(mtcars, mean_mpg = fmean(mpg),
sd_mpg = fsd(mpg))
# Using base functions (not a big difference without groups)
fsummarise(mtcars, mean_mpg = mean(mpg),
sd_mpg = sd(mpg))
# Grouped use
mtcars |> fgroup_by(cyl) |>
fsummarise(mean_mpg = fmean(mpg),
sd_mpg = fsd(mpg))
# This is still efficient but quite a bit slower on large data (many groups)
mtcars |> fgroup_by(cyl) |>
fsummarise(mean_mpg = mean(mpg),
sd_mpg = sd(mpg))
# Weighted aggregation
mtcars |> fgroup_by(cyl) |>
fsummarise(w_mean_mpg = fmean(mpg, wt),
w_sd_mpg = fsd(mpg, wt))
## Can also group with dplyr::group_by, but at a conversion cost, see ?GRP
library(dplyr)
mtcars |> group_by(cyl) |>
fsummarise(mean_mpg = fmean(mpg),
sd_mpg = fsd(mpg))
# Again less efficient...
mtcars |> group_by(cyl) |>
fsummarise(mean_mpg = mean(mpg),
sd_mpg = sd(mpg))
Fast Transform and Compute Columns on a Data Frame
Description
ftransform
is a much faster version of transform
for data frames. It returns the data frame with new columns computed and/or existing columns modified or deleted. settransform
does all of that by reference. fcompute
computes and returns new columns. These functions evaluate all arguments simultaneously, allow list-input (nested pipelines) and disregard grouped data.
Catering to the tidyverse user, v1.7.0 introduced fmutate
, providing familiar functionality i.e. arguments are evaluated sequentially, computation on grouped data is done by groups, and functions can be applied to multiple columns using across
. See also the Details.
Usage
# dplyr-style mutate (sequential evaluation + across() feature)
fmutate(.data, ..., .keep = "all", .cols = NULL)
mtt(.data, ..., .keep = "all", .cols = NULL) # Shorthand for fmutate
# Modify and return data frame
ftransform(.data, ...)
ftransformv(.data, vars, FUN, ..., apply = TRUE)
tfm(.data, ...) # Shorthand for ftransform
tfmv(.data, vars, FUN, ..., apply = TRUE)
# Modify data frame by reference
settransform(.data, ...)
settransformv(.data, ...) # Same arguments as ftransformv
settfm(.data, ...) # Shorthand for settransform
settfmv(.data, ...)
# Replace/add modified columns in/to a data frame
ftransform(.data) <- value
tfm(.data) <- value # Shorthand for ftransform<-
# Compute columns, returned as a new data frame
fcompute(.data, ..., keep = NULL)
fcomputev(.data, vars, FUN, ..., apply = TRUE, keep = NULL)
Arguments
.data |
a data frame or named list of columns. |
... |
further arguments of the form |
vars |
variables to be transformed by applying |
FUN |
a single function yielding a result of length |
apply |
logical. |
value |
a named list of replacements, it will be treated like an evaluated list of |
keep |
select columns to preserve using column names, indices or a function (e.g. |
.keep |
either one of |
.cols |
for expressions involving |
Details
The ...
arguments to ftransform
are tagged
vector expressions, which are evaluated in the data frame
.data
. The tags are matched against names(.data)
, and for
those that match, the values replace the corresponding variable in
.data
, whereas the others are appended to .data
. It is also possible to delete columns by assigning NULL
to them, i.e. ftransform(data, colk = NULL)
removes colk
from the data. Note that names(.data)
and the names of the ...
arguments are checked for uniqueness beforehand, yielding an error if this is not the case.
Since collapse v1.3.0, is is also possible to pass a single named list to ...
, i.e. ftransform(data, newdata)
. This list will be treated like a list of tagged vector expressions. Note the different behavior: ftransform(data, list(newcol = col1))
is the same as ftransform(data, newcol = col1)
, whereas ftransform(data, newcol = as.list(col1))
creates a list column. Something like ftransform(data, as.list(col1))
gives an error because the list is not named. See Examples.
The function ftransformv
added in v1.3.2 provides a fast replacement for the functions dplyr::mutate_at
and dplyr::mutate_if
(without the grouping feature) facilitating mutations of groups of columns (dplyr::mutate_all
is already accounted for by dapply
). See Examples.
The function settransform
does all of that by reference, but uses base-R's copy-on modify semantics, which is equivalent to replacing the data with <-
(thus it is still memory efficient but the data will have a different memory address afterwards).
The function fcompute(v)
works just like ftransform(v)
, but returns only the changed / computed columns without modifying or appending the data in .data
. See Examples.
The function fmutate
added in v1.7.0, provides functionality familiar from dplyr 1.0.0 and higher. It evaluates tagged vector expressions sequentially and does operations by groups on a grouped frame (thus it is slower than ftransform
if you have many tagged expressions or a grouped data frame). Note however that collapse does not depend on rlang, so things like lambda expressions are not available. Note also that fmutate
operates differently on grouped data whether you use .FAST_FUN
or base R functions / functions from other packages. With .FAST_FUN
(including .OPERATOR_FUN
, excluding fhdbetween
/ fhdwithin
/ HDW
/ HDB
), fmutate
performs an efficient vectorized execution, i.e. the grouping object from the grouped data frame is passed to the g
argument of these functions, and for .FAST_STAT_FUN
also TRA = "replace_fill"
is set (if not overwritten by the user), yielding internal grouped computation by these functions without the need for splitting the data by groups. For base R and other functions, fmutate
performs classical split-apply combine computing i.e. the relevant columns of the data are selected and split into groups, the expression is evaluated for each group, and the result is recombined and suitably expanded to match the original data frame. Note that it is not possible to mix vectorized and standard execution in the same expression!! Vectorized execution is performed if any .FAST_FUN
or .OPERATOR_FUN
is part of the expression, thus a code like mtcars |> gby(cyl) |> fmutate(new = fmin(mpg) / min(mpg))
will be expanded to something like mtcars |> gby(cyl) |> ftransform(new = fmin(mpg, g = GRP(.), TRA = "replace_fill") / min(mpg))
and then executed, i.e. fmin(mpg)
will be executed in a vectorized way, and min(mpg)
will not be executed by groups at all.
Value
The modified data frame .data
, or, for fcompute
, a new data frame with the columns computed on .data
. All attributes of .data
are preserved.
Note
ftransform
ignores grouped data. This is on purpose as it allows non-grouped transformation inside a pipeline on grouped data, and affords greater flexibility and performance in programming with the .FAST_FUN
. In particular, you can run a nested pipeline inside ftransform
, and decide which expressions should be grouped, and you can use the ad-hoc grouping functionality of the .FAST_FUN
, allowing operations where different groupings are applied simultaneously in an expression. See Examples or the answer provided here.
fmutate
on the other hand supports grouped operations just like dplyr::mutate
, but works in two different ways depending on whether you use .FAST_FUN
in an expression or other functions. See the Examples.
See Also
across
, fsummarise
, Data Frame Manipulation, Collapse Overview
Examples
## fmutate() examples ---------------------------------------------------------------
# Please note that expressions are vectorized whenever they contain 'ANY' fast function
mtcars |>
fgroup_by(cyl, vs, am) |>
fmutate(mean_mpg = fmean(mpg), # Vectorized
mean_mpg_base = mean(mpg), # Non-vectorized
mpg_cumpr = fcumsum(mpg) / fsum(mpg), # Vectorized
mpg_cumpr_base = cumsum(mpg) / sum(mpg), # Non-vectorized
mpg_cumpr_mixed = fcumsum(mpg) / sum(mpg)) # Vectorized: division by overall sum
# Using across: here fmean() gets vectorized across both groups and columns (requiring a single
# call to fmean.data.frame which goes to C), whereas weighted.mean needs to be called many times.
mtcars |> fgroup_by(cyl, vs, am) |>
fmutate(across(disp:qsec, list(mu = fmean, mu2 = weighted.mean), w = wt, .names = "flip"))
# Can do more complex things...
mtcars |> fgroup_by(cyl) |>
fmutate(res = resid(lm(mpg ~ carb + hp, weights = wt)))
# Since v1.9.0: supports arbitrary expressions returning suitable lists
## Not run:
mtcars |> fgroup_by(cyl) |>
fmutate(broom::augment(lm(mpg ~ carb + hp, weights = wt)))
# Same thing using across() (supported before 1.9.0)
modelfun <- function(data) broom::augment(lm(mpg ~ carb + hp, data, weights = wt))
mtcars |> fgroup_by(cyl) |>
fmutate(across(c(mpg, carb, hp, wt), modelfun, .apply = FALSE))
## End(Not run)
## ftransform() / fcompute() examples: ----------------------------------------------
## ftransform modifies and returns a data.frame
head(ftransform(airquality, Ozone = -Ozone))
head(ftransform(airquality, new = -Ozone, Temp = (Temp-32)/1.8))
head(ftransform(airquality, new = -Ozone, new2 = 1, Temp = NULL)) # Deleting Temp
head(ftransform(airquality, Ozone = NULL, Temp = NULL)) # Deleting columns
# With collapse's grouped and weighted functions, complex operations are done on the fly
head(ftransform(airquality, # Grouped operations by month:
Ozone_Month_median = fmedian(Ozone, Month, TRA = "fill"),
Ozone_Month_sd = fsd(Ozone, Month, TRA = "replace"),
Ozone_Month_centered = fwithin(Ozone, Month)))
# Grouping by month and above/below average temperature in each month
head(ftransform(airquality, Ozone_Month_high_median =
fmedian(Ozone, list(Month, Temp > fbetween(Temp, Month)), TRA = "fill")))
## ftransformv can be used to modify multiple columns using a function
head(ftransformv(airquality, 1:3, log))
head(`[<-`(airquality, 1:3, value = lapply(airquality[1:3], log))) # Same thing in base R
head(ftransformv(airquality, 1:3, log, apply = FALSE))
head(`[<-`(airquality, 1:3, value = log(airquality[1:3]))) # Same thing in base R
# Using apply = FALSE yields meaningful performance gains with collapse functions
# This calls fwithin.default, and repeates the grouping by month 3 times:
head(ftransformv(airquality, 1:3, fwithin, Month))
# This calls fwithin.data.frame, and only groups one time -> 5x faster!
head(ftransformv(airquality, 1:3, fwithin, Month, apply = FALSE))
# This also works for grouped and panel data frames (calling fwithin.grouped_df)
airquality |> fgroup_by(Month) |>
ftransformv(1:3, fwithin, apply = FALSE) |> head()
# But this gives the WRONG result (calling fwithin.default). Need option apply = FALSE!!
airquality |> fgroup_by(Month) |>
ftransformv(1:3, fwithin) |> head()
# For grouped modification of single columns in a grouped dataset, we can use GRP():
library(magrittr)
airquality |> fgroup_by(Month) %>%
ftransform(W_Ozone = fwithin(Ozone, GRP(.)), # Grouped centering
sd_Ozone_m = fsd(Ozone, GRP(.), TRA = "replace"), # In-Month standard deviation
sd_Ozone = fsd(Ozone, TRA = "replace"), # Overall standard deviation
sd_Ozone2 = fsd(Ozone, TRA = "fill"), # Same, overwriting NA's
sd_Ozone3 = fsd(Ozone)) |> head() # Same thing (calling alloc())
## For more complex mutations we can use ftransform with compound pipes
airquality |> fgroup_by(Month) %>%
ftransform(get_vars(., 1:3) |> fwithin() |> flag(0:2)) |> head()
airquality %>% ftransform(STD(., cols = 1:3) |> replace_na(0)) |> head()
# The list argument feature also allows flexible operations creating multiple new columns
airquality |> # The variance of Wind and Ozone, by month, weighted by temperature:
ftransform(fvar(list(Wind_var = Wind, Ozone_var = Ozone), Month, Temp, "replace")) |> head()
# Same as above using a grouped data frame (a bit more complex)
airquality |> fgroup_by(Month) %>%
ftransform(fselect(., Wind, Ozone) |> fvar(Temp, "replace") |> add_stub("_var", FALSE)) |>
fungroup() |> head()
# This performs 2 different multi-column grouped operations (need c() to make it one list)
ftransform(airquality, c(fmedian(list(Wind_Day_median = Wind,
Ozone_Day_median = Ozone), Day, TRA = "replace"),
fsd(list(Wind_Month_sd = Wind,
Ozone_Month_sd = Ozone), Month, TRA = "replace"))) |> head()
## settransform(v) works like ftransform(v) but modifies a data frame in the global environment..
settransform(airquality, Ratio = Ozone / Temp, Ozone = NULL, Temp = NULL)
head(airquality)
rm(airquality)
# Grouped and weighted centering
settransformv(airquality, 1:3, fwithin, Month, Temp, apply = FALSE)
head(airquality)
rm(airquality)
# Suitably lagged first-differences
settransform(airquality, get_vars(airquality, 1:3) |> fdiff() |> flag(0:2))
head(airquality)
rm(airquality)
# Same as above using magrittr::`%<>%`
airquality %<>% ftransform(get_vars(., 1:3) |> fdiff() |> flag(0:2))
head(airquality)
rm(airquality)
# It is also possible to achieve the same thing via a replacement method (if needed)
ftransform(airquality) <- get_vars(airquality, 1:3) |> fdiff() |> flag(0:2)
head(airquality)
rm(airquality)
## fcompute only returns the modified / computed columns
head(fcompute(airquality, Ozone = -Ozone))
head(fcompute(airquality, new = -Ozone, Temp = (Temp-32)/1.8))
head(fcompute(airquality, new = -Ozone, new2 = 1))
# Can preserve existing columns, computed ones are added to the right if names are different
head(fcompute(airquality, new = -Ozone, new2 = 1, keep = 1:3))
# If given same name as preserved columns, preserved columns are replaced in order...
head(fcompute(airquality, Ozone = -Ozone, new = 1, keep = 1:3))
# Same holds for fcomputev
head(fcomputev(iris, is.numeric, log)) # Same as:
iris |> get_vars(is.numeric) |> dapply(log) |> head()
head(fcomputev(iris, is.numeric, log, keep = "Species")) # Adds in front
head(fcomputev(iris, is.numeric, log, keep = names(iris))) # Preserve order
# Keep a subset of the data, add standardized columns
head(fcomputev(iris, 3:4, STD, apply = FALSE, keep = names(iris)[3:5]))
Fast Unique Elements / Rows
Description
funique
is an efficient alternative to unique
(or unique.data.table, kit::funique, dplyr::distinct
).
fnunique
is an alternative to NROW(unique(x))
(or data.table::uniqueN, kit::uniqLen, dplyr::n_distinct
).
fduplicated
is an alternative to duplicated
(or duplicated.data.table
, kit::fduplicated
).
The collapse versions are versatile and highly competitive.
any_duplicated(x)
is faster than any(fduplicated(x))
. Note that for atomic vectors, anyDuplicated
is currently more efficient if there are duplicates at the beginning of the vector.
Usage
funique(x, ...)
## Default S3 method:
funique(x, sort = FALSE, method = "auto", ...)
## S3 method for class 'data.frame'
funique(x, cols = NULL, sort = FALSE, method = "auto", ...)
## S3 method for class 'sf'
funique(x, cols = NULL, sort = FALSE, method = "auto", ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
funique(x, sort = FALSE, method = "auto", drop.index.levels = "id", ...)
## S3 method for class 'pdata.frame'
funique(x, cols = NULL, sort = FALSE, method = "auto", drop.index.levels = "id", ...)
fnunique(x) # Fast NROW(unique(x)), for vectors and lists
fduplicated(x, all = FALSE) # Fast duplicated(x), for vectors and lists
any_duplicated(x) # Simple logical TRUE|FALSE duplicates check
Arguments
x |
a atomic vector or data frame / list of equal-length columns. | |||||||||||||||||||||
sort |
logical. | |||||||||||||||||||||
method |
an integer or character string specifying the method of computation:
| |||||||||||||||||||||
cols |
compute unique rows according to a subset of columns. Columns can be selected using column names, indices, a logical vector or a selector function (e.g. | |||||||||||||||||||||
... |
arguments passed to | |||||||||||||||||||||
drop.index.levels |
character. Either | |||||||||||||||||||||
all |
logical. |
Details
If all values/rows are already unique, then x
is returned. Otherwise a copy of x
with duplicate rows removed is returned. See group
for some additional computational details.
The sf method simply ignores the geometry column when determining unique values.
Methods for indexed data also subset the index accordingly.
any_duplicated
is currently simply implemented as fnunique(x) < NROW(x)
, which means it does not have facilities to terminate early, and users are advised to use anyDuplicated
with atomic vectors if chances are high that there are duplicates at the beginning of the vector. With no duplicate values or data frames, any_duplicated
is considerably faster than anyDuplicated
.
Value
funique
returns x
with duplicate elements/rows removed, fnunique
returns an integer giving the number of unique values/rows, fduplicated
gives a logical vector with TRUE
indicating duplicated elements/rows.
Note
These functions treat lists like data frames, unlike unique
which has a list method to determine uniqueness of (non-atomic/heterogeneous) elements in a list.
No matrix method is provided. Please use the alternatives provided in package kit with matrices.
See Also
fndistinct
, group
, Fast Grouping and Ordering, Collapse Overview.
Examples
funique(mtcars$cyl)
funique(gv(mtcars, c(2,8,9)))
funique(mtcars, cols = c(2,8,9))
fnunique(gv(mtcars, c(2,8,9)))
fduplicated(gv(mtcars, c(2,8,9)))
fduplicated(gv(mtcars, c(2,8,9)), all = TRUE)
any_duplicated(gv(mtcars, c(2,8,9)))
any_duplicated(mtcars)
Fast (Grouped, Weighted) Variance and Standard Deviation for Matrix-Like Objects
Description
fvar
and fsd
are generic functions that compute the (column-wise) variance and standard deviation of x
, (optionally) grouped by g
and/or frequency-weighted by w
. The TRA
argument can further be used to transform x
using its (grouped, weighted) variance/sd.
Usage
fvar(x, ...)
fsd(x, ...)
## Default S3 method:
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, stable.algo = .op[["stable.algo"]], ...)
## Default S3 method:
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'matrix'
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'matrix'
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'data.frame'
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'data.frame'
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'grouped_df'
fvar(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE,
stub = .op[["stub"]], stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'grouped_df'
fsd(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE,
stub = .op[["stub"]], stable.algo = .op[["stable.algo"]], ...)
Arguments
x |
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
w |
a numeric vector of (non-negative) weights, may contain missing values. |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. Skip missing values in |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
keep.w |
grouped_df method: Logical. Retain summed weighting variable after computation (if contained in |
stub |
character. If |
stable.algo |
logical. |
... |
arguments to be passed to or from other methods. If |
Details
Welford's online algorithm used by default to compute the variance is well described here (the section Weighted incremental algorithm also shows how the weighted variance is obtained by this algorithm).
If stable.algo = FALSE
, the variance is computed in one-pass as (sum(x^2)-n*mean(x)^2)/(n-1)
, where sum(x^2)
is the sum of squares from which the expected sum of squares n*mean(x)^2
is subtracted, normalized by n-1
(Bessel's correction). This is numerically unstable if sum(x^2)
and n*mean(x)^2
are large numbers very close together, which will be the case for large n
, large x
-values and small variances (catastrophic cancellation occurs, leading to a loss of numeric precision). Numeric precision is however still maximized through the internal use of long doubles in C++, and the fast algorithm can be up to 4-times faster compared to Welford's method.
The weighted variance is computed with frequency weights as (sum(x^2*w)-sum(w)*weighted.mean(x,w)^2)/(sum(w)-1)
. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
For further computational detail see fsum
.
Value
fvar
returns the (w
weighted) variance of x
, grouped by g
, or (if TRA
is used) x
transformed by its (grouped, weighted) variance. fsd
computes the standard deviation of x
in like manor.
References
Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics. 4 (3): 419-420. doi:10.2307/1266577.
See Also
Fast Statistical Functions, Collapse Overview
Examples
## default vector method
fvar(mtcars$mpg) # Simple variance (all examples also hold for fvar!)
fsd(mtcars$mpg) # Simple standard deviation
fsd(mtcars$mpg, w = mtcars$hp) # Weighted sd: Weighted by hp
fsd(mtcars$mpg, TRA = "/") # Simple transformation: scaling (See also ?fscale)
fsd(mtcars$mpg, mtcars$cyl) # Grouped sd
fsd(mtcars$mpg, mtcars$cyl, mtcars$hp) # Grouped weighted sd
fsd(mtcars$mpg, mtcars$cyl, TRA = "/") # Scaling by group
fsd(mtcars$mpg, mtcars$cyl, mtcars$hp, "/") # Group-scaling using weighted group sds
## data.frame method
fsd(iris) # This works, although 'Species' is a factor variable
fsd(mtcars, drop = FALSE) # This works, all columns are numeric variables
fsd(iris[-5], iris[5]) # By Species: iris[5] is still a list, and thus passed to GRP()
fsd(iris[-5], iris[[5]]) # Same thing much faster: fsd recognizes 'Species' is a factor
head(fsd(iris[-5], iris[[5]], TRA = "/")) # Data scaled by species (see also fscale)
## matrix method
m <- qM(mtcars)
fsd(m)
fsd(m, mtcars$cyl) # etc..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fsd()
mtcars |> fgroup_by(cyl,vs,am) |> fsd(keep.group_vars = FALSE) # Remove grouping columns
mtcars |> fgroup_by(cyl,vs,am) |> fsd(hp) # Weighted by hp
mtcars |> fgroup_by(cyl,vs,am) |> fsd(hp, "/") # Weighted scaling transformation
Find and Extract / Subset List Elements
Description
A suite of functions to subset or extract from (potentially complex) lists and list-like structures. Subsetting may occur according to certain data types, using identifier functions, element names or regular expressions to search the list for certain objects.
-
atomic_elem
andlist_elem
are non-recursive functions to extract and replace the atomic and sub-list elements at the top-level of the list tree. -
reg_elem
is the recursive equivalent ofatomic_elem
and returns the 'regular' part of the list - with atomic elements in the final nodes.irreg_elem
returns all the non-regular elements (i.e. call and terms objects, formulas, etc...). See Examples. -
get_elem
returns the part of the list responding to either an identifier function, regular expression, exact element names or indices applied to all final objects.has_elem
checks for the existence of an element and returnsTRUE
if a match is found. See Examples.
Usage
## Non-recursive (top-level) subsetting and replacing
atomic_elem(l, return = "sublist", keep.class = FALSE)
atomic_elem(l) <- value
list_elem(l, return = "sublist", keep.class = FALSE)
list_elem(l) <- value
## Recursive separation of regular (atomic) and irregular (non-atomic) parts
reg_elem(l, recursive = TRUE, keep.tree = FALSE, keep.class = FALSE)
irreg_elem(l, recursive = TRUE, keep.tree = FALSE, keep.class = FALSE)
## Extract elements / subset list tree
get_elem(l, elem, recursive = TRUE, DF.as.list = FALSE, keep.tree = FALSE,
keep.class = FALSE, regex = FALSE, invert = FALSE, ...)
## Check for the existence of elements
has_elem(l, elem, recursive = TRUE, DF.as.list = FALSE, regex = FALSE, ...)
Arguments
l |
a list. | ||||||||||||||||||||||||||||||||||||
value |
a list of the same length as the extracted subset of | ||||||||||||||||||||||||||||||||||||
elem |
a function returning | ||||||||||||||||||||||||||||||||||||
return |
an integer or string specifying what the selector function should return. The options are:
Note: replacement functions only replace data, names are replaced together with the data. | ||||||||||||||||||||||||||||||||||||
recursive |
logical. Should the list search be recursive (i.e. go though all the elements), or just at the top-level? | ||||||||||||||||||||||||||||||||||||
DF.as.list |
logical. | ||||||||||||||||||||||||||||||||||||
keep.tree |
logical. | ||||||||||||||||||||||||||||||||||||
keep.class |
logical. For list-based objects: should the class be retained? This only works if these objects have a | ||||||||||||||||||||||||||||||||||||
regex |
logical. Should regular expression search be used on the list names, or only exact matches? | ||||||||||||||||||||||||||||||||||||
invert |
logical. Invert search i.e. exclude matched elements from the list? | ||||||||||||||||||||||||||||||||||||
... |
further arguments to |
Details
For a lack of better terminology, collapse defines 'regular' R objects as objects that are either atomic or a list. reg_elem
with recursive = TRUE
extracts the subset of the list tree leading up to atomic elements in the final nodes. This part of the list tree is unlistable - calling is_unlistable(reg_elem(l))
will be TRUE
for all lists l
. Conversely, all elements left behind by reg_elem
will be picked up be irreg_elem
. Thus is_unlistable(irreg_elem(l))
is always FALSE
for lists with irregular elements (otherwise irreg_elem
returns an empty list).
If keep.tree = TRUE
, reg_elem
, irreg_elem
and get_elem
always return the entire list tree, but cut off all of the branches not leading to the desired result. If keep.tree = FALSE
, top-level parts of the tree are omitted as far as possible. For example in a nested list with three levels and one data-matrix in one of the final branches, get_elem(l, is.matrix, keep.tree = TRUE)
will return a list (lres
) of depth 3, from which the matrix can be accessed as lres[[1]][[1]][[1]]
. This however does not make much sense. get_elem(l, is.matrix, keep.tree = FALSE)
will therefore figgure out that it can drop the entire tree and return just the matrix. keep.tree = FALSE
makes additional optimizations if matching elements are at far-apart corners in a nested structure, by only preserving the hierarchy if elements are above each other on the same branch. Thus for a list l <- list(list(2,list("a",1)),list(1,list("b",2)))
calling get_elem(l, is.character)
will just return list("a","b")
.
See Also
List Processing, Collapse Overview
Examples
m <- qM(mtcars)
get_elem(list(list(list(m))), is.matrix)
get_elem(list(list(list(m))), is.matrix, keep.tree = TRUE)
l <- list(list(2,list("a",1)),list(1,list("b",2)))
has_elem(l, is.logical)
has_elem(l, is.numeric)
get_elem(l, is.character)
get_elem(l, is.character, keep.tree = TRUE)
l <- lm(mpg ~ cyl + vs, data = mtcars)
str(reg_elem(l))
str(irreg_elem(l))
get_elem(l, is.matrix)
get_elem(l, "residuals")
get_elem(l, "fit", regex = TRUE)
has_elem(l, "tol")
get_elem(l, "tol")
Groningen Growth and Development Centre 10-Sector Database
Description
The GGDC 10-Sector Database provides a long-run internationally comparable dataset on sectoral productivity performance in Africa, Asia, and Latin America. Variables covered in the data set are annual series of value added (in local currency), and persons employed for 10 broad sectors.
Usage
data("GGDC10S")
Format
A data frame with 5027 observations on the following 16 variables.
Country
char: Country (43 countries)
Regioncode
char: ISO3 Region code
Region
char: Region (6 World Regions)
Variable
char: Variable (Value Added or Employment)
Year
num: Year (67 Years, 1947-2013)
AGR
num: Agriculture
MIN
num: Mining
MAN
num: Manufacturing
PU
num: Utilities
CON
num: Construction
WRT
num: Trade, restaurants and hotels
TRA
num: Transport, storage and communication
FIRE
num: Finance, insurance, real estate and business services
GOV
num: Government services
OTH
num: Community, social and personal services
SUM
num: Summation of sector GDP
Source
https://www.rug.nl/ggdc/productivity/10-sector/
References
Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). "Patterns of Structural Change in Developing Countries." . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.
See Also
Examples
namlab(GGDC10S, class = TRUE)
# aperm(qsu(GGDC10S, ~ Variable, ~ Variable + Country, vlabels = TRUE))
library(ggplot2)
## World Regions Structural Change Plot
GGDC10S |>
fmutate(across(AGR:OTH, `*`, 1 / SUM),
Variable = ifelse(Variable == "VA","Value Added Share", "Employment Share")) |>
replace_outliers(0, NA, "min") |>
collap( ~ Variable + Region + Year, cols = 6:15) |> qDT() |>
pivot(1:3, names = list(variable = "Sector"), na.rm = TRUE) |>
ggplot(aes(x = Year, y = value, fill = Sector)) +
geom_area(position = "fill", alpha = 0.9) + labs(x = NULL, y = NULL) +
theme_linedraw(base_size = 14) +
facet_grid(Variable ~ Region, scales = "free_x") +
scale_fill_manual(values = sub("#00FF66", "#00CC66", rainbow(10))) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 7), expand = c(0, 0))+
scale_y_continuous(breaks = scales::pretty_breaks(n = 10), expand = c(0, 0),
labels = scales::percent) +
theme(axis.text.x = element_text(angle = 315, hjust = 0, margin = ggplot2::margin(t = 0)),
strip.background = element_rect(colour = "grey30", fill = "grey30"))
# A function to plot the structural change of an arbitrary country
plotGGDC <- function(ctry) {
GGDC10S |>
fsubset(Country == ctry, Variable, Year, AGR:SUM) |>
fmutate(across(AGR:OTH, `*`, 1 / SUM), SUM = NULL,
Variable = ifelse(Variable == "VA","Value Added Share", "Employment Share")) |>
replace_outliers(0, NA, "min") |> qDT() |>
pivot(1:2, names = list(variable = "Sector"), na.rm = TRUE) |>
ggplot(aes(x = Year, y = value, fill = Sector)) +
geom_area(position = "fill", alpha = 0.9) + labs(x = NULL, y = NULL) +
theme_linedraw(base_size = 14) + facet_wrap( ~ Variable) +
scale_fill_manual(values = sub("#00FF66", "#00CC66", rainbow(10))) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 7), expand = c(0, 0)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10), expand = c(0, 0),
labels = scales::percent) +
theme(axis.text.x = element_text(angle = 315, hjust = 0, margin = ggplot2::margin(t = 0)),
strip.background = element_rect(colour = "grey20", fill = "grey20"),
strip.text = element_text(face = "bold"))
}
plotGGDC("BWA")
Fast Hash-Based Grouping
Description
group()
scans the rows of a data frame (or atomic vector / list of atomic vectors), assigning to each unique row an integer id - starting with 1 and proceeding in first-appearance order of the rows. The function is written in C and optimized for R's data structures. It is the workhorse behind functions like GRP
/ fgroup_by
, collap
, qF
, qG
, finteraction
and funique
, when called with argument sort = FALSE
.
Usage
group(..., starts = FALSE, group.sizes = FALSE)
groupv(x, starts = FALSE, group.sizes = FALSE)
Arguments
... |
comma separated atomic vectors to group. Also supports a single list of vectors for backward compatibility. |
x |
an atomic vector or data frame / list of equal-length atomic vectors. |
starts |
logical. If |
group.sizes |
logical. If |
Details
A data frame is grouped on a column-by-column basis, starting from the leftmost column. For each new column the grouping vector obtained after the previous column is also fed back into the hash function so that unique values are determined on a running basis. The algorithm terminates as soon as the number of unique rows reaches the size of the data frame. Missing values are also grouped just like any other values. Invoking arguments starts
and/or group.sizes
requires an additional pass through the final grouping vector.
Value
An object is of class 'qG' see qG
.
Author(s)
The Hash Function and inspiration was taken from the excellent kit package by Morgan Jacob, the algorithm was developed by Sebastian Krantz.
See Also
radixorder
, GRPid
, Fast Grouping and Ordering, Collapse Overview
Examples
# Let's replicate what funique does
g <- groupv(wlddev, starts = TRUE)
if(attr(g, "N.groups") == fnrow(wlddev)) wlddev else
ss(wlddev, attr(g, "starts"))
Generate Run-Length Type Group-Id
Description
groupid
is an enhanced version of data.table::rleid
for atomic vectors. It generates a run-length type group-id where consecutive identical values are assigned the same integer. It is a generalization as it can be applied to unordered vectors, generate group id's starting from an arbitrary value, and skip missing values.
Usage
groupid(x, o = NULL, start = 1L, na.skip = FALSE, check.o = TRUE)
Arguments
x |
an atomic vector of any type. Attributes are not considered. |
o |
an (optional) integer ordering vector specifying the order by which to pass through |
start |
integer. The starting value of the resulting group-id. Default is starting from 1. |
na.skip |
logical. Skip missing values i.e. if |
check.o |
logical. Programmers option: |
Value
An integer vector of class 'qG'. See qG
.
See Also
seqid
, timeid
, qG
, Fast Grouping and Ordering, Collapse Overview
Examples
groupid(airquality$Month)
groupid(airquality$Month, start = 0)
groupid(wlddev$country)[1:100]
## Same thing since country is alphabetically ordered: (groupid is faster..)
all.equal(groupid(wlddev$country), qG(wlddev$country, na.exclude = FALSE))
## When data is unordered, group-id can be generated through an ordering..
uo <- order(rnorm(fnrow(airquality)))
monthuo <- airquality$Month[uo]
o <- order(monthuo)
groupid(monthuo, o)
identical(groupid(monthuo, o)[o], unattrib(groupid(airquality$Month)))
Fast Grouping / collapse Grouping Objects
Description
GRP
performs fast, ordered and unordered, groupings of vectors and data frames (or lists of vectors) using radixorder
or group
. The output is a list-like object of class 'GRP' which can be printed, plotted and used as an efficient input to all of collapse's fast statistical and transformation functions and operators (see macros .FAST_FUN
and .OPERATOR_FUN
), as well as to collap
, BY
and TRA
.
fgroup_by
is similar to dplyr::group_by
but faster and class-agnostic. It creates a grouped data frame with a 'GRP' object attached - for fast dplyr-like programming with collapse's fast functions.
There are also several conversion methods to and from 'GRP' objects. Notable among these is GRP.grouped_df
, which returns a 'GRP' object from a grouped data frame created with dplyr::group_by
or fgroup_by
, and the duo GRP.factor
and as_factor_GRP
.
gsplit
efficiently splits a vector based on a 'GRP' object, and greorder
helps to recombine the results. These are the workhorses behind functions like BY
, and collap
, fsummarise
and fmutate
when evaluated with base R and user-defined functions.
Usage
GRP(X, ...)
## Default S3 method:
GRP(X, by = NULL, sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE,
return.groups = TRUE, return.order = sort, method = "auto",
call = TRUE, ...)
## S3 method for class 'factor'
GRP(X, ..., group.sizes = TRUE, drop = FALSE, return.groups = TRUE,
call = TRUE)
## S3 method for class 'qG'
GRP(X, ..., group.sizes = TRUE, return.groups = TRUE, call = TRUE)
## S3 method for class 'pseries'
GRP(X, effect = 1L, ..., group.sizes = TRUE, return.groups = TRUE,
call = TRUE)
## S3 method for class 'pdata.frame'
GRP(X, effect = 1L, ..., group.sizes = TRUE, return.groups = TRUE,
call = TRUE)
## S3 method for class 'grouped_df'
GRP(X, ..., return.groups = TRUE, call = TRUE)
# Identify 'GRP' objects
is_GRP(x)
## S3 method for class 'GRP'
length(x) # Length of data being grouped
GRPN(x, expand = TRUE, ...) # Group sizes (default: expanded to match data length)
GRPid(x, sort = FALSE, ...) # Group id (data length, same as GRP(.)$group.id)
GRPnames(x, force.char = TRUE, sep = ".") # Group names
as_factor_GRP(x, ordered = FALSE, sep = ".") # 'GRP'-object to (ordered) factor conversion
# Efficiently split a vector using a 'GRP' object
gsplit(x, g, use.g.names = FALSE, ...)
# Efficiently reorder y = unlist(gsplit(x, g)) such that identical(greorder(y, g), x)
greorder(x, g, ...)
# Fast, class-agnostic pendant to dplyr::group_by for use with fast functions, see details
fgroup_by(.X, ..., sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE,
return.groups = TRUE, return.order = sort, method = "auto")
# Standard-evaluation analogue (slim wrapper around GRP.default(), for programming)
group_by_vars(X, by = NULL, ...)
# Shorthand for fgroup_by
gby(.X, ..., sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE,
return.groups = TRUE, return.order = sort, method = "auto")
# Get grouping columns from a grouped data frame created with dplyr::group_by or fgroup_by
fgroup_vars(X, return = "data")
# Ungroup grouped data frame created with dplyr::group_by or fgroup_by
fungroup(X, ...)
## S3 method for class 'GRP'
print(x, n = 6, ...)
## S3 method for class 'GRP'
plot(x, breaks = "auto", type = "l", horizontal = FALSE, ...)
Arguments
X |
a vector, list of columns or data frame (default method), or a suitable object (conversion / extractor methods). | |||||||||||||||||||||||||||||||||||||||||
.X |
a data frame or list. | |||||||||||||||||||||||||||||||||||||||||
x , g |
a 'GRP' object. For | |||||||||||||||||||||||||||||||||||||||||
by |
if | |||||||||||||||||||||||||||||||||||||||||
sort |
logical. If | |||||||||||||||||||||||||||||||||||||||||
ordered |
logical. | |||||||||||||||||||||||||||||||||||||||||
decreasing |
logical. Should the sort order be increasing or decreasing? Can be a vector of length equal to the number of arguments in | |||||||||||||||||||||||||||||||||||||||||
na.last |
logical. If missing values are encountered in grouping vector/columns, assign them to the last group (argument passed to | |||||||||||||||||||||||||||||||||||||||||
return.groups |
logical. Include the unique groups in the created GRP object. | |||||||||||||||||||||||||||||||||||||||||
return.order |
logical. If | |||||||||||||||||||||||||||||||||||||||||
method |
character. The algorithm to use for grouping: either | |||||||||||||||||||||||||||||||||||||||||
group.sizes |
logical. | |||||||||||||||||||||||||||||||||||||||||
drop |
logical. | |||||||||||||||||||||||||||||||||||||||||
call |
logical. | |||||||||||||||||||||||||||||||||||||||||
expand |
logical. | |||||||||||||||||||||||||||||||||||||||||
force.char |
logical. Always output group names as character vector, even if a single numeric vector was passed to | |||||||||||||||||||||||||||||||||||||||||
sep |
character. The separator passed to | |||||||||||||||||||||||||||||||||||||||||
effect |
plm / indexed data methods: Select which panel identifier should be used as grouping variable. 1L takes the first variable in the index, 2L the second etc., identifiers can also be passed as a character string. More than one variable can be supplied. | |||||||||||||||||||||||||||||||||||||||||
return |
an integer or string specifying what
| |||||||||||||||||||||||||||||||||||||||||
use.g.names |
logical. | |||||||||||||||||||||||||||||||||||||||||
n |
integer. Number of groups to print out. | |||||||||||||||||||||||||||||||||||||||||
breaks |
integer. Number of breaks in the histogram of group-sizes. | |||||||||||||||||||||||||||||||||||||||||
type |
linetype for plot. | |||||||||||||||||||||||||||||||||||||||||
horizontal |
logical. | |||||||||||||||||||||||||||||||||||||||||
... |
for |
Details
GRP
is a central function in the collapse package because it provides, in the form of integer vectors, some key pieces of information to efficiently perform grouped operations at the C/C++
level.
Most statistical function require information about (1) the number of groups (2) an integer group-id indicating which values / rows belong to which group and (3) information about the size of each group. Provided with these, collapse's Fast Statistical Functions pre-allocate intermediate and result vectors of the right sizes and (in most cases) perform grouped statistical computations in a single pass through the data.
The sorting functionality of GRP.default
lets groups receive different integer-id's depending on whether the groups are sorted sort = TRUE
(FALSE
gives first-appearance order), and in which order (argument decreasing
). This affects the order of values/rows in the output whenever an aggregation is performed.
Other elements in the object provide information about whether the data was sorted by the variables defining the grouping (6) and the ordering vector (7). These also feed into optimizations in gsplit/greorder
that benefit the execution of base R functions across groups.
Complimentary to GRP
, the function fgroup_by
is a significantly faster and class-agnostic alternative to dplyr::group_by
for programming with collapse. It creates a grouped data frame with a 'GRP' object attached in a "groups"
attribute. This data frame has classes 'GRP_df', ..., 'grouped_df' and 'data.frame', where ... stands for any other classes the input frame inherits such as 'data.table', 'sf', 'tbl_df', 'indexed_frame' etc.. collapse functions with a 'grouped_df' method respond to 'grouped_df' objects created with either fgroup_by
or dplyr::group_by
. The method GRP.grouped_df
takes the "groups"
attribute from a 'grouped_df' and converts it to a 'GRP' object if created with dplyr::group_by
.
The 'GRP_df' class in front responds to print.GRP_df
which first calls print(fungroup(x), ...)
and prints one line below the object indicating the grouping variables, followed, in square brackets, by some statistics on the group sizes: [N | Mean (SD) Min-Max]
. The mean is rounded to a full number and the standard deviation (SD) to one digit. Minimum and maximum are only displayed if the SD is non-zero. There also exist a method [.GRP_df
which calls NextMethod
but makes sure that the grouping information is preserved or dropped depending on the dimensions of the result (subsetting rows or aggregation with data.table drops the grouping object).
GRP.default
supports vector and list input and will also return 'GRP' objects if passed. There is also a hidden method GRP.GRP
which simply returns grouping objects (no re-grouping functionality is offered).
Apart from GRP.grouped_df
there are several further conversion methods:
The conversion of factors to 'GRP' objects by GRP.factor
involves obtaining the number of groups calling ng <- fnlevels(f)
and then computing the count of each level using tabulate(f, ng)
. The integer group-id (2) is already given by the factor itself after removing the levels and class attributes and replacing any missing values with ng + 1L
. The levels are put in a list and moved to position (4) in the 'GRP' object, which is reserved for the unique groups. Finally, a sortedness check !is.unsorted(id)
is run on the group-id to check if the data represented by the factor was sorted (6). GRP.qG
works similarly (see also qG
), and the 'pseries' and 'pdata.frame' methods simply group one or more factors in the index (selected using the effect
argument) .
Creating a factor from a 'GRP' object using as_factor_GRP
does not involve any computations, but may involve interacting multiple grouping columns using the paste
function to produce unique factor levels.
Value
A list-like object of class ‘GRP’ containing information about the number of groups, the observations (rows) belonging to each group, the size of each group, the unique group names / definitions, whether the groups are ordered and data grouped is sorted or not, the ordering vector used to perform the ordering and the group start positions. The object is structured as follows:
List-index | Element-name | Content type | Content description | |||
[[1]] | N.groups | integer(1) | Number of Groups | |||
[[2]] | group.id | integer(NROW(X)) | An integer group-identifier | |||
[[3]] | group.sizes | integer(N.groups) | Vector of group sizes | |||
[[4]] | groups | unique(X) or NULL | Unique groups (same format as input, except for fgroup_by which uses a plain list, sorted if sort = TRUE ), or NULL if return.groups = FALSE |
|||
[[5]] | group.vars | character | The names of the grouping variables | |||
[[6]] | ordered | logical(2) | [1] Whether the groups are ordered: equal to the sort argument in the default method, or TRUE if converted objects inherit a class "ordered" and NA otherwise, [2] Whether the data (X ) is already sorted: the result of !is.unsorted(group.id) . If sort = FALSE (default method) the second entry is NA . |
|||
[[7]] | order | integer(NROW(X)) or NULL | Ordering vector from radixorder (with "starts" attribute), or NULL if return.order = FALSE |
|||
[[8]] | group.starts | integer(N.groups) or NULL | The first-occurrence positions/rows of the groups. Useful e.g. with ffirst(x, g, na.rm = FALSE) . NULL if return.groups = FALSE . |
|||
[[9]] | call | match.call() or NULL | The GRP() call, obtained from match.call() , or NULL if call = FALSE
|
See Also
radixorder
, group
, qF
, Fast Grouping and Ordering, Collapse Overview
Examples
## default method
GRP(mtcars$cyl)
GRP(mtcars, ~ cyl + vs + am) # Or GRP(mtcars, c("cyl","vs","am")) or GRP(mtcars, c(2,8:9))
g <- GRP(mtcars, ~ cyl + vs + am) # Saving the object
print(g) # Printing it
plot(g) # Plotting it
GRPnames(g) # Retain group names
GRPid(g) # Retain group id (same as g$group.id), useful inside fmutate()
fsum(mtcars, g) # Compute the sum of mtcars, grouped by variables cyl, vs and am
gsplit(mtcars$mpg, g) # Use the object to split a vector
gsplit(NULL, g) # The indices of the groups
identical(mtcars$mpg, # greorder and unlist undo the effect of gsplit
greorder(unlist(gsplit(mtcars$mpg, g)), g))
## Convert factor to GRP object and vice-versa
GRP(iris$Species)
as_factor_GRP(g)
## dplyr integration
library(dplyr)
mtcars |> group_by(cyl,vs,am) |> GRP() # Get GRP object from a dplyr grouped tibble
mtcars |> group_by(cyl,vs,am) |> fmean() # Grouped mean using dplyr grouping
mtcars |> fgroup_by(cyl,vs,am) |> fmean() # Faster alternative with collapse grouping
mtcars |> fgroup_by(cyl,vs,am) # Print method for grouped data frame
## Adding a column of group sizes.
mtcars |> fgroup_by(cyl,vs,am) |> fsummarise(Sizes = GRPN())
# Note: can also set_collapse(mask = "n") to use n() instead, see help("collapse-options")
# Other usage modes:
mtcars |> fgroup_by(cyl,vs,am) |> fmutate(Sizes = GRPN())
mtcars |> fmutate(Sizes = GRPN(list(cyl,vs,am))) # Same thing, slightly more efficient
## Various options for programming and interactive use
fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10) |> head(3)
fgroup_by(GGDC10S, 1:3, 5) |> head(3)
fgroup_by(GGDC10S, c("Variable", "Country")) |> head(3)
fgroup_by(GGDC10S, is.character) |> head(3)
fgroup_by(GGDC10S, Country:Variable, Year) |> head(3)
fgroup_by(GGDC10S, Country:Region, Var = Variable, Year) |> head(3)
## Note that you can create a grouped data frame without materializing the unique grouping columns
fgroup_by(GGDC10S, Variable, Country, return.groups = FALSE) |> fmutate(across(AGR:SUM, fscale))
fgroup_by(GGDC10S, Variable, Country, return.groups = FALSE) |> fselect(AGR:SUM) |> fmean()
## Note also that setting sort = FALSE on unsorted data can be much faster... if not required...
library(microbenchmark)
microbenchmark(gby(GGDC10S, Variable, Country), gby(GGDC10S, Variable, Country, sort = FALSE))
Fast Indexed Time Series and Panels
Description
A fast and flexible indexed time series and panel data class that inherits from plm's 'pseries' and 'pdata.frame', but is more rigorous, natively handles irregularity, can be superimposed on any data.frame/list, matrix or vector, and supports ad-hoc computations inside data masking functions and model formulas.
Usage
## Create an 'indexed_frame' containing 'indexed_series'
findex_by(.X, ..., single = "auto", interact.ids = TRUE)
iby(.X, ..., single = "auto", interact.ids = TRUE) # Shorthand
## Retrieve the index ('index_df') from an 'indexed_frame' or 'indexed_series'
findex(x)
ix(x) # Shorthand
## Remove index from 'indexed_frame' or 'indexed_series' (i.e. get .X back)
unindex(x)
## Reindex 'indexed_frame' or 'indexed_series' (or index vectors / matrices)
reindex(x, index = findex(x), single = "auto")
## Check if 'indexed_frame', 'indexed_series', index or time vector is irregular
is_irregular(x, any_id = TRUE)
## Convert 'indexed_frame'/'indexed_series' to normal 'pdata.frame'/'pseries'
to_plm(x, row.names = FALSE)
# Subsetting & replacement methods: [(<-) methods call NextMethod().
# Also methods for fsubset, funique and roworder(v), na_omit (internal).
## S3 method for class 'indexed_series'
x[i, ..., drop.index.levels = "id"]
## S3 method for class 'indexed_frame'
x[i, ..., drop.index.levels = "id"]
## S3 replacement method for class 'indexed_frame'
x[i, j] <- value
## S3 method for class 'indexed_frame'
x$name
## S3 replacement method for class 'indexed_frame'
x$name <- value
## S3 method for class 'indexed_frame'
x[[i, ...]]
## S3 replacement method for class 'indexed_frame'
x[[i]] <- value
# Index subsetting and printing: optimized using ss()
## S3 method for class 'index_df'
x[i, j, drop = FALSE, drop.index.levels = "id"]
## S3 method for class 'index_df'
print(x, topn = 5, ...)
Arguments
.X |
a data frame or list-like object of equal-length columns. |
x |
an 'indexed_frame' or 'indexed_series'. |
... |
for |
single |
character. If only one indexing variable is supplied, this can be declared as |
interact.ids |
logical. If |
index |
and index (inherits 'pindex'), or an atomic vector or list of factors matching the data dimensions. Atomic vectors or lists with 1 factor will must be declared, see |
drop.index.levels |
character. Subset methods also subset the index (= a data.frame of factors), and this argument regulates which factor levels should be dropped: either |
any_id |
logical. For panel series: |
row.names |
logical. |
topn |
integer. The number of first and last rows to print. |
i , j , name , drop , value |
Arguments passed to |
Details
The first thing to note about these new 'indexed_frame', 'indexed_series' and 'index_df' classes is that they inherit plm's 'pdata.frame', 'pseries' and 'pindex' classes, respectively. They add, improve, and, in some cases, remove functionality offered by plm, with the aim of striking an optimal balance of flexibility and performance. The inheritance means that all 'pseries' and 'pdata.frame' methods in collapse, and also some methods in plm, apply to them. Where compatibility or performance considerations allow for it, collapse will continue to create methods for plm's classes instead of the new classes.
The use of these classes does not require much knowledge of plm, but as a basic background: A 'pdata.frame' is a data.frame with an index attribute: a data.frame of 2 factors identifying the individual and time-dimension of the data. When pulling a variable out of the pdata.frame using a method like $.pdata.frame
or [[.pdata.frame
(defined in plm), a 'pseries' is created by transferring the index attribute to the vector. Methods defined for functions like lag
/ flag
etc. use the index for correct computations on this panel data, also inside plm's estimation commands.
Main Features and Enhancements
The 'indexed_frame' and 'indexed_series' classes extend and enhance 'pdata.frame' and 'pseries' in a number of critical dimensions. Most notably they:
Support both time series and panel data, by allowing indexation of data with one, two or more variables.
Are class-agnostic: any data.frame/list (such as data.table, tibble, tsibble, sf etc.) can become an 'indexed_frame' and continue to function as usual for most use cases. Similarly, any vector or matrix (such as ts, mts, xts) can become an 'indexed_series'. This also allows for transient workflows e.g.
some_df |> findex_by(...) |> 'do something using collapse functions' |> unindex() |> 'continue working with some_df'
.Have a comprehensive and efficient set of methods for subsetting and manipulation, including methods for
fsubset
,funique
,roworder(v)
(internal) andna_omit
(internal,na.omit
also works but is slower). It is also possible to group indexed data withfgroup_by
for transformations e.g. usingfmutate
, but aggregation requiresunindex()
ing.-
Natively handle irregularity: time objects (such as 'Date', 'POSIXct' etc.) are passed to
timeid
, which efficiently determines the temporal structure by finding the greatest common divisor (GCD), and creates a time-factor with levels corresponding to a complete time-sequence. The latter is also done with plain numeric vectors, which are assumed to represent unit time steps (GDC = 1) and coerced to integer (but can also be passed throughtimeid
if non-unitary). Character time variables are converted to factor, which might also capture irregular gaps in panel series. Using this time-factor in the index, collapse's functions efficiently perform correct computations on irregular sequences and panels without the need to 'expand' the data / fill gaps.is_irregular
can be used to check for irregularity in the entire sequence / panel or separately for each individual in panel data. Support computations inside data-masking functions and formulas, by virtue of "deep indexation": Each variable inside an 'indexed_frame' is an 'indexed_series' which contains in its 'index_df' attribute an external pointer to the 'index_df' attribute of the frame. Functions operating on 'indexed_series' stored inside the frame (such as
with(data, flag(column))
) can fetch the index from this pointer. This allows worry-free application inside arbitrary data masking environments (with
,%$%
,attach
, etc..) and estimation commands (glm
,feols
,lmrob
etc..) without duplication of the index in memory. A limitation is that external pointers are only valid during the present R session, thus when saving an 'indexed_frame' and loading it again, you need to calldata = reindex(data)
before computing on it.
Indexed series also have simple Math and Ops methods, which apply the operation to the unindexed series and shallow copy the attributes of the original object to the result, unless the result it is a logical vector (from operations like !
, ==
etc.). For Ops methods, if the LHS object is an 'indexed_series' its attributes are taken, otherwise the attributes of the RHS object are taken.
Limits to plm Compatibility
In contrast to 'pseries' and 'pdata.frame's, 'indexed_series' and 'indexed_frames' do not have descriptive "names" or "row.names" attributes attached to them, mainly for efficiency reasons.
Furthermore, the index is stored in an attribute named 'index_df' (same as the class name), not 'index' as in plm, mainly to make these classes work with data.table, tsibble and xts, which also utilize 'index' attributes. This for the most part poses no problem to plm compatibility because plm source code fetches the index using attr(x, "index")
, and attr
by default performs partial matching.
A much greater obstacle in working with plm is that some internal plm code is hinged on there being no [.pseries
method, and the existence of [.indexed_series
limits the use of these classes in most plm estimation commands. Therefore the to_plm
function is provided to efficiently coerce the classes to ordinary plm objects before estimation. See Examples.
Overall these classes don't really benefit plm, especially given that collapse's plm methods also support native plm objects. However, they work very well inside other models and software, including stats models, fixest / lfe, and a whole bunch of time series and ML models. See Examples.
Performance Considerations
When indexing long time-series or panels with a single variable, setting single = "id" or "time"
avoids a potentially expensive call to anyDuplicated
. Note also that when panel-data are regular and sorted, omitting the time variable in the index can bring >= 2x performance improvements in operations like lagging and differencing (alternatively use shift = "row"
argument to flag
, fdiff
etc.) .
When dealing with long Date or POSIXct time sequences, it may also be that the internal processing by timeid
is slow simply because calling strftime
on these sequences to create factor levels is slow. In this case you may choose to generate an index factor with integer levels by passing timeid(t)
to findex_by
or reindex
(which by default generates a 'qG' object which is internally converted to factor using as_factor_qG
. The lazy evaluation of expressions like as.character(seq_len(nlev))
in modern R makes this extremely efficient).
With multiple id variables e.g. findex_by(data, id1, id2, id3, time)
, the default call to finteraction()
can be expensive because of pasting the levels together. In this case, users may gain performance by manually invoking finteraction()
(or its shorthand itn()
) with argument factor = FALSE
e.g. findex_by(data, ids = itn(id1, id2, id3, factor = FALSE), time)
. This will generate a factor with integer levels instead.
Print Method
The print methods for 'indexed_frame' and 'indexed_series' first call print(unindex(x), ...)
, followed by the index variables with the number of categories (index factor levels) in square brackets. If the time factor contains unused levels (= irregularity in the sequence), the square brackets indicate the number of used levels (periods), followed by the total number of levels (periods in the sequence) in parentheses.
See Also
timeid
,
Time Series and Panel Series, Collapse Overview
Examples
oldopts <- options(max.print = 70)
# Indexing panel data ----------------------------------------------------------
wldi <- findex_by(wlddev, iso3c, year)
wldi
wldi[1:100,1] # Works like a data frame
POP <- wldi$POP # indexed_series
qsu(POP) # Summary statistics
G(POP) # Population growth
STD(G(POP, c(1, 10))) # Within-standardized 1 and 10-year growth rates
psmat(POP) # Panel-Series Matrix
plot(psmat(log10(POP)))
POP[30:5000] # Subsetting indexed_series
Dlog(POP[30:5000]) # Log-difference of subset
psacf(identity(POP[30:5000])) # ACF of subset
L(Dlog(POP[30:5000], c(1, 10)), -1:1) # Multiple computations on subset
# Fast Statistical Functions don't have dedicated methods
# Thus for aggregation we need to unindex beforehand ...
fmean(unindex(POP))
wldi |> unindex() |>
fgroup_by(iso3c) |> num_vars() |> fmean()
library(magrittr)
# ... or unindex after taking group identifiers from the index
fmean(unindex(fgrowth(POP)), ix(POP)$iso3c)
wldi |> num_vars() %>%
fgroup_by(iso3c = ix(.)$iso3c) |>
unindex() |> fmean()
# With matrix methods it is easier as most attributes are dropped upon aggregation.
G(POP, c(1, 10)) %>% fmean(ix(.)$iso3c)
# Example of index with multiple ids
GGDC10S |> findex_by(Variable, Country, Year) |> head() # default is interact.ids = TRUE
GGDCi <- GGDC10S |> findex_by(Variable, Country, Year, interact.ids = FALSE)
head(GGDCi)
findex(GGDCi)
# The benefit is increased flexibility for summary statistics and data transformation
qsu(GGDCi, effect = "Country")
STD(GGDCi$SUM, effect = "Variable") # Standardizing by variable
STD(GGDCi$SUM, effect = c("Variable", "Year")) # ... by variable and year
# But time-based operations are a bit more expensive because of the necessary interactions
D(GGDCi$SUM)
# Panel-Data modelling ---------------------------------------------------------
# Linear model of 5-year annualized growth rates of GDP on Life Expactancy + 5y lag
lm(G(PCGDP, 5, p = 1/5) ~ L(G(LIFEEX, 5, p = 1/5), c(0, 5)), wldi) # p abbreviates "power"
# Same, adding time fixed effects via plm package: need to utilize to_plm function
plm::plm(G(PCGDP, 5, p = 1/5) ~ L(G(LIFEEX, 5, p = 1/5), c(0, 5)), to_plm(wldi), effect = "time")
# With country and time fixed effects via fixest
fixest::feols(G(PCGDP, 5, p=1/5) ~ L(G(LIFEEX, 5, p=1/5), c(0, 5)), wldi, fixef = .c(iso3c, year))
## Not run:
# Running a robust MM regression without fixed effects
robustbase::lmrob(G(PCGDP, 5, p = 1/5) ~ L(G(LIFEEX, 5, p = 1/5), c(0, 5)), wldi)
# Running a robust MM regression with country and time fixed effects
wldi |> fselect(PCGDP, LIFEEX) |>
fgrowth(5, power = 1/5) |> ftransform(LIFEEX_L5 = L(LIFEEX, 5)) |>
# drop abbreviates drop.index.levels (not strictly needed here but more consistent)
na_omit(drop = "all") |> fhdwithin(na.rm = FALSE) |> # For TFE use fwithin(effect = "year")
unindex() |> robustbase::lmrob(formula = PCGDP ~.) # using lm() gives same result as fixest
# Using a random forest model without fixed effects
# ranger does not support these kinds of formulas, thus we need some preprocessing...
wldi |> fselect(PCGDP, LIFEEX) |>
fgrowth(5, power = 1/5) |> ftransform(LIFEEX_L5 = L(LIFEEX, 5)) |>
unindex() |> na_omit() |> ranger::ranger(formula = PCGDP ~.)
## End(Not run)
# Indexing other data frame based classes --------------------------------------
library(tibble)
wlditbl <- qTBL(wlddev) |> findex_by(iso3c, year)
wlditbl[,2] # Works like a tibble...
wlditbl[[2]]
wlditbl[1:1000, 10]
head(wlditbl)
library(data.table)
wldidt <- qDT(wlddev) |> findex_by(iso3c, year)
wldidt[1:1000] # Works like a data.table...
wldidt[year > 2000]
wldidt[, .(sum_PCGDP = sum(PCGDP, na.rm = TRUE)), by = country] # Aggregation unindexes the result
wldidt[, lapply(.SD, sum, na.rm = TRUE), by = country, .SDcols = .c(PCGDP, LIFEEX)]
# This also works but is a bit inefficient since the index is subset and then dropped
# -> better unindex beforehand
wldidt[year > 2000, .(sum_PCGDP = sum(PCGDP, na.rm = TRUE)), by = country]
wldidt[, PCGDP_gr_5Y := G(PCGDP, 5, power = 1/5)] # Can add Variables by reference
# Note that .SD is a data.table of indexed_series, not an indexed_frame, so this is WRONG!
wldidt[, .c(PCGDP_gr_5Y, LIFEEX_gr_5Y) := G(slt(.SD, PCGDP, LIFEEX), 5, power = 1/5)]
# This gives the correct outcome
wldidt[, .c(PCGDP_gr_5Y, LIFEEX_gr_5Y) := lapply(slt(.SD, PCGDP, LIFEEX), G, 5, power = 1/5)]
## Not run:
library(sf)
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
nci <- findex_by(nc, SID74)
nci[1:10, "AREA"]
st_centroid(nci) # The geometry column is never indexed, thus sf computations work normally
st_coordinates(nci)
fmean(st_area(nci))
library(tsibble)
pedi <- findex_by(pedestrian, Sensor, Date_Time)
pedi[1:5, ]
findex(pedi) # Time factor with 17k levels from POSIXct
# Now here is a case where integer levels in the index can really speed things up
ix(iby(pedestrian, Sensor, timeid(Date_Time)))
library(microbenchmark)
microbenchmark(descriptive_levels = findex_by(pedestrian, Sensor, Date_Time),
integer_levels = findex_by(pedestrian, Sensor, timeid(Date_Time)))
# Data has irregularity
is_irregular(pedi)
is_irregular(pedi, any_id = FALSE) # irregularity in all sequences
# Manipulation such as lagging with tsibble/dplyr requires expanding rows and grouping
# Collapse can just compute correct lag on indexed series or frames
library(dplyr)
microbenchmark(
dplyr = fill_gaps(pedestrian) |> group_by_key() |> mutate(Lag_Count = lag(Count)),
collapse = fmutate(pedi, Lag_Count = flag(Count)), times = 10)
## End(Not run)
# Indexing Atomic objects ---------------------------------------------------------
## ts
print(AirPassengers)
AirPassengers[-(20:30)] # Ts class does not support irregularity, subsetting drops class
G(AirPassengers[-(20:30)], 12) # Annual Growth Rate: Wrong!
# Now indexing AirPassengers (identity() is a trick so that the index is named time(AirPassengers))
iAP <- reindex(AirPassengers, identity(time(AirPassengers)))
iAP
findex(iAP) # See the index
iAP[-(20:30)] # Subsetting
G(iAP[-(20:30)], 12) # Annual Growth Rate: Correct!
L(G(iAP[-(20:30)], c(0,1,12)), 0:1) # Lagged level, period and annual growth rates...
## xts
library(xts)
library(zoo) # Needed for as.yearmon() and index() functions
X <- wlddev |> fsubset(iso3c == "DEU", date, PCGDP:POP) %>% {
xts(num_vars(.), order.by = as.yearmon(.$date))
} |> ss(-(30:40)) %>% reindex(identity(index(.))) # Introducing a gap
# plot(G(unindex(X)))
diff(unindex(X)) # diff.xts gixes wrong result
fdiff(X) # fdiff gives right result
# But xts range-based subsets do not work...
## Not run:
X["1980/"]
## End(Not run)
# Thus a better way is not to index and perform ad-hoc omputations on the xts index
X <- unindex(X)
X["1980/"] %>% fdiff(t = index(.)) # xts index is internally processed by timeid()
## Of course you can also index plain vectors / matrices...
options(oldopts)
Unlistable Lists
Description
A (nested) list with atomic objects in all final nodes of the list-tree is unlistable - checked with is_unlistable
.
Usage
is_unlistable(l, DF.as.list = FALSE)
Arguments
l |
a list. |
DF.as.list |
logical. |
Details
is_unlistable
with DF.as.list = TRUE
is defined as all(rapply(l, is.atomic))
, whereas DF.as.list = FALSE
yields checking using all(unlist(rapply2d(l, function(x) is.atomic(x) || is.list(x)), use.names = FALSE))
, assuming that data frames are lists composed of atomic elements. If l
contains data frames, the latter can be a lot faster than applying is.atomic
to every data frame column.
Value
logical(1)
- TRUE
or FALSE
.
See Also
ldepth
, has_elem
, List Processing, Collapse Overview
Examples
l <- list(1, 2, list(3, 4, "b", FALSE))
is_unlistable(l)
l <- list(1, 2, list(3, 4, "b", FALSE, e ~ b))
is_unlistable(l)
Fast and Verbose Table Joins
Description
Join two data frame like objects x
and y
on
columns. Inspired by polars and by default uses a vectorized hash join algorithm (workhorse function fmatch
), with several verbose options.
Usage
join(x, y,
on = NULL,
how = "left",
suffix = NULL,
validate = "m:m",
multiple = FALSE,
sort = FALSE,
keep.col.order = TRUE,
drop.dup.cols = FALSE,
verbose = .op[["verbose"]],
require = NULL,
column = NULL,
attr = NULL,
...
)
Arguments
x |
a data frame-like object. The result will inherit the attributes of this object. |
y |
a data frame-like object to join with |
on |
character. vector of columns to join on. |
how |
character. Join type: |
suffix |
character(1 or 2). Suffix to add to duplicate column names. |
validate |
character. (Optional) check if join is of specified type. One of |
multiple |
logical. Handling of rows in |
sort |
logical. |
keep.col.order |
logical. Keep order of columns in |
drop.dup.cols |
instead of renaming duplicate columns in |
verbose |
integer. Prints information about the join. One of 0 (off), 1 (default, see Details) or 2 (additionally prints the classes of the |
require |
(optional) named list of the form |
column |
(optional) name for an extra column to generate in the output indicating which dataset a record came from. |
attr |
(optional) name for attribute providing information about the join performed (including the output of |
... |
further arguments to |
Details
If verbose > 0
, join
prints a compact summary of the join operation using cat
. If the names of x
and y
can be extracted (if as.character(substitute(x))
yields a single string) they will be displayed (otherwise 'x' and 'y' are used) followed by the respective join keys in brackets. This is followed by a summary of the records used from each table. If multiple = FALSE
, only the first matches from y
are used and counted here (or the first matches of x
if how = "right"
). Note that if how = "full"
any further matches are simply appended to the results table, thus it may make more sense to use multiple = TRUE
with the full join when suspecting multiple matches.
If multiple = TRUE
, join
performs a full cartesian product matching every key in x
to every matching key in y
. This can considerably increase the size of the resulting table. No memory checks are performed (your system will simply run out of memory; usually this should not terminate R).
In both cases, join
will also determine the average order of the join as the number of records used from each table divided by the number of unique matches and display it between the two tables at up to 2 digits. For example "<4:1.5>"
means that on average 4 records from x
match 1.5 records from y
, implying on average 4*1.5 = 6
records generated per unique match. If multiple = FALSE
"1st"
will be displayed for the using table (y
unless how = "right"
), indicating that there could be multiple matches but only the first is retained. Note that an order of '1' on either table must not imply that the key is unique as this value is generated from round(v, 2)
. To be sure about a keys uniqueness employ the validate
argument.
Value
A data frame-like object of the same type and attributes as x
. "row.names"
of x
are only preserved in left-join operations.
See Also
fmatch
, pivot
, Data Frame Manipulation, Fast Grouping and Ordering, Collapse Overview
Examples
df1 <- data.frame(
id1 = c(1, 1, 2, 3),
id2 = c("a", "b", "b", "c"),
name = c("John", "Jane", "Bob", "Carl"),
age = c(35, 28, 42, 50)
)
df2 <- data.frame(
id1 = c(1, 2, 3, 3),
id2 = c("a", "b", "c", "e"),
salary = c(60000, 55000, 70000, 80000),
dept = c("IT", "Marketing", "Sales", "IT")
)
# Different types of joins
for(i in c("l","i","r","f","s","a"))
join(df1, df2, how = i) |> print()
# With multiple matches
for(i in c("l","i","r","f","s","a"))
join(df1, df2, on = "id2", how = i, multiple = TRUE) |> print()
# Adding join column: useful esp. for full join
join(df1, df2, how = "f", column = TRUE)
# Custom column + rearranging
join(df1, df2, how = "f", column = list("join", c("x", "y", "x_y")), keep = FALSE)
# Attaching match attribute
str(join(df1, df2, attr = TRUE))
Determine the Depth / Level of Nesting of a List
Description
ldepth
provides the depth of a list or list-like structure.
Usage
ldepth(l, DF.as.list = FALSE)
Arguments
l |
a list. |
DF.as.list |
logical. |
Details
The depth or level or nesting of a list or list-like structure (e.g. a model object) is found by recursing down to the bottom of the list and adding an integer count of 1 for each level passed. For example the depth of a data frame is 1. If a data frame has list-columns, the depth is 2. However for reasons of efficiency, if l
is not a data frame and DF.as.list = FALSE
, data frames found inside l
will not be checked for list column's but assumed to have a depth of 1.
Value
A single integer indicating the depth of the list.
See Also
is_unlistable
, has_elem
, List Processing, Collapse Overview
Examples
l <- list(1, 2)
ldepth(l)
l <- list(1, 2, mtcars)
ldepth(l)
ldepth(l, DF.as.list = FALSE)
l <- list(1, 2, list(4, 5, list(6, mtcars)))
ldepth(l)
ldepth(l, DF.as.list = FALSE)
List Processing
Description
collapse provides the following set of functions to efficiently work with lists of R objects:
-
Search and Identification
-
is_unlistable
checks whether a (nested) list is composed of atomic objects in all final nodes, and thus unlistable to an atomic vector usingunlist
. -
ldepth
determines the level of nesting of the list (i.e. the maximum number of nodes of the list-tree). -
has_elem
searches elements in a list using element names, regular expressions applied to element names, or a function applied to the elements, and returnsTRUE
if any matches were found.
-
-
Subsetting
-
atomic_elem
examines the top-level of a list and returns a sublist with the atomic elements. Converselylist_elem
returns the sublist of elements which are themselves lists or list-like objects. -
reg_elem
andirreg_elem
are recursive versions of the former.reg_elem
extracts the 'regular' part of the list-tree leading to atomic elements in the final nodes, whileirreg_elem
extracts the 'irregular' part of the list tree leading to non-atomic elements in the final nodes. (Tip: try calling both on anlm
object). Naturally for all listsl
,is_unlistable(reg_elem(l))
evaluates toTRUE
. -
get_elem
extracts elements from a list using element names, regular expressions applied to element names, a function applied to the elements, or element-indices used to subset the lowest-level sub-lists. by default the result is presented as a simplified list containing all matching elements. With thekeep.tree
option howeverget_elem
can also be used to subset lists i.e. maintain the full tree but cut off non-matching branches.
-
-
Splitting and Transposition
-
rsplit
recursively splits a vector or data frame into subsets according to combinations of (multiple) vectors / factors - by default returning a (nested) list. Ifflatten = TRUE
, the list is flattened yielding the same result assplit
.rsplit
is also faster thansplit
, particularly for data frames. -
t_list
efficiently transposes nested lists of lists, such as those obtained from splitting a data frame by multiple variables usingrsplit
.
-
-
Apply Functions
-
Unlisting / Row-Binding
-
unlist2d
efficiently unlists unlistable lists in 2-dimensions and creates a data frame (or data.table) representation of the list. This is done by recursively flattening and row-binding R objects in the list while creating identifier columns for each level of the list-tree and (optionally) saving the row-names of the objects in a separate column.unlist2d
can thus also be understood as a recursive generalization ofdo.call(rbind, l)
, for lists of vectors, data frames, arrays or heterogeneous objects. A simpler version for non-recursive row-binding lists of lists / data.frames, is also available byrowbind
.
-
Table of Functions
Function | Description | |
is_unlistable | Checks if list is unlistable | |
ldepth | Level of nesting / maximum depth of list-tree | |
has_elem | Checks if list contains a certain element | |
get_elem | Subset list / extract certain elements | |
atomic_elem | Top-level subset atomic elements | |
list_elem | Top-level subset list/list-like elements | |
reg_elem | Recursive version of atomic_elem : Subset / extract 'regular' part of list |
|
irreg_elem | Subset / extract non-regular part of list | |
rsplit | Recursively split vectors or data frames / lists | |
t_list | Transpose lists of lists | |
rapply2d | Recursively apply functions to lists of data objects | |
unlist2d | Recursively unlist/row-bind lists of data objects in 2D, to data frame or data.table | |
rowbind | Non-recursive binding of lists of lists / data.frames. | |
See Also
Pad Matrix-Like Objects with a Value
Description
The pad
function inserts elements / rows filled with value
into a vector matrix or data frame X
at positions given by i
. It is particularly useful to expand objects returned by statistical procedures which remove missing values to the original data dimensions.
Usage
pad(X, i, value = NA, method = c("auto", "xpos", "vpos"))
Arguments
X |
a vector, matrix, data frame or list of equal-length columns. | ||||||||||||||||||||||||
i |
either an integer (positive or negative) or logical vector giving positions / rows of | ||||||||||||||||||||||||
value |
a scalar value to be replicated and inserted into | ||||||||||||||||||||||||
method |
an integer or string specifying the use of
|
Value
X
with elements / rows filled with value
inserted at positions given by i
.
See Also
append
, Recode and Replace Values, Small (Helper) Functions, Collapse Overview
Examples
v <- 1:3
pad(v, 1:2) # Automatic selection of method "vpos"
pad(v, -(1:2)) # Same thing
pad(v, c(TRUE, TRUE, FALSE, FALSE, FALSE)) # Same thing
pad(v, c(1, 3:4)) # Automatic selection of method "xpos"
pad(v, c(TRUE, FALSE, TRUE, TRUE, FALSE)) # Same thing
head(pad(wlddev, 1:3)) # Insert 3 missing rows at the beginning of the data
head(pad(wlddev, 2:4)) # ... at rows positions 2-4
# pad() is mostly useful for statistical models which only use the complete cases:
mod <- lm(LIFEEX ~ PCGDP, wlddev)
# Generating a residual column in the original data (automatic selection of method "vpos")
settfm(wlddev, resid = pad(resid(mod), mod$na.action))
# Another way to do it:
r <- resid(mod)
i <- as.integer(names(r))
resid2 <- pad(r, i) # automatic selection of method "xpos"
# here we need to add some elements as flast(i) < nrow(wlddev)
resid2 <- c(resid2, rep(NA, nrow(wlddev)-length(resid2)))
# See that these are identical:
identical(unattrib(wlddev$resid), resid2)
# Can also easily get a model matrix at the dimensions of the original data
mm <- pad(model.matrix(mod), mod$na.action)
Fast and Easy Data Reshaping
Description
pivot()
is collapse's data reshaping command. It combines longer-, wider-, and recast-pivoting functionality in a single parsimonious API. Notably, it can also accommodate variable labels.
Usage
pivot(data, # Summary of Documentation:
ids = NULL, # identifier cols to preserve
values = NULL, # cols containing the data
names = NULL, # name(s) of new col(s) | col(s) containing names
labels = NULL, # name of new labels col | col(s) containing labels
how = "longer", # method: "longer"/"l", "wider"/"w" or "recast"/"r"
na.rm = FALSE, # remove rows missing 'values' in reshaped data
factor = c("names", "labels"), # create new id col(s) as factor variable(s)?
check.dups = FALSE, # detect duplicate 'ids'+'names' combinations
# Only apply if how = "wider" or "recast"
FUN = "last", # aggregation function (internal or external)
FUN.args = NULL, # list of arguments passed to aggregation function
nthreads = .op[["nthreads"]], # minor gains as grouping remains serial
fill = NULL, # value to insert for unbalanced data (default NA/NULL)
drop = TRUE, # drop unused levels (=columns) if 'names' is factor
sort = FALSE, # "ids": sort 'ids' and/or "names": alphabetic casting
# Only applies if how = "wider" with multiple long columns ('values')
transpose = FALSE # "columns": applies t_list() before flattening, and/or
) # "names": sets names nami_colj. default: colj_nami
Arguments
data |
data frame-like object (list of equal-length columns). | ||||||||||||||||
ids |
identifier columns to keep. Specified using column names, indices, a logical vector or an identifier function e.g. | ||||||||||||||||
values |
columns containing the data to be reshaped. Specified like | ||||||||||||||||
names |
names of columns to generate, or retrieve variable names from:
| ||||||||||||||||
labels |
names of columns to generate, or retrieve variable labels from:
| ||||||||||||||||
how |
character. The pivoting method: one of | ||||||||||||||||
na.rm |
logical. | ||||||||||||||||
factor |
character. Whether to generate new 'names' and/or 'labels' columns as factor variables. This is generally recommended as factors are more memory efficient than character vectors and also faster in subsequent filtering and grouping. Internally, this argument is evaluated as | ||||||||||||||||
check.dups |
logical. | ||||||||||||||||
FUN |
function to aggregate values. At present, only a single function is allowed. Fast Statistical Functions receive vectorized execution. For maximum efficiency, a small set of internal functions is provided: | ||||||||||||||||
FUN.args |
(optional) list of arguments passed to | ||||||||||||||||
nthreads |
integer. if | ||||||||||||||||
fill |
if | ||||||||||||||||
drop |
logical. if | ||||||||||||||||
sort |
if | ||||||||||||||||
transpose |
if |
Details
Pivot wider essentially works as follows: compute g_rows = group(ids)
and also g_cols = group(names)
(using group
if sort = FALSE
). g_rows
gives the row-numbers of the wider data frame and g_cols
the column numbers.
Then, a C function generates a wide data frame and runs through each long column ('values'), assigning each value to the corresponding row and column in the wide frame. In this process FUN
is always applied. The default, "last"
, does nothing at all, i.e., if there are duplicates, some values are overwritten. "first"
works similarly just that the C-loop is executed the other way around. The other hard-coded options count, sum, average, or compare observations on the fly. Missing values are internally skipped for statistical functions as there is no way to distinguish an incoming NA
from an initial NA
- apart from counting occurrences using an internal structure of the same size as the result data frame which is costly and thus not implemented.
When passing an R-function to FUN
, the data is grouped using g_full = group(g_rows, g_cols)
, aggregated by groups, and expanded again to full length using TRA
before entering the reshaping algorithm. Thus, this is significantly more expensive than the optimized internal functions. With Fast Statistical Functions the aggregation is vectorized across groups, other functions are applied using BY
- by far the slowest option.
If check.dups = TRUE
, a check of the form fnunique(list(g_rows, g_cols)) < fnrow(data)
is run, and an informative warning is issued if duplicates are found.
Recast pivoting works similarly. In long pivots FUN
is ignored and the check simply amounts to fnunique(ids) < fnrow(data)
.
Value
A reshaped data frame with the same class and attributes (except for 'names'/'row-names') as the input frame.
Note
Leaving either 'ids' or 'values' empty will assign all other columns (except for "variable"
if how = "wider"|"recast"
) to the non-specified argument. It is also possible to leave both empty, e.g. for complete melting if how = "wider"
or data transposition if how = "recast"
(similar to data.table::transpose
but supporting multiple names columns and variable labels). See Examples.
pivot
currently does not support concurrently melting/pivoting longer to multiple columns. See data.table::melt
or pivot_longer
from tidyr or tidytable for an efficient alternative with this feature. It is also possible to achieve this with just a little bit of programming. An example is provided below.
See Also
collap
, vec
, rowbind
, unlist2d
, Data Frame Manipulation, Collapse Overview
Examples
# -------------------------------- PIVOT LONGER ---------------------------------
# Simple Melting (Reshaping Long)
pivot(mtcars) |> head()
pivot(iris, "Species") |> head()
pivot(iris, values = 1:4) |> head() # Same thing
# Using collapse's datasets
head(wlddev)
pivot(wlddev, 1:8, na.rm = TRUE) |> head()
pivot(wlddev, c("iso3c", "year"), c("PCGDP", "LIFEEX"), na.rm = TRUE) |> head()
head(GGDC10S)
pivot(GGDC10S, 1:5, names = list("Sectorcode", "Value"), na.rm = TRUE) |> head()
# Can also set by name: variable and/or value. Note that 'value' here remains lowercase
pivot(GGDC10S, 1:5, names = list(variable = "Sectorcode"), na.rm = TRUE) |> head()
# Melting including saving labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = TRUE) |> head()
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = "description") |> head()
# Also assigning new labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = list("description",
c("Sector Code", "Sector Description", "Value"))) |> namlab()
# Can leave out value column by providing named vector of labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = list("description",
c(variable = "Sector Code", description = "Sector Description"))) |> namlab()
# Now here is a nice example that is explicit and respects the dataset naming conventions
pivot(GGDC10S, ids = 1:5, na.rm = TRUE,
names = list(variable = "Sectorcode",
value = "Value"),
labels = list(name = "Sector",
new = c(Sectorcode = "GGDC10S Sector Code",
Sector = "Long Sector Description",
Value = "Employment or Value Added"))) |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
# Note that pivot() currently does not support melting to multiple columns
# But you can tackle the issue with a bit of programming:
wide <- pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable",
how = "wider", na.rm = TRUE)
head(wide)
library(magrittr)
wide %>% {av(pivot(., 1:2, grep("_VA", names(.))), pivot(gvr(., "_EMP")))} |> head()
wide %>% {av(av(gv(., 1:2), rm_stub(gvr(., "_VA"), "_VA", pre = FALSE)) |>
pivot(1:2, names = list("Sectorcode", "VA"), labels = "Sector"),
EMP = vec(gvr(., "_EMP")))} |> head()
rm(wide)
# -------------------------------- PIVOT WIDER ---------------------------------
iris_long <- pivot(iris, "Species") # Getting a long frame
head(iris_long)
# If 'names'/'values' not supplied, searches for 'variable' and 'value' columns
pivot(iris_long, how = "wider")
# But here the records are not identified by 'Species': thus aggregation with last value:
pivot(iris_long, how = "wider", check = TRUE) # issues a warning
rm(iris_long)
# This works better, these two are inverse operations
wlddev |> pivot(1:8) |> pivot(how = "w") |> head()
# ...but not perfect, we loose labels
namlab(wlddev)
wlddev |> pivot(1:8) |> pivot(how = "w") |> namlab()
# But pivot() supports labels: these are perfect inverse operations
wlddev |> pivot(1:8, labels = "label") |> print(max = 50) |> # Notice the "label" column
pivot(how = "w", labels = "label") |> namlab()
# If the data does not have 'variable'/'value' cols: need to specify 'names'/'values'
# Using a single column:
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w") |> head()
SUM_wide <- pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w", na.rm = TRUE)
head(SUM_wide) # na.rm = TRUE here removes all new rows completely missing data
tail(SUM_wide) # But there may still be NA's, notice the NA in the final row
# We could use fill to set another value
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w",
na.rm = TRUE, fill = -9999) |> tail()
# This will keep the label of "SUM", unless we supply a column with new labels
namlab(SUM_wide)
# Such a column is not available here, but we could use "Variable" twice
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", "Variable", how = "w",
na.rm = TRUE) |> namlab()
# Alternatively, can of course relabel ex-post
SUM_wide |> relabel(VA = "Value Added", EMP = "Employment") |> namlab()
rm(SUM_wide)
# Multiple-column pivots
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE) |> head()
# Here we may prefer a transposed column order
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = "columns") |> head()
# Can also flip the order of names (independently of columns)
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = "names") |> head()
# Can also enable both (complete transposition)
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = TRUE) |> head() # or tranpose = c("columns", "names")
# Finally, here is a nice, simple way to reshape the entire dataset.
pivot(GGDC10S, values = 6:16, names = "Variable", na.rm = TRUE, how = "w") |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
# -------------------------------- PIVOT RECAST ---------------------------------
# Look at the data again
head(GGDC10S)
# Let's stack the sectors and instead create variable columns
pivot(GGDC10S, .c(Country, Regioncode, Region, Year),
names = list("Variable", "Sectorcode"), how = "r") |> head()
# Same thing (a bit easier)
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"), how = "r") |> head()
# Removing missing values
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"), how = "r",
na.rm = TRUE) |> head()
# Saving Labels
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
labels = list(to = "Sector"), how = "r", na.rm = TRUE) |> head()
# Supplying new labels for generated columns: as complete as it gets
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
labels = list(to = "Sector",
new = c(Sectorcode = "GGDC10S Sector Code",
Sector = "Long Sector Description",
VA = "Value Added",
EMP = "Employment")), how = "r", na.rm = TRUE) |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
# Now another (slightly unconventional) use case here is data transposition
# Let's get the data for Botswana
BWA <- GGDC10S |> fsubset(Country == "BWA", Variable, Year, AGR:SUM)
head(BWA)
# By supplying no ids or values, we are simply requesting a transpose operation
pivot(BWA, names = list(from = c("Variable", "Year"), to = "Sectorcode"), how = "r")
# Same with labels
pivot(BWA, names = list(from = c("Variable", "Year"), to = "Sectorcode"),
labels = list(to = "Sector"), how = "r")
# For simple cases, data.table::transpose() will be more efficient, but with multiple
# columns to generate names and/or variable labels to be saved/assigned, pivot() is handy
rm(BWA)
Auto- and Cross- Covariance and Correlation Function Estimation for Panel Series
Description
psacf
, pspacf
and psccf
compute (and by default plot) estimates of the auto-, partial auto- and cross- correlation or covariance functions for panel series. They are analogues to acf
, pacf
and ccf
.
Usage
psacf(x, ...)
pspacf(x, ...)
psccf(x, y, ...)
## Default S3 method:
psacf(x, g, t = NULL, lag.max = NULL, type = c("correlation", "covariance","partial"),
plot = TRUE, gscale = TRUE, ...)
## Default S3 method:
pspacf(x, g, t = NULL, lag.max = NULL, plot = TRUE, gscale = TRUE, ...)
## Default S3 method:
psccf(x, y, g, t = NULL, lag.max = NULL, type = c("correlation", "covariance"),
plot = TRUE, gscale = TRUE, ...)
## S3 method for class 'data.frame'
psacf(x, by, t = NULL, cols = is.numeric, lag.max = NULL,
type = c("correlation", "covariance","partial"), plot = TRUE, gscale = TRUE, ...)
## S3 method for class 'data.frame'
pspacf(x, by, t = NULL, cols = is.numeric, lag.max = NULL,
plot = TRUE, gscale = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
psacf(x, lag.max = NULL, type = c("correlation", "covariance","partial"),
plot = TRUE, gscale = TRUE, ...)
## S3 method for class 'pseries'
pspacf(x, lag.max = NULL, plot = TRUE, gscale = TRUE, ...)
## S3 method for class 'pseries'
psccf(x, y, lag.max = NULL, type = c("correlation", "covariance"),
plot = TRUE, gscale = TRUE, ...)
## S3 method for class 'pdata.frame'
psacf(x, cols = is.numeric, lag.max = NULL,
type = c("correlation", "covariance","partial"), plot = TRUE, gscale = TRUE, ...)
## S3 method for class 'pdata.frame'
pspacf(x, cols = is.numeric, lag.max = NULL, plot = TRUE, gscale = TRUE, ...)
Arguments
x , y |
a numeric vector, 'indexed_series' ('pseries'), data frame or 'indexed_frame' ('pdata.frame'). |
g |
a factor, |
by |
data.frame method: Same input as |
t |
a time vector or list of vectors. See |
cols |
data.frame method: Select columns using a function, column names, indices or a logical vector. Note: |
lag.max |
integer. Maximum lag at which to calculate the acf. Default is |
type |
character. String giving the type of acf to be computed. Allowed values are "correlation" (the default), "covariance" or "partial". |
plot |
logical. If |
gscale |
logical. Do a groupwise scaling / standardization of |
... |
further arguments to be passed to |
Details
If gscale = TRUE
data are standardized within each group (using fscale
) such that the group-mean is 0 and the group-standard deviation is 1. This is strongly recommended for most panels to get rid of individual-specific heterogeneity which would corrupt the ACF computations.
After scaling, psacf
, pspacf
and psccf
compute the ACF/CCF by creating a matrix of panel-lags of the series using flag
and then computing the covariance of this matrix with the series (x, y
) using cov
and pairwise-complete observations, and dividing by the variance (of x, y
). Creating the lag matrix may require a lot of memory on large data, but passing a sequence of lags to flag
and thus calling flag
and cov
one time is generally much faster than calling them lag.max
times. The partial ACF is computed from the ACF using a Yule-Walker decomposition, in the same way as in pacf
.
Value
An object of class 'acf', see acf
. The result is returned invisibly if plot = TRUE
.
See Also
Time Series and Panel Series, Collapse Overview
Examples
## World Development Panel Data
head(wlddev) # See also help(wlddev)
psacf(wlddev$PCGDP, wlddev$country, wlddev$year) # ACF of GDP per Capita
psacf(wlddev, PCGDP ~ country, ~year) # Same using data.frame method
psacf(wlddev$PCGDP, wlddev$country) # The Data is sorted, can omit t
pspacf(wlddev$PCGDP, wlddev$country) # Partial ACF
psccf(wlddev$PCGDP, wlddev$LIFEEX, wlddev$country) # CCF with Life-Expectancy at Birth
psacf(wlddev, PCGDP + LIFEEX + ODA ~ country, ~year) # ACF and CCF of GDP, LIFEEX and ODA
psacf(wlddev, ~ country, ~year, c(9:10,12)) # Same, using cols argument
pspacf(wlddev, ~ country, ~year, c(9:10,12)) # Partial ACF
## Using indexed data:
wldi <- findex_by(wlddev, iso3c, year) # Creating a indexed frame
PCGDP <- wldi$PCGDP # Indexed Series of GDP per Capita
LIFEEX <- wldi$LIFEEX # Indexed Series of Life Expectancy
psacf(PCGDP) # Same as above, more parsimonious
pspacf(PCGDP)
psccf(PCGDP, LIFEEX)
psacf(wldi[c(9:10,12)])
pspacf(wldi[c(9:10,12)])
Matrix / Array from Panel Series
Description
psmat
efficiently expands a panel-vector or 'indexed_series' ('pseries') into a matrix. If a data frame or 'indexed_frame' ('pdata.frame') is passed, psmat
returns a 3D array or a list of matrices.
Usage
psmat(x, ...)
## Default S3 method:
psmat(x, g, t = NULL, transpose = FALSE, fill = NULL, ...)
## S3 method for class 'data.frame'
psmat(x, by, t = NULL, cols = NULL, transpose = FALSE, fill = NULL, array = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
psmat(x, transpose = FALSE, fill = NULL, drop.index.levels = "none", ...)
## S3 method for class 'pdata.frame'
psmat(x, cols = NULL, transpose = FALSE, fill = NULL, array = TRUE,
drop.index.levels = "none", ...)
## S3 method for class 'psmat'
plot(x, legend = FALSE, colours = legend, labs = NULL, grid = FALSE, ...)
Arguments
x |
a vector, indexed series 'indexed_series' ('pseries'), data frame or 'indexed_frame' ('pdata.frame'). |
g |
a factor, |
by |
data.frame method: Same input as |
t |
same inputs as |
cols |
data.frame method: Select columns using a function, column names, indices or a logical vector. Note: |
transpose |
logical. |
fill |
element to fill empty slots of matrix / array if panel is unbalanced. |
array |
data.frame / pdata.frame methods: logical. |
drop.index.levels |
character. Either |
... |
arguments to be passed to or from other methods, or for the plot method additional arguments passed to |
legend |
logical. Automatically create a legend of panel-groups. |
colours |
either |
labs |
character. Provide a character-vector of variable labels / series titles when plotting an array. |
grid |
logical. Calls |
Details
If n > 2 index variables are attached to an indexed series or frame, the first n-1 variables in the index are interacted.
Value
A matrix or 3D array containing the data in x
, where by default the rows constitute the groups-ids (g/by
) and the columns the time variable or individual ids (t
). 3D arrays contain the variables in the 3rd dimension. The objects have a class 'psmat', and also a 'transpose' attribute indicating whether transpose = TRUE
.
Note
The pdata.frame
method only works for properly subsetted objects of class 'pdata.frame'. A list of 'pseries' won't work. There also exist simple aperm
and [
(subset) methods for 'psmat' objects. These differ from the default methods only by keeping the class and the 'transpose' attribute.
See Also
Time Series and Panel Series, Collapse Overview
Examples
## World Development Panel Data
head(wlddev) # View data
qsu(wlddev, pid = ~ iso3c, cols = 9:12, vlabels = TRUE) # Sumarizing data
str(psmat(wlddev$PCGDP, wlddev$iso3c, wlddev$year)) # Generating matrix of GDP
r <- psmat(wlddev, PCGDP ~ iso3c, ~ year) # Same thing using data.frame method
plot(r, main = vlabels(wlddev)[9], xlab = "Year") # Plot the matrix
str(r) # See srructure
str(psmat(wlddev$PCGDP, wlddev$iso3c)) # The Data is sorted, could omit t
str(psmat(wlddev$PCGDP, 216)) # This panel is also balanced, so
# ..indicating the number of groups would be sufficient to obtain a matrix
ar <- psmat(wlddev, ~ iso3c, ~ year, 9:12) # Get array of transposed matrices
str(ar)
plot(ar)
plot(ar, legend = TRUE)
plot(psmat(collap(wlddev, ~region+year, cols = 9:12), # More legible and fancy plot
~region, ~year), legend = TRUE,
labs = vlabels(wlddev)[9:12])
psml <- psmat(wlddev, ~ iso3c, ~ year, 9:12, array = FALSE) # This gives list of ps-matrices
head(unlist2d(psml, "Variable", "Country", id.factor = TRUE),2) # Using unlist2d, can generate DF
## Indexing simplifies things
wldi <- findex_by(wlddev, iso3c, year) # Creating an indexed frame
PCGDP <- wldi$PCGDP # An indexed_series of GDP per Capita
head(psmat(PCGDP), 2) # Same as above, more parsimonious
plot(psmat(PCGDP))
plot(psmat(wldi[9:12]))
plot(psmat(G(wldi[9:12]))) # Here plotting panel-growth rates
(Pairwise, Weighted) Correlations, Covariances and Observation Counts
Description
Computes (pairwise, weighted) Pearson's correlations, covariances and observation counts. Pairwise correlations and covariances can be computed together with observation counts and p-values, and output as 3D array (default) or list of matrices. pwcor
and pwcov
offer an elaborate print method.
Usage
pwcor(X, ..., w = NULL, N = FALSE, P = FALSE, array = TRUE, use = "pairwise.complete.obs")
pwcov(X, ..., w = NULL, N = FALSE, P = FALSE, array = TRUE, use = "pairwise.complete.obs")
pwnobs(X)
## S3 method for class 'pwcor'
print(x, digits = .op[["digits"]], sig.level = 0.05,
show = c("all","lower.tri","upper.tri"), spacing = 1L, return = FALSE, ...)
## S3 method for class 'pwcov'
print(x, digits = .op[["digits"]], sig.level = 0.05,
show = c("all","lower.tri","upper.tri"), spacing = 1L, return = FALSE, ...)
Arguments
X |
a matrix or data.frame, for |
x |
an object of class 'pwcor' / 'pwcov'. |
w |
numeric. A vector of (frequency) weights. |
N |
logical. |
P |
logical. |
array |
logical. If |
use |
argument passed to |
digits |
integer. The number of digits to round to in print. |
sig.level |
numeric. P-value threshold below which a |
show |
character. The part of the correlation / covariance matrix to display. |
spacing |
integer. Controls the spacing between different reported quantities in the printout of the matrix: 0 - compressed, 1 - single space, 2 - double space. |
return |
logical. |
... |
other arguments passed to |
Value
a numeric matrix, 3D array or list of matrices with the computed statistics. For pwcor
and pwcov
the object has a class 'pwcor' and 'pwcov', respectively.
Note
weights::wtd.cors
is imported for weighted pairwise correlations (written in C for speed). For weighted correlations with bootstrap SE's see weights::wtd.cor
(bootstrap can be slow). Weighted correlations for complex surveys are implemented in jtools::svycor
. An equivalent and faster implementation of pwcor
(without weights) is provided in Hmisc::rcorr
(written in Fortran).
See Also
qsu
, Summary Statistics, Collapse Overview
Examples
mna <- na_insert(mtcars)
pwcor(mna)
pwcov(mna)
pwnobs(mna)
pwcor(mna, N = TRUE)
pwcor(mna, P = TRUE)
pwcor(mna, N = TRUE, P = TRUE)
aperm(pwcor(mna, N = TRUE, P = TRUE))
print(pwcor(mna, N = TRUE, P = TRUE), digits = 3, sig.level = 0.01, show = "lower.tri")
pwcor(mna, N = TRUE, P = TRUE, array = FALSE)
print(pwcor(mna, N = TRUE, P = TRUE, array = FALSE), show = "lower.tri")
Fast Factor Generation, Interactions and Vector Grouping
Description
qF
, shorthand for 'quick-factor' implements very fast factor generation from atomic vectors using either radix ordering or index hashing followed by sorting.
qG
, shorthand for 'quick-group', generates a kind of factor-light without the levels attribute but instead an attribute providing the number of levels. Optionally the levels / groups can be attached, but without converting them to character (which can have large performance implications). Objects have a class 'qG'.
finteraction
generates a factor or 'qG' object by interacting multiple vectors or factors. In that process missing values are always replaced with a level and unused levels/combinations are always dropped.
collapse internally makes optimal use of factors and 'qG' objects when passed as grouping vectors to statistical functions (g/by
, or t
arguments) i.e. typically no further grouping or ordering is performed and objects are used directly by statistical C/C++ code.
Usage
qF(x, ordered = FALSE, na.exclude = TRUE, sort = .op[["sort"]], drop = FALSE,
keep.attr = TRUE, method = "auto")
qG(x, ordered = FALSE, na.exclude = TRUE, sort = .op[["sort"]],
return.groups = FALSE, method = "auto")
is_qG(x)
as_factor_qG(x, ordered = FALSE, na.exclude = TRUE)
finteraction(..., factor = TRUE, ordered = FALSE, sort = factor && .op[["sort"]],
method = "auto", sep = ".")
itn(...) # Shorthand for finteraction
Arguments
x |
a atomic vector, factor or quick-group. | ||||||||||||||||||||||||||
ordered |
logical. Adds a class 'ordered'. | ||||||||||||||||||||||||||
na.exclude |
logical. | ||||||||||||||||||||||||||
sort |
logical. | ||||||||||||||||||||||||||
drop |
logical. If | ||||||||||||||||||||||||||
keep.attr |
logical. If | ||||||||||||||||||||||||||
method |
an integer or character string specifying the method of computation:
Note that for | ||||||||||||||||||||||||||
return.groups |
logical. | ||||||||||||||||||||||||||
factor |
logical. | ||||||||||||||||||||||||||
sep |
character. The separator passed to | ||||||||||||||||||||||||||
... |
multiple atomic vectors or factors, or a single list of equal-length vectors or factors. See Details. |
Details
Whenever a vector is passed to a Fast Statistical Function such as fmean(mtcars, mtcars$cyl)
, is is grouped using qF
, or qG
if use.g.names = FALSE
.
qF
is a combination of as.factor
and factor
. Applying it to a vector i.e. qF(x)
gives the same result as as.factor(x)
. qF(x, ordered = TRUE)
generates an ordered factor (same as factor(x, ordered = TRUE)
), and qF(x, na.exclude = FALSE)
generates a level for missing values (same as factor(x, exclude = NULL)
). An important addition is that qF(x, na.exclude = FALSE)
also adds a class 'na.included'. This prevents collapse functions from checking missing values in the factor, and is thus computationally more efficient. Therefore factors used in grouped operations should preferably be generated using qF(x, na.exclude = FALSE)
. Setting sort = FALSE
gathers the levels in first-appearance order (unless method = "radix"
and x
is numeric, in which case the levels are always sorted). This often gives a noticeable speed improvement.
There are 3 internal methods of computation: radix ordering, hashing, and Rcpp sugar hashing. Radix ordering is done by combining the functions radixorder
and groupid
. It is generally faster than hashing for large numeric data and pre-sorted data (although there are exceptions). Hashing uses group
, followed by radixorder
on the unique elements if sort = TRUE
. It is generally fastest for character data. Rcpp hashing uses Rcpp::sugar::sort_unique
and Rcpp::sugar::match
. This is often less efficient than the former on large data, but the sorting properties (relying on std::sort
) may be superior in borderline cases where radixorder
fails to deliver exact lexicographic ordering of factor levels.
Regarding speed: In general qF
is around 5x faster than as.factor
on character data and about 30x faster on numeric data. Automatic method dispatch typically does a good job delivering optimal performance.
qG
is in the first place a programmers function. It generates a factor-'light' class 'qG' consisting of only an integer grouping vector and an attribute providing the number of groups. It is slightly faster and more memory efficient than GRP
for grouping atomic vectors, and also convenient as it can be stored in a data frame column, which are the main reasons for its existence.
finteraction
is simply a wrapper around as_factor_GRP(GRP.default(X))
, where X is replaced by the arguments in '...' combined in a list (so its not really an interaction function but just a multivariate grouping converted to factor, see GRP
for computational details). In general: All vectors, factors, or lists of vectors / factors passed can be interacted. Interactions always create a level for missing values and always drop unused levels.
Value
qF
returns an (ordered) factor. qG
returns an object of class 'qG': an integer grouping vector with an attribute "N.groups"
indicating the number of groups, and, if return.groups = TRUE
, an attribute "groups"
containing the vector of unique groups / elements in x
corresponding to the integer-id. finteraction
can return either.
Note
An efficient alternative for character vectors with multithreading support is provided by kit::charToFact
.
qG(x, sort = FALSE, na.exclude = FALSE, method = "hash")
internally calls group(x)
which can also be used directly and also supports multivariate groupings.
Neither qF
nor qG
reorder groups / factor levels. An exception was added in v1.7, when calling qF(f, sort = FALSE)
on a factor f
, the levels are recast in first appearance order. These objects can however be converted into one another using qF/qG
or the direct method as_factor_qG
(called inside qF
). It is also possible to add a class 'ordered' (ordered = TRUE
) and to create am extra level / integer for missing values (na.exclude = FALSE
) if factors or 'qG' objects are passed to qF
or qG
.
See Also
group
, groupid
, GRP
, Fast Grouping and Ordering, Collapse Overview
Examples
cylF <- qF(mtcars$cyl) # Factor from atomic vector
cylG <- qG(mtcars$cyl) # Quick-group from atomic vector
cylG # See the simple structure of this object
cf <- qF(wlddev$country) # Bigger data
cf2 <- qF(wlddev$country, na.exclude = FALSE) # With na.included class
dat <- num_vars(wlddev)
# cf2 is faster in grouped operations because no missing value check is performed
library(microbenchmark)
microbenchmark(fmax(dat, cf), fmax(dat, cf2))
finteraction(mtcars$cyl, mtcars$vs) # Interacting two variables (can be factors)
head(finteraction(mtcars)) # A more crude example..
finteraction(mtcars$cyl, mtcars$vs, factor = FALSE) # Returns 'qG', by default unsorted
group(mtcars$cyl, mtcars$vs) # Same thing
Fast (Grouped, Weighted) Summary Statistics for Cross-Sectional and Panel Data
Description
qsu
, shorthand for quick-summary, is an extremely fast summary command inspired by the (xt)summarize command in the STATA statistical software.
It computes a set of 7 statistics (nobs, mean, sd, min, max, skewness and kurtosis) using a numerically stable one-pass method generalized from Welford's Algorithm. Statistics can be computed weighted, by groups, and also within-and between entities (for panel data, see Details).
Usage
qsu(x, ...)
## Default S3 method:
qsu(x, g = NULL, pid = NULL, w = NULL, higher = FALSE,
array = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'matrix'
qsu(x, g = NULL, pid = NULL, w = NULL, higher = FALSE,
array = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'data.frame'
qsu(x, by = NULL, pid = NULL, w = NULL, cols = NULL, higher = FALSE,
array = TRUE, labels = FALSE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'grouped_df'
qsu(x, pid = NULL, w = NULL, higher = FALSE,
array = TRUE, labels = FALSE, stable.algo = .op[["stable.algo"]], ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
qsu(x, g = NULL, w = NULL, effect = 1L, higher = FALSE,
array = TRUE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'pdata.frame'
qsu(x, by = NULL, w = NULL, cols = NULL, effect = 1L, higher = FALSE,
array = TRUE, labels = FALSE, stable.algo = .op[["stable.algo"]], ...)
# Methods for compatibility with sf:
## S3 method for class 'sf'
qsu(x, by = NULL, pid = NULL, w = NULL, cols = NULL, higher = FALSE,
array = TRUE, labels = FALSE, stable.algo = .op[["stable.algo"]], ...)
## S3 method for class 'qsu'
as.data.frame(x, ..., gid = "Group", stringsAsFactors = TRUE)
## S3 method for class 'qsu'
print(x, digits = .op[["digits"]] + 2L, nonsci.digits = 9, na.print = "-",
return = FALSE, print.gap = 2, ...)
Arguments
x |
a vector, matrix, data frame, 'indexed_series' ('pseries') or 'indexed_frame' ('pdata.frame'). |
g |
a factor, |
by |
(p)data.frame method: Same as |
pid |
same input as |
w |
a vector of (non-negative) weights. Adding weights will compute the weighted mean, sd, skewness and kurtosis, and transform the data using weighted individual means if |
cols |
select columns to summarize using column names, indices, a logical vector or a function (e.g. |
higher |
logical. Add higher moments (skewness and kurtosis). |
array |
logical. If computations have more than 2 dimensions (up to a maximum of 4D: variables, statistics, groups and panel-decomposition) |
stable.algo |
logical. |
labels |
logical |
effect |
plm methods: Select which panel identifier should be used for between and within transformations of the data. 1L takes the first variable in the index, 2L the second etc.. Index variables can also be called by name using a character string. More than one variable can be supplied. |
... |
arguments to be passed to or from other methods. |
gid |
character. Name assigned to the group-id column, when summarising variables by groups. |
stringsAsFactors |
logical. Make factors from dimension names of 'qsu' array. Same as option to |
digits |
the number of digits to print after the comma/dot. |
nonsci.digits |
the number of digits to print before resorting to scientific notation (default is to print out numbers with up to 9 digits and print larger numbers scientifically). |
na.print |
character string to substitute for missing values. |
return |
logical. Don't print but instead return the formatted object. |
print.gap |
integer. Spacing between printed columns. Passed to |
Details
The algorithm used to compute statistics is well described here [see sections Welford's online algorithm, Weighted incremental algorithm and Higher-order statistics. Skewness and kurtosis are calculated as described in Higher-order statistics and are mathematically identical to those implemented in the moments package. Just note that qsu
computes the kurtosis (like momens::kurtosis
), not the excess-kurtosis (= kurtosis - 3) defined in Higher-order statistics. The Weighted incremental algorithm described can easily be generalized to higher-order statistics].
Grouped computations specified with g/by
are carried out extremely efficiently as in fsum
(in a single pass, without splitting the data).
If pid
is used, qsu
performs a panel-decomposition of each variable and computes 3 sets of statistics: Statistics computed on the 'Overall' (raw) data, statistics computed on the 'Between' - transformed (pid - averaged) data, and statistics computed on the 'Within' - transformed (pid - demeaned) data.
More formally, let x
(bold) be a panel vector of data for N
individuals indexed by i
, recorded for T
periods, indexed by t
. xit
then denotes a single data-point belonging to individual i
in time-period t
(t/T
must not represent time). Then xi.
denotes the average of all values for individual i
(averaged over t
), and by extension xN.
is the vector (length N
) of such averages for all individuals. If no groups are supplied to g/by
, the 'Between' statistics are computed on xN.
, the vector of individual averages. (This means that for a non-balanced panel or in the presence of missing values, the 'Overall' mean computed on x
can be slightly different than the 'Between' mean computed on xN.
, and the variance decomposition is not exact). If groups are supplied to g/by
, xN.
is expanded to the vector xi.
(length N x T
) by replacing each value xit
in x
with xi.
, while preserving missing values in x
. Grouped Between-statistics are then computed on xi.
, with the only difference that the number of observations ('Between-N') reported for each group is the number of distinct non-missing values of xi.
in each group (not the total number of non-missing values of xi.
in each group, which is already reported in 'Overall-N'). See Examples.
'Within' statistics are always computed on the vector x - xi. + x..
, where x..
is simply the 'Overall' mean computed from x
, which is added back to preserve the level of the data. The 'Within' mean computed on this data will always be identical to the 'Overall' mean. In the summary output, qsu
reports not 'N', which would be identical to the 'Overall-N', but 'T', the average number of time-periods of data available for each individual obtained as 'T' = 'Overall-N / 'Between-N'. When using weights (w
) with panel data (pid
), the 'Between' sum of weights is also simply the number of groups, and the 'Within' sum of weights is the 'Overall' sum of weights divided by the number of groups. See Examples.
Apart from 'N/T' and the extrema, the standard-deviations ('SD') computed on between- and within- transformed data are extremely valuable because they indicate how much of the variation in a panel-variable is between-individuals and how much of the variation is within-individuals (over time). At the extremes, variables that have common values across individuals (such as the time-variable(s) 't' in a balanced panel), can readily be identified as individual-invariant because the 'Between-SD' on this variable is 0 and the 'Within-SD' is equal to the 'Overall-SD'. Analogous, time-invariant individual characteristics (such as the individual-id 'i') have a 0 'Within-SD' and a 'Between-SD' equal to the 'Overall-SD'. See Examples.
For data frame methods, if labels = TRUE
, qsu
uses function(x) paste(names(x), setv(vlabels(x), NA, ""), sep = ": ")
to combine variable names and labels for display. Alternatively, the user can pass a custom function which will be applied to the data frame, e.g. using labels = vlabels
just displays the labels. See also vlabels
.
qsu
comes with its own print method which by default writes out up to 9 digits at 4 decimal places. Larger numbers are printed in scientific format. for numbers between 7 and 9 digits, an apostrophe (') is placed after the 6th digit to designate the millions. Missing values are printed using '-'.
The sf method simply ignores the geometry column.
Value
A vector, matrix, array or list of matrices of summary statistics. All matrices and arrays have a class 'qsu' and a class 'table' attached.
Note
In weighted summaries, observations with missing or zero weights are skipped, and thus do not affect any of the calculated statistics, including the observation count. This also implies that a logical vector passed to w
can be used to efficiently summarize a subset of the data.
Note
If weights w
are used together with pid
, transformed data is computed using weighted individual means i.e. weighted xi.
and weighted x..
. Weighted statistics are subsequently computed on this weighted-transformed data.
References
Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics. 4 (3): 419-420. doi:10.2307/1266577.
See Also
descr
, Summary Statistics, Fast Statistical Functions, Collapse Overview
Examples
## World Development Panel Data
# Simple Summaries -------------------------
qsu(wlddev) # Simple summary
qsu(wlddev, labels = TRUE) # Display variable labels
qsu(wlddev, higher = TRUE) # Add skewness and kurtosis
# Grouped Summaries ------------------------
qsu(wlddev, ~ region, labels = TRUE) # Statistics by World Bank Region
qsu(wlddev, PCGDP + LIFEEX ~ income) # Summarize GDP per Capita and Life Expectancy by
stats <- qsu(wlddev, ~ region + income, # World Bank Income Level
cols = 9:10, higher = TRUE) # Same variables, by both region and income
aperm(stats) # A different perspective on the same stats
# Grouped summary
wlddev |> fgroup_by(region) |> fselect(PCGDP, LIFEEX) |> qsu()
# Panel Data Summaries ---------------------
qsu(wlddev, pid = ~ iso3c, labels = TRUE) # Adding between and within countries statistics
# -> They show amongst other things that year and decade are individual-invariant,
# that we have GINI-data on only 161 countries, with only 8.42 observations per country on average,
# and that GDP, LIFEEX and GINI vary more between-countries, but ODA received varies more within
# countries over time.
# Let's do this manually for PCGDP:
x <- wlddev$PCGDP
g <- wlddev$iso3c
# This is the exact variance decomposion
all.equal(fvar(x), fvar(B(x, g)) + fvar(W(x, g)))
# What qsu does is calculate
r <- rbind(Overall = qsu(x),
Between = qsu(fmean(x, g)), # Aggregation instead of between-transform
Within = qsu(fwithin(x, g, mean = "overall.mean"))) # Same as qsu(W(x, g) + fmean(x))
r[3, 1] <- r[1, 1] / r[2, 1]
print.qsu(r)
# Proof:
qsu(x, pid = g)
# Using indexed data:
wldi <- findex_by(wlddev, iso3c, year) # Creating a Indexed Data Frame frame from this data
qsu(wldi) # Summary for pdata.frame -> qsu(wlddev, pid = ~ iso3c)
qsu(wldi$PCGDP) # Default summary for Panel Series
qsu(G(wldi$PCGDP)) # Summarizing GDP growth, see also ?G
# Grouped Panel Data Summaries -------------
qsu(wlddev, ~ region, ~ iso3c, cols = 9:12) # Panel-Statistics by region
psr <- qsu(wldi, ~ region, cols = 9:12) # Same on indexed data
psr # -> Gives a 4D array
psr[,"N/T",,] # Checking out the number of observations:
# In North america we only have 3 countries, for the GINI we only have 3.91 observations on average
# for 45 Sub-Saharan-African countries, etc..
psr[,"SD",,] # Considering only standard deviations
# -> In all regions variations in inequality (GINI) between countries are greater than variations
# in inequality within countries. The opposite is true for Life-Expectancy in all regions apart
# from Europe, etc..
# Again let's do this manually for PDGCP:
d <- cbind(Overall = x,
Between = fbetween(x, g),
Within = fwithin(x, g, mean = "overall.mean"))
r <- qsu(d, g = wlddev$region)
r[,"N","Between"] <- fndistinct(g[!is.na(x)], wlddev$region[!is.na(x)])
r[,"N","Within"] <- r[,"N","Overall"] / r[,"N","Between"]
r
# Proof:
qsu(wlddev, PCGDP ~ region, ~ iso3c)
# Weighted Summaries -----------------------
n <- nrow(wlddev)
weights <- abs(rnorm(n)) # Generate random weights
qsu(wlddev, w = weights, higher = TRUE) # Computed weighted mean, SD, skewness and kurtosis
weightsNA <- weights # Weights may contain missing values.. inserting 1000
weightsNA[sample.int(n, 1000)] <- NA
qsu(wlddev, w = weightsNA, higher = TRUE) # But now these values are removed from all variables
# Grouped and panel-summaries can also be weighted in the same manner
# Alternative Output Formats ---------------
# Simple case
as.data.frame(qsu(mtcars))
# For matrices can also use qDF/qDT/qTBL to assign custom name and get a character-id
qDF(qsu(mtcars), "car")
# DF from 3D array: do not combine with aperm(), might introduce wrong column labels
as.data.frame(stats, gid = "Region_Income")
# DF from 4D array: also no aperm()
as.data.frame(qsu(wlddev, ~ income, ~ iso3c, cols = 9:10), gid = "Region")
# Output as nested list
psrl <- qsu(wlddev, ~ income, ~ iso3c, cols = 9:10, array = FALSE)
psrl
# We can now use unlist2d to create a tidy data frame
unlist2d(psrl, c("Variable", "Trans"), row.names = "Income")
Fast (Weighted) Cross Tabulation
Description
A versatile and computationally more efficient replacement for table
. Notably, it also supports tabulations with frequency weights, and computation of a statistic over combinations of variables.
Usage
qtab(..., w = NULL, wFUN = NULL, wFUN.args = NULL,
dnn = "auto", sort = .op[["sort"]], na.exclude = TRUE,
drop = FALSE, method = "auto")
qtable(...) # Long-form. Use set_collapse(mask = "table") to replace table()
Arguments
... |
atomic vectors or factors spanning the table dimensions, (optionally) with tags for the dimension names, or a data frame / list of these. See Examples. |
w |
a single vector to aggregate over the table dimensions e.g. a vector of frequency weights. |
wFUN |
a function used to aggregate |
wFUN.args |
a list of (optional) further arguments passed to |
dnn |
the names of the table dimensions. Either passed directly as a character vector or list (internally
|
sort , na.exclude , drop , method |
arguments passed down to
|
Value
An array of class 'qtab' that inherits from 'table'. Thus all 'table' methods apply to it.
See Also
descr
, Summary Statistics, Fast Statistical Functions, Collapse Overview
Examples
## Basic use
qtab(iris$Species)
with(mtcars, qtab(vs, am))
qtab(mtcars[.c(vs, am)])
library(magrittr)
iris %$% qtab(Sepal.Length > mean(Sepal.Length), Species)
iris %$% qtab(AMSL = Sepal.Length > mean(Sepal.Length), Species)
## World after 2015
wlda15 <- wlddev |> fsubset(year >= 2015) |> collap(~ iso3c)
# Regions and income levels (country frequency)
wlda15 %$% qtab(region, income)
wlda15 %$% qtab(region, income, dnn = vlabels)
wlda15 %$% qtab(region, income, dnn = "namlab")
# Population (millions)
wlda15 %$% qtab(region, income, w = POP) |> divide_by(1e6)
# Life expectancy (years)
wlda15 %$% qtab(region, income, w = LIFEEX, wFUN = fmean)
# Life expectancy (years), weighted by population
wlda15 %$% qtab(region, income, w = LIFEEX, wFUN = fmean,
wFUN.args = list(w = POP))
# GDP per capita (constant 2010 US$): median
wlda15 %$% qtab(region, income, w = PCGDP, wFUN = fmedian,
wFUN.args = list(na.rm = TRUE))
# GDP per capita (constant 2010 US$): median, weighted by population
wlda15 %$% qtab(region, income, w = PCGDP, wFUN = fmedian,
wFUN.args = list(w = POP))
# Including OECD membership
tab <- wlda15 %$% qtab(region, income, OECD)
tab
# Various 'table' methods
tab |> addmargins()
tab |> marginSums(margin = c("region", "income"))
tab |> proportions()
tab |> proportions(margin = "income")
as.data.frame(tab) |> head(10)
ftable(tab, row.vars = c("region", "OECD"))
# Other options
tab |> fsum(TRA = "%") # Percentage table (on a matrix use fsum.default)
tab %/=% (sum(tab)/100) # Another way (division by reference, preserves integers)
tab
rm(tab, wlda15)
Quick Data Conversion
Description
Fast, flexible and precise conversion of common data objects, without method dispatch and extensive checks:
-
qDF
,qDT
andqTBL
convert vectors, matrices, higher-dimensional arrays and suitable lists to data frame, data.table and tibble, respectively. -
qM
converts vectors, higher-dimensional arrays, data frames and suitable lists to matrix. -
mctl
andmrtl
column- or row-wise convert a matrix to list, data frame or data.table. They are used internally byqDF/qDT/qTBL
,dapply
,BY
, etc... -
qF
converts atomic vectors to factor (documented on a separate page). -
as_numeric_factor
,as_integer_factor
, andas_character_factor
convert factors, or all factor columns in a data frame / list, to character or numeric (by converting the levels).
Usage
# Converting between matrices, data frames / tables / tibbles
qDF(X, row.names.col = FALSE, keep.attr = FALSE, class = "data.frame")
qDT(X, row.names.col = FALSE, keep.attr = FALSE, class = c("data.table", "data.frame"))
qTBL(X, row.names.col = FALSE, keep.attr = FALSE, class = c("tbl_df","tbl","data.frame"))
qM(X, row.names.col = NULL , keep.attr = FALSE, class = NULL, sep = ".")
# Programmer functions: matrix rows or columns to list / DF / DT - fully in C++
mctl(X, names = FALSE, return = "list")
mrtl(X, names = FALSE, return = "list")
# Converting factors or factor columns
as_numeric_factor(X, keep.attr = TRUE)
as_integer_factor(X, keep.attr = TRUE)
as_character_factor(X, keep.attr = TRUE)
Arguments
X |
a vector, factor, matrix, higher-dimensional array, data frame or list. | |||||||||||||||||||||
row.names.col |
can be used to add an column saving names or row.names when converting objects to data frame using | |||||||||||||||||||||
keep.attr |
logical. | |||||||||||||||||||||
class |
if a vector of classes is passed here, the converted object will be assigned these classes. If | |||||||||||||||||||||
sep |
character. Separator used for interacting multiple variables selected through | |||||||||||||||||||||
names |
logical. Should the list be named using row/column names from the matrix? | |||||||||||||||||||||
return |
an integer or string specifying what to return. The options are:
|
Details
Object conversions using these functions are maximally efficient and involve 3 consecutive steps: (1) Converting the storage mode / dimensions / data of the object, (2) converting / modifying the attributes and (3) modifying the class of the object:
(1) is determined by the choice of function and the optional row.names.col
argument. Higher-dimensional arrays are converted by expanding the second dimension (adding columns, same as as.matrix, as.data.frame, as.data.table
).
(2) is determined by the keep.attr
argument: keep.attr = TRUE
seeks to preserve the attributes of the object. Its effect is like copying attributes(converted) <- attributes(original)
, and then modifying the "dim", "dimnames", "names", "row.names"
and "levels"
attributes as necessitated by the conversion task. keep.attr = FALSE
only converts / assigns / removes these attributes and drops all others.
(3) is determined by the class
argument: Setting class = "myclass"
will yield a converted object of class "myclass"
, with any other / prior classes being removed by this replacement. Setting class = NULL
does NOT mean that a class NULL
is assigned (which would remove the class attribute), but rather that the default classes are assigned: qM
assigns no class, qDF
a class "data.frame"
, and qDT
a class c("data.table", "data.frame")
. At this point there is an interaction with keep.attr
: If keep.attr = TRUE
and class = NULL
and the object converted already inherits the respective default classes, then any other inherited classes will also be preserved (with qM(x, keep.attr = TRUE, class = NULL)
any class will be preserved if is.matrix(x)
evaluates to TRUE
.)
The default keep.attr = FALSE
ensures hard conversions so that all unnecessary attributes are dropped. Furthermore in qDF/qDT/qTBL
the default classes were explicitly assigned. This is to ensure that the default methods apply, even if the user chooses to preserve further attributes. For qM
a more lenient default setup was chosen to enable the full preservation of time series matrices with keep.attr = TRUE
. If the user wants to keep attributes attached to a matrix but make sure that all default methods work properly, either one of qM(x, keep.attr = TRUE, class = "matrix")
or unclass(qM(x, keep.attr = TRUE))
should be employed.
Value
qDF
- returns a data.frame
qDT
- returns a data.table
qTBL
- returns a tibble
qM
- returns a matrix
mctl
, mrtl
- return a list, data frame or data.table
qF
- returns a factor
as_numeric_factor
- returns X with factors converted to numeric (double) variables
as_integer_factor
- returns X with factors converted to integer variables
as_character_factor
- returns X with factors converted to character variables
See Also
Examples
## Basic Examples
mtcarsM <- qM(mtcars) # Matrix from data.frame
mtcarsDT <- qDT(mtcarsM) # data.table from matrix columns
mtcarsTBL <- qTBL(mtcarsM) # tibble from matrix columns
head(mrtl(mtcarsM, TRUE, "data.frame")) # data.frame from matrix rows, etc..
head(qDF(mtcarsM, "cars")) # Adding a row.names column when converting from matrix
head(qDT(mtcars, "cars")) # Saving row.names when converting data frame to data.table
head(qM(iris, "Species")) # Examples converting data to matrix, saving information
head(qM(GGDC10S, is.character)) # as rownames
head(qM(gv(GGDC10S, -(2:3)), 1:3, sep = "-")) # plm-style rownames
qDF(fmean(mtcars), c("cars", "mean")) # Data frame from named vector, with names
# mrtl() and mctl() are very useful for iteration over matrices
# Think of a coordninates matrix e.g. from sf::st_coordinates()
coord <- matrix(rnorm(10), ncol = 2, dimnames = list(NULL, c("X", "Y")))
# Then we can
for (d in mrtl(coord)) {
cat("lon =", d[1], ", lat =", d[2], fill = TRUE)
# do something complicated ...
}
rm(coord)
## Factors
cylF <- qF(mtcars$cyl) # Factor from atomic vector
cylF
# Factor to numeric conversions
identical(mtcars, as_numeric_factor(dapply(mtcars, qF)))
Fast Radix-Based Ordering
Description
A slight modification of order(..., method = "radix")
that is more programmer friendly and, importantly, provides features for ordered grouping of data (similar to data.table:::forderv
from which it descended).
Usage
radixorder(..., na.last = TRUE, decreasing = FALSE, starts = FALSE,
group.sizes = FALSE, sort = TRUE)
radixorderv(x, na.last = TRUE, decreasing = FALSE, starts = FALSE,
group.sizes = FALSE, sort = TRUE)
Arguments
... |
comma-separated atomic vectors to order. |
x |
an atomic vector or list of atomic vectors such as a data frame. |
na.last |
logical. for controlling the treatment of |
decreasing |
logical. Should the sort order be increasing or decreasing? Can be a vector of length equal to the number of arguments in |
starts |
logical. |
group.sizes |
logical. |
sort |
logical. This argument only affects character vectors / columns passed. If |
Value
An integer ordering vector with attributes: Unless na.last = NA
an attribute "sorted"
indicating whether the input data was already sorted is attached. If starts = TRUE
, "starts"
giving a vector of group starts in the ordered data, and if group.sizes = TRUE
, "group.sizes"
giving the vector of group sizes are attached. In either case an attribute "maxgrpn"
providing the size of the largest group is also attached.
Author(s)
The C code was taken - with slight modifications - from base R source code, and is originally due to data.table authors Matt Dowle and Arun Srinivasan.
See Also
Fast Grouping and Ordering, Collapse Overview
Examples
radixorder(mtcars$mpg)
head(mtcars[radixorder(mtcars$mpg), ])
radixorder(mtcars$cyl, mtcars$vs)
o <- radixorder(mtcars$cyl, mtcars$vs, starts = TRUE)
st <- attr(o, "starts")
head(mtcars[o, ])
mtcars[o[st], c("cyl", "vs")] # Unique groups
# Note that if attr(o, "sorted") == TRUE, then all(o[st] == st)
radixorder(rep(1:3, each = 3), starts = TRUE)
# Group sizes
radixorder(mtcars$cyl, mtcars$vs, group.sizes = TRUE)
# Both
radixorder(mtcars$cyl, mtcars$vs, starts = TRUE, group.sizes = TRUE)
Recursively Apply a Function to a List of Data Objects
Description
rapply2d
is a recursive version of lapply
with three differences to rapply
:
data frames (or other list-based objects specified in
classes
) are considered as atomic, not as (sub-)lists-
FUN
is applied to all 'atomic' objects in the nested list the result is not simplified / unlisted.
Usage
rapply2d(l, FUN, ..., classes = "data.frame")
Arguments
l |
a list. |
FUN |
a function that can be applied to all 'atomic' elements in l. |
... |
additional elements passed to FUN. |
classes |
character. Classes of list-based objects inside |
Value
A list of the same structure as l
, where FUN
was applied to all atomic elements and list-based objects of a class included in classes
.
Note
The main reason rapply2d
exists is to have a recursive function that out-of-the-box applies a function to a nested list of data frames.
For most other purposes rapply
, or by extension the excellent rrapply function / package, provide more advanced functionality and greater performance.
See Also
rsplit
, unlist2d
, List Processing, Collapse Overview
Examples
l <- list(mtcars, list(mtcars, as.matrix(mtcars)))
rapply2d(l, fmean)
unlist2d(rapply2d(l, fmean))
Recode and Replace Values in Matrix-Like Objects
Description
A small suite of functions to efficiently perform common recoding and replacing tasks in matrix-like objects.
Usage
recode_num(X, ..., default = NULL, missing = NULL, set = FALSE)
recode_char(X, ..., default = NULL, missing = NULL, regex = FALSE,
ignore.case = FALSE, fixed = FALSE, set = FALSE)
replace_na(X, value = 0, cols = NULL, set = FALSE, type = "const")
replace_inf(X, value = NA, replace.nan = FALSE, set = FALSE)
replace_outliers(X, limits, value = NA,
single.limit = c("sd", "mad", "min", "max"),
ignore.groups = FALSE, set = FALSE)
Arguments
X |
a vector, matrix, array, data frame or list of atomic objects. |
... |
comma-separated recode arguments of the form: |
default |
optional argument to specify a scalar value to replace non-matched elements with. |
missing |
optional argument to specify a scalar value to replace missing elements with. Note that to increase efficiency this is done before the rest of the recoding i.e. the recoding is performed on data where missing values are filled! |
set |
logical. |
type |
character. One of |
regex |
logical. If |
value |
a single (scalar) value to replace matching elements with. In |
cols |
select columns to replace missing values in using a function, column names, indices or a logical vector. |
replace.nan |
logical. |
limits |
either a vector of two-numeric values |
single.limit |
character, controls the behavior if
|
ignore.groups |
logical. If |
ignore.case , fixed |
logical. Passed to |
Details
-
recode_num
andrecode_char
can be used to efficiently recode multiple numeric or character values, respectively. The syntax is inspired bydplyr::recode
, but the functionality is enhanced in the following respects: (1) when passed a data frame / list, all appropriately typed columns will be recoded. (2) They preserve the attributes of the data object and of columns in a data frame / list, and (3)recode_char
also supports regular expression matching usinggrepl
. -
replace_na
efficiently replacesNA/NaN
with a value (default is0
). data can be multi-typed, in which case appropriate columns can be selected through thecols
argument. For numeric data a more versatile alternative is provided bydata.table::nafill
anddata.table::setnafill
. -
replace_inf
replacesInf/-Inf
(or optionallyNaN/Inf/-Inf
) with a value (default isNA
). It skips non-numeric columns in a data frame. -
replace_outliers
replaces values falling outside a 1- or 2-sided numeric threshold or outside a certain number of standard deviations or median absolute deviation with a value (default isNA
). It skips non-numeric columns in a data frame.
Note
These functions are not generic and do not offer support for factors or date(-time) objects. see dplyr::recode_factor
, forcats and other appropriate packages for dealing with these classes.
Simple replacing tasks on a vector can also effectively be handled by, setv
/ copyv
. Fast vectorized switches are offered by package kit (functions iif
, nif
, vswitch
, nswitch
) as well as data.table::fcase
and data.table::fifelse
. Using switches is more efficient than recode_*
, as recode_*
creates an internal copy of the object to enable cross-replacing.
Function TRA
, and the associated TRA
('transform') argument to Fast Statistical Functions also has option "replace_na"
, to replace missing values with a statistic computed on the non-missing observations, e.g. fmedian(airquality, TRA = "replace_na")
does median imputation.
See Also
pad
, Efficient Programming, Collapse Overview
Examples
recode_char(c("a","b","c"), a = "b", b = "c")
recode_char(month.name, ber = NA, regex = TRUE)
mtcr <- recode_num(mtcars, `0` = 2, `4` = Inf, `1` = NaN)
replace_inf(mtcr)
replace_inf(mtcr, replace.nan = TRUE)
replace_outliers(mtcars, c(2, 100)) # Replace all values below 2 and above 100 w. NA
replace_outliers(mtcars, c(2, 100), value = "clip") # Clipping outliers to the thresholds
replace_outliers(mtcars, 2, single.limit = "min") # Replace all value smaller than 2 with NA
replace_outliers(mtcars, 100, single.limit = "max") # Replace all value larger than 100 with NA
replace_outliers(mtcars, 2) # Replace all values above or below 2 column-
# standard-deviations from the column-mean w. NA
replace_outliers(fgroup_by(iris, Species), 2) # Passing a grouped_df, pseries or pdata.frame
# allows to remove outliers according to
# in-group standard-deviation. see ?fscale
Row-Bind Lists / Data Frame-Like Objects
Description
collapse's version of data.table::rbindlist
and rbind.data.frame
. The core code is copied from data.table, which deserves all credit for the implementation. rowbind
only binds lists/data.frame's. For a more flexible recursive version see unlist2d
. To combine lists column-wise see add_vars
or ftransform
(with replacement).
Usage
rowbind(..., idcol = NULL, row.names = FALSE,
use.names = TRUE, fill = FALSE, id.factor = "auto",
return = c("as.first", "data.frame", "data.table", "tibble", "list"))
Arguments
... |
a single list of list-like objects (data.frames) or comma separated objects (internally assembled using |
idcol |
character. The name of an id-column to be generated identifying the source of rows in the final object. Using |
row.names |
|
use.names |
logical. |
fill |
logical. |
id.factor |
if |
return |
an integer or string specifying what to return. |
Value
a long list or data frame-like object formed by combining the rows / elements of the input objects. The return
argument controls the exact format of the output.
See Also
unlist2d
, add_vars
, ftransform
, Data Frame Manipulation, Collapse Overview
Examples
# These are the same
rowbind(mtcars, mtcars)
rowbind(list(mtcars, mtcars))
# With id column
rowbind(mtcars, mtcars, idcol = "id")
rowbind(a = mtcars, b = mtcars, idcol = "id")
# With saving row-names
rowbind(mtcars, mtcars, row.names = "cars")
rowbind(a = mtcars, b = mtcars, idcol = "id", row.names = "cars")
# Filling up columns
rowbind(mtcars, mtcars[2:8], fill = TRUE)
Fast Reordering of Data Frame Rows
Description
A fast substitute for dplyr::arrange
, based on radixorder(v)
and inspired by data.table::setorder(v)
. It returns a sorted copy of the data frame, unless the data is already sorted in which case no copy is made. In addition, rows can be manually re-ordered. roworderv
is a programmers version that takes vectors/variables as input.
Use data.table::setorder(v)
to sort a data frame without creating a copy.
Usage
roworder(X, ..., na.last = TRUE, verbose = .op[["verbose"]])
roworderv(X, cols = NULL, neworder = NULL, decreasing = FALSE,
na.last = TRUE, pos = "front", verbose = .op[["verbose"]])
Arguments
X |
a data frame or list of equal-length columns. | ||||||||||||||||||||||||||
... |
comma-separated columns of | ||||||||||||||||||||||||||
cols |
select columns to sort by using a function, column names, indices or a logical vector. The default | ||||||||||||||||||||||||||
na.last |
logical. If | ||||||||||||||||||||||||||
decreasing |
logical. Should the sort order be increasing or decreasing? Can also be a vector of length equal to the number of arguments in | ||||||||||||||||||||||||||
neworder |
an ordering vector, can be | ||||||||||||||||||||||||||
pos |
integer or character. Different arrangement options if
| ||||||||||||||||||||||||||
verbose |
logical. |
Value
A copy of X
with rows reordered. If X
is already sorted, X
is simply returned.
Note
If you don't require a copy of the data, use data.table::setorder
(you can also use it in a piped call as it invisibly returns the data).
roworder(v)
has internal facilities to deal with grouped and indexed data. This is however inefficient (since in most cases data could be reordered before grouping/indexing), and therefore issues a message if verbose > 0L
.
See Also
colorder
, Data Frame Manipulation, Fast Grouping and Ordering, Collapse Overview
Examples
head(roworder(airquality, Month, -Ozone))
head(roworder(airquality, Month, -Ozone, na.last = NA)) # Removes the missing values in Ozone
## Same in standard evaluation
head(roworderv(airquality, c("Month", "Ozone"), decreasing = c(FALSE, TRUE)))
head(roworderv(airquality, c("Month", "Ozone"), decreasing = c(FALSE, TRUE), na.last = NA))
## Custom reordering
head(roworderv(mtcars, neworder = 3:4)) # Bring rows 3 and 4 to the front
head(roworderv(mtcars, neworder = 3:4, pos = "end")) # Bring them to the end
head(roworderv(mtcars, neworder = mtcars$vs == 1)) # Bring rows with vs == 1 to the top
Fast (Recursive) Splitting
Description
rsplit
(recursively) splits a vector, matrix or data frame into subsets according to combinations of (multiple) vectors / factors and returns a (nested) list. If flatten = TRUE
, the list is flattened yielding the same result as split
. rsplit
is implemented as a wrapper around gsplit
, and significantly faster than split
.
Usage
rsplit(x, ...)
## Default S3 method:
rsplit(x, fl, drop = TRUE, flatten = FALSE, use.names = TRUE, ...)
## S3 method for class 'matrix'
rsplit(x, fl, drop = TRUE, flatten = FALSE, use.names = TRUE,
drop.dim = FALSE, ...)
## S3 method for class 'data.frame'
rsplit(x, by, drop = TRUE, flatten = FALSE, cols = NULL,
keep.by = FALSE, simplify = TRUE, use.names = TRUE, ...)
Arguments
x |
a vector, matrix, data.frame or list like object. |
fl |
a |
by |
data.frame method: Same as |
drop |
logical. |
flatten |
logical. If |
use.names |
logical. |
drop.dim |
logical. |
cols |
data.frame method: Select columns to split using a function, column names, indices or a logical vector. Note: |
keep.by |
logical. If a formula is passed to |
simplify |
data.frame method: Logical. |
... |
further arguments passed to |
Value
a (nested) list containing the subsets of x
.
See Also
gsplit
, rapply2d
, unlist2d
, List Processing, Collapse Overview
Examples
rsplit(mtcars$mpg, mtcars$cyl)
rsplit(mtcars, mtcars$cyl)
rsplit(mtcars, mtcars[.c(cyl, vs, am)])
rsplit(mtcars, ~ cyl + vs + am, keep.by = TRUE) # Same thing
rsplit(mtcars, ~ cyl + vs + am)
rsplit(mtcars, ~ cyl + vs + am, flatten = TRUE)
rsplit(mtcars, mpg ~ cyl)
rsplit(mtcars, mpg ~ cyl, simplify = FALSE)
rsplit(mtcars, mpg + hp ~ cyl + vs + am)
rsplit(mtcars, mpg + hp ~ cyl + vs + am, keep.by = TRUE)
# Split this sectoral data, first by Variable (Emloyment and Value Added), then by Country
GGDCspl <- rsplit(GGDC10S, ~ Variable + Country, cols = 6:16)
str(GGDCspl)
# The nested list can be reassembled using unlist2d()
head(unlist2d(GGDCspl, idcols = .c(Variable, Country)))
rm(GGDCspl)
# Another example with mtcars (not as clean because of row.names)
nl <- rsplit(mtcars, mpg + hp ~ cyl + vs + am)
str(nl)
unlist2d(nl, idcols = .c(cyl, vs, am), row.names = "car")
rm(nl)
Generate Group-Id from Integer Sequences
Description
seqid
can be used to group sequences of integers in a vector, e.g. seqid(c(1:3, 5:7))
becomes c(rep(1,3), rep(2,3))
. It also supports increments > 1
, unordered sequences, and missing values in the sequence.
Some applications are to facilitate identification of, and grouped operations on, (irregular) time series and panels.
Usage
seqid(x, o = NULL, del = 1L, start = 1L, na.skip = FALSE,
skip.seq = FALSE, check.o = TRUE)
Arguments
x |
a factor or integer vector. Numeric vectors will be converted to integer i.e. rounded downwards. |
o |
an (optional) integer ordering vector specifying the order by which to pass through |
del |
integer. The integer deliminating two consecutive points in a sequence. |
start |
integer. The starting value of the resulting sequence id. Default is starting from 1. |
na.skip |
logical. |
skip.seq |
logical. If |
check.o |
logical. Programmers option: |
Details
seqid
was created primarily as a workaround to deal with problems of computing lagged values, differences and growth rates on irregularly spaced time series and panels before collapse version 1.5.0 (#26). Now flag
, fdiff
and fgrowth
natively support irregular data so this workaround is superfluous, except for iterated differencing which is not yet supported with irregular data.
The theory of the workaround was to express an irregular time series or panel series as a regular panel series with a group-id created such that the time-periods within each group are consecutive. seqid
makes this very easy: For an irregular panel with some gaps or repeated values in the time variable, an appropriate id variable can be generated using settransform(data, newid = seqid(time, radixorder(id, time)))
. Lags can then be computed using L(data, 1, ~newid, ~time)
etc.
In general, for any regularly spaced panel the identity given by identical(groupid(id, order(id, time)), seqid(time, order(id, time)))
should hold.
For the opposite operation of creating a new time-variable that is consecutive in each group, see data.table::rowid
.
Value
An integer vector of class 'qG'. See qG
.
See Also
timeid
, groupid
, qG
, Fast Grouping and Ordering, Collapse Overview
Examples
## This creates an irregularly spaced panel, with a gap in time for id = 2
data <- data.frame(id = rep(1:3, each = 4),
time = c(1:4, 1:2, 4:5, 1:4),
value = rnorm(12))
data
## This gave a gaps in time error previous to collapse 1.5.0
L(data, 1, value ~ id, ~time)
## Generating new id variable (here seqid(time) would suffice as data is sorted)
settransform(data, newid = seqid(time, order(id, time)))
data
## Lag the panel this way
L(data, 1, value ~ newid, ~time)
## A different possibility: Creating a consecutive time variable
settransform(data, newtime = data.table::rowid(id))
data
L(data, 1, value ~ id, ~newtime)
## With sorted data, the time variable can also just be omitted..
L(data, 1, value ~ id)
Small (Helper) Functions
Description
Convenience functions in the collapse package that help to deal with object attributes such as variable names and labels, object checking, metaprogramming, and that improve the workflow.
Usage
.c(...) # Non-standard concatenation i.e. .c(a, b) == c("a", "b")
nam %=% values # Multiple-assignment e.g. .c(x, y) %=% c(1, 2),
massign(nam, values, # can also assign to different environment.
envir = parent.frame())
vlabels(X, attrn = "label", # Get labels of variables in X, in attr(X[[i]], attrn)
use.names = TRUE)
vlabels(X, attrn = "label") <- value # Set labels of variables in X (by reference)
setLabels(X, value = NULL, # Set labels of variables in X (by reference) and return X
attrn = "label", cols = NULL)
vclasses(X, use.names = TRUE) # Get classes of variables in X
namlab(X, class = FALSE, # Return data frame of names and labels,
attrn = "label", N = FALSE, # and (optionally) classes, number of observations
Ndistinct = FALSE) # and number of non-missing distinct values
add_stub(X, stub, pre = TRUE, # Add a stub (i.e. prefix or postfix) to column names
cols = NULL)
rm_stub(X, stub, pre = TRUE, # Remove stub from column names, also supports general
regex = FALSE, # regex matching and removing of characters
cols = NULL, ...)
all_identical(...) # Check exact equality of multiple objects or list-elements
all_obj_equal(...) # Check near equality of multiple objects or list-elements
all_funs(expr) # Find all functions called in an R language expression
setRownames(object, # Set rownames of object and return object
nm = if(is.atomic(object)) seq_row(object) else NULL)
setColnames(object, nm) # Set colnames of object and return object
setDimnames(object, dn, # Set dimension names of object and return object
which = NULL)
unattrib(object) # Remove all attributes from object
setAttrib(object, a) # Replace all attributes with list of attributes 'a'
setattrib(object, a) # Same thing by reference, returning object invisibly
copyAttrib(to, from) # Copy all attributes from object 'from' to object 'to'
copyMostAttrib(to, from) # Copy most attributes from object 'from' to object 'to'
is_categorical(x) # The opposite of is.numeric
is_date(x) # Check if object is of class "Date", "POSIXlt" or "POSIXct"
Arguments
X |
a matrix or data frame (some functions also support vectors and arrays although that is less common). |
x |
a (atomic) vector. |
expr |
an expression of type "language" e.g. |
object , to , from |
a suitable R object. |
a |
a suitable list of attributes. |
attrn |
character. Name of attribute to store labels or retrieve labels from. |
N , Ndistinct |
logical. Options to display the number of observations or number of distinct non-missing values. |
value |
for |
use.names |
logical. Preserve names if |
cols |
integer. (optional) indices of columns to apply the operation to. Note that for these small functions this needs to be integer, whereas for other functions in the package this argument is more flexible. |
class |
logical. Also show the classes of variables in X in a column? |
stub |
a single character stub, i.e. "log.", which by default will be pre-applied to all variables or column names in X. |
pre |
logical. |
regex |
logical. Match pattern anywhere in names using a regular expression and remove it with |
nm |
a suitable vector of row- or column-names. |
dn |
a suitable vector or list of names for dimension(s). |
which |
integer. If |
nam |
character. A vector of object names. |
values |
a matching atomic vector or list of objects. |
envir |
the environment to assign into. |
... |
for |
Details
all_funs
is the opposite of all.vars
, to return the functions called rather than the variables in an expression. See Examples.
copyAttrib
and copyMostAttrib
take a shallow copy of the attribute list, i.e. they don't duplicate in memory the attributes themselves. They also, along with setAttrib
, take a shallow copy of lists passed to the to
argument, so that lists are not modified by reference. Atomic to
arguments are however modified by reference. The function setattrib
, added in v1.8.9, modifies the object
by reference i.e. no shallow copies are taken.
copyMostAttrib
copies all attributes except for "names"
, "dim"
and "dimnames"
(like the corresponding C-API function), and further only copies the "row.names"
attribute of data frames if known to be valid. Thus it is a suitable choice if objects should be of the same type but are not of equal dimensions.
See Also
Efficient Programming, Collapse Overview
Examples
## Non-standard concatenation
.c(a, b, "c d", e == f)
## Multiple assignment
.c(a, b) %=% list(1, 2)
.c(T, N) %=% dim(EuStockMarkets)
names(iris) %=% iris
list2env(iris) # Same thing
rm(list = c("a", "b", "T", "N", names(iris)))
## Variable labels
namlab(wlddev)
namlab(wlddev, class = TRUE, N = TRUE, Ndistinct = TRUE)
vlabels(wlddev)
vlabels(wlddev) <- vlabels(wlddev)
## Stub-renaming
log_mtc <- add_stub(log(mtcars), "log.")
head(log_mtc)
head(rm_stub(log_mtc, "log."))
rm(log_mtc)
## Setting dimension names of an object
head(setRownames(mtcars))
ar <- array(1:9, c(3,3,3))
setRownames(ar)
setColnames(ar, c("a","b","c"))
setDimnames(ar, c("a","b","c"), which = 3)
setDimnames(ar, list(c("d","e","f"), c("a","b","c")), which = 2:3)
setDimnames(ar, list(c("g","h","i"), c("d","e","f"), c("a","b","c")))
## Checking exact equality of multiple objects
all_identical(iris, iris, iris, iris)
l <- replicate(100, fmean(num_vars(iris), iris$Species), simplify = FALSE)
all_identical(l)
rm(l)
## Function names from expressions
ex = quote(sum(x) + mean(y) / z)
all.names(ex)
all.vars(ex)
all_funs(ex)
rm(ex)
Summary Statistics
Description
collapse provides the following functions to efficiently summarize and examine data:
-
qsu
, shorthand for quick-summary, is an extremely fast summary command inspired by the (xt)summarize command in the STATA statistical software. It computes a set of 7 statistics (nobs, mean, sd, min, max, skewness and kurtosis) using a numerically stable one-pass method. Statistics can be computed weighted, by groups, and also within-and between entities (for multilevel / panel data). -
qtab
, shorthand for quick-table, is a faster and more versatile alternative totable
. Notably, it also supports tabulations with frequency weights, as well as computing a statistic over combinations of variables. 'qtab's inherit the 'table' class, allowing for seamless application of 'table' methods. -
descr
computes a concise and detailed description of a data frame, including (sorted) frequency tables for categorical variables and various statistics and quantiles for numeric variables. It is inspired byHmisc::describe
, but about 10x faster. -
pwcor
,pwcov
andpwnobs
compute (weighted) pairwise correlations, covariances and observation counts on matrices and data frames. Pairwise correlations and covariances can be computed together with observation counts and p-values. The elaborate print method displays all of these statistics in a single correlation table. -
varying
very efficiently checks for the presence of any variation in data (optionally) within groups (such as panel-identifiers). A variable is variant if it has at least 2 distinct non-missing data points.
Table of Functions
Function / S3 Generic | Methods | Description | ||
qsu | default, matrix, data.frame, grouped_df, pseries, pdata.frame, sf | Fast (grouped, weighted, panel-decomposed) summary statistics | ||
qtab | No methods, for data frames or vectors | Fast (weighted) cross tabulation | ||
descr | default, grouped_df (default method handles most objects) | Detailed statistical description of data frame | ||
pwcor | No methods, for matrices or data frames | Pairwise (weighted) correlations | ||
pwcov | No methods, for matrices or data frames | Pairwise (weighted) covariances | ||
pwnobs | No methods, for matrices or data frames | Pairwise observation counts | ||
varying | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Fast variation check | ||
See Also
Collapse Overview, Fast Statistical Functions
Efficient List Transpose
Description
t_list
turns a list of lists inside-out. The performance is quite efficient regardless of the size of the list.
Usage
t_list(l)
Arguments
l |
a list of lists. Elements inside the sublists can be heterogeneous, including further lists. |
Value
l
transposed such that the second layer of the list becomes the top layer and the top layer the second layer. See Examples.
Note
To transpose a data frame / list of atomic vectors see data.table::transpose()
.
See Also
rsplit
, List Processing, Collapse Overview
Examples
# Homogenous list of lists
l <- list(a = list(c = 1, d = 2), b = list(c = 3, d = 4))
str(l)
str(t_list(l))
# Heterogenous case
l2 <- list(a = list(c = 1, d = letters), b = list(c = 3:10, d = list(4, e = 5)))
attr(l2, "bla") <- "abc" # Attributes other than names are preserved
str(l2)
str(t_list(l2))
rm(l, l2)
Time Series and Panel Series
Description
collapse provides a flexible and powerful set of functions and classes to work with time-dependent data:
-
findex_by/iby
creates an 'indexed_frame': a flexible structure that can be imposed upon any data-frame like object and facilitates indexed (time-aware) computations on time series and panel data. Indexed frames are composed of 'indexed_series', which can also be created from vector and matrix-based objects using thereindex
function. Further functionsfindex/ix
,unindex
,is_irregular
andto_plm
help operate these classes, check for irregularity, and ensure plm compatibility. Methods are defined for various time series, data transformation and data manipulation functions in collapse. -
timeid
efficiently converts numeric time sequences, such as 'Date' or 'POSIXct' vectors, to a time-factor / integer id, where a unit-step represents the greatest common divisor of the underlying sequence. -
flag
, and the lag- and lead- operatorsL
andF
are S3 generics to efficiently compute sequences of lags and leads on regular or irregular / unbalanced time series and panel data. Similarly,
fdiff
,fgrowth
, and the operatorsD
,Dlog
andG
are S3 generics to efficiently compute sequences of suitably lagged / leaded and iterated differences, log-differences and growth rates.fdiff/D/Dlog
can also compute quasi-differences of the formx_t - \rho x_{t-1}
.-
fcumsum
is an S3 generic to efficiently compute cumulative sums on time series and panel data. In contrast tocumsum
, it can handle missing values and supports both grouped and indexed / ordered computations. -
psmat
is an S3 generic to efficiently convert panel-vectors / 'indexed_series' and data frames / 'indexed_frame's to panel series matrices and 3D arrays, respectively (where time, individuals and variables receive different dimensions, allowing for fast indexation, visualization, and computations). -
psacf
,pspacf
andpsccf
are S3 generics to compute estimates of the auto-, partial auto- and cross- correlation or covariance functions for panel-vectors / 'indexed_series', and multivariate versions for data frames / 'indexed_frame's.
Table of Functions
S3 Generic | Methods | Description | ||
findex_by/iby , findex/ix , reindex , unindex , is_irregular , to_plm | For vectors, matrices and data frames / lists. | Fast and flexible time series and panel data classes 'indexed_series' and 'indexed_frame'. | ||
timeid | For time sequences represented by integer or double vectors / objects. | Generate integer time-id/factor | ||
flag/L/F | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Compute (sequences of) lags and leads | ||
fdiff/D/Dlog | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Compute (sequences of lagged / leaded and iterated) (quasi-)differences or log-differences | ||
fgrowth/G | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Compute (sequences of lagged / leaded and iterated) growth rates (exact, via log-differencing, or compounded) | ||
fcumsum | default, matrix, data.frame, pseries, pdata.frame, grouped_df | Compute cumulative sums | ||
psmat | default, pseries, data.frame, pdata.frame | Convert panel data to matrix / array | ||
psacf | default, pseries, data.frame, pdata.frame | Compute ACF on panel data | ||
pspacf | default, pseries, data.frame, pdata.frame | Compute PACF on panel data | ||
psccf | default, pseries, data.frame, pdata.frame | Compute CCF on panel data |
See Also
Collapse Overview, Data Transformations
Generate Integer-Id From Time/Date Sequences
Description
timeid
groups time vectors in a way that preserves the temporal structure. It generate an integer id where unit steps represent the greatest common divisor in the original sequence e.g c(4, 6, 10) -> c(1, 2, 4)
or c(0.25, 0.75, 1) -> c(1, 3, 4)
.
Usage
timeid(x, factor = FALSE, ordered = factor, extra = FALSE)
Arguments
x |
a numeric time object such as a |
factor |
logical. |
ordered |
logical. |
extra |
logical.
Note that returning these attributes does not incur additional computations. |
Details
Let range_x
and step_x
be the like-named attributes returned when extra = TRUE
, then, if factor = TRUE
, a complete sequence of levels is generated as seq(range_x[1], range_x[2], by = step_x) |> copyMostAttrib(x) |> as.character()
. If factor = FALSE
, the number of timesteps recorded in the "N.groups"
attribute is computed as (range_x[2]-range_x[1])/step_x + 1
, which is equal to the number of factor levels. In both cases the underlying integer id is the same and preserves gaps in time. Large gaps (strong irregularity) can result in many unused factor levels, the generation of which can become expensive. Using factor = FALSE
(the default) is thus more efficient.
Value
A factor or 'qG
' object, optionally with additional attributes attached.
See Also
seqid
, Indexing, Time Series and Panel Series, Collapse Overview
Examples
oldopts <- options(max.print = 30)
# A normal use case
timeid(wlddev$decade)
timeid(wlddev$decade, factor = TRUE)
timeid(wlddev$decade, extra = TRUE)
# Here a large number of levels is generated, which is expensive
timeid(wlddev$date, factor = TRUE)
tid <- timeid(wlddev$date, extra = TRUE) # Much faster
str(tid)
# The reason for step = 1 are leap years with 366 days every 4 years
diff(attr(tid, "unique"))
# So in this case simple factor generation gives a better result
qF(wlddev$date, ordered = TRUE, na.exclude = FALSE)
# The best way to deal with this data would be to convert it
# to zoo::yearmon and then use timeid:
timeid(zoo::as.yearmon(wlddev$date), factor = TRUE, extra = TRUE)
options(oldopts)
rm(oldopts, tid)
Transform Data by (Grouped) Replacing or Sweeping out Statistics
Description
TRA
is an S3 generic that efficiently transforms data by either (column-wise) replacing data values with supplied statistics or sweeping the statistics out of the data. TRA
supports grouped operations and data transformation by reference, and is thus a generalization of sweep
.
Usage
TRA(x, STATS, FUN = "-", ...)
setTRA(x, STATS, FUN = "-", ...) # Shorthand for invisible(TRA(..., set = TRUE))
## Default S3 method:
TRA(x, STATS, FUN = "-", g = NULL, set = FALSE, ...)
## S3 method for class 'matrix'
TRA(x, STATS, FUN = "-", g = NULL, set = FALSE, ...)
## S3 method for class 'data.frame'
TRA(x, STATS, FUN = "-", g = NULL, set = FALSE, ...)
## S3 method for class 'grouped_df'
TRA(x, STATS, FUN = "-", keep.group_vars = TRUE, set = FALSE, ...)
Arguments
x |
a atomic vector, matrix, data frame or grouped data frame (class 'grouped_df'). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
STATS |
a matching set of summary statistics. See Details and Examples. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FUN |
an integer or character string indicating the operation to perform. There are 11 supported operations:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
g |
a factor, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
set |
logical. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
keep.group_vars |
grouped_df method: Logical. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
... |
arguments to be passed to or from other methods. |
Details
Without groups (g = NULL
), TRA
is little more than a column based version of sweep
, albeit many times more efficient. In this case all methods support an atomic vector of statistics of length NCOL(x)
passed to STATS
. The matrix and data frame methods also support a 1-row matrix or 1-row data frame / list, respectively. TRA
always preserves all attributes of x
.
With groups passed to g
, STATS
needs to be of the same type as x
and of appropriate dimensions [such that NCOL(x) == NCOL(STATS)
and NROW(STATS)
equals the number of groups (i.e. the number of levels if g
is a factor)]. If this condition is satisfied, TRA
will assume that the first row of STATS
is the set of statistics computed on the first group/level of g
, the second row on the second group/level etc. and do groupwise replacing or sweeping out accordingly.
For example Let x = c(1.2, 4.6, 2.5, 9.1, 8.7, 3.3)
, g is an integer vector in 3 groups g = c(1,3,3,2,1,2)
and STATS = fmean(x,g) = c(4.95, 6.20, 3.55)
. Then out = TRA(x,STATS,"-",g) = c(-3.75, 1.05, -1.05, 2.90, 3.75, -2.90)
[same as fmean(x, g, TRA = "-")
] does the equivalent of the following for-loop: for(i in 1:6) out[i] = x[i] - STATS[g[i]]
.
Correct computation requires that g
as used in fmean
and g
passed to TRA
are exactly the same vector. Using g = c(1,3,3,2,1,2)
for fmean
and g = c(3,1,1,2,3,2)
for TRA
will not give the right result. The safest way of programming with TRA
is thus to repeatedly employ the same factor or GRP
object for all grouped computations. Atomic vectors passed to g
will be converted to factors (see qF
) and lists will be converted to GRP
objects. This is also done by all Fast Statistical Functions and BY
, thus together with these functions, TRA
can also safely be used with atomic- or list-groups (as long as all functions apply sorted grouping, which is the default in collapse).
If x
is a grouped data frame ('grouped_df'), TRA
matches the columns of x
and STATS
and also checks for grouping columns in x
and STATS
. TRA.grouped_df
will then only transform those columns in x
for which matching counterparts were found in STATS
(exempting grouping columns) and return x
again (with columns in the same order). If keep.group_vars = FALSE
, the grouping columns are dropped after computation, however the "groups" attribute is not dropped (it can be removed using fungroup()
or dplyr::ungroup()
).
Value
x
with columns replaced or swept out using STATS
, (optionally) grouped by g
.
Note
In most cases there is no need to call the TRA()
function, because of the TRA-argument to all Fast Statistical Functions (ensuring that the exact same grouping vector is used for computing statistics and subsequent transformation). In addition the functions fbetween/B
and fwithin/W
and fscale/STD
provide optimized solutions for frequent scaling, centering and averaging tasks.
See Also
sweep
, Fast Statistical Functions, Data Transformations, Collapse Overview
Examples
v <- iris$Sepal.Length # A numeric vector
f <- iris$Species # A factor
dat <- num_vars(iris) # Numeric columns
m <- qM(dat) # Matrix of numeric data
head(TRA(v, fmean(v))) # Simple centering [same as fmean(v, TRA = "-") or W(v)]
head(TRA(m, fmean(m))) # [same as sweep(m, 2, fmean(m)), fmean(m, TRA = "-") or W(m)]
head(TRA(dat, fmean(dat))) # [same as fmean(dat, TRA = "-") or W(dat)]
head(TRA(v, fmean(v), "replace")) # Simple replacing [same as fmean(v, TRA = "replace") or B(v)]
head(TRA(m, fmean(m), "replace")) # [same as sweep(m, 2, fmean(m)), fmean(m, TRA = 1L) or B(m)]
head(TRA(dat, fmean(dat), "replace")) # [same as fmean(dat, TRA = "replace") or B(dat)]
head(TRA(m, fsd(m), "/")) # Simple scaling... [same as fsd(m, TRA = "/")]...
# Note: All grouped examples also apply for v and dat...
head(TRA(m, fmean(m, f), "-", f)) # Centering [same as fmean(m, f, TRA = "-") or W(m, f)]
head(TRA(m, fmean(m, f), "replace", f)) # Replacing [same fmean(m, f, TRA = "replace") or B(m, f)]
head(TRA(m, fsd(m, f), "/", f)) # Scaling [same as fsd(m, f, TRA = "/")]
head(TRA(m, fmean(m, f), "-+", f)) # Centering on the overall mean ...
# [same as fmean(m, f, TRA = "-+") or
# W(m, f, mean = "overall.mean")]
head(TRA(TRA(m, fmean(m, f), "-", f), # Also the same thing done manually !!
fmean(m), "+"))
# Grouped data method
library(magrittr)
iris %>% fgroup_by(Species) %>% TRA(fmean(.))
iris %>% fgroup_by(Species) %>% fmean(TRA = "-") # Same thing
iris %>% fgroup_by(Species) %>% TRA(fmean(.)[c(2,4)]) # Only transforming 2 columns
iris %>% fgroup_by(Species) %>% TRA(fmean(.)[c(2,4)], # Dropping species column
keep.group_vars = FALSE)
Recursive Row-Binding / Unlisting in 2D - to Data Frame
Description
unlist2d
efficiently unlists lists of regular R objects (objects built up from atomic elements) and creates a data frame representation of the list through recursive flattening and intelligent row-binding operations. It is a full 2-dimensional generalization of unlist
, and best understood as a recursive generalization of do.call(rbind, ...)
.
It is a powerful tool to create a tidy data frame representation from (nested) lists of vectors, data frames, matrices, arrays or heterogeneous objects. For simple row-wise combining lists/data.frame's use the non-recursive rowbind
function.
Usage
unlist2d(l, idcols = ".id", row.names = FALSE, recursive = TRUE,
id.factor = FALSE, DT = FALSE)
Arguments
l |
a unlistable list (with atomic elements in all final nodes, see |
idcols |
a character stub or a vector of names for id-columns automatically added - one for each level of nesting in |
row.names |
|
recursive |
logical. if |
id.factor |
if |
DT |
logical. |
Details
The data frame representation created by unlist2d
is built as follows:
Recurse down to the lowest level of the list-tree, data frames are exempted and treated as a final (atomic) elements.
Identify the objects, if they are vectors, matrices or arrays convert them to data frame (in the case of atomic vectors each element becomes a column).
Row-bind these data frames using data.table's
rbindlist
function. Columns are matched by name. If the number of columns differ, fill empty spaces withNA
's. If!isFALSE(idcols)
, create id-columns on the left, filled with the object names or indices (if the (sub-)list is unnamed). If!isFALSE(row.names)
, store rownames of the objects (if available) in a separate column.Move up to the next higher level of the list-tree and repeat: Convert atomic objects to data frame and row-bind while matching all columns and filling unmatched ones with
NA
's. Create another id-column for each level of nesting passed through. If the list-tree is asymmetric, fill empty spaces in lower-level id columns withNA
's.
The result of this iterative procedure is a single data frame containing on the left side id-columns for each level of nesting (from higher to lower level), followed by a column containing all the rownames of the objects (if !isFALSE(row.names)
), followed by the data columns, matched at each level of recursion. Optimal results are obtained with symmetric lists of arrays, matrices or data frames, which unlist2d
efficiently binds into a beautiful data frame ready for plotting or further analysis. See examples below.
Value
A data frame or (if DT = TRUE
) a data.table.
Note
For lists of data frames unlist2d
works just like data.table::rbindlist(l, use.names = TRUE, fill = TRUE, idcol = ".id")
however for lists of lists unlist2d
does not produce the same output as data.table::rbindlist
because unlist2d
is a recursive function. You can use rowbind
as a faithful alternative to data.table::rbindlist
.
The function rrapply::rrapply(l, how = "melt"|"bind")
is a fast alternative (written fully in C) for nested lists of atomic elements.
See Also
rowbind
, rsplit
, rapply2d
, List Processing, Collapse Overview
Examples
## Basic Examples:
l <- list(mtcars, list(mtcars, mtcars))
tail(unlist2d(l))
unlist2d(rapply2d(l, fmean))
l = list(a = qM(mtcars[1:8]),
b = list(c = mtcars[4:11], d = list(e = mtcars[2:10], f = mtcars)))
tail(unlist2d(l, row.names = TRUE))
unlist2d(rapply2d(l, fmean))
unlist2d(rapply2d(l, fmean), recursive = FALSE)
## Groningen Growth and Development Center 10-Sector Database
head(GGDC10S) # See ?GGDC10S
namlab(GGDC10S, class = TRUE)
# Panel-Summarize this data by Variable (Emloyment and Value Added)
l <- qsu(GGDC10S, by = ~ Variable, # Output as list (instead of 4D array)
pid = ~ Variable + Country,
cols = 6:16, array = FALSE)
str(l, give.attr = FALSE) # A list of 2-levels with matrices of statistics
head(unlist2d(l)) # Default output, missing the variables (row-names)
head(unlist2d(l, row.names = TRUE)) # Here we go, but this is still not very nice
head(unlist2d(l, idcols = c("Sector","Trans"), # Now this is looking pretty good
row.names = "Variable"))
dat <- unlist2d(l, c("Sector","Trans"), # Id-columns can also be generated as factors
"Variable", id.factor = TRUE)
str(dat)
# Split this sectoral data, first by Variable (Emloyment and Value Added), then by Country
sdat <- rsplit(GGDC10S, ~ Variable + Country, cols = 6:16)
# Compute pairwise correlations between sectors and recombine:
dat <- unlist2d(rapply2d(sdat, pwcor),
idcols = c("Variable","Country"),
row.names = "Sector")
head(dat)
plot(hclust(as.dist(1-pwcor(dat[-(1:3)])))) # Using corrs. as distance metric to cluster sectors
# List of panel-series matrices
psml <- psmat(fsubset(GGDC10S, Variable == "VA"), ~Country, ~Year, cols = 6:16, array = FALSE)
# Recombining with unlist2d() (effectively like reshapig the data)
head(unlist2d(psml, idcols = "Sector", row.names = "Country"))
rm(l, dat, sdat, psml)
Fast Check of Variation in Data
Description
varying
is a generic function that (column-wise) checks for variation in the values of x
, (optionally) within the groups g
(e.g. a panel-identifier).
Usage
varying(x, ...)
## Default S3 method:
varying(x, g = NULL, any_group = TRUE, use.g.names = TRUE, ...)
## S3 method for class 'matrix'
varying(x, g = NULL, any_group = TRUE, use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
varying(x, by = NULL, cols = NULL, any_group = TRUE, use.g.names = TRUE, drop = TRUE, ...)
# Methods for indexed data / compatibility with plm:
## S3 method for class 'pseries'
varying(x, effect = 1L, any_group = TRUE, use.g.names = TRUE, ...)
## S3 method for class 'pdata.frame'
varying(x, effect = 1L, cols = NULL, any_group = TRUE, use.g.names = TRUE,
drop = TRUE, ...)
# Methods for grouped data frame / compatibility with dplyr:
## S3 method for class 'grouped_df'
varying(x, any_group = TRUE, use.g.names = FALSE, drop = TRUE,
keep.group_vars = TRUE, ...)
# Methods for grouped data frame / compatibility with sf:
## S3 method for class 'sf'
varying(x, by = NULL, cols = NULL, any_group = TRUE, use.g.names = TRUE, drop = TRUE, ...)
Arguments
x |
a vector, matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df'). Data must not be numeric. |
g |
a factor, |
by |
same as |
any_group |
logical. If |
cols |
select columns using column names, indices or a function (e.g. |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
drop |
matrix and data.frame methods: Logical. |
effect |
plm methods: Select the panel identifier by which variation in the data should be examined. 1L takes the first variable in the index, 2L the second etc.. Index variables can also be called by name. More than one index variable can be supplied, which will be interacted. |
keep.group_vars |
grouped_df method: Logical. |
... |
arguments to be passed to or from other methods. |
Details
Without groups passed to g
, varying
simply checks if there is any variation in the columns of x
and returns TRUE
for each column where this is the case and FALSE
otherwise. A set of data points is defined as varying if it contains at least 2 distinct non-missing values (such that a non-0 standard deviation can be computed on numeric data). varying
checks for variation in both numeric and non-numeric data.
If groups are supplied to g
(or alternatively a grouped_df to x
), varying
can operate in one of 2 modes:
If
any_group = TRUE
(the default),varying
checks each column for variation in any of the groups defined byg
, and returnsTRUE
if such within-variation was detected andFALSE
otherwise. Thus only one logical value is returned for each column and the computation on each column is terminated as soon as any variation within any group was found.If
any_group = FALSE
,varying
runs through the entire data checking each group for variation and returns, for each column inx
, a logical vector reporting the variation check for all groups. If a group contains only missing values, aNA
is returned for that group.
The sf method simply ignores the geometry column.
Value
A logical vector or (if !is.null(g)
and any_group = FALSE
), a matrix or data frame of logical vectors indicating whether the data vary (over the dimension supplied by g
).
See Also
Summary Statistics, Data Transformations, Collapse Overview
Examples
## Checks overall variation in all columns
varying(wlddev)
## Checks whether data are time-variant i.e. vary within country
varying(wlddev, ~ country)
## Same as above but done for each country individually, countries without data are coded NA
head(varying(wlddev, ~ country, any_group = FALSE))
World Development Dataset
Description
This dataset contains 5 indicators from the World Bank's World Development Indicators (WDI) database: (1) GDP per capita, (2) Life expectancy at birth, (3) GINI index, (4) Net ODA and official aid received and (5) Population. The panel data is balanced and covers 216 present and historic countries from 1960-2020 (World Bank aggregates and regional entities are excluded).
Apart from the indicators the data contains a number of identifiers (character country name, factor ISO3 country code, World Bank region and income level, numeric year and decade) and 2 generated variables: A logical variable indicating whether the country is an OECD member, and a fictitious variable stating the date the data was recorded. These variables were added so that all common data-types are represented in this dataset, making it an ideal test-dataset for certain collapse functions.
Usage
data("wlddev")
Format
A data frame with 13176 observations on the following 13 variables. All variables are labeled e.g. have a 'label' attribute.
country
chr Country Name
iso3c
fct Country Code
date
date Date Recorded (Fictitious)
year
int Year
decade
int Decade
region
fct World Bank Region
income
fct World Bank Income Level
OECD
log Is OECD Member Country?
PCGDP
num GDP per capita (constant 2010 US$)
LIFEEX
num Life expectancy at birth, total (years)
GINI
num GINI index (World Bank estimate)
ODA
num Net official development assistance and official aid received (constant 2018 US$)
POP
num Population, total
Source
https://data.worldbank.org/, accessed via the WDI
package. The codes for the series are c("NY.GDP.PCAP.KD", "SP.DYN.LE00.IN", "SI.POV.GINI", "DT.ODA.ALLD.KD", "SP.POP.TOTL")
.
See Also
Examples
data(wlddev)
# Panel-summarizing the 5 series
qsu(wlddev, pid = ~iso3c, cols = 9:13, vlabels = TRUE)
# By Region
qsu(wlddev, by = ~region, cols = 9:13, vlabels = TRUE)
# Panel-summary by region
qsu(wlddev, by = ~region, pid = ~iso3c, cols = 9:13, vlabels = TRUE)
# Pairwise correlations: Ovarall
print(pwcor(get_vars(wlddev, 9:13), N = TRUE, P = TRUE), show = "lower.tri")
# Pairwise correlations: Between Countries
print(pwcor(fmean(get_vars(wlddev, 9:13), wlddev$iso3c), N = TRUE, P = TRUE), show = "lower.tri")
# Pairwise correlations: Within Countries
print(pwcor(fwithin(get_vars(wlddev, 9:13), wlddev$iso3c), N = TRUE, P = TRUE), show = "lower.tri")