Encoding: | UTF-8 |
Type: | Package |
Title: | Statistical Tools for Filebacked Big Matrices |
Version: | 1.6.1 |
Date: | 2024-09-10 |
Description: | Easy-to-use, efficient, flexible and scalable statistical tools. Package bigstatsr provides and uses Filebacked Big Matrices via memory-mapping. It provides for instance matrix operations, Principal Component Analysis, sparse linear supervised models, utility functions and more <doi:10.1093/bioinformatics/bty185>. |
License: | GPL-3 |
Language: | en-US |
ByteCompile: | TRUE |
Depends: | R (≥ 3.4) |
Imports: | bigassertr (≥ 0.1.1), bigparallelr (≥ 0.2.3), cowplot, foreach, ggplot2 (≥ 3.0), graphics, methods, ps (≥ 1.4), Rcpp, rmio (≥ 0.4), RSpectra, stats, tibble, utils |
LinkingTo: | Rcpp, RcppArmadillo, rmio |
Suggests: | bigmemory (≥ 4.5.33), bigreadr (≥ 0.2), covr, data.table, dplyr, glmnet, hexbin, memuse, ModelMetrics, plotly, ppcor, RhpcBLASctl, spelling (≥ 1.2), testthat |
RoxygenNote: | 7.3.2 |
URL: | https://privefl.github.io/bigstatsr/ |
BugReports: | https://github.com/privefl/bigstatsr/issues |
Collate: | 'AUC.R' 'FBM-attach.R' 'crochet.R' 'FBM.R' 'FBM-code256.R' 'FBM-copy.R' 'RcppExports.R' 'SVD.R' 'apply-parallelize.R' 'biglasso.R' 'bigstatsr-package.R' 'colstats.R' 'crossprodSelf.R' 'mult-mat.R' 'mult-vec.R' 'pcor.R' 'plot.R' 'predict.R' 'randomSVD.R' 'read-write.R' 'scaling.R' 'summary.R' 'tcrossprodSelf.R' 'transpose.R' 'univLinReg.R' 'univLogReg.R' 'utils-assert.R' 'utils.R' 'zzz.R' |
NeedsCompilation: | yes |
Packaged: | 2024-09-09 21:08:37 UTC; au639593 |
Author: | Florian Privé [aut, cre], Michael Blum [ths], Hugues Aschard [ths] |
Maintainer: | Florian Privé <florian.prive.21@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-09-09 21:40:02 UTC |
bigstatsr: Statistical Tools for Filebacked Big Matrices
Description
Easy-to-use, efficient, flexible and scalable statistical tools. Package bigstatsr provides and uses Filebacked Big Matrices via memory-mapping. It provides for instance matrix operations, Principal Component Analysis, sparse linear supervised models, utility functions and more doi:10.1093/bioinformatics/bty185.
Arguments
X |
An object of class FBM. |
X.code |
An object of class FBM.code256. |
y.train |
Vector of responses, corresponding to |
y01.train |
Vector of responses, corresponding to |
ind.train |
An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
fun.scaling |
A function with parameters
Default doesn't use any scaling.
You can also provide your own |
covar.train |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
covar.row |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
center |
Vector of same length of |
scale |
Vector of same length of |
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
Author(s)
Maintainer: Florian Privé florian.prive.21@gmail.com
Other contributors:
Michael Blum [thesis advisor]
Hugues Aschard hugues.aschard@pasteur.fr [thesis advisor]
See Also
Useful links:
Scaling function creator
Description
Convenience function to create a function to be used as parameter fun.scaling
when you want to use your own precomputed center and scale.
Usage
as_scaling_fun(center.col, scale.col, ind.col = seq_along(center.col))
Arguments
center.col |
Vector of centers corresponding to |
scale.col |
Vector of scales corresponding to |
ind.col |
Column indices for which these are provided. |
Value
A function to be used as parameter fun.scaling
.
Examples
fun.scaling <- as_scaling_fun(1:6, 2:7)
fun.scaling(NULL, NULL, 1:3) # first two parameters X and ind.row are not used here
fun.scaling2 <- as_scaling_fun(1:6, 2:7, ind.col = 6:1)
fun.scaling2(NULL, NULL, 1:3)
X <- big_attachExtdata()
sc <- big_scale()(X)
fun <- as_scaling_fun(center = sc$center, scale = sc$scale)
obj.svd <- big_randomSVD(X, fun.scaling = fun)
obj.svd2 <- big_randomSVD(X, fun.scaling = big_scale())
all.equal(obj.svd, obj.svd2)
Plotly text
Description
Convert a data.frame to plotly text
Usage
asPlotlyText(df)
Arguments
df |
A data.frame |
Value
A character vector of the length of df
's number of rows.
Examples
set.seed(1)
X <- big_attachExtdata()
svd <- big_SVD(X, big_scale(), k = 10)
p <- plot(svd, type = "scores")
pop <- rep(c("POP1", "POP2", "POP3"), c(143, 167, 207))
df <- data.frame(Population = pop, Index = 1:517)
plot(p2 <- p + ggplot2::aes(text = asPlotlyText(df)))
## Not run: plotly::ggplotly(p2, tooltip = "text")
AUC
Description
Compute the Area Under the ROC Curve (AUC) of a predictor and possibly its 95% confidence interval.
Usage
AUC(pred, target, digits = NULL)
AUCBoot(pred, target, nboot = 10000, seed = NA, digits = NULL)
Arguments
pred |
Vector of predictions. |
target |
Vector of true labels (must have exactly two levels, no missing values). |
digits |
See round. Default doesn't use rounding. |
nboot |
Number of bootstrap samples used to evaluate the 95% CI.
Default is |
seed |
See set.seed. Use it for reproducibility. Default doesn't set any seed. |
Details
Other packages provide ways to compute the AUC (see this answer). I chose to compute the AUC through its statistical definition as a probability:
P(score(x_{case}) > score(x_{control})).
Note that I consider equality between scores as a 50%-probability of one being greater than the other.
Value
The AUC, a probability, and possibly its 2.5% and 97.5% quantiles (95% CI).
See Also
Examples
set.seed(1)
AUC(c(0, 0), 0:1) # Equality of scores
AUC(c(0.2, 0.1, 1), c(0, 0, 1)) # Perfect AUC
x <- rnorm(100)
z <- rnorm(length(x), x, abs(x))
y <- as.numeric(z > 0)
print(AUC(x, y))
print(AUCBoot(x, y))
# Partial AUC
pAUC <- function(pred, target, p = 0.1) {
val.min <- min(target)
q <- quantile(pred[target == val.min], probs = 1 - p)
ind <- (target != val.min) | (pred > q)
bigstatsr::AUC(pred[ind], target[ind]) * p
}
pAUC(x, y)
pAUC(x, y, 0.2)
Split-Apply-Combine
Description
A Split-Apply-Combine strategy to apply common R functions to a Filebacked Big Matrix.
Usage
big_apply(
X,
a.FUN,
a.combine = NULL,
ind = cols_along(X),
ncores = 1,
block.size = block_size(nrow(X), ncores),
...
)
Arguments
X |
An object of class FBM. |
a.FUN |
The function to be applied to each subset matrix.
It must take a Filebacked Big Matrix as first argument and
|
a.combine |
Function to combine the results with |
ind |
Initial vector of subsetting indices. Default is the vector of all column indices. |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
block.size |
Maximum number of columns (or rows, depending on how you
use |
... |
Extra arguments to be passed to |
Details
This function splits indices in parts, then apply a given function to each subset matrix and finally combine the results. If parallelization is used, this function splits indices in parts for parallelization, then split again them on each core, apply a given function to each part and finally combine the results (on each cluster and then from each cluster). See also the corresponding vignette.
See Also
big_parallelize bigparallelr::split_parapply
Examples
X <- big_attachExtdata()
# get the means of each column
colMeans_sub <- function(X, ind) colMeans(X[, ind])
str(colmeans <- big_apply(X, a.FUN = colMeans_sub, a.combine = 'c'))
# get the norms of each column
colNorms_sub <- function(X, ind) sqrt(colSums(X[, ind]^2))
str(colnorms <- big_apply(X, colNorms_sub, a.combine = 'c'))
# get the sums of each row
# split along rows: need to change the "complete" `ind` parameter
str(rowsums <- big_apply(X, a.FUN = function(X, ind) rowSums(X[ind, ]),
ind = rows_along(X), a.combine = 'c',
block.size = 100))
# it is usually preferred to split along columns
# because matrices are stored by column.
str(rowsums2 <- big_apply(X, a.FUN = function(X, ind) rowSums(X[, ind]),
a.combine = 'plus'))
Attach a Filebacked Big Matrix
Description
Attach a Filebacked Big Matrix
Usage
big_attach(rdsfile)
big_attachExtdata()
Arguments
rdsfile |
Path to a ".rds" file. |
Value
The FBM object stored in the rdsfile.
Examples
# temporary FBM
X <- FBM(10, 10)$save()
rdsfile <- sub_bk(X$backingfile, ".rds")
X2 <- big_attach(rdsfile)
all.equal(X[], X2[])
Standard univariate statistics
Description
Standard univariate statistics for columns of a Filebacked Big Matrix.
For now, the sum
and var
are implemented
(the mean
and sd
can easily be deduced, see examples).
Usage
big_colstats(X, ind.row = rows_along(X), ind.col = cols_along(X), ncores = 1)
Arguments
X |
An object of class FBM. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
Value
Data.frame of two numeric vectors sum
and var
with the
corresponding column statistics.
See Also
Examples
set.seed(1)
X <- big_attachExtdata()
# Check the results
str(test <- big_colstats(X))
# Only with the first 100 rows
ind <- 1:100
str(test2 <- big_colstats(X, ind.row = ind))
plot(test$sum, test2$sum)
abline(lm(test2$sum ~ test$sum), col = "red", lwd = 2)
X.ind <- X[ind, ]
all.equal(test2$sum, colSums(X.ind))
all.equal(test2$var, apply(X.ind, 2, var))
# deduce mean and sd
# note that the are also implemented in big_scale()
means <- test2$sum / length(ind) # if using all rows,
# divide by nrow(X) instead
all.equal(means, colMeans(X.ind))
sds <- sqrt(test2$var)
all.equal(sds, apply(X.ind, 2, sd))
Copy as a Filebacked Big Matrix
Description
Deep copy of a Filebacked Big Matrix with possible subsetting. This should also work for any matrix-like object.
Usage
big_copy(
X,
ind.row = rows_along(X),
ind.col = cols_along(X),
type = typeof(X),
backingfile = tempfile(tmpdir = getOption("FBM.dir")),
block.size = block_size(length(ind.row)),
is_read_only = FALSE
)
Arguments
X |
Could be any matrix-like object. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
type |
Type of the Filebacked Big Matrix (default is
|
backingfile |
Path to the file storing the FBM data on disk. An extension ".bk" will be automatically added. Default stores in the temporary directory, which you can change using global option "FBM.dir". |
block.size |
Maximum number of columns read at once. Default uses block_size. |
is_read_only |
Whether the FBM is read-only? Default is |
Value
A copy of X
as a new FBM object.
Examples
X <- FBM(10, 10, init = 1:100)
X[]
X2 <- big_copy(X, ind.row = 1:5)
X2[]
mat <- matrix(101:200, 10)
X3 <- big_copy(mat, type = "double") # as_FBM() would be faster here
X3[]
X.code <- big_attachExtdata()
class(X.code)
X2.code <- big_copy(X.code)
class(X2.code)
all.equal(X.code[], X2.code[])
Correlation
Description
Compute the (Pearson) correlation matrix of a Filebacked Big Matrix.
Usage
big_cor(
X,
ind.row = rows_along(X),
ind.col = cols_along(X),
block.size = block_size(nrow(X)),
backingfile = tempfile(tmpdir = getOption("FBM.dir"))
)
Arguments
X |
An object of class FBM. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
backingfile |
Path to the file storing the FBM data on disk. An extension ".bk" will be automatically added. Default stores in the temporary directory, which you can change using global option "FBM.dir". |
Value
A temporary FBM, with the following two attributes:
a numeric vector
center
of column scaling,a numeric vector
scale
of column scaling.
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
See Also
Examples
X <- FBM(13, 17, init = rnorm(221))
# Comparing with cor
K <- big_cor(X)
class(K)
dim(K)
K$backingfile
true <- cor(X[])
all.equal(K[], true)
# Using only half of the data
n <- nrow(X)
ind <- sort(sample(n, n/2))
K2 <- big_cor(X, ind.row = ind)
true2 <- cor(X[ind, ])
all.equal(K2[], true2)
Counts for class FBM.code256
Description
Counts by columns (or rows) the number of each unique element of a
FBM.code256
.
Usage
big_counts(
X.code,
ind.row = rows_along(X.code),
ind.col = cols_along(X.code),
byrow = FALSE
)
Arguments
X.code |
An object of class FBM.code256. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
byrow |
Count by rows rather than by columns?
Default is |
Value
A matrix of counts of K x m (or n) elements, where
K is the number of unique elements of the
BM.code
,n is its number of rows,
m is its number of columns.
Beware that K is up to 256. So, if you apply this on a Filebacked Big Matrix of one million columns, you will create a matrix of nearly 1GB!
Examples
X <- big_attachExtdata()
class(X) # big_counts() is available for class FBM.code256 only
X[1:5, 1:8]
# by columns
big_counts(X, ind.row = 1:5, ind.col = 1:8)
# by rows
big_counts(X, ind.row = 1:5, ind.col = 1:8, byrow = TRUE)
Cross-product with a matrix
Description
Cross-product between a Filebacked Big Matrix and a matrix.
Usage
big_cprodMat(
X,
A.row,
ind.row = rows_along(X),
ind.col = cols_along(X),
ncores = 1,
block.size = block_size(nrow(X), ncores),
center = NULL,
scale = NULL
)
## S4 method for signature 'FBM,matrix'
crossprod(x, y)
## S4 method for signature 'FBM,matrix'
tcrossprod(x, y)
## S4 method for signature 'matrix,FBM'
crossprod(x, y)
## S4 method for signature 'matrix,FBM'
tcrossprod(x, y)
Arguments
X |
An object of class FBM. |
A.row |
A matrix with |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
center |
Vector of same length of |
scale |
Vector of same length of |
x |
A 'double' FBM or a matrix. |
y |
A 'double' FBM or a matrix. |
Value
X^T \cdot A
.
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
Examples
X <- big_attachExtdata()
n <- nrow(X)
m <- ncol(X)
A <- matrix(0, n, 10); A[] <- rnorm(length(A))
test <- big_cprodMat(X, A)
true <- crossprod(X[], A)
all.equal(test, true)
X2 <- big_copy(X, type = "double")
all.equal(crossprod(X2, A), true)
# subsetting
ind.row <- sample(n, n/2)
ind.col <- sample(m, m/2)
tryCatch(test2 <- big_cprodMat(X, A, ind.row, ind.col),
error = function(e) print(e))
# returns an error. You need to use the subset of A:
test2 <- big_cprodMat(X, A[ind.row, ], ind.row, ind.col)
true2 <- crossprod(X[ind.row, ind.col], A[ind.row, ])
all.equal(test2, true2)
Cross-product with a vector
Description
Cross-product between a Filebacked Big Matrix and a vector.
Usage
big_cprodVec(
X,
y.row,
ind.row = rows_along(X),
ind.col = cols_along(X),
center = NULL,
scale = NULL,
ncores = 1
)
Arguments
X |
An object of class FBM. |
y.row |
A vector of same size as |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
center |
Vector of same length of |
scale |
Vector of same length of |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
Value
X^T \cdot y
.
Examples
X <- big_attachExtdata()
n <- nrow(X)
m <- ncol(X)
y <- rnorm(n)
test <- big_cprodVec(X, y) # vector
true <- crossprod(X[], y) # one-column matrix
all.equal(test, as.numeric(true))
# subsetting
ind.row <- sample(n, n/2)
ind.col <- sample(m, m/2)
tryCatch(test2 <- big_cprodVec(X, y, ind.row, ind.col),
error = function(e) print(e))
# returns an error. You need to use the subset of y:
test2 <- big_cprodVec(X, y[ind.row], ind.row, ind.col)
true2 <- crossprod(X[ind.row, ind.col], y[ind.row])
all.equal(test2, as.numeric(true2))
Crossprod
Description
Compute X.row^T X.row
for a Filebacked Big Matrix X
after applying a particular scaling to it.
Usage
big_crossprodSelf(
X,
fun.scaling = big_scale(center = FALSE, scale = FALSE),
ind.row = rows_along(X),
ind.col = cols_along(X),
block.size = block_size(nrow(X)),
backingfile = tempfile(tmpdir = getOption("FBM.dir"))
)
## S4 method for signature 'FBM,missing'
crossprod(x, y)
Arguments
X |
An object of class FBM. |
fun.scaling |
A function with parameters
Default doesn't use any scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
backingfile |
Path to the file storing the FBM data on disk. An extension ".bk" will be automatically added. Default stores in the temporary directory, which you can change using global option "FBM.dir". |
x |
A 'double' FBM. |
y |
Missing. |
Value
A temporary FBM, with the following two attributes:
a numeric vector
center
of column scaling,a numeric vector
scale
of column scaling.
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
See Also
Examples
X <- FBM(13, 17, init = rnorm(221))
true <- crossprod(X[])
# No scaling
K1 <- crossprod(X)
class(K1)
all.equal(K1, true)
K2 <- big_crossprodSelf(X)
class(K2)
K2$backingfile
all.equal(K2[], true)
# big_crossprodSelf() provides some scaling and subsetting
# Example using only half of the data:
n <- nrow(X)
ind <- sort(sample(n, n/2))
K3 <- big_crossprodSelf(X, fun.scaling = big_scale(), ind.row = ind)
true2 <- crossprod(scale(X[ind, ]))
all.equal(K3[], true2)
Increment an FBM
Description
Increment an FBM
Usage
big_increment(X, add, use_lock = FALSE)
Arguments
X |
An |
add |
A matrix of same dimensions as |
use_lock |
Whether to use locks when incrementing. Default is |
Value
Returns nothing (NULL
, invisibly).
Examples
X <- FBM(10, 10, init = 0)
mat <- matrix(rnorm(100), 10, 10)
big_increment(X, mat)
all.equal(X[], mat)
big_increment(X, mat)
all.equal(X[], 2 * mat)
Split-parApply-Combine
Description
A Split-Apply-Combine strategy to parallelize the evaluation of a function.
Usage
big_parallelize(
X,
p.FUN,
p.combine = NULL,
ind = cols_along(X),
ncores = nb_cores(),
...
)
Arguments
X |
An object of class FBM. |
p.FUN |
The function to be applied to each subset matrix.
It must take a Filebacked Big Matrix as first argument and
|
p.combine |
Function to combine the results with |
ind |
Initial vector of subsetting indices. Default is the vector of all column indices. |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
... |
Extra arguments to be passed to |
Details
This function splits indices in parts, then apply a given function to each part and finally combine the results.
Value
Return a list of ncores
elements, each element being the result of
one of the cores, computed on a block. The elements of this list are then
combined with do.call(p.combine, .)
if p.combined
is given.
See Also
big_apply bigparallelr::split_parapply
Examples
## Not run: # CRAN is super slow when parallelism.
X <- big_attachExtdata()
### Computation on all the matrix
true <- big_colstats(X)
big_colstats_sub <- function(X, ind) {
big_colstats(X, ind.col = ind)
}
# 1. the computation is split along all the columns
# 2. for each part the computation is done, using `big_colstats`
# 3. the results (data.frames) are combined via `rbind`.
test <- big_parallelize(X, p.FUN = big_colstats_sub,
p.combine = 'rbind', ncores = 2)
all.equal(test, true)
### Computation on a part of the matrix
n <- nrow(X)
m <- ncol(X)
rows <- sort(sample(n, n/2)) # sort to provide some locality in accesses
cols <- sort(sample(m, m/2)) # idem
true2 <- big_colstats(X, ind.row = rows, ind.col = cols)
big_colstats_sub2 <- function(X, ind, rows, cols) {
big_colstats(X, ind.row = rows, ind.col = cols[ind])
}
# This doesn't work because, by default, the computation is spread
# along all columns. We must explictly specify the `ind` parameter.
tryCatch(big_parallelize(X, p.FUN = big_colstats_sub2,
p.combine = 'rbind', ncores = 2,
rows = rows, cols = cols),
error = function(e) message(e))
# This now works, using `ind = seq_along(cols)`.
test2 <- big_parallelize(X, p.FUN = big_colstats_sub2,
p.combine = 'rbind', ncores = 2,
ind = seq_along(cols),
rows = rows, cols = cols)
all.equal(test2, true2)
## End(Not run)
Product with a matrix
Description
Product between a Filebacked Big Matrix and a matrix.
Usage
big_prodMat(
X,
A.col,
ind.row = rows_along(X),
ind.col = cols_along(X),
ncores = 1,
block.size = block_size(nrow(X), ncores),
center = NULL,
scale = NULL
)
## S4 method for signature 'FBM,matrix'
x %*% y
## S4 method for signature 'matrix,FBM'
x %*% y
Arguments
X |
An object of class FBM. |
A.col |
A matrix with |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
center |
Vector of same length of |
scale |
Vector of same length of |
x |
A 'double' FBM or a matrix. |
y |
A 'double' FBM or a matrix. |
Value
X \cdot A
.
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
Examples
X <- big_attachExtdata()
n <- nrow(X)
m <- ncol(X)
A <- matrix(0, m, 10); A[] <- rnorm(length(A))
test <- big_prodMat(X, A)
true <- X[] %*% A
all.equal(test, true)
X2 <- big_copy(X, type = "double")
all.equal(X2 %*% A, true)
# subsetting
ind.row <- sample(n, n/2)
ind.col <- sample(m, m/2)
tryCatch(test2 <- big_prodMat(X, A, ind.row, ind.col),
error = function(e) print(e))
# returns an error. You need to use the subset of A:
test2 <- big_prodMat(X, A[ind.col, ], ind.row, ind.col)
true2 <- X[ind.row, ind.col] %*% A[ind.col, ]
all.equal(test2, true2)
Product with a vector
Description
Product between a Filebacked Big Matrix and a vector.
Usage
big_prodVec(
X,
y.col,
ind.row = rows_along(X),
ind.col = cols_along(X),
center = NULL,
scale = NULL,
ncores = 1
)
Arguments
X |
An object of class FBM. |
y.col |
A vector of same size as |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
center |
Vector of same length of |
scale |
Vector of same length of |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
Value
X \cdot y
.
Examples
X <- big_attachExtdata()
n <- nrow(X)
m <- ncol(X)
y <- rnorm(m)
test <- big_prodVec(X, y) # vector
true <- X[] %*% y # one-column matrix
all.equal(test, as.numeric(true))
# subsetting
ind.row <- sample(n, n/2)
ind.col <- sample(m, m/2)
tryCatch(test2 <- big_prodVec(X, y, ind.row, ind.col),
error = function(e) print(e))
# returns an error. You need to use the subset of y:
test2 <- big_prodVec(X, y[ind.col], ind.row, ind.col)
true2 <- X[ind.row, ind.col] %*% y[ind.col]
all.equal(test2, as.numeric(true2))
Randomized partial SVD
Description
An algorithm for partial SVD (or PCA) of a Filebacked Big Matrix based on the
algorithm in RSpectra (by Yixuan Qiu and Jiali Mei).
This algorithm is linear in time in all dimensions and is very
memory-efficient. Thus, it can be used on very large big.matrices.
Usage
big_randomSVD(
X,
fun.scaling = big_scale(center = FALSE, scale = FALSE),
ind.row = rows_along(X),
ind.col = cols_along(X),
k = 10,
tol = 1e-04,
verbose = FALSE,
ncores = 1,
fun.prod = big_prodVec,
fun.cprod = big_cprodVec
)
Arguments
X |
An object of class FBM. |
fun.scaling |
A function with parameters
Default doesn't use any scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
k |
Number of singular vectors/values to compute. Default is |
tol |
Precision parameter of svds. Default is |
verbose |
Should some progress be printed? Default is |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
fun.prod |
Function that takes 6 arguments (in this order):
|
fun.cprod |
Same as |
Value
A named list (an S3 class "big_SVD") of
-
d
, the singular values, -
u
, the left singular vectors, -
v
, the right singular vectors, -
niter
, the number of the iteration of the algorithm, -
nops
, number of Matrix-Vector multiplications used, -
center
, the centering vector, -
scale
, the scaling vector.
Note that to obtain the Principal Components, you must use predict on the result. See examples.
Note
The idea of using this Implicitly Restarted Arnoldi Method algorithm
comes from G. Abraham, Y. Qiu, and M. Inouye,
FlashPCA2: principal component analysis of biobank-scale genotype datasets,
bioRxiv: doi:10.1101/094714.
It proved to be faster than our implementation of the "blanczos" algorithm
in Rokhlin, V., Szlam, A., & Tygert, M. (2010).
A Randomized Algorithm for Principal Component Analysis.
SIAM Journal on Matrix Analysis and Applications, 31(3), 1100-1124.
doi:10.1137/080736417.
See Also
Examples
set.seed(1)
X <- big_attachExtdata()
K <- 10
# Using only half of the data for "training"
n <- nrow(X)
ind <- sort(sample(n, n/2))
test <- big_randomSVD(X, fun.scaling = big_scale(), ind.row = ind, k = K)
str(test)
pca <- prcomp(X[ind, ], center = TRUE, scale. = TRUE)
# same scaling
all.equal(test$center, pca$center)
all.equal(test$scale, pca$scale)
# use this function to predict scores
class(test)
scores <- predict(test)
# scores and loadings are the same or opposite
plot(scores, pca$x[, 1:K])
plot(test$v, pca$rotation[, 1:K])
plot(test$u)
plot(test, type = "scores")
# projecting on new data
ind2 <- setdiff(rows_along(X), ind)
scores.test2 <- predict(test, X, ind.row = ind2)
scores.test3 <- predict(pca, X[-ind, ])
plot(scores.test2, scores.test3[, 1:K])
Read a file as FBM
Description
Read a file as a Filebacked Big Matrix by using package {bigreadr}. For a mini-tutorial, please see this vignette.
Usage
big_read(
file,
select,
filter = NULL,
type = c("double", "float", "integer", "unsigned short", "unsigned char", "raw"),
backingfile = drop_ext(file),
...
)
Arguments
file |
File to read. |
select |
Indices of columns to read (sorted).
The length of |
filter |
Vector used to subset the rows of each data frame. |
type |
Type of the Filebacked Big Matrix (default is
|
backingfile |
Path to the file storing the FBM data on disk.
An extension ".bk" will be automatically added.
Default uses |
... |
Arguments passed on to
|
Value
A Filebacked Big Matrix of type type
with length(select)
columns.
Some scaling functions
Description
Some scaling functions for a Filebacked Big Matrix to be used as
the fun.scaling
parameter of some functions of this package.
Usage
big_scale(center = TRUE, scale = TRUE)
Arguments
center |
A logical value: whether to return means or 0s. |
scale |
A logical value: whether to return standard deviations or 1s. You can't use scale without using center. |
Details
One could think about less common scalings, such as for example the
"y-aware" scaling which uses the inverse of betas of column-wise linear
regression as scaling. See this post for details.
It would be easy to implement it using big_colstats
to get column means
and big_univLinReg
to get betas (and then inverse them).
Value
A new function that returns a data.frame of two vectors
"center" and "scale" which are of the length of ind.col
.
See Also
Examples
X <- big_attachExtdata()
# No scaling
big_noscale <- big_scale(center = FALSE, scale = FALSE)
class(big_noscale) # big_scale returns a new function
str(big_noscale(X))
big_noscale2 <- big_scale(center = FALSE)
str(big_noscale2(X)) # you can't scale without centering
# Centering
big_center <- big_scale(scale = FALSE)
str(big_center(X))
# + scaling
str(big_scale()(X))
Sparse linear regression
Description
Fit lasso (or elastic-net) penalized linear regression for a Filebacked Big Matrix. Covariables can be added (/!\ penalized by default /!\).
Usage
big_spLinReg(
X,
y.train,
ind.train = rows_along(X),
ind.col = cols_along(X),
covar.train = NULL,
base.train = NULL,
pf.X = NULL,
pf.covar = NULL,
alphas = 1,
power_scale = 1,
power_adaptive = 0,
K = 10,
ind.sets = NULL,
nlambda = 200,
nlam.min = 50,
n.abort = 10,
dfmax = 50000,
warn = TRUE,
ncores = 1,
...
)
Arguments
X |
An object of class FBM. |
y.train |
Vector of responses, corresponding to |
ind.train |
An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
covar.train |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
base.train |
Vector of base predictions. Model will be learned starting from these predictions. This can be useful if you want to previously fit a model with large-effect variables that you don't want to penalize. |
pf.X |
A multiplicative factor for the penalty applied to each coefficient.
If supplied, |
pf.covar |
Same as |
alphas |
The elastic-net mixing parameter that controls the relative contribution from the lasso (l1) and the ridge (l2) penalty. The penalty is defined as
|
power_scale |
When using lasso (alpha = 1), penalization to apply that
is equivalent to scaling genotypes dividing by (standard deviation)^power_scale.
Default is 1 and corresponding to standard scaling. Using 0 would correspond
to using unscaled variables and using 0.5 is Pareto scaling. If you e.g. use
|
power_adaptive |
Multiplicative penalty factor to apply to variables
in the form of 1 / m_j^power_adaptive, where m_j is the marginal statistic
for variable j. Default is 0, which effectively disables this option.
If you e.g. use |
K |
Number of sets used in the Cross-Model Selection and Averaging
(CMSA) procedure. Default is |
ind.sets |
Integer vectors of values between |
nlambda |
The number of lambda values. Default is |
nlam.min |
Minimum number of lambda values to investigate. Default is |
n.abort |
Number of lambda values for which prediction on the validation
set must decrease before stopping. Default is |
dfmax |
Upper bound for the number of nonzero coefficients. Default is
|
warn |
Whether to warn if some models may not have reached a minimum.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
... |
Arguments passed on to
|
Details
This is a modified version of one function of
package biglasso.
It adds the possibility to train models with covariables and use many
types of FBM
(not only double
ones).
Yet, it only corresponds to screen = "SSR"
(Sequential Strong Rules).
Also, to remove the choice of the lambda parameter, we introduce the Cross-Model Selection and Averaging (CMSA) procedure:
This function separates the training set in
K
folds (e.g. 10).-
In turn,
each fold is considered as an inner validation set and the others (K - 1) folds form an inner training set,
the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
the vector of scores which maximizes log-likelihood is determined,
the vector of coefficients corresponding to the previous vector of scores is chosen.
The
K
resulting vectors of coefficients are then averaged into one final vector of coefficients.
Value
Return an object of class big_sp_list
(a list of length(alphas)
x K
) that has 3 methods predict
, summary
and plot
.
References
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R. J. (2012), Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74: 245-266. doi:10.1111/j.1467-9868.2011.01004.x.
Zeng, Y., and Breheny, P. (2017). The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. doi:10.32614/RJ-2021-001.
Privé, F., Aschard, H., and Blum, M. G.B. (2019). Efficient implementation of penalized regression for genetic risk prediction. Genetics, 212: 65-74. doi:10.1534/genetics.119.302019.
See Also
Examples
set.seed(1)
# simulating some data
N <- 230
M <- 730
X <- FBM(N, M, init = rnorm(N * M, sd = 5))
y <- rowSums(X[, 1:10]) + rnorm(N)
covar <- matrix(rnorm(N * 3), N)
ind.train <- sort(sample(nrow(X), 150))
ind.test <- setdiff(rows_along(X), ind.train)
# fitting model for multiple lambdas and alphas
test <- big_spLinReg(X, y[ind.train], ind.train = ind.train,
covar.train = covar[ind.train, ],
alphas = c(1, 0.1), K = 3, warn = FALSE)
# peek at the models
plot(test)
summary(test, sort = TRUE)
summary(test, sort = TRUE)$message
# prediction for other data -> only the best alpha is used
summary(test, best.only = TRUE)
pred <- predict(test, X, ind.row = ind.test, covar.row = covar[ind.test, ])
plot(pred, y[ind.test], pch = 20); abline(0, 1, col = "red")
Sparse logistic regression
Description
Fit lasso (or elastic-net) penalized logistic regression for a Filebacked Big Matrix. Covariables can be added (/!\ penalized by default /!\).
Usage
big_spLogReg(
X,
y01.train,
ind.train = rows_along(X),
ind.col = cols_along(X),
covar.train = NULL,
base.train = NULL,
pf.X = NULL,
pf.covar = NULL,
alphas = 1,
power_scale = 1,
power_adaptive = 0,
K = 10,
ind.sets = NULL,
nlambda = 200,
nlam.min = 50,
n.abort = 10,
dfmax = 50000,
warn = TRUE,
ncores = 1,
...
)
Arguments
X |
An object of class FBM. |
y01.train |
Vector of responses, corresponding to |
ind.train |
An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
covar.train |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
base.train |
Vector of base predictions. Model will be learned starting from these predictions. This can be useful if you want to previously fit a model with large-effect variables that you don't want to penalize. |
pf.X |
A multiplicative factor for the penalty applied to each coefficient.
If supplied, |
pf.covar |
Same as |
alphas |
The elastic-net mixing parameter that controls the relative contribution from the lasso (l1) and the ridge (l2) penalty. The penalty is defined as
|
power_scale |
When using lasso (alpha = 1), penalization to apply that
is equivalent to scaling genotypes dividing by (standard deviation)^power_scale.
Default is 1 and corresponding to standard scaling. Using 0 would correspond
to using unscaled variables and using 0.5 is Pareto scaling. If you e.g. use
|
power_adaptive |
Multiplicative penalty factor to apply to variables
in the form of 1 / m_j^power_adaptive, where m_j is the marginal statistic
for variable j. Default is 0, which effectively disables this option.
If you e.g. use |
K |
Number of sets used in the Cross-Model Selection and Averaging
(CMSA) procedure. Default is |
ind.sets |
Integer vectors of values between |
nlambda |
The number of lambda values. Default is |
nlam.min |
Minimum number of lambda values to investigate. Default is |
n.abort |
Number of lambda values for which prediction on the validation
set must decrease before stopping. Default is |
dfmax |
Upper bound for the number of nonzero coefficients. Default is
|
warn |
Whether to warn if some models may not have reached a minimum.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
... |
Arguments passed on to
|
Details
This is a modified version of one function of
package biglasso.
It adds the possibility to train models with covariables and use many
types of FBM
(not only double
ones).
Yet, it only corresponds to screen = "SSR"
(Sequential Strong Rules).
Also, to remove the choice of the lambda parameter, we introduce the Cross-Model Selection and Averaging (CMSA) procedure:
This function separates the training set in
K
folds (e.g. 10).-
In turn,
each fold is considered as an inner validation set and the others (K - 1) folds form an inner training set,
the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
the vector of scores which maximizes log-likelihood is determined,
the vector of coefficients corresponding to the previous vector of scores is chosen.
The
K
resulting vectors of coefficients are then averaged into one final vector of coefficients.
Value
Return an object of class big_sp_list
(a list of length(alphas)
x K
) that has 3 methods predict
, summary
and plot
.
References
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R. J. (2012), Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74: 245-266. doi:10.1111/j.1467-9868.2011.01004.x.
Zeng, Y., and Breheny, P. (2017). The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. doi:10.32614/RJ-2021-001.
Privé, F., Aschard, H., and Blum, M. G.B. (2019). Efficient implementation of penalized regression for genetic risk prediction. Genetics, 212: 65-74. doi:10.1534/genetics.119.302019.
See Also
Examples
set.seed(2)
# simulating some data
N <- 230
M <- 730
X <- FBM(N, M, init = rnorm(N * M, sd = 5))
y01 <- as.numeric((rowSums(X[, 1:10]) + 2 * rnorm(N)) > 0)
covar <- matrix(rnorm(N * 3), N)
ind.train <- sort(sample(nrow(X), 150))
ind.test <- setdiff(rows_along(X), ind.train)
# fitting model for multiple lambdas and alphas
test <- big_spLogReg(X, y01[ind.train], ind.train = ind.train,
covar.train = covar[ind.train, ],
alphas = c(1, 0.1), K = 3, warn = FALSE)
# peek at the models
plot(test)
summary(test, sort = TRUE)
summary(test, sort = TRUE)$message
# prediction for other data -> only the best alpha is used
summary(test, best.only = TRUE)
pred <- predict(test, X, ind.row = ind.test, covar.row = covar[ind.test, ])
AUC(pred, y01[ind.test])
library(ggplot2)
qplot(pred, fill = as.logical(y01[ind.test]),
geom = "density", alpha = I(0.4)) +
labs(fill = "Case?") +
theme_bigstatsr() +
theme(legend.position = c(0.52, 0.8))
Partial SVD
Description
An algorithm for partial SVD (or PCA) of a Filebacked Big Matrix through the eigen decomposition of the covariance between variables (primal) or observations (dual). Use this algorithm only if there is one dimension that is much smaller than the other. Otherwise use big_randomSVD.
Usage
big_SVD(
X,
fun.scaling = big_scale(center = FALSE, scale = FALSE),
ind.row = rows_along(X),
ind.col = cols_along(X),
k = 10,
block.size = block_size(nrow(X))
)
Arguments
X |
An object of class FBM. |
fun.scaling |
A function with parameters
Default doesn't use any scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
k |
Number of singular vectors/values to compute. Default is |
block.size |
Maximum number of columns read at once. Default uses block_size. |
Details
To get X = U \cdot D \cdot V^T
,
if the number of observations is small, this function computes
K_(2) = X \cdot X^T \approx U \cdot D^2 \cdot U^T
and thenV = X^T \cdot U \cdot D^{-1}
,if the number of variable is small, this function computes
K_(1) = X^T \cdot X \approx V \cdot D^2 \cdot V^T
and thenU = X \cdot V \cdot D^{-1}
,if both dimensions are large, use big_randomSVD instead.
Value
A named list (an S3 class "big_SVD") of
-
d
, the singular values, -
u
, the left singular vectors, -
v
, the right singular vectors, -
center
, the centering vector, -
scale
, the scaling vector.
Note that to obtain the Principal Components, you must use predict on the result. See examples.
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
See Also
Examples
set.seed(1)
X <- big_attachExtdata()
n <- nrow(X)
# Using only half of the data
ind <- sort(sample(n, n/2))
test <- big_SVD(X, fun.scaling = big_scale(), ind.row = ind)
str(test)
plot(test$u)
pca <- prcomp(X[ind, ], center = TRUE, scale. = TRUE)
# same scaling
all.equal(test$center, pca$center)
all.equal(test$scale, pca$scale)
# scores and loadings are the same or opposite
# except for last eigenvalue which is equal to 0
# due to centering of columns
scores <- test$u %*% diag(test$d)
class(test)
scores2 <- predict(test) # use this function to predict scores
all.equal(scores, scores2)
dim(scores)
dim(pca$x)
tail(pca$sdev)
plot(scores2, pca$x[, 1:ncol(scores2)])
plot(test$v[1:100, ], pca$rotation[1:100, 1:ncol(scores2)])
# projecting on new data
X2 <- sweep(sweep(X[-ind, ], 2, test$center, '-'), 2, test$scale, '/')
scores.test <- X2 %*% test$v
ind2 <- setdiff(rows_along(X), ind)
scores.test2 <- predict(test, X, ind.row = ind2) # use this
all.equal(scores.test, scores.test2)
scores.test3 <- predict(pca, X[-ind, ])
plot(scores.test2, scores.test3[, 1:ncol(scores.test2)])
Tcrossprod
Description
Compute X.row X.row^T
for a Filebacked Big Matrix X
after applying a particular scaling to it.
Usage
big_tcrossprodSelf(
X,
fun.scaling = big_scale(center = FALSE, scale = FALSE),
ind.row = rows_along(X),
ind.col = cols_along(X),
block.size = block_size(nrow(X))
)
## S4 method for signature 'FBM,missing'
tcrossprod(x, y)
Arguments
X |
An object of class FBM. |
fun.scaling |
A function with parameters
Default doesn't use any scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
x |
A 'double' FBM. |
y |
Missing. |
Value
A temporary FBM, with the following two attributes:
a numeric vector
center
of column scaling,a numeric vector
scale
of column scaling.
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
See Also
Examples
X <- FBM(13, 17, init = rnorm(221))
true <- tcrossprod(X[])
# No scaling
K1 <- tcrossprod(X)
class(K1)
all.equal(K1, true)
K2 <- big_tcrossprodSelf(X)
class(K2)
K2$backingfile
all.equal(K2[], true)
# big_tcrossprodSelf() provides some scaling and subsetting
# Example using only half of the data:
n <- nrow(X)
ind <- sort(sample(n, n/2))
K3 <- big_tcrossprodSelf(X, fun.scaling = big_scale(), ind.row = ind)
true2 <- tcrossprod(scale(X[ind, ]))
all.equal(K3[], true2)
Transpose an FBM
Description
This function implements a simple cache-oblivious algorithm for the transposition of a Filebacked Big Matrix.
Usage
big_transpose(X, backingfile = tempfile(tmpdir = getOption("FBM.dir")))
Arguments
X |
An object of class FBM. |
backingfile |
Path to the file storing the FBM data on disk. An extension ".bk" will be automatically added. Default stores in the temporary directory, which you can change using global option "FBM.dir". |
Value
The new transposed FBM. Dimensions and type are automatically determined from the input FBM.
Examples
X <- FBM(6, 5, init = rnorm(30))
X[]
Xt <- big_transpose(X)
identical(t(X[]), Xt[])
Column-wise linear regression
Description
Slopes of column-wise linear regressions of each column of a Filebacked Big Matrix, with some other associated statistics. Covariates can be added to correct for confounders.
Usage
big_univLinReg(
X,
y.train,
ind.train = rows_along(X),
ind.col = cols_along(X),
covar.train = NULL,
thr.eigval = 1e-04,
ncores = 1
)
Arguments
X |
An object of class FBM. |
y.train |
Vector of responses, corresponding to |
ind.train |
An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
covar.train |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
thr.eigval |
Threshold to remove "insignificant" singular vectors.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
Value
A data.frame with 3 elements:
the slopes of each regression,
the standard errors of each slope,
the t-scores associated with each slope. This is also an object of class
mhtest
. Seemethods(class = "mhtest")
.
See Also
Examples
set.seed(1)
X <- big_attachExtdata()
n <- nrow(X)
y <- rnorm(n)
covar <- matrix(rnorm(n * 3), n)
X1 <- X[, 1] # only first column of the Filebacked Big Matrix
# Without covar
test <- big_univLinReg(X, y)
## New class `mhtest`
class(test)
attr(test, "transfo")
attr(test, "predict")
## plot results
plot(test)
plot(test, type = "Volcano")
## To get p-values associated with the test
test$p.value <- predict(test, log10 = FALSE)
str(test)
summary(lm(y ~ X1))$coefficients[2, ]
# With all data
str(big_univLinReg(X, y, covar = covar))
summary(lm(y ~ X1 + covar))$coefficients[2, ]
# With only half of the data
ind.train <- sort(sample(n, n/2))
str(big_univLinReg(X, y[ind.train],
covar.train = covar[ind.train, ],
ind.train = ind.train))
summary(lm(y ~ X1 + covar, subset = ind.train))$coefficients[2, ]
Column-wise logistic regression
Description
Slopes of column-wise logistic regressions of each column of a Filebacked Big Matrix, with some other associated statistics. Covariates can be added to correct for confounders.
Usage
big_univLogReg(
X,
y01.train,
ind.train = rows_along(X),
ind.col = cols_along(X),
covar.train = NULL,
tol = 1e-08,
maxiter = 20,
ncores = 1
)
Arguments
X |
An object of class FBM. |
y01.train |
Vector of responses, corresponding to |
ind.train |
An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
covar.train |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
tol |
Relative tolerance to assess convergence of the coefficient.
Default is |
maxiter |
Maximum number of iterations before giving up.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
Details
If convergence is not reached by the main algorithm for some columns,
the corresponding niter
element is set to NA
and a message is given.
Then, glm is used instead for the corresponding column.
If it can't converge either, all corresponding estimations are set to NA
.
Value
A data.frame with 4 elements:
the slopes of each regression,
the standard errors of each slope,
the number of iteration for each slope. If is
NA
, this means that the algorithm didn't converge, and glm was used instead.the z-scores associated with each slope. This is also an object of class
mhtest
. Seemethods(class = "mhtest")
.
See Also
Examples
set.seed(1)
X <- big_attachExtdata()
n <- nrow(X)
y01 <- sample(0:1, size = n, replace = TRUE)
covar <- matrix(rnorm(n * 3), n)
X1 <- X[, 1] # only first column of the Filebacked Big Matrix
# Without covar
test <- big_univLogReg(X, y01)
## new class `mhtest`
class(test)
attr(test, "transfo")
attr(test, "predict")
## plot results
plot(test)
plot(test, type = "Volcano")
## To get p-values associated with the test
test$p.value <- predict(test, log10 = FALSE)
str(test)
summary(glm(y01 ~ X1, family = "binomial"))$coefficients[2, ]
# With all data
str(big_univLogReg(X, y01, covar.train = covar))
summary(glm(y01 ~ X1 + covar, family = "binomial"))$coefficients[2, ]
# With only half of the data
ind.train <- sort(sample(n, n/2))
str(big_univLogReg(X, y01[ind.train],
covar.train = covar[ind.train, ],
ind.train = ind.train))
summary(glm(y01 ~ X1 + covar, family = "binomial",
subset = ind.train))$coefficients[2, ]
Write an FBM to a file
Description
Write a file from a Filebacked Big Matrix (by parts).
Usage
big_write(
X,
file,
every_nrow,
...,
ind.row = rows_along(X),
ind.col = cols_along(X),
progress = FALSE
)
Arguments
X |
An object of class FBM. |
file |
File to write to. |
every_nrow |
Number of rows to write at once. |
... |
Other arguments to be passed to data.table::fwrite,
except |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
progress |
Show progress? Default is |
Value
Input parameter file
, invisibly.
Examples
X <- big_attachExtdata()
csv <- big_write(X, tempfile(), every_nrow = 100, progress = interactive())
Determine a correct value for the block.size parameter
Description
It determines the value of block.size
such that a matrix of doubles of
size n
x block.size
takes less memory than
getOption("bigstatsr.block.sizeGB")
GigaBytes (default is 1GB).
Usage
block_size(n, ncores = 1)
Arguments
n |
The number of rows. |
ncores |
The number of cores. |
Value
An integer >= 1.
Examples
block_size(1e3)
block_size(1e6)
block_size(1e6, 6)
Sparse regression path
Description
Fit solution paths for linear or logistic regression models penalized by lasso (alpha = 1) or elastic-net (1e-4 < alpha < 1) over a grid of values for the regularization parameter lambda.
Usage
COPY_biglasso_main(
X,
y.train,
ind.train,
ind.col,
covar.train,
family = c("gaussian", "binomial"),
alphas = 1,
K = 10,
ind.sets = NULL,
nlambda = 200,
lambda.min.ratio = if (n > p) 1e-04 else 0.001,
nlam.min = 50,
n.abort = 10,
base.train = NULL,
pf.X = NULL,
pf.covar = NULL,
eps = 1e-05,
max.iter = 1000,
dfmax = 50000,
lambda.min = if (n > p) 1e-04 else 0.001,
power_scale = 1,
power_adaptive = 0,
return.all = FALSE,
warn = TRUE,
ncores = 1
)
Arguments
family |
Either "gaussian" (linear) or "binomial" (logistic). |
alphas |
The elastic-net mixing parameter that controls the relative contribution from the lasso (l1) and the ridge (l2) penalty. The penalty is defined as
|
K |
Number of sets used in the Cross-Model Selection and Averaging
(CMSA) procedure. Default is |
ind.sets |
Integer vectors of values between |
nlambda |
The number of lambda values. Default is |
lambda.min.ratio |
The smallest value for lambda, as a fraction of
lambda.max. Default is |
nlam.min |
Minimum number of lambda values to investigate. Default is |
n.abort |
Number of lambda values for which prediction on the validation
set must decrease before stopping. Default is |
base.train |
Vector of base predictions. Model will be learned starting from these predictions. This can be useful if you want to previously fit a model with large-effect variables that you don't want to penalize. |
pf.X |
A multiplicative factor for the penalty applied to each coefficient.
If supplied, |
pf.covar |
Same as |
eps |
Convergence threshold for inner coordinate descent.
The algorithm iterates until the maximum change in the objective after any
coefficient update is less than |
max.iter |
Maximum number of iterations. Default is |
dfmax |
Upper bound for the number of nonzero coefficients. Default is
|
lambda.min |
This parameter has been renamed |
power_scale |
When using lasso (alpha = 1), penalization to apply that
is equivalent to scaling genotypes dividing by (standard deviation)^power_scale.
Default is 1 and corresponding to standard scaling. Using 0 would correspond
to using unscaled variables and using 0.5 is Pareto scaling. If you e.g. use
|
power_adaptive |
Multiplicative penalty factor to apply to variables
in the form of 1 / m_j^power_adaptive, where m_j is the marginal statistic
for variable j. Default is 0, which effectively disables this option.
If you e.g. use |
return.all |
Deprecated. Now always return all models. |
warn |
Whether to warn if some models may not have reached a minimum.
Default is |
Details
The objective function for linear regression (family = "gaussian"
) is
\frac{1}{2n}\textrm{RSS} + \textrm{penalty},
for logistic regression
(family = "binomial"
) it is
-\frac{1}{n} loglike +
\textrm{penalty}.
Train one model
Description
Train one model
Usage
COPY_biglasso_part(
X,
y.train,
ind.train,
ind.col,
covar.train,
family,
lambda,
center,
scale,
resid,
alpha,
eps,
max.iter,
dfmax,
ind.val,
covar.val,
y.val,
n.abort,
nlam.min,
base.train,
base.val,
pf
)
Value
A named list with following variables:
intercept |
A vector of intercepts, corresponding to each lambda. |
beta |
The vector of coefficients that minimized the loss on the validation set. |
iter |
A vector of length |
lambda |
The sequence of regularization parameter values in the path. |
alpha |
Input parameter. |
loss |
A vector containing either the residual sum of squares
(for linear models) or negative log-likelihood (for logistic models)
of the fitted model at each value of |
loss.val |
A vector containing the loss for the corresponding validation set. |
message |
Reason the fitting has stopped. |
nb_active |
The number of active (non-zero) variables along the regularization path. |
nb_candidate |
The number of candidate variables (used in the gradient descent) along the regularization path. |
ind.train |
Indices of training set. |
Numeric matrix from data frame
Description
Transform a data frame to a numeric matrix by one-hot encoding factors. The last factor value is always omitted to prevent having a singular matrix when adding a column of 1s (intercept) in models.
Usage
covar_from_df(df)
Arguments
df |
A data frame. |
Value
A numeric matrix.
Examples
mat <- covar_from_df(iris)
head(mat)
Create an Implementation of [ For Custom Matrix-Like Types
Description
extract
is a function that converts different index types such as negative
integer vectors or logical vectors passed to the [
function as i
(e.g. X[i]
) or i
and j
(e.g. X[i, j]
) into positive
integer vectors. The converted indices are provided as the i
parameter of
extract_vector
or i
and j
parameters of extract_matrix
to facilitate
implementing the extraction mechanism for custom matrix-like types.
Usage
Extract(extract_vector, extract_matrix)
Arguments
extract_vector |
A function in the form of |
extract_matrix |
A function in the form of |
Details
This idea initially comes from package crochet.
Value
A function in the form of function(x, i, j, ..., drop = TRUE)
that
is meant to be used as a method for [
for a custom type.
Class FBM
Description
A reference class for storing and accessing matrix-like data stored in files on disk. This is very similar to Filebacked Big Matrices provided by the bigmemory package (see the corresponding vignette).
Convert a matrix (or a data frame) to an FBM.
Usage
FBM(
nrow,
ncol,
type = c("double", "float", "integer", "unsigned short", "unsigned char", "raw"),
init = NULL,
backingfile = tempfile(tmpdir = getOption("FBM.dir")),
create_bk = TRUE,
is_read_only = FALSE
)
as_FBM(
x,
type = c("double", "float", "integer", "unsigned short", "unsigned char", "raw"),
backingfile = tempfile(tmpdir = getOption("FBM.dir")),
is_read_only = FALSE
)
Arguments
nrow |
Number of rows. |
ncol |
Number of columns. |
type |
Type of the Filebacked Big Matrix (default is
|
init |
Either a single value (e.g. |
backingfile |
Path to the file storing the FBM data on disk. An extension ".bk" will be automatically added. Default stores in the temporary directory, which you can change using global option "FBM.dir". |
create_bk |
Whether to create a backingfile (the default) or use an
existing one (which should be named by the |
is_read_only |
Whether the FBM is read-only? Default is |
x |
A matrix or an data frame (2-dimensional data). |
Details
An object of class FBM has many fields:
-
$address
: address of the external pointer containing the underlying C++ object for read-only mapping, to be used as aXPtr<FBM>
in C++ code -
$extptr
: (internal) use$address
instead -
$address_rw
: address of the external pointer containing the underlying C++ object for read/write mapping, to be used as aXPtr<FBM_RW>
in C++ code -
$extptr_rw
: (internal) use$address_rw
instead -
$nrow
: number of rows -
$ncol
: number of columns -
$type
: (internal) usetype_size
ortype_chr
instead -
$type_chr
: FBM type as character, e.g. "double" -
$type_size
: size of FBM type in bytes (e.g. "double" is 8 and "float" is 4) -
$backingfile
or$bk
: File with extension 'bk' that stores the numeric data of the FBM -
$rds
: 'rds' file (that may not exist) corresponding to the 'bk' file -
$is_saved
: whether this object is stored in$rds
? -
$is_read_only
: whether it is (not) allowed to modify data?
And some methods:
-
$save()
: Save the FBM object in$rds
. Returns the FBM. -
add_columns(<ncol_add>)
: Add some columns to the FBM by appending the backingfile with some data. Returns the FBM invisibly. -
$bm()
: Get this object as afilebacked.big.matrix
to be used by package {bigmemory}. -
$bm.desc()
: Get this object as afilebacked.big.matrix
descriptor to be used by package {bigmemory}. -
$check_write_permissions()
: Error if the FBM is read-only.
See Also
Examples
mat <- matrix(1:4, 2)
X_from_mat <- as_FBM(mat)
## You can save this object in an .rds file to use it in another session
X_from_mat$is_saved
X_from_mat$save()
X_from_mat$is_saved
(rds <- X_from_mat$rds)
## Use big_attach() to load the FBM object in another session
X_from_mat <- big_attach(rds)
## Standard accessors
X <- FBM(10, 10)
typeof(X)
X[] <- rnorm(length(X))
X[, 1:6]
X[] <- 1:100
X[, 1]
X[1, ] # not recommended for large matrices
X[, -1]
X[, c(TRUE, FALSE)]
X[cbind(1:10, 1:10)] <- NA_real_
X[] # access as standard R matrix
X <- FBM(150, 5)
X[] <- iris ## you can replace with a df (but factors -> integers)
X2 <- as_FBM(iris)
identical(X[], X2[])
Methods for the FBM class
Description
Methods for the FBM class
Accessor methods for class FBM
. You can use positive and negative indices,
logical indices (that are recycled) and also a matrix of indices (but only
positive ones).
Dimension and type methods for class FBM
.
Usage
## S4 method for signature 'FBM,ANY,ANY,ANY'
x[i, j, ..., drop = TRUE]
## S4 replacement method for signature 'FBM,ANY,ANY,ANY'
x[i, j, ...] <- value
## S4 method for signature 'FBM'
dim(x)
## S4 method for signature 'FBM'
length(x)
## S4 method for signature 'FBM'
typeof(x)
## S4 method for signature 'FBM'
diag(x)
Arguments
x |
A FBM object. |
i |
A vector of indices (or nothing). You can use positive and negative indices, logical indices (that are recycled) and also a matrix of indices (but only positive ones). |
j |
A vector of indices (or nothing). You can use positive and negative indices, logical indices (that are recycled). |
... |
Not used. Just to make nargs work. |
drop |
Whether to delete the dimensions of a matrix which have one dimension equals to 1. |
value |
The values to replace. Should be of length 1 or of the same length of the subset to replace. |
Class FBM.code256
Description
A reference class for storing and accessing up to 256 arbitrary different
values using a Filebacked Big Matrix of type unsigned char
. Compared to a
Filebacked Big Matrix, it adds a slot code
which is used as
a lookup table of size 256.
Usage
FBM.code256(
nrow,
ncol,
code = rep(NA_real_, 256),
init = NULL,
backingfile = tempfile(tmpdir = getOption("FBM.dir")),
create_bk = TRUE,
is_read_only = FALSE
)
add_code256(x, code)
Arguments
nrow |
Number of rows. |
ncol |
Number of columns. |
code |
A numeric vector (of length 256).
You should construct it with |
init |
Either a single value (e.g. |
backingfile |
Path to the file storing the FBM data on disk. An extension ".bk" will be automatically added. Default stores in the temporary directory, which you can change using global option "FBM.dir". |
create_bk |
Whether to create a backingfile (the default) or use an
existing one (which should be named by the |
is_read_only |
Whether the FBM is read-only? Default is |
x |
A FBM. |
Examples
X <- FBM(10, 10, type = "raw")
X[] <- sample(as.raw(0:3), size = length(X), replace = TRUE)
X[]
# From an FBM of type 'raw' ('unsigned char')
code <- rep(NA_real_, 256)
code[1:3] <- c(1, 3, 5)
X.code <- add_code256(X, code)
X.code[]
# Or directly
X.code2 <- FBM.code256(10, 10, code, init = sample(as.raw(0:3), 100, TRUE))
X.code2[]
# Get a new FBM.code256 object with another code (but same underlying data)
X.code3 <- X.code$copy(code = rnorm(256))
all.equal(X.code$code256, code)
Combine sets of coefficients
Description
Combine sets of coefficients
Usage
get_beta(betas, method = c("geometric-median", "mean-wise", "median-wise"))
Arguments
betas |
Matrix of coefficient vectors to be combined. |
method |
Method for combining vectors of coefficients. The default uses the geometric median. |
Value
A vector of resulting coefficients.
Get coordinates on plot
Description
Get coordinates on a plot by mouse-clicking.
Usage
pasteLoc(nb, digits = c(3, 3))
Arguments
nb |
Number of positions. |
digits |
2 integer indicating the number of decimal places (respectively for x and y coordinates). |
Value
A list of coordinates. Note that if you don't put the result in a variable, it returns as the command text for generating the list. This can be useful to get coordinates by mouse-clicking once, but then using the code for convenience and reproducibility.
Examples
## Not run:
plot(runif(20, max = 5000))
# note the negative number for the rounding of $y
coord <- pasteLoc(3, digits = c(2, -1))
text(coord, c("a", "b", "c"))
## End(Not run)
Partial correlation
Description
Partial correlation between x and y, after having adjusted both for z.
Usage
pcor(x, y, z, alpha = 0.05)
Arguments
x |
A numeric vector. |
y |
A numeric vector. |
z |
A data frame, which can contain characters or factors. |
alpha |
Type-I error for the confidence interval (CI).
Default is |
Value
The partial correlation, and the lower and upper bounds of its CI.
Examples
pcor(iris[[1]], iris[[2]], iris[-(1:2)])
Plot method
Description
Plot method for class big_sp_list
.
Usage
## S3 method for class 'big_sp_list'
plot(x, coeff = 1, ...)
Arguments
x |
An object of class |
coeff |
Relative size of text. Default is |
... |
Not used. |
Value
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
Plot method
Description
Plot method for class big_SVD
.
Usage
## S3 method for class 'big_SVD'
plot(
x,
type = c("screeplot", "scores", "loadings"),
nval = length(x$d),
scores = c(1, 2),
loadings = 1,
ncol = NULL,
coeff = 1,
viridis = TRUE,
cols = 2,
...
)
Arguments
x |
An object of class |
type |
Either
|
nval |
Number of singular values to plot. Default plots all computed. |
scores |
Vector of indices of the two PCs to plot. Default plots the first two PCs. If providing more than two, it produces many plots. |
loadings |
Indices of PC loadings to plot. Default plots the first vector of loadings. |
ncol |
If multiple vector of loadings are to be plotted, this defines the number of columns of the resulting multiplot. |
coeff |
Relative size of text. Default is |
viridis |
Deprecated argument. |
cols |
Deprecated. Use |
... |
Not used. |
Value
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
See Also
big_SVD, big_randomSVD and asPlotlyText.
Examples
set.seed(1)
X <- big_attachExtdata()
svd <- big_SVD(X, big_scale(), k = 10)
# screeplots
plot(svd) # 3 PCs seems "significant"
plot(svd, coeff = 1.5) # larger font for papers
# scores plot
plot(svd, type = "scores") # first 2 PCs
plot(svd, type = "scores", scores = c(1, 3))
plot(svd, type = "scores", scores = 1:4, ncol = 2, coeff = 0.7)
## add color (recall that this return a `ggplot2` object)
class(obj <- plot(svd, type = "scores"))
pop <- rep(c("POP1", "POP2", "POP3"), c(143, 167, 207))
library(ggplot2)
print(obj2 <- obj + aes(color = pop) + labs(color = "Population"))
## change the place of the legend
print(obj3 <- obj2 + theme(legend.position = c(0.82, 0.17)))
## change the title and the labels of the axes
obj3 + ggtitle("Yet another title") + xlab("with an other 'x' label")
# loadings
plot(svd, type = "loadings", loadings = 2)
## all loadings
plot(svd, type = "loadings", loadings = 1:2, coeff = 0.7, ncol = 1)
# Percentage of variance explained by the PCs
# See https://github.com/privefl/bigstatsr/issues/83
# dynamic plots, require the package **plotly**
## Not run: plotly::ggplotly(obj3)
Plot method
Description
Plot method for class mhtest
.
Usage
## S3 method for class 'mhtest'
plot(x, type = c("hist", "Manhattan", "Q-Q", "Volcano"), coeff = 1, ...)
Arguments
x |
An object of class |
type |
Either.
|
coeff |
Relative size of text. Default is |
... |
Not used. |
Value
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
See Also
big_univLinReg, big_univLogReg, plot.big_SVD and asPlotlyText.
Examples
set.seed(1)
X <- big_attachExtdata()
y <- rnorm(nrow(X))
test <- big_univLinReg(X, y)
plot(test)
plot(test, type = "Volcano")
plot(test, type = "Q-Q")
plot(test, type = "Manhattan")
plot(test, type = "Manhattan") + ggplot2::ggtitle(NULL)
Predict method
Description
Predict method for class big_sp
.
Usage
## S3 method for class 'big_sp'
predict(object, X, ind.row, ind.col, covar.row = NULL, ncores = 1, ...)
Arguments
object |
Object of class |
X |
An object of class FBM. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
covar.row |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
... |
Not used. |
Value
A vector of scores, corresponding to ind.row
.
See Also
big_spLinReg and big_spLogReg.
Predict method
Description
Predict method for class big_sp_list
.
Usage
## S3 method for class 'big_sp_list'
predict(
object,
X,
ind.row = rows_along(X),
ind.col = attr(object, "ind.col"),
covar.row = NULL,
proba = (attr(object, "family") == "binomial"),
base.row = NULL,
ncores = 1,
...
)
Arguments
object |
Object of class |
X |
An object of class FBM. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
covar.row |
Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to |
proba |
Whether to return probabilities? |
base.row |
Vector of base predictions, corresponding to |
ncores |
Number of cores used. Default doesn't use parallelism. You may use nb_cores. |
... |
Not used. |
Value
A vector of scores, corresponding to ind.row
.
See Also
big_spLinReg and big_spLogReg.
Scores of PCA
Description
Get the scores of PCA associated with an svd decomposition (class big_SVD
).
Usage
## S3 method for class 'big_SVD'
predict(
object,
X = NULL,
ind.row = rows_along(X),
ind.col = cols_along(X),
block.size = block_size(nrow(X)),
...
)
Arguments
object |
A list returned by |
X |
An object of class FBM. |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
... |
Not used. |
Value
A matrix of size n \times K
where n
is the number of samples
corresponding to indices in ind.row
and K the number of PCs
computed in object
. If X
is not specified, this just returns
the scores of the training set of object
.
See Also
Examples
set.seed(1)
X <- big_attachExtdata()
n <- nrow(X)
# Using only half of the data
ind <- sort(sample(n, n/2))
test <- big_SVD(X, fun.scaling = big_scale(), ind.row = ind)
str(test)
plot(test$u)
pca <- prcomp(X[ind, ], center = TRUE, scale. = TRUE)
# same scaling
all.equal(test$center, pca$center)
all.equal(test$scale, pca$scale)
# scores and loadings are the same or opposite
# except for last eigenvalue which is equal to 0
# due to centering of columns
scores <- test$u %*% diag(test$d)
class(test)
scores2 <- predict(test) # use this function to predict scores
all.equal(scores, scores2)
dim(scores)
dim(pca$x)
tail(pca$sdev)
plot(scores2, pca$x[, 1:ncol(scores2)])
plot(test$v[1:100, ], pca$rotation[1:100, 1:ncol(scores2)])
# projecting on new data
X2 <- sweep(sweep(X[-ind, ], 2, test$center, '-'), 2, test$scale, '/')
scores.test <- X2 %*% test$v
ind2 <- setdiff(rows_along(X), ind)
scores.test2 <- predict(test, X, ind.row = ind2) # use this
all.equal(scores.test, scores.test2)
scores.test3 <- predict(pca, X[-ind, ])
plot(scores.test2, scores.test3[, 1:ncol(scores.test2)])
Predict method
Description
Predict method for class mhtest
.
Usage
## S3 method for class 'mhtest'
predict(object, scores = object$score, log10 = TRUE, ...)
Arguments
object |
An object of class |
scores |
Raw scores (before transformation) that you want to transform to p-values. |
log10 |
Are p-values returned on the |
... |
Not used. |
Value
Vector of log10(p-values)
associated with scores
and object
.
See Also
big_univLinReg and big_univLogReg.
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- bigparallelr
- cowplot
Create an Implementation of [<- For Custom Matrix-Like Types
Description
replace
is a function that converts different index types such as negative
integer vectors or logical vectors passed to the [<-
function as i
(e.g. X[i]
) or i
and j
(e.g. X[i, j]
) into positive
integer vectors. The converted indices are provided as the i
parameter of
replace_vector
or i
and j
parameters of replace_matrix
to facilitate
implementing the replacement mechanism for custom matrix-like types.
Single values are recycled to match the replacement length.
Usage
Replace(replace_vector, replace_matrix)
Arguments
replace_vector |
A function in the form of |
replace_matrix |
A function in the form of |
Details
This idea initially comes from package crochet.
Value
A function in the form of function(x, i, j, ..., value)
that is
meant to be used as a method for [<-
for a custom type.
Replace extension '.bk'
Description
Replace extension '.bk'
Usage
sub_bk(path, replacement = "", stop_if_not_ext = TRUE)
Arguments
path |
String with extension '.bk'. |
replacement |
Replacement of '.bk'. Default replaces by nothing. |
stop_if_not_ext |
If |
Value
String with extension '.bk' replaced by replacement
.
Examples
path <- "toto.bk"
sub_bk(path)
sub_bk(path, ".rds")
Summary method
Description
Summary method for class big_sp_list
.
Usage
## S3 method for class 'big_sp_list'
summary(object, best.only = FALSE, sort = FALSE, ...)
Arguments
object |
An object of class |
best.only |
Whether to return only one row corresponding to the best
model? The best model is the one smallest |
sort |
Whether to sort by |
... |
Not used. |
Value
A tibble with, for each $alpha
, a mean $validation_loss
, a mean
vector of coefficients $beta
, the corresponding number of non-zero
coefficients $nb_var
, and the reasons of method completion $message
.
Theme ggplot2
Description
Theme ggplot2 used by this package.
Usage
theme_bigstatsr(size.rel = 1)
Arguments
size.rel |
Relative size. Default is |
Examples
library(ggplot2)
(p <- ggplot(mapping = aes(x = 1:10, y = 1:10)) + geom_point())
p + theme_bw()
p + theme_bigstatsr()
Temporarily disable downcast warning
Description
Temporarily disable downcast warning
Usage
without_downcast_warning(expr)
Arguments
expr |
The expression to evaluate without downcast warning. |
Value
The result of the evaluated expression.
Examples
without_downcast_warning(FBM(10, 10, type = "integer", init = 1.5))