Title: | Stability Selection with Error Control |
Version: | 0.6-4 |
Date: | 2021-01-28 |
Description: | Resampling procedures to assess the stability of selected variables with additional finite sample error control for high-dimensional variable selection procedures such as Lasso or boosting. Both, standard stability selection (Meinshausen & Buhlmann, 2010, <doi:10.1111/j.1467-9868.2010.00740.x>) and complementary pairs stability selection with improved error bounds (Shah & Samworth, 2013, <doi:10.1111/j.1467-9868.2011.01034.x>) are implemented. The package can be combined with arbitrary user specified variable selection approaches. |
VignetteBuilder: | knitr |
Depends: | R (≥ 2.14.0), methods, stats, parallel |
Imports: | graphics, grDevices, utils |
Suggests: | glmnet, lars, mboost (> 2.3-0), gamboostLSS (≥ 1.2-0), TH.data, hdi, testthat, knitr, rmarkdown |
LazyData: | yes |
License: | GPL-2 |
URL: | https://github.com/hofnerb/stabs |
NeedsCompilation: | no |
Packaged: | 2021-01-28 09:19:53 UTC; Benjamin (PEI) |
Author: | Benjamin Hofner [aut, cre], Torsten Hothorn [aut] |
Maintainer: | Benjamin Hofner <benjamin.hofner@pei.de> |
Repository: | CRAN |
Date/Publication: | 2021-01-29 10:00:02 UTC |
Check if folds result from subsampling with p = 0.5.
Description
(Internal) function that checks if folds result from subsampling with p = 0.5 and adds complementary pairs if needed.
Usage
check_folds(folds, B, n, sampling.type)
Arguments
folds |
a weight matrix that represents the subsamples. |
B |
number of subsampling replicates. |
n |
the number of observations; needed for internal checks. |
sampling.type |
sampling type to be used. |
Details
This is an internal function used to check if folds
are
specified correctly. For details (e.g. on arguments) see
stabsel
.
Value
A matrix containing the folds
, possibly after adding the
complementary pairs.
References
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi: 10.1186/s12859-015-0575-3.
See Also
For details see stabsel
.
Fit Functions for Stability Selection
Description
Functions that fit a model until q
variables are selected and
that returns the indices (and names) of the selected variables.
Usage
## package lars:
lars.lasso(x, y, q, ...)
lars.stepwise(x, y, q, ...)
## package glmnet:
glmnet.lasso(x, y, q, type = c("conservative", "anticonservative"), ...)
glmnet.lasso_maxCoef(x, y, q, ...)
Arguments
x |
a matrix containing the predictors or an object of class
|
y |
a vector or matrix containing the outcome. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
type |
a charachter vector specifying if the number of selected
variables per subsample is |
... |
additional arguments passed to the underlying fitting function.
See the example on |
Details
All fitting functions are named after the package and the type of
model that is fitted: package_name.model
, e.g.,
glmnet.lasso
stands for a lasso model that is fitted using the
package glmnet.
glmnet.lasso_maxCoef
fits a lasso model with a given penalty parameter
and returns the q
largest coefficients. If one wants to use
glmnet.lasso_maxCoef
, one must specify the penalty parameter
lambda
(via the ...
argument) or in
stabsel
via args.fitfun(lambda = )
. Note that usually,
the penalty parameter cannot be specified but is chosen such that q
variables are selected. For an example on how to use
glmnet.lasso_maxCoef
see stabsel
.
Value
A named list with elements
selected |
logical. A vector that indicates which variable was selected. |
path |
logical. A matrix that indicates which variable was selected in which step. Each row represents one variable, the columns represent the steps. |
See Also
stabsel
for stability selection itself.
Examples
if (require("TH.data")) {
## make data set available
data("bodyfat", package = "TH.data")
} else {
## simulate some data if TH.data not available.
## Note that results are non-sense with this data.
bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10)
}
if (require("lars")) {
## selected variables
lars.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$selected
lars.stepwise(bodyfat[, -2], bodyfat[,2], q = 3)$selected
}
if (require("glmnet")) {
glmnet.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$selected
## selection path
glmnet.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$path
## Using the anticonservative glmnet.lasso (see args.fitfun):
stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso,
args.fitfun = list(type = "anticonservative"),
cutoff = 0.75, PFER = 1)
}
Method to Extract Parameters
Description
Extract stability selection parameters, i.e., tuning parameters, from
a stabsel
object.
Usage
parameters(object)
## extract parameters from a stabsel model
## (same as parameters(p) )
## S3 method for class 'stabsel'
stabsel_parameters(p, ...)
Arguments
object |
an object of class |
p |
an object of class |
... |
additional arguments, currently not used. |
Value
An object of class stabsel_parameters
with a special
print
method. See there for details.
See Also
stabsel
to run stability selection and
stabsel_parameters
for details on the parameters.
Plot and Print Methods for Stability Selection
Description
Display results of stability selection.
Usage
## S3 method for class 'stabsel'
plot(x, main = deparse(x$call), type = c("maxsel", "paths"),
xlab = NULL, ylab = NULL, col = NULL, ymargin = 10, np = sum(x$max > 0),
labels = NULL, ...)
## S3 method for class 'stabsel'
print(x, decreasing = FALSE, print.all = TRUE, ...)
Arguments
x |
object of class |
main |
main title for the plot. |
type |
plot type; either stability paths ( |
xlab , ylab |
labels for the x- and y-axis of the plot. Per
default, sensible labels are used depending on the |
col |
a vector of colors; Typically, one can specify a single color or one color for each variable. Per default, colors depend on the maximal selection frequency of the variable and range from grey to red. |
ymargin |
(temporarily) specifies the y margin of of the plot in
lines (see argument |
np |
number of variables to plot for the maximum selection
frequency plot ( |
labels |
variable labels for the plot; one label per variable / effect
must be specified. Per default, the names of |
decreasing |
logical. Should the selection frequencies be printed
in descending order ( |
print.all |
logical. Should all selection frequencies be displayed or only those that are greater than zero? |
... |
additional arguments to |
Details
This function implements the stability selection procedure by Meinshausen and Buehlmann (2010) and the improved error bounds by Shah and Samworth (2013).
Two of the three arguments cutoff
, q
and PFER
must be specified. The per-family error rate (PFER), i.e., the
expected number of false positives E(V)
, where V
is the
number of false positives, is bounded by the argument PFER
.
As controlling the PFER is more conservative as controlling the
family-wise error rate (FWER), the procedure also controlls the FWER,
i.e., the probability of selecting at least one non-influential
variable (or model component) is less than PFER
.
Value
An object of class stabsel
with a special print
method.
The object has the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
per-family error rate. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
call |
the call. |
References
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi: 10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
See Also
Examples
if (require("TH.data")) {
## make data set available
data("bodyfat", package = "TH.data")
} else {
## simulate some data if TH.data not available.
## Note that results are non-sense with this data.
bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10)
}
## set seed
set.seed(1234)
####################################################################
### using stability selection with Lasso methods:
if (require("lars")) {
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.lasso, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.lasso, ymargin = 6)
opar <- par(mai = par("mai") * c(1, 1, 1, 2.7))
plot(stab.lasso, type = "paths")
}
Run Stability Selection
Description
(Internal) function that is used to run stability selection (i.e. to apply the fit-function to the subsamples. This function is not intended to be directly called.
Usage
run_stabsel(fitter, args.fitter, n, p, cutoff, q, PFER, folds, B, assumption,
sampling.type, papply, verbose, FWER, eval, names,
mc.preschedule = FALSE, ...)
Arguments
fitter |
a function to fit the model on subsamples. See argument
|
args.fitter |
a named list containing additional arguments that are
passed to |
n |
the number of observations; needed for internal checks. |
p |
number of possible predictors (including intercept if applicable). |
cutoff |
cutoff between 0.5 and 1. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. |
folds |
a weight matrix that represents the subsamples. |
B |
number of subsampling replicates. |
assumption |
distributional assumption. |
sampling.type |
sampling type to be used. |
papply |
(parallel) apply function. |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
eval |
logical. Determines whether stability selection is evaluated. |
names |
variable names that are used to label the results. |
mc.preschedule |
preschedule tasks? |
... |
additional arguments to be passed to next function. |
Details
This is an internal function that fits the actual models to the
subsamples, i.e., this is the work horse that runs stability
selection. Usually, one should use stabsel
, which
internally calls run_stabsel
.
run_stabsel
can be used by expert users to implement stability
selection methods for new model types.
For details (e.g. on arguments) see stabsel
.
Value
An object of class stabsel
with the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
per-family error rate. |
p |
the number of effects subject to selection. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
References
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi: 10.1186/s12859-015-0575-3.
See Also
For details see stabsel
.
Method to Extract Selected Variables
Description
Extract selected variables from a stabsel
object.
Usage
selected(object, ...)
## S3 method for class 'stabsel'
selected(object, ...)
Arguments
object |
an object of class |
... |
additional arguments passed to specific |
Details
The ids of variables selected during the stability selection process
can be extracted using selected()
.
Stability Selection
Description
Selection of influential variables or model components with error control.
Usage
## generic stability selection funcion
stabsel(x, ...)
## a method to fit models with stability selection
## S3 method for class 'matrix'
stabsel(x, y, fitfun = lars.lasso,
args.fitfun = list(), cutoff, q, PFER,
folds = subsample(rep(1, nrow(x)), B = B),
B = ifelse(sampling.type == "MB", 100, 50),
assumption = c("unimodal", "r-concave", "none"),
sampling.type = c("SS", "MB"),
papply = mclapply, mc.preschedule = FALSE,
verbose = TRUE, FWER, eval = TRUE, ...)
## essentially a wrapper for data.frames (see details)
## S3 method for class 'data.frame'
stabsel(x, y, intercept = FALSE, ...)
Arguments
x |
a |
y |
a vector or matrix containing the outcome. |
intercept |
logical. If |
fitfun |
a function that takes the arguments |
args.fitfun |
a named list containing additional arguments that are
passed to the fitting function; see also argument |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
folds |
a weight matrix with number of rows equal to the number
of observations, see |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
sampling.type |
use sampling scheme of of Shah & Samworth
(2013), i.e., with complementarty pairs ( |
B |
number of subsampling replicates. Per default, we use 50
complementary pairs for the error bounds of Shah & Samworth (2013)
and 100 for the error bound derived in Meinshausen & Buehlmann
(2010). As we use |
papply |
(parallel) apply function, defaults to
|
mc.preschedule |
preschedule tasks if |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
eval |
logical. Determines whether stability selection is
evaluated ( |
... |
additional arguments to parallel apply methods such as
|
Details
This function implements the stability selection procedure by
Meinshausen and Buehlmann (2010) and the improved error bounds by Shah
and Samworth (2013). For details see also Hofner et al. (2014). The
error bounds are implemented in the function
stabsel_parameters
. Two of the three arguments
cutoff
, q
and PFER
must be specified. The
per-family error rate (PFER), i.e., the expected number of false
positives E(V)
, where V
is the number of false positives,
is bounded by the argument PFER
.
As controlling the PFER is more conservative as controlling the
family-wise error rate (FWER), the procedure also controlls the FWER,
i.e., the probability of selecting at least one non-influential
variable (or model component) is less than PFER
.
Predefined fitfuns
functions exist but more can be
easily implemented. Note that stepwise regression methods are usually
not advised as they tend to be relatively unstable. See example below.
The function stabsel
for data.frame
s is
essentially just a wrapper to the matrix
function with
the same argments. The only difference is that in a pre-processing
step, the data set is converted to a model matrix using the function
model.matrix
. The additional argument intercept
determines if an explicit intercept should be added to the model
matrix. This is often not neccessary but depends on the fitfun
.
Value
An object of class stabsel
with a special print
method.
The object has the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
(realized) upper bound for the per-family error rate. |
specifiedPFER |
specified upper bound for the per-family error rate. |
p |
the number of effects subject to selection. |
B |
the number of subsamples. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
call |
the call. |
References
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi: 10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
See Also
stabsel_parameters
for the computation of error bounds,
stabsel.stabsel
for the fast re-computation of
parameters of a fitted stabsel
object,
fitfun
for available fitting functions and
plot.stabsel
for available plot functions
Examples
if (require("TH.data")) {
## make data set available
data("bodyfat", package = "TH.data")
} else {
## simulate some data if TH.data not available.
## Note that results are non-sense with this data.
bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10)
}
## set seed
set.seed(1234)
####################################################################
### using stability selection with Lasso methods:
if (require("lars")) {
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.lasso, cutoff = 0.75,
PFER = 1))
(stab.stepwise <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.stepwise, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.lasso, main = "Lasso")
plot(stab.stepwise, main = "Stepwise Selection")
## --> stepwise selection seems to be quite unstable even in this low
## dimensional example!
}
## set seed (again to make results comparable)
set.seed(1234)
if (require("glmnet")) {
(stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.glmnet, main = "Lasso (glmnet)")
if (exists("stab.lasso"))
plot(stab.lasso, main = "Lasso (lars)")
}
## Select variables with maximum coefficients based on lasso estimate
set.seed(1234) # reset seed
if (require("glmnet")) {
## use cross-validated lambda
lambda.min <- cv.glmnet(x = as.matrix(bodyfat[, -2]), y = bodyfat[,2])$lambda.min
(stab.maxCoef <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso_maxCoef,
# specify additional parameters to fitfun
args.fitfun = list(lambda = lambda.min),
cutoff = 0.75, PFER = 1))
## WARNING: Using a fixed penalty (lambda) is usually not permitted and
## not sensible. See ?fitfun for details.
## now compare standard lasso with "maximal parameter estimates" from lasso
par(mfrow = c(2, 1))
plot(stab.maxCoef, main = "Lasso (glmnet; Maximum Coefficients)")
plot(stab.glmnet, main = "Lasso (glmnet)")
## --> very different results.
}
####################################################################
### using stability selection directly on computed boosting models
### from mboost
if (require("mboost")) {
### low-dimensional example
mod <- glmboost(DEXfat ~ ., data = bodyfat)
## compute cutoff ahead of running stabsel to see if it is a sensible
## parameter choice.
## p = ncol(bodyfat) - 1 (= Outcome) + 1 ( = Intercept)
stabsel_parameters(q = 3, PFER = 1, p = ncol(bodyfat) - 1 + 1,
sampling.type = "MB")
## the same:
stabsel(mod, q = 3, PFER = 1, sampling.type = "MB", eval = FALSE)
### Do not test the following code per default on CRAN as it takes some time to run:
## now run stability selection
(sbody <- stabsel(mod, q = 3, PFER = 1, sampling.type = "MB"))
opar <- par(mai = par("mai") * c(1, 1, 1, 2.7))
plot(sbody)
par(opar)
plot(sbody, type = "maxsel", ymargin = 6)
}
Compute Error Bounds for Stability Selection
Description
Compute the missing parameter from the two given parameters in order to assess suitability of the parameter constellation
Usage
stabsel_parameters(p, ...)
## Default S3 method:
stabsel_parameters(p, cutoff, q, PFER,
B = ifelse(sampling.type == "MB", 100, 50),
assumption = c("unimodal", "r-concave", "none"),
sampling.type = c("SS", "MB"),
verbose = FALSE, FWER, ...)
## S3 method for class 'stabsel_parameters'
print(x, heading = TRUE, ...)
Arguments
p |
number of possible predictors (including intercept if applicable). |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
B |
number of subsampling replicates. Per default, we use 50
complementary pairs for the error bounds of Shah & Samworth (2013)
and 100 for the error bound derived in Meinshausen & Buehlmann
(2010). As we use |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
sampling.type |
use sampling scheme of of Shah & Samworth
(2013), i.e., with complementarty pairs ( |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
x |
an object of class |
heading |
logical. Specifies if a heading line should be printed. |
... |
additional arguments to be passed to next function. |
Details
This function implements the error bounds for stability selection by Meinshausen and Buehlmann (2010) and the improved error bounds by Shah and Samworth (2013). For details see also Hofner et al. (2014).
Two of the three arguments cutoff
, q
and PFER
must be specified. The per-family error rate (PFER), i.e., the
expected number of false positives E(V)
, where V
is the
number of false positives, is bounded by the argument PFER
.
For more details see also stabsel
.
Value
An object of class stabsel_parameters
with a special print
method.
The object has the following elements:
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
(realized) upper bound for the per-family error rate. |
specifiedPFER |
specified upper bound for the per-family error rate. |
p |
the number of effects subject to selection. |
B |
the number of subsamples. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
References
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi: 10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
See Also
For more details see also stabsel
.
Change Parameters of Stability Selection
Description
Method to change the parameters cutoff
, PFER
and
assumption
of stability selection that can be altered without
the need to re-run the subsampling process.
Usage
## S3 method for class 'stabsel'
stabsel(x, cutoff, PFER, assumption = x$assumption, ...)
Arguments
x |
an object that results from a call to |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
... |
additional arguments that are currently ignored. |
Details
This function allows to alter the parameters cutoff
,
PFER
and assumption
of a fitted stability selection
result. All other parameters are re-used from the original stability
selection results. The missing paramter is computed and the selected
variables are updated accordingly.
Value
An object of class stabsel
. For details see there.
See Also
stabsel
for the generic function,
stabsel_parameters
for the computation of error bounds,
fitfun
for available fitting functions and
plot.stabsel
for available plot functions
Examples
if (require("TH.data")) {
## make data set available
data("bodyfat", package = "TH.data")
} else {
## simulate some data if TH.data not available.
## Note that results are non-sense with this data.
bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10)
}
## set seed
set.seed(1234)
####################################################################
### using stability selection with Lasso methods:
if (require("lars")) {
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.lasso, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.lasso)
## now change the PFER and the assumption:
(stab.lasso_cf0.93_rconc <- stabsel(stab.lasso, cutoff = 0.93,
assumption = "r-concave"))
plot(stab.lasso_cf0.93_rconc)
## the cutoff did change and hence the PFER and the selected
## variables
}
Draw Random Subsamples
Description
Set up weight matrix for subsampling with sample proportion 1/2
to be used with stabsel
.
Usage
subsample(weights, B = 100, strata = NULL)
Arguments
weights |
a numeric vector of weights for the model to be cross-validated. |
B |
number of folds, per default 25 for |
strata |
a factor of the same length as |
Details
The function subsample
can be used to build an appropriate
weight matrix to be used with stabsel
. See there for
more details.
If strata
is defined sampling is performed in each stratum
separately thus preserving the distribution of the strata
variable in each fold.
See Also
Examples
## just a low-dimensional example
subsample(weights = rep(1, 10), B = 50)