Title: | A Common API to Modeling and Analysis Functions |
Version: | 1.3.1 |
Maintainer: | Max Kuhn <max@posit.co> |
Description: | A common interface is provided to allow users to specify a model without having to remember the different argument names across different functions or computational engines (e.g. 'R', 'Spark', 'Stan', 'H2O', etc). |
License: | MIT + file LICENSE |
URL: | https://github.com/tidymodels/parsnip, https://parsnip.tidymodels.org/ |
BugReports: | https://github.com/tidymodels/parsnip/issues |
Depends: | R (≥ 3.6) |
Imports: | cli, dplyr (≥ 1.1.0), generics (≥ 0.1.2), ggplot2, globals, glue, hardhat (≥ 1.4.1), lifecycle, magrittr, pillar, prettyunits, purrr (≥ 1.0.0), rlang (≥ 1.1.0), sparsevctrs (≥ 0.2.0), stats, tibble (≥ 2.1.1), tidyr (≥ 1.3.0), utils, vctrs (≥ 0.6.0), withr |
Suggests: | bench, C50, covr, dials (≥ 1.1.0), earth, ggrepel, keras, kernlab, kknn, knitr, LiblineaR, MASS, Matrix, methods, mgcv, modeldata, nlme, prodlim, ranger (≥ 0.12.0), remotes, rmarkdown, rpart, sparklyr (≥ 1.0.0), survival, tensorflow, testthat (≥ 3.0.0), xgboost (≥ 1.5.0.1) |
VignetteBuilder: | knitr |
ByteCompile: | true |
Config/Needs/website: | brulee, C50, dbarts, earth, glmnet, keras, kernlab, kknn, LiblineaR, mgcv, nnet, parsnip, quantreg, randomForest, ranger, rpart, rstanarm, tidymodels/tidymodels, tidyverse/tidytemplate, rstudio/reticulate, xgboost, rmarkdown |
Config/rcmdcheck/ignore-inconsequential-notes: | true |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-03-11 19:17:07 UTC; max |
Author: | Max Kuhn [aut, cre], Davis Vaughan [aut], Emil Hvitfeldt [ctb], Posit Software, PBC [cph, fnd] |
Repository: | CRAN |
Date/Publication: | 2025-03-12 00:10:02 UTC |
parsnip
Description
The goal of parsnip is to provide a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.
Author(s)
Maintainer: Max Kuhn max@posit.co
Authors:
Davis Vaughan davis@posit.co
Other contributors:
Emil Hvitfeldt emil.hvitfeldt@posit.co [contributor]
Posit Software, PBC [copyright holder, funder]
See Also
Useful links:
Report bugs at https://github.com/tidymodels/parsnip/issues
Helper functions for checking the penalty of glmnet models
Description
These functions are for developer use.
.check_glmnet_penalty_fit()
checks that the model specification for fitting a
glmnet model contains a single value.
.check_glmnet_penalty_predict()
checks that the penalty value used for prediction is valid.
If called by predict()
, it needs to be a single value. Multiple values are
allowed for multi_predict()
.
Usage
.check_glmnet_penalty_fit(x, call = rlang::caller_env())
.check_glmnet_penalty_predict(
penalty = NULL,
object,
multi = FALSE,
call = rlang::caller_env()
)
Arguments
x |
An object of class |
penalty |
A penalty value to check. |
object |
An object of class |
multi |
A logical indicating if multiple values are allowed. |
Helper functions to convert between formula and matrix interface
Description
Functions to take a formula interface and get the resulting
objects (y, x, weights, etc) back or the other way around. The functions are
intended for developer use. For the most part, this emulates the internals
of lm()
(and also see the notes at
https://developer.r-project.org/model-fitting-functions.html).
.convert_form_to_xy_fit()
and .convert_xy_to_form_fit()
are for when the
data are created for modeling.
.convert_form_to_xy_fit()
saves both the data objects as well as the objects
needed when new data are predicted (e.g. terms
, etc.).
.convert_form_to_xy_new()
and .convert_xy_to_form_new()
are used when new
samples are being predicted and only require the predictors to be available.
Usage
.convert_form_to_xy_fit(
formula,
data,
...,
na.action = na.omit,
indicators = "traditional",
composition = "data.frame",
remove_intercept = TRUE,
call = rlang::caller_env()
)
.convert_form_to_xy_new(
object,
new_data,
na.action = na.pass,
composition = "data.frame",
call = rlang::caller_env()
)
.convert_xy_to_form_fit(
x,
y,
weights = NULL,
y_name = "..y",
remove_intercept = TRUE,
call = rlang::caller_env()
)
.convert_xy_to_form_new(object, new_data)
Arguments
formula |
An object of class |
data |
A data frame containing all relevant variables (e.g. outcome(s), predictors, case weights, etc). |
... |
Additional arguments passed to |
na.action |
A function which indicates what should happen when the data contain NAs. |
indicators |
A string describing whether and how to create
indicator/dummy variables from factor predictors. Possible options are
|
composition |
A string describing whether the resulting |
remove_intercept |
A logical indicating whether to remove the intercept
column after |
object |
A model fit. |
new_data |
A rectangular data object, such as a data frame. |
x |
A matrix, sparse matrix, or data frame of predictors. Only some
models have support for sparse matrix input. See |
y |
A vector, matrix or data frame of outcome data. |
weights |
A numeric vector containing the weights. |
y_name |
A string specifying the name of the outcome. |
Extract survival status
Description
Extract the status from a survival::Surv()
object.
Arguments
surv |
A single |
Value
A numeric vector.
Extract survival time
Description
Extract the time component(s) from a survival::Surv()
object.
Arguments
surv |
A single |
Value
A vector when the type is "right"
or "left"
and a tibble otherwise.
Obtain names of prediction columns for a fitted model or workflow
Description
.get_prediction_column_names()
returns a list that has the names of the
columns for the primary prediction types for a model.
Usage
.get_prediction_column_names(x, syms = FALSE)
Arguments
x |
A fitted parsnip model (class |
syms |
Should the column names be converted to symbols? Defaults to |
Value
A list with elements "estimate"
and "probabilities"
.
Examples
library(dplyr)
library(modeldata)
data("two_class_dat")
levels(two_class_dat$Class)
lr_fit <- logistic_reg() %>% fit(Class ~ ., data = two_class_dat)
.get_prediction_column_names(lr_fit)
.get_prediction_column_names(lr_fit, syms = TRUE)
Translate names of model tuning parameters
Description
This function creates a key that connects the identifiers users make for tuning parameter names, the standardized parsnip parameter names, and the argument names to the underlying fit function for the engine.
Usage
.model_param_name_key(object, as_tibble = TRUE)
Arguments
object |
A workflow or parsnip model specification. |
as_tibble |
A logical. Should the results be in a tibble (the default) or in a list that can facilitate renaming grid objects? |
Value
A tibble with columns user
, parsnip
, and engine
, or a list
with named character vectors user_to_parsnip
and parsnip_to_engine
.
Examples
mod <-
linear_reg(penalty = tune("regularization"), mixture = tune()) %>%
set_engine("glmnet")
mod %>% .model_param_name_key()
rn <- mod %>% .model_param_name_key(as_tibble = FALSE)
rn
grid <- tidyr::crossing(regularization = c(0, 1), mixture = (0:3) / 3)
grid %>%
dplyr::rename(!!!rn$user_to_parsnip)
grid %>%
dplyr::rename(!!!rn$user_to_parsnip) %>%
dplyr::rename(!!!rn$parsnip_to_engine)
Organize glmnet predictions
Description
This function is for developer use and organizes predictions from glmnet models.
Usage
.organize_glmnet_pred(x, object)
Arguments
x |
Predictions as returned by the |
object |
An object of class |
Add a column of row numbers to a data frame
Description
Add a column of row numbers to a data frame
Usage
add_rowindex(x)
Arguments
x |
A data frame |
Value
The same data frame with a column of 1-based integers named .row
.
Examples
mtcars %>% add_rowindex()
Augment data with predictions
Description
augment()
will add column(s) for predictions to the given data.
Usage
## S3 method for class 'model_fit'
augment(x, new_data, eval_time = NULL, ...)
Arguments
x |
A model fit produced by |
new_data |
A data frame or matrix. |
eval_time |
For censored regression models, a vector of time points at which the survival probability is estimated. |
... |
Not currently used. |
Details
Regression
For regression models, a .pred
column is added. If x
was created using
fit.model_spec()
and new_data
contains a regression outcome column, a
.resid
column is also added.
Classification
For classification models, the results can include a column called
.pred_class
as well as class probability columns named .pred_{level}
.
This depends on what type of prediction types are available for the model.
Censored Regression
For these models, predictions for the expected time and survival probability
are created (if the model engine supports them). If the model supports
survival prediction, the eval_time
argument is required.
If survival predictions are created and new_data
contains a
survival::Surv()
object, additional columns are added for inverse
probability of censoring weights (IPCW) are also created (see tidymodels.org
page in the references below). This enables the user to compute performance
metrics in the yardstick package.
References
https://www.tidymodels.org/learn/statistics/survival-metrics/
Examples
car_trn <- mtcars[11:32,]
car_tst <- mtcars[ 1:10,]
reg_form <-
linear_reg() %>%
set_engine("lm") %>%
fit(mpg ~ ., data = car_trn)
reg_xy <-
linear_reg() %>%
set_engine("lm") %>%
fit_xy(car_trn[, -1], car_trn$mpg)
augment(reg_form, car_tst)
augment(reg_form, car_tst[, -1])
augment(reg_xy, car_tst)
augment(reg_xy, car_tst[, -1])
# ------------------------------------------------------------------------------
data(two_class_dat, package = "modeldata")
cls_trn <- two_class_dat[-(1:10), ]
cls_tst <- two_class_dat[ 1:10 , ]
cls_form <-
logistic_reg() %>%
set_engine("glm") %>%
fit(Class ~ ., data = cls_trn)
cls_xy <-
logistic_reg() %>%
set_engine("glm") %>%
fit_xy(cls_trn[, -3],
cls_trn$Class)
augment(cls_form, cls_tst)
augment(cls_form, cls_tst[, -3])
augment(cls_xy, cls_tst)
augment(cls_xy, cls_tst[, -3])
Automatic Machine Learning
Description
auto_ml()
defines an automated searching and tuning process where
many models of different families are trained and ranked given their
performance on the training data.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
h2o¹²
¹ The default engine. ² Requires a parsnip extension package for classification and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
auto_ml(mode = "unknown", engine = "h2o")
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 auto_ml(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, h2o engine details
Create a ggplot for a model object
Description
This method provides a good visualization method for model results. Currently, only methods for glmnet models are implemented.
Usage
## S3 method for class 'model_fit'
autoplot(object, ...)
## S3 method for class 'glmnet'
autoplot(object, ..., min_penalty = 0, best_penalty = NULL, top_n = 3L)
Arguments
object |
A model fit object. |
... |
For |
min_penalty |
A single, non-negative number for the smallest penalty
value that should be shown in the plot. If left |
best_penalty |
A single, non-negative number that will show a vertical
line marker. If left |
top_n |
A non-negative integer for how many model predictors to label.
The top predictors are ranked by their absolute coefficient value. For
multinomial or multivariate models, the |
Details
The glmnet package will need to be attached or loaded for
its autoplot()
method to work correctly.
Value
A ggplot object with penalty on the x-axis and coefficients on the y-axis. For multinomial or multivariate models, the plot is faceted.
Ensembles of MARS models
Description
bag_mars()
defines an ensemble of generalized linear models that use
artificial features for some predictors. These features resemble hinge
functions and the result is a model that is a segmented regression in small
dimensions. This function can fit classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
earth¹²
¹ The default engine. ² Requires a parsnip extension package for classification and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
bag_mars(
mode = "unknown",
num_terms = NULL,
prod_degree = NULL,
prune_method = NULL,
engine = "earth"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
num_terms |
The number of features that will be retained in the final model, including the intercept. |
prod_degree |
The highest possible interaction degree. |
prune_method |
The pruning method. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 bag_mars(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, earth engine details
Ensembles of neural networks
Description
bag_mlp()
defines an ensemble of single layer, feed-forward neural networks.
This function can fit classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
nnet¹²
¹ The default engine. ² Requires a parsnip extension package for classification and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
bag_mlp(
mode = "unknown",
hidden_units = NULL,
penalty = NULL,
epochs = NULL,
engine = "nnet"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
An integer for the number of units in the hidden model. | |
penalty |
A non-negative numeric value for the amount of weight decay. |
epochs |
An integer for the number of training iterations. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 bag_mlp(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, nnet engine details
Ensembles of decision trees
Description
bag_tree()
defines an ensemble of decision trees. This function can fit
classification, regression, and censored regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
bag_tree(
mode = "unknown",
cost_complexity = 0,
tree_depth = NULL,
min_n = 2,
class_cost = NULL,
engine = "rpart"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression". |
cost_complexity |
A positive number for the the cost/complexity
parameter (a.k.a. |
tree_depth |
An integer for the maximum depth of the tree (i.e. number of splits) (specific engines only). |
min_n |
An integer for the minimum number of data points in a node that is required for the node to be split further. |
class_cost |
A non-negative scalar for a class cost (where a cost of 1 means no extra cost). This is useful for when the first level of the outcome factor is the minority class. If this is not the case, values between zero and one can be used to bias to the second level of the factor. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 bag_tree(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, rpart engine details
, C5.0 engine details
Bayesian additive regression trees (BART)
Description
bart()
defines a tree ensemble model that uses Bayesian analysis to
assemble the ensemble. This function can fit classification and regression
models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
bart(
mode = "unknown",
engine = "dbarts",
trees = NULL,
prior_terminal_node_coef = NULL,
prior_terminal_node_expo = NULL,
prior_outcome_range = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine to use for fitting. |
trees |
An integer for the number of trees contained in the ensemble. |
prior_terminal_node_coef |
A coefficient for the prior probability that a node is a terminal node. Values are usually between 0 and one with a default of 0.95. This affects the baseline probability; smaller numbers make the probabilities larger overall. See Details below. |
prior_terminal_node_expo |
An exponent in the prior probability that a node is a terminal node. Values are usually non-negative with a default of 2 This affects the rate that the prior probability decreases as the depth of the tree increases. Larger values make deeper trees less likely. |
prior_outcome_range |
A positive value that defines the width of a prior that the predicted outcome is within a certain range. For regression it is related to the observed range of the data; the prior is the number of standard deviations of a Gaussian distribution defined by the observed range of the data. For classification, it is defined as the range of +/-3 (assumed to be on the logit scale). The default value is 2. |
Details
The prior for the terminal node probability is expressed as
prior = a * (1 + d)^(-b)
where d
is the depth of the node, a
is
prior_terminal_node_coef
and b
is prior_terminal_node_expo
. See the
Examples section below for an example graph of the prior probability of a
terminal node for different values of these parameters.
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 bart(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, dbarts engine details
Examples
show_engines("bart")
bart(mode = "regression", trees = 5)
# ------------------------------------------------------------------------------
# Examples for terminal node prior
library(ggplot2)
library(dplyr)
prior_test <- function(coef = 0.95, expo = 2, depths = 1:10) {
tidyr::crossing(coef = coef, expo = expo, depth = depths) %>%
mutate(
`terminial node prior` = coef * (1 + depth)^(-expo),
coef = format(coef),
expo = format(expo))
}
prior_test(coef = c(0.05, 0.5, .95), expo = c(1/2, 1, 2)) %>%
ggplot(aes(depth, `terminial node prior`, col = coef)) +
geom_line() +
geom_point() +
facet_wrap(~ expo)
Developer functions for predictions via BART models
Description
Developer functions for predictions via BART models
Usage
dbart_predict_calc(obj, new_data, type, level = 0.95, std_err = FALSE)
Arguments
obj |
A parsnip object. |
new_data |
A rectangular data object, such as a data frame. |
type |
A single character value or |
level |
Confidence level. |
std_err |
Attach column for standard error of prediction or not. |
Boosted trees
Description
boost_tree()
defines a model that creates a series of decision trees
forming an ensemble. Each tree depends on the results of previous trees.
All trees in the ensemble are combined to produce a final prediction. This
function can fit classification, regression, and censored regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
boost_tree(
mode = "unknown",
engine = "xgboost",
mtry = NULL,
trees = NULL,
min_n = NULL,
tree_depth = NULL,
learn_rate = NULL,
loss_reduction = NULL,
sample_size = NULL,
stop_iter = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression". |
engine |
A single character string specifying what computational engine to use for fitting. |
mtry |
A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (specific engines only). |
trees |
An integer for the number of trees contained in the ensemble. |
min_n |
An integer for the minimum number of data points in a node that is required for the node to be split further. |
tree_depth |
An integer for the maximum depth of the tree (i.e. number of splits) (specific engines only). |
learn_rate |
A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter. |
loss_reduction |
A number for the reduction in the loss function required to split further (specific engines only). |
sample_size |
A number for the number (or proportion) of data that is
exposed to the fitting routine. For |
stop_iter |
The number of iterations without improvement before stopping (specific engines only). |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 boost_tree(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, xgboost engine details
, C5.0 engine details
, h2o engine details
, lightgbm engine details
, mboost engine details
, spark engine details
,
xgb_train()
, C5.0_train()
Examples
show_engines("boost_tree")
boost_tree(mode = "classification", trees = 20)
C5.0 rule-based classification models
Description
C5_rules()
defines a model that derives feature rules from a tree for
prediction. A single tree or boosted ensemble can be used. This function can
fit classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
C5.0¹²
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
C5_rules(mode = "classification", trees = NULL, min_n = NULL, engine = "C5.0")
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "classification". |
trees |
A non-negative integer (no greater than 100) for the number of members of the ensemble. |
min_n |
An integer greater between zero and nine for the minimum number of data points in a node that are required for the node to be split further. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
C5.0 is a classification model that is an extension of the C4.5
model of Quinlan (1993). It has tree- and rule-based versions that also
include boosting capabilities. C5_rules()
enables the version of the model
that uses a series of rules (see the examples below). To make a set of
rules, an initial C5.0 tree is created and flattened into rules. The rules
are pruned, simplified, and ordered. Rule sets are created within each
iteration of boosting.
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 C5_rules(argument = !!value)
References
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
C50::C5.0()
, C50::C5.0Control()
,
fit()
, set_engine()
, update()
, C5.0 engine details
Examples
show_engines("C5_rules")
C5_rules()
Boosted trees via C5.0
Description
C5.0_train
is a wrapper for the C5.0()
function in the
C50 package that fits tree-based models
where all of the model arguments are in the main function.
Usage
C5.0_train(x, y, weights = NULL, trials = 15, minCases = 2, sample = 0, ...)
Arguments
x |
A data frame or matrix of predictors. |
y |
A factor vector with 2 or more levels |
weights |
An optional numeric vector of case weights. Note that the data used for the case weights will not be used as a splitting variable in the model (see https://www.rulequest.com/see5-info.html for Quinlan's notes on case weights). |
trials |
An integer specifying the number of boosting iterations. A value of one indicates that a single model is used. |
minCases |
An integer for the smallest number of samples that must be put in at least two of the splits. |
sample |
A value between (0, .999) that specifies the random proportion of the data should be used to train the model. By default, all the samples are used for model training. Samples not used for training are used to evaluate the accuracy of the model in the printed output. A value of zero means that all the training data are used. |
... |
Other arguments to pass. |
Value
A fitted C5.0 model.
Using case weights with parsnip
Description
Case weights are positive numeric values that influence how much each data point has during the model fitting process. There are a variety of situations where case weights can be used.
Details
tidymodels packages differentiate how different types of case weights should be used during the entire data analysis process, including preprocessing data, model fitting, performance calculations, etc.
The tidymodels packages require users to convert their numeric vectors to a vector class that reflects how these should be used. For example, there are some situations where the weights should not affect operations such as centering and scaling or other preprocessing operations.
The types of weights allowed in tidymodels are:
Frequency weights via
hardhat::frequency_weights()
Importance weights via
hardhat::importance_weights()
More types can be added by request.
For parsnip, the fit()
and fit_xy()
functions contain a case_weight
argument that takes these data. For Spark models, the argument value should
be a character value.
See Also
frequency_weights()
, importance_weights()
, fit()
, fit_xy()
Determine if case weights are used
Description
Not all modeling engines can incorporate case weights into their calculations. This function can determine whether they can be used.
Usage
case_weights_allowed(spec)
Arguments
spec |
A parsnip model specification. |
Value
A single logical.
Examples
case_weights_allowed(linear_reg())
case_weights_allowed(linear_reg(engine = "keras"))
Calculations for inverse probability of censoring weights (IPCW)
Description
The method of Graf et al (1999) is used to compute weights at specific evaluation times that can be used to help measure a model's time-dependent performance (e.g. the time-dependent Brier score or the area under the ROC curve). This is an internal function.
Usage
.censoring_weights_graf(object, ...)
## Default S3 method:
.censoring_weights_graf(object, ...)
## S3 method for class 'model_fit'
.censoring_weights_graf(
object,
predictions,
cens_predictors = NULL,
trunc = 0.05,
eps = 10^-10,
...
)
Arguments
object |
A fitted parsnip model object or fitted workflow with a mode of "censored regression". |
predictions |
A data frame with a column containing a |
cens_predictors |
Not currently used. A potential future slot for models with
informative censoring based on columns in |
trunc |
A potential lower bound for the probability of censoring to avoid very large weight values. |
eps |
A small value that is subtracted from the evaluation time when computing the censoring probabilities. See Details below. |
Details
A probability that the data are censored immediately prior to a specific time is computed. To do this, we must determine what time to make the prediction. There are two time values for each row of the data set: the observed time (either censored or not) and the time that the model is being evaluated at (e.g. the survival function prediction at some time point), which is constant across rows. .
From Graf et al (1999) there are three cases:
If the observed time is a censoring time and that is before the evaluation time, the data point should make no contribution to the performance metric (their "category 3"). These values have a missing value for their probability estimate (and also for their weight column).
If the observed time corresponds to an actual event, and that time is prior to the evaluation time (category 1), the probability of being censored is predicted at the observed time (minus an epsilon).
If the observed time is after the evaluation time (category 2), regardless of the status, the probability of being censored is predicted at the evaluation time (minus an epsilon).
The epsilon is used since, we would not have actual information at time t
for a data point being predicted at time t
(only data prior to time t
should be available).
After the censoring probability is computed, the trunc
option is used to
avoid using numbers pathologically close to zero. After this, the weight is
computed by inverting the censoring probability.
The eps
argument is used to avoid information leakage when computing the
censoring probability. Subtracting a small number avoids using data that
would not be known at the time of prediction. For example, if we are making
survival probability predictions at eval_time = 3.0
, we would not know the
about the probability of being censored at that exact time (since it has not
occurred yet).
When creating weights by inverting probabilities, there is the risk that a few
cases will have severe outliers due to probabilities close to zero. To
mitigate this, the trunc
argument can be used to put a cap on the weights.
If the smallest probability is greater than trunc
, the probabilities with
values less than trunc
are given that value. Otherwise, trunc
is
adjusted to be half of the smallest probability and that value is used as the
lower bound..
Note that if there are n
rows in data
and t
time points, the resulting
data, once unnested, has n * t
rows. Computations will not easily scale
well as t
becomes very large.
Value
The same data are returned with the pred
tibbles containing
several new columns:
-
.weight_time
: the time at which the inverse censoring probability weights are computed. This is a function of the observed time and the time of analysis (i.e.,eval_time
). See Details for more information. -
.pred_censored
: the probability of being censored at.weight_time
. -
.weight_censored
: The inverse of the censoring probability.
References
Graf, E., Schmoor, C., Sauerbrei, W. and Schumacher, M. (1999), Assessment and comparison of prognostic classification schemes for survival data. Statist. Med., 18: 2529-2545.
Check to ensure that ellipses are empty
Description
Check to ensure that ellipses are empty
Usage
check_empty_ellipse(...)
Arguments
... |
Extra arguments. |
Value
If an error is not thrown (from non-empty ellipses), a NULL list.
Condense control object into strictly smaller control object
Description
This function is used to help the hierarchy of control functions used throughout the tidymodels packages. It is now assumed that each control function is either a subset or a superset of another control function.
Usage
condense_control(x, ref, ..., call = rlang::caller_env())
Arguments
x |
A control object to be condensed. |
ref |
A control object that is used to determine what element should be kept. |
call |
The execution environment of a currently running function, e.g.
|
Value
A control object with the same elements and classes of ref
, with
values of x
.
Examples
ctrl <- control_parsnip(catch = TRUE)
ctrl$allow_par <- TRUE
str(ctrl)
ctrl <- condense_control(ctrl, control_parsnip())
str(ctrl)
Control the fit function
Description
Pass options to the fit.model_spec()
function to control its
output and computations
Usage
control_parsnip(verbosity = 1L, catch = FALSE)
Arguments
verbosity |
An integer to control how verbose the output is. For a
value of zero, no messages or output are shown when packages are loaded or
when the model is fit. For a value of 1, package loading is quiet but model
fits can produce output to the screen (depending on if they contain their
own |
catch |
A logical where a value of |
Value
An S3 object with class "control_parsnip" that is a named list with the results of the function call
Examples
control_parsnip(verbosity = 2L)
Convenience function for intervals
Description
Convenience function for intervals
Usage
convert_stan_interval(x, level = 0.95, lower = TRUE)
Arguments
x |
A fitted model object |
level |
Level of uncertainty for intervals |
lower |
Is |
A wrapper function for conditional inference tree models
Description
These functions are slightly different APIs for partykit::ctree()
and
partykit::cforest()
that have several important arguments as top-level
arguments (as opposed to being specified in partykit::ctree_control()
).
Usage
ctree_train(
formula,
data,
weights = NULL,
minsplit = 20L,
maxdepth = Inf,
teststat = "quadratic",
testtype = "Bonferroni",
mincriterion = 0.95,
...
)
cforest_train(
formula,
data,
weights = NULL,
minsplit = 20L,
maxdepth = Inf,
teststat = "quadratic",
testtype = "Univariate",
mincriterion = 0,
mtry = ceiling(sqrt(ncol(data) - 1)),
ntree = 500L,
...
)
Arguments
formula |
A symbolic description of the model to be fit. |
data |
A data frame containing the variables in the model. |
weights |
A vector of weights whose length is the same as |
minsplit |
The minimum sum of weights in a node in order to be considered for splitting. |
maxdepth |
maximum depth of the tree. The default |
teststat |
A character specifying the type of the test statistic to be applied. |
testtype |
A character specifying how to compute the distribution of the test statistic. |
mincriterion |
The value of the test statistic (for |
... |
Other options to pass to |
mtry |
Number of input variables randomly sampled as candidates at each
node for random forest like algorithms. The default |
ntree |
Number of trees to grow in a forest. |
Value
An object of class party
(for ctree
) or cforest
.
Examples
if (rlang::is_installed(c("modeldata", "partykit"))) {
data(bivariate, package = "modeldata")
ctree_train(Class ~ ., data = bivariate_train)
ctree_train(Class ~ ., data = bivariate_train, maxdepth = 1)
}
Cubist rule-based regression models
Description
cubist_rules()
defines a model that derives simple feature rules from a tree
ensemble and creates regression models within each rule. This function can fit
regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
Cubist¹²
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
cubist_rules(
mode = "regression",
committees = NULL,
neighbors = NULL,
max_rules = NULL,
engine = "Cubist"
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "regression". |
committees |
A non-negative integer (no greater than 100) for the number of members of the ensemble. |
neighbors |
An integer between zero and nine for the number of training set instances that are used to adjust the model-based prediction. |
max_rules |
The largest number of rules. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
Cubist is a rule-based ensemble regression model. A basic model tree
(Quinlan, 1992) is created that has a separate linear regression model
corresponding for each terminal node. The paths along the model tree are
flattened into rules and these rules are simplified and pruned. The parameter
min_n
is the primary method for controlling the size of each tree while
max_rules
controls the number of rules.
Cubist ensembles are created using committees, which are similar to
boosting. After the first model in the committee is created, the second
model uses a modified version of the outcome data based on whether the
previous model under- or over-predicted the outcome. For iteration m, the
new outcome y*
is computed using
If a sample is under-predicted on the previous iteration, the outcome is adjusted so that the next time it is more likely to be over-predicted to compensate. This adjustment continues for each ensemble iteration. See Kuhn and Johnson (2013) for details.
After the model is created, there is also an option for a post-hoc adjustment that uses the training set (Quinlan, 1993). When a new sample is predicted by the model, it can be modified by its nearest neighbors in the original training set. For K neighbors, the model-based predicted value is adjusted by the neighbor using:
where t
is the training set prediction and w
is a weight that is inverse
to the distance to the neighbor.
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 cubist_rules(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
Quinlan R (1992). "Learning with Continuous Classes." Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, pp. 343-348.
Quinlan R (1993)."Combining Instance-Based and Model-Based Learning." Proceedings of the Tenth International Conference on Machine Learning, pp. 236-243.
Kuhn M and Johnson K (2013). Applied Predictive Modeling. Springer.
See Also
Cubist::cubist()
, Cubist::cubistControl()
, fit()
, set_engine()
, update()
, Cubist engine details
Decision trees
Description
decision_tree()
defines a model as a set of if/then
statements that
creates a tree-based structure. This function can fit classification,
regression, and censored regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
decision_tree(
mode = "unknown",
engine = "rpart",
cost_complexity = NULL,
tree_depth = NULL,
min_n = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression". |
engine |
A single character string specifying what computational engine to use for fitting. |
cost_complexity |
A positive number for the the cost/complexity
parameter (a.k.a. |
tree_depth |
An integer for maximum depth of the tree. |
min_n |
An integer for the minimum number of data points in a node that are required for the node to be split further. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 decision_tree(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, rpart engine details
, C5.0 engine details
, partykit engine details
, spark engine details
Examples
show_engines("decision_tree")
decision_tree(mode = "classification", tree_depth = 5)
Data Set Characteristics Available when Fitting Models
Description
When using the fit()
functions there are some
variables that will be available for use in arguments. For
example, if the user would like to choose an argument value
based on the current number of rows in a data set, the .obs()
function can be used. See Details below.
Usage
.cols()
.preds()
.obs()
.lvls()
.facts()
.x()
.y()
.dat()
Details
Existing functions:
-
.obs()
: The current number of rows in the data set. -
.preds()
: The number of columns in the data set that is associated with the predictors prior to dummy variable creation. -
.cols()
: The number of predictor columns available after dummy variables are created (if any). -
.facts()
: The number of factor predictors in the data set. -
.lvls()
: If the outcome is a factor, this is a table with the counts for each level (andNA
otherwise). -
.x()
: The predictors returned in the format given. Either a data frame or a matrix. -
.y()
: The known outcomes returned in the format given. Either a vector, matrix, or data frame. -
.dat()
: A data frame containing all of the predictors and the outcomes. Iffit_xy()
was used, the outcomes are attached as the column,..y
.
For example, if you use the model formula circumference ~ .
with the
built-in Orange
data, the values would be
.preds() = 2 (the 2 remaining columns in `Orange`) .cols() = 5 (1 numeric column + 4 from Tree dummy variables) .obs() = 35 .lvls() = NA (no factor outcome) .facts() = 1 (the Tree predictor) .y() = <vector> (circumference as a vector) .x() = <data.frame> (The other 2 columns as a data frame) .dat() = <data.frame> (The full data set)
If the formula Tree ~ .
were used:
.preds() = 2 (the 2 numeric columns in `Orange`) .cols() = 2 (same) .obs() = 35 .lvls() = c("1" = 7, "2" = 7, "3" = 7, "4" = 7, "5" = 7) .facts() = 0 .y() = <vector> (Tree as a vector) .x() = <data.frame> (The other 2 columns as a data frame) .dat() = <data.frame> (The full data set)
To use these in a model fit, pass them to a model specification.
The evaluation is delayed until the time when the
model is run via fit()
(and the variables listed above are available).
For example:
library(modeldata) data("lending_club") rand_forest(mode = "classification", mtry = .cols() - 2)
When no descriptors are found, the computation of the descriptor values is not executed.
Automatic machine learning via h2o
Description
h2o::h2o.automl defines an automated model training process and returns a leaderboard of models with best performances.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has no tuning parameters.
Engine arguments of interest
-
max_runtime_secs
andmax_models
: controls the maximum running time and number of models to build in the automatic process. -
exclude_algos
andinclude_algos
: a character vector indicating the excluded or included algorithms during model building. To see a full list of supported models, see the details section inh2o::h2o.automl()
. -
validation
: An integer between 0 and 1 specifying the proportion of training data reserved as validation set. This is used by h2o for performance assessment and potential early stopping.
Translation from parsnip to the original package (regression)
agua::h2o_train_auto()
is a wrapper around
h2o::h2o.automl()
.
auto_ml() %>% set_engine("h2o") %>% set_mode("regression") %>% translate()
## Automatic Machine Learning Model Specification (regression) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_auto(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), verbosity = NULL)
Translation from parsnip to the original package (classification)
auto_ml() %>% set_engine("h2o") %>% set_mode("classification") %>% translate()
## Automatic Machine Learning Model Specification (classification) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_auto(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), verbosity = NULL)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Bagged MARS via earth
Description
baguette::bagger()
creates an collection of MARS models forming an
ensemble. All models in the ensemble are combined to produce a final prediction.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
prod_degree
: Degree of Interaction (type: integer, default: 1L) -
prune_method
: Pruning Method (type: character, default: ‘backward’) -
num_terms
: # Model Terms (type: integer, default: see below)
The default value of num_terms
depends on the number of predictor
columns. For a data frame x
, the default is
min(200, max(20, 2 * ncol(x))) + 1
(see
earth::earth()
and the reference below).
Translation from parsnip to the original package (regression)
The baguette extension package is required to fit this model.
bag_mars(num_terms = integer(1), prod_degree = integer(1), prune_method = character(1)) %>% set_engine("earth") %>% set_mode("regression") %>% translate()
## Bagged MARS Model Specification (regression) ## ## Main Arguments: ## num_terms = integer(1) ## prod_degree = integer(1) ## prune_method = character(1) ## ## Computational engine: earth ## ## Model fit template: ## baguette::bagger(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), nprune = integer(1), degree = integer(1), ## pmethod = character(1), base_model = "MARS")
Translation from parsnip to the original package (classification)
The baguette extension package is required to fit this model.
library(baguette) bag_mars( num_terms = integer(1), prod_degree = integer(1), prune_method = character(1) ) %>% set_engine("earth") %>% set_mode("classification") %>% translate()
## Bagged MARS Model Specification (classification) ## ## Main Arguments: ## num_terms = integer(1) ## prod_degree = integer(1) ## prune_method = character(1) ## ## Computational engine: earth ## ## Model fit template: ## baguette::bagger(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), nprune = integer(1), degree = integer(1), ## pmethod = character(1), base_model = "MARS")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that the earth
package documentation has: “In the current
implementation, building models with weights can be slow.”
References
Breiman, L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics, vol. 19, no. 1, pp. 1-67.
Milborrow, S. “Notes on the earth package.”
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Bagged neural networks via nnet
Description
baguette::bagger()
creates a collection of neural networks forming an
ensemble. All trees in the ensemble are combined to produce a final prediction.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
hidden_units
: # Hidden Units (type: integer, default: 10L) -
penalty
: Amount of Regularization (type: double, default: 0.0) -
epochs
: # Epochs (type: integer, default: 1000L)
These defaults are set by the baguette
package and are different than
those in nnet::nnet()
.
Translation from parsnip to the original package (classification)
The baguette extension package is required to fit this model.
library(baguette) bag_mlp(penalty = double(1), hidden_units = integer(1)) %>% set_engine("nnet") %>% set_mode("classification") %>% translate()
## Bagged Neural Network Model Specification (classification) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## ## Computational engine: nnet ## ## Model fit template: ## baguette::bagger(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), size = integer(1), decay = double(1), ## base_model = "nnet")
Translation from parsnip to the original package (regression)
The baguette extension package is required to fit this model.
library(baguette) bag_mlp(penalty = double(1), hidden_units = integer(1)) %>% set_engine("nnet") %>% set_mode("regression") %>% translate()
## Bagged Neural Network Model Specification (regression) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## ## Computational engine: nnet ## ## Model fit template: ## baguette::bagger(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), size = integer(1), decay = double(1), ## base_model = "nnet")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
References
Breiman L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Bagged trees via C5.0
Description
baguette::bagger()
creates an collection of decision trees forming an
ensemble. All trees in the ensemble are combined to produce a final prediction.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 1 tuning parameters:
-
min_n
: Minimal Node Size (type: integer, default: 2L)
Translation from parsnip to the original package (classification)
The baguette extension package is required to fit this model.
library(baguette) bag_tree(min_n = integer()) %>% set_engine("C5.0") %>% set_mode("classification") %>% translate()
## Bagged Decision Tree Model Specification (classification) ## ## Main Arguments: ## cost_complexity = 0 ## min_n = integer() ## ## Computational engine: C5.0 ## ## Model fit template: ## baguette::bagger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## minCases = integer(), base_model = "C5.0")
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
Breiman, L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Bagged trees via rpart
Description
baguette::bagger()
and ipred::bagging()
create collections of decision
trees forming an ensemble. All trees in the ensemble are combined to produce
a final prediction.
Details
For this engine, there are multiple modes: classification, regression, and censored regression
Tuning Parameters
This model has 4 tuning parameters:
-
class_cost
: Class Cost (type: double, default: (see below)) -
tree_depth
: Tree Depth (type: integer, default: 30L) -
min_n
: Minimal Node Size (type: integer, default: 2L) -
cost_complexity
: Cost-Complexity Parameter (type: double, default: 0.01)
For the class_cost
parameter, the value can be a non-negative scalar
for a class cost (where a cost of 1 means no extra cost). This is useful
for when the first level of the outcome factor is the minority class. If
this is not the case, values between zero and one can be used to bias to
the second level of the factor.
Translation from parsnip to the original package (classification)
The baguette extension package is required to fit this model.
library(baguette) bag_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% set_engine("rpart") %>% set_mode("classification") %>% translate()
## Bagged Decision Tree Model Specification (classification) ## ## Main Arguments: ## cost_complexity = double(1) ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: rpart ## ## Model fit template: ## baguette::bagger(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), cp = double(1), maxdepth = integer(1), ## minsplit = integer(1), base_model = "CART")
Translation from parsnip to the original package (regression)
The baguette extension package is required to fit this model.
library(baguette) bag_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% set_engine("rpart") %>% set_mode("regression") %>% translate()
## Bagged Decision Tree Model Specification (regression) ## ## Main Arguments: ## cost_complexity = double(1) ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: rpart ## ## Model fit template: ## baguette::bagger(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), cp = double(1), maxdepth = integer(1), ## minsplit = integer(1), base_model = "CART")
Translation from parsnip to the original package (censored regression)
The censored extension package is required to fit this model.
library(censored) bag_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% set_engine("rpart") %>% set_mode("censored regression") %>% translate()
## Bagged Decision Tree Model Specification (censored regression) ## ## Main Arguments: ## cost_complexity = double(1) ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: rpart ## ## Model fit template: ## ipred::bagging(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), cp = double(1), maxdepth = integer(1), ## minsplit = integer(1))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Other details
Predictions of type "time"
are predictions of the median survival
time.
References
Breiman L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Hothorn T, Lausen B, Benner A, Radespiel-Troeger M. 2004. Bagging Survival Trees. Statistics in Medicine, 23(1), 77–91.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Bayesian additive regression trees via dbarts
Description
dbarts::bart()
creates an ensemble of tree-based model whose training
and assembly is determined using Bayesian analysis.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 4 tuning parameters:
-
trees
: # Trees (type: integer, default: 200L) -
prior_terminal_node_coef
: Terminal Node Prior Coefficient (type: double, default: 0.95) -
prior_terminal_node_expo
: Terminal Node Prior Exponent (type: double, default: 2.00) -
prior_outcome_range
: Prior for Outcome Range (type: double, default: 2.00)
Important engine-specific options
Some relevant arguments that can be passed to set_engine()
:
-
keepevery
,n.thin
: Everykeepevery
draw is kept to be returned to the user. Useful for “thinning” samples. -
ntree
,n.trees
: The number of trees in the sum-of-trees formulation. -
ndpost
,n.samples
: The number of posterior draws after burn in,ndpost
/keepevery
will actually be returned. -
nskip
,n.burn
: Number of MCMC iterations to be treated as burn in. -
nchain
,n.chains
: Integer specifying how many independent tree sets and fits should be calculated. -
nthread
,n.threads
: Integer specifying how many threads to use. Depending on the CPU architecture, using more than the number of chains can degrade performance for small/medium data sets. As such some calculations may be executed single threaded regardless. -
combinechains
,combineChains
: Logical; ifTRUE
, samples will be returned in arrays of dimensions equal tonchain
timesndpost
times number of observations.
Translation from parsnip to the original package (classification)
bart( trees = integer(1), prior_terminal_node_coef = double(1), prior_terminal_node_expo = double(1), prior_outcome_range = double(1) ) %>% set_engine("dbarts") %>% set_mode("classification") %>% translate()
## BART Model Specification (classification) ## ## Main Arguments: ## trees = integer(1) ## prior_terminal_node_coef = double(1) ## prior_terminal_node_expo = double(1) ## prior_outcome_range = double(1) ## ## Computational engine: dbarts ## ## Model fit template: ## dbarts::bart(x = missing_arg(), y = missing_arg(), ntree = integer(1), ## base = double(1), power = double(1), k = double(1), verbose = FALSE, ## keeptrees = TRUE, keepcall = FALSE)
Translation from parsnip to the original package (regression)
bart( trees = integer(1), prior_terminal_node_coef = double(1), prior_terminal_node_expo = double(1), prior_outcome_range = double(1) ) %>% set_engine("dbarts") %>% set_mode("regression") %>% translate()
## BART Model Specification (regression) ## ## Main Arguments: ## trees = integer(1) ## prior_terminal_node_coef = double(1) ## prior_terminal_node_expo = double(1) ## prior_outcome_range = double(1) ## ## Computational engine: dbarts ## ## Model fit template: ## dbarts::bart(x = missing_arg(), y = missing_arg(), ntree = integer(1), ## base = double(1), power = double(1), k = double(1), verbose = FALSE, ## keeptrees = TRUE, keepcall = FALSE)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
dbarts::bart()
will also convert the factors to
indicators if the user does not create them first.
References
Chipman, George, McCulloch. “BART: Bayesian additive regression trees.” Ann. Appl. Stat. 4 (1) 266 - 298, March 2010.
Boosted trees via C5.0
Description
C50::C5.0()
creates a series of classification trees forming an
ensemble. Each tree depends on the results of previous trees. All trees in
the ensemble are combined to produce a final prediction.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 3 tuning parameters:
-
trees
: # Trees (type: integer, default: 15L) -
min_n
: Minimal Node Size (type: integer, default: 2L) -
sample_size
: Proportion Observations Sampled (type: double, default: 1.0)
The implementation of C5.0 limits the number of trees to be between 1 and 100.
Translation from parsnip to the original package (classification)
boost_tree(trees = integer(), min_n = integer(), sample_size = numeric()) %>% set_engine("C5.0") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Main Arguments: ## trees = integer() ## min_n = integer() ## sample_size = numeric() ## ## Computational engine: C5.0 ## ## Model fit template: ## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## trials = integer(), minCases = integer(), sample = numeric())
C5.0_train()
is a wrapper around
C50::C5.0()
that makes it easier to run this model.
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Other details
Early stopping
By default, early stopping is used. To use the complete set of boosting
iterations, pass earlyStopping = FALSE
to
set_engine()
. Also, it is unlikely that early stopping
will occur if sample_size = 1
.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for boost_tree()
with the "C5.0"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Boosted trees via h2o
Description
h2o::h2o.xgboost()
creates a series of decision trees
forming an ensemble. Each tree depends on the results of previous trees.
All trees in the ensemble are combined to produce a final prediction.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 8 tuning parameters:
-
trees
: # Trees (type: integer, default: 50) -
tree_depth
: Tree Depth (type: integer, default: 6) -
min_n
: Minimal Node Size (type: integer, default: 1) -
learn_rate
: Learning Rate (type: double, default: 0.3) -
sample_size
: # Observations Sampled (type: integer, default: 1) -
mtry
: # Randomly Selected Predictors (type: integer, default: 1) -
loss_reduction
: Minimum Loss Reduction (type: double, default: 0) -
stop_iter
: # Iterations Before Stopping (type: integer, default: 0)
min_n
represents the fewest allowed observations in a terminal node,
h2o::h2o.xgboost()
allows only one row in a leaf
by default.
stop_iter
controls early stopping rounds based on the convergence of
the engine parameter stopping_metric
. By default,
h2o::h2o.xgboost()
does not use early stopping.
When stop_iter
is not 0, h2o::h2o.xgboost()
uses logloss for classification, deviance for regression and anonomaly
score for Isolation Forest. This is mostly useful when used alongside
the engine parameter validation
, which is the proportion of
train-validation split, parsnip will split and pass the two data frames
to h2o. Then h2o::h2o.xgboost()
will evaluate
the metric and early stopping criteria on the validation set.
Translation from parsnip to the original package (regression)
agua::h2o_train_xgboost()
is a wrapper
around h2o::h2o.xgboost()
.
The agua extension package is required to fit this model.
boost_tree( mtry = integer(), trees = integer(), tree_depth = integer(), learn_rate = numeric(), min_n = integer(), loss_reduction = numeric(), stop_iter = integer() ) %>% set_engine("h2o") %>% set_mode("regression") %>% translate()
## Boosted Tree Model Specification (regression) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## stop_iter = integer() ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_xgboost(x = missing_arg(), y = missing_arg(), ## weights = missing_arg(), validation_frame = missing_arg(), ## col_sample_rate = integer(), ntrees = integer(), min_rows = integer(), ## max_depth = integer(), learn_rate = numeric(), min_split_improvement = numeric(), ## stopping_rounds = integer())
Translation from parsnip to the original package (classification)
The agua extension package is required to fit this model.
boost_tree( mtry = integer(), trees = integer(), tree_depth = integer(), learn_rate = numeric(), min_n = integer(), loss_reduction = numeric(), stop_iter = integer() ) %>% set_engine("h2o") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## stop_iter = integer() ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_xgboost(x = missing_arg(), y = missing_arg(), ## weights = missing_arg(), validation_frame = missing_arg(), ## col_sample_rate = integer(), ntrees = integer(), min_rows = integer(), ## max_depth = integer(), learn_rate = numeric(), min_split_improvement = numeric(), ## stopping_rounds = integer())
Preprocessing
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric.
Interpreting mtry
The mtry
argument denotes the number of predictors that will be
randomly sampled at each split when creating tree models.
Some engines, such as "xgboost"
, "xrf"
, and "lightgbm"
, interpret
their analogue to the mtry
argument as the proportion of predictors
that will be randomly sampled at each split rather than the count. In
some settings, such as when tuning over preprocessors that influence the
number of predictors, this parameterization is quite
helpful—interpreting mtry
as a proportion means that [0, 1]
is
always a valid range for that parameter, regardless of input data.
parsnip and its extensions accommodate this parameterization using the
counts
argument: a logical indicating whether mtry
should be
interpreted as the number of predictors that will be randomly sampled at
each split. TRUE
indicates that mtry
will be interpreted in its
sense as a count, FALSE
indicates that the argument will be
interpreted in its sense as a proportion.
mtry
is a main model argument for
boost_tree()
and
rand_forest()
, and thus should not have an
engine-specific interface. So, regardless of engine, counts
defaults
to TRUE
. For engines that support the proportion interpretation
(currently "xgboost"
and "xrf"
, via the rules package, and
"lightgbm"
via the bonsai package) the user can pass the
counts = FALSE
argument to set_engine()
to supply mtry
values
within [0, 1]
.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Boosted trees via lightgbm
Description
lightgbm::lgb.train()
creates a series of decision trees
forming an ensemble. Each tree depends on the results of previous trees.
All trees in the ensemble are combined to produce a final prediction.
Details
For this engine, there are multiple modes: regression and classification
Tuning Parameters
This model has 6 tuning parameters:
-
tree_depth
: Tree Depth (type: integer, default: -1) -
trees
: # Trees (type: integer, default: 100) -
learn_rate
: Learning Rate (type: double, default: 0.1) -
mtry
: # Randomly Selected Predictors (type: integer, default: see below) -
min_n
: Minimal Node Size (type: integer, default: 20) -
loss_reduction
: Minimum Loss Reduction (type: double, default: 0)
The mtry
parameter gives the number of predictors that will be
randomly sampled at each split. The default is to use all predictors.
Rather than as a number,
lightgbm::lgb.train()
’s feature_fraction
argument encodes mtry
as the proportion of predictors that will be
randomly sampled at each split. parsnip translates mtry
, supplied as
the number of predictors, to a proportion under the hood. That is, the
user should still supply the argument as mtry
to boost_tree()
, and
do so in its sense as a number rather than a proportion; before passing
mtry
to lightgbm::lgb.train()
, parsnip will
convert the mtry
value to a proportion.
Note that parsnip’s translation can be overridden via the counts
argument, supplied to set_engine()
. By default, counts
is set to
TRUE
, but supplying the argument counts = FALSE
allows the user to
supply mtry
as a proportion rather than a number.
Translation from parsnip to the original package (regression)
The bonsai extension package is required to fit this model.
boost_tree( mtry = integer(), trees = integer(), tree_depth = integer(), learn_rate = numeric(), min_n = integer(), loss_reduction = numeric() ) %>% set_engine("lightgbm") %>% set_mode("regression") %>% translate()
## Boosted Tree Model Specification (regression) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## ## Computational engine: lightgbm ## ## Model fit template: ## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(), ## weights = missing_arg(), feature_fraction_bynode = integer(), ## num_iterations = integer(), min_data_in_leaf = integer(), ## max_depth = integer(), learning_rate = numeric(), min_gain_to_split = numeric(), ## verbose = -1, num_threads = 0, seed = sample.int(10^5, 1), ## deterministic = TRUE)
Translation from parsnip to the original package (classification)
The bonsai extension package is required to fit this model.
boost_tree( mtry = integer(), trees = integer(), tree_depth = integer(), learn_rate = numeric(), min_n = integer(), loss_reduction = numeric() ) %>% set_engine("lightgbm") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## ## Computational engine: lightgbm ## ## Model fit template: ## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(), ## weights = missing_arg(), feature_fraction_bynode = integer(), ## num_iterations = integer(), min_data_in_leaf = integer(), ## max_depth = integer(), learning_rate = numeric(), min_gain_to_split = numeric(), ## verbose = -1, num_threads = 0, seed = sample.int(10^5, 1), ## deterministic = TRUE)
bonsai::train_lightgbm()
is a wrapper
around lightgbm::lgb.train()
(and other
functions) that make it easier to run this model.
Other details
Preprocessing
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric.
Interpreting mtry
The mtry
argument denotes the number of predictors that will be
randomly sampled at each split when creating tree models.
Some engines, such as "xgboost"
, "xrf"
, and "lightgbm"
, interpret
their analogue to the mtry
argument as the proportion of predictors
that will be randomly sampled at each split rather than the count. In
some settings, such as when tuning over preprocessors that influence the
number of predictors, this parameterization is quite
helpful—interpreting mtry
as a proportion means that [0, 1]
is
always a valid range for that parameter, regardless of input data.
parsnip and its extensions accommodate this parameterization using the
counts
argument: a logical indicating whether mtry
should be
interpreted as the number of predictors that will be randomly sampled at
each split. TRUE
indicates that mtry
will be interpreted in its
sense as a count, FALSE
indicates that the argument will be
interpreted in its sense as a proportion.
mtry
is a main model argument for
boost_tree()
and
rand_forest()
, and thus should not have an
engine-specific interface. So, regardless of engine, counts
defaults
to TRUE
. For engines that support the proportion interpretation
(currently "xgboost"
and "xrf"
, via the rules package, and
"lightgbm"
via the bonsai package) the user can pass the
counts = FALSE
argument to set_engine()
to supply mtry
values
within [0, 1]
.
Bagging
The sample_size
argument is translated to the bagging_fraction
parameter in the param
argument of lgb.train
. The argument is
interpreted by lightgbm as a proportion rather than a count, so bonsai
internally reparameterizes the sample_size
argument with
dials::sample_prop()
during tuning.
To effectively enable bagging, the user would also need to set the
bagging_freq
argument to lightgbm. bagging_freq
defaults to 0, which
means bagging is disabled, and a bagging_freq
argument of k
means
that the booster will perform bagging at every k
th boosting iteration.
Thus, by default, the sample_size
argument would be ignored without
setting this argument manually. Other boosting libraries, like xgboost,
do not have an analogous argument to bagging_freq
and use k = 1
when
the analogue to bagging_fraction
is in $(0, 1)$. bonsai will thus
automatically set bagging_freq = 1
in set_engine("lightgbm", ...)
if sample_size
(i.e. bagging_fraction
) is not equal to 1 and no
bagging_freq
value is supplied. This default can be overridden by
setting the bagging_freq
argument to set_engine()
manually.
Verbosity
bonsai quiets much of the logging output from
lightgbm::lgb.train()
by default. With
default settings, logged warnings and errors will still be passed on to
the user. To print out all logs during training, set quiet = TRUE
.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
Examples
The “Introduction to bonsai” article contains
examples of
boost_tree()
with the "lightgbm"
engine.
References
-
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Boosted trees
Description
mboost::blackboost()
fits a series of decision trees forming an ensemble.
Each tree depends on the results of previous trees. All trees in the
ensemble are combined to produce a final prediction.
Details
For this engine, there is a single mode: censored regression
Tuning Parameters
This model has 5 tuning parameters:
-
mtry
: # Randomly Selected Predictors (type: integer, default: see below) -
trees
: # Trees (type: integer, default: 100L) -
tree_depth
: Tree Depth (type: integer, default: 2L) -
min_n
: Minimal Node Size (type: integer, default: 10L) -
loss_reduction
: Minimum Loss Reduction (type: double, default: 0)
The mtry
parameter is related to the number of predictors. The default
is to use all predictors.
Translation from parsnip to the original package (censored regression)
The censored extension package is required to fit this model.
library(censored) boost_tree() %>% set_engine("mboost") %>% set_mode("censored regression") %>% translate()
## Boosted Tree Model Specification (censored regression) ## ## Computational engine: mboost ## ## Model fit template: ## censored::blackboost_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), family = mboost::CoxPH())
censored::blackboost_train()
is a wrapper around
mboost::blackboost()
(and other functions)
that makes it easier to run this model.
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Other details
Predictions of type "time"
are predictions of the mean survival time.
References
Buehlmann P, Hothorn T. 2007. Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Boosted trees via Spark
Description
sparklyr::ml_gradient_boosted_trees()
creates a series of decision trees
forming an ensemble. Each tree depends on the results of previous trees.
All trees in the ensemble are combined to produce a final prediction.
Details
For this engine, there are multiple modes: classification and regression. However, multiclass classification is not supported yet.
Tuning Parameters
This model has 7 tuning parameters:
-
tree_depth
: Tree Depth (type: integer, default: 5L) -
trees
: # Trees (type: integer, default: 20L) -
learn_rate
: Learning Rate (type: double, default: 0.1) -
mtry
: # Randomly Selected Predictors (type: integer, default: see below) -
min_n
: Minimal Node Size (type: integer, default: 1L) -
loss_reduction
: Minimum Loss Reduction (type: double, default: 0.0) -
sample_size
: # Observations Sampled (type: integer, default: 1.0)
The mtry
parameter is related to the number of predictors. The default
depends on the model mode. For classification, the square root of the
number of predictors is used and for regression, one third of the
predictors are sampled.
Translation from parsnip to the original package (regression)
boost_tree( mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(), learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric() ) %>% set_engine("spark") %>% set_mode("regression") %>% translate()
## Boosted Tree Model Specification (regression) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## sample_size = numeric() ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), ## type = "regression", feature_subset_strategy = integer(), ## max_iter = integer(), min_instances_per_node = min_rows(integer(0), ## x), max_depth = integer(), step_size = numeric(), min_info_gain = numeric(), ## subsampling_rate = numeric(), seed = sample.int(10^5, 1))
Translation from parsnip to the original package (classification)
boost_tree( mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(), learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric() ) %>% set_engine("spark") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## sample_size = numeric() ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), ## type = "classification", feature_subset_strategy = integer(), ## max_iter = integer(), min_instances_per_node = min_rows(integer(0), ## x), max_depth = integer(), step_size = numeric(), min_info_gain = numeric(), ## subsampling_rate = numeric(), seed = sample.int(10^5, 1))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that, for spark engines, the case_weight
argument value should be
a character string to specify the column with the numeric case weights.
Other details
For models created using the "spark"
engine, there are several things
to consider.
Only the formula interface to via
fit()
is available; usingfit_xy()
will generate an error.The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via
save()
), themodel$fit
element of the parsnip object should be serialized viaml_save(object$fit)
and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.
References
Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Boosted trees via xgboost
Description
xgboost::xgb.train()
creates a series of decision trees forming an
ensemble. Each tree depends on the results of previous trees. All trees in
the ensemble are combined to produce a final prediction.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 8 tuning parameters:
-
tree_depth
: Tree Depth (type: integer, default: 6L) -
trees
: # Trees (type: integer, default: 15L) -
learn_rate
: Learning Rate (type: double, default: 0.3) -
mtry
: # Randomly Selected Predictors (type: integer, default: see below) -
min_n
: Minimal Node Size (type: integer, default: 1L) -
loss_reduction
: Minimum Loss Reduction (type: double, default: 0.0) -
sample_size
: Proportion Observations Sampled (type: double, default: 1.0) -
stop_iter
: # Iterations Before Stopping (type: integer, default: Inf)
For mtry
, the default value of NULL
translates to using all
available columns.
Translation from parsnip to the original package (regression)
boost_tree( mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(), learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric(), stop_iter = integer() ) %>% set_engine("xgboost") %>% set_mode("regression") %>% translate()
## Boosted Tree Model Specification (regression) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## sample_size = numeric() ## stop_iter = integer() ## ## Computational engine: xgboost ## ## Model fit template: ## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## colsample_bynode = integer(), nrounds = integer(), min_child_weight = integer(), ## max_depth = integer(), eta = numeric(), gamma = numeric(), ## subsample = numeric(), early_stop = integer(), nthread = 1, ## verbose = 0)
Translation from parsnip to the original package (classification)
boost_tree( mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(), learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric(), stop_iter = integer() ) %>% set_engine("xgboost") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Main Arguments: ## mtry = integer() ## trees = integer() ## min_n = integer() ## tree_depth = integer() ## learn_rate = numeric() ## loss_reduction = numeric() ## sample_size = numeric() ## stop_iter = integer() ## ## Computational engine: xgboost ## ## Model fit template: ## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## colsample_bynode = integer(), nrounds = integer(), min_child_weight = integer(), ## max_depth = integer(), eta = numeric(), gamma = numeric(), ## subsample = numeric(), early_stop = integer(), nthread = 1, ## verbose = 0)
xgb_train()
is a wrapper around
xgboost::xgb.train()
(and other functions)
that makes it easier to run this model.
Preprocessing requirements
xgboost does not have a means to translate factor predictors to grouped
splits. Factor/categorical predictors need to be converted to numeric
values (e.g., dummy or indicator variables) for this engine. When using
the formula method via fit.model_spec()
, parsnip
will convert factor columns to indicators using a one-hot encoding.
For classification, non-numeric outcomes (i.e., factors) are internally
converted to numeric. For binary classification, the event_level
argument of set_engine()
can be set to either "first"
or "second"
to specify which level should be used as the event. This can be helpful
when a watchlist is used to monitor performance from with the xgboost
training process.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
Other details
Interfacing with the params
argument
The xgboost function that parsnip indirectly wraps,
xgboost::xgb.train()
, takes most arguments via
the params
list argument. To supply engine-specific arguments that are
documented in xgboost::xgb.train()
as
arguments to be passed via params
, supply the list elements directly
as named arguments to set_engine()
rather than as
elements in params
. For example, pass a non-default evaluation metric
like this:
# good boost_tree() %>% set_engine("xgboost", eval_metric = "mae")
## Boosted Tree Model Specification (unknown mode) ## ## Engine-Specific Arguments: ## eval_metric = mae ## ## Computational engine: xgboost
…rather than this:
# bad boost_tree() %>% set_engine("xgboost", params = list(eval_metric = "mae"))
## Boosted Tree Model Specification (unknown mode) ## ## Engine-Specific Arguments: ## params = list(eval_metric = "mae") ## ## Computational engine: xgboost
parsnip will then route arguments as needed. In the case that arguments
are passed to params
via set_engine()
, parsnip will
warn and re-route the arguments as needed. Note, though, that arguments
passed to params
cannot be tuned.
Sparse matrices
xgboost requires the data to be in a sparse format. If your predictor
data are already in this format, then use
fit_xy.model_spec()
to pass it to the model
function. Otherwise, parsnip converts the data to this format.
Parallel processing
By default, the model is trained without parallel processing. This can
be change by passing the nthread
parameter to
set_engine()
. However, it is unwise to combine this
with external parallel processing when using the package.
Interpreting mtry
The mtry
argument denotes the number of predictors that will be
randomly sampled at each split when creating tree models.
Some engines, such as "xgboost"
, "xrf"
, and "lightgbm"
, interpret
their analogue to the mtry
argument as the proportion of predictors
that will be randomly sampled at each split rather than the count. In
some settings, such as when tuning over preprocessors that influence the
number of predictors, this parameterization is quite
helpful—interpreting mtry
as a proportion means that [0, 1]
is
always a valid range for that parameter, regardless of input data.
parsnip and its extensions accommodate this parameterization using the
counts
argument: a logical indicating whether mtry
should be
interpreted as the number of predictors that will be randomly sampled at
each split. TRUE
indicates that mtry
will be interpreted in its
sense as a count, FALSE
indicates that the argument will be
interpreted in its sense as a proportion.
mtry
is a main model argument for
boost_tree()
and
rand_forest()
, and thus should not have an
engine-specific interface. So, regardless of engine, counts
defaults
to TRUE
. For engines that support the proportion interpretation
(currently "xgboost"
and "xrf"
, via the rules package, and
"lightgbm"
via the bonsai package) the user can pass the
counts = FALSE
argument to set_engine()
to supply mtry
values
within [0, 1]
.
Early stopping
The stop_iter()
argument allows the model to prematurely stop training
if the objective function does not improve within early_stop
iterations.
The best way to use this feature is in conjunction with an internal
validation set. To do this, pass the validation
parameter of
xgb_train()
via the parsnip
set_engine()
function. This is the
proportion of the training set that should be reserved for measuring
performance (and stopping early).
If the model specification has early_stop >= trees
, early_stop
is
converted to trees - 1
and a warning is issued.
Note that, since the validation
argument provides an alternative
interface to watchlist
, the watchlist
argument is guarded by parsnip
and will be ignored (with a warning) if passed.
Objective function
parsnip chooses the objective function based on the characteristics of
the outcome. To use a different loss, pass the objective
argument to
set_engine()
directly.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for boost_tree()
with the "xgboost"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
C5.0 rule-based classification models
Description
C50::C5.0()
fits a model that derives feature rules from a tree for
prediction. A single tree or boosted ensemble can be used. rules::c5_fit()
is a wrapper around this function.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
trees
: # Trees (type: integer, default: 1L) -
min_n
: Minimal Node Size (type: integer, default: 2L)
Note that C5.0 has a tool for early stopping during boosting where
less iterations of boosting are performed than the number requested.
C5_rules()
turns this feature off (although it can be re-enabled using
C50::C5.0Control()
).
Translation from parsnip to the underlying model call (classification)
The rules extension package is required to fit this model.
library(rules) C5_rules( trees = integer(1), min_n = integer(1) ) %>% set_engine("C5.0") %>% set_mode("classification") %>% translate()
## C5.0 Model Specification (classification) ## ## Main Arguments: ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: C5.0 ## ## Model fit template: ## rules::c5_fit(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## trials = integer(1), minCases = integer(1))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Quinlan R (1992). “Learning with Continuous Classes.” Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, pp. 343-348.
Quinlan R (1993).”Combining Instance-Based and Model-Based Learning.” Proceedings of the Tenth International Conference on Machine Learning, pp. 236-243.
Kuhn M and Johnson K (2013). Applied Predictive Modeling. Springer.
Cubist rule-based regression models
Description
Cubist::cubist()
fits a model that derives simple feature rules from a tree
ensemble and uses creates regression models within each rule.
rules::cubist_fit()
is a wrapper around this function.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has 3 tuning parameters:
-
committees
: # Committees (type: integer, default: 1L) -
neighbors
: # Nearest Neighbors (type: integer, default: 0L) -
max_rules
: Max. Rules (type: integer, default: NA_integer)
Translation from parsnip to the underlying model call (regression)
The rules extension package is required to fit this model.
library(rules) cubist_rules( committees = integer(1), neighbors = integer(1), max_rules = integer(1) ) %>% set_engine("Cubist") %>% set_mode("regression") %>% translate()
## Cubist Model Specification (regression) ## ## Main Arguments: ## committees = integer(1) ## neighbors = integer(1) ## max_rules = integer(1) ## ## Computational engine: Cubist ## ## Model fit template: ## rules::cubist_fit(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## committees = integer(1), neighbors = integer(1), max_rules = integer(1))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
References
Quinlan R (1992). “Learning with Continuous Classes.” Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, pp. 343-348.
Quinlan R (1993).”Combining Instance-Based and Model-Based Learning.” Proceedings of the Tenth International Conference on Machine Learning, pp. 236-243.
Kuhn M and Johnson K (2013). Applied Predictive Modeling. Springer.
Decision trees via C5.0
Description
C50::C5.0()
fits a model as a set of if/then
statements that
creates a tree-based structure.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 1 tuning parameters:
-
min_n
: Minimal Node Size (type: integer, default: 2L)
Translation from parsnip to the original package (classification)
decision_tree(min_n = integer()) %>% set_engine("C5.0") %>% set_mode("classification") %>% translate()
## Decision Tree Model Specification (classification) ## ## Main Arguments: ## min_n = integer() ## ## Computational engine: C5.0 ## ## Model fit template: ## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## minCases = integer(), trials = 1)
C5.0_train()
is a wrapper around
C50::C5.0()
that makes it easier to run this model.
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for decision_tree()
with the "C5.0"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Decision trees via partykit
Description
partykit::ctree()
fits a model as a set of if/then statements that creates a
tree-based structure using hypothesis testing methods.
Details
For this engine, there are multiple modes: censored regression, regression, and classification
Tuning Parameters
This model has 2 tuning parameters:
-
tree_depth
: Tree Depth (type: integer, default: see below) -
min_n
: Minimal Node Size (type: integer, default: 20L)
The tree_depth
parameter defaults to 0
which means no restrictions
are applied to tree depth.
An engine-specific parameter for this model is:
-
mtry
: the number of predictors, selected at random, that are evaluated for splitting. The default is to use all predictors.
Translation from parsnip to the original package (regression)
The bonsai extension package is required to fit this model.
library(bonsai) decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% set_engine("partykit") %>% set_mode("regression") %>% translate()
## Decision Tree Model Specification (regression) ## ## Main Arguments: ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: partykit ## ## Model fit template: ## parsnip::ctree_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), maxdepth = integer(1), minsplit = min_rows(0L, ## data))
Translation from parsnip to the original package (classification)
The bonsai extension package is required to fit this model.
library(bonsai) decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% set_engine("partykit") %>% set_mode("classification") %>% translate()
## Decision Tree Model Specification (classification) ## ## Main Arguments: ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: partykit ## ## Model fit template: ## parsnip::ctree_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), maxdepth = integer(1), minsplit = min_rows(0L, ## data))
parsnip::ctree_train()
is a wrapper around
partykit::ctree()
(and other functions) that
makes it easier to run this model.
Translation from parsnip to the original package (censored regression)
The censored extension package is required to fit this model.
library(censored) decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% set_engine("partykit") %>% set_mode("censored regression") %>% translate()
## Decision Tree Model Specification (censored regression) ## ## Main Arguments: ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: partykit ## ## Model fit template: ## parsnip::ctree_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), maxdepth = integer(1), minsplit = min_rows(0L, ## data))
censored::cond_inference_surv_ctree()
is a wrapper around
partykit::ctree()
(and other functions) that
makes it easier to run this model.
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Other details
Predictions of type "time"
are predictions of the median survival
time.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Decision trees via CART
Description
rpart::rpart()
fits a model as a set of if/then
statements that
creates a tree-based structure.
Details
For this engine, there are multiple modes: classification, regression, and censored regression
Tuning Parameters
This model has 3 tuning parameters:
-
tree_depth
: Tree Depth (type: integer, default: 30L) -
min_n
: Minimal Node Size (type: integer, default: 2L) -
cost_complexity
: Cost-Complexity Parameter (type: double, default: 0.01)
Translation from parsnip to the original package (classification)
decision_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% set_engine("rpart") %>% set_mode("classification") %>% translate()
## Decision Tree Model Specification (classification) ## ## Main Arguments: ## cost_complexity = double(1) ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: rpart ## ## Model fit template: ## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## cp = double(1), maxdepth = integer(1), minsplit = min_rows(0L, ## data))
Translation from parsnip to the original package (regression)
decision_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% set_engine("rpart") %>% set_mode("regression") %>% translate()
## Decision Tree Model Specification (regression) ## ## Main Arguments: ## cost_complexity = double(1) ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: rpart ## ## Model fit template: ## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## cp = double(1), maxdepth = integer(1), minsplit = min_rows(0L, ## data))
Translation from parsnip to the original package (censored regression)
The censored extension package is required to fit this model.
library(censored) decision_tree( tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1) ) %>% set_engine("rpart") %>% set_mode("censored regression") %>% translate()
## Decision Tree Model Specification (censored regression) ## ## Main Arguments: ## cost_complexity = double(1) ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: rpart ## ## Model fit template: ## pec::pecRpart(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), cp = double(1), maxdepth = integer(1), ## minsplit = min_rows(0L, data))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Other details
Predictions of type "time"
are predictions of the mean survival time.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for decision_tree()
with the "rpart"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Decision trees via Spark
Description
sparklyr::ml_decision_tree()
fits a model as a set of if/then
statements that creates a tree-based structure.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 2 tuning parameters:
-
tree_depth
: Tree Depth (type: integer, default: 5L) -
min_n
: Minimal Node Size (type: integer, default: 1L)
Translation from parsnip to the original package (classification)
decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% set_engine("spark") %>% set_mode("classification") %>% translate()
## Decision Tree Model Specification (classification) ## ## Main Arguments: ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_decision_tree_classifier(x = missing_arg(), formula = missing_arg(), ## max_depth = integer(1), min_instances_per_node = min_rows(0L, ## x), seed = sample.int(10^5, 1))
Translation from parsnip to the original package (regression)
decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% set_engine("spark") %>% set_mode("regression") %>% translate()
## Decision Tree Model Specification (regression) ## ## Main Arguments: ## tree_depth = integer(1) ## min_n = integer(1) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_decision_tree_regressor(x = missing_arg(), formula = missing_arg(), ## max_depth = integer(1), min_instances_per_node = min_rows(0L, ## x), seed = sample.int(10^5, 1))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that, for spark engines, the case_weight
argument value should be
a character string to specify the column with the numeric case weights.
Other details
For models created using the "spark"
engine, there are several things
to consider.
Only the formula interface to via
fit()
is available; usingfit_xy()
will generate an error.The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via
save()
), themodel$fit
element of the parsnip object should be serialized viaml_save(object$fit)
and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Flexible discriminant analysis via earth
Description
mda::fda()
(in conjunction with earth::earth()
can fit a nonlinear
discriminant analysis model that uses nonlinear features created using
multivariate adaptive regression splines (MARS). This function can fit
classification models.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 3 tuning parameter:
-
num_terms
: # Model Terms (type: integer, default: (see below)) -
prod_degree
: Degree of Interaction (type: integer, default: 1L) -
prune_method
: Pruning Method (type: character, default: ‘backward’)
The default value of num_terms
depends on the number of columns (p
):
min(200, max(20, 2 * p)) + 1
. Note that num_terms = 1
is an
intercept-only model.
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_flexible( num_terms = integer(0), prod_degree = integer(0), prune_method = character(0) ) %>% translate()
## Flexible Discriminant Model Specification (classification) ## ## Main Arguments: ## num_terms = integer(0) ## prod_degree = integer(0) ## prune_method = character(0) ## ## Computational engine: earth ## ## Model fit template: ## mda::fda(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## nprune = integer(0), degree = integer(0), pmethod = character(0), ## method = earth::earth)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
Hastie, Tibshirani & Buja (1994) Flexible Discriminant Analysis by Optimal Scoring, Journal of the American Statistical Association, 89:428, 1255-1270
Friedman (1991). Multivariate Adaptive Regression Splines. The Annals of Statistics, 19(1), 1-67.
Linear discriminant analysis via MASS
Description
MASS::lda()
fits a model that estimates a multivariate
distribution for the predictors separately for the data in each class
(Gaussian with a common covariance matrix). Bayes' theorem is used
to compute the probability of each class, given the predictor values.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This engine has no tuning parameters.
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_linear() %>% set_engine("MASS") %>% translate()
## Linear Discriminant Model Specification (classification) ## ## Computational engine: MASS ## ## Model fit template: ## MASS::lda(formula = missing_arg(), data = missing_arg())
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Linear discriminant analysis via flexible discriminant analysis
Description
mda::fda()
(in conjunction with mda::gen.ridge()
can fit a linear
discriminant analysis model that penalizes the predictor coefficients with a
quadratic penalty (i.e., a ridge or weight decay approach).
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 1 tuning parameter:
-
penalty
: Amount of Regularization (type: double, default: 1.0)
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_linear(penalty = numeric(0)) %>% set_engine("mda") %>% translate()
## Linear Discriminant Model Specification (classification) ## ## Main Arguments: ## penalty = numeric(0) ## ## Computational engine: mda ## ## Model fit template: ## mda::fda(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## lambda = numeric(0), method = mda::gen.ridge, keep.fitted = FALSE)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
Hastie, Tibshirani & Buja (1994) Flexible Discriminant Analysis by Optimal Scoring, Journal of the American Statistical Association, 89:428, 1255-1270
Linear discriminant analysis via James-Stein-type shrinkage estimation
Description
sda::sda()
can fit a linear discriminant analysis model that can fit models
between classical discriminant analysis and diagonal discriminant analysis.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This engine has no tuning parameter arguments in
discrim_linear()
.
However, there are a few engine-specific parameters that can be set or
optimized when calling set_engine()
:
-
lambda
: the shrinkage parameters for the correlation matrix. This maps to the parameterdials::shrinkage_correlation()
. -
lambda.var
: the shrinkage parameters for the predictor variances. This maps todials::shrinkage_variance()
. -
lambda.freqs
: the shrinkage parameters for the class frequencies. This maps todials::shrinkage_frequencies()
. -
diagonal
: a logical to make the model covariance diagonal or not. This maps todials::diagonal_covariance()
.
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_linear() %>% set_engine("sda") %>% translate()
## Linear Discriminant Model Specification (classification) ## ## Computational engine: sda ## ## Model fit template: ## sda::sda(Xtrain = missing_arg(), L = missing_arg(), verbose = FALSE)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
Ahdesmaki, A., and K. Strimmer. 2010. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Ann. Appl. Stat. 4: 503-519. Preprint.
Linear discriminant analysis via regularization
Description
Functions in the sparsediscrim package fit different types of linear discriminant analysis model that regularize the estimates (like the mean or covariance).
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 1 tuning parameter:
-
regularization_method
: Regularization Method (type: character, default: ‘diagonal’)
The possible values of this parameter, and the functions that they execute, are:
-
"diagonal"
:sparsediscrim::lda_diag()
-
"min_distance"
:sparsediscrim::lda_emp_bayes_eigen()
-
"shrink_mean"
:sparsediscrim::lda_shrink_mean()
-
"shrink_cov"
:sparsediscrim::lda_shrink_cov()
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_linear(regularization_method = character(0)) %>% set_engine("sparsediscrim") %>% translate()
## Linear Discriminant Model Specification (classification) ## ## Main Arguments: ## regularization_method = character(0) ## ## Computational engine: sparsediscrim ## ## Model fit template: ## discrim::fit_regularized_linear(x = missing_arg(), y = missing_arg(), ## regularization_method = character(0))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
-
lda_diag()
: Dudoit, Fridlyand and Speed (2002) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, Journal of the American Statistical Association, 97:457, 77-87. -
lda_shrink_mean()
: Tong, Chen, Zhao, Improved mean estimation and its application to diagonal discriminant analysis, Bioinformatics, Volume 28, Issue 4, 15 February 2012, Pages 531-537. -
lda_shrink_cov()
: Pang, Tong and Zhao (2009), Shrinkage-based Diagonal Discriminant Analysis and Its Applications in High-Dimensional Data. Biometrics, 65, 1021-1029. -
lda_emp_bayes_eigen()
: Srivistava and Kubokawa (2007), Comparison of Discrimination Methods for High Dimensional Data, Journal of the Japan Statistical Society, 37:1, 123-134.
Quadratic discriminant analysis via MASS
Description
MASS::qda()
fits a model that estimates a multivariate
distribution for the predictors separately for the data in each class
(Gaussian with separate covariance matrices). Bayes' theorem is used
to compute the probability of each class, given the predictor values.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This engine has no tuning parameters.
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_quad() %>% set_engine("MASS") %>% translate()
## Quadratic Discriminant Model Specification (classification) ## ## Computational engine: MASS ## ## Model fit template: ## MASS::qda(formula = missing_arg(), data = missing_arg())
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations within each outcome class. For this reason, zero-variance predictors (i.e., with a single unique value) within each class should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Quadratic discriminant analysis via regularization
Description
Functions in the sparsediscrim package fit different types of quadratic discriminant analysis model that regularize the estimates (like the mean or covariance).
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 1 tuning parameter:
-
regularization_method
: Regularization Method (type: character, default: ‘diagonal’)
The possible values of this parameter, and the functions that they execute, are:
-
"diagonal"
:sparsediscrim::qda_diag()
-
"shrink_mean"
:sparsediscrim::qda_shrink_mean()
-
"shrink_cov"
:sparsediscrim::qda_shrink_cov()
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_quad(regularization_method = character(0)) %>% set_engine("sparsediscrim") %>% translate()
## Quadratic Discriminant Model Specification (classification) ## ## Main Arguments: ## regularization_method = character(0) ## ## Computational engine: sparsediscrim ## ## Model fit template: ## discrim::fit_regularized_quad(x = missing_arg(), y = missing_arg(), ## regularization_method = character(0))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations within each outcome class. For this reason, zero-variance predictors (i.e., with a single unique value) within each class should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
-
qda_diag()
: Dudoit, Fridlyand and Speed (2002) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, Journal of the American Statistical Association, 97:457, 77-87. -
qda_shrink_mean()
: Tong, Chen, Zhao, Improved mean estimation and its application to diagonal discriminant analysis, Bioinformatics, Volume 28, Issue 4, 15 February 2012, Pages 531-537. -
qda_shrink_cov()
: Pang, Tong and Zhao (2009), Shrinkage-based Diagonal Discriminant Analysis and Its Applications in High-Dimensional Data. Biometrics, 65, 1021-1029.
Regularized discriminant analysis via klaR
Description
klaR::rda()
fits a a model that estimates a multivariate
distribution for the predictors separately for the data in each class. The
structure of the model can be LDA, QDA, or some amalgam of the two. Bayes'
theorem is used to compute the probability of each class, given the
predictor values.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameter:
-
frac_common_cov
: Fraction of the Common Covariance Matrix (type: double, default: (see below)) -
frac_identity
: Fraction of the Identity Matrix (type: double, default: (see below))
Some special cases for the RDA model:
-
frac_identity = 0
andfrac_common_cov = 1
is a linear discriminant analysis (LDA) model. -
frac_identity = 0
andfrac_common_cov = 0
is a quadratic discriminant analysis (QDA) model.
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) discrim_regularized(frac_identity = numeric(0), frac_common_cov = numeric(0)) %>% set_engine("klaR") %>% translate()
## Regularized Discriminant Model Specification (classification) ## ## Main Arguments: ## frac_common_cov = numeric(0) ## frac_identity = numeric(0) ## ## Computational engine: klaR ## ## Model fit template: ## klaR::rda(formula = missing_arg(), data = missing_arg(), lambda = numeric(0), ## gamma = numeric(0))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations within each outcome class. For this reason, zero-variance predictors (i.e., with a single unique value) within each class should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
Friedman, J (1989). Regularized Discriminant Analysis. Journal of the American Statistical Association, 84, 165-175.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Generalized additive models via mgcv
Description
mgcv::gam()
fits a generalized linear model with additive smoother terms
for continuous predictors.
Details
For this engine, there are multiple modes: regression and classification
Tuning Parameters
This model has 2 tuning parameters:
-
select_features
: Select Features? (type: logical, default: FALSE) -
adjust_deg_free
: Smoothness Adjustment (type: double, default: 1.0)
Translation from parsnip to the original package (regression)
gen_additive_mod(adjust_deg_free = numeric(1), select_features = logical(1)) %>% set_engine("mgcv") %>% set_mode("regression") %>% translate()
## GAM Model Specification (regression) ## ## Main Arguments: ## select_features = logical(1) ## adjust_deg_free = numeric(1) ## ## Computational engine: mgcv ## ## Model fit template: ## mgcv::gam(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## select = logical(1), gamma = numeric(1))
Translation from parsnip to the original package (classification)
gen_additive_mod(adjust_deg_free = numeric(1), select_features = logical(1)) %>% set_engine("mgcv") %>% set_mode("classification") %>% translate()
## GAM Model Specification (classification) ## ## Main Arguments: ## select_features = logical(1) ## adjust_deg_free = numeric(1) ## ## Computational engine: mgcv ## ## Model fit template: ## mgcv::gam(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## select = logical(1), gamma = numeric(1), family = stats::binomial(link = "logit"))
Model fitting
This model should be used with a model formula so that smooth terms can be specified. For example:
library(mgcv) gen_additive_mod() %>% set_engine("mgcv") %>% set_mode("regression") %>% fit(mpg ~ wt + gear + cyl + s(disp, k = 10), data = mtcars)
## parsnip model object ## ## ## Family: gaussian ## Link function: identity ## ## Formula: ## mpg ~ wt + gear + cyl + s(disp, k = 10) ## ## Estimated degrees of freedom: ## 7.52 total = 11.52 ## ## GCV score: 4.225228
The smoothness of the terms will need to be manually specified (e.g.,
using s(x, df = 10)
) in the formula. Tuning can be accomplished using
the adjust_deg_free
parameter.
When using a workflow, pass the model formula to
workflows::add_model()
’s formula
argument,
and a simplified preprocessing formula elsewhere.
spec <- gen_additive_mod() %>% set_engine("mgcv") %>% set_mode("regression") workflow() %>% add_model(spec, formula = mpg ~ wt + gear + cyl + s(disp, k = 10)) %>% add_formula(mpg ~ wt + gear + cyl + disp) %>% fit(data = mtcars) %>% extract_fit_engine()
## ## Family: gaussian ## Link function: identity ## ## Formula: ## mpg ~ wt + gear + cyl + s(disp, k = 10) ## ## Estimated degrees of freedom: ## 7.52 total = 11.52 ## ## GCV score: 4.225228
To learn more about the differences between these formulas, see
?model_formula
.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Ross, W. 2021. Generalized Additive Models in R: A Free, Interactive Course using mgcv
Wood, S. 2017. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC.
Linear regression via brulee
Description
brulee::brulee_linear_reg()
uses ordinary least squares to fit models with
numeric outcomes.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has 2 tuning parameter:
-
penalty
: Amount of Regularization (type: double, default: 0.001) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0)
The use of the L1 penalty (a.k.a. the lasso penalty) does not force parameters to be strictly zero (as it does in packages such as glmnet). The zeroing out of parameters is a specific feature the optimization method used in those packages.
Other engine arguments of interest:
-
optimizer()
: The optimization method. Seebrulee::brulee_linear_reg()
. -
epochs()
: An integer for the number of passes through the training set. -
lean_rate()
: A number used to accelerate the gradient decsent process. -
momentum()
: A number used to use historical gradient infomration during optimization (optimizer = "SGD"
only). -
batch_size()
: An integer for the number of training set points in each batch. -
stop_iter()
: A non-negative integer for how many iterations with no improvement before stopping. (default: 5L).
Translation from parsnip to the original package (regression)
linear_reg(penalty = double(1)) %>% set_engine("brulee") %>% translate()
## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = double(1) ## ## Computational engine: brulee ## ## Model fit template: ## brulee::brulee_linear_reg(x = missing_arg(), y = missing_arg(), ## penalty = double(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Linear regression via generalized estimating equations (GEE)
Description
gee::gee()
uses generalized least squares to fit different types of models
with errors that are not independent.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no formal tuning parameters. It may be beneficial to determine the appropriate correlation structure to use, but this typically does not affect the predicted value of the model. It does have an effect on the inferential results and parameter covariance values.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) linear_reg() %>% set_engine("gee") %>% set_mode("regression") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: gee ## ## Model fit template: ## multilevelmod::gee_fit(formula = missing_arg(), data = missing_arg(), ## family = gaussian)
multilevelmod::gee_fit()
is a wrapper model around gee::gee()
.
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model cannot accept case weights.
Both gee:gee()
and gee:geepack()
specify the id/cluster variable
using an argument id
that requires a vector. parsnip doesn’t work that
way so we enable this model to be fit using a artificial function
id_var()
to be used in the formula. So, in the original package, the
call would look like:
gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) linear_reg() %>% set_engine("gee", corstr = "exchangeable") %>% fit(breaks ~ tension + id_var(wool), data = warpbreaks)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the GEE formula when adding the model:
library(tidymodels) gee_spec <- linear_reg() %>% set_engine("gee", corstr = "exchangeable") gee_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = breaks, predictors = c(tension, wool)) %>% add_model(gee_spec, formula = breaks ~ tension + id_var(wool)) fit(gee_wflow, data = warpbreaks)
The gee::gee()
function always prints out warnings and output even
when silent = TRUE
. The parsnip "gee"
engine, by contrast, silences
all console output coming from gee::gee()
, even if silent = FALSE
.
Also, because of issues with the gee()
function, a supplementary call
to glm()
is needed to get the rank and QR decomposition objects so
that predict()
can be used.
Case weights
The underlying model implementation does not allow for case weights.
References
Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22.
Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42 121–130.
Linear regression via glm
Description
stats::glm()
fits a generalized linear model for numeric outcomes. A
linear combination of the predictors is used to model the numeric outcome
via a link function.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This engine has no tuning parameters but you can set the family
parameter (and/or link
) as an engine argument (see below).
Translation from parsnip to the original package
linear_reg() %>% set_engine("glm") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: glm ## ## Model fit template: ## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = stats::gaussian)
To use a non-default family
and/or link
, pass in as an argument to
set_engine()
:
linear_reg() %>% set_engine("glm", family = stats::poisson(link = "sqrt")) %>% translate()
## Linear Regression Model Specification (regression) ## ## Engine-Specific Arguments: ## family = stats::poisson(link = "sqrt") ## ## Computational engine: glm ## ## Model fit template: ## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = stats::poisson(link = "sqrt"))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
However, the documentation in stats::glm()
assumes
that is specific type of case weights are being used:“Non-NULL weights
can be used to indicate that different observations have different
dispersions (with the values in weights being inversely proportional to
the dispersions); or equivalently, when the elements of weights are
positive integers w_i
, that each response y_i
is the mean of w_i
unit-weight observations. For a binomial GLM prior weights are used to
give the number of trials when the response is the proportion of
successes: they would rarely be used for a Poisson GLM.”
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for linear_reg()
with the "glm"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Linear regression via generalized mixed models
Description
The "glmer"
engine estimates fixed and random effect regression parameters
using maximum likelihood (or restricted maximum likelihood) estimation.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no tuning parameters.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) linear_reg() %>% set_engine("glmer") %>% set_mode("regression") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: glmer ## ## Model fit template: ## lme4::glmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = stats::gaussian)
Note that using this engine with a linear link function will result in a warning:
calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is deprecated; please call lmer() directly
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) data("riesby") linear_reg() %>% set_engine("glmer") %>% fit(depr_score ~ week + (1|subject), data = riesby)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) glmer_spec <- linear_reg() %>% set_engine("glmer") glmer_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% add_model(glmer_spec, formula = depr_score ~ week + (1|subject)) fit(glmer_wflow, data = riesby)
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.
Linear regression via glmnet
Description
glmnet::glmnet()
uses regularized least squares to fit models with numeric outcomes.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: see below) -
mixture
: Proportion of Lasso Penalty (type: double, default: 1.0)
A value of mixture = 1
corresponds to a pure lasso model, while
mixture = 0
indicates ridge regression.
The penalty
parameter has no default and requires a single numeric
value. For more details about this, and the glmnet
model in general,
see glmnet-details.
Translation from parsnip to the original package
linear_reg(penalty = double(1), mixture = double(1)) %>% set_engine("glmnet") %>% translate()
## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 0 ## mixture = double(1) ## ## Computational engine: glmnet ## ## Model fit template: ## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## alpha = double(1), family = "gaussian")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to
center and scale each so that each predictor has mean zero and a
variance of one. By default, glmnet::glmnet()
uses
the argument standardize = TRUE
to center and scale the data.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for linear_reg()
with the "glmnet"
engine.
References
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Linear regression via generalized least squares
Description
The "gls"
engine estimates linear regression for models where the rows of the
data are not independent.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no tuning parameters.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) linear_reg() %>% set_engine("gls") %>% set_mode("regression") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: gls ## ## Model fit template: ## nlme::gls(formula = missing_arg(), data = missing_arg())
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the fixed effects formula method when
fitting, but the details of the correlation structure should be passed
to set_engine()
since it is an irregular (but required) argument:
library(tidymodels) # load nlme to be able to use the `cor*()` functions library(nlme) data("riesby") linear_reg() %>% set_engine("gls", correlation = corCompSymm(form = ~ 1 | subject)) %>% fit(depr_score ~ week, data = riesby)
## parsnip model object ## ## Generalized least squares fit by REML ## Model: depr_score ~ week ## Data: data ## Log-restricted-likelihood: -765.0148 ## ## Coefficients: ## (Intercept) week ## -4.953439 -2.119678 ## ## Correlation Structure: Compound symmetry ## Formula: ~1 | subject ## Parameter estimate(s): ## Rho ## 0.6820145 ## Degrees of freedom: 250 total; 248 residual ## Residual standard error: 6.868785
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) gls_spec <- linear_reg() %>% set_engine("gls", correlation = corCompSymm(form = ~ 1 | subject)) gls_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% add_model(gls_spec, formula = depr_score ~ week) fit(gls_wflow, data = riesby)
Case weights
The underlying model implementation does not allow for case weights.
References
J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
Linear regression via h2o
Description
This model uses regularized least squares to fit models with numeric outcomes.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has 2 tuning parameters:
-
mixture
: Proportion of Lasso Penalty (type: double, default: see below) -
penalty
: Amount of Regularization (type: double, default: see below)
By default, when not given a fixed penalty
,
h2o::h2o.glm()
uses a heuristic approach to select
the optimal value of penalty
based on training data. Setting the
engine parameter lambda_search
to TRUE
enables an efficient version
of the grid search, see more details at
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/lambda_search.html.
The choice of mixture
depends on the engine parameter solver
, which
is automatically chosen given training data and the specification of
other model parameters. When solver
is set to 'L-BFGS'
, mixture
defaults to 0 (ridge regression) and 0.5 otherwise.
Translation from parsnip to the original package
agua::h2o_train_glm()
for linear_reg()
is a
wrapper around h2o::h2o.glm()
with
family = "gaussian"
.
linear_reg(penalty = 1, mixture = 0.5) %>% set_engine("h2o") %>% translate()
## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 1 ## mixture = 0.5 ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), lambda = 1, alpha = 0.5, ## family = "gaussian")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, h2o::h2o.glm()
uses the argument
standardize = TRUE
to center and scale the data.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Linear regression via keras/tensorflow
Description
This model uses regularized least squares to fit models with numeric outcomes.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has one tuning parameter:
-
penalty
: Amount of Regularization (type: double, default: 0.0)
For penalty
, the amount of regularization is only L2 penalty (i.e.,
ridge or weight decay).
Translation from parsnip to the original package
linear_reg(penalty = double(1)) %>% set_engine("keras") %>% translate()
## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = double(1) ## ## Computational engine: keras ## ## Model fit template: ## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), penalty = double(1), ## hidden_units = 1, act = "linear")
keras_mlp()
is a parsnip wrapper around keras code for
neural networks. This model fits a linear regression as a network with a
single hidden unit.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for linear_reg()
with the "keras"
engine.
References
Hoerl, A., & Kennard, R. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 42(1), 80-86.
Linear regression via lm
Description
stats::lm()
uses ordinary least squares to fit models with numeric outcomes.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This engine has no tuning parameters.
Translation from parsnip to the original package
linear_reg() %>% set_engine("lm") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: lm ## ## Model fit template: ## stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
However, the documentation in stats::lm()
assumes
that is specific type of case weights are being used: “Non-NULL weights
can be used to indicate that different observations have different
variances (with the values in weights being inversely proportional to
the variances); or equivalently, when the elements of weights are
positive integers w_i
, that each response y_i
is the mean of w_i
unit-weight observations (including the case that there are w_i
observations equal to y_i
and the data have been summarized). However,
in the latter case, notice that within-group variation is not used.
Therefore, the sigma estimate and residual degrees of freedom may be
suboptimal; in the case of replication weights, even wrong. Hence,
standard errors and analysis of variance tables should be treated with
care” (emphasis added)
Depending on your application, the degrees of freedom for the model (and other statistics) might be incorrect.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for linear_reg()
with the "lm"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Linear regression via mixed models
Description
The "lme"
engine estimates fixed and random effect regression parameters
using maximum likelihood (or restricted maximum likelihood) estimation.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no tuning parameters.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) linear_reg() %>% set_engine("lme") %>% set_mode("regression") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: lme ## ## Model fit template: ## nlme::lme(fixed = missing_arg(), data = missing_arg())
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the fixed effects formula method when
fitting, but the random effects formula should be passed to
set_engine()
since it is an irregular (but required) argument:
library(tidymodels) data("riesby") linear_reg() %>% set_engine("lme", random = ~ 1|subject) %>% fit(depr_score ~ week, data = riesby)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) lme_spec <- linear_reg() %>% set_engine("lme", random = ~ 1|subject) lme_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% add_model(lme_spec, formula = depr_score ~ week) fit(lme_wflow, data = riesby)
Case weights
The underlying model implementation does not allow for case weights.
References
J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.
Linear regression via mixed models
Description
The "lmer"
engine estimates fixed and random effect regression parameters
using maximum likelihood (or restricted maximum likelihood) estimation.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no tuning parameters.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) linear_reg() %>% set_engine("lmer") %>% set_mode("regression") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: lmer ## ## Model fit template: ## lme4::lmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) data("riesby") linear_reg() %>% set_engine("lmer") %>% fit(depr_score ~ week + (1|subject), data = riesby)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) lmer_spec <- linear_reg() %>% set_engine("lmer") lmer_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% add_model(lmer_spec, formula = depr_score ~ week + (1|subject)) fit(lmer_wflow, data = riesby)
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.
Linear quantile regression via the quantreg package
Description
quantreg::rq()
optimizes quantile loss to fit models with numeric outcomes.
Details
For this engine, there is a single mode: quantile regression
This model has the same structure as the model fit by lm()
, but
instead of optimizing the sum of squared errors, it optimizes “quantile
loss” in order to produce better estimates of the predictive
distribution.
Tuning Parameters
This engine has no tuning parameters.
Translation from parsnip to the original package
This model only works with the "quantile regression"
model and
requires users to specify which areas of the distribution to predict via
the quantile_levels
argument. For example:
linear_reg() %>% set_engine("quantreg") %>% set_mode("quantile regression", quantile_levels = (1:3) / 4) %>% translate()
## Linear Regression Model Specification (quantile regression) ## ## Computational engine: quantreg ## ## Model fit template: ## quantreg::rq(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## tau = quantile_levels) ## Quantile levels: 0.25, 0.5, and 0.75.
Output format
When multiple quantile levels are predicted, there are multiple
predicted values for each row of new data. The predict()
method for
this mode produces a column named .pred_quantile
that has a special
class of "quantile_pred"
, and it contains the predictions for each
row.
For example:
library(modeldata) rlang::check_installed("quantreg") n <- nrow(Chicago) Chicago <- Chicago %>% select(ridership, Clark_Lake) Chicago_train <- Chicago[1:(n - 7), ] Chicago_test <- Chicago[(n - 6):n, ] qr_fit <- linear_reg() %>% set_engine("quantreg") %>% set_mode("quantile regression", quantile_levels = (1:3) / 4) %>% fit(ridership ~ Clark_Lake, data = Chicago_train) qr_fit
## parsnip model object ## ## Call: ## quantreg::rq(formula = ridership ~ Clark_Lake, tau = quantile_levels, ## data = data) ## ## Coefficients: ## tau= 0.25 tau= 0.50 tau= 0.75 ## (Intercept) -0.2064189 0.2051549 0.8112286 ## Clark_Lake 0.9820582 0.9862306 0.9777820 ## ## Degrees of freedom: 5691 total; 5689 residual
qr_pred <- predict(qr_fit, Chicago_test) qr_pred
## # A tibble: 7 x 1 ## .pred_quantile ## <qtls(3)> ## 1 [21.1] ## 2 [21.4] ## 3 [21.7] ## 4 [21.4] ## 5 [19.5] ## 6 [6.88] ## # i 1 more row
We can unnest these values and/or convert them to a rectangular format:
as_tibble(qr_pred$.pred_quantile)
## # A tibble: 21 x 3 ## .pred_quantile .quantile_levels .row ## <dbl> <dbl> <int> ## 1 20.6 0.25 1 ## 2 21.1 0.5 1 ## 3 21.5 0.75 1 ## 4 20.9 0.25 2 ## 5 21.4 0.5 2 ## 6 21.8 0.75 2 ## # i 15 more rows
as.matrix(qr_pred$.pred_quantile)
## [,1] [,2] [,3] ## [1,] 20.590627 21.090561 21.517717 ## [2,] 20.863639 21.364733 21.789541 ## [3,] 21.190665 21.693148 22.115142 ## [4,] 20.879352 21.380513 21.805185 ## [5,] 19.047814 19.541193 19.981622 ## [6,] 6.435241 6.875033 7.423968 ## [7,] 6.062058 6.500265 7.052411
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for linear_reg()
with the "quantreg"
engine.
References
Waldmann, E. (2018). Quantile regression: a short story on how and why. Statistical Modelling, 18(3-4), 203-218.
Linear regression via spark
Description
sparklyr::ml_linear_regression()
uses regularized least squares to fit
models with numeric outcomes.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: 0.0) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0)
For penalty
, the amount of regularization includes both the L1 penalty
(i.e., lasso) and the L2 penalty (i.e., ridge or weight decay). As for
mixture
:
-
mixture = 1
specifies a pure lasso model, -
mixture = 0
specifies a ridge regression model, and -
0 < mixture < 1
specifies an elastic net model, interpolating lasso and ridge.
Translation from parsnip to the original package
linear_reg(penalty = double(1), mixture = double(1)) %>% set_engine("spark") %>% translate()
## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = double(1) ## mixture = double(1) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_linear_regression(x = missing_arg(), formula = missing_arg(), ## weights = missing_arg(), reg_param = double(1), elastic_net_param = double(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, ml_linear_regression()
uses the argument
standardization = TRUE
to center and scale the data.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that, for spark engines, the case_weight
argument value should be
a character string to specify the column with the numeric case weights.
Other details
For models created using the "spark"
engine, there are several things
to consider.
Only the formula interface to via
fit()
is available; usingfit_xy()
will generate an error.The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via
save()
), themodel$fit
element of the parsnip object should be serialized viaml_save(object$fit)
and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.
References
Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Linear regression via Bayesian Methods
Description
The "stan"
engine estimates regression parameters using Bayesian estimation.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This engine has no tuning parameters.
Important engine-specific options
Some relevant arguments that can be passed to set_engine()
:
-
chains
: A positive integer specifying the number of Markov chains. The default is 4. -
iter
: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000. -
seed
: The seed for random number generation. -
cores
: Number of cores to use when executing the chains in parallel. -
prior
: The prior distribution for the (non-hierarchical) regression coefficients. The"stan"
engine does not fit any hierarchical terms. See the"stan_glmer"
engine from the multilevelmod package for that type of model. -
prior_intercept
: The prior distribution for the intercept (after centering all predictors).
See rstan::sampling()
and
rstanarm::priors()
for more information on these
and other options.
Translation from parsnip to the original package
linear_reg() %>% set_engine("stan") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: stan ## ## Model fit template: ## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), family = stats::gaussian, refresh = 0)
Note that the refresh
default prevents logging of the estimation
process. Change this value in set_engine()
to show the MCMC logs.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Other details
For prediction, the "stan"
engine can compute posterior intervals
analogous to confidence and prediction intervals. In these instances,
the units are the original outcome and when std_error = TRUE
, the
standard deviation of the posterior distribution (or posterior
predictive distribution as appropriate) is returned.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for linear_reg()
with the "stan"
engine.
References
McElreath, R. 2020 Statistical Rethinking. CRC Press.
Linear regression via hierarchical Bayesian methods
Description
The "stan_glmer"
engine estimates hierarchical regression parameters using
Bayesian estimation.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no tuning parameters.
Important engine-specific options
Some relevant arguments that can be passed to set_engine()
:
-
chains
: A positive integer specifying the number of Markov chains. The default is 4. -
iter
: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000. -
seed
: The seed for random number generation. -
cores
: Number of cores to use when executing the chains in parallel. -
prior
: The prior distribution for the (non-hierarchical) regression coefficients. -
prior_intercept
: The prior distribution for the intercept (after centering all predictors).
See ?rstanarm::stan_glmer
and ?rstan::sampling
for more information.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) linear_reg() %>% set_engine("stan_glmer") %>% set_mode("regression") %>% translate()
## Linear Regression Model Specification (regression) ## ## Computational engine: stan_glmer ## ## Model fit template: ## rstanarm::stan_glmer(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), family = stats::gaussian, refresh = 0)
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) data("riesby") linear_reg() %>% set_engine("stan_glmer") %>% fit(depr_score ~ week + (1|subject), data = riesby)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) glmer_spec <- linear_reg() %>% set_engine("stan_glmer") glmer_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% add_model(glmer_spec, formula = depr_score ~ week + (1|subject)) fit(glmer_wflow, data = riesby)
For prediction, the "stan_glmer"
engine can compute posterior
intervals analogous to confidence and prediction intervals. In these
instances, the units are the original outcome. When std_error = TRUE
,
the standard deviation of the posterior distribution (or posterior
predictive distribution as appropriate) is returned.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
McElreath, R. 2020 Statistical Rethinking. CRC Press.
Sorensen, T, Vasishth, S. 2016. Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists, arXiv:1506.06201.
Logistic regression via brulee
Description
brulee::brulee_logistic_reg()
fits a generalized linear model for binary
outcomes. A linear combination of the predictors is used to model the log
odds of an event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameter:
-
penalty
: Amount of Regularization (type: double, default: 0.001) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0)
The use of the L1 penalty (a.k.a. the lasso penalty) does not force parameters to be strictly zero (as it does in packages such as glmnet). The zeroing out of parameters is a specific feature the optimization method used in those packages.
Other engine arguments of interest:
-
optimizer()
: The optimization method. Seebrulee::brulee_linear_reg()
. -
epochs()
: An integer for the number of passes through the training set. -
lean_rate()
: A number used to accelerate the gradient decsent process. -
momentum()
: A number used to use historical gradient information during optimization (optimizer = "SGD"
only). -
batch_size()
: An integer for the number of training set points in each batch. -
stop_iter()
: A non-negative integer for how many iterations with no improvement before stopping. (default: 5L). -
class_weights()
: Numeric class weights. Seebrulee::brulee_logistic_reg()
.
Translation from parsnip to the original package (classification)
logistic_reg(penalty = double(1)) %>% set_engine("brulee") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## ## Computational engine: brulee ## ## Model fit template: ## brulee::brulee_logistic_reg(x = missing_arg(), y = missing_arg(), ## penalty = double(1))
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Logistic regression via generalized estimating equations (GEE)
Description
gee::gee()
uses generalized least squares to fit different types of models
with errors that are not independent.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has no formal tuning parameters. It may be beneficial to determine the appropriate correlation structure to use, but this typically does not affect the predicted value of the model. It does have an effect on the inferential results and parameter covariance values.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) logistic_reg() %>% set_engine("gee") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Computational engine: gee ## ## Model fit template: ## multilevelmod::gee_fit(formula = missing_arg(), data = missing_arg(), ## family = binomial)
multilevelmod::gee_fit()
is a wrapper model around gee::gee()
.
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model cannot accept case weights.
Both gee:gee()
and gee:geepack()
specify the id/cluster variable
using an argument id
that requires a vector. parsnip doesn’t work that
way so we enable this model to be fit using a artificial function
id_var()
to be used in the formula. So, in the original package, the
call would look like:
gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")
With parsnip
, we suggest using the formula method when fitting:
library(tidymodels) data("toenail", package = "HSAUR3") logistic_reg() %>% set_engine("gee", corstr = "exchangeable") %>% fit(outcome ~ treatment * visit + id_var(patientID), data = toenail)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the GEE formula when adding the model:
library(tidymodels) gee_spec <- logistic_reg() %>% set_engine("gee", corstr = "exchangeable") gee_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = outcome, predictors = c(treatment, visit, patientID)) %>% add_model(gee_spec, formula = outcome ~ treatment * visit + id_var(patientID)) fit(gee_wflow, data = toenail)
The gee::gee()
function always prints out warnings and output even
when silent = TRUE
. The parsnip "gee"
engine, by contrast, silences
all console output coming from gee::gee()
, even if silent = FALSE
.
Also, because of issues with the gee()
function, a supplementary call
to glm()
is needed to get the rank and QR decomposition objects so
that predict()
can be used.
Case weights
The underlying model implementation does not allow for case weights.
References
Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22.
Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42 121–130.
Logistic regression via glm
Description
stats::glm()
fits a generalized linear model for binary outcomes. A
linear combination of the predictors is used to model the log odds of an
event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This engine has no tuning parameters but you can set the family
parameter (and/or link
) as an engine argument (see below).
Translation from parsnip to the original package
logistic_reg() %>% set_engine("glm") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ## ## Model fit template: ## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = stats::binomial)
To use a non-default family
and/or link
, pass in as an argument to
set_engine()
:
logistic_reg() %>% set_engine("glm", family = stats::binomial(link = "probit")) %>% translate()
## Logistic Regression Model Specification (classification) ## ## Engine-Specific Arguments: ## family = stats::binomial(link = "probit") ## ## Computational engine: glm ## ## Model fit template: ## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = stats::binomial(link = "probit"))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
However, the documentation in stats::glm()
assumes
that is specific type of case weights are being used:“Non-NULL weights
can be used to indicate that different observations have different
dispersions (with the values in weights being inversely proportional to
the dispersions); or equivalently, when the elements of weights are
positive integers w_i
, that each response y_i
is the mean of w_i
unit-weight observations. For a binomial GLM prior weights are used to
give the number of trials when the response is the proportion of
successes: they would rarely be used for a Poisson GLM.”
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for logistic_reg()
with the "glm"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Logistic regression via mixed models
Description
The "glmer"
engine estimates fixed and random effect regression parameters
using maximum likelihood (or restricted maximum likelihood) estimation.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has no tuning parameters.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) logistic_reg() %>% set_engine("glmer") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Computational engine: glmer ## ## Model fit template: ## lme4::glmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = binomial)
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) data("toenail", package = "HSAUR3") logistic_reg() %>% set_engine("glmer") %>% fit(outcome ~ treatment * visit + (1 | patientID), data = toenail)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) glmer_spec <- logistic_reg() %>% set_engine("glmer") glmer_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = outcome, predictors = c(treatment, visit, patientID)) %>% add_model(glmer_spec, formula = outcome ~ treatment * visit + (1 | patientID)) fit(glmer_wflow, data = toenail)
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.
Logistic regression via glmnet
Description
glmnet::glmnet()
fits a generalized linear model for binary outcomes. A
linear combination of the predictors is used to model the log odds of an
event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: see below) -
mixture
: Proportion of Lasso Penalty (type: double, default: 1.0)
The penalty
parameter has no default and requires a single numeric
value. For more details about this, and the glmnet
model in general,
see glmnet-details. As for mixture
:
-
mixture = 1
specifies a pure lasso model, -
mixture = 0
specifies a ridge regression model, and -
0 < mixture < 1
specifies an elastic net model, interpolating lasso and ridge.
Translation from parsnip to the original package
logistic_reg(penalty = double(1), mixture = double(1)) %>% set_engine("glmnet") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = 0 ## mixture = double(1) ## ## Computational engine: glmnet ## ## Model fit template: ## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## alpha = double(1), family = "binomial")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to
center and scale each so that each predictor has mean zero and a
variance of one. By default, glmnet::glmnet()
uses
the argument standardize = TRUE
to center and scale the data.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for logistic_reg()
with the "glmnet"
engine.
References
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Logistic regression via h2o
Description
h2o::h2o.glm()
fits a generalized linear model for binary outcomes.
A linear combination of the predictors is used to model the log odds of an
event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
mixture
: Proportion of Lasso Penalty (type: double, default: see below) -
penalty
: Amount of Regularization (type: double, default: see below)
By default, when not given a fixed penalty
,
h2o::h2o.glm()
uses a heuristic approach to select
the optimal value of penalty
based on training data. Setting the
engine parameter lambda_search
to TRUE
enables an efficient version
of the grid search, see more details at
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/lambda_search.html.
The choice of mixture
depends on the engine parameter solver
, which
is automatically chosen given training data and the specification of
other model parameters. When solver
is set to 'L-BFGS'
, mixture
defaults to 0 (ridge regression) and 0.5 otherwise.
Translation from parsnip to the original package
agua::h2o_train_glm()
for logistic_reg()
is
a wrapper around h2o::h2o.glm()
. h2o will
automatically picks the link function and distribution family or
binomial responses.
logistic_reg() %>% set_engine("h2o") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), family = "binomial")
To use a non-default argument in h2o::h2o.glm()
,
pass in as an engine argument to set_engine()
:
logistic_reg() %>% set_engine("h2o", compute_p_values = TRUE) %>% translate()
## Logistic Regression Model Specification (classification) ## ## Engine-Specific Arguments: ## compute_p_values = TRUE ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), compute_p_values = TRUE, ## family = "binomial")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, h2o::h2o.glm()
uses the argument
standardize = TRUE
to center and scale all numeric columns.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Logistic regression via keras
Description
keras_mlp()
fits a generalized linear model for binary outcomes. A
linear combination of the predictors is used to model the log odds of an
event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has one tuning parameter:
-
penalty
: Amount of Regularization (type: double, default: 0.0)
For penalty
, the amount of regularization is only L2 penalty (i.e.,
ridge or weight decay).
Translation from parsnip to the original package
logistic_reg(penalty = double(1)) %>% set_engine("keras") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## ## Computational engine: keras ## ## Model fit template: ## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), penalty = double(1), ## hidden_units = 1, act = "linear")
keras_mlp()
is a parsnip wrapper around keras code for
neural networks. This model fits a linear regression as a network with a
single hidden unit.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for logistic_reg()
with the "keras"
engine.
References
Hoerl, A., & Kennard, R. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 42(1), 80-86.
Logistic regression via LiblineaR
Description
LiblineaR::LiblineaR()
fits a generalized linear model for binary outcomes. A
linear combination of the predictors is used to model the log odds of an
event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: see below) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0)
For LiblineaR
models, the value for mixture
can either be 0 (for
ridge) or 1 (for lasso) but not other intermediate values. In the
LiblineaR::LiblineaR()
documentation, these
correspond to types 0 (L2-regularized) and 6 (L1-regularized).
Be aware that the LiblineaR
engine regularizes the intercept. Other
regularized regression models do not, which will result in different
parameter estimates.
Translation from parsnip to the original package
logistic_reg(penalty = double(1), mixture = double(1)) %>% set_engine("LiblineaR") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## mixture = double(1) ## ## Computational engine: LiblineaR ## ## Model fit template: ## LiblineaR::LiblineaR(x = missing_arg(), y = missing_arg(), cost = Inf, ## type = double(1), verbose = FALSE)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for logistic_reg()
with the "LiblineaR"
engine.
References
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Logistic regression via spark
Description
sparklyr::ml_logistic_regression()
fits a generalized linear model for
binary outcomes. A linear combination of the predictors is used to model the
log odds of an event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: 0.0) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0)
For penalty
, the amount of regularization includes both the L1 penalty
(i.e., lasso) and the L2 penalty (i.e., ridge or weight decay). As for
mixture
:
-
mixture = 1
specifies a pure lasso model, -
mixture = 0
specifies a ridge regression model, and -
0 < mixture < 1
specifies an elastic net model, interpolating lasso and ridge.
Translation from parsnip to the original package
logistic_reg(penalty = double(1), mixture = double(1)) %>% set_engine("spark") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## mixture = double(1) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_logistic_regression(x = missing_arg(), formula = missing_arg(), ## weights = missing_arg(), reg_param = double(1), elastic_net_param = double(1), ## family = "binomial")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, ml_logistic_regression()
uses the argument
standardization = TRUE
to center and scale the data.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that, for spark engines, the case_weight
argument value should be
a character string to specify the column with the numeric case weights.
Other details
For models created using the "spark"
engine, there are several things
to consider.
Only the formula interface to via
fit()
is available; usingfit_xy()
will generate an error.The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via
save()
), themodel$fit
element of the parsnip object should be serialized viaml_save(object$fit)
and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.
References
Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Logistic regression via stan
Description
rstanarm::stan_glm()
fits a generalized linear model for binary outcomes.
A linear combination of the predictors is used to model the log odds of an
event.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This engine has no tuning parameters.
Important engine-specific options
Some relevant arguments that can be passed to set_engine()
:
-
chains
: A positive integer specifying the number of Markov chains. The default is 4. -
iter
: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000. -
seed
: The seed for random number generation. -
cores
: Number of cores to use when executing the chains in parallel. -
prior
: The prior distribution for the (non-hierarchical) regression coefficients. This"stan"
engine does not fit any hierarchical terms. -
prior_intercept
: The prior distribution for the intercept (after centering all predictors).
See rstan::sampling()
and
rstanarm::priors()
for more information on these
and other options.
Translation from parsnip to the original package
logistic_reg() %>% set_engine("stan") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Computational engine: stan ## ## Model fit template: ## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), family = stats::binomial, refresh = 0)
Note that the refresh
default prevents logging of the estimation
process. Change this value in set_engine()
to show the MCMC logs.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Other details
For prediction, the "stan"
engine can compute posterior intervals
analogous to confidence and prediction intervals. In these instances,
the units are the original outcome and when std_error = TRUE
, the
standard deviation of the posterior distribution (or posterior
predictive distribution as appropriate) is returned.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for logistic_reg()
with the "stan"
engine.
References
McElreath, R. 2020 Statistical Rethinking. CRC Press.
Logistic regression via hierarchical Bayesian methods
Description
The "stan_glmer"
engine estimates hierarchical regression parameters using
Bayesian estimation.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has no tuning parameters.
Important engine-specific options
Some relevant arguments that can be passed to set_engine()
:
-
chains
: A positive integer specifying the number of Markov chains. The default is 4. -
iter
: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000. -
seed
: The seed for random number generation. -
cores
: Number of cores to use when executing the chains in parallel. -
prior
: The prior distribution for the (non-hierarchical) regression coefficients. -
prior_intercept
: The prior distribution for the intercept (after centering all predictors).
See ?rstanarm::stan_glmer
and ?rstan::sampling
for more information.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) logistic_reg() %>% set_engine("stan_glmer") %>% translate()
## Logistic Regression Model Specification (classification) ## ## Computational engine: stan_glmer ## ## Model fit template: ## rstanarm::stan_glmer(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), family = stats::binomial, refresh = 0)
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) data("toenail", package = "HSAUR3") logistic_reg() %>% set_engine("stan_glmer") %>% fit(outcome ~ treatment * visit + (1 | patientID), data = toenail)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) glmer_spec <- logistic_reg() %>% set_engine("stan_glmer") glmer_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = outcome, predictors = c(treatment, visit, patientID)) %>% add_model(glmer_spec, formula = outcome ~ treatment * visit + (1 | patientID)) fit(glmer_wflow, data = toenail)
For prediction, the "stan_glmer"
engine can compute posterior
intervals analogous to confidence and prediction intervals. In these
instances, the units are the original outcome. When std_error = TRUE
,
the standard deviation of the posterior distribution (or posterior
predictive distribution as appropriate) is returned.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
McElreath, R. 2020 Statistical Rethinking. CRC Press.
Sorensen, T, Vasishth, S. 2016. Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists, arXiv:1506.06201.
Multivariate adaptive regression splines (MARS) via earth
Description
earth::earth()
fits a generalized linear model that uses artificial features for
some predictors. These features resemble hinge functions and the result is
a model that is a segmented regression in small dimensions.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
num_terms
: # Model Terms (type: integer, default: see below) -
prod_degree
: Degree of Interaction (type: integer, default: 1L) -
prune_method
: Pruning Method (type: character, default: ‘backward’)
Parsnip changes the default range for num_terms
to c(50, 500)
.
Translation from parsnip to the original package (regression)
mars(num_terms = integer(1), prod_degree = integer(1), prune_method = character(1)) %>% set_engine("earth") %>% set_mode("regression") %>% translate()
## MARS Model Specification (regression) ## ## Main Arguments: ## num_terms = integer(1) ## prod_degree = integer(1) ## prune_method = character(1) ## ## Computational engine: earth ## ## Model fit template: ## earth::earth(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## nprune = integer(1), degree = integer(1), pmethod = character(1), ## keepxy = TRUE)
Translation from parsnip to the original package (classification)
mars(num_terms = integer(1), prod_degree = integer(1), prune_method = character(1)) %>% set_engine("earth") %>% set_mode("classification") %>% translate()
## MARS Model Specification (classification) ## ## Main Arguments: ## num_terms = integer(1) ## prod_degree = integer(1) ## prune_method = character(1) ## ## Engine-Specific Arguments: ## glm = list(family = stats::binomial) ## ## Computational engine: earth ## ## Model fit template: ## earth::earth(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## nprune = integer(1), degree = integer(1), pmethod = character(1), ## glm = list(family = stats::binomial), keepxy = TRUE)
An alternate method for using MARs for categorical outcomes can be found
in discrim_flexible()
.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that the earth
package documentation has: “In the current
implementation, building models with weights can be slow.”
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for mars()
with the "earth"
engine.
References
Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics, vol. 19, no. 1, pp. 1-67.
Milborrow, S. “Notes on the earth package.”
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multilayer perceptron via brulee
Description
brulee::brulee_mlp()
fits a neural network.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 7 tuning parameters:
-
epochs
: # Epochs (type: integer, default: 100L) -
hidden_units
: # Hidden Units (type: integer, default: 3L) -
activation
: Activation Function (type: character, default: ‘relu’) -
penalty
: Amount of Regularization (type: double, default: 0.001) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0) -
dropout
: Dropout Rate (type: double, default: 0.0) -
learn_rate
: Learning Rate (type: double, default: 0.01)
The use of the L1 penalty (a.k.a. the lasso penalty) does not force parameters to be strictly zero (as it does in packages such as glmnet). The zeroing out of parameters is a specific feature the optimization method used in those packages.
Both penalty
and dropout
should be not be used in the same model.
Other engine arguments of interest:
-
momentum
: A number used to use historical gradient infomration during optimization. -
batch_size
: An integer for the number of training set points in each batch. -
class_weights
: Numeric class weights. Seebrulee::brulee_mlp()
. -
stop_iter
: A non-negative integer for how many iterations with no improvement before stopping. (default: 5L). -
rate_schedule
: A function to change the learning rate over epochs. Seebrulee::schedule_decay_time()
for details.
Translation from parsnip to the original package (regression)
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), learn_rate = double(1), activation = character(1) ) %>% set_engine("brulee") %>% set_mode("regression") %>% translate()
## Single Layer Neural Network Model Specification (regression) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## learn_rate = double(1) ## ## Computational engine: brulee ## ## Model fit template: ## brulee::brulee_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), ## penalty = double(1), dropout = double(1), epochs = integer(1), ## activation = character(1), learn_rate = double(1))
Note that parsnip automatically sets linear activation in the last layer.
Translation from parsnip to the original package (classification)
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), learn_rate = double(1), activation = character(1) ) %>% set_engine("brulee") %>% set_mode("classification") %>% translate()
## Single Layer Neural Network Model Specification (classification) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## learn_rate = double(1) ## ## Computational engine: brulee ## ## Model fit template: ## brulee::brulee_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), ## penalty = double(1), dropout = double(1), epochs = integer(1), ## activation = character(1), learn_rate = double(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multilayer perceptron via brulee with two hidden layers
Description
brulee::brulee_mlp_two_layer()
fits a neural network (with version 0.3.0.9000 or higher of brulee)
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 7 tuning parameters:
-
epochs
: # Epochs (type: integer, default: 100L) -
hidden_units
: # Hidden Units (type: integer, default: 3L) -
activation
: Activation Function (type: character, default: ‘relu’) -
penalty
: Amount of Regularization (type: double, default: 0.001) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0) -
dropout
: Dropout Rate (type: double, default: 0.0) -
learn_rate
: Learning Rate (type: double, default: 0.01)
The use of the L1 penalty (a.k.a. the lasso penalty) does not force parameters to be strictly zero (as it does in packages such as glmnet). The zeroing out of parameters is a specific feature the optimization method used in those packages.
Both penalty
and dropout
should be not be used in the same model.
Other engine arguments of interest:
-
hidden_layer_2
andactivation_2
control the format of the second layer. -
momentum
: A number used to use historical gradient information during optimization. -
batch_size
: An integer for the number of training set points in each batch. -
class_weights
: Numeric class weights. Seebrulee::brulee_mlp()
. -
stop_iter
: A non-negative integer for how many iterations with no improvement before stopping. (default: 5L). -
rate_schedule
: A function to change the learning rate over epochs. Seebrulee::schedule_decay_time()
for details.
Translation from parsnip to the original package (regression)
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), learn_rate = double(1), activation = character(1) ) %>% set_engine("brulee_two_layer", hidden_units_2 = integer(1), activation_2 = character(1)) %>% set_mode("regression") %>% translate()
## Single Layer Neural Network Model Specification (regression) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## learn_rate = double(1) ## ## Engine-Specific Arguments: ## hidden_units_2 = integer(1) ## activation_2 = character(1) ## ## Computational engine: brulee_two_layer ## ## Model fit template: ## brulee::brulee_mlp_two_layer(x = missing_arg(), y = missing_arg(), ## hidden_units = integer(1), penalty = double(1), dropout = double(1), ## epochs = integer(1), activation = character(1), learn_rate = double(1), ## hidden_units_2 = integer(1), activation_2 = character(1))
Note that parsnip automatically sets the linear activation in the last layer.
Translation from parsnip to the original package (classification)
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), learn_rate = double(1), activation = character(1) ) %>% set_engine("brulee_two_layer", hidden_units_2 = integer(1), activation_2 = character(1)) %>% set_mode("classification") %>% translate()
## Single Layer Neural Network Model Specification (classification) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## learn_rate = double(1) ## ## Engine-Specific Arguments: ## hidden_units_2 = integer(1) ## activation_2 = character(1) ## ## Computational engine: brulee_two_layer ## ## Model fit template: ## brulee::brulee_mlp_two_layer(x = missing_arg(), y = missing_arg(), ## hidden_units = integer(1), penalty = double(1), dropout = double(1), ## epochs = integer(1), activation = character(1), learn_rate = double(1), ## hidden_units_2 = integer(1), activation_2 = character(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multilayer perceptron via h2o
Description
h2o::h2o.deeplearning()
fits a feed-forward neural network.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 6 tuning parameters:
-
hidden_units
: # Hidden Units (type: integer, default: 200L) -
penalty
: Amount of Regularization (type: double, default: 0.0) -
dropout
: Dropout Rate (type: double, default: 0.5) -
epochs
: # Epochs (type: integer, default: 10) -
activation
: Activation function (type: character, default: ‘see below’) -
learn_rate
: Learning Rate (type: double, default: 0.005)
The naming of activation functions in
h2o::h2o.deeplearning()
differs from
parsnip’s conventions. Currently, only “relu” and “tanh” are supported
and will be converted internally to “Rectifier” and “Tanh” passed to the
fitting function.
penalty
corresponds to l2 penalty.
h2o::h2o.deeplearning()
also supports
specifying the l1 penalty directly with the engine argument l1
.
Other engine arguments of interest:
-
stopping_rounds
controls early stopping rounds based on the convergence of another engine parameterstopping_metric
. By default, h2o::h2o.deeplearning stops training if simple moving average of length 5 of the stopping_metric does not improve for 5 scoring events. This is mostly useful when used alongside the engine parametervalidation
, which is the proportion of train-validation split, parsnip will split and pass the two data frames to h2o. Then h2o::h2o.deeplearning will evaluate the metric and early stopping criteria on the validation set. h2o uses a 50% dropout ratio controlled by
dropout
for hidden layers by default.h2o::h2o.deeplearning()
provides an engine argumentinput_dropout_ratio
for dropout ratios in the input layer, which defaults to 0.
Translation from parsnip to the original package (regression)
agua::h2o_train_mlp is a wrapper around
h2o::h2o.deeplearning()
.
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), learn_rate = double(1), activation = character(1) ) %>% set_engine("h2o") %>% set_mode("regression") %>% translate()
## Single Layer Neural Network Model Specification (regression) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## learn_rate = double(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_mlp(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), hidden = integer(1), l2 = double(1), ## hidden_dropout_ratios = double(1), epochs = integer(1), activation = character(1), ## rate = double(1))
Translation from parsnip to the original package (classification)
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), learn_rate = double(1), activation = character(1) ) %>% set_engine("h2o") %>% set_mode("classification") %>% translate()
## Single Layer Neural Network Model Specification (classification) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## learn_rate = double(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_mlp(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), hidden = integer(1), l2 = double(1), ## hidden_dropout_ratios = double(1), epochs = integer(1), activation = character(1), ## rate = double(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, h2o::h2o.deeplearning()
uses
the argument standardize = TRUE
to center and scale all numeric
columns.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Multilayer perceptron via keras
Description
keras_mlp()
fits a single layer, feed-forward neural network.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 5 tuning parameters:
-
hidden_units
: # Hidden Units (type: integer, default: 5L) -
penalty
: Amount of Regularization (type: double, default: 0.0) -
dropout
: Dropout Rate (type: double, default: 0.0) -
epochs
: # Epochs (type: integer, default: 20L) -
activation
: Activation Function (type: character, default: ‘softmax’)
Translation from parsnip to the original package (regression)
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), activation = character(1) ) %>% set_engine("keras") %>% set_mode("regression") %>% translate()
## Single Layer Neural Network Model Specification (regression) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## ## Computational engine: keras ## ## Model fit template: ## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), ## penalty = double(1), dropout = double(1), epochs = integer(1), ## activation = character(1))
Translation from parsnip to the original package (classification)
mlp( hidden_units = integer(1), penalty = double(1), dropout = double(1), epochs = integer(1), activation = character(1) ) %>% set_engine("keras") %>% set_mode("classification") %>% translate()
## Single Layer Neural Network Model Specification (classification) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## dropout = double(1) ## epochs = integer(1) ## activation = character(1) ## ## Computational engine: keras ## ## Model fit template: ## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), ## penalty = double(1), dropout = double(1), epochs = integer(1), ## activation = character(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for mlp()
with the "keras"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multilayer perceptron via nnet
Description
nnet::nnet()
fits a single layer, feed-forward neural network.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
hidden_units
: # Hidden Units (type: integer, default: none) -
penalty
: Amount of Regularization (type: double, default: 0.0) -
epochs
: # Epochs (type: integer, default: 100L)
Note that, in nnet::nnet()
, the maximum number of
parameters is an argument with a fairly low value of maxit = 1000
. For
some models, you may need to pass this value in via
set_engine()
so that the model does not fail.
Translation from parsnip to the original package (regression)
mlp( hidden_units = integer(1), penalty = double(1), epochs = integer(1) ) %>% set_engine("nnet") %>% set_mode("regression") %>% translate()
## Single Layer Neural Network Model Specification (regression) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## epochs = integer(1) ## ## Computational engine: nnet ## ## Model fit template: ## nnet::nnet(formula = missing_arg(), data = missing_arg(), size = integer(1), ## decay = double(1), maxit = integer(1), trace = FALSE, linout = TRUE)
Note that parsnip automatically sets linear activation in the last layer.
Translation from parsnip to the original package (classification)
mlp( hidden_units = integer(1), penalty = double(1), epochs = integer(1) ) %>% set_engine("nnet") %>% set_mode("classification") %>% translate()
## Single Layer Neural Network Model Specification (classification) ## ## Main Arguments: ## hidden_units = integer(1) ## penalty = double(1) ## epochs = integer(1) ## ## Computational engine: nnet ## ## Model fit template: ## nnet::nnet(formula = missing_arg(), data = missing_arg(), size = integer(1), ## decay = double(1), maxit = integer(1), trace = FALSE, linout = FALSE)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for mlp()
with the "nnet"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multinomial regression via brulee
Description
brulee::brulee_multinomial_reg()
fits a model that uses linear predictors
to predict multiclass data using the multinomial distribution.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameter:
-
penalty
: Amount of Regularization (type: double, default: 0.001) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0)
The use of the L1 penalty (a.k.a. the lasso penalty) does not force parameters to be strictly zero (as it does in packages such as glmnet). The zeroing out of parameters is a specific feature the optimization method used in those packages.
Other engine arguments of interest:
-
optimizer()
: The optimization method. Seebrulee::brulee_linear_reg()
. -
epochs()
: An integer for the number of passes through the training set. -
lean_rate()
: A number used to accelerate the gradient decsent process. -
momentum()
: A number used to use historical gradient information during optimization (optimizer = "SGD"
only). -
batch_size()
: An integer for the number of training set points in each batch. -
stop_iter()
: A non-negative integer for how many iterations with no improvement before stopping. (default: 5L). -
class_weights()
: Numeric class weights. Seebrulee::brulee_multinomial_reg()
.
Translation from parsnip to the original package (classification)
multinom_reg(penalty = double(1)) %>% set_engine("brulee") %>% translate()
## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## ## Computational engine: brulee ## ## Model fit template: ## brulee::brulee_multinomial_reg(x = missing_arg(), y = missing_arg(), ## penalty = double(1))
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multinomial regression via glmnet
Description
glmnet::glmnet()
fits a model that uses linear predictors to predict
multiclass data using the multinomial distribution.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: see below) -
mixture
: Proportion of Lasso Penalty (type: double, default: 1.0)
The penalty
parameter has no default and requires a single numeric
value. For more details about this, and the glmnet
model in general,
see glmnet-details. As for mixture
:
-
mixture = 1
specifies a pure lasso model, -
mixture = 0
specifies a ridge regression model, and -
0 < mixture < 1
specifies an elastic net model, interpolating lasso and ridge.
Translation from parsnip to the original package
multinom_reg(penalty = double(1), mixture = double(1)) %>% set_engine("glmnet") %>% translate()
## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = 0 ## mixture = double(1) ## ## Computational engine: glmnet ## ## Model fit template: ## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## alpha = double(1), family = "multinomial")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to
center and scale each so that each predictor has mean zero and a
variance of one. By default, glmnet::glmnet()
uses
the argument standardize = TRUE
to center and scale the data.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for multinom_reg()
with the "glmnet"
engine.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multinomial regression via h2o
Description
h2o::h2o.glm()
fits a model that uses linear predictors to predict
multiclass data for multinomial responses.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
mixture
: Proportion of Lasso Penalty (type: double, default: see below) -
penalty
: Amount of Regularization (type: double, default: see below)
By default, when not given a fixed penalty
,
h2o::h2o.glm()
uses a heuristic approach to select
the optimal value of penalty
based on training data. Setting the
engine parameter lambda_search
to TRUE
enables an efficient version
of the grid search, see more details at
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/lambda_search.html.
The choice of mixture
depends on the engine parameter solver
, which
is automatically chosen given training data and the specification of
other model parameters. When solver
is set to 'L-BFGS'
, mixture
defaults to 0 (ridge regression) and 0.5 otherwise.
Translation from parsnip to the original package
agua::h2o_train_glm()
for multinom_reg()
is
a wrapper around h2o::h2o.glm()
with
family = 'multinomial'
.
multinom_reg(penalty = double(1), mixture = double(1)) %>% set_engine("h2o") %>% translate()
## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## mixture = double(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), lambda = double(1), alpha = double(1), ## family = "multinomial")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, h2o::h2o.glm()
uses the argument
standardize = TRUE
to center and scale the data.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Multinomial regression via keras
Description
keras_mlp()
fits a model that uses linear predictors to predict
multiclass data using the multinomial distribution.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has one tuning parameter:
-
penalty
: Amount of Regularization (type: double, default: 0.0)
For penalty
, the amount of regularization is only L2 penalty (i.e.,
ridge or weight decay).
Translation from parsnip to the original package
multinom_reg(penalty = double(1)) %>% set_engine("keras") %>% translate()
## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## ## Computational engine: keras ## ## Model fit template: ## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), penalty = double(1), ## hidden_units = 1, act = "linear")
keras_mlp()
is a parsnip wrapper around keras code for
neural networks. This model fits a linear regression as a network with a
single hidden unit.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for multinom_reg()
with the "keras"
engine.
References
Hoerl, A., & Kennard, R. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 42(1), 80-86.
Multinomial regression via nnet
Description
nnet::multinom()
fits a model that uses linear predictors to predict
multiclass data using the multinomial distribution.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 1 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: 0.0)
For penalty
, the amount of regularization includes only the L2 penalty
(i.e., ridge or weight decay).
Translation from parsnip to the original package
multinom_reg(penalty = double(1)) %>% set_engine("nnet") %>% translate()
## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## ## Computational engine: nnet ## ## Model fit template: ## nnet::multinom(formula = missing_arg(), data = missing_arg(), ## decay = double(1), trace = FALSE)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for multinom_reg()
with the "nnet"
engine.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering nnet with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Multinomial regression via spark
Description
sparklyr::ml_logistic_regression()
fits a model that uses linear
predictors to predict multiclass data using the multinomial distribution.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: 0.0) -
mixture
: Proportion of Lasso Penalty (type: double, default: 0.0)
For penalty
, the amount of regularization includes both the L1 penalty
(i.e., lasso) and the L2 penalty (i.e., ridge or weight decay). As for
mixture
:
-
mixture = 1
specifies a pure lasso model, -
mixture = 0
specifies a ridge regression model, and -
0 < mixture < 1
specifies an elastic net model, interpolating lasso and ridge.
Translation from parsnip to the original package
multinom_reg(penalty = double(1), mixture = double(1)) %>% set_engine("spark") %>% translate()
## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = double(1) ## mixture = double(1) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_logistic_regression(x = missing_arg(), formula = missing_arg(), ## weights = missing_arg(), reg_param = double(1), elastic_net_param = double(1), ## family = "multinomial")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, ml_multinom_regression()
uses the argument
standardization = TRUE
to center and scale the data.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that, for spark engines, the case_weight
argument value should be
a character string to specify the column with the numeric case weights.
Other details
For models created using the "spark"
engine, there are several things
to consider.
Only the formula interface to via
fit()
is available; usingfit_xy()
will generate an error.The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via
save()
), themodel$fit
element of the parsnip object should be serialized viaml_save(object$fit)
and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.
References
Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Naive Bayes models via naivebayes
Description
h2o::h2o.naiveBayes()
fits a model that uses Bayes' theorem to compute
the probability of each class, given the predictor values.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 1 tuning parameter:
-
Laplace
: Laplace Correction (type: double, default: 0.0)
h2o::h2o.naiveBayes()
provides several engine
arguments to deal with imbalances and rare classes:
-
balance_classes
A logical value controlling over/under-sampling (for imbalanced data). Defaults toFALSE
. -
class_sampling_factors
The over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requirebalance_classes
to beTRUE
. -
min_sdev
: The minimum standard deviation to use for observations without enough data, must be greater than 1e-10. -
min_prob
: The minimum probability to use for observations with not enough data.
Translation from parsnip to the original package
The agua extension package is required to fit this model.
agua::h2o_train_nb()
is a wrapper around
h2o::h2o.naiveBayes()
.
naive_Bayes(Laplace = numeric(0)) %>% set_engine("h2o") %>% translate()
## Naive Bayes Model Specification (classification) ## ## Main Arguments: ## Laplace = numeric(0) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_nb(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), laplace = numeric(0))
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Naive Bayes models via klaR
Description
klaR::NaiveBayes()
fits a model that uses Bayes' theorem to compute the
probability of each class, given the predictor values.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameter:
-
smoothness
: Kernel Smoothness (type: double, default: 1.0) -
Laplace
: Laplace Correction (type: double, default: 0.0)
Note that the engine argument usekernel
is set to TRUE
by default
when using the klaR
engine.
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) naive_Bayes(smoothness = numeric(0), Laplace = numeric(0)) %>% set_engine("klaR") %>% translate()
## Naive Bayes Model Specification (classification) ## ## Main Arguments: ## smoothness = numeric(0) ## Laplace = numeric(0) ## ## Computational engine: klaR ## ## Model fit template: ## discrim::klar_bayes_wrapper(x = missing_arg(), y = missing_arg(), ## adjust = numeric(0), fL = numeric(0), usekernel = TRUE)
Preprocessing requirements
The columns for qualitative predictors should always be represented as factors (as opposed to dummy/indicator variables). When the predictors are factors, the underlying code treats them as multinomial data and appropriately computes their conditional distributions.
Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Naive Bayes models via naivebayes
Description
naivebayes::naive_bayes()
fits a model that uses Bayes' theorem to compute
the probability of each class, given the predictor values.
Details
For this engine, there is a single mode: classification
Tuning Parameters
This model has 2 tuning parameter:
-
smoothness
: Kernel Smoothness (type: double, default: 1.0) -
Laplace
: Laplace Correction (type: double, default: 0.0)
Note that the engine argument usekernel
is set to TRUE
by default
when using the naivebayes
engine.
Translation from parsnip to the original package
The discrim extension package is required to fit this model.
library(discrim) naive_Bayes(smoothness = numeric(0), Laplace = numeric(0)) %>% set_engine("naivebayes") %>% translate()
## Naive Bayes Model Specification (classification) ## ## Main Arguments: ## smoothness = numeric(0) ## Laplace = numeric(0) ## ## Computational engine: naivebayes ## ## Model fit template: ## naivebayes::naive_bayes(x = missing_arg(), y = missing_arg(), ## adjust = numeric(0), laplace = numeric(0), usekernel = TRUE)
Preprocessing requirements
The columns for qualitative predictors should always be represented as factors (as opposed to dummy/indicator variables). When the predictors are factors, the underlying code treats them as multinomial data and appropriately computes their conditional distributions.
For count data, integers can be estimated using a Poisson distribution
if the argument usepoisson = TRUE
is passed as an engine argument.
Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.
Case weights
The underlying model implementation does not allow for case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
K-nearest neighbors via kknn
Description
kknn::train.kknn()
fits a model that uses the K
most similar data points
from the training set to predict new samples.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
neighbors
: # Nearest Neighbors (type: integer, default: 5L) -
weight_func
: Distance Weighting Function (type: character, default: ‘optimal’) -
dist_power
: Minkowski Distance Order (type: double, default: 2.0)
Parsnip changes the default range for neighbors
to c(1, 15)
and
dist_power
to c(1/10, 2)
.
Translation from parsnip to the original package (regression)
nearest_neighbor( neighbors = integer(1), weight_func = character(1), dist_power = double(1) ) %>% set_engine("kknn") %>% set_mode("regression") %>% translate()
## K-Nearest Neighbor Model Specification (regression) ## ## Main Arguments: ## neighbors = integer(1) ## weight_func = character(1) ## dist_power = double(1) ## ## Computational engine: kknn ## ## Model fit template: ## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), ## ks = min_rows(0L, data, 5), kernel = character(1), distance = double(1))
min_rows()
will adjust the number of neighbors if the chosen value if
it is not consistent with the actual data dimensions.
Translation from parsnip to the original package (classification)
nearest_neighbor( neighbors = integer(1), weight_func = character(1), dist_power = double(1) ) %>% set_engine("kknn") %>% set_mode("classification") %>% translate()
## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = integer(1) ## weight_func = character(1) ## dist_power = double(1) ## ## Computational engine: kknn ## ## Model fit template: ## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), ## ks = min_rows(0L, data, 5), kernel = character(1), distance = double(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for nearest_neighbor()
with the "kknn"
engine.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Hechenbichler K. and Schliep K.P. (2004) Weighted k-Nearest-Neighbor Techniques and Ordinal Classification, Discussion Paper 399, SFB 386, Ludwig-Maximilians University Munich
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Partial least squares via mixOmics
Description
The mixOmics package can fit several different types of PLS models.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 2 tuning parameters:
-
predictor_prop
: Proportion of Predictors (type: double, default: see below) -
num_comp
: # Components (type: integer, default: 2L)
Translation from parsnip to the underlying model call (regression)
The plsmod extension package is required to fit this model.
library(plsmod) pls(num_comp = integer(1), predictor_prop = double(1)) %>% set_engine("mixOmics") %>% set_mode("regression") %>% translate()
## PLS Model Specification (regression) ## ## Main Arguments: ## predictor_prop = double(1) ## num_comp = integer(1) ## ## Computational engine: mixOmics ## ## Model fit template: ## plsmod::pls_fit(x = missing_arg(), y = missing_arg(), predictor_prop = double(1), ## ncomp = integer(1))
plsmod::pls_fit()
is a function that:
Determines the number of predictors in the data.
Adjusts
num_comp
if the value is larger than the number of factors.Determines whether sparsity is required based on the value of
predictor_prop
.Sets the
keepX
argument ofmixOmics::spls()
for sparse models.
Translation from parsnip to the underlying model call (classification)
The plsmod extension package is required to fit this model.
library(plsmod) pls(num_comp = integer(1), predictor_prop = double(1)) %>% set_engine("mixOmics") %>% set_mode("classification") %>% translate()
## PLS Model Specification (classification) ## ## Main Arguments: ## predictor_prop = double(1) ## num_comp = integer(1) ## ## Computational engine: mixOmics ## ## Model fit template: ## plsmod::pls_fit(x = missing_arg(), y = missing_arg(), predictor_prop = double(1), ## ncomp = integer(1))
In this case, plsmod::pls_fit()
has the same role
as above but eventually targets mixOmics::plsda()
or
mixOmics::splsda()
.
Installing mixOmics
This package is available via the Bioconductor repository and is not accessible via CRAN. You can install using:
if (!require("remotes", quietly = TRUE)) { install.packages("remotes") } remotes::install_bioc("mixOmics")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
References
Rohart F and Gautier B and Singh A and Le Cao K-A (2017). “mixOmics: An R package for ’omics feature selection and multiple data integration.” PLoS computational biology, 13(11), e1005752.
Poisson regression via generalized estimating equations (GEE)
Description
gee::gee()
uses generalized least squares to fit different types of models
with errors that are not independent.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no formal tuning parameters. It may be beneficial to determine the appropriate correlation structure to use, but this typically does not affect the predicted value of the model. It does have an effect on the inferential results and parameter covariance values.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) poisson_reg(engine = "gee") %>% set_engine("gee") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Computational engine: gee ## ## Model fit template: ## multilevelmod::gee_fit(formula = missing_arg(), data = missing_arg(), ## family = stats::poisson)
multilevelmod::gee_fit()
is a wrapper model around gee()
.
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Case weights
The underlying model implementation does not allow for case weights.
Other details
Both gee:gee()
and gee:geepack()
specify the id/cluster variable
using an argument id
that requires a vector. parsnip doesn’t work that
way so we enable this model to be fit using a artificial function
id_var()
to be used in the formula. So, in the original package, the
call would look like:
gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) poisson_reg() %>% set_engine("gee", corstr = "exchangeable") %>% fit(y ~ time + x + id_var(subject), data = longitudinal_counts)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the GEE formula when adding the model:
library(tidymodels) gee_spec <- poisson_reg() %>% set_engine("gee", corstr = "exchangeable") gee_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = y, predictors = c(time, x, subject)) %>% add_model(gee_spec, formula = y ~ time + x + id_var(subject)) fit(gee_wflow, data = longitudinal_counts)
The gee::gee()
function always prints out warnings and output even
when silent = TRUE
. The parsnip "gee"
engine, by contrast, silences
all console output coming from gee::gee()
, even if silent = FALSE
.
Also, because of issues with the gee()
function, a supplementary call
to glm()
is needed to get the rank and QR decomposition objects so
that predict()
can be used.
References
Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22.
Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42 121–130.
Poisson regression via glm
Description
stats::glm()
uses maximum likelihood to fit a model for count data.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This engine has no tuning parameters.
Translation from parsnip to the underlying model call (regression)
The poissonreg extension package is required to fit this model.
library(poissonreg) poisson_reg() %>% set_engine("glm") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Computational engine: glm ## ## Model fit template: ## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = stats::poisson)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
However, the documentation in stats::glm()
assumes
that is specific type of case weights are being used:“Non-NULL weights
can be used to indicate that different observations have different
dispersions (with the values in weights being inversely proportional to
the dispersions); or equivalently, when the elements of weights are
positive integers w_i
, that each response y_i
is the mean of w_i
unit-weight observations. For a binomial GLM prior weights are used to
give the number of trials when the response is the proportion of
successes: they would rarely be used for a Poisson GLM.”
If frequency weights are being used in your application, the
glm_grouped()
model (and corresponding engine) may be
more appropriate.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Poisson regression via mixed models
Description
The "glmer"
engine estimates fixed and random effect regression parameters
using maximum likelihood (or restricted maximum likelihood) estimation.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no tuning parameters.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) poisson_reg(engine = "glmer") %>% set_engine("glmer") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Computational engine: glmer ## ## Model fit template: ## lme4::glmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## family = stats::poisson)
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) poisson_reg() %>% set_engine("glmer") %>% fit(y ~ time + x + (1 | subject), data = longitudinal_counts)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) glmer_spec <- poisson_reg() %>% set_engine("glmer") glmer_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = y, predictors = c(time, x, subject)) %>% add_model(glmer_spec, formula = y ~ time + x + (1 | subject)) fit(glmer_wflow, data = longitudinal_counts)
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.
Poisson regression via glmnet
Description
glmnet::glmnet()
uses penalized maximum likelihood to fit a model for
count data.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: see below) -
mixture
: Proportion of Lasso Penalty (type: double, default: 1.0)
The penalty
parameter has no default and requires a single numeric
value. For more details about this, and the glmnet
model in general,
see glmnet-details. As for mixture
:
-
mixture = 1
specifies a pure lasso model, -
mixture = 0
specifies a ridge regression model, and -
0 < mixture < 1
specifies an elastic net model, interpolating lasso and ridge.
Translation from parsnip to the original package
The poissonreg extension package is required to fit this model.
library(poissonreg) poisson_reg(penalty = double(1), mixture = double(1)) %>% set_engine("glmnet") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 0 ## mixture = double(1) ## ## Computational engine: glmnet ## ## Model fit template: ## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## alpha = double(1), family = "poisson")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to
center and scale each so that each predictor has mean zero and a
variance of one. By default, glmnet::glmnet()
uses the argument
standardize = TRUE
to center and scale the data.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Poisson regression via h2o
Description
h2o::h2o.glm()
uses penalized maximum likelihood to fit a model for
count data.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has 2 tuning parameters:
-
mixture
: Proportion of Lasso Penalty (type: double, default: see below) -
penalty
: Amount of Regularization (type: double, default: see below)
By default, when not given a fixed penalty
,
h2o::h2o.glm()
uses a heuristic approach to select
the optimal value of penalty
based on training data. Setting the
engine parameter lambda_search
to TRUE
enables an efficient version
of the grid search, see more details at
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/lambda_search.html.
The choice of mixture
depends on the engine parameter solver
, which
is automatically chosen given training data and the specification of
other model parameters. When solver
is set to 'L-BFGS'
, mixture
defaults to 0 (ridge regression) and 0.5 otherwise.
Translation from parsnip to the original package
agua::h2o_train_glm()
for poisson_reg()
is
a wrapper around h2o::h2o.glm()
with
family = 'poisson'
.
The agua extension package is required to fit this model.
library(poissonreg) poisson_reg(penalty = double(1), mixture = double(1)) %>% set_engine("h2o") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Main Arguments: ## penalty = double(1) ## mixture = double(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), lambda = double(1), alpha = double(1), ## family = "poisson")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
By default, h2o::h2o.glm()
uses the argument standardize = TRUE
to
center and scale all numerical columns.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Poisson regression via pscl
Description
pscl::hurdle()
uses maximum likelihood estimation to fit a model for
count data that has separate model terms for predicting the counts and for
predicting the probability of a zero count.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This engine has no tuning parameters.
Translation from parsnip to the underlying model call (regression)
The poissonreg extension package is required to fit this model.
library(poissonreg) poisson_reg() %>% set_engine("hurdle") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Computational engine: hurdle ## ## Model fit template: ## pscl::hurdle(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
Preprocessing and special formulas for zero-inflated Poisson models
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Specifying the statistical model details
For this particular model, a special formula is used to specify which
columns affect the counts and which affect the model for the probability
of zero counts. These sets of terms are separated by a bar. For example,
y ~ x | z
. This type of formula is not used by the base R
infrastructure (e.g. model.matrix()
)
When fitting a parsnip model with this engine directly, the formula method is required and the formula is just passed through. For example:
library(tidymodels) tidymodels_prefer() data("bioChemists", package = "pscl") poisson_reg() %>% set_engine("hurdle") %>% fit(art ~ fem + mar | ment, data = bioChemists)
## parsnip model object ## ## ## Call: ## pscl::hurdle(formula = art ~ fem + mar | ment, data = data) ## ## Count model coefficients (truncated poisson with log link): ## (Intercept) femWomen marMarried ## 0.847598 -0.237351 0.008846 ## ## Zero hurdle model coefficients (binomial with logit link): ## (Intercept) ment ## 0.24871 0.08092
However, when using a workflow, the best approach is to avoid using
workflows::add_formula()
and use
workflows::add_variables()
in
conjunction with a model formula:
data("bioChemists", package = "pscl") spec <- poisson_reg() %>% set_engine("hurdle") workflow() %>% add_variables(outcomes = c(art), predictors = c(fem, mar, ment)) %>% add_model(spec, formula = art ~ fem + mar | ment) %>% fit(data = bioChemists) %>% extract_fit_engine()
## ## Call: ## pscl::hurdle(formula = art ~ fem + mar | ment, data = data) ## ## Count model coefficients (truncated poisson with log link): ## (Intercept) femWomen marMarried ## 0.847598 -0.237351 0.008846 ## ## Zero hurdle model coefficients (binomial with logit link): ## (Intercept) ment ## 0.24871 0.08092
The reason for this is that
workflows::add_formula()
will try to
create the model matrix and either fail or create dummy variables
prematurely.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Poisson regression via stan
Description
rstanarm::stan_glm()
uses Bayesian estimation to fit a model for
count data.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This engine has no tuning parameters.
Important engine-specific options
Some relevant arguments that can be passed to set_engine()
:
-
chains
: A positive integer specifying the number of Markov chains. The default is 4. -
iter
: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000. -
seed
: The seed for random number generation. -
cores
: Number of cores to use when executing the chains in parallel. -
prior
: The prior distribution for the (non-hierarchical) regression coefficients. The"stan"
engine does not fit any hierarchical terms. -
prior_intercept
: The prior distribution for the intercept (after centering all predictors).
See rstan::sampling()
and
rstanarm::priors()
for more information on these
and other options.
Translation from parsnip to the original package
The poissonreg extension package is required to fit this model.
library(poissonreg) poisson_reg() %>% set_engine("stan") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Computational engine: stan ## ## Model fit template: ## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), family = stats::poisson)
Note that the refresh
default prevents logging of the estimation
process. Change this value in set_engine()
to show the MCMC logs.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Other details
For prediction, the "stan"
engine can compute posterior intervals
analogous to confidence and prediction intervals. In these instances,
the units are the original outcome. When std_error = TRUE
, the
standard deviation of the posterior distribution (or posterior
predictive distribution as appropriate) is returned.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for poisson_reg()
with the "stan"
engine.
References
McElreath, R. 2020 Statistical Rethinking. CRC Press.
Poisson regression via hierarchical Bayesian methods
Description
The "stan_glmer"
engine estimates hierarchical regression parameters using
Bayesian estimation.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This model has no tuning parameters.
Important engine-specific options
Some relevant arguments that can be passed to set_engine()
:
-
chains
: A positive integer specifying the number of Markov chains. The default is 4. -
iter
: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000. -
seed
: The seed for random number generation. -
cores
: Number of cores to use when executing the chains in parallel. -
prior
: The prior distribution for the (non-hierarchical) regression coefficients. -
prior_intercept
: The prior distribution for the intercept (after centering all predictors).
See ?rstanarm::stan_glmer
and ?rstan::sampling
for more information.
Translation from parsnip to the original package
The multilevelmod extension package is required to fit this model.
library(multilevelmod) poisson_reg(engine = "stan_glmer") %>% set_engine("stan_glmer") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Computational engine: stan_glmer ## ## Model fit template: ## rstanarm::stan_glmer(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), family = stats::poisson, refresh = 0)
Predicting new samples
This model can use subject-specific coefficient estimates to make
predictions (i.e. partial pooling). For example, this equation shows the
linear predictor (\eta
) for a random intercept:
\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}
where i
denotes the i
th independent experimental unit
(e.g. subject). When the model has seen subject i
, it can use that
subject’s data to adjust the population intercept to be more specific
to that subjects results.
What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:
\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}
Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.
The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).
Preprocessing requirements
There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.
Other details
The model can accept case weights.
With parsnip, we suggest using the formula method when fitting:
library(tidymodels) poisson_reg() %>% set_engine("stan_glmer") %>% fit(y ~ time + x + (1 | subject), data = longitudinal_counts)
When using tidymodels infrastructure, it may be better to use a
workflow. In this case, you can add the appropriate columns using
add_variables()
then supply the typical formula when adding the model:
library(tidymodels) glmer_spec <- poisson_reg() %>% set_engine("stan_glmer") glmer_wflow <- workflow() %>% # The data are included as-is using: add_variables(outcomes = y, predictors = c(time, x, subject)) %>% add_model(glmer_spec, formula = y ~ time + x + (1 | subject)) fit(glmer_wflow, data = longitudinal_counts)
For prediction, the "stan_glmer"
engine can compute posterior
intervals analogous to confidence and prediction intervals. In these
instances, the units are the original outcome. When std_error = TRUE
,
the standard deviation of the posterior distribution (or posterior
predictive distribution as appropriate) is returned.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
McElreath, R. 2020 Statistical Rethinking. CRC Press.
Sorensen, T, Vasishth, S. 2016. Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists, arXiv:1506.06201.
Poisson regression via pscl
Description
pscl::zeroinfl()
uses maximum likelihood estimation to fit a model for
count data that has separate model terms for predicting the counts and for
predicting the probability of a zero count.
Details
For this engine, there is a single mode: regression
Tuning Parameters
This engine has no tuning parameters.
Translation from parsnip to the underlying model call (regression)
The poissonreg extension package is required to fit this model.
library(poissonreg) poisson_reg() %>% set_engine("zeroinfl") %>% translate()
## Poisson Regression Model Specification (regression) ## ## Computational engine: zeroinfl ## ## Model fit template: ## pscl::zeroinfl(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg())
Preprocessing and special formulas for zero-inflated Poisson models
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Specifying the statistical model details
For this particular model, a special formula is used to specify which
columns affect the counts and which affect the model for the probability
of zero counts. These sets of terms are separated by a bar. For example,
y ~ x | z
. This type of formula is not used by the base R
infrastructure (e.g. model.matrix()
)
When fitting a parsnip model with this engine directly, the formula method is required and the formula is just passed through. For example:
library(tidymodels) tidymodels_prefer() data("bioChemists", package = "pscl") poisson_reg() %>% set_engine("zeroinfl") %>% fit(art ~ fem + mar | ment, data = bioChemists)
## parsnip model object ## ## ## Call: ## pscl::zeroinfl(formula = art ~ fem + mar | ment, data = data) ## ## Count model coefficients (poisson with log link): ## (Intercept) femWomen marMarried ## 0.82840 -0.21365 0.02576 ## ## Zero-inflation model coefficients (binomial with logit link): ## (Intercept) ment ## -0.363 -0.166
However, when using a workflow, the best approach is to avoid using
workflows::add_formula()
and use
workflows::add_variables()
in
conjunction with a model formula:
data("bioChemists", package = "pscl") spec <- poisson_reg() %>% set_engine("zeroinfl") workflow() %>% add_variables(outcomes = c(art), predictors = c(fem, mar, ment)) %>% add_model(spec, formula = art ~ fem + mar | ment) %>% fit(data = bioChemists) %>% extract_fit_engine()
## ## Call: ## pscl::zeroinfl(formula = art ~ fem + mar | ment, data = data) ## ## Count model coefficients (poisson with log link): ## (Intercept) femWomen marMarried ## 0.82840 -0.21365 0.02576 ## ## Zero-inflation model coefficients (binomial with logit link): ## (Intercept) ment ## -0.363 -0.166
The reason for this is that
workflows::add_formula()
will try to
create the model matrix and either fail or create dummy variables
prematurely.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Proportional hazards regression
Description
glmnet::glmnet()
fits a regularized Cox proportional hazards model.
Details
For this engine, there is a single mode: censored regression
Tuning Parameters
This model has 2 tuning parameters:
-
penalty
: Amount of Regularization (type: double, default: see below) -
mixture
: Proportion of Lasso Penalty (type: double, default: 1.0)
The penalty
parameter has no default and requires a single numeric
value. For more details about this, and the glmnet
model in general,
see glmnet-details. As for
mixture
:
-
mixture = 1
specifies a pure lasso model, -
mixture = 0
specifies a ridge regression model, and -
0 < mixture < 1
specifies an elastic net model, interpolating lasso and ridge.
Translation from parsnip to the original package
The censored extension package is required to fit this model.
library(censored) proportional_hazards(penalty = double(1), mixture = double(1)) %>% set_engine("glmnet") %>% translate()
## Proportional Hazards Model Specification (censored regression) ## ## Main Arguments: ## penalty = 0 ## mixture = double(1) ## ## Computational engine: glmnet ## ## Model fit template: ## censored::coxnet_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), alpha = double(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to
center and scale each so that each predictor has mean zero and a
variance of one. By default, glmnet::glmnet()
uses
the argument standardize = TRUE
to center and scale the data.
Other details
The model does not fit an intercept.
The model formula (which is required) can include special terms, such
as survival::strata()
. This allows the baseline
hazard to differ between groups contained in the function. (To learn
more about using special terms in formulas with tidymodels, see
?model_formula
.) The column used inside
strata()
is treated as qualitative no matter its type. This is
different than the syntax offered by the
glmnet::glmnet()
package (i.e.,
glmnet::stratifySurv()
) which is not
recommended here.
For example, in this model, the numeric column rx
is used to estimate
two different baseline hazards for each value of the column:
library(survival) library(censored) library(dplyr) library(tidyr) mod <- proportional_hazards(penalty = 0.01) %>% set_engine("glmnet", nlambda = 5) %>% fit(Surv(futime, fustat) ~ age + ecog.ps + strata(rx), data = ovarian) pred_data <- data.frame(age = c(50, 50), ecog.ps = c(1, 1), rx = c(1, 2)) # Different survival probabilities for different values of 'rx' predict(mod, pred_data, type = "survival", time = 500) %>% bind_cols(pred_data) %>% unnest(.pred)
## # A tibble: 2 x 5 ## .eval_time .pred_survival age ecog.ps rx ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 500 0.666 50 1 1 ## 2 500 0.769 50 1 2
Note that columns used in the strata()
function will also be
estimated in the regular portion of the model (i.e., within the linear
predictor).
Predictions of type "time"
are predictions of the mean survival time.
Linear predictor values
Since risk regression and parametric survival models are modeling different characteristics (e.g. relative hazard versus event time), their linear predictors will be going in opposite directions.
For example, for parametric models, the linear predictor increases with time. For proportional hazards models the linear predictor decreases with time (since hazard is increasing). As such, the linear predictors for these two quantities will have opposite signs.
tidymodels does not treat different models differently when computing
performance metrics. To standardize across model types, the default for
proportional hazards models is to have increasing values with time. As
a result, the sign of the linear predictor will be the opposite of the
value produced by the predict()
method in the engine package.
This behavior can be changed by using the increasing
argument when
calling predict()
on a model object.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Simon N, Friedman J, Hastie T, Tibshirani R. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software, Articles 39 (5): 1–13. .
Hastie T, Tibshirani R, Wainwright M. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn M, Johnson K. 2013. Applied Predictive Modeling. Springer.
Proportional hazards regression
Description
survival::coxph()
fits a Cox proportional hazards model.
Details
For this engine, there is a single mode: censored regression
Tuning Parameters
This model has no tuning parameters.
Translation from parsnip to the original package
The censored extension package is required to fit this model.
library(censored) proportional_hazards() %>% set_engine("survival") %>% set_mode("censored regression") %>% translate()
## Proportional Hazards Model Specification (censored regression) ## ## Computational engine: survival ## ## Model fit template: ## survival::coxph(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), x = TRUE, model = TRUE)
Other details
The model does not fit an intercept.
The main interface for this model uses the formula method since the
model specification typically involved the use of
survival::Surv()
.
The model formula can include special terms, such as
survival::strata()
. The allows the baseline
hazard to differ between groups contained in the function. The column
used inside strata()
is treated as qualitative no matter its type. To
learn more about using special terms in formulas with tidymodels, see
?model_formula
.
For example, in this model, the numeric column rx
is used to estimate
two different baseline hazards for each value of the column:
library(survival) proportional_hazards() %>% fit(Surv(futime, fustat) ~ age + strata(rx), data = ovarian) %>% extract_fit_engine() %>% # Two different hazards for each value of 'rx' basehaz()
## hazard time strata ## 1 0.02250134 59 rx=1 ## 2 0.05088586 115 rx=1 ## 3 0.09467873 156 rx=1 ## 4 0.14809975 268 rx=1 ## 5 0.30670509 329 rx=1 ## 6 0.46962698 431 rx=1 ## 7 0.46962698 448 rx=1 ## 8 0.46962698 477 rx=1 ## 9 1.07680229 638 rx=1 ## 10 1.07680229 803 rx=1 ## 11 1.07680229 855 rx=1 ## 12 1.07680229 1040 rx=1 ## 13 1.07680229 1106 rx=1 ## 14 0.05843331 353 rx=2 ## 15 0.12750063 365 rx=2 ## 16 0.12750063 377 rx=2 ## 17 0.12750063 421 rx=2 ## 18 0.23449656 464 rx=2 ## 19 0.35593895 475 rx=2 ## 20 0.50804209 563 rx=2 ## 21 0.50804209 744 rx=2 ## 22 0.50804209 769 rx=2 ## 23 0.50804209 770 rx=2 ## 24 0.50804209 1129 rx=2 ## 25 0.50804209 1206 rx=2 ## 26 0.50804209 1227 rx=2
Note that columns used in the strata()
function will not be estimated
in the regular portion of the model (i.e., within the linear predictor).
Predictions of type "time"
are predictions of the mean survival time.
Linear predictor values
Since risk regression and parametric survival models are modeling different characteristics (e.g. relative hazard versus event time), their linear predictors will be going in opposite directions.
For example, for parametric models, the linear predictor increases with time. For proportional hazards models the linear predictor decreases with time (since hazard is increasing). As such, the linear predictors for these two quantities will have opposite signs.
tidymodels does not treat different models differently when computing
performance metrics. To standardize across model types, the default for
proportional hazards models is to have increasing values with time. As
a result, the sign of the linear predictor will be the opposite of the
value produced by the predict()
method in the engine package.
This behavior can be changed by using the increasing
argument when
calling predict()
on a model object.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
References
Andersen P, Gill R. 1982. Cox’s regression model for counting processes, a large sample study. Annals of Statistics 10, 1100-1120.
Oblique random survival forests via aorsf
Description
aorsf::orsf()
fits a model that creates a large number of oblique decision
trees, each de-correlated from the others. The final prediction uses all
predictions from the individual trees and combines them.
Details
For this engine, there are multiple modes: censored regression, classification, and regression
Tuning Parameters
This model has 3 tuning parameters:
-
trees
: # Trees (type: integer, default: 500L) -
min_n
: Minimal Node Size (type: integer, default: 5L) -
mtry
: # Randomly Selected Predictors (type: integer, default: ceiling(sqrt(n_predictors)))
Additionally, this model has one engine-specific tuning parameter:
-
split_min_stat
: Minimum test statistic required to split a node. Defaults are3.841459
for censored regression (which is roughly a p-value of 0.05) and0
for classification and regression. For classification, this tuning parameter should be between 0 and 1, and for regression it should be greater than or equal to 0. Higher values of this parameter cause trees grown byaorsf
to have less depth.
Translation from parsnip to the original package (censored regression)
The censored extension package is required to fit this model.
library(censored) rand_forest() %>% set_engine("aorsf") %>% set_mode("censored regression") %>% translate()
## Random Forest Model Specification (censored regression) ## ## Computational engine: aorsf ## ## Model fit template: ## aorsf::orsf(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
Translation from parsnip to the original package (regression)
The bonsai extension package is required to fit this model.
library(bonsai) rand_forest() %>% set_engine("aorsf") %>% set_mode("regression") %>% translate()
## Random Forest Model Specification (regression) ## ## Computational engine: aorsf ## ## Model fit template: ## aorsf::orsf(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## n_thread = 1, verbose_progress = FALSE)
Translation from parsnip to the original package (classification)
The bonsai extension package is required to fit this model.
library(bonsai) rand_forest() %>% set_engine("aorsf") %>% set_mode("classification") %>% translate()
## Random Forest Model Specification (classification) ## ## Computational engine: aorsf ## ## Model fit template: ## aorsf::orsf(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), ## n_thread = 1, verbose_progress = FALSE)
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Other details
Predictions of survival probability at a time exceeding the maximum observed event time are the predicted survival probability at the maximum observed time in the training data.
The class predict method in aorsf
uses the standard ‘each tree gets
one vote’ approach, which is usually but not always consistent with the
picking the class that has highest predicted probability. It is okay for
this inconsistency to occur in aorsf
because it is intentionally
applying the traditional class prediction method for random forests, but
in tidymodels
it is preferable to embrace consistency. Thus, we opted
to make predicted probability consistent with predicted class all the
time by making the predicted class a function of predicted probability
(see
tidymodels/bonsai#78).
References
Jaeger BC, Long DL, Long DM, Sims M, Szychowski JM, Min YI, Mcclure LA, Howard G, Simon N. Oblique random survival forests. Annals of applied statistics 2019 Sep; 13(3):1847-83. DOI: 10.1214/19-AOAS1261
Jaeger BC, Welden S, Lenoir K, Pajewski NM. aorsf: An R package for supervised learning using the oblique random survival forest. Journal of Open Source Software 2022, 7(77), 1 4705. .
Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM. Accelerated and interpretable oblique random survival forests. arXiv e-prints 2022 Aug; arXiv-2208. URL: https://arxiv.org/abs/2208.01129
Random forests via h2o
Description
h2o::h2o.randomForest()
fits a model that creates a large number of
decision trees, each independent of the others. The final prediction uses all
predictions from the individual trees and combines them.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
trees
: # Trees (type: integer, default: 50L) -
min_n
: Minimal Node Size (type: integer, default: 1) -
mtry
: # Randomly Selected Predictors (type: integer, default: see below)
mtry
depends on the number of columns and the model mode. The default
in h2o::h2o.randomForest()
is
floor(sqrt(ncol(x)))
for classification and floor(ncol(x)/3)
for
regression.
Translation from parsnip to the original package (regression)
agua::h2o_train_rf()
is a wrapper around
h2o::h2o.randomForest()
.
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("h2o") %>% set_mode("regression") %>% translate()
## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_rf(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), mtries = integer(1), ntrees = integer(1), ## min_rows = integer(1))
min_rows()
and min_cols()
will adjust the number of neighbors if the
chosen value if it is not consistent with the actual data dimensions.
Translation from parsnip to the original package (classification)
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("h2o") %>% set_mode("classification") %>% translate()
## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_rf(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), mtries = integer(1), ntrees = integer(1), ## min_rows = integer(1))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Initializing h2o
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
Random forests via partykit
Description
partykit::cforest()
fits a model that creates a large number of decision
trees, each independent of the others. The final prediction uses all
predictions from the individual trees and combines them.
Details
For this engine, there are multiple modes: censored regression, regression, and classification
Tuning Parameters
This model has 3 tuning parameters:
-
trees
: # Trees (type: integer, default: 500L) -
min_n
: Minimal Node Size (type: integer, default: 20L) -
mtry
: # Randomly Selected Predictors (type: integer, default: 5L)
Translation from parsnip to the original package (regression)
The bonsai extension package is required to fit this model.
library(bonsai) rand_forest() %>% set_engine("partykit") %>% set_mode("regression") %>% translate()
## Random Forest Model Specification (regression) ## ## Computational engine: partykit ## ## Model fit template: ## parsnip::cforest_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg())
Translation from parsnip to the original package (classification)
The bonsai extension package is required to fit this model.
library(bonsai) rand_forest() %>% set_engine("partykit") %>% set_mode("classification") %>% translate()
## Random Forest Model Specification (classification) ## ## Computational engine: partykit ## ## Model fit template: ## parsnip::cforest_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg())
parsnip::cforest_train()
is a wrapper around
partykit::cforest()
(and other functions) that
makes it easier to run this model.
Translation from parsnip to the original package (censored regression)
The censored extension package is required to fit this model.
library(censored) rand_forest() %>% set_engine("partykit") %>% set_mode("censored regression") %>% translate()
## Random Forest Model Specification (censored regression) ## ## Computational engine: partykit ## ## Model fit template: ## parsnip::cforest_train(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg())
censored::cond_inference_surv_cforest()
is a wrapper around
partykit::cforest()
(and other functions) that
makes it easier to run this model.
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Other details
Predictions of type "time"
are predictions of the median survival
time.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Random forests via randomForest
Description
randomForest::randomForest()
fits a model that creates a large number of
decision trees, each independent of the others. The final prediction uses all
predictions from the individual trees and combines them.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
mtry
: # Randomly Selected Predictors (type: integer, default: see below) -
trees
: # Trees (type: integer, default: 500L) -
min_n
: Minimal Node Size (type: integer, default: see below)
mtry
depends on the number of columns and the model mode. The default
in randomForest::randomForest()
is
floor(sqrt(ncol(x)))
for classification and floor(ncol(x)/3)
for
regression.
min_n
depends on the mode. For regression, a value of 5 is the
default. For classification, a value of 10 is used.
Translation from parsnip to the original package (regression)
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("randomForest") %>% set_mode("regression") %>% translate()
## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: randomForest ## ## Model fit template: ## randomForest::randomForest(x = missing_arg(), y = missing_arg(), ## mtry = min_cols(~integer(1), x), ntree = integer(1), nodesize = min_rows(~integer(1), ## x))
min_rows()
and min_cols()
will adjust the number of neighbors if the
chosen value if it is not consistent with the actual data dimensions.
Translation from parsnip to the original package (classification)
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("randomForest") %>% set_mode("classification") %>% translate()
## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: randomForest ## ## Model fit template: ## randomForest::randomForest(x = missing_arg(), y = missing_arg(), ## mtry = min_cols(~integer(1), x), ntree = integer(1), nodesize = min_rows(~integer(1), ## x))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for rand_forest()
with the "randomForest"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Random forests via ranger
Description
ranger::ranger()
fits a model that creates a large number of decision
trees, each independent of the others. The final prediction uses all
predictions from the individual trees and combines them.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
mtry
: # Randomly Selected Predictors (type: integer, default: see below) -
trees
: # Trees (type: integer, default: 500L) -
min_n
: Minimal Node Size (type: integer, default: see below)
mtry
depends on the number of columns. The default in
ranger::ranger()
is floor(sqrt(ncol(x)))
.
min_n
depends on the mode. For regression, a value of 5 is the
default. For classification, a value of 10 is used.
Translation from parsnip to the original package (regression)
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("ranger") %>% set_mode("regression") %>% translate()
## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: ranger ## ## Model fit template: ## ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## mtry = min_cols(~integer(1), x), num.trees = integer(1), ## min.node.size = min_rows(~integer(1), x), num.threads = 1, ## verbose = FALSE, seed = sample.int(10^5, 1))
min_rows()
and min_cols()
will adjust the number of neighbors if the
chosen value if it is not consistent with the actual data dimensions.
Translation from parsnip to the original package (classification)
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("ranger") %>% set_mode("classification") %>% translate()
## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: ranger ## ## Model fit template: ## ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## mtry = min_cols(~integer(1), x), num.trees = integer(1), ## min.node.size = min_rows(~integer(1), x), num.threads = 1, ## verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Note that a ranger
probability forest is always fit (unless the
probability
argument is changed by the user via
set_engine()
).
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Other notes
By default, parallel processing is turned off. When tuning, it is more
efficient to parallelize over the resamples and tuning parameters. To
parallelize the construction of the trees within the ranger
model,
change the num.threads
argument via set_engine()
.
For ranger
confidence intervals, the intervals are constructed using
the form estimate +/- z * std_error
. For classification probabilities,
these values can fall outside of [0, 1]
and will be coerced to be in
this range.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
While this engine supports sparse data as an input, it doesn’t use it any differently than dense data. Hence there it no reason to convert back and forth.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for rand_forest()
with the "ranger"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Random forests via spark
Description
sparklyr::ml_random_forest()
fits a model that creates a large number of
decision trees, each independent of the others. The final prediction uses all
predictions from the individual trees and combines them.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
mtry
: # Randomly Selected Predictors (type: integer, default: see below) -
trees
: # Trees (type: integer, default: 20L) -
min_n
: Minimal Node Size (type: integer, default: 1L)
mtry
depends on the number of columns and the model mode. The default
in sparklyr::ml_random_forest()
is
floor(sqrt(ncol(x)))
for classification and floor(ncol(x)/3)
for
regression.
Translation from parsnip to the original package (regression)
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("spark") %>% set_mode("regression") %>% translate()
## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), ## type = "regression", feature_subset_strategy = integer(1), ## num_trees = integer(1), min_instances_per_node = min_rows(~integer(1), ## x), seed = sample.int(10^5, 1))
min_rows()
and min_cols()
will adjust the number of neighbors if the
chosen value if it is not consistent with the actual data dimensions.
Translation from parsnip to the original package (classification)
rand_forest( mtry = integer(1), trees = integer(1), min_n = integer(1) ) %>% set_engine("spark") %>% set_mode("classification") %>% translate()
## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = integer(1) ## trees = integer(1) ## min_n = integer(1) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), ## type = "classification", feature_subset_strategy = integer(1), ## num_trees = integer(1), min_instances_per_node = min_rows(~integer(1), ## x), seed = sample.int(10^5, 1))
Preprocessing requirements
This engine does not require any special encoding of the predictors.
Categorical predictors can be partitioned into groups of factor levels
(e.g. {a, c}
vs {b, d}
) when splitting at a node. Dummy variables
are not required for this model.
Other details
For models created using the "spark"
engine, there are several things
to consider.
Only the formula interface to via
fit()
is available; usingfit_xy()
will generate an error.The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via
save()
), themodel$fit
element of the parsnip object should be serialized viaml_save(object$fit)
and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Note that, for spark engines, the case_weight
argument value should be
a character string to specify the column with the numeric case weights.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
RuleFit models via h2o
Description
h2o::h2o.rulefit()
fits a model that derives simple feature rules from a tree
ensemble and uses the rules as features to a regularized (LASSO) model. agua::h2o_train_rule()
is a wrapper around this function.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
trees
: # Trees (type: integer, default: 50L) -
tree_depth
: Tree Depth (type: integer, default: 3L) -
penalty
: Amount of Regularization (type: double, default: 0) Note thatpenalty
for the h2o engine in 'rule_fit()“ corresponds to the L1 penalty (LASSO).
Other engine arguments of interest:
-
algorithm
: The algorithm to use to generate rules. should be one of “AUTO”, “DRF”, “GBM”, defaults to “AUTO”. -
min_rule_length
: Minimum length of tree depth, opposite oftree_dpeth
, defaults to 3. -
max_num_rules
: The maximum number of rules to return. The default value of -1 means the number of rules is selected by diminishing returns in model deviance. -
model_type
: The type of base learners in the ensemble, should be one of: “rules_and_linear”, “rules”, “linear”, defaults to “rules_and_linear”.
Translation from parsnip to the underlying model call (regression)
agua::h2o_train_rule()
is a wrapper around
h2o::h2o.rulefit()
.
The agua extension package is required to fit this model.
library(rules) rule_fit( trees = integer(1), tree_depth = integer(1), penalty = numeric(1) ) %>% set_engine("h2o") %>% set_mode("regression") %>% translate()
## RuleFit Model Specification (regression) ## ## Main Arguments: ## trees = integer(1) ## tree_depth = integer(1) ## penalty = numeric(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_rule(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), rule_generation_ntrees = integer(1), ## max_rule_length = integer(1), lambda = numeric(1))
Translation from parsnip to the underlying model call (classification)
agua::h2o_train_rule()
for rule_fit()
is a
wrapper around h2o::h2o.rulefit()
.
The agua extension package is required to fit this model.
rule_fit( trees = integer(1), tree_depth = integer(1), penalty = numeric(1) ) %>% set_engine("h2o") %>% set_mode("classification") %>% translate()
## RuleFit Model Specification (classification) ## ## Main Arguments: ## trees = integer(1) ## tree_depth = integer(1) ## penalty = numeric(1) ## ## Computational engine: h2o ## ## Model fit template: ## agua::h2o_train_rule(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## validation_frame = missing_arg(), rule_generation_ntrees = integer(1), ## max_rule_length = integer(1), lambda = numeric(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Other details
To use the h2o engine with tidymodels, please run h2o::h2o.init()
first. By default, This connects R to the local h2o server. This needs
to be done in every new R session. You can also connect to a remote h2o
server with an IP address, for more details see
h2o::h2o.init()
.
You can control the number of threads in the thread pool used by h2o
with the nthreads
argument. By default, it uses all CPUs on the host.
This is different from the usual parallel processing mechanism in
tidymodels for tuning, while tidymodels parallelizes over resamples, h2o
parallelizes over hyperparameter combinations for a given resample.
h2o will automatically shut down the local h2o instance started by R
when R is terminated. To manually stop the h2o server, run
h2o::h2o.shutdown()
.
Saving fitted model objects
Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.
RuleFit models via xrf
Description
xrf::xrf()
fits a model that derives simple feature rules from a tree
ensemble and uses the rules as features to a regularized model. rules::xrf_fit()
is a wrapper around this function.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 8 tuning parameters:
-
mtry
: Proportion Randomly Selected Predictors (type: double, default: see below) -
trees
: # Trees (type: integer, default: 15L) -
min_n
: Minimal Node Size (type: integer, default: 1L) -
tree_depth
: Tree Depth (type: integer, default: 6L) -
learn_rate
: Learning Rate (type: double, default: 0.3) -
loss_reduction
: Minimum Loss Reduction (type: double, default: 0.0) -
sample_size
: Proportion Observations Sampled (type: double, default: 1.0) -
penalty
: Amount of Regularization (type: double, default: 0.1)
Translation from parsnip to the underlying model call (regression)
The rules extension package is required to fit this model.
library(rules) rule_fit( mtry = numeric(1), trees = integer(1), min_n = integer(1), tree_depth = integer(1), learn_rate = numeric(1), loss_reduction = numeric(1), sample_size = numeric(1), penalty = numeric(1) ) %>% set_engine("xrf") %>% set_mode("regression") %>% translate()
## RuleFit Model Specification (regression) ## ## Main Arguments: ## mtry = numeric(1) ## trees = integer(1) ## min_n = integer(1) ## tree_depth = integer(1) ## learn_rate = numeric(1) ## loss_reduction = numeric(1) ## sample_size = numeric(1) ## penalty = numeric(1) ## ## Computational engine: xrf ## ## Model fit template: ## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), ## xgb_control = missing_arg(), colsample_bynode = numeric(1), ## nrounds = integer(1), min_child_weight = integer(1), max_depth = integer(1), ## eta = numeric(1), gamma = numeric(1), subsample = numeric(1), ## lambda = numeric(1))
Translation from parsnip to the underlying model call (classification)
The rules extension package is required to fit this model.
library(rules) rule_fit( mtry = numeric(1), trees = integer(1), min_n = integer(1), tree_depth = integer(1), learn_rate = numeric(1), loss_reduction = numeric(1), sample_size = numeric(1), penalty = numeric(1) ) %>% set_engine("xrf") %>% set_mode("classification") %>% translate()
## RuleFit Model Specification (classification) ## ## Main Arguments: ## mtry = numeric(1) ## trees = integer(1) ## min_n = integer(1) ## tree_depth = integer(1) ## learn_rate = numeric(1) ## loss_reduction = numeric(1) ## sample_size = numeric(1) ## penalty = numeric(1) ## ## Computational engine: xrf ## ## Model fit template: ## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), ## xgb_control = missing_arg(), colsample_bynode = numeric(1), ## nrounds = integer(1), min_child_weight = integer(1), max_depth = integer(1), ## eta = numeric(1), gamma = numeric(1), subsample = numeric(1), ## lambda = numeric(1))
Differences from the xrf package
Note that, per the documentation in ?xrf
, transformations of the
response variable are not supported. To use these with rule_fit()
, we
recommend using a recipe instead of the formula method.
Also, there are several configuration differences in how xrf()
is fit
between that package and the wrapper used in rules. Some differences
in default values are:
parameter | xrf | rules |
trees | 100 | 15 |
max_depth | 3 | 6 |
These differences will create a disparity in the values of the penalty
argument that glmnet uses. Also, rules can also set penalty
whereas xrf uses an internal 5-fold cross-validation to determine it
(by default).
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Other details
Interpreting mtry
The mtry
argument denotes the number of predictors that will be
randomly sampled at each split when creating tree models.
Some engines, such as "xgboost"
, "xrf"
, and "lightgbm"
, interpret
their analogue to the mtry
argument as the proportion of predictors
that will be randomly sampled at each split rather than the count. In
some settings, such as when tuning over preprocessors that influence the
number of predictors, this parameterization is quite
helpful—interpreting mtry
as a proportion means that [0, 1]
is
always a valid range for that parameter, regardless of input data.
parsnip and its extensions accommodate this parameterization using the
counts
argument: a logical indicating whether mtry
should be
interpreted as the number of predictors that will be randomly sampled at
each split. TRUE
indicates that mtry
will be interpreted in its
sense as a count, FALSE
indicates that the argument will be
interpreted in its sense as a proportion.
mtry
is a main model argument for
boost_tree()
and
rand_forest()
, and thus should not have an
engine-specific interface. So, regardless of engine, counts
defaults
to TRUE
. For engines that support the proportion interpretation
(currently "xgboost"
and "xrf"
, via the rules package, and
"lightgbm"
via the bonsai package) the user can pass the
counts = FALSE
argument to set_engine()
to supply mtry
values
within [0, 1]
.
Early stopping
The stop_iter()
argument allows the model to prematurely stop training
if the objective function does not improve within early_stop
iterations.
The best way to use this feature is in conjunction with an internal
validation set. To do this, pass the validation
parameter of
xgb_train()
via the parsnip
set_engine()
function. This is the
proportion of the training set that should be reserved for measuring
performance (and stopping early).
If the model specification has early_stop >= trees
, early_stop
is
converted to trees - 1
and a warning is issued.
Case weights
The underlying model implementation does not allow for case weights.
References
Friedman and Popescu. “Predictive learning via rule ensembles.” Ann. Appl. Stat. 2 (3) 916- 954, September 2008
Parametric survival regression
Description
flexsurv::flexsurvreg()
fits a parametric survival model.
Details
For this engine, there is a single mode: censored regression
Tuning Parameters
This model has 1 tuning parameters:
-
dist
: Distribution (type: character, default: ‘weibull’)
Translation from parsnip to the original package
The censored extension package is required to fit this model.
library(censored) survival_reg(dist = character(1)) %>% set_engine("flexsurv") %>% set_mode("censored regression") %>% translate()
## Parametric Survival Regression Model Specification (censored regression) ## ## Main Arguments: ## dist = character(1) ## ## Computational engine: flexsurv ## ## Model fit template: ## flexsurv::flexsurvreg(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), dist = character(1))
Other details
The main interface for this model uses the formula method since the
model specification typically involved the use of
survival::Surv()
.
For this engine, stratification cannot be specified via
survival::strata()
, please see
flexsurv::flexsurvreg()
for alternative
specifications.
Predictions of type "time"
are predictions of the mean survival time.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Jackson, C. 2016.
flexsurv
: A Platform for Parametric Survival Modeling in R. Journal of Statistical Software, 70(8), 1 - 33.
Flexible parametric survival regression
Description
flexsurv::flexsurvspline()
fits a flexible parametric survival model.
Details
For this engine, there is a single mode: censored regression
Tuning Parameters
This model has one engine-specific tuning parameter:
-
k
: Number of knots in the spline. The default isk = 0
.
Translation from parsnip to the original package
The censored extension package is required to fit this model.
library(censored) survival_reg() %>% set_engine("flexsurvspline") %>% set_mode("censored regression") %>% translate()
## Parametric Survival Regression Model Specification (censored regression) ## ## Computational engine: flexsurvspline ## ## Model fit template: ## flexsurv::flexsurvspline(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg())
Other details
The main interface for this model uses the formula method since the
model specification typically involved the use of
survival::Surv()
.
For this engine, stratification cannot be specified via
survival::strata()
, please see
flexsurv::flexsurvspline()
for
alternative specifications.
Predictions of type "time"
are predictions of the mean survival time.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Jackson, C. 2016.
flexsurv
: A Platform for Parametric Survival Modeling in R. Journal of Statistical Software, 70(8), 1 - 33.
Parametric survival regression
Description
survival::survreg()
fits a parametric survival model.
Details
For this engine, there is a single mode: censored regression
Tuning Parameters
This model has 1 tuning parameters:
-
dist
: Distribution (type: character, default: ‘weibull’)
Translation from parsnip to the original package
The censored extension package is required to fit this model.
library(censored) survival_reg(dist = character(1)) %>% set_engine("survival") %>% set_mode("censored regression") %>% translate()
## Parametric Survival Regression Model Specification (censored regression) ## ## Main Arguments: ## dist = character(1) ## ## Computational engine: survival ## ## Model fit template: ## survival::survreg(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), dist = character(1), model = TRUE)
Other details
In the translated syntax above, note that model = TRUE
is needed to
produce quantile predictions when there is a stratification variable and
can be overridden in other cases.
The main interface for this model uses the formula method since the
model specification typically involved the use of
survival::Surv()
.
The model formula can include special terms, such as
survival::strata()
. The allows the model scale
parameter to differ between groups contained in the function. The column
used inside strata()
is treated as qualitative no matter its type. To
learn more about using special terms in formulas with tidymodels, see
?model_formula
.
For example, in this model, the numeric column rx
is used to estimate
two different scale parameters for each value of the column:
library(survival) survival_reg() %>% fit(Surv(futime, fustat) ~ age + strata(rx), data = ovarian) %>% extract_fit_engine()
## Call: ## survival::survreg(formula = Surv(futime, fustat) ~ age + strata(rx), ## data = data, model = TRUE) ## ## Coefficients: ## (Intercept) age ## 12.8734120 -0.1033569 ## ## Scale: ## rx=1 rx=2 ## 0.7695509 0.4703602 ## ## Loglik(model)= -89.4 Loglik(intercept only)= -97.1 ## Chisq= 15.36 on 1 degrees of freedom, p= 8.88e-05 ## n= 26
Predictions of type "time"
are predictions of the mean survival time.
Case weights
This model can utilize case weights during model fitting. To use them,
see the documentation in case_weights and the examples
on tidymodels.org
.
The fit()
and fit_xy()
arguments have arguments called
case_weights
that expect vectors of case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Kalbfleisch, J. D. and Prentice, R. L. 2002 The statistical analysis of failure time data, Wiley.
Linear support vector machines (SVMs) via kernlab
Description
kernlab::ksvm()
fits a support vector machine model. For classification,
the model tries to maximize the width of the margin between classes.
For regression, the model optimizes a robust loss function that is only
affected by very large model residuals.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 2 tuning parameters:
-
cost
: Cost (type: double, default: 1.0) -
margin
: Insensitivity Margin (type: double, default: 0.1)
Parsnip changes the default range for cost
to c(-10, 5)
.
Translation from parsnip to the original package (regression)
svm_linear( cost = double(1), margin = double(1) ) %>% set_engine("kernlab") %>% set_mode("regression") %>% translate()
## Linear Support Vector Machine Model Specification (regression) ## ## Main Arguments: ## cost = double(1) ## margin = double(1) ## ## Computational engine: kernlab ## ## Model fit template: ## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), ## epsilon = double(1), kernel = "vanilladot")
Translation from parsnip to the original package (classification)
svm_linear( cost = double(1) ) %>% set_engine("kernlab") %>% set_mode("classification") %>% translate()
## Linear Support Vector Machine Model Specification (classification) ## ## Main Arguments: ## cost = double(1) ## ## Computational engine: kernlab ## ## Model fit template: ## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), ## kernel = "vanilladot", prob.model = TRUE)
The margin
parameter does not apply to classification models.
Note that the "kernlab"
engine does not naturally estimate class
probabilities. To produce them, the decision values of the model are
converted to probabilities using Platt scaling. This method fits an
additional model on top of the SVM model. When fitting the Platt scaling
model, random numbers are used that are not reproducible or controlled
by R’s random number stream.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for svm_linear()
with the "kernlab"
engine.
References
Lin, HT, and R Weng. “A Note on Platt’s Probabilistic Outputs for Support Vector Machines”
Karatzoglou, A, Smola, A, Hornik, K, and A Zeileis. 2004. “kernlab - An S4 Package for Kernel Methods in R.”, Journal of Statistical Software.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Linear support vector machines (SVMs) via LiblineaR
Description
LiblineaR::LiblineaR()
fits a support vector machine model. For classification,
the model tries to maximize the width of the margin between classes.
For regression, the model optimizes a robust loss function that is only
affected by very large model residuals.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 2 tuning parameters:
-
cost
: Cost (type: double, default: 1.0) -
margin
: Insensitivity Margin (type: double, default: no default)
This engine fits models that are L2-regularized for L2-loss. In the
LiblineaR::LiblineaR()
documentation, these
are types 1 (classification) and 11 (regression).
Parsnip changes the default range for cost
to c(-10, 5)
.
Translation from parsnip to the original package (regression)
svm_linear( cost = double(1), margin = double(1) ) %>% set_engine("LiblineaR") %>% set_mode("regression") %>% translate()
## Linear Support Vector Machine Model Specification (regression) ## ## Main Arguments: ## cost = double(1) ## margin = double(1) ## ## Computational engine: LiblineaR ## ## Model fit template: ## LiblineaR::LiblineaR(x = missing_arg(), y = missing_arg(), C = double(1), ## svr_eps = double(1), type = 11)
Translation from parsnip to the original package (classification)
svm_linear( cost = double(1) ) %>% set_engine("LiblineaR") %>% set_mode("classification") %>% translate()
## Linear Support Vector Machine Model Specification (classification) ## ## Main Arguments: ## cost = double(1) ## ## Computational engine: LiblineaR ## ## Model fit template: ## LiblineaR::LiblineaR(x = missing_arg(), y = missing_arg(), C = double(1), ## type = 1)
The margin
parameter does not apply to classification models.
Note that the LiblineaR
engine does not produce class probabilities.
When optimizing the model using the tune package, the default metrics
require class probabilities. To use the tune_*()
functions, a metric
set must be passed as an argument that only contains metrics for hard
class predictions (e.g., accuracy).
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Sparse Data
This model can utilize sparse data during model fitting and prediction.
Both sparse matrices such as dgCMatrix from the Matrix
package and
sparse tibbles from the sparsevctrs
package are supported. See
sparse_data for more information.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for svm_linear()
with the "LiblineaR"
engine.
References
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Polynomial support vector machines (SVMs) via kernlab
Description
kernlab::ksvm()
fits a support vector machine model. For classification,
the model tries to maximize the width of the margin between classes.
For regression, the model optimizes a robust loss function that is only
affected by very large model residuals.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 4 tuning parameters:
-
cost
: Cost (type: double, default: 1.0) -
degree
: Degree of Interaction (type: integer, default: 1L1) -
scale_factor
: Scale Factor (type: double, default: 1.0) -
margin
: Insensitivity Margin (type: double, default: 0.1)
Parsnip changes the default range for cost
to c(-10, 5)
.
Translation from parsnip to the original package (regression)
svm_poly( cost = double(1), degree = integer(1), scale_factor = double(1), margin = double(1) ) %>% set_engine("kernlab") %>% set_mode("regression") %>% translate()
## Polynomial Support Vector Machine Model Specification (regression) ## ## Main Arguments: ## cost = double(1) ## degree = integer(1) ## scale_factor = double(1) ## margin = double(1) ## ## Computational engine: kernlab ## ## Model fit template: ## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), ## epsilon = double(1), kernel = "polydot", kpar = list(degree = ~integer(1), ## scale = ~double(1)))
Translation from parsnip to the original package (classification)
svm_poly( cost = double(1), degree = integer(1), scale_factor = double(1) ) %>% set_engine("kernlab") %>% set_mode("classification") %>% translate()
## Polynomial Support Vector Machine Model Specification (classification) ## ## Main Arguments: ## cost = double(1) ## degree = integer(1) ## scale_factor = double(1) ## ## Computational engine: kernlab ## ## Model fit template: ## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), ## kernel = "polydot", prob.model = TRUE, kpar = list(degree = ~integer(1), ## scale = ~double(1)))
The margin
parameter does not apply to classification models.
Note that the "kernlab"
engine does not naturally estimate class
probabilities. To produce them, the decision values of the model are
converted to probabilities using Platt scaling. This method fits an
additional model on top of the SVM model. When fitting the Platt scaling
model, random numbers are used that are not reproducible or controlled
by R’s random number stream.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for svm_poly()
with the "kernlab"
engine.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
References
Lin, HT, and R Weng. “A Note on Platt’s Probabilistic Outputs for Support Vector Machines”
Karatzoglou, A, Smola, A, Hornik, K, and A Zeileis. 2004. “kernlab - An S4 Package for Kernel Methods in R.”, Journal of Statistical Software.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Radial basis function support vector machines (SVMs) via kernlab
Description
kernlab::ksvm()
fits a support vector machine model. For classification,
the model tries to maximize the width of the margin between classes.
For regression, the model optimizes a robust loss function that is only
affected by very large model residuals.
Details
For this engine, there are multiple modes: classification and regression
Tuning Parameters
This model has 3 tuning parameters:
-
cost
: Cost (type: double, default: 1.0) -
rbf_sigma
: Radial Basis Function sigma (type: double, default: see below) -
margin
: Insensitivity Margin (type: double, default: 0.1)
There is no default for the radial basis function kernel parameter.
kernlab estimates it from the data using a heuristic method. See
kernlab::sigest()
. This method uses random
numbers so, without setting the seed before fitting, the model will not
be reproducible.
Parsnip changes the default range for cost
to c(-10, 5)
.
Translation from parsnip to the original package (regression)
svm_rbf( cost = double(1), rbf_sigma = double(1), margin = double(1) ) %>% set_engine("kernlab") %>% set_mode("regression") %>% translate()
## Radial Basis Function Support Vector Machine Model Specification (regression) ## ## Main Arguments: ## cost = double(1) ## rbf_sigma = double(1) ## margin = double(1) ## ## Computational engine: kernlab ## ## Model fit template: ## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), ## epsilon = double(1), kernel = "rbfdot", kpar = list(sigma = ~double(1)))
Translation from parsnip to the original package (classification)
svm_rbf( cost = double(1), rbf_sigma = double(1) ) %>% set_engine("kernlab") %>% set_mode("classification") %>% translate()
## Radial Basis Function Support Vector Machine Model Specification (classification) ## ## Main Arguments: ## cost = double(1) ## rbf_sigma = double(1) ## ## Computational engine: kernlab ## ## Model fit template: ## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), ## kernel = "rbfdot", prob.model = TRUE, kpar = list(sigma = ~double(1)))
The margin
parameter does not apply to classification models.
Note that the "kernlab"
engine does not naturally estimate class
probabilities. To produce them, the decision values of the model are
converted to probabilities using Platt scaling. This method fits an
additional model on top of the SVM model. When fitting the Platt scaling
model, random numbers are used that are not reproducible or controlled
by R’s random number stream.
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit()
, parsnip will
convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
Case weights
The underlying model implementation does not allow for case weights.
Saving fitted model objects
This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.
Examples
The “Fitting and Predicting with parsnip” article contains
examples
for svm_rbf()
with the "kernlab"
engine.
References
Lin, HT, and R Weng. “A Note on Platt’s Probabilistic Outputs for Support Vector Machines”
Karatzoglou, A, Smola, A, Hornik, K, and A Zeileis. 2004. “kernlab - An S4 Package for Kernel Methods in R.”, Journal of Statistical Software.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Flexible discriminant analysis
Description
discrim_flexible()
defines a model that fits a discriminant analysis model
that can use nonlinear features created using multivariate adaptive
regression splines (MARS). This function can fit classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
earth¹²
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
discrim_flexible(
mode = "classification",
num_terms = NULL,
prod_degree = NULL,
prune_method = NULL,
engine = "earth"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
num_terms |
The number of features that will be retained in the final model, including the intercept. |
prod_degree |
The highest possible interaction degree. |
prune_method |
The pruning method. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 discrim_flexible(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, earth engine details
Linear discriminant analysis
Description
discrim_linear()
defines a model that estimates a multivariate
distribution for the predictors separately for the data in each class
(usually Gaussian with a common covariance matrix). Bayes' theorem is used
to compute the probability of each class, given the predictor values. This
function can fit classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
discrim_linear(
mode = "classification",
penalty = NULL,
regularization_method = NULL,
engine = "MASS"
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "classification". |
penalty |
An non-negative number representing the amount of regularization used by some of the engines. |
regularization_method |
A character string for the type of regularized
estimation. Possible values are: " |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 discrim_linear(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, MASS engine details
, mda engine details
, sda engine details
, sparsediscrim engine details
Quadratic discriminant analysis
Description
discrim_quad()
defines a model that estimates a multivariate
distribution for the predictors separately for the data in each class
(usually Gaussian with separate covariance matrices). Bayes' theorem is used
to compute the probability of each class, given the predictor values. This
function can fit classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
discrim_quad(
mode = "classification",
regularization_method = NULL,
engine = "MASS"
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "classification". |
regularization_method |
A character string for the type of regularized
estimation. Possible values are: " |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 discrim_quad(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, MASS engine details
, sparsediscrim engine details
Regularized discriminant analysis
Description
discrim_regularized()
defines a model that estimates a multivariate
distribution for the predictors separately for the data in each class. The
structure of the model can be LDA, QDA, or some amalgam of the two. Bayes'
theorem is used to compute the probability of each class, given the
predictor values. This function can fit classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
klaR¹²
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
discrim_regularized(
mode = "classification",
frac_common_cov = NULL,
frac_identity = NULL,
engine = "klaR"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
frac_common_cov , frac_identity |
Numeric values between zero and one. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
There are many ways of regularizing models. For example, one form of regularization is to penalize model parameters. Similarly, the classic James–Stein regularization approach shrinks the model structure to a less complex form.
The model fits a very specific type of regularized model by Friedman (1989) that uses two types of regularization. One modulates how class-specific the covariance matrix should be. This allows the model to balance between LDA and QDA. The second regularization component shrinks the covariance matrix towards the identity matrix.
For the penalization approach, discrim_linear()
with a mda
engine can be
used. Other regularization methods can be used with discrim_linear()
and
discrim_quad()
can used via the sparsediscrim
engine for those functions.
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 discrim_regularized(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
Friedman, J (1989). Regularized Discriminant Analysis. Journal of the American Statistical Association, 84, 165-175.
See Also
fit()
, set_engine()
, update()
, klaR engine details
Tools for documenting engines
Description
parsnip has a fairly complex documentation system where the engines for each model have detailed documentation about the syntax, tuning parameters, preprocessing needs, and so on.
The functions below are called from .R
files to programmatically
generate content in the help files for a model.
-
find_engine_files()
identifies engines for a model and creates a bulleted list of links to those specific help files. -
make_seealso_list()
creates a set of links for the "See Also" list at the bottom of the help pages. -
find_engine_files()
is a function, used by the above, to find the engines for each model function.
Usage
find_engine_files(mod)
make_engine_list(mod)
make_seealso_list(mod, pkg = "parsnip")
Arguments
mod |
A character string for the model file (e.g. "linear_reg") |
pkg |
A character string for the package where the function is invoked. |
Details
parsnip includes a document (README-DOCS.md
) with step-by-step instructions
and details. See the code below to determine where it is installed (or see
the References section).
Most parsnip users will not need to use these functions or documentation.
Value
make_engine_list()
returns a character string that creates a
bulleted list of links to more specific help files.
make_seealso_list()
returns a formatted character string of links.
find_engine_files()
returns a tibble.
References
https://github.com/tidymodels/parsnip/blob/main/inst/README-DOCS.md
Examples
# See this file for step-by-step instructions.
system.file("README-DOCS.md", package = "parsnip")
# Code examples:
make_engine_list("linear_reg")
cat(make_engine_list("linear_reg"))
Evaluate parsnip model arguments
Description
Evaluate parsnip model arguments
Usage
eval_args(spec, ...)
Arguments
spec |
|
... |
Not used. |
Extract elements of a parsnip model object
Description
These functions extract various elements from a parsnip object. If they do not exist yet, an error is thrown.
-
extract_spec_parsnip()
returns the parsnip model specification. -
extract_fit_engine()
returns the engine specific fit embedded within a parsnip model fit. For example, when usinglinear_reg()
with the"lm"
engine, this returns the underlyinglm
object. -
extract_parameter_dials()
returns a single dials parameter object. -
extract_parameter_set_dials()
returns a set of dials parameter objects. -
extract_fit_time()
returns a tibble with fit times. The fit times correspond to the time for the parsnip engine to fit and do not include other portions of the elapsed time infit.model_spec()
.
Usage
## S3 method for class 'model_fit'
extract_spec_parsnip(x, ...)
## S3 method for class 'model_fit'
extract_fit_engine(x, ...)
## S3 method for class 'model_spec'
extract_parameter_set_dials(x, ...)
## S3 method for class 'model_spec'
extract_parameter_dials(x, parameter, ...)
## S3 method for class 'model_fit'
extract_fit_time(x, summarize = TRUE, ...)
Arguments
x |
A parsnip |
... |
Not currently used. |
parameter |
A single string for the parameter ID. |
summarize |
A logical for whether the elapsed fit time should be
returned as a single row or multiple rows. Doesn't support |
Details
Extracting the underlying engine fit can be helpful for describing the
model (via print()
, summary()
, plot()
, etc.) or for variable
importance/explainers.
However, users should not invoke the predict()
method on an extracted
model. There may be preprocessing operations that parsnip has executed on
the data prior to giving it to the model. Bypassing these can lead to errors
or silently generating incorrect predictions.
Good:
parsnip_fit %>% predict(new_data)
Bad:
parsnip_fit %>% extract_fit_engine() %>% predict(new_data)
Value
The extracted value from the parsnip object, x
, as described in the description
section.
Examples
lm_spec <- linear_reg() %>% set_engine("lm")
lm_fit <- fit(lm_spec, mpg ~ ., data = mtcars)
lm_spec
extract_spec_parsnip(lm_fit)
extract_fit_engine(lm_fit)
lm(mpg ~ ., data = mtcars)
Control the fit function
Description
Pass options to the fit.model_spec()
function to control its
output and computations
Usage
fit_control(verbosity = 1L, catch = FALSE)
Arguments
verbosity |
An integer to control how verbose the output is. For a
value of zero, no messages or output are shown when packages are loaded or
when the model is fit. For a value of 1, package loading is quiet but model
fits can produce output to the screen (depending on if they contain their
own |
catch |
A logical where a value of |
Details
fit_control()
is deprecated in favor of control_parsnip()
.
Value
An S3 object with class "control_parsnip" that is a named list with the results of the function call
Examples
fit_control(verbosity = 2L)
Fit a Model Specification to a Dataset
Description
fit()
and fit_xy()
take a model specification, translate the required
code by substituting arguments, and execute the model fit
routine.
Usage
## S3 method for class 'model_spec'
fit(
object,
formula,
data,
case_weights = NULL,
control = control_parsnip(),
...
)
## S3 method for class 'model_spec'
fit_xy(object, x, y, case_weights = NULL, control = control_parsnip(), ...)
Arguments
object |
An object of class |
formula |
An object of class |
data |
Optional, depending on the interface (see Details below). A data frame containing all relevant variables (e.g. outcome(s), predictors, case weights, etc). Note: when needed, a named argument should be used. |
case_weights |
An optional classed vector of numeric case weights. This
must return |
control |
A named list with elements |
... |
Not currently used; values passed here will be
ignored. Other options required to fit the model should be
passed using |
x |
A matrix, sparse matrix, or data frame of predictors. Only some
models have support for sparse matrix input. See |
y |
A vector, matrix or data frame of outcome data. |
Details
fit()
and fit_xy()
substitute the current arguments in the model
specification into the computational engine's code, check them
for validity, then fit the model using the data and the
engine-specific code. Different model functions have different
interfaces (e.g. formula or x
/y
) and these functions translate
between the interface used when fit()
or fit_xy()
was invoked and the one
required by the underlying model.
When possible, these functions attempt to avoid making copies of the
data. For example, if the underlying model uses a formula and
fit()
is invoked, the original data are references
when the model is fit. However, if the underlying model uses
something else, such as x
/y
, the formula is evaluated and
the data are converted to the required format. In this case, any
calls in the resulting model objects reference the temporary
objects used to fit the model.
If the model engine has not been set, the model's default engine will be used
(as discussed on each model page). If the verbosity
option of
control_parsnip()
is greater than zero, a warning will be produced.
If you would like to use an alternative method for generating contrasts when
supplying a formula to fit()
, set the global option contrasts
to your
preferred method. For example, you might set it to:
options(contrasts = c(unordered = "contr.helmert", ordered = "contr.poly"))
.
See the help page for stats::contr.treatment()
for more possible contrast
types.
For models with "censored regression"
modes, an additional computation is
executed and saved in the parsnip object. The censor_probs
element contains
a "reverse Kaplan-Meier" curve that models the probability of censoring. This
may be used later to compute inverse probability censoring weights for
performance measures.
Sparse data is supported, with the use of the x
argument in fit_xy()
. See
allow_sparse_x
column of get_encoding()
for sparse input
compatibility.
Value
A model_fit
object that contains several elements:
-
lvl
: If the outcome is a factor, this contains the factor levels at the time of model fitting. -
ordered
: If the outcome is a factor, was it an ordered factor? -
spec
: The model specification object (object
in the call tofit
) -
fit
: when the model is executed without error, this is the model object. Otherwise, it is atry-error
object with the error message. -
preproc
: any objects needed to convert between a formula and non-formula interface (such as theterms
object)
The return value will also have a class related to the fitted model (e.g.
"_glm"
) before the base class of "model_fit"
.
See Also
set_engine()
, control_parsnip()
, model_spec
, model_fit
Examples
# Although `glm()` only has a formula interface, different
# methods for specifying the model can be used
library(dplyr)
library(modeldata)
data("lending_club")
lr_mod <- logistic_reg()
using_formula <-
lr_mod %>%
set_engine("glm") %>%
fit(Class ~ funded_amnt + int_rate, data = lending_club)
using_xy <-
lr_mod %>%
set_engine("glm") %>%
fit_xy(x = lending_club[, c("funded_amnt", "int_rate")],
y = lending_club$Class)
using_formula
using_xy
Internal functions that format predictions
Description
These are used to ensure that we have appropriate column names inside of tibbles.
Usage
format_num(x)
format_class(x)
format_classprobs(x)
format_time(x)
format_survival(x)
format_linear_pred(x)
format_hazard(x)
ensure_parsnip_format(x, col_name, overwrite = TRUE)
Arguments
x |
A data frame or vector (depending on the context and function). |
col_name |
A string for a prediction column name. |
overwrite |
A logical for whether to overwrite the column name. |
Value
A tibble
Generalized additive models (GAMs)
Description
gen_additive_mod()
defines a model that can use smoothed functions of
numeric predictors in a generalized linear model. This function can fit
classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
mgcv¹
¹ The default engine.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
gen_additive_mod(
mode = "unknown",
select_features = NULL,
adjust_deg_free = NULL,
engine = "mgcv"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
select_features |
|
adjust_deg_free |
If |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 gen_additive_mod(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, mgcv engine details
Examples
show_engines("gen_additive_mod")
gen_additive_mod()
Working with the parsnip model environment
Description
These functions read and write to the environment where the package stores information about model specifications.
Usage
get_model_env()
get_from_env(items)
set_in_env(...)
set_env_val(name, value)
Arguments
items |
A character string of objects in the model environment. |
... |
Named values that will be assigned to the model environment. |
name |
A single character value for a new symbol in the model environment. |
value |
A single value for a new value in the model environment. |
References
"How to build a parsnip model" https://www.tidymodels.org/learn/develop/models/
Examples
# Access the model data:
current_code <- get_model_env()
ls(envir = current_code)
Construct a single row summary "glance" of a model, fit, or other object
Description
This method glances the model in a parsnip model object, if it exists.
Usage
## S3 method for class 'model_fit'
glance(x, ...)
Arguments
x |
model or other R object to convert to single-row data frame |
... |
other arguments passed to methods |
Value
a tibble
Fit a grouped binomial outcome from a data set with case weights
Description
stats::glm()
assumes that a tabular data set with case weights corresponds
to "different observations have different dispersions" (see ?glm
).
In some cases, the case weights reflect that the same covariate pattern was
observed multiple times (i.e., frequency weights). In this case,
stats::glm()
expects the data to be formatted as the number of events for
each factor level so that the outcome can be given to the formula as
cbind(events_1, events_2)
.
glm_grouped()
converts data with integer case weights to the expected
"number of events" format for binomial data.
Usage
glm_grouped(formula, data, weights, ...)
Arguments
formula |
A formula object with one outcome that is a two-level factors. |
data |
A data frame with the outcomes and predictors (but not case weights). |
weights |
An integer vector of weights whose length is the same as the
number of rows in |
... |
Options to pass to |
Value
A object produced by stats::glm()
.
Examples
#----------------------------------------------------------------------------
# The same data set formatted three ways
# First with basic case weights that, from ?glm, are used inappropriately.
ucb_weighted <- as.data.frame(UCBAdmissions)
ucb_weighted$Freq <- as.integer(ucb_weighted$Freq)
head(ucb_weighted)
nrow(ucb_weighted)
# Format when yes/no data are in individual rows (probably still inappropriate)
library(tidyr)
ucb_long <- uncount(ucb_weighted, Freq)
head(ucb_long)
nrow(ucb_long)
# Format where the outcome is formatted as number of events
ucb_events <-
ucb_weighted %>%
tidyr::pivot_wider(
id_cols = c(Gender, Dept),
names_from = Admit,
values_from = Freq,
values_fill = 0L
)
head(ucb_events)
nrow(ucb_events)
#----------------------------------------------------------------------------
# Different model fits
# Treat data as separate Bernoulli data:
glm(Admit ~ Gender + Dept, data = ucb_long, family = binomial)
# Weights produce the same statistics
glm(
Admit ~ Gender + Dept,
data = ucb_weighted,
family = binomial,
weights = ucb_weighted$Freq
)
# Data as binomial "x events out of n trials" format. Note that, to get the same
# coefficients, the order of the levels must be reversed.
glm(
cbind(Rejected, Admitted) ~ Gender + Dept,
data = ucb_events,
family = binomial
)
# The new function that starts with frequency weights and gets the correct place:
glm_grouped(Admit ~ Gender + Dept, data = ucb_weighted, weights = ucb_weighted$Freq)
Technical aspects of the glmnet model
Description
glmnet is a popular statistical model for regularized generalized linear models. These notes reflect common questions about this particular model.
tidymodels and glmnet
The implementation of the glmnet package has some nice features. For
example, one of the main tuning parameters, the regularization penalty,
does not need to be specified when fitting the model. The package fits a
compendium of values, called the regularization path. These values
depend on the data set and the value of alpha
, the mixture parameter
between a pure ridge model (alpha = 0
) and a pure lasso model
(alpha = 1
). When predicting, any penalty values can be simultaneously
predicted, even those that are not exactly on the regularization path.
For those, the model approximates between the closest path values to
produce a prediction. There is an argument called lambda
to the
glmnet()
function that is used to specify the path.
In the discussion below, linear_reg()
is used. The information is true
for all parsnip models that have a "glmnet"
engine.
Fitting and predicting using parsnip
Recall that tidymodels uses standardized parameter names across models
chosen to be low on jargon. The argument penalty
is the equivalent of
what glmnet calls the lambda
value and mixture
is the same as their
alpha
value.
In tidymodels, our predict()
methods are defined to make one
prediction at a time. For this model, that means predictions are for a
single penalty value. For this reason, models that have glmnet engines
require the user to always specify a single penalty value when the model
is defined. For example, for linear regression:
linear_reg(penalty = 1) %>% set_engine("glmnet")
When the predict()
method is called, it automatically uses the penalty
that was given when the model was defined. For example:
library(tidymodels) fit <- linear_reg(penalty = 1) %>% set_engine("glmnet") %>% fit(mpg ~ ., data = mtcars) # predict at penalty = 1 predict(fit, mtcars[1:3,])
## # A tibble: 3 x 1 ## .pred ## <dbl> ## 1 22.2 ## 2 21.5 ## 3 24.9
However, any penalty values can be predicted simultaneously using the
multi_predict()
method:
# predict at c(0.00, 0.01) multi_predict(fit, mtcars[1:3,], penalty = c(0.00, 0.01))
## # A tibble: 3 x 1 ## .pred ## <list> ## 1 <tibble [2 x 2]> ## 2 <tibble [2 x 2]> ## 3 <tibble [2 x 2]>
# unnested: multi_predict(fit, mtcars[1:3,], penalty = c(0.00, 0.01)) %>% add_rowindex() %>% unnest(cols = ".pred")
## # A tibble: 6 x 3 ## penalty .pred .row ## <dbl> <dbl> <int> ## 1 0 22.6 1 ## 2 0.01 22.5 1 ## 3 0 22.1 2 ## 4 0.01 22.1 2 ## 5 0 26.3 3 ## 6 0.01 26.3 3
Where did lambda
go?
It may appear odd that the lambda
value does not get used in the fit:
linear_reg(penalty = 1) %>% set_engine("glmnet") %>% translate()
## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 1 ## ## Computational engine: glmnet ## ## Model fit template: ## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## family = "gaussian")
Internally, the value of penalty = 1
is saved in the parsnip object
and no value is set for lambda
. This enables the full path to be fit
by glmnet()
. See the section below about setting the path.
How do I set the regularization path?
Regardless of what value you use for penalty
, the full coefficient
path is used when glmnet::glmnet()
is called.
What if you want to manually set this path? Normally, you would pass a
vector to lambda
in glmnet::glmnet()
.
parsnip models that use a glmnet
engine can use a special optional
argument called path_values
. This is not an argument to
glmnet::glmnet()
; it is used by parsnip to
independently set the path.
For example, we have found that if you want a fully ridge regression
model (i.e., mixture = 0
), you can get the wrong coefficients if the
path does not contain zero (see issue #431).
If we want to use our own path, the argument is passed as an engine-specific option:
coef_path_values <- c(0, 10^seq(-5, 1, length.out = 7)) fit_ridge <- linear_reg(penalty = 1, mixture = 0) %>% set_engine("glmnet", path_values = coef_path_values) %>% fit(mpg ~ ., data = mtcars) all.equal(sort(fit_ridge$fit$lambda), coef_path_values)
## [1] TRUE
# predict at penalty = 1 predict(fit_ridge, mtcars[1:3,])
## # A tibble: 3 x 1 ## .pred ## <dbl> ## 1 22.1 ## 2 21.8 ## 3 26.6
Tidying the model object
broom::tidy()
is a function that gives a summary of
the object as a tibble.
tl;dr tidy()
on a glmnet
model produced by parsnip gives the
coefficients for the value given by penalty
.
When parsnip makes a model, it gives it an extra class. Use the tidy()
method on the object, it produces coefficients for the penalty that was
originally requested:
tidy(fit)
## # A tibble: 11 x 3 ## term estimate penalty ## <chr> <dbl> <dbl> ## 1 (Intercept) 35.3 1 ## 2 cyl -0.872 1 ## 3 disp 0 1 ## 4 hp -0.0101 1 ## 5 drat 0 1 ## 6 wt -2.59 1 ## # i 5 more rows
Note that there is a tidy()
method for glmnet
objects in the broom
package. If this is used directly on the underlying glmnet
object, it
returns all of coefficients on the path:
# Use the basic tidy() method for glmnet all_tidy_coefs <- broom:::tidy.glmnet(fit$fit) all_tidy_coefs
## # A tibble: 640 x 5 ## term step estimate lambda dev.ratio ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1 20.1 5.15 0 ## 2 (Intercept) 2 21.6 4.69 0.129 ## 3 (Intercept) 3 23.2 4.27 0.248 ## 4 (Intercept) 4 24.7 3.89 0.347 ## 5 (Intercept) 5 26.0 3.55 0.429 ## 6 (Intercept) 6 27.2 3.23 0.497 ## # i 634 more rows
length(unique(all_tidy_coefs$lambda))
## [1] 79
This can be nice for plots but it might not contain the penalty value that you are interested in.
Tools for models that predict on sub-models
Description
has_multi_predict()
tests to see if an object can make multiple
predictions on submodels from the same object. multi_predict_args()
returns the names of the arguments to multi_predict()
for this model
(if any).
Usage
has_multi_predict(object, ...)
## Default S3 method:
has_multi_predict(object, ...)
## S3 method for class 'model_fit'
has_multi_predict(object, ...)
## S3 method for class 'workflow'
has_multi_predict(object, ...)
multi_predict_args(object, ...)
## Default S3 method:
multi_predict_args(object, ...)
## S3 method for class 'model_fit'
multi_predict_args(object, ...)
## S3 method for class 'workflow'
multi_predict_args(object, ...)
Arguments
object |
An object to test. |
... |
Not currently used. |
Value
has_multi_predict()
returns single logical value while
multi_predict_args()
returns a character vector of argument names (or NA
if none exist).
Examples
lm_model_idea <- linear_reg() %>% set_engine("lm")
has_multi_predict(lm_model_idea)
lm_model_fit <- fit(lm_model_idea, mpg ~ ., data = mtcars)
has_multi_predict(lm_model_fit)
multi_predict_args(lm_model_fit)
library(kknn)
knn_fit <-
nearest_neighbor(mode = "regression", neighbors = 5) %>%
set_engine("kknn") %>%
fit(mpg ~ ., mtcars)
multi_predict_args(knn_fit)
multi_predict(knn_fit, mtcars[1, -1], neighbors = 1:4)$.pred
Activation functions for neural networks in keras
Description
Activation functions for neural networks in keras
Usage
keras_activations()
Value
A character vector of values.
Simple interface to MLP models via keras
Description
Instead of building a keras
model sequentially, keras_mlp
can be used to
create a feedforward network with a single hidden layer. Regularization is
via either weight decay or dropout.
Usage
keras_mlp(
x,
y,
hidden_units = 5,
penalty = 0,
dropout = 0,
epochs = 20,
activation = "softmax",
seeds = sample.int(10^5, size = 3),
...
)
Arguments
x |
A data frame or matrix of predictors |
y |
A vector (factor or numeric) or matrix (numeric) of outcome data. |
An integer for the number of hidden units. | |
penalty |
A non-negative real number for the amount of weight decay. Either
this parameter or |
dropout |
The proportion of parameters to set to zero. Either
this parameter or |
epochs |
An integer for the number of passes through the data. |
activation |
A character string for the type of activation function between layers. |
seeds |
A vector of three positive integers to control randomness of the calculations. |
... |
Additional named arguments to pass to |
Value
A keras
model object.
Wrapper for keras class predictions
Description
Wrapper for keras class predictions
Usage
keras_predict_classes(object, x)
Arguments
object |
A keras model fit |
x |
A data set. |
Knit engine-specific documentation
Description
Knit engine-specific documentation
Usage
knit_engine_docs(pattern = NULL)
Arguments
pattern |
A regular expression to specify which files to knit. The default knits all engine documentation files. |
Details
This function will check whether the known parsnip extension packages, engine specific packages, and a few other ancillary packages are installed. Users will be prompted to install anything required to create the engine documentation.
Value
A tibble with column file
for the file name and result
(a
character vector that echos the output file name or, when there is
a failure, the error message).
Linear regression
Description
linear_reg()
defines a model that can predict numeric values from
predictors using a linear function. This function can fit regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package for regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
linear_reg(mode = "regression", engine = "lm", penalty = NULL, mixture = NULL)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "regression". |
engine |
A single character string specifying what computational engine
to use for fitting. Possible engines are listed below. The default for this
model is |
penalty |
A non-negative number representing the total amount of regularization (specific engines only). |
mixture |
A number between zero and one (inclusive) denoting the proportion of L1 regularization (i.e. lasso) in the model.
Available for specific engines only. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 linear_reg(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, lm engine details
, brulee engine details
, gee engine details
, glm engine details
, glmer engine details
, glmnet engine details
, gls engine details
, h2o engine details
, keras engine details
, lme engine details
, lmer engine details
, quantreg engine details
, spark engine details
, stan engine details
, stan_glmer engine details
Examples
show_engines("linear_reg")
linear_reg()
Locate and show errors/warnings in engine-specific documentation
Description
Locate and show errors/warnings in engine-specific documentation
Usage
list_md_problems()
Value
A tibble with column file
for the file name, line
indicating
the line where the error/warning occurred, and problem
showing the
error/warning message.
Logistic regression
Description
logistic_reg()
defines a generalized linear model for binary outcomes. A
linear combination of the predictors is used to model the log odds of an
event. This function can fit classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
logistic_reg(
mode = "classification",
engine = "glm",
penalty = NULL,
mixture = NULL
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "classification". |
engine |
A single character string specifying what computational engine
to use for fitting. Possible engines are listed below. The default for this
model is |
penalty |
A non-negative number representing the total
amount of regularization (specific engines only).
For |
mixture |
A number between zero and one (inclusive) giving the proportion of L1 regularization (i.e. lasso) in the model.
Available for specific engines only. For |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 logistic_reg(argument = !!value)
This model fits a classification model for binary outcomes; for
multiclass outcomes, see multinom_reg()
.
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, glm engine details
, brulee engine details
, gee engine details
, glmer engine details
, glmnet engine details
, h2o engine details
, keras engine details
, LiblineaR engine details
, spark engine details
, stan engine details
, stan_glmer engine details
Examples
show_engines("logistic_reg")
logistic_reg()
Make a parsnip call expression
Description
Make a parsnip call expression
Usage
make_call(fun, ns, args, ...)
Arguments
fun |
A character string of a function name. |
ns |
A character string of a package name. |
args |
A named list of argument values. |
Details
The arguments are spliced into the ns::fun()
call. If they are
missing, null, or a single logical, then are not spliced.
Value
A call.
Prepend a new class
Description
This adds an extra class to a base class of "model_spec".
Usage
make_classes(prefix)
Arguments
prefix |
A character string for a class. |
Value
A character vector.
Multivariate adaptive regression splines (MARS)
Description
mars()
defines a generalized linear model that uses artificial features for
some predictors. These features resemble hinge functions and the result is
a model that is a segmented regression in small dimensions. This function can
fit classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
mars(
mode = "unknown",
engine = "earth",
num_terms = NULL,
prod_degree = NULL,
prune_method = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine to use for fitting. |
num_terms |
The number of features that will be retained in the final model, including the intercept. |
prod_degree |
The highest possible interaction degree. |
prune_method |
The pruning method. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 mars(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, earth engine details
Examples
show_engines("mars")
mars(mode = "regression", num_terms = 5)
Reformat quantile predictions
Description
Reformat quantile predictions
Usage
matrix_to_quantile_pred(x, object)
Arguments
x |
A matrix of predictions with rows as samples and columns as quantile levels. |
object |
A parsnip |
Determine largest value of mtry from formula.
This function potentially caps the value of mtry
based on a formula and
data set. This is a safe approach for survival and/or multivariate models.
Description
Determine largest value of mtry from formula.
This function potentially caps the value of mtry
based on a formula and
data set. This is a safe approach for survival and/or multivariate models.
Usage
max_mtry_formula(mtry, formula, data)
Arguments
mtry |
An initial value of |
formula |
A model formula. |
data |
The training set (data frame). |
Value
A value for mtry
.
Examples
# should be 9
max_mtry_formula(200, cbind(wt, mpg) ~ ., data = mtcars)
Fuzzy conversions
Description
These are substitutes for as.matrix()
and as.data.frame()
that leave
a sparse matrix as-is.
Usage
maybe_matrix(x)
maybe_data_frame(x)
Arguments
x |
A data frame, matrix, or sparse matrix. |
Value
A data frame, matrix, or sparse matrix.
Execution-time data dimension checks
Description
For some tuning parameters, the range of values depend on the data
dimensions (e.g. mtry
). Some packages will fail if the parameter values are
outside of these ranges. Since the model might receive resampled versions of
the data, these ranges can't be set prior to the point where the model is
fit. These functions check the possible range of the data and adjust them
if needed (with a warning).
Usage
min_cols(num_cols, source)
min_rows(num_rows, source, offset = 0)
Arguments
num_cols , num_rows |
The parameter value requested by the user. |
source |
A data frame for the data to be used in the fit. If the source is named "data", it is assumed that one column of the data corresponds to an outcome (and is subtracted off). |
offset |
A number subtracted off of the number of rows available in the data. |
Value
An integer (and perhaps a warning).
Examples
nearest_neighbor(neighbors= 100) %>%
set_engine("kknn") %>%
set_mode("regression") %>%
translate()
library(ranger)
rand_forest(mtry = 2, min_n = 100, trees = 3) %>%
set_engine("ranger") %>%
set_mode("regression") %>%
fit(mpg ~ ., data = mtcars)
Single layer neural network
Description
mlp()
defines a multilayer perceptron model (a.k.a. a single layer,
feed-forward neural network). This function can fit classification and
regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package for classification and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
mlp(
mode = "unknown",
engine = "nnet",
hidden_units = NULL,
penalty = NULL,
dropout = NULL,
epochs = NULL,
activation = NULL,
learn_rate = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine to use for fitting. |
An integer for the number of units in the hidden model. | |
penalty |
A non-negative numeric value for the amount of weight decay. |
dropout |
A number between 0 (inclusive) and 1 denoting the proportion of model parameters randomly set to zero during model training. |
epochs |
An integer for the number of training iterations. |
activation |
A single character string denoting the type of relationship between the original predictors and the hidden unit layer. The activation function between the hidden and output layers is automatically set to either "linear" or "softmax" depending on the type of outcome. Possible values depend on the engine being used. |
learn_rate |
A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 mlp(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, nnet engine details
, brulee engine details
, brulee_two_layer engine details
, h2o engine details
, keras engine details
Examples
show_engines("mlp")
mlp(mode = "classification", penalty = 0.01)
parsnip model specification database
Description
This is used in the RStudio add-in and captures information about mode specifications in various R packages.
Value
model_db |
a data frame |
Examples
data(model_db)
Model Fit Objects
Description
Model fits are trained model specifications that are
ready to predict on new data. Model fits have class
model_fit
and, usually, a subclass referring to the engine
used to fit the model.
Details
An object with class "model_fit"
is a container for
information about a model that has been fit to the data.
The main elements of the object are:
-
lvl
: A vector of factor levels when the outcome is a factor. This isNULL
when the outcome is not a factor vector. -
spec
: Amodel_spec
object. -
fit
: The object produced by the fitting function. -
preproc
: This contains any data-specific information required to process new a sample point for prediction. For example, if the underlying model function requires argumentsx
andy
and the user passed a formula tofit
, thepreproc
object would contain items such as the terms object and so on. When no information is required, this isNA
.
As discussed in the documentation for model_spec
, the
original arguments to the specification are saved as quosures.
These are evaluated for the model_fit
object prior to fitting.
If the resulting model object prints its call, any user-defined
options are shown in the call preceded by a tilde (see the
example below). This is a result of the use of quosures in the
specification.
This class and structure is the basis for how parsnip stores model objects after seeing the data and applying a model.
Examples
# Keep the `x` matrix if the data are not too big.
spec_obj <-
linear_reg() %>%
set_engine("lm", x = ifelse(.obs() < 500, TRUE, FALSE))
spec_obj
fit_obj <- fit(spec_obj, mpg ~ ., data = mtcars)
fit_obj
nrow(fit_obj$fit$x)
Formulas with special terms in tidymodels
Description
In R, formulas provide a compact, symbolic notation to specify model terms.
Many modeling functions in R make use of "specials",
or nonstandard notations used in formulas. Specials are defined and handled as
a special case by a given modeling package. For example, the mgcv package,
which provides support for
generalized additive models in R, defines a
function s()
to be in-lined into formulas. It can be used like so:
mgcv::gam(mpg ~ wt + s(disp, k = 5), data = mtcars)
In this example, the s()
special defines a smoothing term that the mgcv
package knows to look for when preprocessing model input.
The parsnip package can handle most specials without issue. The analogous code for specifying this generalized additive model with the parsnip "mgcv" engine looks like:
gen_additive_mod() %>% set_mode("regression") %>% set_engine("mgcv") %>% fit(mpg ~ wt + s(disp, k = 5), data = mtcars)
However, parsnip is often used in conjunction with the greater tidymodels package ecosystem, which defines its own pre-processing infrastructure and functionality via packages like hardhat and recipes. The specials defined in many modeling packages introduce conflicts with that infrastructure.
To support specials while also maintaining consistent syntax elsewhere in the ecosystem, tidymodels delineates between two types of formulas: preprocessing formulas and model formulas. Preprocessing formulas specify the input variables, while model formulas determine the model structure.
Example
To create the preprocessing formula from the model formula, just remove the specials, retaining references to input variables themselves. For example:
model_formula <- mpg ~ wt + s(disp, k = 5) preproc_formula <- mpg ~ wt + disp
-
With parsnip, use the model formula:
model_spec <- gen_additive_mod() %>% set_mode("regression") %>% set_engine("mgcv") model_spec %>% fit(model_formula, data = mtcars)
-
With recipes, use the preprocessing formula only:
library(recipes) recipe(preproc_formula, mtcars)
The recipes package supplies a large variety of preprocessing techniques that may replace the need for specials altogether, in some cases.
-
With workflows, use the preprocessing formula everywhere, but pass the model formula to the
formula
argument inadd_model()
:library(workflows) wflow <- workflow() %>% add_formula(preproc_formula) %>% add_model(model_spec, formula = model_formula) fit(wflow, data = mtcars)
The workflow will then pass the model formula to parsnip, using the preprocessor formula elsewhere. We would still use the preprocessing formula if we had added a recipe preprocessor using
add_recipe()
instead a formula viaadd_formula()
.
Print helper for model objects
Description
A common format function that prints information about the model object (e.g. arguments, calls, packages, etc).
Usage
model_printer(x, ...)
Arguments
x |
A model object. |
... |
Not currently used. |
Model Specifications
Description
The parsnip package splits the process of fitting models into two steps:
Specify how a model will be fit using a model specification
Fit a model using the model specification
This is a different approach to many other model interfaces in R, like lm()
,
where both the specification of the model and the fitting happens in one
function call. Splitting the process into two steps allows users to
iteratively define model specifications throughout the model development
process.
This intermediate object that defines how the model will be fit is called
a model specification and has class model_spec
. Model type functions,
like linear_reg()
or boost_tree()
, return model_spec
objects.
Fitted model objects, resulting from passing a model_spec
to
fit() or fit_xy, have
class model_fit
, and contain the original model_spec
objects inside
them. See ?model_fit for more on that object type, and
?extract_spec_parsnip to
extract model_spec
s from model_fit
s.
Details
An object with class "model_spec"
is a container for
information about a model that will be fit.
The main elements of the object are:
-
args
: A vector of the main arguments for the model. The names of these arguments may be different from their counterparts n the underlying model function. For example, for aglmnet
model, the argument name for the amount of the penalty is called "penalty" instead of "lambda" to make it more general and usable across different types of models (and to not be specific to a particular model function). The elements ofargs
cantune()
with the use of the tune package. For more information see https://www.tidymodels.org/start/tuning/. If left to their defaults (NULL
), the arguments will use the underlying model functions default value. As discussed below, the arguments inargs
are captured as quosures and are not immediately executed. -
...
: Optional model-function-specific parameters. As withargs
, these will be quosures and can betune()
. -
mode
: The type of model, such as "regression" or "classification". Other modes will be added once the package adds more functionality. -
method
: This is a slot that is filled in later by the model's constructor function. It generally contains lists of information that are used to create the fit and prediction code as well as required packages and similar data. -
engine
: This character string declares exactly what software will be used. It can be a package name or a technology type.
This class and structure is the basis for how parsnip stores model objects prior to seeing the data.
Argument Details
An important detail to understand when creating model specifications is that they are intended to be functionally independent of the data. While it is true that some tuning parameters are data dependent, the model specification does not interact with the data at all.
For example, most R functions immediately evaluate their
arguments. For example, when calling mean(dat_vec)
, the object
dat_vec
is immediately evaluated inside of the function.
parsnip model functions do not do this. For example, using
rand_forest(mtry = ncol(mtcars) - 1)
does not execute ncol(mtcars) - 1
when creating the specification.
This can be seen in the output:
> rand_forest(mtry = ncol(mtcars) - 1) Random Forest Model Specification (unknown) Main Arguments: mtry = ncol(mtcars) - 1
The model functions save the argument expressions and their
associated environments (a.k.a. a quosure) to be evaluated later
when either fit.model_spec()
or fit_xy.model_spec()
are
called with the actual data.
The consequence of this strategy is that any data required to get the parameter values must be available when the model is fit. The two main ways that this can fail is if:
The data have been modified between the creation of the model specification and when the model fit function is invoked.
If the model specification is saved and loaded into a new session where those same data objects do not exist.
The best way to avoid these issues is to not reference any data
objects in the global environment but to use data descriptors
such as .cols()
. Another way of writing the previous
specification is
rand_forest(mtry = .cols() - 1)
This is not dependent on any specific data object and is evaluated immediately before the model fitting process begins.
One less advantageous approach to solving this issue is to use quasiquotation. This would insert the actual R object into the model specification and might be the best idea when the data object is small. For example, using
rand_forest(mtry = ncol(!!mtcars) - 1)
would work (and be reproducible between sessions) but embeds
the entire mtcars data set into the mtry
expression:
> rand_forest(mtry = ncol(!!mtcars) - 1) Random Forest Model Specification (unknown) Main Arguments: mtry = ncol(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, <snip>
However, if there were an object with the number of columns in it, this wouldn't be too bad:
> mtry_val <- ncol(mtcars) - 1 > mtry_val [1] 10 > rand_forest(mtry = !!mtry_val) Random Forest Model Specification (unknown) Main Arguments: mtry = 10
More information on quosures and quasiquotation can be found at https://adv-r.hadley.nz/quasiquotation.html.
Model predictions across many sub-models
Description
For some models, predictions can be made on sub-models in the model object.
Usage
multi_predict(object, ...)
## Default S3 method:
multi_predict(object, ...)
## S3 method for class ''_xgb.Booster''
multi_predict(object, new_data, type = NULL, trees = NULL, ...)
## S3 method for class ''_C5.0''
multi_predict(object, new_data, type = NULL, trees = NULL, ...)
## S3 method for class ''_elnet''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)
## S3 method for class ''_lognet''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)
## S3 method for class ''_multnet''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)
## S3 method for class ''_glmnetfit''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)
## S3 method for class ''_earth''
multi_predict(object, new_data, type = NULL, num_terms = NULL, ...)
## S3 method for class ''_torch_mlp''
multi_predict(object, new_data, type = NULL, epochs = NULL, ...)
## S3 method for class ''_train.kknn''
multi_predict(object, new_data, type = NULL, neighbors = NULL, ...)
Arguments
object |
A model fit. |
... |
Optional arguments to pass to |
new_data |
A rectangular data object, such as a data frame. |
type |
A single character value or |
trees |
An integer vector for the number of trees in the ensemble. |
penalty |
A numeric vector of penalty values. |
num_terms |
An integer vector for the number of MARS terms to retain. |
epochs |
An integer vector for the number of training epochs. |
neighbors |
An integer vector for the number of nearest neighbors. |
Value
A tibble with the same number of rows as the data being predicted.
There is a list-column named .pred
that contains tibbles with
multiple rows per sub-model. Note that, within the tibbles, the column names
follow the usual standard based on prediction type
(i.e. .pred_class
for
type = "class"
and so on).
Multinomial regression
Description
multinom_reg()
defines a model that uses linear predictors to predict
multiclass data using the multinomial distribution. This function can fit
classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
multinom_reg(
mode = "classification",
engine = "nnet",
penalty = NULL,
mixture = NULL
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "classification". |
engine |
A single character string specifying what computational engine
to use for fitting. Possible engines are listed below. The default for this
model is |
penalty |
A non-negative number representing the total
amount of regularization (specific engines only).
For |
mixture |
A number between zero and one (inclusive) giving the proportion of L1 regularization (i.e. lasso) in the model.
Available for specific engines only. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 multinom_reg(argument = !!value)
This model fits a classification model for multiclass outcomes; for
binary outcomes, see logistic_reg()
.
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, nnet engine details
, brulee engine details
, glmnet engine details
, h2o engine details
, keras engine details
, spark engine details
Examples
show_engines("multinom_reg")
multinom_reg()
Naive Bayes models
Description
naive_Bayes()
defines a model that uses Bayes' theorem to compute the
probability of each class, given the predictor values. This function can fit
classification models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
naive_Bayes(
mode = "classification",
smoothness = NULL,
Laplace = NULL,
engine = "klaR"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
smoothness |
An non-negative number representing the the relative smoothness of the class boundary. Smaller examples result in model flexible boundaries and larger values generate class boundaries that are less adaptable |
Laplace |
A non-negative value for the Laplace correction to smoothing low-frequency counts. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 naive_Bayes(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, klaR engine details
, h2o engine details
, naivebayes engine details
K-nearest neighbors
Description
nearest_neighbor()
defines a model that uses the K
most similar data
points from the training set to predict new samples. This function can
fit classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
kknn¹
¹ The default engine.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
nearest_neighbor(
mode = "unknown",
engine = "kknn",
neighbors = NULL,
weight_func = NULL,
dist_power = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine to use for fitting. |
neighbors |
A single integer for the number of neighbors
to consider (often called |
weight_func |
A single character for the type of kernel function used
to weight distances between samples. Valid choices are: |
dist_power |
A single number for the parameter used in calculating Minkowski distance. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 nearest_neighbor(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, kknn engine details
Examples
show_engines("nearest_neighbor")
nearest_neighbor(neighbors = 11)
Null model
Description
null_model()
defines a simple, non-informative model. It doesn't have any
main arguments. This function can fit classification and regression models.
Usage
null_model(mode = "classification", engine = "parsnip")
Arguments
mode |
A single character string for the type of model. The only
possible values for this model are |
engine |
A single character string specifying what computational engine
to use for fitting. Possible engines are listed below. The default for this
model is |
Engine Details
Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:
parsnip
null_model() %>% set_engine("parsnip") %>% set_mode("regression") %>% translate()
## Null Model Specification (regression) ## ## Computational engine: parsnip ## ## Model fit template: ## parsnip::nullmodel(x = missing_arg(), y = missing_arg())
null_model() %>% set_engine("parsnip") %>% set_mode("classification") %>% translate()
## Null Model Specification (classification) ## ## Computational engine: parsnip ## ## Model fit template: ## parsnip::nullmodel(x = missing_arg(), y = missing_arg())
See Also
Examples
null_model(mode = "regression")
Functions required for parsnip-adjacent packages
Description
These functions are helpful when creating new packages that will register new model specifications.
Usage
null_value(x)
show_fit(model, eng)
check_args(object, call = rlang::caller_env())
update_dot_check(...)
new_model_spec(
cls,
args,
eng_args,
mode,
user_specified_mode = TRUE,
method,
engine,
user_specified_engine = TRUE
)
check_final_param(x, call = rlang::caller_env())
update_main_parameters(args, param, call = rlang::caller_env())
update_engine_parameters(eng_args, fresh, ...)
print_model_spec(x, cls = class(x)[1], desc = get_model_desc(cls), ...)
update_spec(
object,
parameters,
args_enquo_list,
fresh,
cls,
...,
call = caller_env()
)
is_varying(x)
Fit a simple, non-informative model
Description
Fit a single mean or largest class model. nullmodel()
is the underlying
computational function for the null_model()
specification.
Usage
nullmodel(x, ...)
## Default S3 method:
nullmodel(x = NULL, y, ...)
## S3 method for class 'nullmodel'
print(x, ...)
## S3 method for class 'nullmodel'
predict(object, new_data = NULL, type = NULL, ...)
Arguments
x |
An optional matrix or data frame of predictors. These values are not used in the model fit |
... |
Optional arguments (not yet used) |
y |
A numeric vector (for regression) or factor (for classification) of outcomes |
object |
An object of class |
new_data |
A matrix or data frame of predictors (only used to determine the number of predictions to return) |
type |
Either "raw" (for regression), "class" or "prob" (for classification) |
Details
nullmodel()
emulates other model building functions, but returns the
simplest model possible given a training set: a single mean for numeric
outcomes and the most prevalent class for factor outcomes. When class
probabilities are requested, the percentage of the training set samples with
the most prevalent class is returned.
Value
The output of nullmodel()
is a list of class nullmodel
with elements
call |
the function call |
value |
the mean of
|
levels |
when |
pct |
when |
n |
the number of elements in |
predict.nullmodel()
returns either a factor or numeric vector
depending on the class of y
. All predictions are always the same.
Examples
outcome <- factor(sample(letters[1:2],
size = 100,
prob = c(.1, .9),
replace = TRUE))
useless <- nullmodel(y = outcome)
useless
predict(useless, matrix(NA, nrow = 5))
Start an RStudio Addin that can write model specifications
Description
parsnip_addin()
starts a process in the RStudio IDE Viewer window
that allows users to write code for parsnip model specifications from
various R packages. The new code is written to the current document at the
location of the cursor.
Usage
parsnip_addin()
Partial least squares (PLS)
Description
pls()
defines a partial least squares model that uses latent variables to
model the data. It is similar to a supervised version of principal component.
This function can fit classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
mixOmics¹²
¹ The default engine. ² Requires a parsnip extension package for classification and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
pls(
mode = "unknown",
predictor_prop = NULL,
num_comp = NULL,
engine = "mixOmics"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
predictor_prop |
The maximum proportion of original predictors that can have non-zero coefficients for each PLS component (via regularization). This value is used for all PLS components for X. |
num_comp |
The number of PLS components to retain. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 pls(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, mixOmics engine details
Poisson regression models
Description
poisson_reg()
defines a generalized linear model for count data that follow
a Poisson distribution. This function can fit regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
poisson_reg(
mode = "regression",
penalty = NULL,
mixture = NULL,
engine = "glm"
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "regression". |
penalty |
A non-negative number representing the total
amount of regularization ( |
mixture |
A number between zero and one (inclusive) giving the proportion of L1 regularization (i.e. lasso) in the model.
Available for |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 poisson_reg(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, glm engine details
, gee engine details
, glmer engine details
, glmnet engine details
, h2o engine details
, hurdle engine details
, stan engine details
, stan_glmer engine details
, zeroinfl engine details
Other predict methods.
Description
These are internal functions not meant to be directly called by the user.
Usage
## S3 method for class 'model_fit'
predict_class(object, new_data, ...)
## S3 method for class 'model_fit'
predict_classprob(object, new_data, ...)
## S3 method for class 'model_fit'
predict_hazard(object, new_data, eval_time, time = deprecated(), ...)
## S3 method for class 'model_fit'
predict_confint(object, new_data, level = 0.95, std_error = FALSE, ...)
predict_confint(object, ...)
predict_predint(object, ...)
## S3 method for class 'model_fit'
predict_predint(object, new_data, level = 0.95, std_error = FALSE, ...)
predict_predint(object, ...)
## S3 method for class 'model_fit'
predict_linear_pred(object, new_data, ...)
predict_linear_pred(object, ...)
## S3 method for class 'model_fit'
predict_numeric(object, new_data, ...)
predict_numeric(object, ...)
## S3 method for class 'model_fit'
predict_quantile(
object,
new_data,
quantile_levels = NULL,
quantile = deprecated(),
interval = "none",
level = 0.95,
...
)
## S3 method for class 'model_fit'
predict_survival(
object,
new_data,
eval_time,
time = deprecated(),
interval = "none",
level = 0.95,
...
)
predict_survival(object, ...)
## S3 method for class 'model_fit'
predict_time(object, new_data, ...)
predict_time(object, ...)
Arguments
object |
A model fit. |
new_data |
A rectangular data object, such as a data frame. |
... |
Additional
|
level |
A single numeric value between zero and one for the interval estimates. |
std_error |
A single logical for whether the standard error should be returned (assuming that the model can compute it). |
quantile , quantile_levels |
A vector of values between 0 and 1 for the
quantile to be predicted. If the model has a |
Model predictions
Description
Apply a model to create different types of predictions.
predict()
can be used for all types of models and uses the
"type" argument for more specificity.
Usage
## S3 method for class 'model_fit'
predict(object, new_data, type = NULL, opts = list(), ...)
## S3 method for class 'model_fit'
predict_raw(object, new_data, opts = list(), ...)
predict_raw(object, ...)
Arguments
object |
A model fit. |
new_data |
A rectangular data object, such as a data frame. |
type |
A single character value or |
opts |
A list of optional arguments to the underlying
predict function that will be used when |
... |
Additional
|
Details
For type = NULL
, predict()
uses
-
type = "numeric"
for regression models, -
type = "class"
for classification, and -
type = "time"
for censored regression.
Interval predictions
When using type = "conf_int"
and type = "pred_int"
, the options
level
and std_error
can be used. The latter is a logical for an
extra column of standard error values (if available).
Censored regression predictions
For censored regression, a numeric vector for eval_time
is required when
survival or hazard probabilities are requested. The time values are required
to be unique, finite, non-missing, and non-negative. The predict()
functions will adjust the values to fit this specification by removing
offending points (with a warning).
predict.model_fit()
does not require the outcome to be present. For
performance metrics on the predicted survival probability, inverse probability
of censoring weights (IPCW) are required (see the tidymodels.org
reference
below). Those require the outcome and are thus not returned by predict()
.
They can be added via augment.model_fit()
if new_data
contains a column
with the outcome as a Surv
object.
Also, when type = "linear_pred"
, censored regression models will by default
be formatted such that the linear predictor increases with time. This may
have the opposite sign as what the underlying model's predict()
method
produces. Set increasing = FALSE
to suppress this behavior.
Value
With the exception of type = "raw"
, the result of
predict.model_fit()
is a tibble
has as many rows as there are rows in
new_data
has standardized column names, see below:
For type = "numeric"
, the tibble has a .pred
column for a single
outcome and .pred_Yname
columns for a multivariate outcome.
For type = "class"
, the tibble has a .pred_class
column.
For type = "prob"
, the tibble has .pred_classlevel
columns.
For type = "conf_int"
and type = "pred_int"
, the tibble has
.pred_lower
and .pred_upper
columns with an attribute for
the confidence level. In the case where intervals can be
produces for class probabilities (or other non-scalar outputs),
the columns are named .pred_lower_classlevel
and so on.
For type = "quantile"
, the tibble has a .pred
column, which is
a list-column. Each list element contains a tibble with columns
.pred
and .quantile
(and perhaps other columns).
For type = "time"
, the tibble has a .pred_time
column.
For type = "survival"
, the tibble has a .pred
column, which is
a list-column. Each list element contains a tibble with columns
.eval_time
and .pred_survival
(and perhaps other columns).
For type = "hazard"
, the tibble has a .pred
column, which is
a list-column. Each list element contains a tibble with columns
.eval_time
and .pred_hazard
(and perhaps other columns).
Using type = "raw"
with predict.model_fit()
will return
the unadulterated results of the prediction function.
In the case of Spark-based models, since table columns cannot contain dots, the same convention is used except 1) no dots appear in names and 2) vectors are never returned but type-specific prediction functions.
When the model fit failed and the error was captured, the
predict()
function will return the same structure as above but
filled with missing values. This does not currently work for
multivariate models.
References
https://www.tidymodels.org/learn/statistics/survival-metrics/
Examples
library(dplyr)
lm_model <-
linear_reg() %>%
set_engine("lm") %>%
fit(mpg ~ ., data = mtcars %>% dplyr::slice(11:32))
pred_cars <-
mtcars %>%
dplyr::slice(1:10) %>%
dplyr::select(-mpg)
predict(lm_model, pred_cars)
predict(
lm_model,
pred_cars,
type = "conf_int",
level = 0.90
)
predict(
lm_model,
pred_cars,
type = "raw",
opts = list(type = "terms")
)
Prepare data based on parsnip encoding information
Description
Prepare data based on parsnip encoding information
Usage
prepare_data(object, new_data)
Arguments
object |
A parsnip model object |
new_data |
A data frame |
Value
A data frame or matrix
Proportional hazards regression
Description
proportional_hazards()
defines a model for the hazard function
as a multiplicative function of covariates times a baseline hazard. This
function can fit censored regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
proportional_hazards(
mode = "censored regression",
engine = "survival",
penalty = NULL,
mixture = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. The only possible value for this model is "censored regression". |
engine |
A single character string specifying what computational engine to use for fitting. |
penalty |
A non-negative number representing the total amount of regularization (specific engines only). |
mixture |
A number between zero and one (inclusive) denoting the proportion of L1 regularization (i.e. lasso) in the model.
Available for specific engines only. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 proportional_hazards(argument = !!value)
Since survival models typically involve censoring (and require the use of
survival::Surv()
objects), the fit.model_spec()
function will require that the
survival model be specified via the formula interface.
Proportional hazards models include the Cox model.
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, survival engine details
, glmnet engine details
Examples
show_engines("proportional_hazards")
proportional_hazards(mode = "censored regression")
Random forest
Description
rand_forest()
defines a model that creates a large number of decision
trees, each independent of the others. The final prediction uses all
predictions from the individual trees and combines them. This function can fit
classification, regression, and censored regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
rand_forest(
mode = "unknown",
engine = "ranger",
mtry = NULL,
trees = NULL,
min_n = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression". |
engine |
A single character string specifying what computational engine to use for fitting. |
mtry |
An integer for the number of predictors that will be randomly sampled at each split when creating the tree models. |
trees |
An integer for the number of trees contained in the ensemble. |
min_n |
An integer for the minimum number of data points in a node that are required for the node to be split further. |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 rand_forest(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, ranger engine details
, aorsf engine details
, h2o engine details
, partykit engine details
, randomForest engine details
, spark engine details
Examples
show_engines("rand_forest")
rand_forest(mode = "classification", trees = 2000)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- generics
augment
,fit
,fit_xy
,glance
,required_pkgs
,tidy
,varying_args
- ggplot2
- hardhat
contr_one_hot
,extract_fit_engine
,extract_fit_time
,extract_parameter_dials
,extract_parameter_set_dials
,extract_spec_parsnip
,frequency_weights
,importance_weights
,tune
- magrittr
Repair a model call object
Description
When the user passes a formula to fit()
and the underlying model function
uses a formula, the call object produced by fit()
may not be usable by
other functions. For example, some arguments may still be quosures and the
data
portion of the call will not correspond to the original data.
Usage
repair_call(x, data)
Arguments
x |
A fitted parsnip model. An error will occur if the underlying model
does not have a |
data |
A data object that is relevant to the call. In most cases, this is the data frame that was given to parsnip for the model fit (i.e., the training set data). The name of this data object is inserted into the call. |
Details
repair_call()
call can adjust the model objects call to be usable by other
functions and methods.
Value
A modified parsnip
fitted model.
Examples
fitted_model <-
linear_reg() %>%
set_engine("lm", model = TRUE) %>%
fit(mpg ~ ., data = mtcars)
# In this call, note that `data` is not `mtcars` and the `model = ~TRUE`
# indicates that the `model` argument is an rlang quosure.
fitted_model$fit$call
# All better:
repair_call(fitted_model, mtcars)$fit$call
Determine required packages for a model
Description
Usage
req_pkgs(x, ...)
Arguments
x |
A model specification or fit. |
... |
Not used. |
Details
This function has been deprecated in favor of required_pkgs()
.
Value
A character string of package names (if any).
Determine required packages for a model
Description
Determine required packages for a model
Usage
## S3 method for class 'model_spec'
required_pkgs(x, infra = TRUE, ...)
## S3 method for class 'model_fit'
required_pkgs(x, infra = TRUE, ...)
Arguments
x |
A model specification or fit. |
infra |
Should parsnip itself be included in the result? |
... |
Not used. |
Value
A character vector
Examples
should_fail <- try(required_pkgs(linear_reg(engine = NULL)), silent = TRUE)
should_fail
linear_reg() %>%
set_engine("glmnet") %>%
required_pkgs()
linear_reg() %>%
set_engine("glmnet") %>%
required_pkgs(infra = FALSE)
linear_reg() %>%
set_engine("lm") %>%
fit(mpg ~ ., data = mtcars) %>%
required_pkgs()
RuleFit models
Description
rule_fit()
defines a model that derives simple feature rules from a tree
ensemble and uses them as features in a regularized model. This function can
fit classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package for classification and regression.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
rule_fit(
mode = "unknown",
mtry = NULL,
trees = NULL,
min_n = NULL,
tree_depth = NULL,
learn_rate = NULL,
loss_reduction = NULL,
sample_size = NULL,
stop_iter = NULL,
penalty = NULL,
engine = "xrf"
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
mtry |
A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (specific engines only). |
trees |
An integer for the number of trees contained in the ensemble. |
min_n |
An integer for the minimum number of data points in a node that is required for the node to be split further. |
tree_depth |
An integer for the maximum depth of the tree (i.e. number of splits) (specific engines only). |
learn_rate |
A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter. |
loss_reduction |
A number for the reduction in the loss function required to split further (specific engines only). |
sample_size |
A number for the number (or proportion) of data that is
exposed to the fitting routine. For |
stop_iter |
The number of iterations without improvement before stopping (specific engines only). |
penalty |
L1 regularization parameter. |
engine |
A single character string specifying what computational engine to use for fitting. |
Details
The RuleFit model creates a regression model of rules in two stages. The first stage uses a tree-based model that is used to generate a set of rules that can be filtered, modified, and simplified. These rules are then added as predictors to a regularized generalized linear model that can also conduct feature selection during model training.
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 rule_fit(argument = !!value)
References
Friedman, J. H., and Popescu, B. E. (2008). "Predictive learning via rule ensembles." The Annals of Applied Statistics, 2(3), 916-954.
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
xrf::xrf.formula()
, fit()
, set_engine()
, update()
, xrf engine details
, h2o engine details
Examples
show_engines("rule_fit")
rule_fit()
Change elements of a model specification
Description
set_args()
can be used to modify the arguments of a model specification while
set_mode()
is used to change the model's mode.
Usage
set_args(object, ...)
set_mode(object, mode, ...)
## S3 method for class 'model_spec'
set_mode(object, mode, quantile_levels = NULL, ...)
Arguments
object |
|
... |
One or more named model arguments. |
mode |
A character string for the model type (e.g. "classification" or "regression") |
quantile_levels |
A vector of values between zero and one (only for the
|
Details
set_args()
will replace existing values of the arguments.
Value
An updated model object.
Examples
rand_forest()
rand_forest() %>%
set_args(mtry = 3, importance = TRUE) %>%
set_mode("regression")
linear_reg() %>%
set_mode("quantile regression", quantile_levels = c(0.2, 0.5, 0.8))
Declare a computational engine and specific arguments
Description
set_engine()
is used to specify which package or system will be used
to fit the model, along with any arguments specific to that software.
Usage
set_engine(object, engine, ...)
Arguments
object |
|
engine |
A character string for the software that should be used to fit the model. This is highly dependent on the type of model (e.g. linear regression, random forest, etc.). |
... |
Any optional arguments associated with the chosen computational
engine. These are captured as quosures and can be tuned with |
Details
In parsnip,
the model type differentiates basic modeling approaches, such as random forests, logistic regression, linear support vector machines, etc.,
the mode denotes in what kind of modeling context it will be used (most commonly, classification or regression), and
the computational engine indicates how the model is fit, such as with a specific R package implementation or even methods outside of R like Keras or Stan.
Use show_engines()
to get a list of possible engines for the model of
interest.
Modeling functions in parsnip separate model arguments into two categories:
-
Main arguments are more commonly used and tend to be available across engines. These names are standardized to work with different engines in a consistent way, so you can use the parsnip main argument
trees
, instead of the heterogeneous arguments for this parameter from ranger and randomForest packages (num.trees
andntree
, respectively). Set these in your model type function, likerand_forest(trees = 2000)
. -
Engine arguments are either specific to a particular engine or used more rarely; there is no change for these argument names from the underlying engine. The
...
argument ofset_engine()
allows any engine-specific argument to be passed directly to the engine fitting function, likeset_engine("ranger", importance = "permutation")
.
Value
An updated model specification.
Examples
# First, set main arguments using the standardized names
logistic_reg(penalty = 0.01, mixture = 1/3) %>%
# Now specify how you want to fit the model with another argument
set_engine("glmnet", nlambda = 10) %>%
translate()
# Many models have possible engine-specific arguments
decision_tree(tree_depth = 5) %>%
set_engine("rpart", parms = list(prior = c(.65,.35))) %>%
set_mode("classification") %>%
translate()
Tools to Register Models
Description
These functions are similar to constructors and can be used to validate that there are no conflicts with the underlying model structures used by the package.
Usage
set_new_model(model)
set_model_mode(model, mode)
set_model_engine(model, mode, eng)
set_model_arg(model, eng, parsnip, original, func, has_submodel)
set_dependency(model, eng, pkg = "parsnip", mode = NULL)
get_dependency(model)
set_fit(model, mode, eng, value)
get_fit(model)
set_pred(model, mode, eng, type, value)
get_pred_type(model, type)
show_model_info(model)
pred_value_template(pre = NULL, post = NULL, func, ...)
set_encoding(model, mode, eng, options)
get_encoding(model)
Arguments
model |
A single character string for the model type (e.g.
|
mode |
A single character string for the model mode (e.g. "regression"). |
eng |
A single character string for the model engine. |
parsnip |
A single character string for the "harmonized" argument name that parsnip exposes. |
original |
A single character string for the argument name that underlying model function uses. |
func |
A named character vector that describes how to call
a function. |
has_submodel |
A single logical for whether the argument can make predictions on multiple submodels at once. |
pkg |
An options character string for a package name. |
value |
A list that conforms to the |
type |
A single character value for the type of prediction. Possible
values are: |
pre , post |
Optional functions for pre- and post-processing of prediction results. |
... |
Optional arguments that should be passed into the |
options |
A list of options for engine-specific preprocessing encodings. See Details below. |
Details
These functions are available for users to add their own models or engines (in a package or otherwise) so that they can be accessed using parsnip. This is more thoroughly documented on the package web site (see references below).
In short, parsnip
stores an environment object that contains
all of the information and code about how models are used (e.g.
fitting, predicting, etc). These functions can be used to add
models to that environment as well as helper functions that can
be used to makes sure that the model data is in the right
format.
check_model_exists()
checks the model value and ensures that the model has
already been registered. check_model_doesnt_exist()
checks the model value
and also checks to see if it is novel in the environment.
The options for engine-specific encodings dictate how the predictors should be
handled. These options ensure that the data
that parsnip
gives to the underlying model allows for a model fit that is
as similar as possible to what it would have produced directly.
For example, if fit()
is used to fit a model that does not have
a formula interface, typically some predictor preprocessing must
be conducted. glmnet
is a good example of this.
There are four options that can be used for the encodings:
predictor_indicators
describes whether and how to create indicator/dummy
variables from factor predictors. There are three options: "none"
(do not
expand factor predictors), "traditional"
(apply the standard
model.matrix()
encodings), and "one_hot"
(create the complete set
including the baseline level for all factors). This encoding only affects
cases when fit.model_spec()
is used and the underlying model has an x/y
interface.
Another option is compute_intercept
; this controls whether model.matrix()
should include the intercept in its formula. This affects more than the
inclusion of an intercept column. With an intercept, model.matrix()
computes dummy variables for all but one factor levels. Without an
intercept, model.matrix()
computes a full set of indicators for the
first factor variable, but an incomplete set for the remainder.
Next, the option remove_intercept
will remove the intercept column
after model.matrix()
is finished. This can be useful if the model
function (e.g. lm()
) automatically generates an intercept.
Finally, allow_sparse_x
specifies whether the model function can natively
accommodate a sparse matrix representation for predictors during fitting
and tuning.
References
"How to build a parsnip model" https://www.tidymodels.org/learn/develop/models/
Examples
# set_new_model("shallow_learning_model")
# Show the information about a model:
show_model_info("rand_forest")
Set seed in R and TensorFlow at the same time
Description
Some Keras models requires seeds to be set in both R and TensorFlow to achieve reproducible results. This function sets these seeds at the same time using version appropriate functions.
Usage
set_tf_seed(seed)
Arguments
seed |
1 integer value. |
Print the model call
Description
Print the model call
Usage
show_call(object)
Arguments
object |
A "model_spec" object. |
Value
A character string.
Display currently available engines for a model
Description
The possible engines for a model can depend on what packages are loaded.
Some parsnip extension add engines to existing models. For example,
the poissonreg package adds additional engines for the poisson_reg()
model and these are not available unless poissonreg is loaded.
Usage
show_engines(x)
Arguments
x |
The name of a parsnip model (e.g., "linear_reg", "mars", etc.) |
Value
A tibble.
Examples
show_engines("linear_reg")
Using sparse data with parsnip
Description
You can figure out whether a given model engine supports sparse data by
calling get_encoding("name of model")
and looking at the allow_sparse_x
column.
Details
Using sparse data for model fitting and prediction shouldn't require any
additional configurations. Just pass in a sparse matrix such as dgCMatrix
from the Matrix
package or a sparse tibble from the sparsevctrs package
to the data argument of fit()
, fit_xy()
, and predict()
.
Models that don't support sparse data will try to convert to non-sparse data with warnings. If conversion isn’t possible, an informative error will be thrown.
Model Specification Checking:
Description
The helpers spec_is_possible()
, spec_is_loaded()
, and
prompt_missing_implementation()
provide tooling for checking
model specifications. In addition to the spec
, engine
, and mode
arguments, the functions take arguments user_specified_engine
and
user_specified_mode
, denoting whether the user themselves has
specified the engine or mode, respectively.
Usage
spec_is_possible(
spec,
engine = spec$engine,
user_specified_engine = spec$user_specified_engine,
mode = spec$mode,
user_specified_mode = spec$user_specified_mode
)
spec_is_loaded(
spec,
engine = spec$engine,
user_specified_engine = spec$user_specified_engine,
mode = spec$mode,
user_specified_mode = spec$user_specified_mode
)
prompt_missing_implementation(
spec,
engine = spec$engine,
user_specified_engine = spec$user_specified_engine,
mode = spec$mode,
user_specified_mode = spec$user_specified_mode,
prompt,
...
)
Details
spec_is_possible()
checks against the union of
the current parsnip model environment and
the
model_info_table
of "pre-registered" model specifications
to determine whether a model is well-specified. See
parsnip:::model_info_table
for this table.
spec_is_loaded()
checks only against the current parsnip model environment.
spec_is_possible()
is executed automatically on new_model_spec()
,
set_mode()
, and set_engine()
, and spec_is_loaded()
is executed
automatically in print.model_spec()
, among other places. spec_is_possible()
should be used when a model specification is still "in progress" of being
specified, while spec_is_loaded
should only be called when parsnip or an
extension receives some indication that the user is "done" specifying a model
specification: at print, fit, addition to a workflow, or extract_*()
, for
example.
When spec_is_loaded()
is FALSE
, the prompt_missing_implementation()
helper will construct an informative message to prompt users to load or
install needed packages. It's prompt
argument refers to the prompting
function to use, usually cli::cli_inform or cli::cli_abort, and the
ellipses are passed to that function.
Wrapper for stan confidence intervals
Description
Wrapper for stan confidence intervals
Usage
stan_conf_int(object, newdata)
Arguments
object |
A stan model fit |
newdata |
A data set. |
Parametric survival regression
Description
This function is deprecated in favor of survival_reg()
which uses the
"censored regression"
mode.
surv_reg()
defines a parametric survival model.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
surv_reg(mode = "regression", engine = "survival", dist = NULL)
Arguments
mode |
A single character string for the prediction outcome mode. The only possible value for this model is "regression". |
engine |
A single character string specifying what computational engine to use for fitting. |
dist |
A character string for the probability distribution of the outcome. The default is "weibull". |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 surv_reg(argument = !!value)
Since survival models typically involve censoring (and require the use of
survival::Surv()
objects), the fit.model_spec()
function will require that the
survival model be specified via the formula interface.
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
Parametric survival regression
Description
survival_reg()
defines a parametric survival model. This function can fit
censored regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine. ² Requires a parsnip extension package.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
survival_reg(mode = "censored regression", engine = "survival", dist = NULL)
Arguments
mode |
A single character string for the prediction outcome mode. The only possible value for this model is "censored regression". |
engine |
A single character string specifying what computational engine to use for fitting. |
dist |
A character string for the probability distribution of the outcome. The default is "weibull". |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 survival_reg(argument = !!value)
Since survival models typically involve censoring (and require the use of
survival::Surv()
objects), the fit.model_spec()
function will require that the
survival model be specified via the formula interface.
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, survival engine details
, flexsurv engine details
, flexsurvspline engine details
Examples
show_engines("survival_reg")
survival_reg(mode = "censored regression", dist = "weibull")
Linear support vector machines
Description
svm_linear()
defines a support vector machine model. For classification,
the model tries to maximize the width of the margin between classes (using a
linear class boundary). For regression, the model optimizes a robust loss
function that is only affected by very large model residuals and uses a
linear fit. This function can fit classification and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
svm_linear(mode = "unknown", engine = "LiblineaR", cost = NULL, margin = NULL)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine to use for fitting. |
cost |
A positive number for the cost of predicting a sample within or on the wrong side of the margin |
margin |
A positive number for the epsilon in the SVM insensitive loss function (regression only) |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 svm_linear(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, LiblineaR engine details
, kernlab engine details
Examples
show_engines("svm_linear")
svm_linear(mode = "classification")
Polynomial support vector machines
Description
svm_poly()
defines a support vector machine model. For classification,
the model tries to maximize the width of the margin between classes using a
polynomial class boundary. For regression, the model optimizes a robust loss
function that is only affected by very large model residuals and uses polynomial
functions of the predictors. This function can fit classification and
regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
svm_poly(
mode = "unknown",
engine = "kernlab",
cost = NULL,
degree = NULL,
scale_factor = NULL,
margin = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine to use for fitting. |
cost |
A positive number for the cost of predicting a sample within or on the wrong side of the margin |
degree |
A positive number for polynomial degree. |
scale_factor |
A positive number for the polynomial scaling factor. |
margin |
A positive number for the epsilon in the SVM insensitive loss function (regression only) |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 svm_poly(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, kernlab engine details
Examples
show_engines("svm_poly")
svm_poly(mode = "classification", degree = 1.2)
Radial basis function support vector machines
Description
svm_rbf()
defines a support vector machine model. For classification,
the model tries to maximize the width of the margin between classes using a
nonlinear class boundary. For regression, the model optimizes a robust loss
function that is only affected by very large model residuals and uses
nonlinear functions of the predictors. The function can fit classification
and regression models.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
¹ The default engine.
More information on how parsnip is used for modeling is at https://www.tidymodels.org/.
Usage
svm_rbf(
mode = "unknown",
engine = "kernlab",
cost = NULL,
rbf_sigma = NULL,
margin = NULL
)
Arguments
mode |
A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification". |
engine |
A single character string specifying what computational engine
to use for fitting. Possible engines are listed below. The default for this
model is |
cost |
A positive number for the cost of predicting a sample within or on the wrong side of the margin |
rbf_sigma |
A positive number for radial basis function. |
margin |
A positive number for the epsilon in the SVM insensitive loss function (regression only) |
Details
This function only defines what type of model is being fit. Once an engine
is specified, the method to fit the model is also defined. See
set_engine()
for more on setting the engine, including how to set engine
arguments.
The model is not trained or fit until the fit()
function is used
with the data.
Each of the arguments in this function other than mode
and engine
are
captured as quosures. To pass values
programmatically, use the injection operator like so:
value <- 1 svm_rbf(argument = !!value)
References
https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models
See Also
fit()
, set_engine()
, update()
, kernlab engine details
Examples
show_engines("svm_rbf")
svm_rbf(mode = "classification", rbf_sigma = 0.2)
tidy methods for glmnet models
Description
tidy()
methods for the various glmnet
models that return the coefficients
for the specific penalty value used by the parsnip model fit.
Usage
## S3 method for class ''_elnet''
tidy(x, penalty = NULL, ...)
## S3 method for class ''_lognet''
tidy(x, penalty = NULL, ...)
## S3 method for class ''_multnet''
tidy(x, penalty = NULL, ...)
## S3 method for class ''_fishnet''
tidy(x, penalty = NULL, ...)
## S3 method for class ''_coxnet''
tidy(x, penalty = NULL, ...)
Arguments
x |
A fitted parsnip model that used the |
penalty |
A single numeric value. If none is given, the value specified in the model specification is used. |
... |
Not used |
Value
A tibble with columns term
, estimate
, and penalty
. When a
multinomial mode is used, an additional class
column is included.
tidy methods for LiblineaR models
Description
tidy()
methods for the various LiblineaR
models that return the
coefficients from the parsnip model fit.
Usage
## S3 method for class ''_LiblineaR''
tidy(x, ...)
Arguments
x |
A fitted parsnip model that used the |
... |
Not used |
Value
A tibble with columns term
and estimate
.
Turn a parsnip model object into a tidy tibble
Description
This method tidies the model in a parsnip model object, if it exists.
Usage
## S3 method for class 'model_fit'
tidy(x, ...)
Arguments
x |
An object to be converted into a tidy |
... |
Additional arguments to tidying method. |
Value
a tibble
Tidy method for null models
Description
Return the results of nullmodel
as a tibble
Usage
## S3 method for class 'nullmodel'
tidy(x, ...)
Arguments
x |
A |
... |
Not used. |
Value
A tibble with column value
.
Examples
nullmodel(mtcars[,-1], mtcars$mpg) %>% tidy()
Resolve a Model Specification for a Computational Engine
Description
translate()
will translate a model specification into a code
object that is specific to a particular engine (e.g. R package).
It translates generic parameters to their counterparts.
Usage
translate(x, ...)
## Default S3 method:
translate(x, engine = x$engine, ...)
Arguments
x |
|
... |
Not currently used. |
engine |
The computational engine for the model (see |
Details
translate()
produces a template call that lacks the specific
argument values (such as data
, etc). These are filled in once
fit()
is called with the specifics of the data for the model.
The call may also include tune()
arguments if these are in
the specification. To handle the tune()
arguments, you need to use the
tune package. For more information
see https://www.tidymodels.org/start/tuning/
It does contain the resolved argument names that are specific to the model fitting function/engine.
This function can be useful when you need to understand how parsnip goes from a generic model specific to a model fitting function.
Note: this function is used internally and users should only use it to understand what the underlying syntax would be. It should not be used to modify the model specification.
Examples
lm_spec <- linear_reg(penalty = 0.01)
# `penalty` is tranlsated to `lambda`
translate(lm_spec, engine = "glmnet")
# `penalty` not applicable for this model.
translate(lm_spec, engine = "lm")
# `penalty` is tranlsated to `reg_param`
translate(lm_spec, engine = "spark")
# with a placeholder for an unknown argument value:
translate(linear_reg(penalty = tune(), mixture = tune()), engine = "glmnet")
Succinct summary of parsnip object
Description
type_sum
controls how objects are shown when inside tibble
columns.
Usage
## S3 method for class 'model_spec'
type_sum(x)
## S3 method for class 'model_fit'
type_sum(x)
Arguments
x |
A |
Details
For model_spec
objects, the summary is "spec[?]
"
or "spec[+]
". The former indicates that either the model
mode has not been declared or that the specification has
tune()
parameters. Otherwise, the latter is shown.
For fitted models, either "fit[x]
" or "fit[+]
" are used
where the "x" implies that the model fit failed in some way.
Value
A character value.
Save information about models
Description
This function writes a tab delimited file to the package to capture information about the known models. This information includes packages in the tidymodels GitHub repository as well as packages that are known to work well with tidymodels packages (e.g. not only parsnip but also tune, etc.). There may be more model definitions in other extension packages that are not included here.
These data are used to document engines for each model function man page.
Usage
update_model_info_file(path = "inst/models.tsv")
Arguments
path |
A character string for the location of the tab delimited file. |
Details
See our model implementation guidelines on best practices for modeling and modeling packages.
It is highly recommended that the known parsnip extension packages are loaded.
The unexported parsnip function extensions()
will list these.
Updating a model specification
Description
If parameters of a model specification need to be modified, update()
can
be used in lieu of recreating the object from scratch.
Usage
## S3 method for class 'bag_mars'
update(
object,
parameters = NULL,
num_terms = NULL,
prod_degree = NULL,
prune_method = NULL,
fresh = FALSE,
...
)
## S3 method for class 'bag_mlp'
update(
object,
parameters = NULL,
hidden_units = NULL,
penalty = NULL,
epochs = NULL,
fresh = FALSE,
...
)
## S3 method for class 'bag_tree'
update(
object,
parameters = NULL,
cost_complexity = NULL,
tree_depth = NULL,
min_n = NULL,
class_cost = NULL,
fresh = FALSE,
...
)
## S3 method for class 'bart'
update(
object,
parameters = NULL,
trees = NULL,
prior_terminal_node_coef = NULL,
prior_terminal_node_expo = NULL,
prior_outcome_range = NULL,
fresh = FALSE,
...
)
## S3 method for class 'boost_tree'
update(
object,
parameters = NULL,
mtry = NULL,
trees = NULL,
min_n = NULL,
tree_depth = NULL,
learn_rate = NULL,
loss_reduction = NULL,
sample_size = NULL,
stop_iter = NULL,
fresh = FALSE,
...
)
## S3 method for class 'C5_rules'
update(
object,
parameters = NULL,
trees = NULL,
min_n = NULL,
fresh = FALSE,
...
)
## S3 method for class 'cubist_rules'
update(
object,
parameters = NULL,
committees = NULL,
neighbors = NULL,
max_rules = NULL,
fresh = FALSE,
...
)
## S3 method for class 'decision_tree'
update(
object,
parameters = NULL,
cost_complexity = NULL,
tree_depth = NULL,
min_n = NULL,
fresh = FALSE,
...
)
## S3 method for class 'discrim_flexible'
update(
object,
num_terms = NULL,
prod_degree = NULL,
prune_method = NULL,
fresh = FALSE,
...
)
## S3 method for class 'discrim_linear'
update(
object,
penalty = NULL,
regularization_method = NULL,
fresh = FALSE,
...
)
## S3 method for class 'discrim_quad'
update(object, regularization_method = NULL, fresh = FALSE, ...)
## S3 method for class 'discrim_regularized'
update(
object,
frac_common_cov = NULL,
frac_identity = NULL,
fresh = FALSE,
...
)
## S3 method for class 'gen_additive_mod'
update(
object,
select_features = NULL,
adjust_deg_free = NULL,
parameters = NULL,
fresh = FALSE,
...
)
## S3 method for class 'linear_reg'
update(
object,
parameters = NULL,
penalty = NULL,
mixture = NULL,
fresh = FALSE,
...
)
## S3 method for class 'logistic_reg'
update(
object,
parameters = NULL,
penalty = NULL,
mixture = NULL,
fresh = FALSE,
...
)
## S3 method for class 'mars'
update(
object,
parameters = NULL,
num_terms = NULL,
prod_degree = NULL,
prune_method = NULL,
fresh = FALSE,
...
)
## S3 method for class 'mlp'
update(
object,
parameters = NULL,
hidden_units = NULL,
penalty = NULL,
dropout = NULL,
epochs = NULL,
activation = NULL,
learn_rate = NULL,
fresh = FALSE,
...
)
## S3 method for class 'multinom_reg'
update(
object,
parameters = NULL,
penalty = NULL,
mixture = NULL,
fresh = FALSE,
...
)
## S3 method for class 'naive_Bayes'
update(object, smoothness = NULL, Laplace = NULL, fresh = FALSE, ...)
## S3 method for class 'nearest_neighbor'
update(
object,
parameters = NULL,
neighbors = NULL,
weight_func = NULL,
dist_power = NULL,
fresh = FALSE,
...
)
## S3 method for class 'pls'
update(
object,
parameters = NULL,
predictor_prop = NULL,
num_comp = NULL,
fresh = FALSE,
...
)
## S3 method for class 'poisson_reg'
update(
object,
parameters = NULL,
penalty = NULL,
mixture = NULL,
fresh = FALSE,
...
)
## S3 method for class 'proportional_hazards'
update(
object,
parameters = NULL,
penalty = NULL,
mixture = NULL,
fresh = FALSE,
...
)
## S3 method for class 'rand_forest'
update(
object,
parameters = NULL,
mtry = NULL,
trees = NULL,
min_n = NULL,
fresh = FALSE,
...
)
## S3 method for class 'rule_fit'
update(
object,
parameters = NULL,
mtry = NULL,
trees = NULL,
min_n = NULL,
tree_depth = NULL,
learn_rate = NULL,
loss_reduction = NULL,
sample_size = NULL,
penalty = NULL,
fresh = FALSE,
...
)
## S3 method for class 'surv_reg'
update(object, parameters = NULL, dist = NULL, fresh = FALSE, ...)
## S3 method for class 'survival_reg'
update(object, parameters = NULL, dist = NULL, fresh = FALSE, ...)
## S3 method for class 'svm_linear'
update(
object,
parameters = NULL,
cost = NULL,
margin = NULL,
fresh = FALSE,
...
)
## S3 method for class 'svm_poly'
update(
object,
parameters = NULL,
cost = NULL,
degree = NULL,
scale_factor = NULL,
margin = NULL,
fresh = FALSE,
...
)
## S3 method for class 'svm_rbf'
update(
object,
parameters = NULL,
cost = NULL,
rbf_sigma = NULL,
margin = NULL,
fresh = FALSE,
...
)
Arguments
object |
|
parameters |
A 1-row tibble or named list with main
parameters to update. Use either |
num_terms |
The number of features that will be retained in the final model, including the intercept. |
prod_degree |
The highest possible interaction degree. |
prune_method |
The pruning method. |
fresh |
A logical for whether the arguments should be modified in-place or replaced wholesale. |
... |
Not used for |
An integer for the number of units in the hidden model. | |
penalty |
An non-negative number representing the amount of regularization used by some of the engines. |
epochs |
An integer for the number of training iterations. |
cost_complexity |
A positive number for the the cost/complexity
parameter (a.k.a. |
tree_depth |
An integer for maximum depth of the tree. |
min_n |
An integer for the minimum number of data points in a node that are required for the node to be split further. |
class_cost |
A non-negative scalar for a class cost (where a cost of 1 means no extra cost). This is useful for when the first level of the outcome factor is the minority class. If this is not the case, values between zero and one can be used to bias to the second level of the factor. |
trees |
An integer for the number of trees contained in the ensemble. |
prior_terminal_node_coef |
A coefficient for the prior probability that a node is a terminal node. |
prior_terminal_node_expo |
An exponent in the prior probability that a node is a terminal node. |
prior_outcome_range |
A positive value that defines the width of a prior that the predicted outcome is within a certain range. For regression it is related to the observed range of the data; the prior is the number of standard deviations of a Gaussian distribution defined by the observed range of the data. For classification, it is defined as the range of +/-3 (assumed to be on the logit scale). The default value is 2. |
mtry |
A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (specific engines only). |
learn_rate |
A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter. |
loss_reduction |
A number for the reduction in the loss function required to split further (specific engines only). |
sample_size |
A number for the number (or proportion) of data that is
exposed to the fitting routine. For |
stop_iter |
The number of iterations without improvement before stopping (specific engines only). |
committees |
A non-negative integer (no greater than 100) for the number of members of the ensemble. |
neighbors |
An integer between zero and nine for the number of training set instances that are used to adjust the model-based prediction. |
max_rules |
The largest number of rules. |
regularization_method |
A character string for the type of regularized
estimation. Possible values are: " |
frac_common_cov , frac_identity |
Numeric values between zero and one. |
select_features |
|
adjust_deg_free |
If |
mixture |
A number between zero and one (inclusive) denoting the proportion of L1 regularization (i.e. lasso) in the model.
Available for specific engines only. |
dropout |
A number between 0 (inclusive) and 1 denoting the proportion of model parameters randomly set to zero during model training. |
activation |
A single character string denoting the type of relationship between the original predictors and the hidden unit layer. The activation function between the hidden and output layers is automatically set to either "linear" or "softmax" depending on the type of outcome. Possible values depend on the engine being used. |
smoothness |
An non-negative number representing the the relative smoothness of the class boundary. Smaller examples result in model flexible boundaries and larger values generate class boundaries that are less adaptable |
Laplace |
A non-negative value for the Laplace correction to smoothing low-frequency counts. |
weight_func |
A single character for the type of kernel function used
to weight distances between samples. Valid choices are: |
dist_power |
A single number for the parameter used in calculating Minkowski distance. |
predictor_prop |
The maximum proportion of original predictors that can have non-zero coefficients for each PLS component (via regularization). This value is used for all PLS components for X. |
num_comp |
The number of PLS components to retain. |
dist |
A character string for the probability distribution of the outcome. The default is "weibull". |
cost |
A positive number for the cost of predicting a sample within or on the wrong side of the margin |
margin |
A positive number for the epsilon in the SVM insensitive loss function (regression only) |
degree |
A positive number for polynomial degree. |
scale_factor |
A positive number for the polynomial scaling factor. |
rbf_sigma |
A positive number for radial basis function. |
Value
An updated model specification.
Examples
# ------------------------------------------------------------------------------
model <- C5_rules(trees = 10, min_n = 2)
model
update(model, trees = 1)
update(model, trees = 1, fresh = TRUE)
# ------------------------------------------------------------------------------
model <- cubist_rules(committees = 10, neighbors = 2)
model
update(model, committees = 1)
update(model, committees = 1, fresh = TRUE)
model <- pls(predictor_prop = 0.1)
model
update(model, predictor_prop = 1)
update(model, predictor_prop = 1, fresh = TRUE)
# ------------------------------------------------------------------------------
model <- rule_fit(trees = 10, min_n = 2)
model
update(model, trees = 1)
update(model, trees = 1, fresh = TRUE)
model <- boost_tree(mtry = 10, min_n = 3)
model
update(model, mtry = 1)
update(model, mtry = 1, fresh = TRUE)
param_values <- tibble::tibble(mtry = 10, tree_depth = 5)
model %>% update(param_values)
model %>% update(param_values, mtry = 3)
param_values$verbose <- 0
# Fails due to engine argument
# model %>% update(param_values)
model <- linear_reg(penalty = 10, mixture = 0.1)
model
update(model, penalty = 1)
update(model, penalty = 1, fresh = TRUE)
A placeholder function for argument values
Description
varying()
is used when a parameter will be specified at a later date.
Usage
varying()
Determine varying arguments
Description
varying_args()
takes a model specification or a recipe and returns a tibble
of information on all possible varying arguments and whether or not they
are actually varying.
The id
column is determined differently depending on whether a model_spec
or a recipe
is used. For a model_spec
, the first class is used. For
a recipe
, the unique step id
is used.
Usage
## S3 method for class 'model_spec'
varying_args(object, full = TRUE, ...)
## S3 method for class 'recipe'
varying_args(object, full = TRUE, ...)
## S3 method for class 'step'
varying_args(object, full = TRUE, ...)
Arguments
object |
A |
full |
A single logical. Should all possible varying parameters be
returned? If |
... |
Not currently used. |
Value
A tibble with columns for the parameter name (name
), whether it
contains any varying value (varying
), the id
for the object (id
),
and the class that was used to call the method (type
).
Examples
# List all possible varying args for the random forest spec
rand_forest() %>% varying_args()
# mtry is now recognized as varying
rand_forest(mtry = varying()) %>% varying_args()
# Even engine specific arguments can vary
rand_forest() %>%
set_engine("ranger", sample.fraction = varying()) %>%
varying_args()
# List only the arguments that actually vary
rand_forest() %>%
set_engine("ranger", sample.fraction = varying()) %>%
varying_args(full = FALSE)
rand_forest() %>%
set_engine(
"randomForest",
strata = Class,
sampsize = varying()
) %>%
varying_args()
Boosted trees via xgboost
Description
xgb_train()
and xgb_predict()
are wrappers for xgboost
tree-based
models where all of the model arguments are in the main function.
Usage
xgb_train(
x,
y,
weights = NULL,
max_depth = 6,
nrounds = 15,
eta = 0.3,
colsample_bynode = NULL,
colsample_bytree = NULL,
min_child_weight = 1,
gamma = 0,
subsample = 1,
validation = 0,
early_stop = NULL,
counts = TRUE,
event_level = c("first", "second"),
...
)
xgb_predict(object, new_data, ...)
Arguments
x |
A data frame or matrix of predictors |
y |
A vector (factor or numeric) or matrix (numeric) of outcome data. |
max_depth |
An integer for the maximum depth of the tree. |
nrounds |
An integer for the number of boosting iterations. |
eta |
A numeric value between zero and one to control the learning rate. |
colsample_bynode |
Subsampling proportion of columns for each node
within each tree. See the |
colsample_bytree |
Subsampling proportion of columns for each tree.
See the |
min_child_weight |
A numeric value for the minimum sum of instance weights needed in a child to continue to split. |
gamma |
A number for the minimum loss reduction required to make a further partition on a leaf node of the tree |
subsample |
Subsampling proportion of rows. By default, all of the training data are used. |
validation |
The proportion of the data that are used for performance assessment and potential early stopping. |
early_stop |
An integer or |
counts |
A logical. If |
event_level |
For binary classification, this is a single string of either
|
... |
Other options to pass to |
new_data |
A rectangular data object, such as a data frame. |
Value
A fitted xgboost
object.