Help for package parsnip

Title:

A Common API to Modeling and Analysis Functions

Version:

1.3.1

Maintainer:

Max Kuhn <max@posit.co>

Description:

A common interface is provided to allow users to specify a model without having to remember the different argument names across different functions or computational engines (e.g. 'R', 'Spark', 'Stan', 'H2O', etc).

License:

MIT + file LICENSE

URL:

https://github.com/tidymodels/parsnip, https://parsnip.tidymodels.org/

BugReports:

https://github.com/tidymodels/parsnip/issues

Depends:

R (≥ 3.6)

Imports:

cli, dplyr (≥ 1.1.0), generics (≥ 0.1.2), ggplot2, globals, glue, hardhat (≥ 1.4.1), lifecycle, magrittr, pillar, prettyunits, purrr (≥ 1.0.0), rlang (≥ 1.1.0), sparsevctrs (≥ 0.2.0), stats, tibble (≥ 2.1.1), tidyr (≥ 1.3.0), utils, vctrs (≥ 0.6.0), withr

Suggests:

bench, C50, covr, dials (≥ 1.1.0), earth, ggrepel, keras, kernlab, kknn, knitr, LiblineaR, MASS, Matrix, methods, mgcv, modeldata, nlme, prodlim, ranger (≥ 0.12.0), remotes, rmarkdown, rpart, sparklyr (≥ 1.0.0), survival, tensorflow, testthat (≥ 3.0.0), xgboost (≥ 1.5.0.1)

VignetteBuilder:

knitr

ByteCompile:

true

Config/Needs/website:

brulee, C50, dbarts, earth, glmnet, keras, kernlab, kknn, LiblineaR, mgcv, nnet, parsnip, quantreg, randomForest, ranger, rpart, rstanarm, tidymodels/tidymodels, tidyverse/tidytemplate, rstudio/reticulate, xgboost, rmarkdown

Config/rcmdcheck/ignore-inconsequential-notes:

true

Config/testthat/edition:

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.2

NeedsCompilation:

Packaged:

2025-03-11 19:17:07 UTC; max

Author:

Max Kuhn [aut, cre], Davis Vaughan [aut], Emil Hvitfeldt [ctb], Posit Software, PBC [cph, fnd]

Repository:

CRAN

Date/Publication:

2025-03-12 00:10:02 UTC

parsnip

Description

The goal of parsnip is to provide a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

Author(s)

Maintainer: Max Kuhn max@posit.co

Authors:

Davis Vaughan davis@posit.co

Other contributors:

Emil Hvitfeldt emil.hvitfeldt@posit.co [contributor]
Posit Software, PBC [copyright holder, funder]

Helper functions for checking the penalty of glmnet models

Description

These functions are for developer use.

.check_glmnet_penalty_fit() checks that the model specification for fitting a glmnet model contains a single value.

.check_glmnet_penalty_predict() checks that the penalty value used for prediction is valid. If called by predict(), it needs to be a single value. Multiple values are allowed for multi_predict().

Usage

.check_glmnet_penalty_fit(x, call = rlang::caller_env())

.check_glmnet_penalty_predict(
  penalty = NULL,
  object,
  multi = FALSE,
  call = rlang::caller_env()
)

Arguments

x

An object of class model_spec.

penalty

A penalty value to check.

object

An object of class model_fit.

multi

A logical indicating if multiple values are allowed.

Helper functions to convert between formula and matrix interface

Description

Functions to take a formula interface and get the resulting objects (y, x, weights, etc) back or the other way around. The functions are intended for developer use. For the most part, this emulates the internals of lm() (and also see the notes at https://developer.r-project.org/model-fitting-functions.html).

.convert_form_to_xy_fit() and .convert_xy_to_form_fit() are for when the data are created for modeling. .convert_form_to_xy_fit() saves both the data objects as well as the objects needed when new data are predicted (e.g. terms, etc.).

.convert_form_to_xy_new() and .convert_xy_to_form_new() are used when new samples are being predicted and only require the predictors to be available.

Usage

.convert_form_to_xy_fit(
  formula,
  data,
  ...,
  na.action = na.omit,
  indicators = "traditional",
  composition = "data.frame",
  remove_intercept = TRUE,
  call = rlang::caller_env()
)

.convert_form_to_xy_new(
  object,
  new_data,
  na.action = na.pass,
  composition = "data.frame",
  call = rlang::caller_env()
)

.convert_xy_to_form_fit(
  x,
  y,
  weights = NULL,
  y_name = "..y",
  remove_intercept = TRUE,
  call = rlang::caller_env()
)

.convert_xy_to_form_new(object, new_data)

Arguments

formula

An object of class formula (or one that can be coerced to that class): a symbolic description of the model to be fitted.

data

A data frame containing all relevant variables (e.g. outcome(s), predictors, case weights, etc).

...

Additional arguments passed to stats::model.frame().

na.action

A function which indicates what should happen when the data contain NAs.

indicators

A string describing whether and how to create indicator/dummy variables from factor predictors. Possible options are "none", "traditional", and "one_hot".

composition

A string describing whether the resulting x and y should be returned as a "matrix" or a "data.frame".

remove_intercept

A logical indicating whether to remove the intercept column after model.matrix() is finished.

object

A model fit.

new_data

A rectangular data object, such as a data frame.

x

A matrix, sparse matrix, or data frame of predictors. Only some models have support for sparse matrix input. See parsnip::get_encoding() for details. x should have column names.

y

A vector, matrix or data frame of outcome data.

weights

A numeric vector containing the weights.

y_name

A string specifying the name of the outcome.

Extract survival status

Description

Extract the status from a survival::Surv() object.

Arguments

surv

A single survival::Surv() object.

Value

A numeric vector.

Extract survival time

Description

Extract the time component(s) from a survival::Surv() object.

Arguments

surv

A single survival::Surv() object.

Value

A vector when the type is "right" or "left" and a tibble otherwise.

Obtain names of prediction columns for a fitted model or workflow

Description

.get_prediction_column_names() returns a list that has the names of the columns for the primary prediction types for a model.

Usage

.get_prediction_column_names(x, syms = FALSE)

Arguments

x

A fitted parsnip model (class "model_fit") or a fitted workflow.

syms

Should the column names be converted to symbols? Defaults to FALSE.

Value

A list with elements "estimate" and "probabilities".

Examples


library(dplyr)
library(modeldata)
data("two_class_dat")

levels(two_class_dat$Class)
lr_fit <- logistic_reg() %>% fit(Class ~ ., data = two_class_dat)

.get_prediction_column_names(lr_fit)
.get_prediction_column_names(lr_fit, syms = TRUE)

Translate names of model tuning parameters

Description

This function creates a key that connects the identifiers users make for tuning parameter names, the standardized parsnip parameter names, and the argument names to the underlying fit function for the engine.

Usage

.model_param_name_key(object, as_tibble = TRUE)

Arguments

object

A workflow or parsnip model specification.

as_tibble

A logical. Should the results be in a tibble (the default) or in a list that can facilitate renaming grid objects?

Value

A tibble with columns user, parsnip, and engine, or a list with named character vectors user_to_parsnip and parsnip_to_engine.

Examples


mod <-
 linear_reg(penalty = tune("regularization"), mixture = tune()) %>%
 set_engine("glmnet")

mod %>% .model_param_name_key()

rn <- mod %>% .model_param_name_key(as_tibble = FALSE)
rn

grid <- tidyr::crossing(regularization = c(0, 1), mixture = (0:3) / 3)

grid %>%
  dplyr::rename(!!!rn$user_to_parsnip)

grid %>%
  dplyr::rename(!!!rn$user_to_parsnip) %>%
  dplyr::rename(!!!rn$parsnip_to_engine)

Organize glmnet predictions

Description

This function is for developer use and organizes predictions from glmnet models.

Usage

.organize_glmnet_pred(x, object)

Arguments

x

Predictions as returned by the predict() method for glmnet models.

object

An object of class model_fit.

Add a column of row numbers to a data frame

Description

Add a column of row numbers to a data frame

Usage

add_rowindex(x)

Arguments

x

A data frame

Value

The same data frame with a column of 1-based integers named .row.

Examples


mtcars %>% add_rowindex()

Augment data with predictions

Description

augment() will add column(s) for predictions to the given data.

Usage

## S3 method for class 'model_fit'
augment(x, new_data, eval_time = NULL, ...)

Arguments

x

A model fit produced by fit.model_spec() or fit_xy.model_spec().

new_data

A data frame or matrix.

eval_time

For censored regression models, a vector of time points at which the survival probability is estimated.

...

Not currently used.

Details

Regression

For regression models, a .pred column is added. If x was created using fit.model_spec() and new_data contains a regression outcome column, a .resid column is also added.

Classification

For classification models, the results can include a column called .pred_class as well as class probability columns named ⁠.pred_{level}⁠. This depends on what type of prediction types are available for the model.

Censored Regression

For these models, predictions for the expected time and survival probability are created (if the model engine supports them). If the model supports survival prediction, the eval_time argument is required.

If survival predictions are created and new_data contains a survival::Surv() object, additional columns are added for inverse probability of censoring weights (IPCW) are also created (see tidymodels.org page in the references below). This enables the user to compute performance metrics in the yardstick package.

References

https://www.tidymodels.org/learn/statistics/survival-metrics/

Examples


car_trn <- mtcars[11:32,]
car_tst <- mtcars[ 1:10,]

reg_form <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., data = car_trn)
reg_xy <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit_xy(car_trn[, -1], car_trn$mpg)

augment(reg_form, car_tst)
augment(reg_form, car_tst[, -1])

augment(reg_xy, car_tst)
augment(reg_xy, car_tst[, -1])

# ------------------------------------------------------------------------------

data(two_class_dat, package = "modeldata")
cls_trn <- two_class_dat[-(1:10), ]
cls_tst <- two_class_dat[  1:10 , ]

cls_form <-
  logistic_reg() %>%
  set_engine("glm") %>%
  fit(Class ~ ., data = cls_trn)
cls_xy <-
  logistic_reg() %>%
  set_engine("glm") %>%
  fit_xy(cls_trn[, -3],
  cls_trn$Class)

augment(cls_form, cls_tst)
augment(cls_form, cls_tst[, -3])

augment(cls_xy, cls_tst)
augment(cls_xy, cls_tst[, -3])

Automatic Machine Learning

Description

auto_ml() defines an automated searching and tuning process where many models of different families are trained and ranked given their performance on the training data.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

h2o¹²

¹ The default engine. ² Requires a parsnip extension package for classification and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

auto_ml(mode = "unknown", engine = "h2o")

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting.

Details

This function only defines what type of model is being fit. Once an engine is specified, the method to fit the model is also defined. See set_engine() for more on setting the engine, including how to set engine arguments.

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
auto_ml(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Create a ggplot for a model object

Description

This method provides a good visualization method for model results. Currently, only methods for glmnet models are implemented.

Usage

## S3 method for class 'model_fit'
autoplot(object, ...)

## S3 method for class 'glmnet'
autoplot(object, ..., min_penalty = 0, best_penalty = NULL, top_n = 3L)

Arguments

object

A model fit object.

...

For autoplot.glmnet(), options to pass to ggrepel::geom_label_repel(). Otherwise, this argument is ignored.

min_penalty

A single, non-negative number for the smallest penalty value that should be shown in the plot. If left NULL, the whole data range is used.

best_penalty

A single, non-negative number that will show a vertical line marker. If left NULL, no line is shown. When this argument is used, the ggrepl package is required.

top_n

A non-negative integer for how many model predictors to label. The top predictors are ranked by their absolute coefficient value. For multinomial or multivariate models, the top_n terms are selected within class or response, respectively.

Details

The glmnet package will need to be attached or loaded for its autoplot() method to work correctly.

Value

A ggplot object with penalty on the x-axis and coefficients on the y-axis. For multinomial or multivariate models, the plot is faceted.

Ensembles of MARS models

Description

bag_mars() defines an ensemble of generalized linear models that use artificial features for some predictors. These features resemble hinge functions and the result is a model that is a segmented regression in small dimensions. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

earth¹²

¹ The default engine. ² Requires a parsnip extension package for classification and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

bag_mars(
  mode = "unknown",
  num_terms = NULL,
  prod_degree = NULL,
  prune_method = NULL,
  engine = "earth"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

num_terms

The number of features that will be retained in the final model, including the intercept.

prod_degree

The highest possible interaction degree.

prune_method

The pruning method.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
bag_mars(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Ensembles of neural networks

Description

bag_mlp() defines an ensemble of single layer, feed-forward neural networks. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

nnet¹²

¹ The default engine. ² Requires a parsnip extension package for classification and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

bag_mlp(
  mode = "unknown",
  hidden_units = NULL,
  penalty = NULL,
  epochs = NULL,
  engine = "nnet"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

hidden_units

An integer for the number of units in the hidden model.

penalty

A non-negative numeric value for the amount of weight decay.

epochs

An integer for the number of training iterations.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
bag_mlp(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Ensembles of decision trees

Description

bag_tree() defines an ensemble of decision trees. This function can fit classification, regression, and censored regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

rpart¹²
C5.0²

¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

bag_tree(
  mode = "unknown",
  cost_complexity = 0,
  tree_depth = NULL,
  min_n = 2,
  class_cost = NULL,
  engine = "rpart"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression".

cost_complexity

A positive number for the the cost/complexity parameter (a.k.a. Cp) used by CART models (specific engines only).

tree_depth

An integer for the maximum depth of the tree (i.e. number of splits) (specific engines only).

min_n

An integer for the minimum number of data points in a node that is required for the node to be split further.

class_cost

A non-negative scalar for a class cost (where a cost of 1 means no extra cost). This is useful for when the first level of the outcome factor is the minority class. If this is not the case, values between zero and one can be used to bias to the second level of the factor.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
bag_tree(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Bayesian additive regression trees (BART)

Description

bart() defines a tree ensemble model that uses Bayesian analysis to assemble the ensemble. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

dbarts¹

¹ The default engine.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

bart(
  mode = "unknown",
  engine = "dbarts",
  trees = NULL,
  prior_terminal_node_coef = NULL,
  prior_terminal_node_expo = NULL,
  prior_outcome_range = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting.

trees

An integer for the number of trees contained in the ensemble.

prior_terminal_node_coef

A coefficient for the prior probability that a node is a terminal node. Values are usually between 0 and one with a default of 0.95. This affects the baseline probability; smaller numbers make the probabilities larger overall. See Details below.

prior_terminal_node_expo

An exponent in the prior probability that a node is a terminal node. Values are usually non-negative with a default of 2 This affects the rate that the prior probability decreases as the depth of the tree increases. Larger values make deeper trees less likely.

prior_outcome_range

A positive value that defines the width of a prior that the predicted outcome is within a certain range. For regression it is related to the observed range of the data; the prior is the number of standard deviations of a Gaussian distribution defined by the observed range of the data. For classification, it is defined as the range of +/-3 (assumed to be on the logit scale). The default value is 2.

Details

The prior for the terminal node probability is expressed as prior = a * (1 + d)^(-b) where d is the depth of the node, a is prior_terminal_node_coef and b is prior_terminal_node_expo. See the Examples section below for an example graph of the prior probability of a terminal node for different values of these parameters.

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
bart(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("bart")

bart(mode = "regression", trees = 5)

# ------------------------------------------------------------------------------
# Examples for terminal node prior

library(ggplot2)
library(dplyr)

prior_test <- function(coef = 0.95, expo = 2, depths = 1:10) {
  tidyr::crossing(coef = coef, expo = expo, depth = depths) %>%
    mutate(
      `terminial node prior` = coef * (1 + depth)^(-expo),
      coef = format(coef),
      expo = format(expo))
}

prior_test(coef = c(0.05, 0.5, .95), expo = c(1/2, 1, 2)) %>%
  ggplot(aes(depth, `terminial node prior`, col = coef)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ expo)

Developer functions for predictions via BART models

Description

Developer functions for predictions via BART models

Usage

dbart_predict_calc(obj, new_data, type, level = 0.95, std_err = FALSE)

Arguments

obj

A parsnip object.

new_data

A rectangular data object, such as a data frame.

type

A single character value or NULL. Possible values are "numeric", "class", "prob", "conf_int", "pred_int", "quantile", "time", "hazard", "survival", or "raw". When NULL, predict() will choose an appropriate value based on the model's mode.

level

Confidence level.

std_err

Attach column for standard error of prediction or not.

Boosted trees

Description

boost_tree() defines a model that creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction. This function can fit classification, regression, and censored regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

xgboost¹
C5.0
h2o²
lightgbm²
mboost²
spark

¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

boost_tree(
  mode = "unknown",
  engine = "xgboost",
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL,
  stop_iter = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression".

engine

A single character string specifying what computational engine to use for fitting.

mtry

A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (specific engines only).

trees

An integer for the number of trees contained in the ensemble.

min_n

An integer for the minimum number of data points in a node that is required for the node to be split further.

tree_depth

An integer for the maximum depth of the tree (i.e. number of splits) (specific engines only).

learn_rate

A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter.

loss_reduction

A number for the reduction in the loss function required to split further (specific engines only).

sample_size

A number for the number (or proportion) of data that is exposed to the fitting routine. For xgboost, the sampling is done at each iteration while C5.0 samples once during training.

stop_iter

The number of iterations without improvement before stopping (specific engines only).

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
boost_tree(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("boost_tree")

boost_tree(mode = "classification", trees = 20)

C5.0 rule-based classification models

Description

C5_rules() defines a model that derives feature rules from a tree for prediction. A single tree or boosted ensemble can be used. This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

C5.0¹²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

C5_rules(mode = "classification", trees = NULL, min_n = NULL, engine = "C5.0")

Arguments

mode

A single character string for the type of model. The only possible value for this model is "classification".

trees

A non-negative integer (no greater than 100) for the number of members of the ensemble.

min_n

An integer greater between zero and nine for the minimum number of data points in a node that are required for the node to be split further.

engine

A single character string specifying what computational engine to use for fitting.

Details

C5.0 is a classification model that is an extension of the C4.5 model of Quinlan (1993). It has tree- and rule-based versions that also include boosting capabilities. C5_rules() enables the version of the model that uses a series of rules (see the examples below). To make a set of rules, an initial C5.0 tree is created and flattened into rules. The rules are pruned, simplified, and ordered. Rule sets are created within each iteration of boosting.

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
C5_rules(argument = !!value)

References

Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("C5_rules")

C5_rules()

Boosted trees via C5.0

Description

C5.0_train is a wrapper for the C5.0() function in the C50 package that fits tree-based models where all of the model arguments are in the main function.

Usage

C5.0_train(x, y, weights = NULL, trials = 15, minCases = 2, sample = 0, ...)

Arguments

x

A data frame or matrix of predictors.

y

A factor vector with 2 or more levels

weights

An optional numeric vector of case weights. Note that the data used for the case weights will not be used as a splitting variable in the model (see https://www.rulequest.com/see5-info.html for Quinlan's notes on case weights).

trials

An integer specifying the number of boosting iterations. A value of one indicates that a single model is used.

minCases

An integer for the smallest number of samples that must be put in at least two of the splits.

sample

A value between (0, .999) that specifies the random proportion of the data should be used to train the model. By default, all the samples are used for model training. Samples not used for training are used to evaluate the accuracy of the model in the printed output. A value of zero means that all the training data are used.

...

Other arguments to pass.

Value

A fitted C5.0 model.

Using case weights with parsnip

Description

Case weights are positive numeric values that influence how much each data point has during the model fitting process. There are a variety of situations where case weights can be used.

Details

tidymodels packages differentiate how different types of case weights should be used during the entire data analysis process, including preprocessing data, model fitting, performance calculations, etc.

The tidymodels packages require users to convert their numeric vectors to a vector class that reflects how these should be used. For example, there are some situations where the weights should not affect operations such as centering and scaling or other preprocessing operations.

The types of weights allowed in tidymodels are:

Frequency weights via hardhat::frequency_weights()
Importance weights via hardhat::importance_weights()

More types can be added by request.

For parsnip, the fit() and fit_xy() functions contain a case_weight argument that takes these data. For Spark models, the argument value should be a character value.

Determine if case weights are used

Description

Not all modeling engines can incorporate case weights into their calculations. This function can determine whether they can be used.

Usage

case_weights_allowed(spec)

Arguments

spec

A parsnip model specification.

Value

A single logical.

Examples

case_weights_allowed(linear_reg())
case_weights_allowed(linear_reg(engine = "keras"))

Calculations for inverse probability of censoring weights (IPCW)

Description

The method of Graf et al (1999) is used to compute weights at specific evaluation times that can be used to help measure a model's time-dependent performance (e.g. the time-dependent Brier score or the area under the ROC curve). This is an internal function.

Usage

.censoring_weights_graf(object, ...)

## Default S3 method:
.censoring_weights_graf(object, ...)

## S3 method for class 'model_fit'
.censoring_weights_graf(
  object,
  predictions,
  cens_predictors = NULL,
  trunc = 0.05,
  eps = 10^-10,
  ...
)

Arguments

object

A fitted parsnip model object or fitted workflow with a mode of "censored regression".

predictions

A data frame with a column containing a survival::Surv() object as well as a list column called .pred that contains the data structure produced by predict.model_fit().

cens_predictors

Not currently used. A potential future slot for models with informative censoring based on columns in predictions.

trunc

A potential lower bound for the probability of censoring to avoid very large weight values.

eps

A small value that is subtracted from the evaluation time when computing the censoring probabilities. See Details below.

Details

A probability that the data are censored immediately prior to a specific time is computed. To do this, we must determine what time to make the prediction. There are two time values for each row of the data set: the observed time (either censored or not) and the time that the model is being evaluated at (e.g. the survival function prediction at some time point), which is constant across rows. .

From Graf et al (1999) there are three cases:

If the observed time is a censoring time and that is before the evaluation time, the data point should make no contribution to the performance metric (their "category 3"). These values have a missing value for their probability estimate (and also for their weight column).
If the observed time corresponds to an actual event, and that time is prior to the evaluation time (category 1), the probability of being censored is predicted at the observed time (minus an epsilon).
If the observed time is after the evaluation time (category 2), regardless of the status, the probability of being censored is predicted at the evaluation time (minus an epsilon).

The epsilon is used since, we would not have actual information at time t for a data point being predicted at time t (only data prior to time t should be available).

After the censoring probability is computed, the trunc option is used to avoid using numbers pathologically close to zero. After this, the weight is computed by inverting the censoring probability.

The eps argument is used to avoid information leakage when computing the censoring probability. Subtracting a small number avoids using data that would not be known at the time of prediction. For example, if we are making survival probability predictions at eval_time = 3.0, we would not know the about the probability of being censored at that exact time (since it has not occurred yet).

When creating weights by inverting probabilities, there is the risk that a few cases will have severe outliers due to probabilities close to zero. To mitigate this, the trunc argument can be used to put a cap on the weights. If the smallest probability is greater than trunc, the probabilities with values less than trunc are given that value. Otherwise, trunc is adjusted to be half of the smallest probability and that value is used as the lower bound..

Note that if there are n rows in data and t time points, the resulting data, once unnested, has n * t rows. Computations will not easily scale well as t becomes very large.

Value

The same data are returned with the pred tibbles containing several new columns:

.weight_time: the time at which the inverse censoring probability weights are computed. This is a function of the observed time and the time of analysis (i.e., eval_time). See Details for more information.
.pred_censored: the probability of being censored at .weight_time.
.weight_censored: The inverse of the censoring probability.

References

Graf, E., Schmoor, C., Sauerbrei, W. and Schumacher, M. (1999), Assessment and comparison of prognostic classification schemes for survival data. Statist. Med., 18: 2529-2545.

Check to ensure that ellipses are empty

Description

Check to ensure that ellipses are empty

Usage

check_empty_ellipse(...)

Arguments

...

Extra arguments.

Value

If an error is not thrown (from non-empty ellipses), a NULL list.

Condense control object into strictly smaller control object

Description

This function is used to help the hierarchy of control functions used throughout the tidymodels packages. It is now assumed that each control function is either a subset or a superset of another control function.

Usage

condense_control(x, ref, ..., call = rlang::caller_env())

Arguments

x

A control object to be condensed.

ref

A control object that is used to determine what element should be kept.

call

The execution environment of a currently running function, e.g. caller_env(). The function will be mentioned in error messages as the source of the error. See the call argument of rlang::abort() for more information.

Value

A control object with the same elements and classes of ref, with values of x.

Examples


ctrl <- control_parsnip(catch = TRUE)
ctrl$allow_par <- TRUE
str(ctrl)

ctrl <- condense_control(ctrl, control_parsnip())
str(ctrl)

Control the fit function

Description

Pass options to the fit.model_spec() function to control its output and computations

Usage

control_parsnip(verbosity = 1L, catch = FALSE)

Arguments

verbosity

An integer to control how verbose the output is. For a value of zero, no messages or output are shown when packages are loaded or when the model is fit. For a value of 1, package loading is quiet but model fits can produce output to the screen (depending on if they contain their own verbose-type argument). For a value of 2 or more, any output at all is displayed and the execution time of the fit is recorded and printed.

catch

A logical where a value of TRUE will evaluate the model inside of try(, silent = TRUE). If the model fails, an object is still returned (without an error) that inherits the class "try-error".

Value

An S3 object with class "control_parsnip" that is a named list with the results of the function call

Examples


control_parsnip(verbosity = 2L)

Convenience function for intervals

Description

Convenience function for intervals

Usage

convert_stan_interval(x, level = 0.95, lower = TRUE)

Arguments

x

A fitted model object

level

Level of uncertainty for intervals

lower

Is level the lower level?

A wrapper function for conditional inference tree models

Description

These functions are slightly different APIs for partykit::ctree() and partykit::cforest() that have several important arguments as top-level arguments (as opposed to being specified in partykit::ctree_control()).

Usage

ctree_train(
  formula,
  data,
  weights = NULL,
  minsplit = 20L,
  maxdepth = Inf,
  teststat = "quadratic",
  testtype = "Bonferroni",
  mincriterion = 0.95,
  ...
)

cforest_train(
  formula,
  data,
  weights = NULL,
  minsplit = 20L,
  maxdepth = Inf,
  teststat = "quadratic",
  testtype = "Univariate",
  mincriterion = 0,
  mtry = ceiling(sqrt(ncol(data) - 1)),
  ntree = 500L,
  ...
)

Arguments

formula

A symbolic description of the model to be fit.

data

A data frame containing the variables in the model.

weights

A vector of weights whose length is the same as nrow(data). For partykit::ctree() models, these are required to be non-negative integers while for partykit::cforest() they can be non-negative integers or doubles.

minsplit

The minimum sum of weights in a node in order to be considered for splitting.

maxdepth

maximum depth of the tree. The default maxdepth = Inf means that no restrictions are applied to tree sizes.

teststat

A character specifying the type of the test statistic to be applied.

testtype

A character specifying how to compute the distribution of the test statistic.

mincriterion

The value of the test statistic (for testtype == "Teststatistic"), or 1 - p-value (for other values of testtype) that must be exceeded in order to implement a split.

...

Other options to pass to partykit::ctree() or partykit::cforest().

mtry

Number of input variables randomly sampled as candidates at each node for random forest like algorithms. The default mtry = Inf means that no random selection takes place.

ntree

Number of trees to grow in a forest.

Value

An object of class party (for ctree) or cforest.

Examples


if (rlang::is_installed(c("modeldata", "partykit"))) {
  data(bivariate, package = "modeldata")
  ctree_train(Class ~ ., data = bivariate_train)
  ctree_train(Class ~ ., data = bivariate_train, maxdepth = 1)
}

Cubist rule-based regression models

Description

cubist_rules() defines a model that derives simple feature rules from a tree ensemble and creates regression models within each rule. This function can fit regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

Cubist¹²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

cubist_rules(
  mode = "regression",
  committees = NULL,
  neighbors = NULL,
  max_rules = NULL,
  engine = "Cubist"
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "regression".

committees

A non-negative integer (no greater than 100) for the number of members of the ensemble.

neighbors

An integer between zero and nine for the number of training set instances that are used to adjust the model-based prediction.

max_rules

The largest number of rules.

engine

A single character string specifying what computational engine to use for fitting.

Details

Cubist is a rule-based ensemble regression model. A basic model tree (Quinlan, 1992) is created that has a separate linear regression model corresponding for each terminal node. The paths along the model tree are flattened into rules and these rules are simplified and pruned. The parameter min_n is the primary method for controlling the size of each tree while max_rules controls the number of rules.

Cubist ensembles are created using committees, which are similar to boosting. After the first model in the committee is created, the second model uses a modified version of the outcome data based on whether the previous model under- or over-predicted the outcome. For iteration m, the new outcome ⁠y*⁠ is computed using

If a sample is under-predicted on the previous iteration, the outcome is adjusted so that the next time it is more likely to be over-predicted to compensate. This adjustment continues for each ensemble iteration. See Kuhn and Johnson (2013) for details.

After the model is created, there is also an option for a post-hoc adjustment that uses the training set (Quinlan, 1993). When a new sample is predicted by the model, it can be modified by its nearest neighbors in the original training set. For K neighbors, the model-based predicted value is adjusted by the neighbor using:

where t is the training set prediction and w is a weight that is inverse to the distance to the neighbor.

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
cubist_rules(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Quinlan R (1992). "Learning with Continuous Classes." Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, pp. 343-348.

Quinlan R (1993)."Combining Instance-Based and Model-Based Learning." Proceedings of the Tenth International Conference on Machine Learning, pp. 236-243.

Kuhn M and Johnson K (2013). Applied Predictive Modeling. Springer.

Decision trees

Description

decision_tree() defines a model as a set of ⁠if/then⁠ statements that creates a tree-based structure. This function can fit classification, regression, and censored regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

rpart¹²
C5.0
partykit²
spark

¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

decision_tree(
  mode = "unknown",
  engine = "rpart",
  cost_complexity = NULL,
  tree_depth = NULL,
  min_n = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression".

engine

A single character string specifying what computational engine to use for fitting.

cost_complexity

A positive number for the the cost/complexity parameter (a.k.a. Cp) used by CART models (specific engines only).

tree_depth

An integer for maximum depth of the tree.

min_n

An integer for the minimum number of data points in a node that are required for the node to be split further.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
decision_tree(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("decision_tree")

decision_tree(mode = "classification", tree_depth = 5)

Data Set Characteristics Available when Fitting Models

Description

When using the fit() functions there are some variables that will be available for use in arguments. For example, if the user would like to choose an argument value based on the current number of rows in a data set, the .obs() function can be used. See Details below.

Usage

.cols()

.preds()

.obs()

.lvls()

.facts()

.x()

.y()

.dat()

Details

Existing functions:

.obs(): The current number of rows in the data set.
.preds(): The number of columns in the data set that is associated with the predictors prior to dummy variable creation.
.cols(): The number of predictor columns available after dummy variables are created (if any).
.facts(): The number of factor predictors in the data set.
.lvls(): If the outcome is a factor, this is a table with the counts for each level (and NA otherwise).
.x(): The predictors returned in the format given. Either a data frame or a matrix.
.y(): The known outcomes returned in the format given. Either a vector, matrix, or data frame.
.dat(): A data frame containing all of the predictors and the outcomes. If fit_xy() was used, the outcomes are attached as the column, ..y.

For example, if you use the model formula circumference ~ . with the built-in Orange data, the values would be

 .preds() =   2          (the 2 remaining columns in `Orange`)
 .cols()  =   5          (1 numeric column + 4 from Tree dummy variables)
 .obs()   = 35
 .lvls()  =  NA          (no factor outcome)
 .facts() =   1          (the Tree predictor)
 .y()     = <vector>     (circumference as a vector)
 .x()     = <data.frame> (The other 2 columns as a data frame)
 .dat()   = <data.frame> (The full data set)

If the formula Tree ~ . were used:

 .preds() =   2          (the 2 numeric columns in `Orange`)
 .cols()  =   2          (same)
 .obs()   = 35
 .lvls()  =  c("1" = 7, "2" = 7, "3" = 7, "4" = 7, "5" = 7)
 .facts() =   0
 .y()     = <vector>     (Tree as a vector)
 .x()     = <data.frame> (The other 2 columns as a data frame)
 .dat()   = <data.frame> (The full data set)

To use these in a model fit, pass them to a model specification. The evaluation is delayed until the time when the model is run via fit() (and the variables listed above are available). For example:


library(modeldata)
data("lending_club")

rand_forest(mode = "classification", mtry = .cols() - 2)

When no descriptors are found, the computation of the descriptor values is not executed.

Automatic machine learning via h2o

Description

h2o::h2o.automl defines an automated model training process and returns a leaderboard of models with best performances.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has no tuning parameters.

Engine arguments of interest

max_runtime_secs and max_models: controls the maximum running time and number of models to build in the automatic process.
exclude_algos and include_algos: a character vector indicating the excluded or included algorithms during model building. To see a full list of supported models, see the details section in h2o::h2o.automl().
validation: An integer between 0 and 1 specifying the proportion of training data reserved as validation set. This is used by h2o for performance assessment and potential early stopping.

Translation from parsnip to the original package (regression)

agua::h2o_train_auto() is a wrapper around h2o::h2o.automl().

auto_ml() %>%  
  set_engine("h2o") %>% 
  set_mode("regression") %>% 
  translate()

## Automatic Machine Learning Model Specification (regression)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_auto(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), verbosity = NULL)

Translation from parsnip to the original package (classification)

auto_ml() %>%  
  set_engine("h2o") %>% 
  set_mode("classification") %>% 
  translate()

## Automatic Machine Learning Model Specification (classification)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_auto(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), verbosity = NULL)

Preprocessing requirements

Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine. When using the formula method via fit(), parsnip will convert factor columns to indicators.

Initializing h2o

To use the h2o engine with tidymodels, please run h2o::h2o.init() first. By default, This connects R to the local h2o server. This needs to be done in every new R session. You can also connect to a remote h2o server with an IP address, for more details see h2o::h2o.init().

You can control the number of threads in the thread pool used by h2o with the nthreads argument. By default, it uses all CPUs on the host. This is different from the usual parallel processing mechanism in tidymodels for tuning, while tidymodels parallelizes over resamples, h2o parallelizes over hyperparameter combinations for a given resample.

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Models fitted with this engine may require native serialization methods to be properly saved and/or passed between R sessions. To learn more about preparing fitted models for serialization, see the bundle package.

Bagged MARS via earth

Description

baguette::bagger() creates an collection of MARS models forming an ensemble. All models in the ensemble are combined to produce a final prediction.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

prod_degree: Degree of Interaction (type: integer, default: 1L)
prune_method: Pruning Method (type: character, default: ‘backward’)
num_terms: # Model Terms (type: integer, default: see below)

The default value of num_terms depends on the number of predictor columns. For a data frame x, the default is min(200, max(20, 2 * ncol(x))) + 1 (see earth::earth() and the reference below).

Translation from parsnip to the original package (regression)

The baguette extension package is required to fit this model.

bag_mars(num_terms = integer(1), prod_degree = integer(1), prune_method = character(1)) %>% 
  set_engine("earth") %>% 
  set_mode("regression") %>% 
  translate()

## Bagged MARS Model Specification (regression)
## 
## Main Arguments:
##   num_terms = integer(1)
##   prod_degree = integer(1)
##   prune_method = character(1)
## 
## Computational engine: earth 
## 
## Model fit template:
## baguette::bagger(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), nprune = integer(1), degree = integer(1), 
##     pmethod = character(1), base_model = "MARS")

Translation from parsnip to the original package (classification)

The baguette extension package is required to fit this model.

library(baguette)

bag_mars(
  num_terms = integer(1),
  prod_degree = integer(1),
  prune_method = character(1)
) %>% 
  set_engine("earth") %>% 
  set_mode("classification") %>% 
  translate()

## Bagged MARS Model Specification (classification)
## 
## Main Arguments:
##   num_terms = integer(1)
##   prod_degree = integer(1)
##   prune_method = character(1)
## 
## Computational engine: earth 
## 
## Model fit template:
## baguette::bagger(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), nprune = integer(1), degree = integer(1), 
##     pmethod = character(1), base_model = "MARS")

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that the earth package documentation has: “In the current implementation, building models with weights can be slow.”

References

Breiman, L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics, vol. 19, no. 1, pp. 1-67.
Milborrow, S. “Notes on the earth package.”
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Bagged neural networks via nnet

Description

baguette::bagger() creates a collection of neural networks forming an ensemble. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

hidden_units: # Hidden Units (type: integer, default: 10L)
penalty: Amount of Regularization (type: double, default: 0.0)
epochs: # Epochs (type: integer, default: 1000L)

These defaults are set by the baguette package and are different than those in nnet::nnet().

Translation from parsnip to the original package (classification)

The baguette extension package is required to fit this model.

library(baguette)

bag_mlp(penalty = double(1), hidden_units = integer(1)) %>% 
  set_engine("nnet") %>% 
  set_mode("classification") %>% 
  translate()

## Bagged Neural Network Model Specification (classification)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
## 
## Computational engine: nnet 
## 
## Model fit template:
## baguette::bagger(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), size = integer(1), decay = double(1), 
##     base_model = "nnet")

Translation from parsnip to the original package (regression)

The baguette extension package is required to fit this model.

library(baguette)

bag_mlp(penalty = double(1), hidden_units = integer(1)) %>% 
  set_engine("nnet") %>% 
  set_mode("regression") %>% 
  translate()

## Bagged Neural Network Model Specification (regression)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
## 
## Computational engine: nnet 
## 
## Model fit template:
## baguette::bagger(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), size = integer(1), decay = double(1), 
##     base_model = "nnet")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

References

Breiman L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Bagged trees via C5.0

Description

baguette::bagger() creates an collection of decision trees forming an ensemble. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 1 tuning parameters:

min_n: Minimal Node Size (type: integer, default: 2L)

Translation from parsnip to the original package (classification)

The baguette extension package is required to fit this model.

library(baguette)

bag_tree(min_n = integer()) %>% 
  set_engine("C5.0") %>% 
  set_mode("classification") %>% 
  translate()

## Bagged Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 0
##   min_n = integer()
## 
## Computational engine: C5.0 
## 
## Model fit template:
## baguette::bagger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     minCases = integer(), base_model = "C5.0")

Preprocessing requirements

This engine does not require any special encoding of the predictors. Categorical predictors can be partitioned into groups of factor levels (e.g. ⁠{a, c}⁠ vs ⁠{b, d}⁠) when splitting at a node. Dummy variables are not required for this model.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

Breiman, L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Bagged trees via rpart

Description

baguette::bagger() and ipred::bagging() create collections of decision trees forming an ensemble. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there are multiple modes: classification, regression, and censored regression

Tuning Parameters

This model has 4 tuning parameters:

class_cost: Class Cost (type: double, default: (see below))
tree_depth: Tree Depth (type: integer, default: 30L)
min_n: Minimal Node Size (type: integer, default: 2L)
cost_complexity: Cost-Complexity Parameter (type: double, default: 0.01)

For the class_cost parameter, the value can be a non-negative scalar for a class cost (where a cost of 1 means no extra cost). This is useful for when the first level of the outcome factor is the minority class. If this is not the case, values between zero and one can be used to bias to the second level of the factor.

Translation from parsnip to the original package (classification)

The baguette extension package is required to fit this model.

library(baguette)

bag_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% 
  set_engine("rpart") %>% 
  set_mode("classification") %>% 
  translate()

## Bagged Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = double(1)
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: rpart 
## 
## Model fit template:
## baguette::bagger(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), cp = double(1), maxdepth = integer(1), 
##     minsplit = integer(1), base_model = "CART")

Translation from parsnip to the original package (regression)

The baguette extension package is required to fit this model.

library(baguette)

bag_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% 
  set_engine("rpart") %>% 
  set_mode("regression") %>% 
  translate()

## Bagged Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   cost_complexity = double(1)
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: rpart 
## 
## Model fit template:
## baguette::bagger(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), cp = double(1), maxdepth = integer(1), 
##     minsplit = integer(1), base_model = "CART")

Translation from parsnip to the original package (censored regression)

The censored extension package is required to fit this model.

library(censored)

bag_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% 
  set_engine("rpart") %>% 
  set_mode("censored regression") %>% 
  translate()

## Bagged Decision Tree Model Specification (censored regression)
## 
## Main Arguments:
##   cost_complexity = double(1)
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: rpart 
## 
## Model fit template:
## ipred::bagging(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), cp = double(1), maxdepth = integer(1), 
##     minsplit = integer(1))

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Other details

Predictions of type "time" are predictions of the median survival time.

References

Breiman L. 1996. “Bagging predictors”. Machine Learning. 24 (2): 123-140
Hothorn T, Lausen B, Benner A, Radespiel-Troeger M. 2004. Bagging Survival Trees. Statistics in Medicine, 23(1), 77–91.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Bayesian additive regression trees via dbarts

Description

dbarts::bart() creates an ensemble of tree-based model whose training and assembly is determined using Bayesian analysis.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 4 tuning parameters:

trees: # Trees (type: integer, default: 200L)
prior_terminal_node_coef: Terminal Node Prior Coefficient (type: double, default: 0.95)
prior_terminal_node_expo: Terminal Node Prior Exponent (type: double, default: 2.00)
prior_outcome_range: Prior for Outcome Range (type: double, default: 2.00)

Important engine-specific options

Some relevant arguments that can be passed to set_engine():

keepevery, n.thin: Every keepevery draw is kept to be returned to the user. Useful for “thinning” samples.
ntree, n.trees: The number of trees in the sum-of-trees formulation.
ndpost, n.samples: The number of posterior draws after burn in, ndpost / keepevery will actually be returned.
nskip, n.burn: Number of MCMC iterations to be treated as burn in.
nchain, n.chains: Integer specifying how many independent tree sets and fits should be calculated.
nthread, n.threads: Integer specifying how many threads to use. Depending on the CPU architecture, using more than the number of chains can degrade performance for small/medium data sets. As such some calculations may be executed single threaded regardless.
combinechains, combineChains: Logical; if TRUE, samples will be returned in arrays of dimensions equal to nchain times ndpost times number of observations.

Translation from parsnip to the original package (classification)

bart(
  trees = integer(1),
  prior_terminal_node_coef = double(1),
  prior_terminal_node_expo = double(1),
  prior_outcome_range = double(1)
) %>% 
  set_engine("dbarts") %>% 
  set_mode("classification") %>% 
  translate()

## BART Model Specification (classification)
## 
## Main Arguments:
##   trees = integer(1)
##   prior_terminal_node_coef = double(1)
##   prior_terminal_node_expo = double(1)
##   prior_outcome_range = double(1)
## 
## Computational engine: dbarts 
## 
## Model fit template:
## dbarts::bart(x = missing_arg(), y = missing_arg(), ntree = integer(1), 
##     base = double(1), power = double(1), k = double(1), verbose = FALSE, 
##     keeptrees = TRUE, keepcall = FALSE)

Translation from parsnip to the original package (regression)

bart(
  trees = integer(1),
  prior_terminal_node_coef = double(1),
  prior_terminal_node_expo = double(1),
  prior_outcome_range = double(1)
) %>% 
  set_engine("dbarts") %>% 
  set_mode("regression") %>% 
  translate()

## BART Model Specification (regression)
## 
## Main Arguments:
##   trees = integer(1)
##   prior_terminal_node_coef = double(1)
##   prior_terminal_node_expo = double(1)
##   prior_outcome_range = double(1)
## 
## Computational engine: dbarts 
## 
## Model fit template:
## dbarts::bart(x = missing_arg(), y = missing_arg(), ntree = integer(1), 
##     base = double(1), power = double(1), k = double(1), verbose = FALSE, 
##     keeptrees = TRUE, keepcall = FALSE)

Preprocessing requirements

dbarts::bart() will also convert the factors to indicators if the user does not create them first.

References

Chipman, George, McCulloch. “BART: Bayesian additive regression trees.” Ann. Appl. Stat. 4 (1) 266 - 298, March 2010.

Boosted trees via C5.0

Description

C50::C5.0() creates a series of classification trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 3 tuning parameters:

trees: # Trees (type: integer, default: 15L)
min_n: Minimal Node Size (type: integer, default: 2L)
sample_size: Proportion Observations Sampled (type: double, default: 1.0)

The implementation of C5.0 limits the number of trees to be between 1 and 100.

Translation from parsnip to the original package (classification)

boost_tree(trees = integer(), min_n = integer(), sample_size = numeric()) %>% 
  set_engine("C5.0") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   trees = integer()
##   min_n = integer()
##   sample_size = numeric()
## 
## Computational engine: C5.0 
## 
## Model fit template:
## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     trials = integer(), minCases = integer(), sample = numeric())

C5.0_train() is a wrapper around C50::C5.0() that makes it easier to run this model.

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

This model object contains data that are not required to make predictions. When saving the model for the purpose of prediction, the size of the saved object might be substantially reduced by using functions from the butcher package.

Other details

Early stopping

By default, early stopping is used. To use the complete set of boosting iterations, pass earlyStopping = FALSE to set_engine(). Also, it is unlikely that early stopping will occur if sample_size = 1.

Examples

The “Fitting and Predicting with parsnip” article contains examples for boost_tree() with the "C5.0" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Boosted trees via h2o

Description

h2o::h2o.xgboost() creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 8 tuning parameters:

trees: # Trees (type: integer, default: 50)
tree_depth: Tree Depth (type: integer, default: 6)
min_n: Minimal Node Size (type: integer, default: 1)
learn_rate: Learning Rate (type: double, default: 0.3)
sample_size: # Observations Sampled (type: integer, default: 1)
mtry: # Randomly Selected Predictors (type: integer, default: 1)
loss_reduction: Minimum Loss Reduction (type: double, default: 0)
stop_iter: # Iterations Before Stopping (type: integer, default: 0)

min_n represents the fewest allowed observations in a terminal node, h2o::h2o.xgboost() allows only one row in a leaf by default.

stop_iter controls early stopping rounds based on the convergence of the engine parameter stopping_metric. By default, h2o::h2o.xgboost() does not use early stopping. When stop_iter is not 0, h2o::h2o.xgboost() uses logloss for classification, deviance for regression and anonomaly score for Isolation Forest. This is mostly useful when used alongside the engine parameter validation, which is the proportion of train-validation split, parsnip will split and pass the two data frames to h2o. Then h2o::h2o.xgboost() will evaluate the metric and early stopping criteria on the validation set.

Translation from parsnip to the original package (regression)

agua::h2o_train_xgboost() is a wrapper around h2o::h2o.xgboost().

The agua extension package is required to fit this model.

boost_tree(
  mtry = integer(), trees = integer(), tree_depth = integer(), 
  learn_rate = numeric(), min_n = integer(), loss_reduction = numeric(), stop_iter = integer()
) %>%
  set_engine("h2o") %>%
  set_mode("regression") %>%
  translate()

## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
##   stop_iter = integer()
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_xgboost(x = missing_arg(), y = missing_arg(), 
##     weights = missing_arg(), validation_frame = missing_arg(), 
##     col_sample_rate = integer(), ntrees = integer(), min_rows = integer(), 
##     max_depth = integer(), learn_rate = numeric(), min_split_improvement = numeric(), 
##     stopping_rounds = integer())

Translation from parsnip to the original package (classification)

The agua extension package is required to fit this model.

boost_tree(
  mtry = integer(), trees = integer(), tree_depth = integer(), 
  learn_rate = numeric(), min_n = integer(), loss_reduction = numeric(), stop_iter = integer()
) %>% 
  set_engine("h2o") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
##   stop_iter = integer()
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_xgboost(x = missing_arg(), y = missing_arg(), 
##     weights = missing_arg(), validation_frame = missing_arg(), 
##     col_sample_rate = integer(), ntrees = integer(), min_rows = integer(), 
##     max_depth = integer(), learn_rate = numeric(), min_split_improvement = numeric(), 
##     stopping_rounds = integer())

Preprocessing

Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric.

Interpreting `mtry`

The mtry argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.

Some engines, such as "xgboost", "xrf", and "lightgbm", interpret their analogue to the mtry argument as the proportion of predictors that will be randomly sampled at each split rather than the count. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful—interpreting mtry as a proportion means that ⁠[0, 1]⁠ is always a valid range for that parameter, regardless of input data.

parsnip and its extensions accommodate this parameterization using the counts argument: a logical indicating whether mtry should be interpreted as the number of predictors that will be randomly sampled at each split. TRUE indicates that mtry will be interpreted in its sense as a count, FALSE indicates that the argument will be interpreted in its sense as a proportion.

mtry is a main model argument for boost_tree() and rand_forest(), and thus should not have an engine-specific interface. So, regardless of engine, counts defaults to TRUE. For engines that support the proportion interpretation (currently "xgboost" and "xrf", via the rules package, and "lightgbm" via the bonsai package) the user can pass the counts = FALSE argument to set_engine() to supply mtry values within ⁠[0, 1]⁠.

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Boosted trees via lightgbm

Description

lightgbm::lgb.train() creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there are multiple modes: regression and classification

Tuning Parameters

This model has 6 tuning parameters:

tree_depth: Tree Depth (type: integer, default: -1)
trees: # Trees (type: integer, default: 100)
learn_rate: Learning Rate (type: double, default: 0.1)
mtry: # Randomly Selected Predictors (type: integer, default: see below)
min_n: Minimal Node Size (type: integer, default: 20)
loss_reduction: Minimum Loss Reduction (type: double, default: 0)

The mtry parameter gives the number of predictors that will be randomly sampled at each split. The default is to use all predictors.

Rather than as a number, lightgbm::lgb.train()’s feature_fraction argument encodes mtry as the proportion of predictors that will be randomly sampled at each split. parsnip translates mtry, supplied as the number of predictors, to a proportion under the hood. That is, the user should still supply the argument as mtry to boost_tree(), and do so in its sense as a number rather than a proportion; before passing mtry to lightgbm::lgb.train(), parsnip will convert the mtry value to a proportion.

Note that parsnip’s translation can be overridden via the counts argument, supplied to set_engine(). By default, counts is set to TRUE, but supplying the argument counts = FALSE allows the user to supply mtry as a proportion rather than a number.

Translation from parsnip to the original package (regression)

The bonsai extension package is required to fit this model.

boost_tree(
  mtry = integer(), trees = integer(), tree_depth = integer(), 
  learn_rate = numeric(), min_n = integer(), loss_reduction = numeric()
) %>%
  set_engine("lightgbm") %>%
  set_mode("regression") %>%
  translate()

## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
## 
## Computational engine: lightgbm 
## 
## Model fit template:
## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(), 
##     weights = missing_arg(), feature_fraction_bynode = integer(), 
##     num_iterations = integer(), min_data_in_leaf = integer(), 
##     max_depth = integer(), learning_rate = numeric(), min_gain_to_split = numeric(), 
##     verbose = -1, num_threads = 0, seed = sample.int(10^5, 1), 
##     deterministic = TRUE)

Translation from parsnip to the original package (classification)

The bonsai extension package is required to fit this model.

boost_tree(
  mtry = integer(), trees = integer(), tree_depth = integer(), 
  learn_rate = numeric(), min_n = integer(), loss_reduction = numeric()
) %>% 
  set_engine("lightgbm") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
## 
## Computational engine: lightgbm 
## 
## Model fit template:
## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(), 
##     weights = missing_arg(), feature_fraction_bynode = integer(), 
##     num_iterations = integer(), min_data_in_leaf = integer(), 
##     max_depth = integer(), learning_rate = numeric(), min_gain_to_split = numeric(), 
##     verbose = -1, num_threads = 0, seed = sample.int(10^5, 1), 
##     deterministic = TRUE)

bonsai::train_lightgbm() is a wrapper around lightgbm::lgb.train() (and other functions) that make it easier to run this model.

Other details

Preprocessing

Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric.

Interpreting `mtry`

The mtry argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.

Bagging

The sample_size argument is translated to the bagging_fraction parameter in the param argument of lgb.train. The argument is interpreted by lightgbm as a proportion rather than a count, so bonsai internally reparameterizes the sample_size argument with dials::sample_prop() during tuning.

To effectively enable bagging, the user would also need to set the bagging_freq argument to lightgbm. bagging_freq defaults to 0, which means bagging is disabled, and a bagging_freq argument of k means that the booster will perform bagging at every kth boosting iteration. Thus, by default, the sample_size argument would be ignored without setting this argument manually. Other boosting libraries, like xgboost, do not have an analogous argument to bagging_freq and use k = 1 when the analogue to bagging_fraction is in $(0, 1)$. bonsai will thus automatically set bagging_freq = 1 in set_engine("lightgbm", ...) if sample_size (i.e. bagging_fraction) is not equal to 1 and no bagging_freq value is supplied. This default can be overridden by setting the bagging_freq argument to set_engine() manually.

Verbosity

bonsai quiets much of the logging output from lightgbm::lgb.train() by default. With default settings, logged warnings and errors will still be passed on to the user. To print out all logs during training, set quiet = TRUE.

Sparse Data

This model can utilize sparse data during model fitting and prediction. Both sparse matrices such as dgCMatrix from the Matrix package and sparse tibbles from the sparsevctrs package are supported. See sparse_data for more information.

Examples

The “Introduction to bonsai” article contains examples of boost_tree() with the "lightgbm" engine.

References

LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Boosted trees

Description

mboost::blackboost() fits a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there is a single mode: censored regression

Tuning Parameters

This model has 5 tuning parameters:

mtry: # Randomly Selected Predictors (type: integer, default: see below)
trees: # Trees (type: integer, default: 100L)
tree_depth: Tree Depth (type: integer, default: 2L)
min_n: Minimal Node Size (type: integer, default: 10L)
loss_reduction: Minimum Loss Reduction (type: double, default: 0)

The mtry parameter is related to the number of predictors. The default is to use all predictors.

Translation from parsnip to the original package (censored regression)

The censored extension package is required to fit this model.

library(censored)

boost_tree() %>% 
  set_engine("mboost") %>% 
  set_mode("censored regression") %>% 
  translate()

## Boosted Tree Model Specification (censored regression)
## 
## Computational engine: mboost 
## 
## Model fit template:
## censored::blackboost_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = mboost::CoxPH())

censored::blackboost_train() is a wrapper around mboost::blackboost() (and other functions) that makes it easier to run this model.

Preprocessing requirements

Other details

Predictions of type "time" are predictions of the mean survival time.

References

Buehlmann P, Hothorn T. 2007. Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Boosted trees via Spark

Description

sparklyr::ml_gradient_boosted_trees() creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there are multiple modes: classification and regression. However, multiclass classification is not supported yet.

Tuning Parameters

This model has 7 tuning parameters:

tree_depth: Tree Depth (type: integer, default: 5L)
trees: # Trees (type: integer, default: 20L)
learn_rate: Learning Rate (type: double, default: 0.1)
mtry: # Randomly Selected Predictors (type: integer, default: see below)
min_n: Minimal Node Size (type: integer, default: 1L)
loss_reduction: Minimum Loss Reduction (type: double, default: 0.0)
sample_size: # Observations Sampled (type: integer, default: 1.0)

The mtry parameter is related to the number of predictors. The default depends on the model mode. For classification, the square root of the number of predictors is used and for regression, one third of the predictors are sampled.

Translation from parsnip to the original package (regression)

boost_tree(
  mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(),
  learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric()
) %>%
  set_engine("spark") %>%
  set_mode("regression") %>%
  translate()

## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
##   sample_size = numeric()
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), 
##     type = "regression", feature_subset_strategy = integer(), 
##     max_iter = integer(), min_instances_per_node = min_rows(integer(0), 
##         x), max_depth = integer(), step_size = numeric(), min_info_gain = numeric(), 
##     subsampling_rate = numeric(), seed = sample.int(10^5, 1))

Translation from parsnip to the original package (classification)

boost_tree(
  mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(),
  learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric()
) %>% 
  set_engine("spark") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
##   sample_size = numeric()
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), 
##     type = "classification", feature_subset_strategy = integer(), 
##     max_iter = integer(), min_instances_per_node = min_rows(integer(0), 
##         x), max_depth = integer(), step_size = numeric(), min_info_gain = numeric(), 
##     subsampling_rate = numeric(), seed = sample.int(10^5, 1))

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that, for spark engines, the case_weight argument value should be a character string to specify the column with the numeric case weights.

Other details

For models created using the "spark" engine, there are several things to consider.

Only the formula interface to via fit() is available; using fit_xy() will generate an error.
The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via save()), the model$fit element of the parsnip object should be serialized via ml_save(object$fit) and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.

References

Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Boosted trees via xgboost

Description

xgboost::xgb.train() creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 8 tuning parameters:

tree_depth: Tree Depth (type: integer, default: 6L)
trees: # Trees (type: integer, default: 15L)
learn_rate: Learning Rate (type: double, default: 0.3)
mtry: # Randomly Selected Predictors (type: integer, default: see below)
min_n: Minimal Node Size (type: integer, default: 1L)
loss_reduction: Minimum Loss Reduction (type: double, default: 0.0)
sample_size: Proportion Observations Sampled (type: double, default: 1.0)
stop_iter: # Iterations Before Stopping (type: integer, default: Inf)

For mtry, the default value of NULL translates to using all available columns.

Translation from parsnip to the original package (regression)

boost_tree(
  mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(),
  learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric(),
  stop_iter = integer()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression") %>%
  translate()

## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
##   sample_size = numeric()
##   stop_iter = integer()
## 
## Computational engine: xgboost 
## 
## Model fit template:
## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     colsample_bynode = integer(), nrounds = integer(), min_child_weight = integer(), 
##     max_depth = integer(), eta = numeric(), gamma = numeric(), 
##     subsample = numeric(), early_stop = integer(), nthread = 1, 
##     verbose = 0)

Translation from parsnip to the original package (classification)

boost_tree(
  mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(),
  learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric(),
  stop_iter = integer()
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer()
##   trees = integer()
##   min_n = integer()
##   tree_depth = integer()
##   learn_rate = numeric()
##   loss_reduction = numeric()
##   sample_size = numeric()
##   stop_iter = integer()
## 
## Computational engine: xgboost 
## 
## Model fit template:
## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     colsample_bynode = integer(), nrounds = integer(), min_child_weight = integer(), 
##     max_depth = integer(), eta = numeric(), gamma = numeric(), 
##     subsample = numeric(), early_stop = integer(), nthread = 1, 
##     verbose = 0)

xgb_train() is a wrapper around xgboost::xgb.train() (and other functions) that makes it easier to run this model.

Preprocessing requirements

xgboost does not have a means to translate factor predictors to grouped splits. Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine. When using the formula method via fit.model_spec(), parsnip will convert factor columns to indicators using a one-hot encoding.

For classification, non-numeric outcomes (i.e., factors) are internally converted to numeric. For binary classification, the event_level argument of set_engine() can be set to either "first" or "second" to specify which level should be used as the event. This can be helpful when a watchlist is used to monitor performance from with the xgboost training process.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Sparse Data

Other details

Interfacing with the `params` argument

The xgboost function that parsnip indirectly wraps, xgboost::xgb.train(), takes most arguments via the params list argument. To supply engine-specific arguments that are documented in xgboost::xgb.train() as arguments to be passed via params, supply the list elements directly as named arguments to set_engine() rather than as elements in params. For example, pass a non-default evaluation metric like this:

# good
boost_tree() %>%
  set_engine("xgboost", eval_metric = "mae")

## Boosted Tree Model Specification (unknown mode)
## 
## Engine-Specific Arguments:
##   eval_metric = mae
## 
## Computational engine: xgboost

…rather than this:

# bad
boost_tree() %>%
  set_engine("xgboost", params = list(eval_metric = "mae"))

## Boosted Tree Model Specification (unknown mode)
## 
## Engine-Specific Arguments:
##   params = list(eval_metric = "mae")
## 
## Computational engine: xgboost

parsnip will then route arguments as needed. In the case that arguments are passed to params via set_engine(), parsnip will warn and re-route the arguments as needed. Note, though, that arguments passed to params cannot be tuned.

Sparse matrices

xgboost requires the data to be in a sparse format. If your predictor data are already in this format, then use fit_xy.model_spec() to pass it to the model function. Otherwise, parsnip converts the data to this format.

Parallel processing

By default, the model is trained without parallel processing. This can be change by passing the nthread parameter to set_engine(). However, it is unwise to combine this with external parallel processing when using the package.

Interpreting `mtry`

The mtry argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.

Early stopping

The stop_iter() argument allows the model to prematurely stop training if the objective function does not improve within early_stop iterations.

The best way to use this feature is in conjunction with an internal validation set. To do this, pass the validation parameter of xgb_train() via the parsnip set_engine() function. This is the proportion of the training set that should be reserved for measuring performance (and stopping early).

If the model specification has early_stop >= trees, early_stop is converted to trees - 1 and a warning is issued.

Note that, since the validation argument provides an alternative interface to watchlist, the watchlist argument is guarded by parsnip and will be ignored (with a warning) if passed.

Objective function

parsnip chooses the objective function based on the characteristics of the outcome. To use a different loss, pass the objective argument to set_engine() directly.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for boost_tree() with the "xgboost" engine.

References

XGBoost: A Scalable Tree Boosting System
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

C5.0 rule-based classification models

Description

C50::C5.0() fits a model that derives feature rules from a tree for prediction. A single tree or boosted ensemble can be used. rules::c5_fit() is a wrapper around this function.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

trees: # Trees (type: integer, default: 1L)
min_n: Minimal Node Size (type: integer, default: 2L)

Note that C5.0 has a tool for early stopping during boosting where less iterations of boosting are performed than the number requested. C5_rules() turns this feature off (although it can be re-enabled using C50::C5.0Control()).

Translation from parsnip to the underlying model call (classification)

The rules extension package is required to fit this model.

library(rules)

C5_rules(
  trees = integer(1),
  min_n = integer(1)
) %>%
  set_engine("C5.0") %>%
  set_mode("classification") %>%
  translate()

## C5.0 Model Specification (classification)
## 
## Main Arguments:
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: C5.0 
## 
## Model fit template:
## rules::c5_fit(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     trials = integer(1), minCases = integer(1))

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

References

Quinlan R (1992). “Learning with Continuous Classes.” Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, pp. 343-348.
Quinlan R (1993).”Combining Instance-Based and Model-Based Learning.” Proceedings of the Tenth International Conference on Machine Learning, pp. 236-243.
Kuhn M and Johnson K (2013). Applied Predictive Modeling. Springer.

Cubist rule-based regression models

Description

Cubist::cubist() fits a model that derives simple feature rules from a tree ensemble and uses creates regression models within each rule. rules::cubist_fit() is a wrapper around this function.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has 3 tuning parameters:

committees: # Committees (type: integer, default: 1L)
neighbors: # Nearest Neighbors (type: integer, default: 0L)
max_rules: Max. Rules (type: integer, default: NA_integer)

Translation from parsnip to the underlying model call (regression)

The rules extension package is required to fit this model.

library(rules)

cubist_rules(
  committees = integer(1),
  neighbors = integer(1),
  max_rules = integer(1)
) %>%
  set_engine("Cubist") %>%
  set_mode("regression") %>%
  translate()

## Cubist Model Specification (regression)
## 
## Main Arguments:
##   committees = integer(1)
##   neighbors = integer(1)
##   max_rules = integer(1)
## 
## Computational engine: Cubist 
## 
## Model fit template:
## rules::cubist_fit(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     committees = integer(1), neighbors = integer(1), max_rules = integer(1))

Preprocessing requirements

References

Quinlan R (1992). “Learning with Continuous Classes.” Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, pp. 343-348.
Quinlan R (1993).”Combining Instance-Based and Model-Based Learning.” Proceedings of the Tenth International Conference on Machine Learning, pp. 236-243.
Kuhn M and Johnson K (2013). Applied Predictive Modeling. Springer.

Decision trees via C5.0

Description

C50::C5.0() fits a model as a set of ⁠if/then⁠ statements that creates a tree-based structure.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 1 tuning parameters:

min_n: Minimal Node Size (type: integer, default: 2L)

Translation from parsnip to the original package (classification)

decision_tree(min_n = integer()) %>% 
  set_engine("C5.0") %>% 
  set_mode("classification") %>% 
  translate()

## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   min_n = integer()
## 
## Computational engine: C5.0 
## 
## Model fit template:
## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     minCases = integer(), trials = 1)

C5.0_train() is a wrapper around C50::C5.0() that makes it easier to run this model.

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for decision_tree() with the "C5.0" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Decision trees via partykit

Description

partykit::ctree() fits a model as a set of if/then statements that creates a tree-based structure using hypothesis testing methods.

Details

For this engine, there are multiple modes: censored regression, regression, and classification

Tuning Parameters

This model has 2 tuning parameters:

tree_depth: Tree Depth (type: integer, default: see below)
min_n: Minimal Node Size (type: integer, default: 20L)

The tree_depth parameter defaults to 0 which means no restrictions are applied to tree depth.

An engine-specific parameter for this model is:

mtry: the number of predictors, selected at random, that are evaluated for splitting. The default is to use all predictors.

Translation from parsnip to the original package (regression)

The bonsai extension package is required to fit this model.

library(bonsai)

decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% 
  set_engine("partykit") %>% 
  set_mode("regression") %>% 
  translate()

## Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: partykit 
## 
## Model fit template:
## parsnip::ctree_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), maxdepth = integer(1), minsplit = min_rows(0L, 
##         data))

Translation from parsnip to the original package (classification)

The bonsai extension package is required to fit this model.

library(bonsai)

decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% 
  set_engine("partykit") %>% 
  set_mode("classification") %>% 
  translate()

## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: partykit 
## 
## Model fit template:
## parsnip::ctree_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), maxdepth = integer(1), minsplit = min_rows(0L, 
##         data))

parsnip::ctree_train() is a wrapper around partykit::ctree() (and other functions) that makes it easier to run this model.

Translation from parsnip to the original package (censored regression)

The censored extension package is required to fit this model.

library(censored)

decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% 
  set_engine("partykit") %>% 
  set_mode("censored regression") %>% 
  translate()

## Decision Tree Model Specification (censored regression)
## 
## Main Arguments:
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: partykit 
## 
## Model fit template:
## parsnip::ctree_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), maxdepth = integer(1), minsplit = min_rows(0L, 
##         data))

censored::cond_inference_surv_ctree() is a wrapper around partykit::ctree() (and other functions) that makes it easier to run this model.

Preprocessing requirements

Other details

Predictions of type "time" are predictions of the median survival time.

References

partykit: A Modular Toolkit for Recursive Partytioning in R
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Decision trees via CART

Description

rpart::rpart() fits a model as a set of ⁠if/then⁠ statements that creates a tree-based structure.

Details

For this engine, there are multiple modes: classification, regression, and censored regression

Tuning Parameters

This model has 3 tuning parameters:

tree_depth: Tree Depth (type: integer, default: 30L)
min_n: Minimal Node Size (type: integer, default: 2L)
cost_complexity: Cost-Complexity Parameter (type: double, default: 0.01)

Translation from parsnip to the original package (classification)

decision_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% 
  set_engine("rpart") %>% 
  set_mode("classification") %>% 
  translate()

## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = double(1)
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: rpart 
## 
## Model fit template:
## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     cp = double(1), maxdepth = integer(1), minsplit = min_rows(0L, 
##         data))

Translation from parsnip to the original package (regression)

decision_tree(tree_depth = integer(1), min_n = integer(1), cost_complexity = double(1)) %>% 
  set_engine("rpart") %>% 
  set_mode("regression") %>% 
  translate()

## Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   cost_complexity = double(1)
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: rpart 
## 
## Model fit template:
## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     cp = double(1), maxdepth = integer(1), minsplit = min_rows(0L, 
##         data))

Translation from parsnip to the original package (censored regression)

The censored extension package is required to fit this model.

library(censored)

decision_tree(
  tree_depth = integer(1),
  min_n = integer(1),
  cost_complexity = double(1)
) %>% 
  set_engine("rpart") %>% 
  set_mode("censored regression") %>% 
  translate()

## Decision Tree Model Specification (censored regression)
## 
## Main Arguments:
##   cost_complexity = double(1)
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: rpart 
## 
## Model fit template:
## pec::pecRpart(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), cp = double(1), maxdepth = integer(1), 
##     minsplit = min_rows(0L, data))

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Other details

Predictions of type "time" are predictions of the mean survival time.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for decision_tree() with the "rpart" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Decision trees via Spark

Description

sparklyr::ml_decision_tree() fits a model as a set of ⁠if/then⁠ statements that creates a tree-based structure.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 2 tuning parameters:

tree_depth: Tree Depth (type: integer, default: 5L)
min_n: Minimal Node Size (type: integer, default: 1L)

Translation from parsnip to the original package (classification)

decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% 
  set_engine("spark") %>% 
  set_mode("classification") %>% 
  translate()

## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_decision_tree_classifier(x = missing_arg(), formula = missing_arg(), 
##     max_depth = integer(1), min_instances_per_node = min_rows(0L, 
##         x), seed = sample.int(10^5, 1))

Translation from parsnip to the original package (regression)

decision_tree(tree_depth = integer(1), min_n = integer(1)) %>% 
  set_engine("spark") %>% 
  set_mode("regression") %>% 
  translate()

## Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   tree_depth = integer(1)
##   min_n = integer(1)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_decision_tree_regressor(x = missing_arg(), formula = missing_arg(), 
##     max_depth = integer(1), min_instances_per_node = min_rows(0L, 
##         x), seed = sample.int(10^5, 1))

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that, for spark engines, the case_weight argument value should be a character string to specify the column with the numeric case weights.

Other details

For models created using the "spark" engine, there are several things to consider.

Only the formula interface to via fit() is available; using fit_xy() will generate an error.
The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via save()), the model$fit element of the parsnip object should be serialized via ml_save(object$fit) and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Flexible discriminant analysis via earth

Description

mda::fda() (in conjunction with earth::earth() can fit a nonlinear discriminant analysis model that uses nonlinear features created using multivariate adaptive regression splines (MARS). This function can fit classification models.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 3 tuning parameter:

num_terms: # Model Terms (type: integer, default: (see below))
prod_degree: Degree of Interaction (type: integer, default: 1L)
prune_method: Pruning Method (type: character, default: ‘backward’)

The default value of num_terms depends on the number of columns (p): min(200, max(20, 2 * p)) + 1. Note that num_terms = 1 is an intercept-only model.

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_flexible(
  num_terms = integer(0),
  prod_degree = integer(0),
  prune_method = character(0)
) %>% 
  translate()

## Flexible Discriminant Model Specification (classification)
## 
## Main Arguments:
##   num_terms = integer(0)
##   prod_degree = integer(0)
##   prune_method = character(0)
## 
## Computational engine: earth 
## 
## Model fit template:
## mda::fda(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     nprune = integer(0), degree = integer(0), pmethod = character(0), 
##     method = earth::earth)

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

Hastie, Tibshirani & Buja (1994) Flexible Discriminant Analysis by Optimal Scoring, Journal of the American Statistical Association, 89:428, 1255-1270
Friedman (1991). Multivariate Adaptive Regression Splines. The Annals of Statistics, 19(1), 1-67.

Linear discriminant analysis via MASS

Description

MASS::lda() fits a model that estimates a multivariate distribution for the predictors separately for the data in each class (Gaussian with a common covariance matrix). Bayes' theorem is used to compute the probability of each class, given the predictor values.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This engine has no tuning parameters.

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_linear() %>% 
  set_engine("MASS") %>% 
  translate()

## Linear Discriminant Model Specification (classification)
## 
## Computational engine: MASS 
## 
## Model fit template:
## MASS::lda(formula = missing_arg(), data = missing_arg())

Preprocessing requirements

Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Linear discriminant analysis via flexible discriminant analysis

Description

mda::fda() (in conjunction with mda::gen.ridge() can fit a linear discriminant analysis model that penalizes the predictor coefficients with a quadratic penalty (i.e., a ridge or weight decay approach).

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 1 tuning parameter:

penalty: Amount of Regularization (type: double, default: 1.0)

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_linear(penalty = numeric(0)) %>% 
  set_engine("mda") %>% 
  translate()

## Linear Discriminant Model Specification (classification)
## 
## Main Arguments:
##   penalty = numeric(0)
## 
## Computational engine: mda 
## 
## Model fit template:
## mda::fda(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     lambda = numeric(0), method = mda::gen.ridge, keep.fitted = FALSE)

Preprocessing requirements

Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

Hastie, Tibshirani & Buja (1994) Flexible Discriminant Analysis by Optimal Scoring, Journal of the American Statistical Association, 89:428, 1255-1270

Linear discriminant analysis via James-Stein-type shrinkage estimation

Description

sda::sda() can fit a linear discriminant analysis model that can fit models between classical discriminant analysis and diagonal discriminant analysis.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This engine has no tuning parameter arguments in discrim_linear().

However, there are a few engine-specific parameters that can be set or optimized when calling set_engine():

lambda: the shrinkage parameters for the correlation matrix. This maps to the parameter dials::shrinkage_correlation().
lambda.var: the shrinkage parameters for the predictor variances. This maps to dials::shrinkage_variance().
lambda.freqs: the shrinkage parameters for the class frequencies. This maps to dials::shrinkage_frequencies().
diagonal: a logical to make the model covariance diagonal or not. This maps to dials::diagonal_covariance().

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_linear() %>% 
  set_engine("sda") %>% 
  translate()

## Linear Discriminant Model Specification (classification)
## 
## Computational engine: sda 
## 
## Model fit template:
## sda::sda(Xtrain = missing_arg(), L = missing_arg(), verbose = FALSE)

Preprocessing requirements

Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.

Case weights

The underlying model implementation does not allow for case weights.

References

Ahdesmaki, A., and K. Strimmer. 2010. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Ann. Appl. Stat. 4: 503-519. Preprint.

Linear discriminant analysis via regularization

Description

Functions in the sparsediscrim package fit different types of linear discriminant analysis model that regularize the estimates (like the mean or covariance).

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 1 tuning parameter:

regularization_method: Regularization Method (type: character, default: ‘diagonal’)

The possible values of this parameter, and the functions that they execute, are:

"diagonal": sparsediscrim::lda_diag()
"min_distance": sparsediscrim::lda_emp_bayes_eigen()
"shrink_mean": sparsediscrim::lda_shrink_mean()
"shrink_cov": sparsediscrim::lda_shrink_cov()

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_linear(regularization_method = character(0)) %>% 
  set_engine("sparsediscrim") %>% 
  translate()

## Linear Discriminant Model Specification (classification)
## 
## Main Arguments:
##   regularization_method = character(0)
## 
## Computational engine: sparsediscrim 
## 
## Model fit template:
## discrim::fit_regularized_linear(x = missing_arg(), y = missing_arg(), 
##     regularization_method = character(0))

Preprocessing requirements

Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.

Case weights

The underlying model implementation does not allow for case weights.

References

lda_diag(): Dudoit, Fridlyand and Speed (2002) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, Journal of the American Statistical Association, 97:457, 77-87.
lda_shrink_mean(): Tong, Chen, Zhao, Improved mean estimation and its application to diagonal discriminant analysis, Bioinformatics, Volume 28, Issue 4, 15 February 2012, Pages 531-537.
lda_shrink_cov(): Pang, Tong and Zhao (2009), Shrinkage-based Diagonal Discriminant Analysis and Its Applications in High-Dimensional Data. Biometrics, 65, 1021-1029.
lda_emp_bayes_eigen(): Srivistava and Kubokawa (2007), Comparison of Discrimination Methods for High Dimensional Data, Journal of the Japan Statistical Society, 37:1, 123-134.

Quadratic discriminant analysis via MASS

Description

MASS::qda() fits a model that estimates a multivariate distribution for the predictors separately for the data in each class (Gaussian with separate covariance matrices). Bayes' theorem is used to compute the probability of each class, given the predictor values.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This engine has no tuning parameters.

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_quad() %>% 
  set_engine("MASS") %>% 
  translate()

## Quadratic Discriminant Model Specification (classification)
## 
## Computational engine: MASS 
## 
## Model fit template:
## MASS::qda(formula = missing_arg(), data = missing_arg())

Preprocessing requirements

Variance calculations are used in these computations within each outcome class. For this reason, zero-variance predictors (i.e., with a single unique value) within each class should be eliminated before fitting the model.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Quadratic discriminant analysis via regularization

Description

Functions in the sparsediscrim package fit different types of quadratic discriminant analysis model that regularize the estimates (like the mean or covariance).

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 1 tuning parameter:

regularization_method: Regularization Method (type: character, default: ‘diagonal’)

The possible values of this parameter, and the functions that they execute, are:

"diagonal": sparsediscrim::qda_diag()
"shrink_mean": sparsediscrim::qda_shrink_mean()
"shrink_cov": sparsediscrim::qda_shrink_cov()

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_quad(regularization_method = character(0)) %>% 
  set_engine("sparsediscrim") %>% 
  translate()

## Quadratic Discriminant Model Specification (classification)
## 
## Main Arguments:
##   regularization_method = character(0)
## 
## Computational engine: sparsediscrim 
## 
## Model fit template:
## discrim::fit_regularized_quad(x = missing_arg(), y = missing_arg(), 
##     regularization_method = character(0))

Preprocessing requirements

Case weights

The underlying model implementation does not allow for case weights.

References

qda_diag(): Dudoit, Fridlyand and Speed (2002) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, Journal of the American Statistical Association, 97:457, 77-87.
qda_shrink_mean(): Tong, Chen, Zhao, Improved mean estimation and its application to diagonal discriminant analysis, Bioinformatics, Volume 28, Issue 4, 15 February 2012, Pages 531-537.
qda_shrink_cov(): Pang, Tong and Zhao (2009), Shrinkage-based Diagonal Discriminant Analysis and Its Applications in High-Dimensional Data. Biometrics, 65, 1021-1029.

Regularized discriminant analysis via klaR

Description

klaR::rda() fits a a model that estimates a multivariate distribution for the predictors separately for the data in each class. The structure of the model can be LDA, QDA, or some amalgam of the two. Bayes' theorem is used to compute the probability of each class, given the predictor values.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameter:

frac_common_cov: Fraction of the Common Covariance Matrix (type: double, default: (see below))
frac_identity: Fraction of the Identity Matrix (type: double, default: (see below))

Some special cases for the RDA model:

frac_identity = 0 and frac_common_cov = 1 is a linear discriminant analysis (LDA) model.
frac_identity = 0 and frac_common_cov = 0 is a quadratic discriminant analysis (QDA) model.

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

discrim_regularized(frac_identity = numeric(0), frac_common_cov = numeric(0)) %>% 
  set_engine("klaR") %>% 
  translate()

## Regularized Discriminant Model Specification (classification)
## 
## Main Arguments:
##   frac_common_cov = numeric(0)
##   frac_identity = numeric(0)
## 
## Computational engine: klaR 
## 
## Model fit template:
## klaR::rda(formula = missing_arg(), data = missing_arg(), lambda = numeric(0), 
##     gamma = numeric(0))

Preprocessing requirements

Case weights

The underlying model implementation does not allow for case weights.

References

Friedman, J (1989). Regularized Discriminant Analysis. Journal of the American Statistical Association, 84, 165-175.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Generalized additive models via mgcv

Description

mgcv::gam() fits a generalized linear model with additive smoother terms for continuous predictors.

Details

For this engine, there are multiple modes: regression and classification

Tuning Parameters

This model has 2 tuning parameters:

select_features: Select Features? (type: logical, default: FALSE)
adjust_deg_free: Smoothness Adjustment (type: double, default: 1.0)

Translation from parsnip to the original package (regression)

gen_additive_mod(adjust_deg_free = numeric(1), select_features = logical(1)) %>% 
  set_engine("mgcv") %>% 
  set_mode("regression") %>% 
  translate()

## GAM Model Specification (regression)
## 
## Main Arguments:
##   select_features = logical(1)
##   adjust_deg_free = numeric(1)
## 
## Computational engine: mgcv 
## 
## Model fit template:
## mgcv::gam(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     select = logical(1), gamma = numeric(1))

Translation from parsnip to the original package (classification)

gen_additive_mod(adjust_deg_free = numeric(1), select_features = logical(1)) %>% 
  set_engine("mgcv") %>% 
  set_mode("classification") %>% 
  translate()

## GAM Model Specification (classification)
## 
## Main Arguments:
##   select_features = logical(1)
##   adjust_deg_free = numeric(1)
## 
## Computational engine: mgcv 
## 
## Model fit template:
## mgcv::gam(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     select = logical(1), gamma = numeric(1), family = stats::binomial(link = "logit"))

Model fitting

This model should be used with a model formula so that smooth terms can be specified. For example:

library(mgcv)
gen_additive_mod() %>% 
  set_engine("mgcv") %>% 
  set_mode("regression") %>% 
  fit(mpg ~ wt + gear + cyl + s(disp, k = 10), data = mtcars)

## parsnip model object
## 
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## mpg ~ wt + gear + cyl + s(disp, k = 10)
## 
## Estimated degrees of freedom:
## 7.52  total = 11.52 
## 
## GCV score: 4.225228

The smoothness of the terms will need to be manually specified (e.g., using s(x, df = 10)) in the formula. Tuning can be accomplished using the adjust_deg_free parameter.

When using a workflow, pass the model formula to workflows::add_model()’s formula argument, and a simplified preprocessing formula elsewhere.

spec <- 
  gen_additive_mod() %>% 
  set_engine("mgcv") %>% 
  set_mode("regression")

workflow() %>% 
  add_model(spec, formula = mpg ~ wt + gear + cyl + s(disp, k = 10)) %>% 
  add_formula(mpg ~ wt + gear + cyl + disp) %>% 
  fit(data = mtcars) %>% 
  extract_fit_engine()

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## mpg ~ wt + gear + cyl + s(disp, k = 10)
## 
## Estimated degrees of freedom:
## 7.52  total = 11.52 
## 
## GCV score: 4.225228

To learn more about the differences between these formulas, see ?model_formula.

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

References

Ross, W. 2021. Generalized Additive Models in R: A Free, Interactive Course using mgcv
Wood, S. 2017. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC.

Linear regression via brulee

Description

brulee::brulee_linear_reg() uses ordinary least squares to fit models with numeric outcomes.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has 2 tuning parameter:

penalty: Amount of Regularization (type: double, default: 0.001)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)

The use of the L1 penalty (a.k.a. the lasso penalty) does not force parameters to be strictly zero (as it does in packages such as glmnet). The zeroing out of parameters is a specific feature the optimization method used in those packages.

Other engine arguments of interest:

optimizer(): The optimization method. See brulee::brulee_linear_reg().
epochs(): An integer for the number of passes through the training set.
lean_rate(): A number used to accelerate the gradient decsent process.
momentum(): A number used to use historical gradient infomration during optimization (optimizer = "SGD" only).
batch_size(): An integer for the number of training set points in each batch.
stop_iter(): A non-negative integer for how many iterations with no improvement before stopping. (default: 5L).

Translation from parsnip to the original package (regression)

linear_reg(penalty = double(1)) %>%  
  set_engine("brulee") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = double(1)
## 
## Computational engine: brulee 
## 
## Model fit template:
## brulee::brulee_linear_reg(x = missing_arg(), y = missing_arg(), 
##     penalty = double(1))

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Linear regression via generalized estimating equations (GEE)

Description

gee::gee() uses generalized least squares to fit different types of models with errors that are not independent.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no formal tuning parameters. It may be beneficial to determine the appropriate correlation structure to use, but this typically does not affect the predicted value of the model. It does have an effect on the inferential results and parameter covariance values.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

linear_reg() %>% 
  set_engine("gee") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: gee 
## 
## Model fit template:
## multilevelmod::gee_fit(formula = missing_arg(), data = missing_arg(), 
##     family = gaussian)

multilevelmod::gee_fit() is a wrapper model around gee::gee().

Preprocessing requirements

There are no specific preprocessing needs. However, it is helpful to keep the clustering/subject identifier column as factor or character (instead of making them into dummy variables). See the examples in the next section.

Other details

The model cannot accept case weights.

Both gee:gee() and gee:geepack() specify the id/cluster variable using an argument id that requires a vector. parsnip doesn’t work that way so we enable this model to be fit using a artificial function id_var() to be used in the formula. So, in the original package, the call would look like:

gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)

linear_reg() %>% 
  set_engine("gee", corstr = "exchangeable") %>% 
  fit(breaks ~ tension + id_var(wool), data = warpbreaks)

When using tidymodels infrastructure, it may be better to use a workflow. In this case, you can add the appropriate columns using add_variables() then supply the GEE formula when adding the model:

library(tidymodels)

gee_spec <- 
  linear_reg() %>% 
  set_engine("gee", corstr = "exchangeable")

gee_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = breaks, predictors = c(tension, wool)) %>% 
  add_model(gee_spec, formula = breaks ~ tension + id_var(wool))

fit(gee_wflow, data = warpbreaks)

The gee::gee() function always prints out warnings and output even when silent = TRUE. The parsnip "gee" engine, by contrast, silences all console output coming from gee::gee(), even if silent = FALSE.

Also, because of issues with the gee() function, a supplementary call to glm() is needed to get the rank and QR decomposition objects so that predict() can be used.

Case weights

The underlying model implementation does not allow for case weights.

References

Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22.
Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42 121–130.

Linear regression via glm

Description

stats::glm() fits a generalized linear model for numeric outcomes. A linear combination of the predictors is used to model the numeric outcome via a link function.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This engine has no tuning parameters but you can set the family parameter (and/or link) as an engine argument (see below).

Translation from parsnip to the original package

linear_reg() %>% 
  set_engine("glm") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: glm 
## 
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::gaussian)

To use a non-default family and/or link, pass in as an argument to set_engine():

linear_reg() %>% 
  set_engine("glm", family = stats::poisson(link = "sqrt")) %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Engine-Specific Arguments:
##   family = stats::poisson(link = "sqrt")
## 
## Computational engine: glm 
## 
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::poisson(link = "sqrt"))

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

However, the documentation in stats::glm() assumes that is specific type of case weights are being used:“Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations. For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM.”

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for linear_reg() with the "glm" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Linear regression via generalized mixed models

Description

The "glmer" engine estimates fixed and random effect regression parameters using maximum likelihood (or restricted maximum likelihood) estimation.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no tuning parameters.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

linear_reg() %>% 
  set_engine("glmer") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: glmer 
## 
## Model fit template:
## lme4::glmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::gaussian)

Note that using this engine with a linear link function will result in a warning:

calling glmer() with family=gaussian (identity link) as a shortcut 
to lmer() is deprecated; please call lmer() directly

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

where i denotes the ith independent experimental unit (e.g. subject). When the model has seen subject i, it can use that subject’s data to adjust the population intercept to be more specific to that subjects results.

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

Depending on what covariates are in the model, this might have the effect of making the same prediction for all new samples. The population parameters are the “best estimate” for a subject that was not included in the model fit.

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)
data("riesby")

linear_reg() %>% 
  set_engine("glmer") %>% 
  fit(depr_score ~ week + (1|subject), data = riesby)

When using tidymodels infrastructure, it may be better to use a workflow. In this case, you can add the appropriate columns using add_variables() then supply the typical formula when adding the model:

library(tidymodels)

glmer_spec <- 
  linear_reg() %>% 
  set_engine("glmer")

glmer_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% 
  add_model(glmer_spec, formula = depr_score ~ week + (1|subject))

fit(glmer_wflow, data = riesby)

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.

Linear regression via glmnet

Description

glmnet::glmnet() uses regularized least squares to fit models with numeric outcomes.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: see below)
mixture: Proportion of Lasso Penalty (type: double, default: 1.0)

A value of mixture = 1 corresponds to a pure lasso model, while mixture = 0 indicates ridge regression.

The penalty parameter has no default and requires a single numeric value. For more details about this, and the glmnet model in general, see glmnet-details.

Translation from parsnip to the original package

linear_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("glmnet") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0
##   mixture = double(1)
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     alpha = double(1), family = "gaussian")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one. By default, glmnet::glmnet() uses the argument standardize = TRUE to center and scale the data.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Sparse Data

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for linear_reg() with the "glmnet" engine.

References

Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Linear regression via generalized least squares

Description

The "gls" engine estimates linear regression for models where the rows of the data are not independent.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no tuning parameters.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

linear_reg() %>% 
  set_engine("gls") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: gls 
## 
## Model fit template:
## nlme::gls(formula = missing_arg(), data = missing_arg())

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the fixed effects formula method when fitting, but the details of the correlation structure should be passed to set_engine() since it is an irregular (but required) argument:

library(tidymodels)
# load nlme to be able to use the `cor*()` functions
library(nlme)

data("riesby")

linear_reg() %>% 
  set_engine("gls", correlation =  corCompSymm(form = ~ 1 | subject)) %>% 
  fit(depr_score ~ week, data = riesby)

## parsnip model object
## 
## Generalized least squares fit by REML
##   Model: depr_score ~ week 
##   Data: data 
##   Log-restricted-likelihood: -765.0148
## 
## Coefficients:
## (Intercept)        week 
##   -4.953439   -2.119678 
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | subject 
##  Parameter estimate(s):
##       Rho 
## 0.6820145 
## Degrees of freedom: 250 total; 248 residual
## Residual standard error: 6.868785

library(tidymodels)

gls_spec <- 
  linear_reg() %>% 
  set_engine("gls", correlation =  corCompSymm(form = ~ 1 | subject))

gls_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% 
  add_model(gls_spec, formula = depr_score ~ week)

fit(gls_wflow, data = riesby)

Case weights

The underlying model implementation does not allow for case weights.

References

J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY

Linear regression via h2o

Description

This model uses regularized least squares to fit models with numeric outcomes.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has 2 tuning parameters:

mixture: Proportion of Lasso Penalty (type: double, default: see below)
penalty: Amount of Regularization (type: double, default: see below)

By default, when not given a fixed penalty, h2o::h2o.glm() uses a heuristic approach to select the optimal value of penalty based on training data. Setting the engine parameter lambda_search to TRUE enables an efficient version of the grid search, see more details at https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/lambda_search.html.

The choice of mixture depends on the engine parameter solver, which is automatically chosen given training data and the specification of other model parameters. When solver is set to 'L-BFGS', mixture defaults to 0 (ridge regression) and 0.5 otherwise.

Translation from parsnip to the original package

agua::h2o_train_glm() for linear_reg() is a wrapper around h2o::h2o.glm() with family = "gaussian".

linear_reg(penalty = 1, mixture = 0.5) %>% 
  set_engine("h2o") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 1
##   mixture = 0.5
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), lambda = 1, alpha = 0.5, 
##     family = "gaussian")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, h2o::h2o.glm() uses the argument standardize = TRUE to center and scale the data.

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Linear regression via keras/tensorflow

Description

This model uses regularized least squares to fit models with numeric outcomes.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has one tuning parameter:

penalty: Amount of Regularization (type: double, default: 0.0)

For penalty, the amount of regularization is only L2 penalty (i.e., ridge or weight decay).

Translation from parsnip to the original package

linear_reg(penalty = double(1)) %>% 
  set_engine("keras") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = double(1)
## 
## Computational engine: keras 
## 
## Model fit template:
## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), penalty = double(1), 
##     hidden_units = 1, act = "linear")

keras_mlp() is a parsnip wrapper around keras code for neural networks. This model fits a linear regression as a network with a single hidden unit.

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Examples

The “Fitting and Predicting with parsnip” article contains examples for linear_reg() with the "keras" engine.

References

Hoerl, A., & Kennard, R. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 42(1), 80-86.

Linear regression via lm

Description

stats::lm() uses ordinary least squares to fit models with numeric outcomes.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This engine has no tuning parameters.

Translation from parsnip to the original package

linear_reg() %>% 
  set_engine("lm") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm 
## 
## Model fit template:
## stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

However, the documentation in stats::lm() assumes that is specific type of case weights are being used: “Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized). However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; in the case of replication weights, even wrong. Hence, standard errors and analysis of variance tables should be treated with care” (emphasis added)

Depending on your application, the degrees of freedom for the model (and other statistics) might be incorrect.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for linear_reg() with the "lm" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Linear regression via mixed models

Description

The "lme" engine estimates fixed and random effect regression parameters using maximum likelihood (or restricted maximum likelihood) estimation.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no tuning parameters.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

linear_reg() %>% 
  set_engine("lme") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: lme 
## 
## Model fit template:
## nlme::lme(fixed = missing_arg(), data = missing_arg())

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the fixed effects formula method when fitting, but the random effects formula should be passed to set_engine() since it is an irregular (but required) argument:

library(tidymodels)
data("riesby")

linear_reg() %>% 
  set_engine("lme", random =  ~ 1|subject) %>% 
  fit(depr_score ~ week, data = riesby)

library(tidymodels)

lme_spec <- 
  linear_reg() %>% 
  set_engine("lme", random =  ~ 1|subject)

lme_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% 
  add_model(lme_spec, formula = depr_score ~ week)

fit(lme_wflow, data = riesby)

Case weights

The underlying model implementation does not allow for case weights.

References

J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.

Linear regression via mixed models

Description

The "lmer" engine estimates fixed and random effect regression parameters using maximum likelihood (or restricted maximum likelihood) estimation.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no tuning parameters.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

linear_reg() %>% 
  set_engine("lmer") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: lmer 
## 
## Model fit template:
## lme4::lmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)
data("riesby")

linear_reg() %>% 
  set_engine("lmer") %>% 
  fit(depr_score ~ week + (1|subject), data = riesby)

library(tidymodels)

lmer_spec <- 
  linear_reg() %>% 
  set_engine("lmer")

lmer_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% 
  add_model(lmer_spec, formula = depr_score ~ week + (1|subject))

fit(lmer_wflow, data = riesby)

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.

Linear quantile regression via the quantreg package

Description

quantreg::rq() optimizes quantile loss to fit models with numeric outcomes.

Details

For this engine, there is a single mode: quantile regression

This model has the same structure as the model fit by lm(), but instead of optimizing the sum of squared errors, it optimizes “quantile loss” in order to produce better estimates of the predictive distribution.

Tuning Parameters

This engine has no tuning parameters.

Translation from parsnip to the original package

This model only works with the "quantile regression" model and requires users to specify which areas of the distribution to predict via the quantile_levels argument. For example:

linear_reg() %>% 
  set_engine("quantreg") %>% 
  set_mode("quantile regression", quantile_levels = (1:3) / 4) %>% 
  translate()

## Linear Regression Model Specification (quantile regression)
## 
## Computational engine: quantreg 
## 
## Model fit template:
## quantreg::rq(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     tau = quantile_levels)

## Quantile levels: 0.25, 0.5, and 0.75.

Output format

When multiple quantile levels are predicted, there are multiple predicted values for each row of new data. The predict() method for this mode produces a column named .pred_quantile that has a special class of "quantile_pred", and it contains the predictions for each row.

For example:

library(modeldata)
rlang::check_installed("quantreg")

n <- nrow(Chicago)
Chicago <- Chicago %>% select(ridership, Clark_Lake)

Chicago_train <- Chicago[1:(n - 7), ]
Chicago_test  <- Chicago[(n - 6):n, ]

qr_fit <- 
  linear_reg() %>% 
  set_engine("quantreg") %>% 
  set_mode("quantile regression", quantile_levels = (1:3) / 4) %>% 
  fit(ridership ~ Clark_Lake, data = Chicago_train)
qr_fit

## parsnip model object
## 
## Call:
## quantreg::rq(formula = ridership ~ Clark_Lake, tau = quantile_levels, 
##     data = data)
## 
## Coefficients:
##              tau= 0.25 tau= 0.50 tau= 0.75
## (Intercept) -0.2064189 0.2051549 0.8112286
## Clark_Lake   0.9820582 0.9862306 0.9777820
## 
## Degrees of freedom: 5691 total; 5689 residual

qr_pred <- predict(qr_fit, Chicago_test)
qr_pred

## # A tibble: 7 x 1
##   .pred_quantile
##        <qtls(3)>
## 1         [21.1]
## 2         [21.4]
## 3         [21.7]
## 4         [21.4]
## 5         [19.5]
## 6         [6.88]
## # i 1 more row

We can unnest these values and/or convert them to a rectangular format:

as_tibble(qr_pred$.pred_quantile)

## # A tibble: 21 x 3
##   .pred_quantile .quantile_levels  .row
##            <dbl>            <dbl> <int>
## 1           20.6             0.25     1
## 2           21.1             0.5      1
## 3           21.5             0.75     1
## 4           20.9             0.25     2
## 5           21.4             0.5      2
## 6           21.8             0.75     2
## # i 15 more rows

as.matrix(qr_pred$.pred_quantile)

##           [,1]      [,2]      [,3]
## [1,] 20.590627 21.090561 21.517717
## [2,] 20.863639 21.364733 21.789541
## [3,] 21.190665 21.693148 22.115142
## [4,] 20.879352 21.380513 21.805185
## [5,] 19.047814 19.541193 19.981622
## [6,]  6.435241  6.875033  7.423968
## [7,]  6.062058  6.500265  7.052411

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for linear_reg() with the "quantreg" engine.

References

Waldmann, E. (2018). Quantile regression: a short story on how and why. Statistical Modelling, 18(3-4), 203-218.

Linear regression via spark

Description

sparklyr::ml_linear_regression() uses regularized least squares to fit models with numeric outcomes.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: 0.0)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)

For penalty, the amount of regularization includes both the L1 penalty (i.e., lasso) and the L2 penalty (i.e., ridge or weight decay). As for mixture:

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Translation from parsnip to the original package

linear_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("spark") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = double(1)
##   mixture = double(1)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_linear_regression(x = missing_arg(), formula = missing_arg(), 
##     weights = missing_arg(), reg_param = double(1), elastic_net_param = double(1))

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, ml_linear_regression() uses the argument standardization = TRUE to center and scale the data.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that, for spark engines, the case_weight argument value should be a character string to specify the column with the numeric case weights.

Other details

For models created using the "spark" engine, there are several things to consider.

Only the formula interface to via fit() is available; using fit_xy() will generate an error.
The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via save()), the model$fit element of the parsnip object should be serialized via ml_save(object$fit) and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.

References

Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Linear regression via Bayesian Methods

Description

The "stan" engine estimates regression parameters using Bayesian estimation.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This engine has no tuning parameters.

Important engine-specific options

Some relevant arguments that can be passed to set_engine():

chains: A positive integer specifying the number of Markov chains. The default is 4.
iter: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000.
seed: The seed for random number generation.
cores: Number of cores to use when executing the chains in parallel.
prior: The prior distribution for the (non-hierarchical) regression coefficients. The "stan" engine does not fit any hierarchical terms. See the "stan_glmer" engine from the multilevelmod package for that type of model.
prior_intercept: The prior distribution for the intercept (after centering all predictors).

See rstan::sampling() and rstanarm::priors() for more information on these and other options.

Translation from parsnip to the original package

linear_reg() %>% 
  set_engine("stan") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: stan 
## 
## Model fit template:
## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = stats::gaussian, refresh = 0)

Note that the refresh default prevents logging of the estimation process. Change this value in set_engine() to show the MCMC logs.

Preprocessing requirements

Other details

For prediction, the "stan" engine can compute posterior intervals analogous to confidence and prediction intervals. In these instances, the units are the original outcome and when std_error = TRUE, the standard deviation of the posterior distribution (or posterior predictive distribution as appropriate) is returned.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Examples

The “Fitting and Predicting with parsnip” article contains examples for linear_reg() with the "stan" engine.

References

McElreath, R. 2020 Statistical Rethinking. CRC Press.

Linear regression via hierarchical Bayesian methods

Description

The "stan_glmer" engine estimates hierarchical regression parameters using Bayesian estimation.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no tuning parameters.

Important engine-specific options

Some relevant arguments that can be passed to set_engine():

chains: A positive integer specifying the number of Markov chains. The default is 4.
iter: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000.
seed: The seed for random number generation.
cores: Number of cores to use when executing the chains in parallel.
prior: The prior distribution for the (non-hierarchical) regression coefficients.
prior_intercept: The prior distribution for the intercept (after centering all predictors).

See ?rstanarm::stan_glmer and ?rstan::sampling for more information.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

linear_reg() %>% 
  set_engine("stan_glmer") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Computational engine: stan_glmer 
## 
## Model fit template:
## rstanarm::stan_glmer(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = stats::gaussian, refresh = 0)

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)
data("riesby")

linear_reg() %>% 
  set_engine("stan_glmer") %>% 
  fit(depr_score ~ week + (1|subject), data = riesby)

library(tidymodels)

glmer_spec <- 
  linear_reg() %>% 
  set_engine("stan_glmer")

glmer_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = depr_score, predictors = c(week, subject)) %>% 
  add_model(glmer_spec, formula = depr_score ~ week + (1|subject))

fit(glmer_wflow, data = riesby)

For prediction, the "stan_glmer" engine can compute posterior intervals analogous to confidence and prediction intervals. In these instances, the units are the original outcome. When std_error = TRUE, the standard deviation of the posterior distribution (or posterior predictive distribution as appropriate) is returned.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

McElreath, R. 2020 Statistical Rethinking. CRC Press.
Sorensen, T, Vasishth, S. 2016. Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists, arXiv:1506.06201.

Logistic regression via brulee

Description

brulee::brulee_logistic_reg() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameter:

penalty: Amount of Regularization (type: double, default: 0.001)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)

Other engine arguments of interest:

optimizer(): The optimization method. See brulee::brulee_linear_reg().
epochs(): An integer for the number of passes through the training set.
lean_rate(): A number used to accelerate the gradient decsent process.
momentum(): A number used to use historical gradient information during optimization (optimizer = "SGD" only).
batch_size(): An integer for the number of training set points in each batch.
stop_iter(): A non-negative integer for how many iterations with no improvement before stopping. (default: 5L).
class_weights(): Numeric class weights. See brulee::brulee_logistic_reg().

Translation from parsnip to the original package (classification)

logistic_reg(penalty = double(1)) %>% 
  set_engine("brulee") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
## 
## Computational engine: brulee 
## 
## Model fit template:
## brulee::brulee_logistic_reg(x = missing_arg(), y = missing_arg(), 
##     penalty = double(1))

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Logistic regression via generalized estimating equations (GEE)

Description

gee::gee() uses generalized least squares to fit different types of models with errors that are not independent.

Details

For this engine, there is a single mode: classification

Tuning Parameters

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

logistic_reg() %>% 
  set_engine("gee") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Computational engine: gee 
## 
## Model fit template:
## multilevelmod::gee_fit(formula = missing_arg(), data = missing_arg(), 
##     family = binomial)

multilevelmod::gee_fit() is a wrapper model around gee::gee().

Preprocessing requirements

Other details

The model cannot accept case weights.

gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)
data("toenail", package = "HSAUR3")

logistic_reg() %>% 
  set_engine("gee", corstr = "exchangeable") %>% 
  fit(outcome ~ treatment * visit + id_var(patientID), data = toenail)

When using tidymodels infrastructure, it may be better to use a workflow. In this case, you can add the appropriate columns using add_variables() then supply the GEE formula when adding the model:

library(tidymodels)

gee_spec <- 
  logistic_reg() %>% 
  set_engine("gee", corstr = "exchangeable")

gee_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = outcome, predictors = c(treatment, visit, patientID)) %>% 
  add_model(gee_spec, formula = outcome ~ treatment * visit + id_var(patientID))

fit(gee_wflow, data = toenail)

Also, because of issues with the gee() function, a supplementary call to glm() is needed to get the rank and QR decomposition objects so that predict() can be used.

Case weights

The underlying model implementation does not allow for case weights.

References

Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22.
Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42 121–130.

Logistic regression via glm

Description

stats::glm() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This engine has no tuning parameters but you can set the family parameter (and/or link) as an engine argument (see below).

Translation from parsnip to the original package

logistic_reg() %>% 
  set_engine("glm") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm 
## 
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::binomial)

To use a non-default family and/or link, pass in as an argument to set_engine():

logistic_reg() %>% 
  set_engine("glm", family = stats::binomial(link = "probit")) %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Engine-Specific Arguments:
##   family = stats::binomial(link = "probit")
## 
## Computational engine: glm 
## 
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::binomial(link = "probit"))

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for logistic_reg() with the "glm" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Logistic regression via mixed models

Description

The "glmer" engine estimates fixed and random effect regression parameters using maximum likelihood (or restricted maximum likelihood) estimation.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has no tuning parameters.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

logistic_reg() %>% 
  set_engine("glmer") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glmer 
## 
## Model fit template:
## lme4::glmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = binomial)

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)
data("toenail", package = "HSAUR3")

logistic_reg() %>% 
  set_engine("glmer") %>% 
  fit(outcome ~ treatment * visit + (1 | patientID), data = toenail)

library(tidymodels)

glmer_spec <- 
  logistic_reg() %>% 
  set_engine("glmer")

glmer_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = outcome, predictors = c(treatment, visit, patientID)) %>% 
  add_model(glmer_spec, formula = outcome ~ treatment * visit + (1 | patientID))

fit(glmer_wflow, data = toenail)

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.

Logistic regression via glmnet

Description

glmnet::glmnet() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: see below)
mixture: Proportion of Lasso Penalty (type: double, default: 1.0)

The penalty parameter has no default and requires a single numeric value. For more details about this, and the glmnet model in general, see glmnet-details. As for mixture:

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Translation from parsnip to the original package

logistic_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("glmnet") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0
##   mixture = double(1)
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     alpha = double(1), family = "binomial")

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Sparse Data

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for logistic_reg() with the "glmnet" engine.

References

Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Logistic regression via h2o

Description

h2o::h2o.glm() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

mixture: Proportion of Lasso Penalty (type: double, default: see below)
penalty: Amount of Regularization (type: double, default: see below)

Translation from parsnip to the original package

agua::h2o_train_glm() for logistic_reg() is a wrapper around h2o::h2o.glm(). h2o will automatically picks the link function and distribution family or binomial responses.

logistic_reg() %>% 
  set_engine("h2o") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), family = "binomial")

To use a non-default argument in h2o::h2o.glm(), pass in as an engine argument to set_engine():

logistic_reg() %>% 
  set_engine("h2o", compute_p_values = TRUE) %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Engine-Specific Arguments:
##   compute_p_values = TRUE
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), compute_p_values = TRUE, 
##     family = "binomial")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, h2o::h2o.glm() uses the argument standardize = TRUE to center and scale all numeric columns.

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Logistic regression via keras

Description

keras_mlp() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has one tuning parameter:

penalty: Amount of Regularization (type: double, default: 0.0)

For penalty, the amount of regularization is only L2 penalty (i.e., ridge or weight decay).

Translation from parsnip to the original package

logistic_reg(penalty = double(1)) %>% 
  set_engine("keras") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
## 
## Computational engine: keras 
## 
## Model fit template:
## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), penalty = double(1), 
##     hidden_units = 1, act = "linear")

keras_mlp() is a parsnip wrapper around keras code for neural networks. This model fits a linear regression as a network with a single hidden unit.

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for logistic_reg() with the "keras" engine.

References

Hoerl, A., & Kennard, R. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 42(1), 80-86.

Logistic regression via LiblineaR

Description

LiblineaR::LiblineaR() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: see below)
mixture: Proportion of Lasso Penalty (type: double, default: 0)

For LiblineaR models, the value for mixture can either be 0 (for ridge) or 1 (for lasso) but not other intermediate values. In the LiblineaR::LiblineaR() documentation, these correspond to types 0 (L2-regularized) and 6 (L1-regularized).

Be aware that the LiblineaR engine regularizes the intercept. Other regularized regression models do not, which will result in different parameter estimates.

Translation from parsnip to the original package

logistic_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("LiblineaR") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
##   mixture = double(1)
## 
## Computational engine: LiblineaR 
## 
## Model fit template:
## LiblineaR::LiblineaR(x = missing_arg(), y = missing_arg(), cost = Inf, 
##     type = double(1), verbose = FALSE)

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Sparse Data

Examples

The “Fitting and Predicting with parsnip” article contains examples for logistic_reg() with the "LiblineaR" engine.

References

Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Logistic regression via spark

Description

sparklyr::ml_logistic_regression() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: 0.0)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)

For penalty, the amount of regularization includes both the L1 penalty (i.e., lasso) and the L2 penalty (i.e., ridge or weight decay). As for mixture:

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Translation from parsnip to the original package

logistic_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("spark") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
##   mixture = double(1)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_logistic_regression(x = missing_arg(), formula = missing_arg(), 
##     weights = missing_arg(), reg_param = double(1), elastic_net_param = double(1), 
##     family = "binomial")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, ml_logistic_regression() uses the argument standardization = TRUE to center and scale the data.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that, for spark engines, the case_weight argument value should be a character string to specify the column with the numeric case weights.

Other details

For models created using the "spark" engine, there are several things to consider.

Only the formula interface to via fit() is available; using fit_xy() will generate an error.
The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via save()), the model$fit element of the parsnip object should be serialized via ml_save(object$fit) and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.

References

Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Logistic regression via stan

Description

rstanarm::stan_glm() fits a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This engine has no tuning parameters.

Important engine-specific options

Some relevant arguments that can be passed to set_engine():

chains: A positive integer specifying the number of Markov chains. The default is 4.
iter: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000.
seed: The seed for random number generation.
cores: Number of cores to use when executing the chains in parallel.
prior: The prior distribution for the (non-hierarchical) regression coefficients. This "stan" engine does not fit any hierarchical terms.
prior_intercept: The prior distribution for the intercept (after centering all predictors).

See rstan::sampling() and rstanarm::priors() for more information on these and other options.

Translation from parsnip to the original package

logistic_reg() %>% 
  set_engine("stan") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Computational engine: stan 
## 
## Model fit template:
## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = stats::binomial, refresh = 0)

Note that the refresh default prevents logging of the estimation process. Change this value in set_engine() to show the MCMC logs.

Preprocessing requirements

Other details

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Examples

The “Fitting and Predicting with parsnip” article contains examples for logistic_reg() with the "stan" engine.

References

McElreath, R. 2020 Statistical Rethinking. CRC Press.

Logistic regression via hierarchical Bayesian methods

Description

The "stan_glmer" engine estimates hierarchical regression parameters using Bayesian estimation.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has no tuning parameters.

Important engine-specific options

Some relevant arguments that can be passed to set_engine():

chains: A positive integer specifying the number of Markov chains. The default is 4.
iter: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000.
seed: The seed for random number generation.
cores: Number of cores to use when executing the chains in parallel.
prior: The prior distribution for the (non-hierarchical) regression coefficients.
prior_intercept: The prior distribution for the intercept (after centering all predictors).

See ?rstanarm::stan_glmer and ?rstan::sampling for more information.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

logistic_reg() %>% 
  set_engine("stan_glmer") %>% 
  translate()

## Logistic Regression Model Specification (classification)
## 
## Computational engine: stan_glmer 
## 
## Model fit template:
## rstanarm::stan_glmer(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = stats::binomial, refresh = 0)

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)
data("toenail", package = "HSAUR3")

logistic_reg() %>% 
  set_engine("stan_glmer") %>% 
  fit(outcome ~ treatment * visit + (1 | patientID), data = toenail)

library(tidymodels)

glmer_spec <- 
  logistic_reg() %>% 
  set_engine("stan_glmer")

glmer_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = outcome, predictors = c(treatment, visit, patientID)) %>% 
  add_model(glmer_spec, formula = outcome ~ treatment * visit + (1 | patientID))

fit(glmer_wflow, data = toenail)

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

McElreath, R. 2020 Statistical Rethinking. CRC Press.
Sorensen, T, Vasishth, S. 2016. Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists, arXiv:1506.06201.

Multivariate adaptive regression splines (MARS) via earth

Description

earth::earth() fits a generalized linear model that uses artificial features for some predictors. These features resemble hinge functions and the result is a model that is a segmented regression in small dimensions.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

num_terms: # Model Terms (type: integer, default: see below)
prod_degree: Degree of Interaction (type: integer, default: 1L)
prune_method: Pruning Method (type: character, default: ‘backward’)

Parsnip changes the default range for num_terms to c(50, 500).

Translation from parsnip to the original package (regression)

mars(num_terms = integer(1), prod_degree = integer(1), prune_method = character(1)) %>% 
  set_engine("earth") %>% 
  set_mode("regression") %>% 
  translate()

## MARS Model Specification (regression)
## 
## Main Arguments:
##   num_terms = integer(1)
##   prod_degree = integer(1)
##   prune_method = character(1)
## 
## Computational engine: earth 
## 
## Model fit template:
## earth::earth(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     nprune = integer(1), degree = integer(1), pmethod = character(1), 
##     keepxy = TRUE)

Translation from parsnip to the original package (classification)

mars(num_terms = integer(1), prod_degree = integer(1), prune_method = character(1)) %>% 
  set_engine("earth") %>% 
  set_mode("classification") %>% 
  translate()

## MARS Model Specification (classification)
## 
## Main Arguments:
##   num_terms = integer(1)
##   prod_degree = integer(1)
##   prune_method = character(1)
## 
## Engine-Specific Arguments:
##   glm = list(family = stats::binomial)
## 
## Computational engine: earth 
## 
## Model fit template:
## earth::earth(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     nprune = integer(1), degree = integer(1), pmethod = character(1), 
##     glm = list(family = stats::binomial), keepxy = TRUE)

An alternate method for using MARs for categorical outcomes can be found in discrim_flexible().

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that the earth package documentation has: “In the current implementation, building models with weights can be slow.”

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for mars() with the "earth" engine.

References

Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics, vol. 19, no. 1, pp. 1-67.
Milborrow, S. “Notes on the earth package.”
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multilayer perceptron via brulee

Description

brulee::brulee_mlp() fits a neural network.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 7 tuning parameters:

epochs: # Epochs (type: integer, default: 100L)
hidden_units: # Hidden Units (type: integer, default: 3L)
activation: Activation Function (type: character, default: ‘relu’)
penalty: Amount of Regularization (type: double, default: 0.001)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)
dropout: Dropout Rate (type: double, default: 0.0)
learn_rate: Learning Rate (type: double, default: 0.01)

Both penalty and dropout should be not be used in the same model.

Other engine arguments of interest:

momentum: A number used to use historical gradient infomration during optimization.
batch_size: An integer for the number of training set points in each batch.
class_weights: Numeric class weights. See brulee::brulee_mlp().
stop_iter: A non-negative integer for how many iterations with no improvement before stopping. (default: 5L).
rate_schedule: A function to change the learning rate over epochs. See brulee::schedule_decay_time() for details.

Translation from parsnip to the original package (regression)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  learn_rate = double(1),
  activation = character(1)
) %>%  
  set_engine("brulee") %>% 
  set_mode("regression") %>% 
  translate()

## Single Layer Neural Network Model Specification (regression)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
##   learn_rate = double(1)
## 
## Computational engine: brulee 
## 
## Model fit template:
## brulee::brulee_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), 
##     penalty = double(1), dropout = double(1), epochs = integer(1), 
##     activation = character(1), learn_rate = double(1))

Note that parsnip automatically sets linear activation in the last layer.

Translation from parsnip to the original package (classification)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  learn_rate = double(1),
  activation = character(1)
) %>% 
  set_engine("brulee") %>% 
  set_mode("classification") %>% 
  translate()

## Single Layer Neural Network Model Specification (classification)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
##   learn_rate = double(1)
## 
## Computational engine: brulee 
## 
## Model fit template:
## brulee::brulee_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), 
##     penalty = double(1), dropout = double(1), epochs = integer(1), 
##     activation = character(1), learn_rate = double(1))

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multilayer perceptron via brulee with two hidden layers

Description

brulee::brulee_mlp_two_layer() fits a neural network (with version 0.3.0.9000 or higher of brulee)

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 7 tuning parameters:

epochs: # Epochs (type: integer, default: 100L)
hidden_units: # Hidden Units (type: integer, default: 3L)
activation: Activation Function (type: character, default: ‘relu’)
penalty: Amount of Regularization (type: double, default: 0.001)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)
dropout: Dropout Rate (type: double, default: 0.0)
learn_rate: Learning Rate (type: double, default: 0.01)

Both penalty and dropout should be not be used in the same model.

Other engine arguments of interest:

hidden_layer_2 and activation_2 control the format of the second layer.
momentum: A number used to use historical gradient information during optimization.
batch_size: An integer for the number of training set points in each batch.
class_weights: Numeric class weights. See brulee::brulee_mlp().
stop_iter: A non-negative integer for how many iterations with no improvement before stopping. (default: 5L).
rate_schedule: A function to change the learning rate over epochs. See brulee::schedule_decay_time() for details.

Translation from parsnip to the original package (regression)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  learn_rate = double(1),
  activation = character(1)
) %>%
  set_engine("brulee_two_layer",
             hidden_units_2 = integer(1),
             activation_2 = character(1)) %>% 
  set_mode("regression") %>% 
  translate()

## Single Layer Neural Network Model Specification (regression)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
##   learn_rate = double(1)
## 
## Engine-Specific Arguments:
##   hidden_units_2 = integer(1)
##   activation_2 = character(1)
## 
## Computational engine: brulee_two_layer 
## 
## Model fit template:
## brulee::brulee_mlp_two_layer(x = missing_arg(), y = missing_arg(), 
##     hidden_units = integer(1), penalty = double(1), dropout = double(1), 
##     epochs = integer(1), activation = character(1), learn_rate = double(1), 
##     hidden_units_2 = integer(1), activation_2 = character(1))

Note that parsnip automatically sets the linear activation in the last layer.

Translation from parsnip to the original package (classification)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  learn_rate = double(1),
  activation = character(1)
) %>% 
  set_engine("brulee_two_layer",
             hidden_units_2 = integer(1),
             activation_2 = character(1)) %>% 
  set_mode("classification") %>% 
  translate()

## Single Layer Neural Network Model Specification (classification)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
##   learn_rate = double(1)
## 
## Engine-Specific Arguments:
##   hidden_units_2 = integer(1)
##   activation_2 = character(1)
## 
## Computational engine: brulee_two_layer 
## 
## Model fit template:
## brulee::brulee_mlp_two_layer(x = missing_arg(), y = missing_arg(), 
##     hidden_units = integer(1), penalty = double(1), dropout = double(1), 
##     epochs = integer(1), activation = character(1), learn_rate = double(1), 
##     hidden_units_2 = integer(1), activation_2 = character(1))

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multilayer perceptron via h2o

Description

h2o::h2o.deeplearning() fits a feed-forward neural network.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 6 tuning parameters:

hidden_units: # Hidden Units (type: integer, default: 200L)
penalty: Amount of Regularization (type: double, default: 0.0)
dropout: Dropout Rate (type: double, default: 0.5)
epochs: # Epochs (type: integer, default: 10)
activation: Activation function (type: character, default: ‘see below’)
learn_rate: Learning Rate (type: double, default: 0.005)

The naming of activation functions in h2o::h2o.deeplearning() differs from parsnip’s conventions. Currently, only “relu” and “tanh” are supported and will be converted internally to “Rectifier” and “Tanh” passed to the fitting function.

penalty corresponds to l2 penalty. h2o::h2o.deeplearning() also supports specifying the l1 penalty directly with the engine argument l1.

Other engine arguments of interest:

stopping_rounds controls early stopping rounds based on the convergence of another engine parameter stopping_metric. By default, h2o::h2o.deeplearning stops training if simple moving average of length 5 of the stopping_metric does not improve for 5 scoring events. This is mostly useful when used alongside the engine parameter validation, which is the proportion of train-validation split, parsnip will split and pass the two data frames to h2o. Then h2o::h2o.deeplearning will evaluate the metric and early stopping criteria on the validation set.
h2o uses a 50% dropout ratio controlled by dropout for hidden layers by default. h2o::h2o.deeplearning() provides an engine argument input_dropout_ratio for dropout ratios in the input layer, which defaults to 0.

Translation from parsnip to the original package (regression)

agua::h2o_train_mlp is a wrapper around h2o::h2o.deeplearning().

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  learn_rate = double(1),
  activation = character(1)
) %>%  
  set_engine("h2o") %>% 
  set_mode("regression") %>% 
  translate()

## Single Layer Neural Network Model Specification (regression)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
##   learn_rate = double(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_mlp(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), hidden = integer(1), l2 = double(1), 
##     hidden_dropout_ratios = double(1), epochs = integer(1), activation = character(1), 
##     rate = double(1))

Translation from parsnip to the original package (classification)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  learn_rate = double(1),
  activation = character(1)
) %>% 
  set_engine("h2o") %>% 
  set_mode("classification") %>% 
  translate()

## Single Layer Neural Network Model Specification (classification)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
##   learn_rate = double(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_mlp(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), hidden = integer(1), l2 = double(1), 
##     hidden_dropout_ratios = double(1), epochs = integer(1), activation = character(1), 
##     rate = double(1))

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, h2o::h2o.deeplearning() uses the argument standardize = TRUE to center and scale all numeric columns.

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Multilayer perceptron via keras

Description

keras_mlp() fits a single layer, feed-forward neural network.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 5 tuning parameters:

hidden_units: # Hidden Units (type: integer, default: 5L)
penalty: Amount of Regularization (type: double, default: 0.0)
dropout: Dropout Rate (type: double, default: 0.0)
epochs: # Epochs (type: integer, default: 20L)
activation: Activation Function (type: character, default: ‘softmax’)

Translation from parsnip to the original package (regression)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  activation = character(1)
) %>%  
  set_engine("keras") %>% 
  set_mode("regression") %>% 
  translate()

## Single Layer Neural Network Model Specification (regression)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
## 
## Computational engine: keras 
## 
## Model fit template:
## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), 
##     penalty = double(1), dropout = double(1), epochs = integer(1), 
##     activation = character(1))

Translation from parsnip to the original package (classification)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  dropout = double(1),
  epochs = integer(1),
  activation = character(1)
) %>% 
  set_engine("keras") %>% 
  set_mode("classification") %>% 
  translate()

## Single Layer Neural Network Model Specification (classification)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   dropout = double(1)
##   epochs = integer(1)
##   activation = character(1)
## 
## Computational engine: keras 
## 
## Model fit template:
## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), hidden_units = integer(1), 
##     penalty = double(1), dropout = double(1), epochs = integer(1), 
##     activation = character(1))

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for mlp() with the "keras" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multilayer perceptron via nnet

Description

nnet::nnet() fits a single layer, feed-forward neural network.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

hidden_units: # Hidden Units (type: integer, default: none)
penalty: Amount of Regularization (type: double, default: 0.0)
epochs: # Epochs (type: integer, default: 100L)

Note that, in nnet::nnet(), the maximum number of parameters is an argument with a fairly low value of maxit = 1000. For some models, you may need to pass this value in via set_engine() so that the model does not fail.

Translation from parsnip to the original package (regression)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  epochs = integer(1)
) %>%  
  set_engine("nnet") %>% 
  set_mode("regression") %>% 
  translate()

## Single Layer Neural Network Model Specification (regression)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   epochs = integer(1)
## 
## Computational engine: nnet 
## 
## Model fit template:
## nnet::nnet(formula = missing_arg(), data = missing_arg(), size = integer(1), 
##     decay = double(1), maxit = integer(1), trace = FALSE, linout = TRUE)

Note that parsnip automatically sets linear activation in the last layer.

Translation from parsnip to the original package (classification)

mlp(
  hidden_units = integer(1),
  penalty = double(1),
  epochs = integer(1)
) %>% 
  set_engine("nnet") %>% 
  set_mode("classification") %>% 
  translate()

## Single Layer Neural Network Model Specification (classification)
## 
## Main Arguments:
##   hidden_units = integer(1)
##   penalty = double(1)
##   epochs = integer(1)
## 
## Computational engine: nnet 
## 
## Model fit template:
## nnet::nnet(formula = missing_arg(), data = missing_arg(), size = integer(1), 
##     decay = double(1), maxit = integer(1), trace = FALSE, linout = FALSE)

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for mlp() with the "nnet" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multinomial regression via brulee

Description

brulee::brulee_multinomial_reg() fits a model that uses linear predictors to predict multiclass data using the multinomial distribution.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameter:

penalty: Amount of Regularization (type: double, default: 0.001)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)

Other engine arguments of interest:

optimizer(): The optimization method. See brulee::brulee_linear_reg().
epochs(): An integer for the number of passes through the training set.
lean_rate(): A number used to accelerate the gradient decsent process.
momentum(): A number used to use historical gradient information during optimization (optimizer = "SGD" only).
batch_size(): An integer for the number of training set points in each batch.
stop_iter(): A non-negative integer for how many iterations with no improvement before stopping. (default: 5L).
class_weights(): Numeric class weights. See brulee::brulee_multinomial_reg().

Translation from parsnip to the original package (classification)

multinom_reg(penalty = double(1)) %>% 
  set_engine("brulee") %>% 
  translate()

## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
## 
## Computational engine: brulee 
## 
## Model fit template:
## brulee::brulee_multinomial_reg(x = missing_arg(), y = missing_arg(), 
##     penalty = double(1))

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multinomial regression via glmnet

Description

glmnet::glmnet() fits a model that uses linear predictors to predict multiclass data using the multinomial distribution.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: see below)
mixture: Proportion of Lasso Penalty (type: double, default: 1.0)

The penalty parameter has no default and requires a single numeric value. For more details about this, and the glmnet model in general, see glmnet-details. As for mixture:

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Translation from parsnip to the original package

multinom_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("glmnet") %>% 
  translate()

## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0
##   mixture = double(1)
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     alpha = double(1), family = "multinomial")

Preprocessing requirements

Examples

The “Fitting and Predicting with parsnip” article contains examples for multinom_reg() with the "glmnet" engine.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Sparse Data

Saving fitted model objects

References

Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multinomial regression via h2o

Description

h2o::h2o.glm() fits a model that uses linear predictors to predict multiclass data for multinomial responses.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

mixture: Proportion of Lasso Penalty (type: double, default: see below)
penalty: Amount of Regularization (type: double, default: see below)

Translation from parsnip to the original package

agua::h2o_train_glm() for multinom_reg() is a wrapper around h2o::h2o.glm() with family = 'multinomial'.

multinom_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("h2o") %>% 
  translate()

## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
##   mixture = double(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), lambda = double(1), alpha = double(1), 
##     family = "multinomial")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, h2o::h2o.glm() uses the argument standardize = TRUE to center and scale the data.

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Multinomial regression via keras

Description

keras_mlp() fits a model that uses linear predictors to predict multiclass data using the multinomial distribution.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has one tuning parameter:

penalty: Amount of Regularization (type: double, default: 0.0)

For penalty, the amount of regularization is only L2 penalty (i.e., ridge or weight decay).

Translation from parsnip to the original package

multinom_reg(penalty = double(1)) %>% 
  set_engine("keras") %>% 
  translate()

## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
## 
## Computational engine: keras 
## 
## Model fit template:
## parsnip::keras_mlp(x = missing_arg(), y = missing_arg(), penalty = double(1), 
##     hidden_units = 1, act = "linear")

keras_mlp() is a parsnip wrapper around keras code for neural networks. This model fits a linear regression as a network with a single hidden unit.

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for multinom_reg() with the "keras" engine.

References

Hoerl, A., & Kennard, R. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 42(1), 80-86.

Multinomial regression via nnet

Description

nnet::multinom() fits a model that uses linear predictors to predict multiclass data using the multinomial distribution.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 1 tuning parameters:

penalty: Amount of Regularization (type: double, default: 0.0)

For penalty, the amount of regularization includes only the L2 penalty (i.e., ridge or weight decay).

Translation from parsnip to the original package

multinom_reg(penalty = double(1)) %>% 
  set_engine("nnet") %>% 
  translate()

## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
## 
## Computational engine: nnet 
## 
## Model fit template:
## nnet::multinom(formula = missing_arg(), data = missing_arg(), 
##     decay = double(1), trace = FALSE)

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Examples

The “Fitting and Predicting with parsnip” article contains examples for multinom_reg() with the "nnet" engine.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

References

Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering nnet with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Multinomial regression via spark

Description

sparklyr::ml_logistic_regression() fits a model that uses linear predictors to predict multiclass data using the multinomial distribution.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: 0.0)
mixture: Proportion of Lasso Penalty (type: double, default: 0.0)

For penalty, the amount of regularization includes both the L1 penalty (i.e., lasso) and the L2 penalty (i.e., ridge or weight decay). As for mixture:

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Translation from parsnip to the original package

multinom_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("spark") %>% 
  translate()

## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = double(1)
##   mixture = double(1)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_logistic_regression(x = missing_arg(), formula = missing_arg(), 
##     weights = missing_arg(), reg_param = double(1), elastic_net_param = double(1), 
##     family = "multinomial")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, ml_multinom_regression() uses the argument standardization = TRUE to center and scale the data.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that, for spark engines, the case_weight argument value should be a character string to specify the column with the numeric case weights.

Other details

For models created using the "spark" engine, there are several things to consider.

Only the formula interface to via fit() is available; using fit_xy() will generate an error.
The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via save()), the model$fit element of the parsnip object should be serialized via ml_save(object$fit) and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.

References

Luraschi, J, K Kuo, and E Ruiz. 2019. Mastering Spark with R. O’Reilly Media
Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Naive Bayes models via naivebayes

Description

h2o::h2o.naiveBayes() fits a model that uses Bayes' theorem to compute the probability of each class, given the predictor values.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 1 tuning parameter:

Laplace: Laplace Correction (type: double, default: 0.0)

h2o::h2o.naiveBayes() provides several engine arguments to deal with imbalances and rare classes:

balance_classes A logical value controlling over/under-sampling (for imbalanced data). Defaults to FALSE.
class_sampling_factors The over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Require balance_classes to be TRUE.
min_sdev: The minimum standard deviation to use for observations without enough data, must be greater than 1e-10.
min_prob: The minimum probability to use for observations with not enough data.

Translation from parsnip to the original package

The agua extension package is required to fit this model.

agua::h2o_train_nb() is a wrapper around h2o::h2o.naiveBayes().

naive_Bayes(Laplace = numeric(0)) %>% 
  set_engine("h2o") %>% 
  translate()

## Naive Bayes Model Specification (classification)
## 
## Main Arguments:
##   Laplace = numeric(0)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_nb(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), laplace = numeric(0))

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Naive Bayes models via klaR

Description

klaR::NaiveBayes() fits a model that uses Bayes' theorem to compute the probability of each class, given the predictor values.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameter:

smoothness: Kernel Smoothness (type: double, default: 1.0)
Laplace: Laplace Correction (type: double, default: 0.0)

Note that the engine argument usekernel is set to TRUE by default when using the klaR engine.

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

naive_Bayes(smoothness = numeric(0), Laplace = numeric(0)) %>% 
  set_engine("klaR") %>% 
  translate()

## Naive Bayes Model Specification (classification)
## 
## Main Arguments:
##   smoothness = numeric(0)
##   Laplace = numeric(0)
## 
## Computational engine: klaR 
## 
## Model fit template:
## discrim::klar_bayes_wrapper(x = missing_arg(), y = missing_arg(), 
##     adjust = numeric(0), fL = numeric(0), usekernel = TRUE)

Preprocessing requirements

The columns for qualitative predictors should always be represented as factors (as opposed to dummy/indicator variables). When the predictors are factors, the underlying code treats them as multinomial data and appropriately computes their conditional distributions.

Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Naive Bayes models via naivebayes

Description

naivebayes::naive_bayes() fits a model that uses Bayes' theorem to compute the probability of each class, given the predictor values.

Details

For this engine, there is a single mode: classification

Tuning Parameters

This model has 2 tuning parameter:

smoothness: Kernel Smoothness (type: double, default: 1.0)
Laplace: Laplace Correction (type: double, default: 0.0)

Note that the engine argument usekernel is set to TRUE by default when using the naivebayes engine.

Translation from parsnip to the original package

The discrim extension package is required to fit this model.

library(discrim)

naive_Bayes(smoothness = numeric(0), Laplace = numeric(0)) %>% 
  set_engine("naivebayes") %>% 
  translate()

## Naive Bayes Model Specification (classification)
## 
## Main Arguments:
##   smoothness = numeric(0)
##   Laplace = numeric(0)
## 
## Computational engine: naivebayes 
## 
## Model fit template:
## naivebayes::naive_bayes(x = missing_arg(), y = missing_arg(), 
##     adjust = numeric(0), laplace = numeric(0), usekernel = TRUE)

Preprocessing requirements

For count data, integers can be estimated using a Poisson distribution if the argument usepoisson = TRUE is passed as an engine argument.

Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.

Case weights

The underlying model implementation does not allow for case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

K-nearest neighbors via kknn

Description

kknn::train.kknn() fits a model that uses the K most similar data points from the training set to predict new samples.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

neighbors: # Nearest Neighbors (type: integer, default: 5L)
weight_func: Distance Weighting Function (type: character, default: ‘optimal’)
dist_power: Minkowski Distance Order (type: double, default: 2.0)

Parsnip changes the default range for neighbors to c(1, 15) and dist_power to c(1/10, 2).

Translation from parsnip to the original package (regression)

nearest_neighbor(
  neighbors = integer(1),
  weight_func = character(1),
  dist_power = double(1)
) %>%  
  set_engine("kknn") %>% 
  set_mode("regression") %>% 
  translate()

## K-Nearest Neighbor Model Specification (regression)
## 
## Main Arguments:
##   neighbors = integer(1)
##   weight_func = character(1)
##   dist_power = double(1)
## 
## Computational engine: kknn 
## 
## Model fit template:
## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
##     ks = min_rows(0L, data, 5), kernel = character(1), distance = double(1))

min_rows() will adjust the number of neighbors if the chosen value if it is not consistent with the actual data dimensions.

Translation from parsnip to the original package (classification)

nearest_neighbor(
  neighbors = integer(1),
  weight_func = character(1),
  dist_power = double(1)
) %>% 
  set_engine("kknn") %>% 
  set_mode("classification") %>% 
  translate()

## K-Nearest Neighbor Model Specification (classification)
## 
## Main Arguments:
##   neighbors = integer(1)
##   weight_func = character(1)
##   dist_power = double(1)
## 
## Computational engine: kknn 
## 
## Model fit template:
## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
##     ks = min_rows(0L, data, 5), kernel = character(1), distance = double(1))

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Examples

The “Fitting and Predicting with parsnip” article contains examples for nearest_neighbor() with the "kknn" engine.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

References

Hechenbichler K. and Schliep K.P. (2004) Weighted k-Nearest-Neighbor Techniques and Ordinal Classification, Discussion Paper 399, SFB 386, Ludwig-Maximilians University Munich
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Partial least squares via mixOmics

Description

The mixOmics package can fit several different types of PLS models.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 2 tuning parameters:

predictor_prop: Proportion of Predictors (type: double, default: see below)
num_comp: # Components (type: integer, default: 2L)

Translation from parsnip to the underlying model call (regression)

The plsmod extension package is required to fit this model.

library(plsmod)

pls(num_comp = integer(1), predictor_prop = double(1)) %>%
  set_engine("mixOmics") %>%
  set_mode("regression") %>%
  translate()

## PLS Model Specification (regression)
## 
## Main Arguments:
##   predictor_prop = double(1)
##   num_comp = integer(1)
## 
## Computational engine: mixOmics 
## 
## Model fit template:
## plsmod::pls_fit(x = missing_arg(), y = missing_arg(), predictor_prop = double(1), 
##     ncomp = integer(1))

plsmod::pls_fit() is a function that:

Determines the number of predictors in the data.
Adjusts num_comp if the value is larger than the number of factors.
Determines whether sparsity is required based on the value of predictor_prop.
Sets the keepX argument of mixOmics::spls() for sparse models.

Translation from parsnip to the underlying model call (classification)

The plsmod extension package is required to fit this model.

library(plsmod)

pls(num_comp = integer(1), predictor_prop = double(1)) %>%
  set_engine("mixOmics") %>%
  set_mode("classification") %>%
  translate()

## PLS Model Specification (classification)
## 
## Main Arguments:
##   predictor_prop = double(1)
##   num_comp = integer(1)
## 
## Computational engine: mixOmics 
## 
## Model fit template:
## plsmod::pls_fit(x = missing_arg(), y = missing_arg(), predictor_prop = double(1), 
##     ncomp = integer(1))

In this case, plsmod::pls_fit() has the same role as above but eventually targets mixOmics::plsda() or mixOmics::splsda().

Installing mixOmics

This package is available via the Bioconductor repository and is not accessible via CRAN. You can install using:

  if (!require("remotes", quietly = TRUE)) {
    install.packages("remotes")
  }
  
  remotes::install_bioc("mixOmics")

Preprocessing requirements

Variance calculations are used in these computations so zero-variance predictors (i.e., with a single unique value) should be eliminated before fitting the model.

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

References

Rohart F and Gautier B and Singh A and Le Cao K-A (2017). “mixOmics: An R package for ’omics feature selection and multiple data integration.” PLoS computational biology, 13(11), e1005752.

Poisson regression via generalized estimating equations (GEE)

Description

gee::gee() uses generalized least squares to fit different types of models with errors that are not independent.

Details

For this engine, there is a single mode: regression

Tuning Parameters

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

poisson_reg(engine = "gee") %>% 
  set_engine("gee") %>% 
  translate()

## Poisson Regression Model Specification (regression)
## 
## Computational engine: gee 
## 
## Model fit template:
## multilevelmod::gee_fit(formula = missing_arg(), data = missing_arg(), 
##     family = stats::poisson)

multilevelmod::gee_fit() is a wrapper model around gee().

Preprocessing requirements

Case weights

The underlying model implementation does not allow for case weights.

Other details

gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)

poisson_reg() %>% 
  set_engine("gee", corstr = "exchangeable") %>% 
  fit(y ~ time + x + id_var(subject), data = longitudinal_counts)

When using tidymodels infrastructure, it may be better to use a workflow. In this case, you can add the appropriate columns using add_variables() then supply the GEE formula when adding the model:

library(tidymodels)

gee_spec <- 
  poisson_reg() %>% 
  set_engine("gee", corstr = "exchangeable")

gee_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = y, predictors = c(time, x, subject)) %>% 
  add_model(gee_spec, formula = y ~ time + x + id_var(subject))

fit(gee_wflow, data = longitudinal_counts)

Also, because of issues with the gee() function, a supplementary call to glm() is needed to get the rank and QR decomposition objects so that predict() can be used.

References

Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22.
Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42 121–130.

Poisson regression via glm

Description

stats::glm() uses maximum likelihood to fit a model for count data.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This engine has no tuning parameters.

Translation from parsnip to the underlying model call (regression)

The poissonreg extension package is required to fit this model.

library(poissonreg)

poisson_reg() %>%
  set_engine("glm") %>%
  translate()

## Poisson Regression Model Specification (regression)
## 
## Computational engine: glm 
## 
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::poisson)

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

If frequency weights are being used in your application, the glm_grouped() model (and corresponding engine) may be more appropriate.

Saving fitted model objects

Poisson regression via mixed models

Description

The "glmer" engine estimates fixed and random effect regression parameters using maximum likelihood (or restricted maximum likelihood) estimation.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no tuning parameters.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

poisson_reg(engine = "glmer") %>% 
  set_engine("glmer") %>% 
  translate()

## Poisson Regression Model Specification (regression)
## 
## Computational engine: glmer 
## 
## Model fit template:
## lme4::glmer(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::poisson)

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)

poisson_reg() %>% 
  set_engine("glmer") %>% 
  fit(y ~ time + x + (1 | subject), data = longitudinal_counts)

library(tidymodels)

glmer_spec <- 
  poisson_reg() %>% 
  set_engine("glmer")

glmer_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = y, predictors = c(time, x, subject)) %>% 
  add_model(glmer_spec, formula = y ~ time + x + (1 | subject))

fit(glmer_wflow, data = longitudinal_counts)

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

J Pinheiro, and D Bates. 2000. Mixed-effects models in S and S-PLUS. Springer, New York, NY
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Thorson, J, Minto, C. 2015, Mixed effects: a unifying framework for statistical modelling in fisheries biology. ICES Journal of Marine Science, Volume 72, Issue 5, Pages 1245–1256.
Harrison, XA, Donaldson, L, Correa-Cano, ME, Evans, J, Fisher, DN, Goodwin, CED, Robinson, BS, Hodgson, DJ, Inger, R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.
DeBruine LM, Barr DJ. Understanding Mixed-Effects Models Through Data Simulation. 2021. Advances in Methods and Practices in Psychological Science.

Poisson regression via glmnet

Description

glmnet::glmnet() uses penalized maximum likelihood to fit a model for count data.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: see below)
mixture: Proportion of Lasso Penalty (type: double, default: 1.0)

The penalty parameter has no default and requires a single numeric value. For more details about this, and the glmnet model in general, see glmnet-details. As for mixture:

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Translation from parsnip to the original package

The poissonreg extension package is required to fit this model.

library(poissonreg)

poisson_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("glmnet") %>% 
  translate()

## Poisson Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0
##   mixture = double(1)
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     alpha = double(1), family = "poisson")

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

Poisson regression via h2o

Description

h2o::h2o.glm() uses penalized maximum likelihood to fit a model for count data.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has 2 tuning parameters:

mixture: Proportion of Lasso Penalty (type: double, default: see below)
penalty: Amount of Regularization (type: double, default: see below)

Translation from parsnip to the original package

agua::h2o_train_glm() for poisson_reg() is a wrapper around h2o::h2o.glm() with family = 'poisson'.

The agua extension package is required to fit this model.

library(poissonreg)

poisson_reg(penalty = double(1), mixture = double(1)) %>% 
  set_engine("h2o") %>% 
  translate()

## Poisson Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = double(1)
##   mixture = double(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_glm(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), lambda = double(1), alpha = double(1), 
##     family = "poisson")

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

By default, h2o::h2o.glm() uses the argument standardize = TRUE to center and scale all numerical columns.

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Poisson regression via pscl

Description

pscl::hurdle() uses maximum likelihood estimation to fit a model for count data that has separate model terms for predicting the counts and for predicting the probability of a zero count.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This engine has no tuning parameters.

Translation from parsnip to the underlying model call (regression)

The poissonreg extension package is required to fit this model.

library(poissonreg)

poisson_reg() %>%
  set_engine("hurdle") %>%
  translate()

## Poisson Regression Model Specification (regression)
## 
## Computational engine: hurdle 
## 
## Model fit template:
## pscl::hurdle(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

Preprocessing and special formulas for zero-inflated Poisson models

Specifying the statistical model details

For this particular model, a special formula is used to specify which columns affect the counts and which affect the model for the probability of zero counts. These sets of terms are separated by a bar. For example, y ~ x | z. This type of formula is not used by the base R infrastructure (e.g. model.matrix())

When fitting a parsnip model with this engine directly, the formula method is required and the formula is just passed through. For example:

library(tidymodels)
tidymodels_prefer()

data("bioChemists", package = "pscl")
poisson_reg() %>% 
  set_engine("hurdle") %>% 
  fit(art ~ fem + mar | ment, data = bioChemists)

## parsnip model object
## 
## 
## Call:
## pscl::hurdle(formula = art ~ fem + mar | ment, data = data)
## 
## Count model coefficients (truncated poisson with log link):
## (Intercept)     femWomen   marMarried  
##    0.847598    -0.237351     0.008846  
## 
## Zero hurdle model coefficients (binomial with logit link):
## (Intercept)         ment  
##     0.24871      0.08092

However, when using a workflow, the best approach is to avoid using workflows::add_formula() and use workflows::add_variables() in conjunction with a model formula:

data("bioChemists", package = "pscl")
spec <- 
  poisson_reg() %>% 
  set_engine("hurdle")

workflow() %>% 
  add_variables(outcomes = c(art), predictors = c(fem, mar, ment)) %>% 
  add_model(spec, formula = art ~ fem + mar | ment) %>% 
  fit(data = bioChemists) %>% 
  extract_fit_engine()

## 
## Call:
## pscl::hurdle(formula = art ~ fem + mar | ment, data = data)
## 
## Count model coefficients (truncated poisson with log link):
## (Intercept)     femWomen   marMarried  
##    0.847598    -0.237351     0.008846  
## 
## Zero hurdle model coefficients (binomial with logit link):
## (Intercept)         ment  
##     0.24871      0.08092

The reason for this is that workflows::add_formula() will try to create the model matrix and either fail or create dummy variables prematurely.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Poisson regression via stan

Description

rstanarm::stan_glm() uses Bayesian estimation to fit a model for count data.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This engine has no tuning parameters.

Important engine-specific options

Some relevant arguments that can be passed to set_engine():

chains: A positive integer specifying the number of Markov chains. The default is 4.
iter: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000.
seed: The seed for random number generation.
cores: Number of cores to use when executing the chains in parallel.
prior: The prior distribution for the (non-hierarchical) regression coefficients. The "stan" engine does not fit any hierarchical terms.
prior_intercept: The prior distribution for the intercept (after centering all predictors).

See rstan::sampling() and rstanarm::priors() for more information on these and other options.

Translation from parsnip to the original package

The poissonreg extension package is required to fit this model.

library(poissonreg)

poisson_reg() %>% 
  set_engine("stan") %>% 
  translate()

## Poisson Regression Model Specification (regression)
## 
## Computational engine: stan 
## 
## Model fit template:
## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = stats::poisson)

Note that the refresh default prevents logging of the estimation process. Change this value in set_engine() to show the MCMC logs.

Preprocessing requirements

Other details

For prediction, the "stan" engine can compute posterior intervals analogous to confidence and prediction intervals. In these instances, the units are the original outcome. When std_error = TRUE, the standard deviation of the posterior distribution (or posterior predictive distribution as appropriate) is returned.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Examples

The “Fitting and Predicting with parsnip” article contains examples for poisson_reg() with the "stan" engine.

References

McElreath, R. 2020 Statistical Rethinking. CRC Press.

Poisson regression via hierarchical Bayesian methods

Description

The "stan_glmer" engine estimates hierarchical regression parameters using Bayesian estimation.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This model has no tuning parameters.

Important engine-specific options

Some relevant arguments that can be passed to set_engine():

chains: A positive integer specifying the number of Markov chains. The default is 4.
iter: A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000.
seed: The seed for random number generation.
cores: Number of cores to use when executing the chains in parallel.
prior: The prior distribution for the (non-hierarchical) regression coefficients.
prior_intercept: The prior distribution for the intercept (after centering all predictors).

See ?rstanarm::stan_glmer and ?rstan::sampling for more information.

Translation from parsnip to the original package

The multilevelmod extension package is required to fit this model.

library(multilevelmod)

poisson_reg(engine = "stan_glmer") %>% 
  set_engine("stan_glmer") %>% 
  translate()

## Poisson Regression Model Specification (regression)
## 
## Computational engine: stan_glmer 
## 
## Model fit template:
## rstanarm::stan_glmer(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = stats::poisson, refresh = 0)

Predicting new samples

This model can use subject-specific coefficient estimates to make predictions (i.e. partial pooling). For example, this equation shows the linear predictor (⁠\eta⁠) for a random intercept:

\eta_{i} = (\beta_0 + b_{0i}) + \beta_1x_{i1}

What happens when data are being predicted for a subject that was not used in the model fit? In that case, this package uses only the population parameter estimates for prediction:

\hat{\eta}_{i'} = \hat{\beta}_0+ \hat{\beta}x_{i'1}

The tidymodels framework deliberately constrains predictions for new data to not use the training set or other data (to prevent information leakage).

Preprocessing requirements

Other details

The model can accept case weights.

With parsnip, we suggest using the formula method when fitting:

library(tidymodels)

poisson_reg() %>% 
  set_engine("stan_glmer") %>% 
  fit(y ~ time + x + (1 | subject), data = longitudinal_counts)

library(tidymodels)

glmer_spec <- 
  poisson_reg() %>% 
  set_engine("stan_glmer")

glmer_wflow <- 
  workflow() %>% 
  # The data are included as-is using:
  add_variables(outcomes = y, predictors = c(time, x, subject)) %>% 
  add_model(glmer_spec, formula = y ~ time + x + (1 | subject))

fit(glmer_wflow, data = longitudinal_counts)

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

McElreath, R. 2020 Statistical Rethinking. CRC Press.
Sorensen, T, Vasishth, S. 2016. Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists, arXiv:1506.06201.

Poisson regression via pscl

Description

pscl::zeroinfl() uses maximum likelihood estimation to fit a model for count data that has separate model terms for predicting the counts and for predicting the probability of a zero count.

Details

For this engine, there is a single mode: regression

Tuning Parameters

This engine has no tuning parameters.

Translation from parsnip to the underlying model call (regression)

The poissonreg extension package is required to fit this model.

library(poissonreg)

poisson_reg() %>%
  set_engine("zeroinfl") %>%
  translate()

## Poisson Regression Model Specification (regression)
## 
## Computational engine: zeroinfl 
## 
## Model fit template:
## pscl::zeroinfl(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg())

Preprocessing and special formulas for zero-inflated Poisson models

Specifying the statistical model details

When fitting a parsnip model with this engine directly, the formula method is required and the formula is just passed through. For example:

library(tidymodels)
tidymodels_prefer()

data("bioChemists", package = "pscl")
poisson_reg() %>% 
  set_engine("zeroinfl") %>% 
  fit(art ~ fem + mar | ment, data = bioChemists)

## parsnip model object
## 
## 
## Call:
## pscl::zeroinfl(formula = art ~ fem + mar | ment, data = data)
## 
## Count model coefficients (poisson with log link):
## (Intercept)     femWomen   marMarried  
##     0.82840     -0.21365      0.02576  
## 
## Zero-inflation model coefficients (binomial with logit link):
## (Intercept)         ment  
##      -0.363       -0.166

However, when using a workflow, the best approach is to avoid using workflows::add_formula() and use workflows::add_variables() in conjunction with a model formula:

data("bioChemists", package = "pscl")
spec <- 
  poisson_reg() %>% 
  set_engine("zeroinfl")

workflow() %>% 
  add_variables(outcomes = c(art), predictors = c(fem, mar, ment)) %>% 
  add_model(spec, formula = art ~ fem + mar | ment) %>% 
  fit(data = bioChemists) %>% 
  extract_fit_engine()

## 
## Call:
## pscl::zeroinfl(formula = art ~ fem + mar | ment, data = data)
## 
## Count model coefficients (poisson with log link):
## (Intercept)     femWomen   marMarried  
##     0.82840     -0.21365      0.02576  
## 
## Zero-inflation model coefficients (binomial with logit link):
## (Intercept)         ment  
##      -0.363       -0.166

The reason for this is that workflows::add_formula() will try to create the model matrix and either fail or create dummy variables prematurely.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Proportional hazards regression

Description

glmnet::glmnet() fits a regularized Cox proportional hazards model.

Details

For this engine, there is a single mode: censored regression

Tuning Parameters

This model has 2 tuning parameters:

penalty: Amount of Regularization (type: double, default: see below)
mixture: Proportion of Lasso Penalty (type: double, default: 1.0)

The penalty parameter has no default and requires a single numeric value. For more details about this, and the glmnet model in general, see glmnet-details. As for mixture:

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Translation from parsnip to the original package

The censored extension package is required to fit this model.

library(censored)

proportional_hazards(penalty = double(1), mixture = double(1)) %>% 
  set_engine("glmnet") %>% 
  translate()

## Proportional Hazards Model Specification (censored regression)
## 
## Main Arguments:
##   penalty = 0
##   mixture = double(1)
## 
## Computational engine: glmnet 
## 
## Model fit template:
## censored::coxnet_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), alpha = double(1))

Preprocessing requirements

Other details

The model does not fit an intercept.

The model formula (which is required) can include special terms, such as survival::strata(). This allows the baseline hazard to differ between groups contained in the function. (To learn more about using special terms in formulas with tidymodels, see ?model_formula.) The column used inside strata() is treated as qualitative no matter its type. This is different than the syntax offered by the glmnet::glmnet() package (i.e., glmnet::stratifySurv()) which is not recommended here.

For example, in this model, the numeric column rx is used to estimate two different baseline hazards for each value of the column:

library(survival)
library(censored)
library(dplyr)
library(tidyr)

mod <- 
  proportional_hazards(penalty = 0.01) %>% 
  set_engine("glmnet", nlambda = 5) %>% 
  fit(Surv(futime, fustat) ~ age + ecog.ps + strata(rx), data = ovarian)

pred_data <- data.frame(age = c(50, 50), ecog.ps = c(1, 1), rx = c(1, 2))

# Different survival probabilities for different values of 'rx'
predict(mod, pred_data, type = "survival", time = 500) %>% 
  bind_cols(pred_data) %>% 
  unnest(.pred)

## # A tibble: 2 x 5
##   .eval_time .pred_survival   age ecog.ps    rx
##        <dbl>          <dbl> <dbl>   <dbl> <dbl>
## 1        500          0.666    50       1     1
## 2        500          0.769    50       1     2

Note that columns used in the strata() function will also be estimated in the regular portion of the model (i.e., within the linear predictor).

Predictions of type "time" are predictions of the mean survival time.

Linear predictor values

Since risk regression and parametric survival models are modeling different characteristics (e.g. relative hazard versus event time), their linear predictors will be going in opposite directions.

For example, for parametric models, the linear predictor increases with time. For proportional hazards models the linear predictor decreases with time (since hazard is increasing). As such, the linear predictors for these two quantities will have opposite signs.

tidymodels does not treat different models differently when computing performance metrics. To standardize across model types, the default for proportional hazards models is to have increasing values with time. As a result, the sign of the linear predictor will be the opposite of the value produced by the predict() method in the engine package.

This behavior can be changed by using the increasing argument when calling predict() on a model object.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

References

Simon N, Friedman J, Hastie T, Tibshirani R. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software, Articles 39 (5): 1–13. .
Hastie T, Tibshirani R, Wainwright M. 2015. Statistical Learning with Sparsity. CRC Press.
Kuhn M, Johnson K. 2013. Applied Predictive Modeling. Springer.

Proportional hazards regression

Description

survival::coxph() fits a Cox proportional hazards model.

Details

For this engine, there is a single mode: censored regression

Tuning Parameters

This model has no tuning parameters.

Translation from parsnip to the original package

The censored extension package is required to fit this model.

library(censored)

proportional_hazards() %>% 
  set_engine("survival") %>% 
  set_mode("censored regression") %>% 
  translate()

## Proportional Hazards Model Specification (censored regression)
## 
## Computational engine: survival 
## 
## Model fit template:
## survival::coxph(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), x = TRUE, model = TRUE)

Other details

The model does not fit an intercept.

The main interface for this model uses the formula method since the model specification typically involved the use of survival::Surv().

The model formula can include special terms, such as survival::strata(). The allows the baseline hazard to differ between groups contained in the function. The column used inside strata() is treated as qualitative no matter its type. To learn more about using special terms in formulas with tidymodels, see ?model_formula.

For example, in this model, the numeric column rx is used to estimate two different baseline hazards for each value of the column:

library(survival)

proportional_hazards() %>% 
  fit(Surv(futime, fustat) ~ age + strata(rx), data = ovarian) %>% 
  extract_fit_engine() %>% 
  # Two different hazards for each value of 'rx'
  basehaz()

##        hazard time strata
## 1  0.02250134   59   rx=1
## 2  0.05088586  115   rx=1
## 3  0.09467873  156   rx=1
## 4  0.14809975  268   rx=1
## 5  0.30670509  329   rx=1
## 6  0.46962698  431   rx=1
## 7  0.46962698  448   rx=1
## 8  0.46962698  477   rx=1
## 9  1.07680229  638   rx=1
## 10 1.07680229  803   rx=1
## 11 1.07680229  855   rx=1
## 12 1.07680229 1040   rx=1
## 13 1.07680229 1106   rx=1
## 14 0.05843331  353   rx=2
## 15 0.12750063  365   rx=2
## 16 0.12750063  377   rx=2
## 17 0.12750063  421   rx=2
## 18 0.23449656  464   rx=2
## 19 0.35593895  475   rx=2
## 20 0.50804209  563   rx=2
## 21 0.50804209  744   rx=2
## 22 0.50804209  769   rx=2
## 23 0.50804209  770   rx=2
## 24 0.50804209 1129   rx=2
## 25 0.50804209 1206   rx=2
## 26 0.50804209 1227   rx=2

Note that columns used in the strata() function will not be estimated in the regular portion of the model (i.e., within the linear predictor).

Predictions of type "time" are predictions of the mean survival time.

Linear predictor values

Since risk regression and parametric survival models are modeling different characteristics (e.g. relative hazard versus event time), their linear predictors will be going in opposite directions.

This behavior can be changed by using the increasing argument when calling predict() on a model object.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

References

Andersen P, Gill R. 1982. Cox’s regression model for counting processes, a large sample study. Annals of Statistics 10, 1100-1120.

Oblique random survival forests via aorsf

Description

aorsf::orsf() fits a model that creates a large number of oblique decision trees, each de-correlated from the others. The final prediction uses all predictions from the individual trees and combines them.

Details

For this engine, there are multiple modes: censored regression, classification, and regression

Tuning Parameters

This model has 3 tuning parameters:

trees: # Trees (type: integer, default: 500L)
min_n: Minimal Node Size (type: integer, default: 5L)
mtry: # Randomly Selected Predictors (type: integer, default: ceiling(sqrt(n_predictors)))

Additionally, this model has one engine-specific tuning parameter:

split_min_stat: Minimum test statistic required to split a node. Defaults are 3.841459 for censored regression (which is roughly a p-value of 0.05) and 0 for classification and regression. For classification, this tuning parameter should be between 0 and 1, and for regression it should be greater than or equal to 0. Higher values of this parameter cause trees grown by aorsf to have less depth.

Translation from parsnip to the original package (censored regression)

The censored extension package is required to fit this model.

library(censored)

rand_forest() %>%
  set_engine("aorsf") %>%
  set_mode("censored regression") %>%
  translate()

## Random Forest Model Specification (censored regression)
## 
## Computational engine: aorsf 
## 
## Model fit template:
## aorsf::orsf(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

Translation from parsnip to the original package (regression)

The bonsai extension package is required to fit this model.

library(bonsai)

rand_forest() %>%
  set_engine("aorsf") %>%
  set_mode("regression") %>%
  translate()

## Random Forest Model Specification (regression)
## 
## Computational engine: aorsf 
## 
## Model fit template:
## aorsf::orsf(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     n_thread = 1, verbose_progress = FALSE)

Translation from parsnip to the original package (classification)

The bonsai extension package is required to fit this model.

library(bonsai)

rand_forest() %>%
  set_engine("aorsf") %>%
  set_mode("classification") %>%
  translate()

## Random Forest Model Specification (classification)
## 
## Computational engine: aorsf 
## 
## Model fit template:
## aorsf::orsf(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     n_thread = 1, verbose_progress = FALSE)

Preprocessing requirements

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Other details

Predictions of survival probability at a time exceeding the maximum observed event time are the predicted survival probability at the maximum observed time in the training data.

The class predict method in aorsf uses the standard ‘each tree gets one vote’ approach, which is usually but not always consistent with the picking the class that has highest predicted probability. It is okay for this inconsistency to occur in aorsf because it is intentionally applying the traditional class prediction method for random forests, but in tidymodels it is preferable to embrace consistency. Thus, we opted to make predicted probability consistent with predicted class all the time by making the predicted class a function of predicted probability (see tidymodels/bonsai#78).

References

Jaeger BC, Long DL, Long DM, Sims M, Szychowski JM, Min YI, Mcclure LA, Howard G, Simon N. Oblique random survival forests. Annals of applied statistics 2019 Sep; 13(3):1847-83. DOI: 10.1214/19-AOAS1261
Jaeger BC, Welden S, Lenoir K, Pajewski NM. aorsf: An R package for supervised learning using the oblique random survival forest. Journal of Open Source Software 2022, 7(77), 1 4705. .
Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM. Accelerated and interpretable oblique random survival forests. arXiv e-prints 2022 Aug; arXiv-2208. URL: https://arxiv.org/abs/2208.01129

Random forests via h2o

Description

h2o::h2o.randomForest() fits a model that creates a large number of decision trees, each independent of the others. The final prediction uses all predictions from the individual trees and combines them.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

trees: # Trees (type: integer, default: 50L)
min_n: Minimal Node Size (type: integer, default: 1)
mtry: # Randomly Selected Predictors (type: integer, default: see below)

mtry depends on the number of columns and the model mode. The default in h2o::h2o.randomForest() is floor(sqrt(ncol(x))) for classification and floor(ncol(x)/3) for regression.

Translation from parsnip to the original package (regression)

agua::h2o_train_rf() is a wrapper around h2o::h2o.randomForest().

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>%  
  set_engine("h2o") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_rf(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), mtries = integer(1), ntrees = integer(1), 
##     min_rows = integer(1))

min_rows() and min_cols() will adjust the number of neighbors if the chosen value if it is not consistent with the actual data dimensions.

Translation from parsnip to the original package (classification)

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>% 
  set_engine("h2o") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_rf(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), mtries = integer(1), ntrees = integer(1), 
##     min_rows = integer(1))

Preprocessing requirements

Initializing h2o

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

Random forests via partykit

Description

partykit::cforest() fits a model that creates a large number of decision trees, each independent of the others. The final prediction uses all predictions from the individual trees and combines them.

Details

For this engine, there are multiple modes: censored regression, regression, and classification

Tuning Parameters

This model has 3 tuning parameters:

trees: # Trees (type: integer, default: 500L)
min_n: Minimal Node Size (type: integer, default: 20L)
mtry: # Randomly Selected Predictors (type: integer, default: 5L)

Translation from parsnip to the original package (regression)

The bonsai extension package is required to fit this model.

library(bonsai)

rand_forest() %>% 
  set_engine("partykit") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Computational engine: partykit 
## 
## Model fit template:
## parsnip::cforest_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg())

Translation from parsnip to the original package (classification)

The bonsai extension package is required to fit this model.

library(bonsai)

rand_forest() %>% 
  set_engine("partykit") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Computational engine: partykit 
## 
## Model fit template:
## parsnip::cforest_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg())

parsnip::cforest_train() is a wrapper around partykit::cforest() (and other functions) that makes it easier to run this model.

Translation from parsnip to the original package (censored regression)

The censored extension package is required to fit this model.

library(censored)

rand_forest() %>% 
  set_engine("partykit") %>% 
  set_mode("censored regression") %>% 
  translate()

## Random Forest Model Specification (censored regression)
## 
## Computational engine: partykit 
## 
## Model fit template:
## parsnip::cforest_train(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg())

censored::cond_inference_surv_cforest() is a wrapper around partykit::cforest() (and other functions) that makes it easier to run this model.

Preprocessing requirements

Other details

Predictions of type "time" are predictions of the median survival time.

References

partykit: A Modular Toolkit for Recursive Partytioning in R
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Random forests via randomForest

Description

randomForest::randomForest() fits a model that creates a large number of decision trees, each independent of the others. The final prediction uses all predictions from the individual trees and combines them.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

mtry: # Randomly Selected Predictors (type: integer, default: see below)
trees: # Trees (type: integer, default: 500L)
min_n: Minimal Node Size (type: integer, default: see below)

mtry depends on the number of columns and the model mode. The default in randomForest::randomForest() is floor(sqrt(ncol(x))) for classification and floor(ncol(x)/3) for regression.

min_n depends on the mode. For regression, a value of 5 is the default. For classification, a value of 10 is used.

Translation from parsnip to the original package (regression)

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>%  
  set_engine("randomForest") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: randomForest 
## 
## Model fit template:
## randomForest::randomForest(x = missing_arg(), y = missing_arg(), 
##     mtry = min_cols(~integer(1), x), ntree = integer(1), nodesize = min_rows(~integer(1), 
##         x))

min_rows() and min_cols() will adjust the number of neighbors if the chosen value if it is not consistent with the actual data dimensions.

Translation from parsnip to the original package (classification)

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>% 
  set_engine("randomForest") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: randomForest 
## 
## Model fit template:
## randomForest::randomForest(x = missing_arg(), y = missing_arg(), 
##     mtry = min_cols(~integer(1), x), ntree = integer(1), nodesize = min_rows(~integer(1), 
##         x))

Preprocessing requirements

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for rand_forest() with the "randomForest" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Random forests via ranger

Description

ranger::ranger() fits a model that creates a large number of decision trees, each independent of the others. The final prediction uses all predictions from the individual trees and combines them.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

mtry: # Randomly Selected Predictors (type: integer, default: see below)
trees: # Trees (type: integer, default: 500L)
min_n: Minimal Node Size (type: integer, default: see below)

mtry depends on the number of columns. The default in ranger::ranger() is floor(sqrt(ncol(x))).

min_n depends on the mode. For regression, a value of 5 is the default. For classification, a value of 10 is used.

Translation from parsnip to the original package (regression)

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>%  
  set_engine("ranger") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: ranger 
## 
## Model fit template:
## ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     mtry = min_cols(~integer(1), x), num.trees = integer(1), 
##     min.node.size = min_rows(~integer(1), x), num.threads = 1, 
##     verbose = FALSE, seed = sample.int(10^5, 1))

min_rows() and min_cols() will adjust the number of neighbors if the chosen value if it is not consistent with the actual data dimensions.

Translation from parsnip to the original package (classification)

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>% 
  set_engine("ranger") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: ranger 
## 
## Model fit template:
## ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     mtry = min_cols(~integer(1), x), num.trees = integer(1), 
##     min.node.size = min_rows(~integer(1), x), num.threads = 1, 
##     verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)

Note that a ranger probability forest is always fit (unless the probability argument is changed by the user via set_engine()).

Preprocessing requirements

Other notes

By default, parallel processing is turned off. When tuning, it is more efficient to parallelize over the resamples and tuning parameters. To parallelize the construction of the trees within the ranger model, change the num.threads argument via set_engine().

For ranger confidence intervals, the intervals are constructed using the form ⁠estimate +/- z * std_error⁠. For classification probabilities, these values can fall outside of ⁠[0, 1]⁠ and will be coerced to be in this range.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Sparse Data

While this engine supports sparse data as an input, it doesn’t use it any differently than dense data. Hence there it no reason to convert back and forth.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for rand_forest() with the "ranger" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Random forests via spark

Description

sparklyr::ml_random_forest() fits a model that creates a large number of decision trees, each independent of the others. The final prediction uses all predictions from the individual trees and combines them.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

mtry: # Randomly Selected Predictors (type: integer, default: see below)
trees: # Trees (type: integer, default: 20L)
min_n: Minimal Node Size (type: integer, default: 1L)

mtry depends on the number of columns and the model mode. The default in sparklyr::ml_random_forest() is floor(sqrt(ncol(x))) for classification and floor(ncol(x)/3) for regression.

Translation from parsnip to the original package (regression)

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>%  
  set_engine("spark") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), 
##     type = "regression", feature_subset_strategy = integer(1), 
##     num_trees = integer(1), min_instances_per_node = min_rows(~integer(1), 
##         x), seed = sample.int(10^5, 1))

min_rows() and min_cols() will adjust the number of neighbors if the chosen value if it is not consistent with the actual data dimensions.

Translation from parsnip to the original package (classification)

rand_forest(
  mtry = integer(1),
  trees = integer(1),
  min_n = integer(1)
) %>% 
  set_engine("spark") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = integer(1)
##   trees = integer(1)
##   min_n = integer(1)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), 
##     type = "classification", feature_subset_strategy = integer(1), 
##     num_trees = integer(1), min_instances_per_node = min_rows(~integer(1), 
##         x), seed = sample.int(10^5, 1))

Preprocessing requirements

Other details

For models created using the "spark" engine, there are several things to consider.

Only the formula interface to via fit() is available; using fit_xy() will generate an error.
The predictions will always be in a Spark table format. The names will be the same as documented but without the dots.
There is no equivalent to factor columns in Spark tables so class predictions are returned as character columns.
To retain the model object for a new R session (via save()), the model$fit element of the parsnip object should be serialized via ml_save(object$fit) and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Note that, for spark engines, the case_weight argument value should be a character string to specify the column with the numeric case weights.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

RuleFit models via h2o

Description

h2o::h2o.rulefit() fits a model that derives simple feature rules from a tree ensemble and uses the rules as features to a regularized (LASSO) model. agua::h2o_train_rule() is a wrapper around this function.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

trees: # Trees (type: integer, default: 50L)
tree_depth: Tree Depth (type: integer, default: 3L)
penalty: Amount of Regularization (type: double, default: 0) Note that penalty for the h2o engine in 'rule_fit()“ corresponds to the L1 penalty (LASSO).

Other engine arguments of interest:

algorithm: The algorithm to use to generate rules. should be one of “AUTO”, “DRF”, “GBM”, defaults to “AUTO”.
min_rule_length: Minimum length of tree depth, opposite of tree_dpeth, defaults to 3.
max_num_rules: The maximum number of rules to return. The default value of -1 means the number of rules is selected by diminishing returns in model deviance.
model_type: The type of base learners in the ensemble, should be one of: “rules_and_linear”, “rules”, “linear”, defaults to “rules_and_linear”.

Translation from parsnip to the underlying model call (regression)

agua::h2o_train_rule() is a wrapper around h2o::h2o.rulefit().

The agua extension package is required to fit this model.

library(rules)

rule_fit(
  trees = integer(1),
  tree_depth = integer(1),
  penalty = numeric(1)
) %>%
  set_engine("h2o") %>%
  set_mode("regression") %>%
  translate()

## RuleFit Model Specification (regression)
## 
## Main Arguments:
##   trees = integer(1)
##   tree_depth = integer(1)
##   penalty = numeric(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_rule(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), rule_generation_ntrees = integer(1), 
##     max_rule_length = integer(1), lambda = numeric(1))

Translation from parsnip to the underlying model call (classification)

agua::h2o_train_rule() for rule_fit() is a wrapper around h2o::h2o.rulefit().

The agua extension package is required to fit this model.

rule_fit(
  trees = integer(1),
  tree_depth = integer(1),
  penalty = numeric(1)
) %>%
  set_engine("h2o") %>%
  set_mode("classification") %>%
  translate()

## RuleFit Model Specification (classification)
## 
## Main Arguments:
##   trees = integer(1)
##   tree_depth = integer(1)
##   penalty = numeric(1)
## 
## Computational engine: h2o 
## 
## Model fit template:
## agua::h2o_train_rule(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     validation_frame = missing_arg(), rule_generation_ntrees = integer(1), 
##     max_rule_length = integer(1), lambda = numeric(1))

Preprocessing requirements

Other details

h2o will automatically shut down the local h2o instance started by R when R is terminated. To manually stop the h2o server, run h2o::h2o.shutdown().

Saving fitted model objects

RuleFit models via xrf

Description

xrf::xrf() fits a model that derives simple feature rules from a tree ensemble and uses the rules as features to a regularized model. rules::xrf_fit() is a wrapper around this function.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 8 tuning parameters:

mtry: Proportion Randomly Selected Predictors (type: double, default: see below)
trees: # Trees (type: integer, default: 15L)
min_n: Minimal Node Size (type: integer, default: 1L)
tree_depth: Tree Depth (type: integer, default: 6L)
learn_rate: Learning Rate (type: double, default: 0.3)
loss_reduction: Minimum Loss Reduction (type: double, default: 0.0)
sample_size: Proportion Observations Sampled (type: double, default: 1.0)
penalty: Amount of Regularization (type: double, default: 0.1)

Translation from parsnip to the underlying model call (regression)

The rules extension package is required to fit this model.

library(rules)

rule_fit(
  mtry = numeric(1),
  trees = integer(1),
  min_n = integer(1),
  tree_depth = integer(1),
  learn_rate = numeric(1),
  loss_reduction = numeric(1),
  sample_size = numeric(1),
  penalty = numeric(1)
) %>%
  set_engine("xrf") %>%
  set_mode("regression") %>%
  translate()

## RuleFit Model Specification (regression)
## 
## Main Arguments:
##   mtry = numeric(1)
##   trees = integer(1)
##   min_n = integer(1)
##   tree_depth = integer(1)
##   learn_rate = numeric(1)
##   loss_reduction = numeric(1)
##   sample_size = numeric(1)
##   penalty = numeric(1)
## 
## Computational engine: xrf 
## 
## Model fit template:
## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), 
##     xgb_control = missing_arg(), colsample_bynode = numeric(1), 
##     nrounds = integer(1), min_child_weight = integer(1), max_depth = integer(1), 
##     eta = numeric(1), gamma = numeric(1), subsample = numeric(1), 
##     lambda = numeric(1))

Translation from parsnip to the underlying model call (classification)

The rules extension package is required to fit this model.

library(rules)

rule_fit(
  mtry = numeric(1),
  trees = integer(1),
  min_n = integer(1),
  tree_depth = integer(1),
  learn_rate = numeric(1),
  loss_reduction = numeric(1),
  sample_size = numeric(1),
  penalty = numeric(1)
) %>%
  set_engine("xrf") %>%
  set_mode("classification") %>%
  translate()

## RuleFit Model Specification (classification)
## 
## Main Arguments:
##   mtry = numeric(1)
##   trees = integer(1)
##   min_n = integer(1)
##   tree_depth = integer(1)
##   learn_rate = numeric(1)
##   loss_reduction = numeric(1)
##   sample_size = numeric(1)
##   penalty = numeric(1)
## 
## Computational engine: xrf 
## 
## Model fit template:
## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), 
##     xgb_control = missing_arg(), colsample_bynode = numeric(1), 
##     nrounds = integer(1), min_child_weight = integer(1), max_depth = integer(1), 
##     eta = numeric(1), gamma = numeric(1), subsample = numeric(1), 
##     lambda = numeric(1))

Differences from the xrf package

Note that, per the documentation in ?xrf, transformations of the response variable are not supported. To use these with rule_fit(), we recommend using a recipe instead of the formula method.

Also, there are several configuration differences in how xrf() is fit between that package and the wrapper used in rules. Some differences in default values are:

parameter	xrf	rules
`trees`	100	15
`max_depth`	3	6

These differences will create a disparity in the values of the penalty argument that glmnet uses. Also, rules can also set penalty whereas xrf uses an internal 5-fold cross-validation to determine it (by default).

Preprocessing requirements

Other details

Interpreting `mtry`

The mtry argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.

Early stopping

The stop_iter() argument allows the model to prematurely stop training if the objective function does not improve within early_stop iterations.

If the model specification has early_stop >= trees, early_stop is converted to trees - 1 and a warning is issued.

Case weights

The underlying model implementation does not allow for case weights.

References

Friedman and Popescu. “Predictive learning via rule ensembles.” Ann. Appl. Stat. 2 (3) 916- 954, September 2008

Parametric survival regression

Description

flexsurv::flexsurvreg() fits a parametric survival model.

Details

For this engine, there is a single mode: censored regression

Tuning Parameters

This model has 1 tuning parameters:

dist: Distribution (type: character, default: ‘weibull’)

Translation from parsnip to the original package

The censored extension package is required to fit this model.

library(censored)

survival_reg(dist = character(1)) %>% 
  set_engine("flexsurv") %>% 
  set_mode("censored regression") %>% 
  translate()

## Parametric Survival Regression Model Specification (censored regression)
## 
## Main Arguments:
##   dist = character(1)
## 
## Computational engine: flexsurv 
## 
## Model fit template:
## flexsurv::flexsurvreg(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), dist = character(1))

Other details

The main interface for this model uses the formula method since the model specification typically involved the use of survival::Surv().

For this engine, stratification cannot be specified via survival::strata(), please see flexsurv::flexsurvreg() for alternative specifications.

Predictions of type "time" are predictions of the mean survival time.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

References

Jackson, C. 2016. flexsurv: A Platform for Parametric Survival Modeling in R. Journal of Statistical Software, 70(8), 1 - 33.

Flexible parametric survival regression

Description

flexsurv::flexsurvspline() fits a flexible parametric survival model.

Details

For this engine, there is a single mode: censored regression

Tuning Parameters

This model has one engine-specific tuning parameter:

k: Number of knots in the spline. The default is k = 0.

Translation from parsnip to the original package

The censored extension package is required to fit this model.

library(censored)

survival_reg() %>% 
  set_engine("flexsurvspline") %>% 
  set_mode("censored regression") %>% 
  translate()

## Parametric Survival Regression Model Specification (censored regression)
## 
## Computational engine: flexsurvspline 
## 
## Model fit template:
## flexsurv::flexsurvspline(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg())

Other details

The main interface for this model uses the formula method since the model specification typically involved the use of survival::Surv().

For this engine, stratification cannot be specified via survival::strata(), please see flexsurv::flexsurvspline() for alternative specifications.

Predictions of type "time" are predictions of the mean survival time.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

References

Jackson, C. 2016. flexsurv: A Platform for Parametric Survival Modeling in R. Journal of Statistical Software, 70(8), 1 - 33.

Parametric survival regression

Description

survival::survreg() fits a parametric survival model.

Details

For this engine, there is a single mode: censored regression

Tuning Parameters

This model has 1 tuning parameters:

dist: Distribution (type: character, default: ‘weibull’)

Translation from parsnip to the original package

The censored extension package is required to fit this model.

library(censored)

survival_reg(dist = character(1)) %>% 
  set_engine("survival") %>% 
  set_mode("censored regression") %>% 
  translate()

## Parametric Survival Regression Model Specification (censored regression)
## 
## Main Arguments:
##   dist = character(1)
## 
## Computational engine: survival 
## 
## Model fit template:
## survival::survreg(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), dist = character(1), model = TRUE)

Other details

In the translated syntax above, note that model = TRUE is needed to produce quantile predictions when there is a stratification variable and can be overridden in other cases.

The main interface for this model uses the formula method since the model specification typically involved the use of survival::Surv().

The model formula can include special terms, such as survival::strata(). The allows the model scale parameter to differ between groups contained in the function. The column used inside strata() is treated as qualitative no matter its type. To learn more about using special terms in formulas with tidymodels, see ?model_formula.

For example, in this model, the numeric column rx is used to estimate two different scale parameters for each value of the column:

library(survival)

survival_reg() %>% 
  fit(Surv(futime, fustat) ~ age + strata(rx), data = ovarian) %>% 
  extract_fit_engine()

## Call:
## survival::survreg(formula = Surv(futime, fustat) ~ age + strata(rx), 
##     data = data, model = TRUE)
## 
## Coefficients:
## (Intercept)         age 
##  12.8734120  -0.1033569 
## 
## Scale:
##      rx=1      rx=2 
## 0.7695509 0.4703602 
## 
## Loglik(model)= -89.4   Loglik(intercept only)= -97.1
##  Chisq= 15.36 on 1 degrees of freedom, p= 8.88e-05 
## n= 26

Predictions of type "time" are predictions of the mean survival time.

Case weights

This model can utilize case weights during model fitting. To use them, see the documentation in case_weights and the examples on tidymodels.org.

The fit() and fit_xy() arguments have arguments called case_weights that expect vectors of case weights.

Saving fitted model objects

References

Kalbfleisch, J. D. and Prentice, R. L. 2002 The statistical analysis of failure time data, Wiley.

Linear support vector machines (SVMs) via kernlab

Description

kernlab::ksvm() fits a support vector machine model. For classification, the model tries to maximize the width of the margin between classes. For regression, the model optimizes a robust loss function that is only affected by very large model residuals.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 2 tuning parameters:

cost: Cost (type: double, default: 1.0)
margin: Insensitivity Margin (type: double, default: 0.1)

Parsnip changes the default range for cost to c(-10, 5).

Translation from parsnip to the original package (regression)

svm_linear(
  cost = double(1),
  margin = double(1)
) %>%  
  set_engine("kernlab") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Support Vector Machine Model Specification (regression)
## 
## Main Arguments:
##   cost = double(1)
##   margin = double(1)
## 
## Computational engine: kernlab 
## 
## Model fit template:
## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), 
##     epsilon = double(1), kernel = "vanilladot")

Translation from parsnip to the original package (classification)

svm_linear(
  cost = double(1)
) %>% 
  set_engine("kernlab") %>% 
  set_mode("classification") %>% 
  translate()

## Linear Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = double(1)
## 
## Computational engine: kernlab 
## 
## Model fit template:
## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), 
##     kernel = "vanilladot", prob.model = TRUE)

The margin parameter does not apply to classification models.

Note that the "kernlab" engine does not naturally estimate class probabilities. To produce them, the decision values of the model are converted to probabilities using Platt scaling. This method fits an additional model on top of the SVM model. When fitting the Platt scaling model, random numbers are used that are not reproducible or controlled by R’s random number stream.

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for svm_linear() with the "kernlab" engine.

References

Lin, HT, and R Weng. “A Note on Platt’s Probabilistic Outputs for Support Vector Machines”
Karatzoglou, A, Smola, A, Hornik, K, and A Zeileis. 2004. “kernlab - An S4 Package for Kernel Methods in R.”, Journal of Statistical Software.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Linear support vector machines (SVMs) via LiblineaR

Description

LiblineaR::LiblineaR() fits a support vector machine model. For classification, the model tries to maximize the width of the margin between classes. For regression, the model optimizes a robust loss function that is only affected by very large model residuals.

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 2 tuning parameters:

cost: Cost (type: double, default: 1.0)
margin: Insensitivity Margin (type: double, default: no default)

This engine fits models that are L2-regularized for L2-loss. In the LiblineaR::LiblineaR() documentation, these are types 1 (classification) and 11 (regression).

Parsnip changes the default range for cost to c(-10, 5).

Translation from parsnip to the original package (regression)

svm_linear(
  cost = double(1),
  margin = double(1)
) %>%  
  set_engine("LiblineaR") %>% 
  set_mode("regression") %>% 
  translate()

## Linear Support Vector Machine Model Specification (regression)
## 
## Main Arguments:
##   cost = double(1)
##   margin = double(1)
## 
## Computational engine: LiblineaR 
## 
## Model fit template:
## LiblineaR::LiblineaR(x = missing_arg(), y = missing_arg(), C = double(1), 
##     svr_eps = double(1), type = 11)

Translation from parsnip to the original package (classification)

svm_linear(
  cost = double(1)
) %>% 
  set_engine("LiblineaR") %>% 
  set_mode("classification") %>% 
  translate()

## Linear Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = double(1)
## 
## Computational engine: LiblineaR 
## 
## Model fit template:
## LiblineaR::LiblineaR(x = missing_arg(), y = missing_arg(), C = double(1), 
##     type = 1)

The margin parameter does not apply to classification models.

Note that the LiblineaR engine does not produce class probabilities. When optimizing the model using the tune package, the default metrics require class probabilities. To use the ⁠tune_*()⁠ functions, a metric set must be passed as an argument that only contains metrics for hard class predictions (e.g., accuracy).

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Sparse Data

Examples

The “Fitting and Predicting with parsnip” article contains examples for svm_linear() with the "LiblineaR" engine.

References

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Polynomial support vector machines (SVMs) via kernlab

Description

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 4 tuning parameters:

cost: Cost (type: double, default: 1.0)
degree: Degree of Interaction (type: integer, default: 1L1)
scale_factor: Scale Factor (type: double, default: 1.0)
margin: Insensitivity Margin (type: double, default: 0.1)

Parsnip changes the default range for cost to c(-10, 5).

Translation from parsnip to the original package (regression)

svm_poly(
  cost = double(1),
  degree = integer(1),
  scale_factor = double(1), 
  margin = double(1)
) %>%  
  set_engine("kernlab") %>% 
  set_mode("regression") %>% 
  translate()

## Polynomial Support Vector Machine Model Specification (regression)
## 
## Main Arguments:
##   cost = double(1)
##   degree = integer(1)
##   scale_factor = double(1)
##   margin = double(1)
## 
## Computational engine: kernlab 
## 
## Model fit template:
## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), 
##     epsilon = double(1), kernel = "polydot", kpar = list(degree = ~integer(1), 
##         scale = ~double(1)))

Translation from parsnip to the original package (classification)

svm_poly(
  cost = double(1),
  degree = integer(1),
  scale_factor = double(1)
) %>% 
  set_engine("kernlab") %>% 
  set_mode("classification") %>% 
  translate()

## Polynomial Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = double(1)
##   degree = integer(1)
##   scale_factor = double(1)
## 
## Computational engine: kernlab 
## 
## Model fit template:
## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), 
##     kernel = "polydot", prob.model = TRUE, kpar = list(degree = ~integer(1), 
##         scale = ~double(1)))

The margin parameter does not apply to classification models.

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Examples

The “Fitting and Predicting with parsnip” article contains examples for svm_poly() with the "kernlab" engine.

Saving fitted model objects

References

Lin, HT, and R Weng. “A Note on Platt’s Probabilistic Outputs for Support Vector Machines”
Karatzoglou, A, Smola, A, Hornik, K, and A Zeileis. 2004. “kernlab - An S4 Package for Kernel Methods in R.”, Journal of Statistical Software.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Radial basis function support vector machines (SVMs) via kernlab

Description

Details

For this engine, there are multiple modes: classification and regression

Tuning Parameters

This model has 3 tuning parameters:

cost: Cost (type: double, default: 1.0)
rbf_sigma: Radial Basis Function sigma (type: double, default: see below)
margin: Insensitivity Margin (type: double, default: 0.1)

There is no default for the radial basis function kernel parameter. kernlab estimates it from the data using a heuristic method. See kernlab::sigest(). This method uses random numbers so, without setting the seed before fitting, the model will not be reproducible.

Parsnip changes the default range for cost to c(-10, 5).

Translation from parsnip to the original package (regression)

svm_rbf(
  cost = double(1),
  rbf_sigma = double(1), 
  margin = double(1)
) %>%  
  set_engine("kernlab") %>% 
  set_mode("regression") %>% 
  translate()

## Radial Basis Function Support Vector Machine Model Specification (regression)
## 
## Main Arguments:
##   cost = double(1)
##   rbf_sigma = double(1)
##   margin = double(1)
## 
## Computational engine: kernlab 
## 
## Model fit template:
## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), 
##     epsilon = double(1), kernel = "rbfdot", kpar = list(sigma = ~double(1)))

Translation from parsnip to the original package (classification)

svm_rbf(
  cost = double(1),
  rbf_sigma = double(1)
) %>% 
  set_engine("kernlab") %>% 
  set_mode("classification") %>% 
  translate()

## Radial Basis Function Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = double(1)
##   rbf_sigma = double(1)
## 
## Computational engine: kernlab 
## 
## Model fit template:
## kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = double(1), 
##     kernel = "rbfdot", prob.model = TRUE, kpar = list(sigma = ~double(1)))

The margin parameter does not apply to classification models.

Preprocessing requirements

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

Case weights

The underlying model implementation does not allow for case weights.

Saving fitted model objects

Examples

The “Fitting and Predicting with parsnip” article contains examples for svm_rbf() with the "kernlab" engine.

References

Lin, HT, and R Weng. “A Note on Platt’s Probabilistic Outputs for Support Vector Machines”
Karatzoglou, A, Smola, A, Hornik, K, and A Zeileis. 2004. “kernlab - An S4 Package for Kernel Methods in R.”, Journal of Statistical Software.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

Flexible discriminant analysis

Description

discrim_flexible() defines a model that fits a discriminant analysis model that can use nonlinear features created using multivariate adaptive regression splines (MARS). This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

earth¹²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

discrim_flexible(
  mode = "classification",
  num_terms = NULL,
  prod_degree = NULL,
  prune_method = NULL,
  engine = "earth"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

num_terms

The number of features that will be retained in the final model, including the intercept.

prod_degree

The highest possible interaction degree.

prune_method

The pruning method.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
discrim_flexible(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Linear discriminant analysis

Description

discrim_linear() defines a model that estimates a multivariate distribution for the predictors separately for the data in each class (usually Gaussian with a common covariance matrix). Bayes' theorem is used to compute the probability of each class, given the predictor values. This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

MASS¹²
mda²
sda²
sparsediscrim²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

discrim_linear(
  mode = "classification",
  penalty = NULL,
  regularization_method = NULL,
  engine = "MASS"
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "classification".

penalty

An non-negative number representing the amount of regularization used by some of the engines.

regularization_method

A character string for the type of regularized estimation. Possible values are: "diagonal", "min_distance", "shrink_cov", and "shrink_mean" (sparsediscrim engine only).

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
discrim_linear(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Quadratic discriminant analysis

Description

discrim_quad() defines a model that estimates a multivariate distribution for the predictors separately for the data in each class (usually Gaussian with separate covariance matrices). Bayes' theorem is used to compute the probability of each class, given the predictor values. This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

MASS¹²
sparsediscrim²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

discrim_quad(
  mode = "classification",
  regularization_method = NULL,
  engine = "MASS"
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "classification".

regularization_method

A character string for the type of regularized estimation. Possible values are: "diagonal", "shrink_cov", and "shrink_mean" (sparsediscrim engine only).

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
discrim_quad(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Regularized discriminant analysis

Description

discrim_regularized() defines a model that estimates a multivariate distribution for the predictors separately for the data in each class. The structure of the model can be LDA, QDA, or some amalgam of the two. Bayes' theorem is used to compute the probability of each class, given the predictor values. This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

klaR¹²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

discrim_regularized(
  mode = "classification",
  frac_common_cov = NULL,
  frac_identity = NULL,
  engine = "klaR"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

frac_common_cov, frac_identity

Numeric values between zero and one.

engine

A single character string specifying what computational engine to use for fitting.

Details

There are many ways of regularizing models. For example, one form of regularization is to penalize model parameters. Similarly, the classic James–Stein regularization approach shrinks the model structure to a less complex form.

The model fits a very specific type of regularized model by Friedman (1989) that uses two types of regularization. One modulates how class-specific the covariance matrix should be. This allows the model to balance between LDA and QDA. The second regularization component shrinks the covariance matrix towards the identity matrix.

For the penalization approach, discrim_linear() with a mda engine can be used. Other regularization methods can be used with discrim_linear() and discrim_quad() can used via the sparsediscrim engine for those functions.

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
discrim_regularized(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Friedman, J (1989). Regularized Discriminant Analysis. Journal of the American Statistical Association, 84, 165-175.

Tools for documenting engines

Description

parsnip has a fairly complex documentation system where the engines for each model have detailed documentation about the syntax, tuning parameters, preprocessing needs, and so on.

The functions below are called from .R files to programmatically generate content in the help files for a model.

find_engine_files() identifies engines for a model and creates a bulleted list of links to those specific help files.
make_seealso_list() creates a set of links for the "See Also" list at the bottom of the help pages.
find_engine_files() is a function, used by the above, to find the engines for each model function.

Usage

find_engine_files(mod)

make_engine_list(mod)

make_seealso_list(mod, pkg = "parsnip")

Arguments

mod

A character string for the model file (e.g. "linear_reg")

pkg

A character string for the package where the function is invoked.

Details

parsnip includes a document (README-DOCS.md) with step-by-step instructions and details. See the code below to determine where it is installed (or see the References section).

Most parsnip users will not need to use these functions or documentation.

Value

make_engine_list() returns a character string that creates a bulleted list of links to more specific help files.

make_seealso_list() returns a formatted character string of links.

find_engine_files() returns a tibble.

References

https://github.com/tidymodels/parsnip/blob/main/inst/README-DOCS.md

Examples


# See this file for step-by-step instructions.
system.file("README-DOCS.md", package = "parsnip")

# Code examples:
make_engine_list("linear_reg")

cat(make_engine_list("linear_reg"))

Evaluate parsnip model arguments

Description

Evaluate parsnip model arguments

Usage

eval_args(spec, ...)

Arguments

spec

A model specification.

...

Not used.

Extract elements of a parsnip model object

Description

These functions extract various elements from a parsnip object. If they do not exist yet, an error is thrown.

extract_spec_parsnip() returns the parsnip model specification.
extract_fit_engine() returns the engine specific fit embedded within a parsnip model fit. For example, when using linear_reg() with the "lm" engine, this returns the underlying lm object.
extract_parameter_dials() returns a single dials parameter object.
extract_parameter_set_dials() returns a set of dials parameter objects.
extract_fit_time() returns a tibble with fit times. The fit times correspond to the time for the parsnip engine to fit and do not include other portions of the elapsed time in fit.model_spec().

Usage

## S3 method for class 'model_fit'
extract_spec_parsnip(x, ...)

## S3 method for class 'model_fit'
extract_fit_engine(x, ...)

## S3 method for class 'model_spec'
extract_parameter_set_dials(x, ...)

## S3 method for class 'model_spec'
extract_parameter_dials(x, parameter, ...)

## S3 method for class 'model_fit'
extract_fit_time(x, summarize = TRUE, ...)

Arguments

x

A parsnip model_fit object or a parsnip model_spec object.

...

Not currently used.

parameter

A single string for the parameter ID.

summarize

A logical for whether the elapsed fit time should be returned as a single row or multiple rows. Doesn't support FALSE for parsnip models.

Details

Extracting the underlying engine fit can be helpful for describing the model (via print(), summary(), plot(), etc.) or for variable importance/explainers.

However, users should not invoke the predict() method on an extracted model. There may be preprocessing operations that parsnip has executed on the data prior to giving it to the model. Bypassing these can lead to errors or silently generating incorrect predictions.

Good:

   parsnip_fit %>% predict(new_data)

Bad:

   parsnip_fit %>% extract_fit_engine() %>% predict(new_data)

Value

The extracted value from the parsnip object, x, as described in the description section.

Examples


lm_spec <- linear_reg() %>% set_engine("lm")
lm_fit <- fit(lm_spec, mpg ~ ., data = mtcars)

lm_spec
extract_spec_parsnip(lm_fit)

extract_fit_engine(lm_fit)
lm(mpg ~ ., data = mtcars)

Control the fit function

Description

Pass options to the fit.model_spec() function to control its output and computations

Usage

fit_control(verbosity = 1L, catch = FALSE)

Arguments

verbosity

catch

A logical where a value of TRUE will evaluate the model inside of try(, silent = TRUE). If the model fails, an object is still returned (without an error) that inherits the class "try-error".

Details

fit_control() is deprecated in favor of control_parsnip().

Value

An S3 object with class "control_parsnip" that is a named list with the results of the function call

Examples


fit_control(verbosity = 2L)

Fit a Model Specification to a Dataset

Description

fit() and fit_xy() take a model specification, translate the required code by substituting arguments, and execute the model fit routine.

Usage

## S3 method for class 'model_spec'
fit(
  object,
  formula,
  data,
  case_weights = NULL,
  control = control_parsnip(),
  ...
)

## S3 method for class 'model_spec'
fit_xy(object, x, y, case_weights = NULL, control = control_parsnip(), ...)

Arguments

object

An object of class model_spec that has a chosen engine (via set_engine()).

formula

An object of class formula (or one that can be coerced to that class): a symbolic description of the model to be fitted.

data

Optional, depending on the interface (see Details below). A data frame containing all relevant variables (e.g. outcome(s), predictors, case weights, etc). Note: when needed, a named argument should be used.

case_weights

An optional classed vector of numeric case weights. This must return TRUE when hardhat::is_case_weights() is run on it. See hardhat::frequency_weights() and hardhat::importance_weights() for examples.

control

A named list with elements verbosity and catch. See control_parsnip().

...

Not currently used; values passed here will be ignored. Other options required to fit the model should be passed using set_engine().

x

A matrix, sparse matrix, or data frame of predictors. Only some models have support for sparse matrix input. See parsnip::get_encoding() for details. x should have column names.

y

A vector, matrix or data frame of outcome data.

Details

fit() and fit_xy() substitute the current arguments in the model specification into the computational engine's code, check them for validity, then fit the model using the data and the engine-specific code. Different model functions have different interfaces (e.g. formula or x/y) and these functions translate between the interface used when fit() or fit_xy() was invoked and the one required by the underlying model.

When possible, these functions attempt to avoid making copies of the data. For example, if the underlying model uses a formula and fit() is invoked, the original data are references when the model is fit. However, if the underlying model uses something else, such as x/y, the formula is evaluated and the data are converted to the required format. In this case, any calls in the resulting model objects reference the temporary objects used to fit the model.

If the model engine has not been set, the model's default engine will be used (as discussed on each model page). If the verbosity option of control_parsnip() is greater than zero, a warning will be produced.

If you would like to use an alternative method for generating contrasts when supplying a formula to fit(), set the global option contrasts to your preferred method. For example, you might set it to: options(contrasts = c(unordered = "contr.helmert", ordered = "contr.poly")). See the help page for stats::contr.treatment() for more possible contrast types.

For models with "censored regression" modes, an additional computation is executed and saved in the parsnip object. The censor_probs element contains a "reverse Kaplan-Meier" curve that models the probability of censoring. This may be used later to compute inverse probability censoring weights for performance measures.

Sparse data is supported, with the use of the x argument in fit_xy(). See allow_sparse_x column of get_encoding() for sparse input compatibility.

Value

A model_fit object that contains several elements:

lvl: If the outcome is a factor, this contains the factor levels at the time of model fitting.
ordered: If the outcome is a factor, was it an ordered factor?
spec: The model specification object (object in the call to fit)
fit: when the model is executed without error, this is the model object. Otherwise, it is a try-error object with the error message.
preproc: any objects needed to convert between a formula and non-formula interface (such as the terms object)

The return value will also have a class related to the fitted model (e.g. "_glm") before the base class of "model_fit".

Examples


# Although `glm()` only has a formula interface, different
# methods for specifying the model can be used

library(dplyr)
library(modeldata)
data("lending_club")

lr_mod <- logistic_reg()

using_formula <-
  lr_mod %>%
  set_engine("glm") %>%
  fit(Class ~ funded_amnt + int_rate, data = lending_club)

using_xy <-
  lr_mod %>%
   set_engine("glm") %>%
  fit_xy(x = lending_club[, c("funded_amnt", "int_rate")],
         y = lending_club$Class)

using_formula
using_xy

Internal functions that format predictions

Description

These are used to ensure that we have appropriate column names inside of tibbles.

Usage

format_num(x)

format_class(x)

format_classprobs(x)

format_time(x)

format_survival(x)

format_linear_pred(x)

format_hazard(x)

ensure_parsnip_format(x, col_name, overwrite = TRUE)

Arguments

x

A data frame or vector (depending on the context and function).

col_name

A string for a prediction column name.

overwrite

A logical for whether to overwrite the column name.

Value

A tibble

Generalized additive models (GAMs)

Description

gen_additive_mod() defines a model that can use smoothed functions of numeric predictors in a generalized linear model. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

mgcv¹

¹ The default engine.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

gen_additive_mod(
  mode = "unknown",
  select_features = NULL,
  adjust_deg_free = NULL,
  engine = "mgcv"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

select_features

TRUE or FALSE. If TRUE, the model has the ability to eliminate a predictor (via penalization). Increasing adjust_deg_free will increase the likelihood of removing predictors.

adjust_deg_free

If select_features = TRUE, then acts as a multiplier for smoothness. Increase this beyond 1 to produce smoother models.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
gen_additive_mod(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("gen_additive_mod")

gen_additive_mod()

Working with the parsnip model environment

Description

These functions read and write to the environment where the package stores information about model specifications.

Usage

get_model_env()

get_from_env(items)

set_in_env(...)

set_env_val(name, value)

Arguments

items

A character string of objects in the model environment.

...

Named values that will be assigned to the model environment.

name

A single character value for a new symbol in the model environment.

value

A single value for a new value in the model environment.

References

"How to build a parsnip model" https://www.tidymodels.org/learn/develop/models/

Examples


# Access the model data:
current_code <- get_model_env()
ls(envir = current_code)

Construct a single row summary "glance" of a model, fit, or other object

Description

This method glances the model in a parsnip model object, if it exists.

Usage

## S3 method for class 'model_fit'
glance(x, ...)

Arguments

x

model or other R object to convert to single-row data frame

...

other arguments passed to methods

Value

a tibble

Fit a grouped binomial outcome from a data set with case weights

Description

stats::glm() assumes that a tabular data set with case weights corresponds to "different observations have different dispersions" (see ?glm).

In some cases, the case weights reflect that the same covariate pattern was observed multiple times (i.e., frequency weights). In this case, stats::glm() expects the data to be formatted as the number of events for each factor level so that the outcome can be given to the formula as cbind(events_1, events_2).

glm_grouped() converts data with integer case weights to the expected "number of events" format for binomial data.

Usage

glm_grouped(formula, data, weights, ...)

Arguments

formula

A formula object with one outcome that is a two-level factors.

data

A data frame with the outcomes and predictors (but not case weights).

weights

An integer vector of weights whose length is the same as the number of rows in data. If it is a non-integer numeric, it will be converted to integer (with a warning).

...

Options to pass to stats::glm(). If family is not set, it will automatically be assigned the basic binomial family.

Value

A object produced by stats::glm().

Examples


#----------------------------------------------------------------------------
# The same data set formatted three ways

# First with basic case weights that, from ?glm, are used inappropriately.
ucb_weighted <- as.data.frame(UCBAdmissions)
ucb_weighted$Freq <- as.integer(ucb_weighted$Freq)
head(ucb_weighted)
nrow(ucb_weighted)

# Format when yes/no data are in individual rows (probably still inappropriate)
library(tidyr)
ucb_long <- uncount(ucb_weighted, Freq)
head(ucb_long)
nrow(ucb_long)

# Format where the outcome is formatted as number of events
ucb_events <-
  ucb_weighted %>%
  tidyr::pivot_wider(
    id_cols = c(Gender, Dept),
    names_from = Admit,
    values_from = Freq,
    values_fill = 0L
  )
head(ucb_events)
nrow(ucb_events)

#----------------------------------------------------------------------------
# Different model fits

# Treat data as separate Bernoulli data:
glm(Admit ~ Gender + Dept, data = ucb_long, family = binomial)

# Weights produce the same statistics
glm(
  Admit ~ Gender + Dept,
  data = ucb_weighted,
  family = binomial,
  weights = ucb_weighted$Freq
)

# Data as binomial "x events out of n trials" format. Note that, to get the same
# coefficients, the order of the levels must be reversed.
glm(
  cbind(Rejected, Admitted) ~ Gender + Dept,
  data = ucb_events,
  family = binomial
)

# The new function that starts with frequency weights and gets the correct place:
glm_grouped(Admit ~ Gender + Dept, data = ucb_weighted, weights = ucb_weighted$Freq)

Technical aspects of the glmnet model

Description

glmnet is a popular statistical model for regularized generalized linear models. These notes reflect common questions about this particular model.

tidymodels and glmnet

The implementation of the glmnet package has some nice features. For example, one of the main tuning parameters, the regularization penalty, does not need to be specified when fitting the model. The package fits a compendium of values, called the regularization path. These values depend on the data set and the value of alpha, the mixture parameter between a pure ridge model (alpha = 0) and a pure lasso model (alpha = 1). When predicting, any penalty values can be simultaneously predicted, even those that are not exactly on the regularization path. For those, the model approximates between the closest path values to produce a prediction. There is an argument called lambda to the glmnet() function that is used to specify the path.

In the discussion below, linear_reg() is used. The information is true for all parsnip models that have a "glmnet" engine.

Fitting and predicting using parsnip

Recall that tidymodels uses standardized parameter names across models chosen to be low on jargon. The argument penalty is the equivalent of what glmnet calls the lambda value and mixture is the same as their alpha value.

In tidymodels, our predict() methods are defined to make one prediction at a time. For this model, that means predictions are for a single penalty value. For this reason, models that have glmnet engines require the user to always specify a single penalty value when the model is defined. For example, for linear regression:

linear_reg(penalty = 1) %>% set_engine("glmnet")

When the predict() method is called, it automatically uses the penalty that was given when the model was defined. For example:

library(tidymodels)

fit <- 
  linear_reg(penalty = 1) %>% 
  set_engine("glmnet") %>% 
  fit(mpg ~ ., data = mtcars)

# predict at penalty = 1
predict(fit, mtcars[1:3,])

## # A tibble: 3 x 1
##   .pred
##   <dbl>
## 1  22.2
## 2  21.5
## 3  24.9

However, any penalty values can be predicted simultaneously using the multi_predict() method:

# predict at c(0.00, 0.01)
multi_predict(fit, mtcars[1:3,], penalty = c(0.00, 0.01))

## # A tibble: 3 x 1
##   .pred           
##   <list>          
## 1 <tibble [2 x 2]>
## 2 <tibble [2 x 2]>
## 3 <tibble [2 x 2]>

# unnested:
multi_predict(fit, mtcars[1:3,], penalty = c(0.00, 0.01)) %>% 
  add_rowindex() %>% 
  unnest(cols = ".pred")

## # A tibble: 6 x 3
##   penalty .pred  .row
##     <dbl> <dbl> <int>
## 1    0     22.6     1
## 2    0.01  22.5     1
## 3    0     22.1     2
## 4    0.01  22.1     2
## 5    0     26.3     3
## 6    0.01  26.3     3

Where did `lambda` go?

It may appear odd that the lambda value does not get used in the fit:

linear_reg(penalty = 1) %>% 
  set_engine("glmnet") %>% 
  translate()

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 1
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     family = "gaussian")

Internally, the value of penalty = 1 is saved in the parsnip object and no value is set for lambda. This enables the full path to be fit by glmnet(). See the section below about setting the path.

How do I set the regularization path?

Regardless of what value you use for penalty, the full coefficient path is used when glmnet::glmnet() is called.

What if you want to manually set this path? Normally, you would pass a vector to lambda in glmnet::glmnet().

parsnip models that use a glmnet engine can use a special optional argument called path_values. This is not an argument to glmnet::glmnet(); it is used by parsnip to independently set the path.

For example, we have found that if you want a fully ridge regression model (i.e., mixture = 0), you can get the wrong coefficients if the path does not contain zero (see issue #431).

If we want to use our own path, the argument is passed as an engine-specific option:

coef_path_values <- c(0, 10^seq(-5, 1, length.out = 7))

fit_ridge <- 
  linear_reg(penalty = 1, mixture = 0) %>% 
  set_engine("glmnet", path_values = coef_path_values) %>% 
  fit(mpg ~ ., data = mtcars)

all.equal(sort(fit_ridge$fit$lambda), coef_path_values)

## [1] TRUE

# predict at penalty = 1
predict(fit_ridge, mtcars[1:3,])

## # A tibble: 3 x 1
##   .pred
##   <dbl>
## 1  22.1
## 2  21.8
## 3  26.6

Tidying the model object

broom::tidy() is a function that gives a summary of the object as a tibble.

tl;dr tidy() on a glmnet model produced by parsnip gives the coefficients for the value given by penalty.

When parsnip makes a model, it gives it an extra class. Use the tidy() method on the object, it produces coefficients for the penalty that was originally requested:

tidy(fit)

## # A tibble: 11 x 3
##   term        estimate penalty
##   <chr>          <dbl>   <dbl>
## 1 (Intercept)  35.3          1
## 2 cyl          -0.872        1
## 3 disp          0            1
## 4 hp           -0.0101       1
## 5 drat          0            1
## 6 wt           -2.59         1
## # i 5 more rows

Note that there is a tidy() method for glmnet objects in the broom package. If this is used directly on the underlying glmnet object, it returns all of coefficients on the path:

# Use the basic tidy() method for glmnet
all_tidy_coefs <- broom:::tidy.glmnet(fit$fit)
all_tidy_coefs

## # A tibble: 640 x 5
##   term         step estimate lambda dev.ratio
##   <chr>       <dbl>    <dbl>  <dbl>     <dbl>
## 1 (Intercept)     1     20.1   5.15     0    
## 2 (Intercept)     2     21.6   4.69     0.129
## 3 (Intercept)     3     23.2   4.27     0.248
## 4 (Intercept)     4     24.7   3.89     0.347
## 5 (Intercept)     5     26.0   3.55     0.429
## 6 (Intercept)     6     27.2   3.23     0.497
## # i 634 more rows

length(unique(all_tidy_coefs$lambda))

## [1] 79

This can be nice for plots but it might not contain the penalty value that you are interested in.

Tools for models that predict on sub-models

Description

has_multi_predict() tests to see if an object can make multiple predictions on submodels from the same object. multi_predict_args() returns the names of the arguments to multi_predict() for this model (if any).

Usage

has_multi_predict(object, ...)

## Default S3 method:
has_multi_predict(object, ...)

## S3 method for class 'model_fit'
has_multi_predict(object, ...)

## S3 method for class 'workflow'
has_multi_predict(object, ...)

multi_predict_args(object, ...)

## Default S3 method:
multi_predict_args(object, ...)

## S3 method for class 'model_fit'
multi_predict_args(object, ...)

## S3 method for class 'workflow'
multi_predict_args(object, ...)

Arguments

object

An object to test.

...

Not currently used.

Value

has_multi_predict() returns single logical value while multi_predict_args() returns a character vector of argument names (or NA if none exist).

Examples


lm_model_idea <- linear_reg() %>% set_engine("lm")
has_multi_predict(lm_model_idea)
lm_model_fit <- fit(lm_model_idea, mpg ~ ., data = mtcars)
has_multi_predict(lm_model_fit)

multi_predict_args(lm_model_fit)

library(kknn)

knn_fit <-
  nearest_neighbor(mode = "regression", neighbors = 5) %>%
  set_engine("kknn") %>%
  fit(mpg ~ ., mtcars)

multi_predict_args(knn_fit)

multi_predict(knn_fit, mtcars[1, -1], neighbors = 1:4)$.pred

Activation functions for neural networks in keras

Description

Activation functions for neural networks in keras

Usage

keras_activations()

Value

A character vector of values.

Simple interface to MLP models via keras

Description

Instead of building a keras model sequentially, keras_mlp can be used to create a feedforward network with a single hidden layer. Regularization is via either weight decay or dropout.

Usage

keras_mlp(
  x,
  y,
  hidden_units = 5,
  penalty = 0,
  dropout = 0,
  epochs = 20,
  activation = "softmax",
  seeds = sample.int(10^5, size = 3),
  ...
)

Arguments

x

A data frame or matrix of predictors

y

A vector (factor or numeric) or matrix (numeric) of outcome data.

hidden_units

An integer for the number of hidden units.

penalty

A non-negative real number for the amount of weight decay. Either this parameter or dropout can specified.

dropout

The proportion of parameters to set to zero. Either this parameter or penalty can specified.

epochs

An integer for the number of passes through the data.

activation

A character string for the type of activation function between layers.

seeds

A vector of three positive integers to control randomness of the calculations.

...

Additional named arguments to pass to keras::compile() or keras::fit(). Arguments will be sorted and passed to either function internally.

Value

A keras model object.

Wrapper for keras class predictions

Description

Wrapper for keras class predictions

Usage

keras_predict_classes(object, x)

Arguments

object

A keras model fit

x

A data set.

Knit engine-specific documentation

Description

Knit engine-specific documentation

Usage

knit_engine_docs(pattern = NULL)

Arguments

pattern

A regular expression to specify which files to knit. The default knits all engine documentation files.

Details

This function will check whether the known parsnip extension packages, engine specific packages, and a few other ancillary packages are installed. Users will be prompted to install anything required to create the engine documentation.

Value

A tibble with column file for the file name and result (a character vector that echos the output file name or, when there is a failure, the error message).

Linear regression

Description

linear_reg() defines a model that can predict numeric values from predictors using a linear function. This function can fit regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

lm¹
brulee
gee²
glm
glmer²
glmnet
gls²
h2o²
keras
lme²
lmer²
quantreg
spark
stan
stan_glmer²

¹ The default engine. ² Requires a parsnip extension package for regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

linear_reg(mode = "regression", engine = "lm", penalty = NULL, mixture = NULL)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "regression".

engine

A single character string specifying what computational engine to use for fitting. Possible engines are listed below. The default for this model is "lm".

penalty

A non-negative number representing the total amount of regularization (specific engines only).

mixture

A number between zero and one (inclusive) denoting the proportion of L1 regularization (i.e. lasso) in the model.

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Available for specific engines only.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
linear_reg(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("linear_reg")

linear_reg()

Locate and show errors/warnings in engine-specific documentation

Description

Locate and show errors/warnings in engine-specific documentation

Usage

list_md_problems()

Value

A tibble with column file for the file name, line indicating the line where the error/warning occurred, and problem showing the error/warning message.

Logistic regression

Description

logistic_reg() defines a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event. This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

glm¹
brulee
gee²
glmer²
glmnet
h2o²
keras
LiblineaR
spark
stan
stan_glmer²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

logistic_reg(
  mode = "classification",
  engine = "glm",
  penalty = NULL,
  mixture = NULL
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "classification".

engine

A single character string specifying what computational engine to use for fitting. Possible engines are listed below. The default for this model is "glm".

penalty

A non-negative number representing the total amount of regularization (specific engines only). For keras models, this corresponds to purely L2 regularization (aka weight decay) while the other models can be either or a combination of L1 and L2 (depending on the value of mixture).

mixture

A number between zero and one (inclusive) giving the proportion of L1 regularization (i.e. lasso) in the model.

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Available for specific engines only. For LiblineaR models, mixture must be exactly 1 or 0 only.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
logistic_reg(argument = !!value)

This model fits a classification model for binary outcomes; for multiclass outcomes, see multinom_reg().

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("logistic_reg")

logistic_reg()

Make a parsnip call expression

Description

Make a parsnip call expression

Usage

make_call(fun, ns, args, ...)

Arguments

fun

A character string of a function name.

ns

A character string of a package name.

args

A named list of argument values.

Details

The arguments are spliced into the ns::fun() call. If they are missing, null, or a single logical, then are not spliced.

Value

A call.

Prepend a new class

Description

This adds an extra class to a base class of "model_spec".

Usage

make_classes(prefix)

Arguments

prefix

A character string for a class.

Value

A character vector.

Multivariate adaptive regression splines (MARS)

Description

mars() defines a generalized linear model that uses artificial features for some predictors. These features resemble hinge functions and the result is a model that is a segmented regression in small dimensions. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

earth¹

¹ The default engine.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

mars(
  mode = "unknown",
  engine = "earth",
  num_terms = NULL,
  prod_degree = NULL,
  prune_method = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting.

num_terms

The number of features that will be retained in the final model, including the intercept.

prod_degree

The highest possible interaction degree.

prune_method

The pruning method.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
mars(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("mars")

mars(mode = "regression", num_terms = 5)

Reformat quantile predictions

Description

Reformat quantile predictions

Usage

matrix_to_quantile_pred(x, object)

Arguments

x

A matrix of predictions with rows as samples and columns as quantile levels.

object

A parsnip model_fit object from a quantile regression model.

Determine largest value of mtry from formula. This function potentially caps the value of `mtry` based on a formula and data set. This is a safe approach for survival and/or multivariate models.

Description

Determine largest value of mtry from formula. This function potentially caps the value of mtry based on a formula and data set. This is a safe approach for survival and/or multivariate models.

Usage

max_mtry_formula(mtry, formula, data)

Arguments

mtry

An initial value of mtry (which may be too large).

formula

A model formula.

data

The training set (data frame).

Value

A value for mtry.

Examples


# should be 9
max_mtry_formula(200, cbind(wt, mpg) ~ ., data = mtcars)

Fuzzy conversions

Description

These are substitutes for as.matrix() and as.data.frame() that leave a sparse matrix as-is.

Usage

maybe_matrix(x)

maybe_data_frame(x)

Arguments

x

A data frame, matrix, or sparse matrix.

Value

A data frame, matrix, or sparse matrix.

Execution-time data dimension checks

Description

For some tuning parameters, the range of values depend on the data dimensions (e.g. mtry). Some packages will fail if the parameter values are outside of these ranges. Since the model might receive resampled versions of the data, these ranges can't be set prior to the point where the model is fit. These functions check the possible range of the data and adjust them if needed (with a warning).

Usage

min_cols(num_cols, source)

min_rows(num_rows, source, offset = 0)

Arguments

num_cols, num_rows

The parameter value requested by the user.

source

A data frame for the data to be used in the fit. If the source is named "data", it is assumed that one column of the data corresponds to an outcome (and is subtracted off).

offset

A number subtracted off of the number of rows available in the data.

Value

An integer (and perhaps a warning).

Examples


nearest_neighbor(neighbors= 100) %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  translate()

library(ranger)
rand_forest(mtry = 2, min_n = 100, trees = 3) %>%
  set_engine("ranger") %>%
  set_mode("regression") %>%
  fit(mpg ~ ., data = mtcars)

Single layer neural network

Description

mlp() defines a multilayer perceptron model (a.k.a. a single layer, feed-forward neural network). This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

nnet¹
brulee
brulee_two_layer
h2o²
keras

¹ The default engine. ² Requires a parsnip extension package for classification and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

mlp(
  mode = "unknown",
  engine = "nnet",
  hidden_units = NULL,
  penalty = NULL,
  dropout = NULL,
  epochs = NULL,
  activation = NULL,
  learn_rate = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting.

hidden_units

An integer for the number of units in the hidden model.

penalty

A non-negative numeric value for the amount of weight decay.

dropout

A number between 0 (inclusive) and 1 denoting the proportion of model parameters randomly set to zero during model training.

epochs

An integer for the number of training iterations.

activation

A single character string denoting the type of relationship between the original predictors and the hidden unit layer. The activation function between the hidden and output layers is automatically set to either "linear" or "softmax" depending on the type of outcome. Possible values depend on the engine being used.

learn_rate

A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
mlp(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("mlp")

mlp(mode = "classification", penalty = 0.01)

parsnip model specification database

Description

This is used in the RStudio add-in and captures information about mode specifications in various R packages.

Value

model_db

a data frame

Examples


data(model_db)

Model Fit Objects

Description

Model fits are trained model specifications that are ready to predict on new data. Model fits have class model_fit and, usually, a subclass referring to the engine used to fit the model.

Details

An object with class "model_fit" is a container for information about a model that has been fit to the data.

The main elements of the object are:

lvl: A vector of factor levels when the outcome is a factor. This is NULL when the outcome is not a factor vector.
spec: A model_spec object.
fit: The object produced by the fitting function.
preproc: This contains any data-specific information required to process new a sample point for prediction. For example, if the underlying model function requires arguments x and y and the user passed a formula to fit, the preproc object would contain items such as the terms object and so on. When no information is required, this is NA.

As discussed in the documentation for model_spec, the original arguments to the specification are saved as quosures. These are evaluated for the model_fit object prior to fitting. If the resulting model object prints its call, any user-defined options are shown in the call preceded by a tilde (see the example below). This is a result of the use of quosures in the specification.

This class and structure is the basis for how parsnip stores model objects after seeing the data and applying a model.

Examples



# Keep the `x` matrix if the data are not too big.
spec_obj <-
  linear_reg() %>%
  set_engine("lm", x = ifelse(.obs() < 500, TRUE, FALSE))
spec_obj

fit_obj <- fit(spec_obj, mpg ~ ., data = mtcars)
fit_obj

nrow(fit_obj$fit$x)

Formulas with special terms in tidymodels

Description

In R, formulas provide a compact, symbolic notation to specify model terms. Many modeling functions in R make use of "specials", or nonstandard notations used in formulas. Specials are defined and handled as a special case by a given modeling package. For example, the mgcv package, which provides support for generalized additive models in R, defines a function s() to be in-lined into formulas. It can be used like so:

mgcv::gam(mpg ~ wt + s(disp, k = 5), data = mtcars)

In this example, the s() special defines a smoothing term that the mgcv package knows to look for when preprocessing model input.

The parsnip package can handle most specials without issue. The analogous code for specifying this generalized additive model with the parsnip "mgcv" engine looks like:

gen_additive_mod() %>%
  set_mode("regression") %>%
  set_engine("mgcv") %>%
  fit(mpg ~ wt + s(disp, k = 5), data = mtcars)

However, parsnip is often used in conjunction with the greater tidymodels package ecosystem, which defines its own pre-processing infrastructure and functionality via packages like hardhat and recipes. The specials defined in many modeling packages introduce conflicts with that infrastructure.

To support specials while also maintaining consistent syntax elsewhere in the ecosystem, tidymodels delineates between two types of formulas: preprocessing formulas and model formulas. Preprocessing formulas specify the input variables, while model formulas determine the model structure.

Example

To create the preprocessing formula from the model formula, just remove the specials, retaining references to input variables themselves. For example:

model_formula <- mpg ~ wt + s(disp, k = 5)
preproc_formula <- mpg ~ wt + disp

With parsnip, use the model formula:

model_spec <-
  gen_additive_mod() %>%
  set_mode("regression") %>%
  set_engine("mgcv")

model_spec %>%
  fit(model_formula, data = mtcars)

With recipes, use the preprocessing formula only:
```
library(recipes)

recipe(preproc_formula, mtcars)
```
The recipes package supplies a large variety of preprocessing techniques that may replace the need for specials altogether, in some cases.
With workflows, use the preprocessing formula everywhere, but pass the model formula to the formula argument in add_model():
```
library(workflows)

wflow <-
  workflow() %>%
  add_formula(preproc_formula) %>%
  add_model(model_spec, formula = model_formula)

fit(wflow, data = mtcars)
```
The workflow will then pass the model formula to parsnip, using the preprocessor formula elsewhere. We would still use the preprocessing formula if we had added a recipe preprocessor using add_recipe() instead a formula via add_formula().

Print helper for model objects

Description

A common format function that prints information about the model object (e.g. arguments, calls, packages, etc).

Usage

model_printer(x, ...)

Arguments

x

A model object.

...

Not currently used.

Model Specifications

Description

The parsnip package splits the process of fitting models into two steps:

Specify how a model will be fit using a model specification
Fit a model using the model specification

This is a different approach to many other model interfaces in R, like lm(), where both the specification of the model and the fitting happens in one function call. Splitting the process into two steps allows users to iteratively define model specifications throughout the model development process.

This intermediate object that defines how the model will be fit is called a model specification and has class model_spec. Model type functions, like linear_reg() or boost_tree(), return model_spec objects.

Fitted model objects, resulting from passing a model_spec to fit() or fit_xy, have class model_fit, and contain the original model_spec objects inside them. See ?model_fit for more on that object type, and ?extract_spec_parsnip to extract model_specs from model_fits.

Details

An object with class "model_spec" is a container for information about a model that will be fit.

The main elements of the object are:

args: A vector of the main arguments for the model. The names of these arguments may be different from their counterparts n the underlying model function. For example, for a glmnet model, the argument name for the amount of the penalty is called "penalty" instead of "lambda" to make it more general and usable across different types of models (and to not be specific to a particular model function). The elements of args can tune() with the use of the tune package. For more information see https://www.tidymodels.org/start/tuning/. If left to their defaults (NULL), the arguments will use the underlying model functions default value. As discussed below, the arguments in args are captured as quosures and are not immediately executed.
...: Optional model-function-specific parameters. As with args, these will be quosures and can be tune().
mode: The type of model, such as "regression" or "classification". Other modes will be added once the package adds more functionality.
method: This is a slot that is filled in later by the model's constructor function. It generally contains lists of information that are used to create the fit and prediction code as well as required packages and similar data.
engine: This character string declares exactly what software will be used. It can be a package name or a technology type.

This class and structure is the basis for how parsnip stores model objects prior to seeing the data.

Argument Details

An important detail to understand when creating model specifications is that they are intended to be functionally independent of the data. While it is true that some tuning parameters are data dependent, the model specification does not interact with the data at all.

For example, most R functions immediately evaluate their arguments. For example, when calling mean(dat_vec), the object dat_vec is immediately evaluated inside of the function.

parsnip model functions do not do this. For example, using

 rand_forest(mtry = ncol(mtcars) - 1)

does not execute ncol(mtcars) - 1 when creating the specification. This can be seen in the output:

 > rand_forest(mtry = ncol(mtcars) - 1)
 Random Forest Model Specification (unknown)

 Main Arguments:
   mtry = ncol(mtcars) - 1

The model functions save the argument expressions and their associated environments (a.k.a. a quosure) to be evaluated later when either fit.model_spec() or fit_xy.model_spec() are called with the actual data.

The consequence of this strategy is that any data required to get the parameter values must be available when the model is fit. The two main ways that this can fail is if:

The data have been modified between the creation of the model specification and when the model fit function is invoked.
If the model specification is saved and loaded into a new session where those same data objects do not exist.

The best way to avoid these issues is to not reference any data objects in the global environment but to use data descriptors such as .cols(). Another way of writing the previous specification is

 rand_forest(mtry = .cols() - 1)

This is not dependent on any specific data object and is evaluated immediately before the model fitting process begins.

One less advantageous approach to solving this issue is to use quasiquotation. This would insert the actual R object into the model specification and might be the best idea when the data object is small. For example, using

 rand_forest(mtry = ncol(!!mtcars) - 1)

would work (and be reproducible between sessions) but embeds the entire mtcars data set into the mtry expression:

 > rand_forest(mtry = ncol(!!mtcars) - 1)
 Random Forest Model Specification (unknown)

 Main Arguments:
   mtry = ncol(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, <snip>

However, if there were an object with the number of columns in it, this wouldn't be too bad:

 > mtry_val <- ncol(mtcars) - 1
 > mtry_val
 [1] 10
 > rand_forest(mtry = !!mtry_val)
 Random Forest Model Specification (unknown)

 Main Arguments:
   mtry = 10

More information on quosures and quasiquotation can be found at https://adv-r.hadley.nz/quasiquotation.html.

Model predictions across many sub-models

Description

For some models, predictions can be made on sub-models in the model object.

Usage

multi_predict(object, ...)

## Default S3 method:
multi_predict(object, ...)

## S3 method for class ''_xgb.Booster''
multi_predict(object, new_data, type = NULL, trees = NULL, ...)

## S3 method for class ''_C5.0''
multi_predict(object, new_data, type = NULL, trees = NULL, ...)

## S3 method for class ''_elnet''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)

## S3 method for class ''_lognet''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)

## S3 method for class ''_multnet''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)

## S3 method for class ''_glmnetfit''
multi_predict(object, new_data, type = NULL, penalty = NULL, ...)

## S3 method for class ''_earth''
multi_predict(object, new_data, type = NULL, num_terms = NULL, ...)

## S3 method for class ''_torch_mlp''
multi_predict(object, new_data, type = NULL, epochs = NULL, ...)

## S3 method for class ''_train.kknn''
multi_predict(object, new_data, type = NULL, neighbors = NULL, ...)

Arguments

object

A model fit.

...

Optional arguments to pass to predict.model_fit(type = "raw") such as type.

new_data

A rectangular data object, such as a data frame.

type

A single character value or NULL. Possible values are "numeric", "class", "prob", "conf_int", "pred_int", "quantile", or "raw". When NULL, predict() will choose an appropriate value based on the model's mode.

trees

An integer vector for the number of trees in the ensemble.

penalty

A numeric vector of penalty values.

num_terms

An integer vector for the number of MARS terms to retain.

epochs

An integer vector for the number of training epochs.

neighbors

An integer vector for the number of nearest neighbors.

Value

A tibble with the same number of rows as the data being predicted. There is a list-column named .pred that contains tibbles with multiple rows per sub-model. Note that, within the tibbles, the column names follow the usual standard based on prediction type (i.e. .pred_class for type = "class" and so on).

Multinomial regression

Description

multinom_reg() defines a model that uses linear predictors to predict multiclass data using the multinomial distribution. This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

nnet¹
brulee
glmnet
h2o²
keras
spark

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

multinom_reg(
  mode = "classification",
  engine = "nnet",
  penalty = NULL,
  mixture = NULL
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "classification".

engine

A single character string specifying what computational engine to use for fitting. Possible engines are listed below. The default for this model is "nnet".

penalty

A non-negative number representing the total amount of regularization (specific engines only). For keras models, this corresponds to purely L2 regularization (aka weight decay) while the other models can be a combination of L1 and L2 (depending on the value of mixture).

mixture

A number between zero and one (inclusive) giving the proportion of L1 regularization (i.e. lasso) in the model.

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Available for specific engines only.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
multinom_reg(argument = !!value)

This model fits a classification model for multiclass outcomes; for binary outcomes, see logistic_reg().

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("multinom_reg")

multinom_reg()

Naive Bayes models

Description

naive_Bayes() defines a model that uses Bayes' theorem to compute the probability of each class, given the predictor values. This function can fit classification models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

klaR¹²
h2o²
naivebayes²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

naive_Bayes(
  mode = "classification",
  smoothness = NULL,
  Laplace = NULL,
  engine = "klaR"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

smoothness

An non-negative number representing the the relative smoothness of the class boundary. Smaller examples result in model flexible boundaries and larger values generate class boundaries that are less adaptable

Laplace

A non-negative value for the Laplace correction to smoothing low-frequency counts.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
naive_Bayes(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

K-nearest neighbors

Description

nearest_neighbor() defines a model that uses the K most similar data points from the training set to predict new samples. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

kknn¹

¹ The default engine.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

nearest_neighbor(
  mode = "unknown",
  engine = "kknn",
  neighbors = NULL,
  weight_func = NULL,
  dist_power = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting.

neighbors

A single integer for the number of neighbors to consider (often called k). For kknn, a value of 5 is used if neighbors is not specified.

weight_func

A single character for the type of kernel function used to weight distances between samples. Valid choices are: "rectangular", "triangular", "epanechnikov", "biweight", "triweight", "cos", "inv", "gaussian", "rank", or "optimal".

dist_power

A single number for the parameter used in calculating Minkowski distance.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
nearest_neighbor(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("nearest_neighbor")

nearest_neighbor(neighbors = 11)

Null model

Description

null_model() defines a simple, non-informative model. It doesn't have any main arguments. This function can fit classification and regression models.

Usage

null_model(mode = "classification", engine = "parsnip")

Arguments

mode

A single character string for the type of model. The only possible values for this model are "regression" and "classification".

engine

A single character string specifying what computational engine to use for fitting. Possible engines are listed below. The default for this model is "parsnip".

Engine Details

Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:

parsnip

null_model() %>% 
  set_engine("parsnip") %>% 
  set_mode("regression") %>% 
  translate()

## Null Model Specification (regression)
## 
## Computational engine: parsnip 
## 
## Model fit template:
## parsnip::nullmodel(x = missing_arg(), y = missing_arg())

null_model() %>% 
  set_engine("parsnip") %>% 
  set_mode("classification") %>% 
  translate()

## Null Model Specification (classification)
## 
## Computational engine: parsnip 
## 
## Model fit template:
## parsnip::nullmodel(x = missing_arg(), y = missing_arg())

Examples


null_model(mode = "regression")

Functions required for parsnip-adjacent packages

Description

These functions are helpful when creating new packages that will register new model specifications.

Usage

null_value(x)

show_fit(model, eng)

check_args(object, call = rlang::caller_env())

update_dot_check(...)

new_model_spec(
  cls,
  args,
  eng_args,
  mode,
  user_specified_mode = TRUE,
  method,
  engine,
  user_specified_engine = TRUE
)

check_final_param(x, call = rlang::caller_env())

update_main_parameters(args, param, call = rlang::caller_env())

update_engine_parameters(eng_args, fresh, ...)

print_model_spec(x, cls = class(x)[1], desc = get_model_desc(cls), ...)

update_spec(
  object,
  parameters,
  args_enquo_list,
  fresh,
  cls,
  ...,
  call = caller_env()
)

is_varying(x)

Fit a simple, non-informative model

Description

Fit a single mean or largest class model. nullmodel() is the underlying computational function for the null_model() specification.

Usage

nullmodel(x, ...)

## Default S3 method:
nullmodel(x = NULL, y, ...)

## S3 method for class 'nullmodel'
print(x, ...)

## S3 method for class 'nullmodel'
predict(object, new_data = NULL, type = NULL, ...)

Arguments

x

An optional matrix or data frame of predictors. These values are not used in the model fit

...

Optional arguments (not yet used)

y

A numeric vector (for regression) or factor (for classification) of outcomes

object

An object of class nullmodel

new_data

A matrix or data frame of predictors (only used to determine the number of predictions to return)

type

Either "raw" (for regression), "class" or "prob" (for classification)

Details

nullmodel() emulates other model building functions, but returns the simplest model possible given a training set: a single mean for numeric outcomes and the most prevalent class for factor outcomes. When class probabilities are requested, the percentage of the training set samples with the most prevalent class is returned.

Value

The output of nullmodel() is a list of class nullmodel with elements

call

the function call

value

the mean of y or the most prevalent class

levels

when y is a factor, a vector of levels. NULL otherwise

pct

when y is a factor, a data frame with a column for each class (NULL otherwise). The column for the most prevalent class has the proportion of the training samples with that class (the other columns are zero).

n

the number of elements in y

predict.nullmodel() returns either a factor or numeric vector depending on the class of y. All predictions are always the same.

Examples



outcome <- factor(sample(letters[1:2],
                         size = 100,
                         prob = c(.1, .9),
                         replace = TRUE))
useless <- nullmodel(y = outcome)
useless
predict(useless, matrix(NA, nrow = 5))

Start an RStudio Addin that can write model specifications

Description

parsnip_addin() starts a process in the RStudio IDE Viewer window that allows users to write code for parsnip model specifications from various R packages. The new code is written to the current document at the location of the cursor.

Usage

parsnip_addin()

Partial least squares (PLS)

Description

pls() defines a partial least squares model that uses latent variables to model the data. It is similar to a supervised version of principal component. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

mixOmics¹²

¹ The default engine. ² Requires a parsnip extension package for classification and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

pls(
  mode = "unknown",
  predictor_prop = NULL,
  num_comp = NULL,
  engine = "mixOmics"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

predictor_prop

The maximum proportion of original predictors that can have non-zero coefficients for each PLS component (via regularization). This value is used for all PLS components for X.

num_comp

The number of PLS components to retain.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
pls(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Poisson regression models

Description

poisson_reg() defines a generalized linear model for count data that follow a Poisson distribution. This function can fit regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

glm¹²
gee²
glmer²
glmnet²
h2o²
hurdle²
stan²
stan_glmer²
zeroinfl²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

poisson_reg(
  mode = "regression",
  penalty = NULL,
  mixture = NULL,
  engine = "glm"
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "regression".

penalty

A non-negative number representing the total amount of regularization (glmnet only).

mixture

A number between zero and one (inclusive) giving the proportion of L1 regularization (i.e. lasso) in the model.

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Available for glmnet and spark only.

engine

A single character string specifying what computational engine to use for fitting.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
poisson_reg(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Other predict methods.

Description

These are internal functions not meant to be directly called by the user.

Usage

## S3 method for class 'model_fit'
predict_class(object, new_data, ...)

## S3 method for class 'model_fit'
predict_classprob(object, new_data, ...)

## S3 method for class 'model_fit'
predict_hazard(object, new_data, eval_time, time = deprecated(), ...)

## S3 method for class 'model_fit'
predict_confint(object, new_data, level = 0.95, std_error = FALSE, ...)

predict_confint(object, ...)

predict_predint(object, ...)

## S3 method for class 'model_fit'
predict_predint(object, new_data, level = 0.95, std_error = FALSE, ...)

predict_predint(object, ...)

## S3 method for class 'model_fit'
predict_linear_pred(object, new_data, ...)

predict_linear_pred(object, ...)

## S3 method for class 'model_fit'
predict_numeric(object, new_data, ...)

predict_numeric(object, ...)

## S3 method for class 'model_fit'
predict_quantile(
  object,
  new_data,
  quantile_levels = NULL,
  quantile = deprecated(),
  interval = "none",
  level = 0.95,
  ...
)

## S3 method for class 'model_fit'
predict_survival(
  object,
  new_data,
  eval_time,
  time = deprecated(),
  interval = "none",
  level = 0.95,
  ...
)

predict_survival(object, ...)

## S3 method for class 'model_fit'
predict_time(object, new_data, ...)

predict_time(object, ...)

Arguments

object

A model fit.

new_data

A rectangular data object, such as a data frame.

...

Additional parsnip-related options, depending on the value of type. Arguments to the underlying model's prediction function cannot be passed here (use the opts argument instead). Possible arguments are:

interval: for type equal to "survival" or "quantile", should interval estimates be added, if available? Options are "none" and "confidence".
level: for type equal to "conf_int", "pred_int", or "survival", this is the parameter for the tail area of the intervals (e.g. confidence level for confidence intervals). Default value is 0.95.
std_error: for type equal to "conf_int" or "pred_int", add the standard error of fit or prediction (on the scale of the linear predictors). Default value is FALSE.
quantile: for type equal to quantile, the quantiles of the distribution. Default is (1:9)/10.
eval_time: for type equal to "survival" or "hazard", the time points at which the survival probability or hazard is estimated.

level

A single numeric value between zero and one for the interval estimates.

std_error

A single logical for whether the standard error should be returned (assuming that the model can compute it).

quantile, quantile_levels

A vector of values between 0 and 1 for the quantile to be predicted. If the model has a "quantile regression" mode, this value should be NULL. For other modes, the default is (1:9)/10. Note that, as of version 1.3.0 of parsnip, the quantile is deprecated. Use quantile_levels instead.

Model predictions

Description

Apply a model to create different types of predictions. predict() can be used for all types of models and uses the "type" argument for more specificity.

Usage

## S3 method for class 'model_fit'
predict(object, new_data, type = NULL, opts = list(), ...)

## S3 method for class 'model_fit'
predict_raw(object, new_data, opts = list(), ...)

predict_raw(object, ...)

Arguments

object

A model fit.

new_data

A rectangular data object, such as a data frame.

type

opts

A list of optional arguments to the underlying predict function that will be used when type = "raw". The list should not include options for the model object or the new data being predicted.

...

interval: for type equal to "survival" or "quantile", should interval estimates be added, if available? Options are "none" and "confidence".
level: for type equal to "conf_int", "pred_int", or "survival", this is the parameter for the tail area of the intervals (e.g. confidence level for confidence intervals). Default value is 0.95.
std_error: for type equal to "conf_int" or "pred_int", add the standard error of fit or prediction (on the scale of the linear predictors). Default value is FALSE.
quantile: for type equal to quantile, the quantiles of the distribution. Default is (1:9)/10.
eval_time: for type equal to "survival" or "hazard", the time points at which the survival probability or hazard is estimated.

Details

For type = NULL, predict() uses

type = "numeric" for regression models,
type = "class" for classification, and
type = "time" for censored regression.

Interval predictions

When using type = "conf_int" and type = "pred_int", the options level and std_error can be used. The latter is a logical for an extra column of standard error values (if available).

Censored regression predictions

For censored regression, a numeric vector for eval_time is required when survival or hazard probabilities are requested. The time values are required to be unique, finite, non-missing, and non-negative. The predict() functions will adjust the values to fit this specification by removing offending points (with a warning).

predict.model_fit() does not require the outcome to be present. For performance metrics on the predicted survival probability, inverse probability of censoring weights (IPCW) are required (see the tidymodels.org reference below). Those require the outcome and are thus not returned by predict(). They can be added via augment.model_fit() if new_data contains a column with the outcome as a Surv object.

Also, when type = "linear_pred", censored regression models will by default be formatted such that the linear predictor increases with time. This may have the opposite sign as what the underlying model's predict() method produces. Set increasing = FALSE to suppress this behavior.

Value

With the exception of type = "raw", the result of predict.model_fit()

is a tibble
has as many rows as there are rows in new_data
has standardized column names, see below:

For type = "numeric", the tibble has a .pred column for a single outcome and .pred_Yname columns for a multivariate outcome.

For type = "class", the tibble has a .pred_class column.

For type = "prob", the tibble has .pred_classlevel columns.

For type = "conf_int" and type = "pred_int", the tibble has .pred_lower and .pred_upper columns with an attribute for the confidence level. In the case where intervals can be produces for class probabilities (or other non-scalar outputs), the columns are named .pred_lower_classlevel and so on.

For type = "quantile", the tibble has a .pred column, which is a list-column. Each list element contains a tibble with columns .pred and .quantile (and perhaps other columns).

For type = "time", the tibble has a .pred_time column.

For type = "survival", the tibble has a .pred column, which is a list-column. Each list element contains a tibble with columns .eval_time and .pred_survival (and perhaps other columns).

For type = "hazard", the tibble has a .pred column, which is a list-column. Each list element contains a tibble with columns .eval_time and .pred_hazard (and perhaps other columns).

Using type = "raw" with predict.model_fit() will return the unadulterated results of the prediction function.

In the case of Spark-based models, since table columns cannot contain dots, the same convention is used except 1) no dots appear in names and 2) vectors are never returned but type-specific prediction functions.

When the model fit failed and the error was captured, the predict() function will return the same structure as above but filled with missing values. This does not currently work for multivariate models.

References

https://www.tidymodels.org/learn/statistics/survival-metrics/

Examples


library(dplyr)

lm_model <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., data = mtcars %>% dplyr::slice(11:32))

pred_cars <-
  mtcars %>%
  dplyr::slice(1:10) %>%
  dplyr::select(-mpg)

predict(lm_model, pred_cars)

predict(
  lm_model,
  pred_cars,
  type = "conf_int",
  level = 0.90
)

predict(
  lm_model,
  pred_cars,
  type = "raw",
  opts = list(type = "terms")
)

Prepare data based on parsnip encoding information

Description

Prepare data based on parsnip encoding information

Usage

prepare_data(object, new_data)

Arguments

object

A parsnip model object

new_data

A data frame

Value

A data frame or matrix

Proportional hazards regression

Description

proportional_hazards() defines a model for the hazard function as a multiplicative function of covariates times a baseline hazard. This function can fit censored regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

survival¹²
glmnet²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

proportional_hazards(
  mode = "censored regression",
  engine = "survival",
  penalty = NULL,
  mixture = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. The only possible value for this model is "censored regression".

engine

A single character string specifying what computational engine to use for fitting.

penalty

A non-negative number representing the total amount of regularization (specific engines only).

mixture

A number between zero and one (inclusive) denoting the proportion of L1 regularization (i.e. lasso) in the model.

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Available for specific engines only.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
proportional_hazards(argument = !!value)

Since survival models typically involve censoring (and require the use of survival::Surv() objects), the fit.model_spec() function will require that the survival model be specified via the formula interface.

Proportional hazards models include the Cox model.

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("proportional_hazards")

proportional_hazards(mode = "censored regression")

Random forest

Description

rand_forest() defines a model that creates a large number of decision trees, each independent of the others. The final prediction uses all predictions from the individual trees and combines them. This function can fit classification, regression, and censored regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

ranger¹
aorsf²
h2o²
partykit²
randomForest
spark

¹ The default engine. ² Requires a parsnip extension package for censored regression, classification, and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

rand_forest(
  mode = "unknown",
  engine = "ranger",
  mtry = NULL,
  trees = NULL,
  min_n = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", "classification", or "censored regression".

engine

A single character string specifying what computational engine to use for fitting.

mtry

An integer for the number of predictors that will be randomly sampled at each split when creating the tree models.

trees

An integer for the number of trees contained in the ensemble.

min_n

An integer for the minimum number of data points in a node that are required for the node to be split further.

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
rand_forest(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("rand_forest")

rand_forest(mode = "classification", trees = 2000)

Objects exported from other packages

Description

These objects are imported from other packages. Follow the links below to see their documentation.

generics: augment, fit, fit_xy, glance, required_pkgs, tidy, varying_args
ggplot2: autoplot
hardhat: contr_one_hot, extract_fit_engine, extract_fit_time, extract_parameter_dials, extract_parameter_set_dials, extract_spec_parsnip, frequency_weights, importance_weights, tune
magrittr: %>%

Repair a model call object

Description

When the user passes a formula to fit() and the underlying model function uses a formula, the call object produced by fit() may not be usable by other functions. For example, some arguments may still be quosures and the data portion of the call will not correspond to the original data.

Usage

repair_call(x, data)

Arguments

x

A fitted parsnip model. An error will occur if the underlying model does not have a call element.

data

A data object that is relevant to the call. In most cases, this is the data frame that was given to parsnip for the model fit (i.e., the training set data). The name of this data object is inserted into the call.

Details

repair_call() call can adjust the model objects call to be usable by other functions and methods.

Value

A modified parsnip fitted model.

Examples



fitted_model <-
  linear_reg() %>%
  set_engine("lm", model = TRUE) %>%
  fit(mpg ~ ., data = mtcars)

# In this call, note that `data` is not `mtcars` and the `model = ~TRUE`
# indicates that the `model` argument is an rlang quosure.
fitted_model$fit$call

# All better:
repair_call(fitted_model, mtcars)$fit$call

Determine required packages for a model

Description

Usage

req_pkgs(x, ...)

Arguments

x

A model specification or fit.

...

Not used.

Details

This function has been deprecated in favor of required_pkgs().

Value

A character string of package names (if any).

Determine required packages for a model

Description

Determine required packages for a model

Usage

## S3 method for class 'model_spec'
required_pkgs(x, infra = TRUE, ...)

## S3 method for class 'model_fit'
required_pkgs(x, infra = TRUE, ...)

Arguments

x

A model specification or fit.

infra

Should parsnip itself be included in the result?

...

Not used.

Value

A character vector

Examples


should_fail <- try(required_pkgs(linear_reg(engine = NULL)), silent = TRUE)
should_fail

linear_reg() %>%
  set_engine("glmnet") %>%
  required_pkgs()

linear_reg() %>%
  set_engine("glmnet") %>%
  required_pkgs(infra = FALSE)

linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., data = mtcars) %>%
  required_pkgs()

RuleFit models

Description

rule_fit() defines a model that derives simple feature rules from a tree ensemble and uses them as features in a regularized model. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

xrf¹²
h2o²

¹ The default engine. ² Requires a parsnip extension package for classification and regression.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

rule_fit(
  mode = "unknown",
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL,
  stop_iter = NULL,
  penalty = NULL,
  engine = "xrf"
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

mtry

A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (specific engines only).

trees

An integer for the number of trees contained in the ensemble.

min_n

An integer for the minimum number of data points in a node that is required for the node to be split further.

tree_depth

An integer for the maximum depth of the tree (i.e. number of splits) (specific engines only).

learn_rate

A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter.

loss_reduction

A number for the reduction in the loss function required to split further (specific engines only).

sample_size

A number for the number (or proportion) of data that is exposed to the fitting routine. For xgboost, the sampling is done at each iteration while C5.0 samples once during training.

stop_iter

The number of iterations without improvement before stopping (specific engines only).

penalty

L1 regularization parameter.

engine

A single character string specifying what computational engine to use for fitting.

Details

The RuleFit model creates a regression model of rules in two stages. The first stage uses a tree-based model that is used to generate a set of rules that can be filtered, modified, and simplified. These rules are then added as predictors to a regularized generalized linear model that can also conduct feature selection during model training.

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
rule_fit(argument = !!value)

References

Friedman, J. H., and Popescu, B. E. (2008). "Predictive learning via rule ensembles." The Annals of Applied Statistics, 2(3), 916-954.

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("rule_fit")

rule_fit()

Change elements of a model specification

Description

set_args() can be used to modify the arguments of a model specification while set_mode() is used to change the model's mode.

Usage

set_args(object, ...)

set_mode(object, mode, ...)

## S3 method for class 'model_spec'
set_mode(object, mode, quantile_levels = NULL, ...)

Arguments

object

A model specification.

...

One or more named model arguments.

mode

A character string for the model type (e.g. "classification" or "regression")

quantile_levels

A vector of values between zero and one (only for the "quantile regression" mode); otherwise, it is NULL. The model uses these values to appropriately train quantile regression models to make predictions for these values (e.g., quantile_levels = 0.5 is the median).

Details

set_args() will replace existing values of the arguments.

Value

An updated model object.

Examples


rand_forest()

rand_forest() %>%
  set_args(mtry = 3, importance = TRUE) %>%
  set_mode("regression")

linear_reg() %>%
  set_mode("quantile regression", quantile_levels = c(0.2, 0.5, 0.8))

Declare a computational engine and specific arguments

Description

set_engine() is used to specify which package or system will be used to fit the model, along with any arguments specific to that software.

Usage

set_engine(object, engine, ...)

Arguments

object

A model specification.

engine

A character string for the software that should be used to fit the model. This is highly dependent on the type of model (e.g. linear regression, random forest, etc.).

...

Any optional arguments associated with the chosen computational engine. These are captured as quosures and can be tuned with tune().

Details

In parsnip,

the model type differentiates basic modeling approaches, such as random forests, logistic regression, linear support vector machines, etc.,
the mode denotes in what kind of modeling context it will be used (most commonly, classification or regression), and
the computational engine indicates how the model is fit, such as with a specific R package implementation or even methods outside of R like Keras or Stan.

Use show_engines() to get a list of possible engines for the model of interest.

Modeling functions in parsnip separate model arguments into two categories:

Main arguments are more commonly used and tend to be available across engines. These names are standardized to work with different engines in a consistent way, so you can use the parsnip main argument trees, instead of the heterogeneous arguments for this parameter from ranger and randomForest packages (num.trees and ntree, respectively). Set these in your model type function, like rand_forest(trees = 2000).
Engine arguments are either specific to a particular engine or used more rarely; there is no change for these argument names from the underlying engine. The ... argument of set_engine() allows any engine-specific argument to be passed directly to the engine fitting function, like set_engine("ranger", importance = "permutation").

Value

An updated model specification.

Examples


# First, set main arguments using the standardized names
logistic_reg(penalty = 0.01, mixture = 1/3) %>%
  # Now specify how you want to fit the model with another argument
  set_engine("glmnet", nlambda = 10) %>%
  translate()

# Many models have possible engine-specific arguments
decision_tree(tree_depth = 5) %>%
  set_engine("rpart", parms = list(prior = c(.65,.35))) %>%
  set_mode("classification") %>%
  translate()

Tools to Register Models

Description

These functions are similar to constructors and can be used to validate that there are no conflicts with the underlying model structures used by the package.

Usage

set_new_model(model)

set_model_mode(model, mode)

set_model_engine(model, mode, eng)

set_model_arg(model, eng, parsnip, original, func, has_submodel)

set_dependency(model, eng, pkg = "parsnip", mode = NULL)

get_dependency(model)

set_fit(model, mode, eng, value)

get_fit(model)

set_pred(model, mode, eng, type, value)

get_pred_type(model, type)

show_model_info(model)

pred_value_template(pre = NULL, post = NULL, func, ...)

set_encoding(model, mode, eng, options)

get_encoding(model)

Arguments

model

A single character string for the model type (e.g. "rand_forest", etc).

mode

A single character string for the model mode (e.g. "regression").

eng

A single character string for the model engine.

parsnip

A single character string for the "harmonized" argument name that parsnip exposes.

original

A single character string for the argument name that underlying model function uses.

func

A named character vector that describes how to call a function. func should have elements pkg and fun. The former is optional but is recommended and the latter is required. For example, c(pkg = "stats", fun = "lm") would be used to invoke the usual linear regression function. In some cases, it is helpful to use c(fun = "predict") when using a package's predict method.

has_submodel

A single logical for whether the argument can make predictions on multiple submodels at once.

pkg

An options character string for a package name.

value

A list that conforms to the fit_obj or pred_obj description below, depending on context.

type

A single character value for the type of prediction. Possible values are: class, conf_int, numeric, pred_int, prob, quantile, and raw.

pre, post

Optional functions for pre- and post-processing of prediction results.

...

Optional arguments that should be passed into the args slot for prediction objects.

options

A list of options for engine-specific preprocessing encodings. See Details below.

Details

These functions are available for users to add their own models or engines (in a package or otherwise) so that they can be accessed using parsnip. This is more thoroughly documented on the package web site (see references below).

In short, parsnip stores an environment object that contains all of the information and code about how models are used (e.g. fitting, predicting, etc). These functions can be used to add models to that environment as well as helper functions that can be used to makes sure that the model data is in the right format.

check_model_exists() checks the model value and ensures that the model has already been registered. check_model_doesnt_exist() checks the model value and also checks to see if it is novel in the environment.

The options for engine-specific encodings dictate how the predictors should be handled. These options ensure that the data that parsnip gives to the underlying model allows for a model fit that is as similar as possible to what it would have produced directly.

For example, if fit() is used to fit a model that does not have a formula interface, typically some predictor preprocessing must be conducted. glmnet is a good example of this.

There are four options that can be used for the encodings:

predictor_indicators describes whether and how to create indicator/dummy variables from factor predictors. There are three options: "none" (do not expand factor predictors), "traditional" (apply the standard model.matrix() encodings), and "one_hot" (create the complete set including the baseline level for all factors). This encoding only affects cases when fit.model_spec() is used and the underlying model has an x/y interface.

Another option is compute_intercept; this controls whether model.matrix() should include the intercept in its formula. This affects more than the inclusion of an intercept column. With an intercept, model.matrix() computes dummy variables for all but one factor levels. Without an intercept, model.matrix() computes a full set of indicators for the first factor variable, but an incomplete set for the remainder.

Next, the option remove_intercept will remove the intercept column after model.matrix() is finished. This can be useful if the model function (e.g. lm()) automatically generates an intercept.

Finally, allow_sparse_x specifies whether the model function can natively accommodate a sparse matrix representation for predictors during fitting and tuning.

References

"How to build a parsnip model" https://www.tidymodels.org/learn/develop/models/

Examples


# set_new_model("shallow_learning_model")

# Show the information about a model:
show_model_info("rand_forest")

Set seed in R and TensorFlow at the same time

Description

Some Keras models requires seeds to be set in both R and TensorFlow to achieve reproducible results. This function sets these seeds at the same time using version appropriate functions.

Usage

set_tf_seed(seed)

Arguments

seed

1 integer value.

Print the model call

Description

Print the model call

Usage

show_call(object)

Arguments

object

A "model_spec" object.

Value

A character string.

Display currently available engines for a model

Description

The possible engines for a model can depend on what packages are loaded. Some parsnip extension add engines to existing models. For example, the poissonreg package adds additional engines for the poisson_reg() model and these are not available unless poissonreg is loaded.

Usage

show_engines(x)

Arguments

x

The name of a parsnip model (e.g., "linear_reg", "mars", etc.)

Value

A tibble.

Examples


show_engines("linear_reg")

Using sparse data with parsnip

Description

You can figure out whether a given model engine supports sparse data by calling get_encoding("name of model") and looking at the allow_sparse_x column.

Details

Using sparse data for model fitting and prediction shouldn't require any additional configurations. Just pass in a sparse matrix such as dgCMatrix from the Matrix package or a sparse tibble from the sparsevctrs package to the data argument of fit(), fit_xy(), and predict().

Models that don't support sparse data will try to convert to non-sparse data with warnings. If conversion isn’t possible, an informative error will be thrown.

Model Specification Checking:

Description

The helpers spec_is_possible(), spec_is_loaded(), and prompt_missing_implementation() provide tooling for checking model specifications. In addition to the spec, engine, and mode arguments, the functions take arguments user_specified_engine and user_specified_mode, denoting whether the user themselves has specified the engine or mode, respectively.

Usage

spec_is_possible(
  spec,
  engine = spec$engine,
  user_specified_engine = spec$user_specified_engine,
  mode = spec$mode,
  user_specified_mode = spec$user_specified_mode
)

spec_is_loaded(
  spec,
  engine = spec$engine,
  user_specified_engine = spec$user_specified_engine,
  mode = spec$mode,
  user_specified_mode = spec$user_specified_mode
)

prompt_missing_implementation(
  spec,
  engine = spec$engine,
  user_specified_engine = spec$user_specified_engine,
  mode = spec$mode,
  user_specified_mode = spec$user_specified_mode,
  prompt,
  ...
)

Details

spec_is_possible() checks against the union of

the current parsnip model environment and
the model_info_table of "pre-registered" model specifications

to determine whether a model is well-specified. See parsnip:::model_info_table for this table.

spec_is_loaded() checks only against the current parsnip model environment.

spec_is_possible() is executed automatically on new_model_spec(), set_mode(), and set_engine(), and spec_is_loaded() is executed automatically in print.model_spec(), among other places. spec_is_possible() should be used when a model specification is still "in progress" of being specified, while spec_is_loaded should only be called when parsnip or an extension receives some indication that the user is "done" specifying a model specification: at print, fit, addition to a workflow, or ⁠extract_*()⁠, for example.

When spec_is_loaded() is FALSE, the prompt_missing_implementation() helper will construct an informative message to prompt users to load or install needed packages. It's prompt argument refers to the prompting function to use, usually cli::cli_inform or cli::cli_abort, and the ellipses are passed to that function.

Wrapper for stan confidence intervals

Description

Wrapper for stan confidence intervals

Usage

stan_conf_int(object, newdata)

Arguments

object

A stan model fit

newdata

A data set.

Parametric survival regression

Description

This function is deprecated in favor of survival_reg() which uses the "censored regression" mode.

surv_reg() defines a parametric survival model.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

surv_reg(mode = "regression", engine = "survival", dist = NULL)

Arguments

mode

A single character string for the prediction outcome mode. The only possible value for this model is "regression".

engine

A single character string specifying what computational engine to use for fitting.

dist

A character string for the probability distribution of the outcome. The default is "weibull".

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
surv_reg(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Parametric survival regression

Description

survival_reg() defines a parametric survival model. This function can fit censored regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

survival¹²
flexsurv²
flexsurvspline²

¹ The default engine. ² Requires a parsnip extension package.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

survival_reg(mode = "censored regression", engine = "survival", dist = NULL)

Arguments

mode

A single character string for the prediction outcome mode. The only possible value for this model is "censored regression".

engine

A single character string specifying what computational engine to use for fitting.

dist

A character string for the probability distribution of the outcome. The default is "weibull".

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
survival_reg(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("survival_reg")

survival_reg(mode = "censored regression", dist = "weibull")

Linear support vector machines

Description

svm_linear() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes (using a linear class boundary). For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses a linear fit. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

LiblineaR¹
kernlab

¹ The default engine.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

svm_linear(mode = "unknown", engine = "LiblineaR", cost = NULL, margin = NULL)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting.

cost

A positive number for the cost of predicting a sample within or on the wrong side of the margin

margin

A positive number for the epsilon in the SVM insensitive loss function (regression only)

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
svm_linear(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("svm_linear")

svm_linear(mode = "classification")

Polynomial support vector machines

Description

svm_poly() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes using a polynomial class boundary. For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses polynomial functions of the predictors. This function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

kernlab¹

¹ The default engine.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

svm_poly(
  mode = "unknown",
  engine = "kernlab",
  cost = NULL,
  degree = NULL,
  scale_factor = NULL,
  margin = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting.

cost

A positive number for the cost of predicting a sample within or on the wrong side of the margin

degree

A positive number for polynomial degree.

scale_factor

A positive number for the polynomial scaling factor.

margin

A positive number for the epsilon in the SVM insensitive loss function (regression only)

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
svm_poly(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("svm_poly")

svm_poly(mode = "classification", degree = 1.2)

Radial basis function support vector machines

Description

svm_rbf() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes using a nonlinear class boundary. For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses nonlinear functions of the predictors. The function can fit classification and regression models.

There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.

kernlab¹

¹ The default engine.

More information on how parsnip is used for modeling is at https://www.tidymodels.org/.

Usage

svm_rbf(
  mode = "unknown",
  engine = "kernlab",
  cost = NULL,
  rbf_sigma = NULL,
  margin = NULL
)

Arguments

mode

A single character string for the prediction outcome mode. Possible values for this model are "unknown", "regression", or "classification".

engine

A single character string specifying what computational engine to use for fitting. Possible engines are listed below. The default for this model is "kernlab".

cost

A positive number for the cost of predicting a sample within or on the wrong side of the margin

rbf_sigma

A positive number for radial basis function.

margin

A positive number for the epsilon in the SVM insensitive loss function (regression only)

Details

The model is not trained or fit until the fit() function is used with the data.

Each of the arguments in this function other than mode and engine are captured as quosures. To pass values programmatically, use the injection operator like so:

value <- 1
svm_rbf(argument = !!value)

References

https://www.tidymodels.org, Tidy Modeling with R, searchable table of parsnip models

Examples


show_engines("svm_rbf")

svm_rbf(mode = "classification", rbf_sigma = 0.2)

tidy methods for glmnet models

Description

tidy() methods for the various glmnet models that return the coefficients for the specific penalty value used by the parsnip model fit.

Usage

## S3 method for class ''_elnet''
tidy(x, penalty = NULL, ...)

## S3 method for class ''_lognet''
tidy(x, penalty = NULL, ...)

## S3 method for class ''_multnet''
tidy(x, penalty = NULL, ...)

## S3 method for class ''_fishnet''
tidy(x, penalty = NULL, ...)

## S3 method for class ''_coxnet''
tidy(x, penalty = NULL, ...)

Arguments

x

A fitted parsnip model that used the glmnet engine.

penalty

A single numeric value. If none is given, the value specified in the model specification is used.

...

Not used

Value

A tibble with columns term, estimate, and penalty. When a multinomial mode is used, an additional class column is included.

tidy methods for LiblineaR models

Description

tidy() methods for the various LiblineaR models that return the coefficients from the parsnip model fit.

Usage

## S3 method for class ''_LiblineaR''
tidy(x, ...)

Arguments

x

A fitted parsnip model that used the LiblineaR engine.

...

Not used

Value

A tibble with columns term and estimate.

Turn a parsnip model object into a tidy tibble

Description

This method tidies the model in a parsnip model object, if it exists.

Usage

## S3 method for class 'model_fit'
tidy(x, ...)

Arguments

x

An object to be converted into a tidy tibble::tibble().

...

Additional arguments to tidying method.

Value

a tibble

Tidy method for null models

Description

Return the results of nullmodel as a tibble

Usage

## S3 method for class 'nullmodel'
tidy(x, ...)

Arguments

x

A nullmodel object.

...

Not used.

Value

A tibble with column value.

Examples



nullmodel(mtcars[,-1], mtcars$mpg) %>% tidy()

Resolve a Model Specification for a Computational Engine

Description

translate() will translate a model specification into a code object that is specific to a particular engine (e.g. R package). It translates generic parameters to their counterparts.

Usage

translate(x, ...)

## Default S3 method:
translate(x, engine = x$engine, ...)

Arguments

x

A model specification.

...

Not currently used.

engine

The computational engine for the model (see ?set_engine).

Details

translate() produces a template call that lacks the specific argument values (such as data, etc). These are filled in once fit() is called with the specifics of the data for the model. The call may also include tune() arguments if these are in the specification. To handle the tune() arguments, you need to use the tune package. For more information see https://www.tidymodels.org/start/tuning/

It does contain the resolved argument names that are specific to the model fitting function/engine.

This function can be useful when you need to understand how parsnip goes from a generic model specific to a model fitting function.

Note: this function is used internally and users should only use it to understand what the underlying syntax would be. It should not be used to modify the model specification.

Examples


lm_spec <- linear_reg(penalty = 0.01)

# `penalty` is tranlsated to `lambda`
translate(lm_spec, engine = "glmnet")

# `penalty` not applicable for this model.
translate(lm_spec, engine = "lm")

# `penalty` is tranlsated to `reg_param`
translate(lm_spec, engine = "spark")

# with a placeholder for an unknown argument value:
translate(linear_reg(penalty = tune(), mixture = tune()), engine = "glmnet")

Succinct summary of parsnip object

Description

type_sum controls how objects are shown when inside tibble columns.

Usage

## S3 method for class 'model_spec'
type_sum(x)

## S3 method for class 'model_fit'
type_sum(x)

Arguments

x

A model_spec or model_fit object to summarise.

Details

For model_spec objects, the summary is "⁠spec[?]⁠" or "⁠spec[+]⁠". The former indicates that either the model mode has not been declared or that the specification has tune() parameters. Otherwise, the latter is shown.

For fitted models, either "fit[x]" or "⁠fit[+]⁠" are used where the "x" implies that the model fit failed in some way.

Value

A character value.

Save information about models

Description

This function writes a tab delimited file to the package to capture information about the known models. This information includes packages in the tidymodels GitHub repository as well as packages that are known to work well with tidymodels packages (e.g. not only parsnip but also tune, etc.). There may be more model definitions in other extension packages that are not included here.

These data are used to document engines for each model function man page.

Usage

update_model_info_file(path = "inst/models.tsv")

Arguments

path

A character string for the location of the tab delimited file.

Details

See our model implementation guidelines on best practices for modeling and modeling packages.

It is highly recommended that the known parsnip extension packages are loaded. The unexported parsnip function extensions() will list these.

Updating a model specification

Description

If parameters of a model specification need to be modified, update() can be used in lieu of recreating the object from scratch.

Usage

## S3 method for class 'bag_mars'
update(
  object,
  parameters = NULL,
  num_terms = NULL,
  prod_degree = NULL,
  prune_method = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'bag_mlp'
update(
  object,
  parameters = NULL,
  hidden_units = NULL,
  penalty = NULL,
  epochs = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'bag_tree'
update(
  object,
  parameters = NULL,
  cost_complexity = NULL,
  tree_depth = NULL,
  min_n = NULL,
  class_cost = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'bart'
update(
  object,
  parameters = NULL,
  trees = NULL,
  prior_terminal_node_coef = NULL,
  prior_terminal_node_expo = NULL,
  prior_outcome_range = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'boost_tree'
update(
  object,
  parameters = NULL,
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL,
  stop_iter = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'C5_rules'
update(
  object,
  parameters = NULL,
  trees = NULL,
  min_n = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'cubist_rules'
update(
  object,
  parameters = NULL,
  committees = NULL,
  neighbors = NULL,
  max_rules = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'decision_tree'
update(
  object,
  parameters = NULL,
  cost_complexity = NULL,
  tree_depth = NULL,
  min_n = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'discrim_flexible'
update(
  object,
  num_terms = NULL,
  prod_degree = NULL,
  prune_method = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'discrim_linear'
update(
  object,
  penalty = NULL,
  regularization_method = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'discrim_quad'
update(object, regularization_method = NULL, fresh = FALSE, ...)

## S3 method for class 'discrim_regularized'
update(
  object,
  frac_common_cov = NULL,
  frac_identity = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'gen_additive_mod'
update(
  object,
  select_features = NULL,
  adjust_deg_free = NULL,
  parameters = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'linear_reg'
update(
  object,
  parameters = NULL,
  penalty = NULL,
  mixture = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'logistic_reg'
update(
  object,
  parameters = NULL,
  penalty = NULL,
  mixture = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'mars'
update(
  object,
  parameters = NULL,
  num_terms = NULL,
  prod_degree = NULL,
  prune_method = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'mlp'
update(
  object,
  parameters = NULL,
  hidden_units = NULL,
  penalty = NULL,
  dropout = NULL,
  epochs = NULL,
  activation = NULL,
  learn_rate = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'multinom_reg'
update(
  object,
  parameters = NULL,
  penalty = NULL,
  mixture = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'naive_Bayes'
update(object, smoothness = NULL, Laplace = NULL, fresh = FALSE, ...)

## S3 method for class 'nearest_neighbor'
update(
  object,
  parameters = NULL,
  neighbors = NULL,
  weight_func = NULL,
  dist_power = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'pls'
update(
  object,
  parameters = NULL,
  predictor_prop = NULL,
  num_comp = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'poisson_reg'
update(
  object,
  parameters = NULL,
  penalty = NULL,
  mixture = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'proportional_hazards'
update(
  object,
  parameters = NULL,
  penalty = NULL,
  mixture = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'rand_forest'
update(
  object,
  parameters = NULL,
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'rule_fit'
update(
  object,
  parameters = NULL,
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL,
  penalty = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'surv_reg'
update(object, parameters = NULL, dist = NULL, fresh = FALSE, ...)

## S3 method for class 'survival_reg'
update(object, parameters = NULL, dist = NULL, fresh = FALSE, ...)

## S3 method for class 'svm_linear'
update(
  object,
  parameters = NULL,
  cost = NULL,
  margin = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'svm_poly'
update(
  object,
  parameters = NULL,
  cost = NULL,
  degree = NULL,
  scale_factor = NULL,
  margin = NULL,
  fresh = FALSE,
  ...
)

## S3 method for class 'svm_rbf'
update(
  object,
  parameters = NULL,
  cost = NULL,
  rbf_sigma = NULL,
  margin = NULL,
  fresh = FALSE,
  ...
)

Arguments

object

A model specification.

parameters

A 1-row tibble or named list with main parameters to update. Use either parameters or the main arguments directly when updating. If the main arguments are used, these will supersede the values in parameters. Also, using engine arguments in this object will result in an error.

num_terms

The number of features that will be retained in the final model, including the intercept.

prod_degree

The highest possible interaction degree.

prune_method

The pruning method.

fresh

A logical for whether the arguments should be modified in-place or replaced wholesale.

...

Not used for update().

hidden_units

An integer for the number of units in the hidden model.

penalty

An non-negative number representing the amount of regularization used by some of the engines.

epochs

An integer for the number of training iterations.

cost_complexity

A positive number for the the cost/complexity parameter (a.k.a. Cp) used by CART models (specific engines only).

tree_depth

An integer for maximum depth of the tree.

min_n

An integer for the minimum number of data points in a node that are required for the node to be split further.

class_cost

trees

An integer for the number of trees contained in the ensemble.

prior_terminal_node_coef

A coefficient for the prior probability that a node is a terminal node.

prior_terminal_node_expo

An exponent in the prior probability that a node is a terminal node.

prior_outcome_range

mtry

A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (specific engines only).

learn_rate

A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only). This is sometimes referred to as the shrinkage parameter.

loss_reduction

A number for the reduction in the loss function required to split further (specific engines only).

sample_size

A number for the number (or proportion) of data that is exposed to the fitting routine. For xgboost, the sampling is done at each iteration while C5.0 samples once during training.

stop_iter

The number of iterations without improvement before stopping (specific engines only).

committees

A non-negative integer (no greater than 100) for the number of members of the ensemble.

neighbors

An integer between zero and nine for the number of training set instances that are used to adjust the model-based prediction.

max_rules

The largest number of rules.

regularization_method

A character string for the type of regularized estimation. Possible values are: "diagonal", "min_distance", "shrink_cov", and "shrink_mean" (sparsediscrim engine only).

frac_common_cov, frac_identity

Numeric values between zero and one.

select_features

TRUE or FALSE. If TRUE, the model has the ability to eliminate a predictor (via penalization). Increasing adjust_deg_free will increase the likelihood of removing predictors.

adjust_deg_free

If select_features = TRUE, then acts as a multiplier for smoothness. Increase this beyond 1 to produce smoother models.

mixture

A number between zero and one (inclusive) denoting the proportion of L1 regularization (i.e. lasso) in the model.

mixture = 1 specifies a pure lasso model,
mixture = 0 specifies a ridge regression model, and
⁠0 < mixture < 1⁠ specifies an elastic net model, interpolating lasso and ridge.

Available for specific engines only.

dropout

A number between 0 (inclusive) and 1 denoting the proportion of model parameters randomly set to zero during model training.

activation

smoothness

Laplace

A non-negative value for the Laplace correction to smoothing low-frequency counts.

weight_func

dist_power

A single number for the parameter used in calculating Minkowski distance.

predictor_prop

The maximum proportion of original predictors that can have non-zero coefficients for each PLS component (via regularization). This value is used for all PLS components for X.

num_comp

The number of PLS components to retain.

dist

A character string for the probability distribution of the outcome. The default is "weibull".

cost

A positive number for the cost of predicting a sample within or on the wrong side of the margin

margin

A positive number for the epsilon in the SVM insensitive loss function (regression only)

degree

A positive number for polynomial degree.

scale_factor

A positive number for the polynomial scaling factor.

rbf_sigma

A positive number for radial basis function.

Value

An updated model specification.

Examples



# ------------------------------------------------------------------------------

model <- C5_rules(trees = 10, min_n = 2)
model
update(model, trees = 1)
update(model, trees = 1, fresh = TRUE)



# ------------------------------------------------------------------------------

model <- cubist_rules(committees = 10, neighbors = 2)
model
update(model, committees = 1)
update(model, committees = 1, fresh = TRUE)


model <- pls(predictor_prop =  0.1)
model
update(model, predictor_prop = 1)
update(model, predictor_prop = 1, fresh = TRUE)


# ------------------------------------------------------------------------------

model <- rule_fit(trees = 10, min_n = 2)
model
update(model, trees = 1)
update(model, trees = 1, fresh = TRUE)


model <- boost_tree(mtry = 10, min_n = 3)
model
update(model, mtry = 1)
update(model, mtry = 1, fresh = TRUE)

param_values <- tibble::tibble(mtry = 10, tree_depth = 5)

model %>% update(param_values)
model %>% update(param_values, mtry = 3)

param_values$verbose <- 0
# Fails due to engine argument
# model %>% update(param_values)

model <- linear_reg(penalty = 10, mixture = 0.1)
model
update(model, penalty = 1)
update(model, penalty = 1, fresh = TRUE)

A placeholder function for argument values

Description

varying() is used when a parameter will be specified at a later date.

Usage

varying()

Determine varying arguments

Description

varying_args() takes a model specification or a recipe and returns a tibble of information on all possible varying arguments and whether or not they are actually varying.

The id column is determined differently depending on whether a model_spec or a recipe is used. For a model_spec, the first class is used. For a recipe, the unique step id is used.

Usage

## S3 method for class 'model_spec'
varying_args(object, full = TRUE, ...)

## S3 method for class 'recipe'
varying_args(object, full = TRUE, ...)

## S3 method for class 'step'
varying_args(object, full = TRUE, ...)

Arguments

object

A model_spec or a recipe.

full

A single logical. Should all possible varying parameters be returned? If FALSE, then only the parameters that are actually varying are returned.

...

Not currently used.

Value

A tibble with columns for the parameter name (name), whether it contains any varying value (varying), the id for the object (id), and the class that was used to call the method (type).

Examples



# List all possible varying args for the random forest spec
rand_forest() %>% varying_args()

# mtry is now recognized as varying
rand_forest(mtry = varying()) %>% varying_args()

# Even engine specific arguments can vary
rand_forest() %>%
  set_engine("ranger", sample.fraction = varying()) %>%
  varying_args()

# List only the arguments that actually vary
rand_forest() %>%
  set_engine("ranger", sample.fraction = varying()) %>%
  varying_args(full = FALSE)

rand_forest() %>%
  set_engine(
    "randomForest",
    strata = Class,
    sampsize = varying()
  ) %>%
  varying_args()

Boosted trees via xgboost

Description

xgb_train() and xgb_predict() are wrappers for xgboost tree-based models where all of the model arguments are in the main function.

Usage

xgb_train(
  x,
  y,
  weights = NULL,
  max_depth = 6,
  nrounds = 15,
  eta = 0.3,
  colsample_bynode = NULL,
  colsample_bytree = NULL,
  min_child_weight = 1,
  gamma = 0,
  subsample = 1,
  validation = 0,
  early_stop = NULL,
  counts = TRUE,
  event_level = c("first", "second"),
  ...
)

xgb_predict(object, new_data, ...)

Arguments

x

A data frame or matrix of predictors

y

A vector (factor or numeric) or matrix (numeric) of outcome data.

max_depth

An integer for the maximum depth of the tree.

nrounds

An integer for the number of boosting iterations.

eta

A numeric value between zero and one to control the learning rate.

colsample_bynode

Subsampling proportion of columns for each node within each tree. See the counts argument below. The default uses all columns.

colsample_bytree

Subsampling proportion of columns for each tree. See the counts argument below. The default uses all columns.

min_child_weight

A numeric value for the minimum sum of instance weights needed in a child to continue to split.

gamma

A number for the minimum loss reduction required to make a further partition on a leaf node of the tree

subsample

Subsampling proportion of rows. By default, all of the training data are used.

validation

The proportion of the data that are used for performance assessment and potential early stopping.

early_stop

An integer or NULL. If not NULL, it is the number of training iterations without improvement before stopping. If validation is used, performance is base on the validation set; otherwise, the training set is used.

counts

A logical. If FALSE, colsample_bynode and colsample_bytree are both assumed to be proportions of the proportion of columns affects (instead of counts).

event_level

For binary classification, this is a single string of either "first" or "second" to pass along describing which level of the outcome should be considered the "event".

...

Other options to pass to xgb.train() or xgboost's method for predict().

new_data

A rectangular data object, such as a data frame.

Value

A fitted xgboost object.

parsnip

Description

Author(s)

See Also

Helper functions for checking the penalty of glmnet models

Description

Usage

Arguments

Helper functions to convert between formula and matrix interface

Description

Usage

Arguments

Extract survival status

Description

Arguments

Value

Extract survival time

Description

Arguments

Value

Obtain names of prediction columns for a fitted model or workflow

Description

Usage

Arguments

Value

Examples

Translate names of model tuning parameters

Description

Usage

Arguments

Value

Examples

Organize glmnet predictions

Description

Usage

Arguments

Add a column of row numbers to a data frame

Description

Usage

Arguments

Value

Examples

Augment data with predictions

Description

Usage

Arguments

Details

Regression

Classification

Censored Regression

References

Examples

Automatic Machine Learning

Description

Usage

Arguments

Details

References

See Also

Create a ggplot for a model object

Description

Usage

Arguments

Details

Value

Ensembles of MARS models

Description

Usage

Arguments

Details

References

See Also

Ensembles of neural networks

Description

Usage

Arguments

Details

References

See Also

Ensembles of decision trees