Maintainer: | Mark van der Loo <mark.vanderloo@gmail.com> |
License: | GPL-3 |
Title: | Simple Imputation |
Type: | Package |
LazyLoad: | yes |
Description: | Easy to use interfaces to a number of imputation methods that fit in the not-a-pipe operator of the 'magrittr' package. |
Version: | 0.2.9 |
Depends: | R (≥ 4.0.0) |
Imports: | stats, utils, MASS, rpart, gower, VIM, randomForest, glmnet, missForest, norm |
URL: | https://github.com/markvanderloo/simputation |
BugReports: | https://github.com/markvanderloo/simputation/issues |
Suggests: | tinytest, knitr, rmarkdown, dplyr |
RoxygenNote: | 7.3.2 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2024-12-16 15:32:14 UTC; mark |
Author: | Mark van der Loo [aut, cre] |
Repository: | CRAN |
Date/Publication: | 2024-12-16 16:10:02 UTC |
simputation
Description
A package to make imputation simpler.
Details
To get started, see the introductory vignette.
Author(s)
Maintainer: Mark van der Loo mark.vanderloo@gmail.com
See Also
Useful links:
Report bugs at https://github.com/markvanderloo/simputation/issues
A deparse
replacement that always returns a length-1 vector
Description
A deparse
replacement that always returns a length-1 vector
Usage
deparse(...)
Arguments
... |
Arguments passed on to |
Value
The deparsed string
Examples
long_formula <- this_is_a_formula_with_long_variables ~
the_test_is_checking_if_deparse_will_return +
multiple_strings_or_not
simputation:::deparse(long_formula)
Alternative to 'predict' returning values of correct type.
Description
Te default precict
function doesn't always return the
predicted variable by default. For example, when estimating
a binomial model using glm
, by default the
log-odds are returned. foretell
wraps predict
while
setting options so that the actual predicted value is returned.
Usage
foretell(object, ...)
## Default S3 method:
foretell(object, ...)
## S3 method for class 'glm'
foretell(object, newdata = NULL, type, ...)
## S3 method for class 'rpart'
foretell(object, newdata, type, ...)
Arguments
object |
A model object,( |
... |
Furher arguments passed to |
newdata |
|
type |
|
Show the number of (remaining) missing values.
Description
Quick indication of the amount and location of missing values.
The function uses na_status
to print the missing values, but
returns the original x
(invisibly) and therefore can be used in an imputation pipeline
to peek at the NA's status.
Usage
glimpse_na(x, show_only_missing = TRUE, ...)
lhs %?>% rhs
Arguments
x |
an R object caryying data (e.g. |
show_only_missing |
if |
... |
arguments passed to |
lhs |
left hand side of pipe |
rhs |
right hand side of pipe |
Details
glimpse_na
is especially helpful when interactively adding imputation methods.
glimpse_na
is named after glimpse
in dplyr
.
Operator %?>%
is syntactic sugar: it inserts a glimpse_na
in
the pipe.
Examples
irisNA <- iris
irisNA[1:3,1] <- irisNA[3:7,2] <- NA
# How many NA's?
na_status(irisNA)
# add an imputation method one at a time
iris_imputed <-
irisNA |>
glimpse_na() # same as above
# ok, glimpse_na says "Sepal.Width" has NA's
# fix that:
iris_imputed <-
irisNA |>
impute_const(Sepal.Width ~ 7) |>
glimpse_na() # end NA
# Sepal.Length is having NA's
iris_imputed <-
irisNA |>
impute_const(Sepal.Width ~ 7) |>
impute_cart(Sepal.Length ~ .) |>
glimpse_na() # end NA
# in an existing imputation pipeline we can peek with
# glimpse_na or %?>%
iris_imputed <-
irisNA |>
glimpse_na() |> # shows the begin NA
impute_const(Sepal.Width ~ 7) |>
glimpse_na() |> # after 1 imputation
impute_cart(Sepal.Length ~ .) |>
glimpse_na() # end NA
# or
iris_imputed <-
irisNA %?>%
impute_const(Sepal.Width ~ 7) %?>%
impute_cart(Sepal.Length ~ .)
na_status(iris_imputed)
Impute using a previously fitted model.
Description
Impute one or more variables using a single R object representing a previously fitted model.
Usage
impute(dat, formula, predictor = foretell, ...)
impute_(dat, variables, model, predictor = foretell, ...)
Arguments
dat |
|
formula |
|
predictor |
|
... |
Extra arguments passed to |
variables |
|
model |
A model object. |
Model specification
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_OBJECT
The left-hand-side of the formula object lists the variable or variables to
be imputed. The right-hand-side must be a model object for which an S3
predict
method is implemented. Alternatively, one can specify a custom
predicting function. This function must accept at least a model and a
dataset, and return one predicted value for each row in the dataset.
foretell
implements usefull predict
methods for cases
where by default the predicted output is not of the same type as the predicted
variable (e.g. when using certain link functions in glm
)
Details
impute_
is an explicit version of impute
that works better in
programming contexts, especially in cases involving nonstandard evaluation.
See Also
Other imputation:
impute_cart()
,
impute_hotdeck
,
impute_lm()
Examples
irisNA <- iris
iris[1:3,1] <- NA
my_model <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)
impute(irisNA, Sepal.Length ~ my_model)
Decision Tree Imputation
Description
Imputation based on CART models or Random Forests.
Usage
impute_cart(
dat,
formula,
add_residual = c("none", "observed", "normal"),
cp,
na_action = na.rpart,
impute_all = FALSE,
...
)
impute_rf(
dat,
formula,
add_residual = c("none", "observed", "normal"),
na_action = na.omit,
impute_all = FALSE,
...
)
Arguments
dat |
|
formula |
|
add_residual |
|
cp |
The complexity parameter used to |
na_action |
|
impute_all |
|
... |
further arguments passed to
|
Model specification
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to be imputed. Variables on the right-hand-side are used as predictors in the CART or random forest model.
If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.
Grouping using dplyr::group_by
is also supported. If groups are
defined in both the formula and using dplyr::group_by
, the data is
grouped by the union of grouping variables. Any missing value in one of the
grouping variables results in an error.
Methodology
CART imputation by impute_cart
can be used for numerical,
categorical, or mixed data. Missing values are estimated using a
Classification and Regression Tree as specified by Breiman, Friedman and
Olshen (1984). This means that prediction is fairly robust agains missingess
in predictors.
Random Forest imputation with impute_rf
can be used for numerical,
categorical, or mixed data. Missing values are estimated using a Random Forest
model as specified by Breiman (2001).
References
Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984. Classification and regression trees. CRC press.
Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.
See Also
Other imputation:
impute()
,
impute_hotdeck
,
impute_lm()
Hot deck imputation
Description
Hot-deck imputation methods include random and sequential hot deck, k-nearest neighbours imputation and predictive mean matching.
Usage
impute_rhd(
dat,
formula,
pool = c("complete", "univariate", "multivariate"),
prob,
backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")),
...
)
impute_shd(
dat,
formula,
pool = c("complete", "univariate", "multivariate"),
order = c("locf", "nocb"),
backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")),
...
)
impute_pmm(
dat,
formula,
predictor = impute_lm,
pool = c("complete", "univariate", "multivariate"),
...
)
impute_knn(
dat,
formula,
pool = c("complete", "univariate", "multivariate"),
k = 5,
backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")),
...
)
Arguments
dat |
|
formula |
|
pool |
|
prob |
|
backend |
|
... |
further arguments passed to
|
order |
|
predictor |
|
k |
|
Model specification
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to be imputed. The interpretation of the independent variables on the right-hand-side depends on the imputation method.
impute_rhd
Variables inMODEL_SPECIFICATION
and/orGROUPING_VARIABLES
are used to split the data set into groups prior to imputation. Use~ 1
to specify that no grouping is to be applied.impute_shd
Variables inMODEL_SPECIFICATION
are used to sort the data. When multiple variables are specified, each variable after the first serves as tie-breaker for the previous one.impute_knn
The predictors are used to determine Gower's distance between records (seegower_topn
). This may include the variables to be imputed..impute_pmm
Predictive mean matching. TheMODEL_SPECIFICATION
is passed through to thepredictor
function.
If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.
Grouping using dplyr::group_by
is also supported. If groups are
defined in both the formula and using dplyr::group_by
, the data is
grouped by the union of grouping variables. Any missing value in one of the
grouping variables results in an error.
Methodology
Random hot deck imputation with impute_rhd
can be applied to
numeric, categorical or mixed data. A missing value is copied from a sampled
record. Optionally samples are taken within a group, or with non-uniform
sampling probabilities. See Andridge and Little (2010) for an overview
of hot deck imputation methods.
Sequential hot deck imputation with impute_rhd
can be applied
to numeric, categorical, or mixed data. The dataset is sorted using the
‘predictor variables’. Missing values or combinations thereof are copied
from the previous record where the value(s) are available in the case
of LOCF and from the next record in the case of NOCF.
Predictive mean matching with impute_pmm
can be applied to
numeric data. Missing values or combinations thereof are first imputed using
a predictive model. Next, these predictions are replaced with observed
(combinations of) values nearest to the prediction. The nearest value is the
observed value with the smallest absolute deviation from the prediction.
K-nearest neighbour imputation with impute_knn
can be applied
to numeric, categorical, or mixed data. For each record containing missing
values, the k
most similar completed records are determined based on
Gower's (1977) similarity coefficient. From these records the actual donor is
sampled.
Using the VIM backend
The VIM package has efficient implementations of several popular imputation methods. In particular, its random and sequential hotdeck implementation is faster and more memory-efficient than that of the current package. Moreover, VIM offers more fine-grained control over the imputation process then simputation.
If you have this package installed, it can be used by setting
backend="VIM"
for functions supporting this option. Alternatively, one
can set options(simputation.hdbackend="VIM")
so it becomes the
default.
Simputation will map the simputation call to a function in the VIM package. In particular:
impute_rhd
is mapped toVIM::hotdeck
where imputed variables are passed to thevariable
argument and the union of predictor and grouping variables are passed todomain_var
. Extra arguments in...
are passed toVIM::hotdeck
as well. Argumentpool
is ignored.impute_shd
is mapped toVIM::hotdeck
where imputed variables are passed to thevariable
argument, predictor variables toord_var
and grouping variables todomain_var
. Extra arguments in...
are passed toVIM::hotdeck
as well. Argumentspool
andorder
are ignored. InVIM
the donor pool is determined on a per-variable basis, equivalent to settingpool="univariate"
with the simputation backend. VIM is LOCF-based. Differences between simputation andVIM
likely occurr when the sorting variables contain missings.impute_knn
is mapped toVIM::kNN
where imputed variables are passed tovariable
, predictor variables are passed todist_var
and grouping variables are ignored with a message. Extra arguments in...
are passed toVIM::kNN
as well. Argumentpool
is ignored. Note that simputation adheres stricktly to the Gower's original definition of the distance measure, while VIM uses a generalized variant that can take ordered factors into account.
By default, VIM's imputation functions add indicator variables to the
original data to trace what values have been imputed. This is switched off by
default for consistency with the rest of the simputation package, but it may
be turned on again by setting imp_var=TRUE
.
References
Andridge, R.R. and Little, R.J., 2010. A review of hot deck imputation for survey non-response. International statistical review, 78(1), pp.40-64.
Gower, J.C., 1971. A general coefficient of similarity and some of its properties. Biometrics, pp.857–871.
See Also
Other imputation:
impute()
,
impute_cart()
,
impute_lm()
(Robust) Linear Regression Imputation
Description
Regression imputation methods including linear regression, robust
linear regression with M
-estimators, regularized regression
with lasso/elasticnet/ridge regression.
Usage
impute_lm(
dat,
formula,
add_residual = c("none", "observed", "normal"),
na_action = na.omit,
impute_all = FALSE,
...
)
impute_rlm(
dat,
formula,
add_residual = c("none", "observed", "normal"),
na_action = na.omit,
impute_all = FALSE,
...
)
impute_en(
dat,
formula,
add_residual = c("none", "observed", "normal"),
na_action = na.omit,
impute_all = FALSE,
family = c("gaussian", "poisson"),
s = 0.01,
...
)
Arguments
dat |
|
formula |
|
add_residual |
|
na_action |
|
impute_all |
|
... |
further arguments passed to |
family |
Response type for elasticnet / lasso regression. For
|
s |
The value of |
Value
dat
, but imputed where possible.
Model specification
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to
be imputed. The right-hand side excluding the optional GROUPING_VARIABLES
model specification for the underlying predictor.
If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.
Grouping using dplyr::group_by
is also supported. If groups are
defined in both the formula and using dplyr::group_by
, the data is
grouped by the union of grouping variables. Any missing value in one of the
grouping variables results in an error.
Grouping is ignored for impute_const
.
Methodology
Linear regression model imputation with impute_lm
can be used
to impute numerical variables based on numerical and/or categorical
predictors. Several common imputation methods, including ratio and (group)
mean imputation can be expressed this way. See lm
for
details on possible model specification.
Robust linear regression through M-estimation with
impute_rlm
can be used to impute numerical variables employing
numerical and/or categorical predictors. In M
-estimation, the
minimization of the squares of residuals is replaced with an alternative
convex function of the residuals that decreases the influence of
outliers.
Also see e.g. Huber (1981).
Lasso/elastic net/ridge regression imputation with impute_en
can be used to impute numerical variables employing numerical and/or
categorical predictors. For this method, the regression coefficients are
found by minimizing the least sum of squares of residuals augmented with a
penalty term depending on the size of the coefficients. For lasso regression
(Tibshirani, 1996), the penalty term is the sum of squares of the
coefficients. For ridge regression (Hoerl and Kennard, 1970), the penalty
term is the sum of absolute values of the coefficients. Elasticnet regression
(Zou and Hastie, 2010) allows switching from lasso to ridge by penalizing by
a weighted sum of the sum-of-squares and sum of absolute values term.
References
Huber, P.J., 2011. Robust statistics (pp. 1248-1251). Springer Berlin Heidelberg.
Hoerl, A.E. and Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), pp.55-67.
Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp.267-288.
Zou, H. and Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), pp.301-320.
See Also
Getting started with simputation,
Other imputation:
impute()
,
impute_cart()
,
impute_hotdeck
Examples
data(iris)
irisNA <- iris
irisNA[1:4, "Sepal.Length"] <- NA
irisNA[3:7, "Sepal.Width"] <- NA
# impute a single variable (Sepal.Length)
i1 <- impute_lm(irisNA, Sepal.Length ~ Sepal.Width + Species)
# impute both Sepal.Length and Sepal.Width, using robust linear regression
i2 <- impute_rlm(irisNA, Sepal.Length + Sepal.Width ~ Species + Petal.Length)
Impute (group-wise) medians
Description
Impute medians of group-wise medians.
Usage
impute_median(
dat,
formula,
add_residual = c("none", "observed", "normal"),
type = 7,
...
)
Arguments
dat |
|
formula |
|
add_residual |
|
type |
|
... |
Currently not used. |
Model Specification
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to
be imputed. Variables in MODEL_SPECIFICATION
and/or
GROUPING_VARIABLES
are used to split the data set into groups prior to
imputation. Use ~ 1
to specify that no grouping is to be applied.
Examples
# group-wise median imputation
irisNA <- iris
irisNA[1:3,1] <- irisNA[4:7,2] <- NA
a <- impute_median(irisNA, Sepal.Length ~ Species)
head(a)
# group-wise median imputation, all variables except species
a <- impute_median(irisNA, . - Species ~ Species)
head(a)
Multivariate, model-based imputation
Description
Models that simultaneously optimize imptuation of multiple variables. Methods include imputation based on EM-estimation of multivariate normal parameters, imputation based on iterative Random Forest estimates and stochastic imptuation based on bootstrapped EM-estimatin of multivariate normal parameters.
Usage
impute_em(dat, formula, verbose = 0, ...)
impute_mf(dat, formula, ...)
Arguments
dat |
|
formula |
|
verbose |
|
... |
Options passed to
|
Model specification
Formulas are of the form
[IMPUTED_VARIABLES] ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
When IMPUTED_VARIABLES
is empty, every variable in
MODEL_SPECIFICATION
will be imputed. When IMPUTED_VARIABLES
is
specified, all variables in IMPUTED_VARIABLES
and
MODEL_SPECIFICATION
are part of the model, but only the
IMPUTED_VARIABLES
are imputed in the output.
GROUPING_VARIABLES
specify what categorical variables are used to
split-impute-combine the data. Grouping using dplyr::group_by
is also
supported. If groups are defined in both the formula and using
dplyr::group_by
, the data is grouped by the union of grouping
variables. Any missing value in one of the grouping variables results in an
error.
Methodology
EM-based imputation with impute_em
only works for numerical
variables. These variables are assumed to follow a multivariate normal distribution
for which the means and covariance matrix is estimated based on the EM-algorithm
of Dempster Laird and Rubin (1977). The imputations are the expected values
for missing values, conditional on the value of the estimated parameters.
Multivariate Random Forest imputation with impute_mf
works for
numerical, categorical or mixed data types. It is based on the algorithm
of Stekhoven and Buehlman (2012). Missing values are imputed using a
rough guess after which a predictive random forest is trained and used
to re-impute themissing values. This is iterated until convergence.
References
Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the royal statistical society. Series B (methodological) (1977): 1-38.
Stekhoven, D.J. and Buehlmann, P., 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp.112-118.
Impute by variable derivation
Description
Impute missing values by a constant, by copying another variable computing transformations from other variables.
Usage
impute_proxy(dat, formula, add_residual = c("none", "observed", "normal"), ...)
impute_const(dat, formula, add_residual = c("none", "observed", "normal"), ...)
Arguments
dat |
|
formula |
|
add_residual |
|
... |
Currently unused |
Model Specification
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to be imputed.
For impute_const
, the MODEL_SPECIFICATION
is a single
value and GROUPING_VARIABLES
are ignored.
For impute_proxy
, the MODEL_SPECIFICATION
is a variable or
expression in terms of variables in the dataset that must result in either a
single number of in a vector of length nrow(dat)
.
If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.
Grouping using dplyr::group_by
is also supported. If groups are
defined in both the formula and using dplyr::group_by
, the data is
grouped by the union of grouping variables. Any missing value in one of the
grouping variables results in an error.
Examples
irisNA <- iris
irisNA[1:3,1] <- irisNA[3:7,2] <- NA
# impute a constant
a <- impute_const(irisNA, Sepal.Width ~ 7)
head(a)
a <- impute_proxy(irisNA, Sepal.Width ~ 7)
head(a)
# copy a value from another variable (where available)
a <- impute_proxy(irisNA, Sepal.Width ~ Sepal.Length)
head(a)
# group mean imputation
a <- impute_proxy(irisNA
, Sepal.Length ~ mean(Sepal.Length,na.rm=TRUE) | Species)
head(a)
# random hot deck imputation
a <- impute_proxy(irisNA, Sepal.Length ~ mean(Sepal.Length, na.rm=TRUE)
, add_residual = "observed")
# ratio imputation (but use impute_lm for that)
a <- impute_proxy(irisNA,
Sepal.Length ~ mean(Sepal.Length,na.rm=TRUE)/mean(Sepal.Width,na.rm=TRUE) * Sepal.Width)
Show the number of (remaining) missing values.
Description
Quick indication of the amount and location of missing values.
Usage
na_status(
x,
show_only_missing = TRUE,
sort_columns = show_only_missing,
show_message = TRUE,
...
)
Arguments
x |
an R object caryying data (e.g. |
show_only_missing |
if |
sort_columns |
If |
show_message |
if |
... |
arguments to be passed to other methods. |
Value
data.frame
with the column and number of NA's
See Also
Examples
irisNA <- iris
irisNA[1:3,1] <- irisNA[3:7,2] <- NA
na_status(irisNA)
# impute a constant
a <- impute_const(irisNA, Sepal.Width ~ 7)
na_status(a)
Rough imputation for handling missing predictors.
Description
This function is re-exported from
randomForest:na.roughfix
when
available. Otherwise it will throw a warning and resort to
options("na.action")
Usage
na.roughfix(object, ...)
Arguments
object |
an R object caryying data (e.g. |
... |
arguments to be passed to other methods. |
print output of simputation_capabilities
Description
print output of simputation_capabilities
Usage
## S3 method for class 'simputation.capabilities'
print(x, ...)
Arguments
x |
an R object |
... |
unused |
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- rpart
Capabilities depending on suggested packages.
Description
This function has bevome unnecessary as of simputation
0.2.8 and higher.
It will be removed from future versions.
Usage
simputation_capabilities()
simputation_suggests(lib.loc = NULL)
Arguments
lib.loc |
Where to check whether a package is installed (passed to
|
Value
For simputation_capabilities
A named character
vector of class
simputation.capabilities
. The class attribute allows pretty-printing
of the output.
For simputation_suggests
a logical
vector, stating which
suggested packages are currently installed (TRUE
) or not
(FALSE
).
details
simputation_capabilities
Calls every impute_
function and
grabs the warning message (if any) stating that a package is missing.
simputation_suggests
checks which of the suggested packages
implementing statistical models are available.