Title: | Extra 'Recipes' for Text Processing |
Version: | 1.1.0 |
Description: | Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing. |
License: | MIT + file LICENSE |
URL: | https://github.com/tidymodels/textrecipes, https://textrecipes.tidymodels.org/ |
BugReports: | https://github.com/tidymodels/textrecipes/issues |
Depends: | R (≥ 3.6), recipes (≥ 1.2.0) |
Imports: | cli, lifecycle, dplyr, generics (≥ 0.1.0), magrittr, Matrix, purrr, rlang (≥ 1.1.0), SnowballC, sparsevctrs (≥ 0.3.0), tibble, tokenizers, vctrs, glue |
Suggests: | covr, data.table, dials (≥ 1.2.0), hardhat, janitor, knitr, modeldata, reticulate, rmarkdown, sentencepiece, spacyr, stopwords, stringi, testthat (≥ 3.0.0), text2vec, tokenizers.bpe, udpipe, wordpiece |
VignetteBuilder: | knitr |
Config/Needs/website: | tidyverse/tidytemplate, reticulate |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
SystemRequirements: | "GNU make" |
NeedsCompilation: | yes |
Packaged: | 2025-03-18 15:37:27 UTC; emilhvitfeldt |
Author: | Emil Hvitfeldt |
Maintainer: | Emil Hvitfeldt <emil.hvitfeldt@posit.co> |
Repository: | CRAN |
Date/Publication: | 2025-03-18 16:10:02 UTC |
textrecipes: Extra 'Recipes' for Text Processing
Description
Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.
Author(s)
Maintainer: Emil Hvitfeldt emil.hvitfeldt@posit.co (ORCID)
Other contributors:
Michael W. Kearney kearneymw@missouri.edu (author of count_functions) [copyright holder]
Posit Software, PBC [copyright holder, funder]
See Also
Useful links:
Report bugs at https://github.com/tidymodels/textrecipes/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Role Selection
Description
all_tokenized()
selects all token
variables,
all_tokenized_predictors()
selects all predictor token
variables.
Usage
all_tokenized()
all_tokenized_predictors()
See Also
List of all feature counting functions
Description
List of all feature counting functions
Usage
count_functions
Format
Named list of all ferature counting functions
n_words
Number of words.
n_uq_words
Number of unique words.
n_charS
Number of characters. Not counting urls, hashtags, mentions or white spaces.
n_uq_charS
Number of unique characters. Not counting urls, hashtags, mentions or white spaces.
n_digits
Number of digits.
n_hashtags
Number of hashtags, word preceded by a '#'.
n_uq_hashtags
Number of unique hashtags, word preceded by a '#'.
n_mentions
Number of mentions, word preceded by a '@'.
n_uq_mentions
Number of unique mentions, word preceded by a '@'.
n_commas
Number of commas.
n_periods
Number of periods.
n_exclaims
Number of exclamation points.
n_extraspaces
Number of times more then 1 consecutive space have been used.
n_caps
Number of upper case characters.
n_lowers
Number of lower case characters.
n_urls
Number of urls.
n_uq_urls
Number of unique urls.
n_nonasciis
Number of non ascii characters.
n_puncts
Number of punctuations characters, not including exclamation points, periods and commas.
first_person
Number of "first person" words.
first_personp
Number of "first person plural" words.
second_person
Number of "second person" words.
second_personp
Number of "second person plural" words.
third_person
Number of "third person" words.
to_be
Number of "to be" words.
prepositions
Number of preposition words.
Details
In this function we refer to "first person", "first person plural" and so on. This list describes what words are contained in each group.
- first person
I, me, myself, my, mine, this.
- first person plural
we, us, our, ours, these.
- second person
you, yours, your, yourself.
- second person plural
he, she, it, its, his, hers.
- third person
they, them, theirs, their, they're, their's, those, that.
- to be
am, is, are, was, were, being, been, be, were, be.
- prepositions
about, below, excepting, off, toward, above, beneath, on, under, across, from, onto, underneath, after, between, in, out, until, against, beyond, outside, up, along, but, inside, over, upon, among, by, past, around, concerning, regarding, with, at, despite, into, since, within, down, like, through, without, before, during, near, throughout, behind, except, of, to, for.
Sample sentences with emojis
Description
This data set is primarily used for examples.
Usage
emoji_samples
Format
tibble with 1 column
Nram generator
Description
Nram generator
Usage
ngram(x, n, n_min, delim)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- generics
S3 methods for tracking which additional packages are needed for steps.
Description
Recipe-adjacent packages always list themselves as a required package so that the steps can function properly within parallel processing schemes.
Usage
## S3 method for class 'step_clean_levels'
required_pkgs(x, ...)
## S3 method for class 'step_clean_names'
required_pkgs(x, ...)
## S3 method for class 'step_dummy_hash'
required_pkgs(x, ...)
## S3 method for class 'step_lda'
required_pkgs(x, ...)
## S3 method for class 'step_lemma'
required_pkgs(x, ...)
## S3 method for class 'step_ngram'
required_pkgs(x, ...)
## S3 method for class 'step_pos_filter'
required_pkgs(x, ...)
## S3 method for class 'step_sequence_onehot'
required_pkgs(x, ...)
## S3 method for class 'step_stem'
required_pkgs(x, ...)
## S3 method for class 'step_stopwords'
required_pkgs(x, ...)
## S3 method for class 'step_text_normalization'
required_pkgs(x, ...)
## S3 method for class 'step_textfeature'
required_pkgs(x, ...)
## S3 method for class 'step_texthash'
required_pkgs(x, ...)
## S3 method for class 'step_tf'
required_pkgs(x, ...)
## S3 method for class 'step_tfidf'
required_pkgs(x, ...)
## S3 method for class 'step_tokenfilter'
required_pkgs(x, ...)
## S3 method for class 'step_tokenize'
required_pkgs(x, ...)
## S3 method for class 'step_tokenize_bpe'
required_pkgs(x, ...)
## S3 method for class 'step_tokenize_sentencepiece'
required_pkgs(x, ...)
## S3 method for class 'step_tokenize_wordpiece'
required_pkgs(x, ...)
## S3 method for class 'step_tokenmerge'
required_pkgs(x, ...)
## S3 method for class 'step_untokenize'
required_pkgs(x, ...)
## S3 method for class 'step_word_embeddings'
required_pkgs(x, ...)
Arguments
x |
A recipe step |
Value
A character vector
Show token output of recipe
Description
Returns the tokens as a list of character vectors of a recipe. This function can be useful for diagnostics during recipe construction but should not be used in final recipe steps. Note that this function will both prep() and bake() the recipe it is used on.
Usage
show_tokens(rec, var, n = 6L)
Arguments
rec |
A recipe object |
var |
name of variable |
n |
Number of elements to return. |
Value
A list of character vectors
Examples
text_tibble <- tibble(text = c("This is words", "They are nice!"))
recipe(~text, data = text_tibble) %>%
step_tokenize(text) %>%
show_tokens(text)
library(modeldata)
data(tate_text)
recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
show_tokens(medium)
Clean Categorical Levels
Description
step_clean_levels()
creates a specification of a recipe step that will
clean nominal data (character or factor) so the levels consist only of
letters, numbers, and the underscore.
Usage
step_clean_levels(
recipe,
...,
role = NA,
trained = FALSE,
clean = NULL,
skip = FALSE,
id = rand_id("clean_levels")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
clean |
A named character vector to clean and recode categorical levels.
This is |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
The new levels are cleaned and then reset with dplyr::recode_factor()
. When
data to be processed contains novel levels (i.e., not contained in the
training set), they are converted to missing.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, orginal
, value
, and id
:
- terms
character, the selectors or variables selected
- original
character, the original levels
- value
character, the cleaned levels
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_clean_names()
, recipes::step_factor2string()
,
recipes::step_string2factor()
, recipes::step_regex()
,
recipes::step_unknown()
, recipes::step_novel()
, recipes::step_other()
Other Steps for Text Cleaning:
step_clean_names()
Examples
library(recipes)
library(modeldata)
data(Smithsonian)
smith_tr <- Smithsonian[1:15, ]
smith_te <- Smithsonian[16:20, ]
rec <- recipe(~., data = smith_tr)
rec <- rec %>%
step_clean_levels(name)
rec <- prep(rec, training = smith_tr)
cleaned <- bake(rec, smith_tr)
tidy(rec, number = 1)
# novel levels are replaced with missing
bake(rec, smith_te)
Clean Variable Names
Description
step_clean_names()
creates a specification of a recipe step that will
clean variable names so the names consist only of letters, numbers, and the
underscore.
Usage
step_clean_names(
recipe,
...,
role = NA,
trained = FALSE,
clean = NULL,
skip = FALSE,
id = rand_id("clean_names")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
clean |
A named character vector to clean variable names. This is |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
- terms
character, the new clean variable names
- value
character, the original variable names
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_clean_levels()
, recipes::step_factor2string()
,
recipes::step_string2factor()
, recipes::step_regex()
,
recipes::step_unknown()
, recipes::step_novel()
, recipes::step_other()
Other Steps for Text Cleaning:
step_clean_levels()
Examples
library(recipes)
data(airquality)
air_tr <- tibble(airquality[1:100, ])
air_te <- tibble(airquality[101:153, ])
rec <- recipe(~., data = air_tr)
rec <- rec %>%
step_clean_names(all_predictors())
rec <- prep(rec, training = air_tr)
tidy(rec, number = 1)
bake(rec, air_tr)
bake(rec, air_te)
Indicator Variables via Feature Hashing
Description
step_dummy_hash()
creates a specification of a recipe step that will
convert factors or character columns into a series of binary (or signed
binary) indicator columns.
Usage
step_dummy_hash(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
signed = TRUE,
num_terms = 32L,
collapse = FALSE,
prefix = "dummyhash",
sparse = "auto",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("dummy_hash")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
signed |
A logical, indicating whether to use a signed hash-function (generating values of -1, 0, or 1), to reduce collisions when hashing. Defaults to TRUE. |
num_terms |
An integer, the number of variables to output. Defaults to 32. |
collapse |
A logical; should all of the selected columns be collapsed into a single column to create a single set of hashed features? |
prefix |
A character string that will be the prefix to the resulting new variables. See notes below. |
sparse |
A single string. Should the columns produced be sparse vectors.
Can take the values |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
Feature hashing, or the hashing trick, is a transformation of a text variable into a new set of numerical variables. This is done by applying a hashing function over the values of the factor levels and using the hash values as feature indices. This allows for a low memory representation of the data and can be very helpful when a qualitative predictor has many levels or is expected to have new levels during prediction. This implementation is done using the MurmurHash3 method.
The argument num_terms
controls the number of indices that the hashing
function will map to. This is the tuning parameter for this transformation.
Since the hashing function can map two different tokens to the same index,
a higher value of num_terms
will result in a lower chance of collision.
The new components will have names that begin with prefix
, then
the name of the variable, followed by the tokens all separated by
-
. The variable names are padded with zeros. For example if
prefix = "hash"
, and if num_terms < 10
, their names will be
hash1
- hash9
. If num_terms = 101
, their names will be
hash001
- hash101
.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value
, num_terms
, collapse
, and id
:
- terms
character, the selectors or variables selected
- value
logical, whether a signed hashing was performed
- num_terms
integer, number of terms
- collapse
logical, were the columns collapsed
- id
character, id of this step
Tuning Parameters
This step has 2 tuning parameters:
-
signed
: Signed Hash Value (type: logical, default: TRUE) -
num_terms
: # Hash Features (type: integer, default: 32)
Sparse data
This step produces sparse columns if sparse = "yes"
is being set. The
default value "auto"
won't trigger production fo sparse columns if a recipe
is recipes::prep()
ed, but allows for a workflow to toggle to "yes"
or
"no"
depending on whether the model supports recipes::sparse_data and if
the model is is expected to run faster with the data.
The mechanism for determining how much sparsity is produced isn't perfect,
and there will be times when you want to manually overwrite by setting
sparse = "yes"
or sparse = "no"
.
Case weights
The underlying operation does not allow for case weights.
References
Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009).
Kuhn and Johnson (2019), Chapter 7, https://bookdown.org/max/FES/encoding-predictors-with-many-categories.html
See Also
Other Steps for Numeric Variables From Characters:
step_sequence_onehot()
,
step_textfeature()
Examples
library(recipes)
library(modeldata)
data(grants)
grants_rec <- recipe(~sponsor_code, data = grants_other) %>%
step_dummy_hash(sponsor_code)
grants_obj <- grants_rec %>%
prep()
bake(grants_obj, grants_test)
tidy(grants_rec, number = 1)
tidy(grants_obj, number = 1)
Calculate LDA Dimension Estimates of Tokens
Description
step_lda()
creates a specification of a recipe step that will return the
lda dimension estimates of a text variable.
Usage
step_lda(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
lda_models = NULL,
num_topics = 10L,
prefix = "lda",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("lda")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
lda_models |
A WarpLDA model object from the text2vec package. If left to NULL, the default, it will train its model based on the training data. Look at the examples for how to fit a WarpLDA model. |
num_topics |
integer desired number of latent topics. |
prefix |
A prefix for generated column names, defaults to "lda". |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, num_topics
, and id
:
- terms
character, the selectors or variables selected
- num_topics
integer, number of topics
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
Source
https://arxiv.org/abs/1301.3781
See Also
Other Steps for Numeric Variables From Tokens:
step_texthash()
,
step_tf()
,
step_tfidf()
,
step_word_embeddings()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_lda(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL) %>%
slice(1:2)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
# Changing the number of topics.
recipe(~., data = tate_text) %>%
step_tokenize(medium, artist) %>%
step_lda(medium, artist, num_topics = 20) %>%
prep() %>%
bake(new_data = NULL) %>%
slice(1:2)
# Supplying A pre-trained LDA model trained using text2vec
library(text2vec)
tokens <- word_tokenizer(tolower(tate_text$medium))
it <- itoken(tokens, ids = seq_along(tate_text$medium))
v <- create_vocabulary(it)
dtm <- create_dtm(it, vocab_vectorizer(v))
lda_model <- LDA$new(n_topics = 15)
recipe(~., data = tate_text) %>%
step_tokenize(medium, artist) %>%
step_lda(medium, artist, lda_models = lda_model) %>%
prep() %>%
bake(new_data = NULL) %>%
slice(1:2)
Lemmatization of Token Variables
Description
step_lemma()
creates a specification of a recipe step that will extract
the lemmatization of a token
variable.
Usage
step_lemma(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
skip = FALSE,
id = rand_id("lemma")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
This stem doesn't perform lemmatization by itself, but rather lets you
extract the lemma attribute of the token
variable. To be
able to use step_lemma
you need to use a tokenization method that includes
lemmatization. Currently using the "spacyr"
engine in step_tokenize()
provides lemmatization and works well with step_lemma
.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
- terms
character, the selectors or variables selected
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_ngram()
,
step_pos_filter()
,
step_stem()
,
step_stopwords()
,
step_tokenfilter()
,
step_tokenmerge()
Examples
## Not run:
library(recipes)
short_data <- data.frame(text = c(
"This is a short tale,",
"With many cats and ladies."
))
rec_spec <- recipe(~text, data = short_data) %>%
step_tokenize(text, engine = "spacyr") %>%
step_lemma(text) %>%
step_tf(text)
rec_prepped <- prep(rec_spec)
bake(rec_prepped, new_data = NULL)
## End(Not run)
Generate n-grams From Token Variables
Description
step_ngram()
creates a specification of a recipe step that will convert a
token
variable into a token
variable of
ngrams.
Usage
step_ngram(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
num_tokens = 3L,
min_num_tokens = 3L,
delim = "_",
skip = FALSE,
id = rand_id("ngram")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
num_tokens |
The number of tokens in the n-gram. This must be an integer greater than or equal to 1. Defaults to 3. |
min_num_tokens |
The minimum number of tokens in the n-gram. This must
be an integer greater than or equal to 1 and smaller than |
delim |
The separator between words in an n-gram. Defaults to "_". |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
The use of this step will leave the ordering of the tokens meaningless. If
min_num_tokens < num_tokens
then the tokens will be ordered in increasing
fashion with respect to the number of tokens in the n-gram. If min_num_tokens = 1
and num_tokens = 3
then the output will contain all the 1-grams followed by all
the 2-grams followed by all the 3-grams.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
- terms
character, the selectors or variables selected
- id
character, id of this step
Tuning Parameters
This step has 1 tuning parameters:
-
num_tokens
: Number of tokens (type: integer, default: 3)
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_pos_filter()
,
step_stem()
,
step_stopwords()
,
step_tokenfilter()
,
step_tokenmerge()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_ngram(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
Part of Speech Filtering of Token Variables
Description
step_pos_filter()
creates a specification of a recipe step that will
filter a token
variable based on part of speech tags.
Usage
step_pos_filter(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
keep_tags = "NOUN",
skip = FALSE,
id = rand_id("pos_filter")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
keep_tags |
Character variable of part of speech tags to keep. See details for complete list of tags. Defaults to "NOUN". |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
Possible part of speech tags for spacyr
engine are: "ADJ", "ADP", "ADV",
"AUX", "CONJ", "CCONJ", "DET", "INTJ", "NOUN", "NUM", "PART", "PRON",
"PROPN", "PUNCT", "SCONJ", "SYM", "VERB", "X" and "SPACE". For more
information look here
https://github.com/explosion/spaCy/blob/master/spacy/glossary.py.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
- terms
character, the selectors or variables selected
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_ngram()
,
step_stem()
,
step_stopwords()
,
step_tokenfilter()
,
step_tokenmerge()
Examples
## Not run:
library(recipes)
short_data <- data.frame(text = c(
"This is a short tale,",
"With many cats and ladies."
))
rec_spec <- recipe(~text, data = short_data) %>%
step_tokenize(text, engine = "spacyr") %>%
step_pos_filter(text, keep_tags = "NOUN") %>%
step_tf(text)
rec_prepped <- prep(rec_spec)
bake(rec_prepped, new_data = NULL)
## End(Not run)
Positional One-Hot encoding of Tokens
Description
step_sequence_onehot()
creates a specification of a recipe step that will
take a string and do one hot encoding for each character by position.
Usage
step_sequence_onehot(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
sequence_length = 100,
padding = "pre",
truncating = "pre",
vocabulary = NULL,
prefix = "seq1hot",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("sequence_onehot")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
sequence_length |
A numeric, number of characters to keep before discarding. Defaults to 100. |
padding |
'pre' or 'post', pad either before or after each sequence. defaults to 'pre'. |
truncating |
'pre' or 'post', remove values from sequences larger than sequence_length either in the beginning or in the end of the sequence. Defaults too 'pre'. |
vocabulary |
A character vector, characters to be mapped to integers.
Characters not in the vocabulary will be encoded as 0. Defaults to
|
prefix |
A prefix for generated column names, defaults to "seq1hot". |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
The string will be capped by the sequence_length argument, strings shorter then sequence_length will be padded with empty characters. The encoding will assign an integer to each character in the vocabulary, and will encode accordingly. Characters not in the vocabulary will be encoded as 0.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, vocabulary
, token
, and id
:
- terms
character, the selectors or variables selected
- vocabulary
integer, index
- token
character, text corresponding to the index
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
Source
https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf
See Also
Other Steps for Numeric Variables From Characters:
step_dummy_hash()
,
step_textfeature()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~medium, data = tate_text) %>%
step_tokenize(medium) %>%
step_tokenfilter(medium) %>%
step_sequence_onehot(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL)
tidy(tate_rec, number = 3)
tidy(tate_obj, number = 3)
Stemming of Token Variables
Description
step_stem()
creates a specification of a recipe step that will convert a
token
variable to have its stemmed version.
Usage
step_stem(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
options = list(),
custom_stemmer = NULL,
skip = FALSE,
id = rand_id("stem")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
options |
A list of options passed to the stemmer function. |
custom_stemmer |
A custom stemming function. If none is provided it will default to "SnowballC". |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
Words tend to have different forms depending on context, such as organize, organizes, and organizing. In many situations it is beneficial to have these words condensed into one to allow for a smaller pool of words. Stemming is the act of chopping off the end of words using a set of heuristics.
Note that the stemming will only be done at the end of the word and will therefore not work reliably on ngrams or sentences.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, is_custom_stemmer
, and id
:
- terms
character, the selectors or variables selected
- is_custom_stemmer
logical, indicate if custom stemmer was used
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_ngram()
,
step_pos_filter()
,
step_stopwords()
,
step_tokenfilter()
,
step_tokenmerge()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_stem(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
# Using custom stemmer. Here a custom stemmer that removes the last letter
# if it is a "s".
remove_s <- function(x) gsub("s$", "", x)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_stem(medium, custom_stemmer = remove_s)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
Filtering of Stop Words for Tokens Variables
Description
step_stopwords()
creates a specification of a recipe step that will
filter a token
variable for stop words.
Usage
step_stopwords(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
language = "en",
keep = FALSE,
stopword_source = "snowball",
custom_stopword_source = NULL,
skip = FALSE,
id = rand_id("stopwords")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
language |
A character to indicate the language of stop words by ISO 639-1 coding scheme. |
keep |
A logical. Specifies whether to keep the stop words or discard them. |
stopword_source |
A character to indicate the stop words source as
listed in |
custom_stopword_source |
A character vector to indicate a custom list of words that cater to the users specific problem. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
Stop words are words which sometimes are removed before natural language processing tasks. While stop words usually refers to the most common words in the language there is no universal stop word list.
The argument custom_stopword_source
allows you to pass a character vector
to filter against. With the keep
argument one can specify words to keep
instead of removing thus allowing you to select words with a combination of
these two arguments.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value
, keep
, and id
:
- terms
character, the selectors or variables selected
- value
character, name of stop word list
- keep
logical, whether stop words are removed or kept
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_ngram()
,
step_pos_filter()
,
step_stem()
,
step_tokenfilter()
,
step_tokenmerge()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_stopwords(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
# With a custom stop words list
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_stopwords(medium, custom_stopword_source = c("twice", "upon"))
tate_obj <- tate_rec %>%
prep(traimomg = tate_text)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
Normalization of Character Variables
Description
step_text_normalization()
creates a specification of a recipe step that
will perform Unicode Normalization on character variables.
Usage
step_text_normalization(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
normalization_form = "nfc",
skip = FALSE,
id = rand_id("text_normalization")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
normalization_form |
A single character string determining the Unicode
Normalization. Must be one of "nfc", "nfd", "nfkd", "nfkc", or
"nfkc_casefold". Defaults to "nfc". See |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, normalization_form
, and id
:
- terms
character, the selectors or variables selected
- normalization_form
character, type of normalization
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_texthash()
for feature hashing.
Examples
library(recipes)
sample_data <- tibble(text = c("sch\U00f6n", "scho\U0308n"))
rec <- recipe(~., data = sample_data) %>%
step_text_normalization(text)
prepped <- rec %>%
prep()
bake(prepped, new_data = NULL, text) %>%
slice(1:2)
bake(prepped, new_data = NULL) %>%
slice(2) %>%
pull(text)
tidy(rec, number = 1)
tidy(prepped, number = 1)
Calculate Set of Text Features
Description
step_textfeature()
creates a specification of a recipe step that will
extract a number of numeric features of a text column.
Usage
step_textfeature(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
extract_functions = count_functions,
prefix = "textfeature",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("textfeature")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
extract_functions |
A named list of feature extracting functions.
Defaults to |
prefix |
A prefix for generated column names, defaults to "textfeature". |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
This step will take a character column and returns a number of numeric
columns equal to the number of functions in the list passed to the
extract_functions
argument.
All the functions passed to extract_functions
must take a character vector
as input and return a numeric vector of the same length, otherwise an error
will be thrown.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, functions
, and id
:
- terms
character, the selectors or variables selected
- functions
character, name of feature functions
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
Other Steps for Numeric Variables From Characters:
step_dummy_hash()
,
step_sequence_onehot()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_textfeature(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
pull(textfeature_medium_n_words)
tidy(tate_rec, number = 1)
tidy(tate_obj, number = 1)
# Using custom extraction functions
nchar_round_10 <- function(x) round(nchar(x) / 10) * 10
recipe(~., data = tate_text) %>%
step_textfeature(medium,
extract_functions = list(nchar10 = nchar_round_10)
) %>%
prep() %>%
bake(new_data = NULL)
Feature Hashing of Tokens
Description
step_texthash()
creates a specification of a recipe step that will
convert a token
variable into multiple numeric variables
using the hashing trick.
Usage
step_texthash(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
signed = TRUE,
num_terms = 1024L,
prefix = "texthash",
sparse = "auto",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("texthash")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
signed |
A logical, indicating whether to use a signed hash-function to reduce collisions when hashing. Defaults to TRUE. |
num_terms |
An integer, the number of variables to output. Defaults to 1024. |
prefix |
A character string that will be the prefix to the resulting new variables. See notes below. |
sparse |
A single string. Should the columns produced be sparse vectors.
Can take the values |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
Feature hashing, or the hashing trick, is a transformation of a text variable into a new set of numerical variables. This is done by applying a hashing function over the tokens and using the hash values as feature indices. This allows for a low memory representation of the text. This implementation is done using the MurmurHash3 method.
The argument num_terms
controls the number of indices that the hashing
function will map to. This is the tuning parameter for this transformation.
Since the hashing function can map two different tokens to the same index,
will a higher value of num_terms
result in a lower chance of collision.
The new components will have names that begin with prefix
, then
the name of the variable, followed by the tokens all separated by
-
. The variable names are padded with zeros. For example if
prefix = "hash"
, and if num_terms < 10
, their names will be
hash1
- hash9
. If num_terms = 101
, their names will be
hash001
- hash101
.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value and id
:
- terms
character, the selectors or variables selected
- value
logical, is it signed?
- length
integer, number of terms
- id
character, id of this step
Tuning Parameters
This step has 2 tuning parameters:
-
signed
: Signed Hash Value (type: logical, default: TRUE) -
num_terms
: # Hash Features (type: integer, default: 1024)
Sparse data
This step produces sparse columns if sparse = "yes"
is being set. The
default value "auto"
won't trigger production fo sparse columns if a recipe
is recipes::prep()
ed, but allows for a workflow to toggle to "yes"
or
"no"
depending on whether the model supports recipes::sparse_data and if
the model is is expected to run faster with the data.
The mechanism for determining how much sparsity is produced isn't perfect,
and there will be times when you want to manually overwrite by setting
sparse = "yes"
or sparse = "no"
.
Case weights
The underlying operation does not allow for case weights.
References
Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009).
See Also
step_tokenize()
to turn characters into tokens
step_text_normalization()
to perform text normalization.
Other Steps for Numeric Variables From Tokens:
step_lda()
,
step_tf()
,
step_tfidf()
,
step_word_embeddings()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_tokenfilter(medium, max_tokens = 10) %>%
step_texthash(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, tate_text)
tidy(tate_rec, number = 3)
tidy(tate_obj, number = 3)
Term frequency of Tokens
Description
sparse = "yes"
doesn't take effect when
weight_scheme = "double normalization"
as it doesn't produce sparse data.
Usage
step_tf(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
weight_scheme = "raw count",
weight = 0.5,
vocabulary = NULL,
res = NULL,
prefix = "tf",
sparse = "auto",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("tf")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
weight_scheme |
A character determining the weighting scheme for the term frequency calculations. Must be one of "binary", "raw count", "term frequency", "log normalization" or "double normalization". Defaults to "raw count". |
weight |
A numeric weight used if |
vocabulary |
A character vector of strings to be considered. |
res |
The words that will be used to calculate the term frequency will
be stored here once this preprocessing step has be trained by
|
prefix |
A character string that will be the prefix to the resulting new variables. See notes below. |
sparse |
A single string. Should the columns produced be sparse vectors.
Can take the values |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
step_tf()
creates a specification of a recipe step that will convert a
token
variable into multiple variables containing the token
counts.
It is strongly advised to use step_tokenfilter before using step_tf to limit the number of variables created, otherwise you might run into memory issues. A good strategy is to start with a low token count and go up according to how much RAM you want to use.
Term frequency is a weight of how many times each token appears in each
observation. There are different ways to calculate the weight and this step
can do it in a couple of ways. Setting the argument weight_scheme
to
"binary" will result in a set of binary variables denoting if a token is
present in the observation. "raw count" will count the times a token is
present in the observation. "term frequency" will divide the count by the
total number of words in the document to limit the effect of the document
length as longer documents tends to have the word present more times but not
necessarily at a higher percentage. "log normalization" takes the log of 1
plus the count, adding 1 is done to avoid taking log of 0. Finally "double
normalization" is the raw frequency divided by the raw frequency of the most
occurring term in the document. This is then multiplied by weight
and
weight
is added to the result. This is again done to prevent a bias towards
longer documents.
The new components will have names that begin with prefix
, then
the name of the variable, followed by the tokens all separated by
-
. The variable names are padded with zeros. For example if
prefix = "hash"
, and if num_terms < 10
, their names will be
hash1
- hash9
. If num_terms = 101
, their names will be
hash001
- hash101
.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
- terms
character, the selectors or variables selected
- value
character, the weighting scheme
- id
character, id of this step
Tuning Parameters
This step has 2 tuning parameters:
-
weight_scheme
: Term Frequency Weight Method (type: character, default: raw count) -
weight
: Weight (type: double, default: 0.5)
Sparse data
This step produces sparse columns if sparse = "yes"
is being set. The
default value "auto"
won't trigger production fo sparse columns if a recipe
is recipes::prep()
ed, but allows for a workflow to toggle to "yes"
or
"no"
depending on whether the model supports recipes::sparse_data and if
the model is is expected to run faster with the data.
The mechanism for determining how much sparsity is produced isn't perfect,
and there will be times when you want to manually overwrite by setting
sparse = "yes"
or sparse = "no"
.
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Numeric Variables From Tokens:
step_lda()
,
step_texthash()
,
step_tfidf()
,
step_word_embeddings()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_tf(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, tate_text)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
Term Frequency-Inverse Document Frequency of Tokens
Description
step_tfidf()
creates a specification of a recipe step that will convert a
token
variable into multiple variables containing the term
frequency-inverse document frequency of tokens.
Usage
step_tfidf(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
vocabulary = NULL,
res = NULL,
smooth_idf = TRUE,
norm = "l1",
sublinear_tf = FALSE,
prefix = "tfidf",
sparse = "auto",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("tfidf")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
vocabulary |
A character vector of strings to be considered. |
res |
The words that will be used to calculate the term frequency will
be stored here once this preprocessing step has be trained by
|
smooth_idf |
TRUE smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This prevents division by zero. |
norm |
A character, defines the type of normalization to apply to term vectors. "l1" by default, i.e., scale by the number of words in the document. Must be one of c("l1", "l2", "none"). |
sublinear_tf |
A logical, apply sublinear term-frequency scaling, i.e., replace the term frequency with 1 + log(TF). Defaults to FALSE. |
prefix |
A character string that will be the prefix to the resulting new variables. See notes below. |
sparse |
A single string. Should the columns produced be sparse vectors.
Can take the values |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
It is strongly advised to use step_tokenfilter before using step_tfidf to limit the number of variables created; otherwise you may run into memory issues. A good strategy is to start with a low token count and increase depending on how much RAM you want to use.
Term frequency-inverse document frequency is the product of two statistics: the term frequency (TF) and the inverse document frequency (IDF).
Term frequency measures how many times each token appears in each observation.
Inverse document frequency is a measure of how informative a word is, e.g., how common or rare the word is across all the observations. If a word appears in all the observations it might not give that much insight, but if it only appears in some it might help differentiate between observations.
The IDF is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears))
The new components will have names that begin with prefix
, then
the name of the variable, followed by the tokens all separated by
-
. The variable names are padded with zeros. For example if
prefix = "hash"
, and if num_terms < 10
, their names will be
hash1
- hash9
. If num_terms = 101
, their names will be
hash001
- hash101
.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, token
, weight
, and id
:
- terms
character, the selectors or variables selected
- token
character, name of token
- weight
numeric, the calculated IDF weight
- id
character, id of this step
Sparse data
This step produces sparse columns if sparse = "yes"
is being set. The
default value "auto"
won't trigger production fo sparse columns if a recipe
is recipes::prep()
ed, but allows for a workflow to toggle to "yes"
or
"no"
depending on whether the model supports recipes::sparse_data and if
the model is is expected to run faster with the data.
The mechanism for determining how much sparsity is produced isn't perfect,
and there will be times when you want to manually overwrite by setting
sparse = "yes"
or sparse = "no"
.
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Numeric Variables From Tokens:
step_lda()
,
step_texthash()
,
step_tf()
,
step_word_embeddings()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_tfidf(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, tate_text)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
Filter Tokens Based on Term Frequency
Description
step_tokenfilter()
creates a specification of a recipe step that will
convert a token
variable to be filtered based on frequency.
Usage
step_tokenfilter(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
max_times = Inf,
min_times = 0,
percentage = FALSE,
max_tokens = 100,
filter_fun = NULL,
res = NULL,
skip = FALSE,
id = rand_id("tokenfilter")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
max_times |
An integer. Maximal number of times a word can appear before getting removed. |
min_times |
An integer. Minimum number of times a word can appear before getting removed. |
percentage |
A logical. Should max_times and min_times be interpreted as a percentage instead of count. |
max_tokens |
An integer. Will only keep the top max_tokens tokens after filtering done by max_times and min_times. Defaults to 100. |
filter_fun |
A function. This function should take a vector of
characters, and return a logical vector of the same length. This function
will be applied to each observation of the data set. Defaults to |
res |
The words that will be keep will be stored here once this
preprocessing step has be trained by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
This step allows you to limit the tokens you are looking at by filtering on
their occurrence in the corpus. You are able to exclude tokens if they appear
too many times or too few times in the data. It can be specified as counts
using max_times
and min_times
or as percentages by setting percentage
as TRUE
. In addition one can filter to only use the top max_tokens
used
tokens. If max_tokens
is set to Inf
then all the tokens will be used.
This will generally lead to very large data sets when then tokens are words
or trigrams. A good strategy is to start with a low token count and go up
according to how much RAM you want to use.
It is strongly advised to filter before using step_tf or step_tfidf to limit the number of variables created.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
- terms
character, the selectors or variables selected
- value
integer, number of unique tokens
- id
character, id of this step
Tuning Parameters
This step has 3 tuning parameters:
-
max_times
: Maximum Token Frequency (type: integer, default: Inf) -
min_times
: Minimum Token Frequency (type: integer, default: 0) -
max_tokens
: # Retained Tokens (type: integer, default: 100)
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_ngram()
,
step_pos_filter()
,
step_stem()
,
step_stopwords()
,
step_tokenmerge()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_tokenfilter(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
Tokenization of Character Variables
Description
step_tokenize()
creates a specification of a recipe step that will
convert a character predictor into a token
variable.
Usage
step_tokenize(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
training_options = list(),
options = list(),
token = "words",
engine = "tokenizers",
custom_token = NULL,
skip = FALSE,
id = rand_id("tokenize")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
training_options |
A list of options passed to the tokenizer when it is being trained. Only applicable for engine == "tokenizers.bpe". |
options |
A list of options passed to the tokenizer. |
token |
Unit for tokenizing. See details for options. Defaults to "words". |
engine |
Package that will be used for tokenization. See details for options. Defaults to "tokenizers". |
custom_token |
User supplied tokenizer. Use of this argument will overwrite the token and engine arguments. Must take a character vector as input and output a list of character vectors. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
Tokenization is the act of splitting a character vector into smaller parts to
be further analyzed. This step uses the tokenizers
package which includes
heuristics on how to to split the text into paragraphs tokens, word tokens,
among others. textrecipes
keeps the tokens as a token
variable and other steps will do their tasks on those token
variables before transforming them back to numeric variables.
Working with textrecipes
will almost always start by calling
step_tokenize
followed by modifying and filtering steps. This is not always
the case as you sometimes want to apply pre-tokenization steps; this can
be done with recipes::step_mutate()
.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Engines
The choice of engine
determines the possible choices of token
.
The following is some small example data used in the following examples
text_tibble <- tibble( text = c("This is words", "They are nice!") )
tokenizers
The tokenizers package is the default engine
and it comes with the
following unit of token
. All of these options correspond to a function in
the tokenizers package.
"words" (default)
"characters"
"character_shingles"
"ngrams"
"skip_ngrams"
"sentences"
"lines"
"paragraphs"
"regex"
"ptb" (Penn Treebank)
"skip_ngrams"
"word_stems"
The default tokenizer is "word"
which splits the text into a series of
words. By using step_tokenize()
without setting any arguments you get word
tokens
recipe(~ text, data = text_tibble) %>% step_tokenize(text) %>% show_tokens(text) #> [[1]] #> [1] "this" "is" "words" #> #> [[2]] #> [1] "they" "are" "nice"
This tokenizer has arguments that change how the tokenization occurs and can
accessed using the options
argument by passing a named list. Here we are
telling tokenizers::tokenize_words that we don't want to turn the words to
lowercase
recipe(~ text, data = text_tibble) %>% step_tokenize(text, options = list(lowercase = FALSE)) %>% show_tokens(text) #> [[1]] #> [1] "This" "is" "words" #> #> [[2]] #> [1] "They" "are" "nice"
We can also stop removing punctuation.
recipe(~ text, data = text_tibble) %>% step_tokenize(text, options = list(strip_punct = FALSE, lowercase = FALSE)) %>% show_tokens(text) #> [[1]] #> [1] "This" "is" "words" #> #> [[2]] #> [1] "They" "are" "nice" "!"
The tokenizer can be changed by setting a different token
. Here we change
it to return character tokens.
recipe(~ text, data = text_tibble) %>% step_tokenize(text, token = "characters") %>% show_tokens(text) #> [[1]] #> [1] "t" "h" "i" "s" "i" "s" "w" "o" "r" "d" "s" #> #> [[2]] #> [1] "t" "h" "e" "y" "a" "r" "e" "n" "i" "c" "e"
It is worth noting that not all these token methods are appropriate but are included for completeness.
spacyr
"words"
tokenizers.bpe
The tokeenizers.bpe engine performs Byte Pair Encoding Text Tokenization.
"words"
This tokenizer is trained on the training set and will thus need to be passed
training arguments. These are passed to the training_options
argument and
the most important one is vocab_size
. The determines the number of unique
tokens the tokenizer will produce. It is generally set to a much higher
value, typically in the thousands, but is set to 22 here for demonstration
purposes.
recipe(~ text, data = text_tibble) %>% step_tokenize( text, engine = "tokenizers.bpe", training_options = list(vocab_size = 22) ) %>% show_tokens(text)
#> [[1]] #> [1] "_Th" "is" "_" "is" "_" "w" "o" "r" "d" "s" #> #> [[2]] #> [1] "_Th" "e" "y" "_" "a" "r" "e" "_" "n" "i" "c" "e" #> [13] "!"
udpipe
"words"
custom_token
Sometimes you need to perform tokenization that is not covered by the
supported engines. In that case you can use the custom_token
argument to
pass a function in that performs the tokenization you want.
Below is an example of a very simple space tokenization. This is a very fast way of tokenizing.
space_tokenizer <- function(x) { strsplit(x, " +") } recipe(~ text, data = text_tibble) %>% step_tokenize( text, custom_token = space_tokenizer ) %>% show_tokens(text) #> [[1]] #> [1] "This" "is" "words" #> #> [[2]] #> [1] "They" "are" "nice!"
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
- terms
character, the selectors or variables selected
- value
character, unit of tokenization
- id
character, id of this step
Tuning Parameters
This step has 1 tuning parameters:
-
token
: Token Unit (type: character, default: words)
Case weights
The underlying operation does not allow for case weights.
See Also
step_untokenize()
to untokenize.
Other Steps for Tokenization:
step_tokenize_bpe()
,
step_tokenize_sentencepiece()
,
step_tokenize_wordpiece()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 1)
tidy(tate_obj, number = 1)
tate_obj_chars <- recipe(~., data = tate_text) %>%
step_tokenize(medium, token = "characters") %>%
prep()
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
BPE Tokenization of Character Variables
Description
step_tokenize_bpe()
creates a specification of a recipe step that will
convert a character predictor into a token
variable using
Byte Pair Encoding.
Usage
step_tokenize_bpe(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
vocabulary_size = 1000,
options = list(),
res = NULL,
skip = FALSE,
id = rand_id("tokenize_bpe")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
vocabulary_size |
Integer, indicating the number of tokens in the final vocabulary. Defaults to 1000. Highly encouraged to be tuned. |
options |
A list of options passed to the tokenizer. |
res |
The fitted |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
- terms
character, the selectors or variables selected
- id
character, id of this step
Tuning Parameters
This step has 1 tuning parameters:
-
vocabulary_size
: # Unique Tokens in Vocabulary (type: integer, default: 1000)
Case weights
The underlying operation does not allow for case weights.
See Also
step_untokenize()
to untokenize.
Other Steps for Tokenization:
step_tokenize()
,
step_tokenize_sentencepiece()
,
step_tokenize_wordpiece()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize_bpe(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 1)
tidy(tate_obj, number = 1)
Sentencepiece Tokenization of Character Variables
Description
step_tokenize_sentencepiece()
creates a specification of a recipe step
that will convert a character predictor into a token
variable using SentencePiece tokenization.
Usage
step_tokenize_sentencepiece(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
vocabulary_size = 1000,
options = list(),
res = NULL,
skip = FALSE,
id = rand_id("tokenize_sentencepiece")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
vocabulary_size |
Integer, indicating the number of tokens in the final vocabulary. Defaults to 1000. Highly encouraged to be tuned. |
options |
A list of options passed to the tokenizer. |
res |
The fitted |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
If you are running into errors, you can investigate the progress of the
compiled code by setting options = list(verbose = TRUE)
. This can reveal if
sentencepiece ran correctly or not.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
- terms
character, the selectors or variables selected
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_untokenize()
to untokenize.
Other Steps for Tokenization:
step_tokenize()
,
step_tokenize_bpe()
,
step_tokenize_wordpiece()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize_sentencepiece(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 1)
tidy(tate_obj, number = 1)
Wordpiece Tokenization of Character Variables
Description
step_tokenize_wordpiece()
creates a specification of a recipe step that
will convert a character predictor into a token
variable
using WordPiece tokenization.
Usage
step_tokenize_wordpiece(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
vocab = wordpiece::wordpiece_vocab(),
unk_token = "[UNK]",
max_chars = 100,
skip = FALSE,
id = rand_id("tokenize_wordpiece")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
vocab |
Character of Character vector of vocabulary tokens. Defaults to
|
unk_token |
Token to represent unknown words. Defaults to |
max_chars |
Integer, Maximum length of word recognized. Defaults to 100. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
- terms
character, the selectors or variables selected
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_untokenize()
to untokenize.
Other Steps for Tokenization:
step_tokenize()
,
step_tokenize_bpe()
,
step_tokenize_sentencepiece()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize_wordpiece(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 1)
tidy(tate_obj, number = 1)
Combine Multiple Token Variables Into One
Description
step_tokenmerge()
creates a specification of a recipe step that will take
multiple token
variables and combine them into one
token
variable.
Usage
step_tokenmerge(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
prefix = "tokenmerge",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("tokenmerge")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
prefix |
A prefix for generated column names, defaults to "tokenmerge". |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
- terms
character, the selectors or variables selected
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_ngram()
,
step_pos_filter()
,
step_stem()
,
step_stopwords()
,
step_tokenfilter()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium, artist) %>%
step_tokenmerge(medium, artist)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
Untokenization of Token Variables
Description
step_untokenize()
creates a specification of a recipe step that will
convert a token
variable into a character predictor.
Usage
step_untokenize(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
sep = " ",
skip = FALSE,
id = rand_id("untokenize")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
sep |
a character to determine how the tokens should be separated when
pasted together. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
This steps will turn a token
vector back into a character
vector. This step is calling paste
internally to put the tokens back
together to a character.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
- terms
character, the selectors or variables selected
- value
character, seperator used for collapsing
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_untokenize(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)
Pretrained Word Embeddings of Tokens
Description
step_word_embeddings()
creates a specification of a recipe step that will
convert a token
variable into word-embedding dimensions by
aggregating the vectors of each token from a pre-trained embedding.
Usage
step_word_embeddings(
recipe,
...,
role = "predictor",
trained = FALSE,
columns = NULL,
embeddings,
aggregation = c("sum", "mean", "min", "max"),
aggregation_default = 0,
prefix = "wordembed",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("word_embeddings")
)
Arguments
recipe |
A recipes::recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new columns created by the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
embeddings |
A tibble of pre-trained word embeddings, such as those returned by the embedding_glove function from the textdata package. The first column should contain tokens, and additional columns should contain embeddings vectors. |
aggregation |
A character giving the name of the aggregation function to use. Must be one of "sum", "mean", "min", and "max". Defaults to "sum". |
aggregation_default |
A numeric denoting the default value for case with no words are matched in embedding. Defaults to 0. |
prefix |
A character string that will be the prefix to the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
Word embeddings map words (or other tokens) into a high-dimensional feature space. This function maps pre-trained word embeddings onto the tokens in your data.
The argument embeddings
provides the pre-trained vectors. Each dimension
present in this tibble becomes a new feature column, with each column
aggregated across each row of your text using the function supplied in the
aggregation
argument.
The new components will have names that begin with prefix
, then the name of
the aggregation function, then the name of the variable from the embeddings
tibble (usually something like "d7"). For example, using the default
"wordembedding" prefix, and the GloVe embeddings from the textdata package
(where the column names are d1
, d2
, etc), new columns would be
wordembedding_d1
, wordembedding_d1
, etc.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble is returned with
columns terms
, embedding_rows
, aggregation
, and id
:
- terms
character, the selectors or variables selected
- embedding_rows
integer, number of rows in embedding
- aggregation
character,aggregation
- id
character, id of this step
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Numeric Variables From Tokens:
step_lda()
,
step_texthash()
,
step_tf()
,
step_tfidf()
Examples
library(recipes)
embeddings <- tibble(
tokens = c("the", "cat", "ran"),
d1 = c(1, 0, 0),
d2 = c(0, 1, 0),
d3 = c(0, 0, 1)
)
sample_data <- tibble(
text = c(
"The.",
"The cat.",
"The cat ran."
),
text_label = c("fragment", "fragment", "sentence")
)
rec <- recipe(text_label ~ ., data = sample_data) %>%
step_tokenize(text) %>%
step_word_embeddings(text, embeddings = embeddings)
obj <- rec %>%
prep()
bake(obj, sample_data)
tidy(rec, number = 2)
tidy(obj, number = 2)
Create Token Object
Description
A tokenlist object is a thin wrapper around a list of character vectors, with a few attributes.
Usage
tokenlist(tokens = list(), lemma = NULL, pos = NULL)
Arguments
tokens |
List of character vectors |
lemma |
List of character vectors, must be same size and shape as |
pos |
List of character vectors, must be same size and shape as |
Value
a tokenlist object.
Examples
abc <- list(letters, LETTERS)
tokenlist(abc)
unclass(tokenlist(abc))
tibble(text = tokenlist(abc))
library(tokenizers)
library(modeldata)
data(tate_text)
tokens <- tokenize_words(as.character(tate_text$medium))
tokenlist(tokens)
tunable methods for textrecipes
Description
These functions define what parameters can be tuned for specific steps.
They also define the recommended objects from the dials
package that can
be used to generate new parameter values and other characteristics.
Usage
## S3 method for class 'step_dummy_hash'
tunable(x, ...)
## S3 method for class 'step_ngram'
tunable(x, ...)
## S3 method for class 'step_texthash'
tunable(x, ...)
## S3 method for class 'step_tf'
tunable(x, ...)
## S3 method for class 'step_tokenfilter'
tunable(x, ...)
## S3 method for class 'step_tokenize'
tunable(x, ...)
## S3 method for class 'step_tokenize_bpe'
tunable(x, ...)
Arguments
x |
A recipe step object |
... |
Not used. |
Value
A tibble object.