Type: | Package |
Title: | Interface to 'TensorFlow' Datasets |
Version: | 2.17.0 |
Description: | Interface to 'TensorFlow' Datasets, a high-level library for building complex input pipelines from simple, re-usable pieces. See https://www.tensorflow.org/guide for additional details. |
License: | Apache License 2.0 |
URL: | https://github.com/rstudio/tfdatasets |
BugReports: | https://github.com/rstudio/tfdatasets/issues |
SystemRequirements: | TensorFlow >= 1.4 (https://www.tensorflow.org/) |
Encoding: | UTF-8 |
LazyData: | true |
Depends: | R (≥ 3.1) |
Imports: | reticulate (≥ 1.10), tensorflow (≥ 1.13.1), magrittr, rlang, tidyselect, stats, generics, vctrs |
RoxygenNote: | 7.3.2 |
Suggests: | testthat, knitr, keras3, rsample, rmarkdown, Metrics, dplyr, progress, tfestimators |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2024-07-16 18:09:10 UTC; tomasz |
Author: | Tomasz Kalinowski [ctb, cph, cre],
Daniel Falbel [ctb, cph],
JJ Allaire [aut, cph],
Yuan Tang |
Maintainer: | Tomasz Kalinowski <tomasz@posit.co> |
Repository: | CRAN |
Date/Publication: | 2024-07-16 22:30:01 UTC |
Pipe operator
Description
See %>%
for more details.
Usage
lhs %>% rhs
Find all nominal variables.
Description
Currently we only consider "string" type as nominal.
Usage
all_nominal()
See Also
Other Selectors:
all_numeric()
,
has_type()
Speciy all numeric variables.
Description
Find all the variables with the following types: "float16", "float32", "float64", "int16", "int32", "int64", "half", "double".
Usage
all_numeric()
See Also
Other Selectors:
all_nominal()
,
has_type()
Convert tf_dataset to an iterator that yields R arrays.
Description
Convert tf_dataset to an iterator that yields R arrays.
Usage
as_array_iterator(dataset)
Arguments
dataset |
A tensorflow dataset |
Value
An iterable. Use iterate()
or iter_next()
to access values from the iterator.
Get the single element of the dataset.
Description
The function enables you to use a TF Dataset in a stateless "tensor-in tensor-out" expression, without creating an iterator. This facilitates the ease of data transformation on tensors using the optimized TF Dataset abstraction on top of them.
Usage
## S3 method for class 'tensorflow.python.data.ops.dataset_ops.DatasetV2'
as_tensor(x, ..., name = NULL)
## S3 method for class 'tensorflow.python.data.ops.dataset_ops.DatasetV2'
as.array(x, ...)
Arguments
x |
A TF Dataset |
... |
passed on to |
name |
(Optional.) A name for the TensorFlow operation. |
Details
For example, consider a preprocess_batch()
which would take as an input
a batch of raw features and returns the processed feature.
preprocess_one_case <- function(x) x + 100 preprocess_batch <- function(raw_features) { batch_size <- dim(raw_features)[1] ds <- raw_features %>% tensor_slices_dataset() %>% dataset_map(preprocess_one_case, num_parallel_calls = batch_size) %>% dataset_batch(batch_size) as_tensor(ds) } raw_features <- array(seq(prod(4, 5)), c(4, 5)) preprocess_batch(raw_features)
In the above example, the batch of raw_features
was converted to a TF
Dataset. Next, each of the raw_feature cases in the batch was mapped using
the preprocess_one_case and the processed features were grouped into a single
batch. The final dataset contains only one element which is a batch of all
the processed features.
Note: The dataset should contain only one element. Now, instead of creating
an iterator for the dataset and retrieving the batch of features, the
as_tensor()
function is used to skip the iterator creation process and
directly output the batch of features.
This can be particularly useful when your tensor transformations are expressed as TF Dataset operations, and you want to use those transformations while serving your model.
See Also
Add the tf_dataset class to a dataset
Description
Calling this function on a dataset adds the "tf_dataset" class to the dataset object. All datasets returned by functions in the tfdatasets package call this function on the dataset before returning it.
Usage
as_tf_dataset(dataset)
Arguments
dataset |
A dataset |
Value
A dataset with class "tf_dataset"
Creates a dataset that deterministically chooses elements from datasets.
Description
Creates a dataset that deterministically chooses elements from datasets.
Usage
choose_from_datasets(datasets, choice_dataset, stop_on_empty_dataset = TRUE)
Arguments
datasets |
A non-empty list of tf.data.Dataset objects with compatible structure. |
choice_dataset |
A |
stop_on_empty_dataset |
If |
Value
Returns a dataset that interleaves elements from datasets according to the values of choice_dataset.
Examples
## Not run:
datasets <- list(tensors_dataset("foo") %>% dataset_repeat(),
tensors_dataset("bar") %>% dataset_repeat(),
tensors_dataset("baz") %>% dataset_repeat())
# Define a dataset containing `[0, 1, 2, 0, 1, 2, 0, 1, 2]`.
choice_dataset <- range_dataset(0, 3) %>% dataset_repeat(3)
result <- choose_from_datasets(datasets, choice_dataset)
result %>% as_array_iterator() %>% iterate(function(s) s$decode()) %>% print()
# [1] "foo" "bar" "baz" "foo" "bar" "baz" "foo" "bar" "baz"
## End(Not run)
Combines consecutive elements of this dataset into batches.
Description
The components of the resulting element will have an additional outer
dimension, which will be batch_size
(or N %% batch_size
for the last
element if batch_size
does not divide the number of input elements N
evenly and drop_remainder
is FALSE
). If your program depends on the
batches having the same outer dimension, you should set the drop_remainder
argument to TRUE
to prevent the smaller batch from being produced.
Usage
dataset_batch(
dataset,
batch_size,
drop_remainder = FALSE,
num_parallel_calls = NULL,
deterministic = NULL
)
Arguments
dataset |
A dataset |
batch_size |
An integer, representing the number of consecutive elements of this dataset to combine in a single batch. |
drop_remainder |
(Optional.) A boolean, representing whether the last
batch should be dropped in the case it has fewer than |
num_parallel_calls |
(Optional.) A scalar integer, representing the
number of batches to compute asynchronously in parallel. If not specified,
batches will be computed sequentially. If the value |
deterministic |
(Optional.) When |
Value
A dataset
Note
If your program requires data to have a statically known shape (e.g.,
when using XLA), you should use drop_remainder=TRUE
. Without
drop_remainder=TRUE
the shape of the output dataset will have an unknown
leading dimension due to the possibility of a smaller final batch.
See Also
Other dataset methods:
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
A transformation that buckets elements in a Dataset
by length
Description
A transformation that buckets elements in a Dataset
by length
Usage
dataset_bucket_by_sequence_length(
dataset,
element_length_func,
bucket_boundaries,
bucket_batch_sizes,
padded_shapes = NULL,
padding_values = NULL,
pad_to_bucket_boundary = FALSE,
no_padding = FALSE,
drop_remainder = FALSE,
name = NULL
)
Arguments
dataset |
A |
element_length_func |
function from element in |
bucket_boundaries |
integers, upper length boundaries of the buckets. |
bucket_batch_sizes |
integers, batch size per bucket. Length should be
|
padded_shapes |
Nested structure of |
padding_values |
Values to pad with, passed to
|
pad_to_bucket_boundary |
bool, if |
no_padding |
boolean, indicates whether to pad the batch features (features
need to be either of type |
drop_remainder |
(Optional.) A logical scalar, representing
whether the last batch should be dropped in the case it has fewer than
|
name |
(Optional.) A name for the tf.data operation. |
Details
Elements of the Dataset
are grouped together by length and then are padded
and batched.
This is useful for sequence tasks in which the elements have variable length. Grouping together elements that have similar lengths reduces the total fraction of padding in a batch which increases training step efficiency.
Below is an example to bucketize the input data to the 3 buckets "[0, 3), [3, 5), [5, Inf)" based on sequence length, with batch size 2.
See Also
Examples
## Not run:
dataset <- list(c(0),
c(1, 2, 3, 4),
c(5, 6, 7),
c(7, 8, 9, 10, 11),
c(13, 14, 15, 16, 17, 18, 19, 20),
c(21, 22)) %>%
lapply(as.array) %>% lapply(as_tensor, "int32") %>%
lapply(tensors_dataset) %>%
Reduce(dataset_concatenate, .)
dataset %>%
dataset_bucket_by_sequence_length(
element_length_func = function(elem) tf$shape(elem)[1],
bucket_boundaries = c(3, 5),
bucket_batch_sizes = c(2, 2, 2)
) %>%
as_array_iterator() %>%
iterate(print)
# [,1] [,2] [,3] [,4]
# [1,] 1 2 3 4
# [2,] 5 6 7 0
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 7 8 9 10 11 0 0 0
# [2,] 13 14 15 16 17 18 19 20
# [,1] [,2]
# [1,] 0 0
# [2,] 21 22
## End(Not run)
Caches the elements in this dataset.
Description
Caches the elements in this dataset.
Usage
dataset_cache(dataset, filename = NULL)
Arguments
dataset |
A dataset |
filename |
String with the name of a directory on the filesystem to use for caching tensors in this Dataset. If a filename is not provided, the dataset will be cached in memory. |
Value
A dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Collects a dataset
Description
Iterates throught the dataset collecting every element into a list. It's useful for looking at the full result of the dataset. Note: You may run out of memory if your dataset is too big.
Usage
dataset_collect(dataset, iter_max = Inf)
Arguments
dataset |
A dataset |
iter_max |
Maximum number of iterations. |
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Creates a dataset by concatenating given dataset with this dataset.
Description
Creates a dataset by concatenating given dataset with this dataset.
Usage
dataset_concatenate(dataset, ...)
Arguments
dataset , ... |
|
Value
A dataset
Note
Input dataset and dataset to be concatenated should have same nested structures and output types.
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Transform a dataset with delimted text lines into a dataset with named columns
Description
Transform a dataset with delimted text lines into a dataset with named columns
Usage
dataset_decode_delim(dataset, record_spec, parallel_records = NULL)
Arguments
dataset |
Dataset containing delimited text lines (e.g. a CSV) |
record_spec |
Specification of column names and types (see |
parallel_records |
(Optional) An integer, representing the number of records to decode in parallel. If not specified, records will be processed sequentially. |
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Enumerates the elements of this dataset
Description
Enumerates the elements of this dataset
Usage
dataset_enumerate(dataset, start = 0L)
Arguments
dataset |
A tensorflow dataset |
start |
An integer (coerced to a |
Details
It is similar to python's enumerate
, this transforms a sequence of
elements into a sequence of list(index, element)
, where index is an integer
that indicates the position of the element in the sequence.
Examples
## Not run:
dataset <- tensor_slices_dataset(100:103) %>%
dataset_enumerate()
iterator <- reticulate::as_iterator(dataset)
reticulate::iter_next(iterator) # list(0, 100)
reticulate::iter_next(iterator) # list(1, 101)
reticulate::iter_next(iterator) # list(2, 102)
reticulate::iter_next(iterator) # list(3, 103)
reticulate::iter_next(iterator) # NULL (iterator exhausted)
reticulate::iter_next(iterator) # NULL (iterator exhausted)
## End(Not run)
Filter a dataset by a predicate
Description
Filter a dataset by a predicate
Usage
dataset_filter(dataset, predicate)
Arguments
dataset |
A dataset |
predicate |
A function mapping a nested structure of tensors (having
shapes and types defined by |
Details
Note that the functions used inside the predicate must be
tensor operations (e.g. tf$not_equal
, tf$less
, etc.). R
generic methods for relational operators (e.g. <
, >
, <=
,
etc.) and logical operators (e.g. !
, &
, |
, etc.) are
provided so you can use shorthand syntax for most common
comparisions (this is illustrated by the example below).
Value
A dataset composed of records that matched the predicate.
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Examples
## Not run:
dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>%
dataset_filter(function(record) {
record$mpg >= 20
})
dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>%
dataset_filter(function(record) {
record$mpg >= 20 & record$cyl >= 6L
})
## End(Not run)
Maps map_func across this dataset and flattens the result.
Description
Maps map_func across this dataset and flattens the result.
Usage
dataset_flat_map(dataset, map_func)
Arguments
dataset |
A dataset |
map_func |
A function mapping a nested structure of tensors (having
shapes and types defined by |
Value
A dataset
Group windows of elements by key and reduce them
Description
Group windows of elements by key and reduce them
Usage
dataset_group_by_window(
dataset,
key_func,
reduce_func,
window_size = NULL,
window_size_func = NULL,
name = NULL
)
Arguments
dataset |
a TF Dataset |
key_func |
A function mapping a nested structure of tensors (having
shapes and types defined by |
reduce_func |
A function mapping a key and a dataset of up to
|
window_size |
A |
window_size_func |
A function mapping a key to a |
name |
(Optional.) A name for the Tensorflow operation. |
Details
This transformation maps each consecutive element in a dataset to a
key using key_func()
and groups the elements by key. It then applies
reduce_func()
to at most window_size_func(key)
elements matching the same
key. All except the final window for each key will contain
window_size_func(key)
elements; the final window may be smaller.
You may provide either a constant window_size
or a window size determined
by the key through window_size_func
.
window_size <- 5 dataset <- range_dataset(to = 10) %>% dataset_group_by_window( key_func = function(x) x %% 2, reduce_func = function(key, ds) dataset_batch(ds, window_size), window_size = window_size ) it <- as_array_iterator(dataset) while (!is.null(elem <- iter_next(it))) print(elem) #> tf.Tensor([0 2 4 6 8], shape=(5), dtype=int64) #> tf.Tensor([1 3 5 7 9], shape=(5), dtype=int64)
See Also
Maps map_func across this dataset, and interleaves the results
Description
Maps map_func across this dataset, and interleaves the results
Usage
dataset_interleave(dataset, map_func, cycle_length, block_length = 1)
Arguments
dataset |
A dataset |
map_func |
A function mapping a nested structure of tensors (having
shapes and types defined by |
cycle_length |
The number of elements from this dataset that will be processed concurrently. |
block_length |
The number of consecutive elements to produce from each input element before cycling to another input element. |
Details
The cycle_length
and block_length
arguments control the order in which
elements are produced. cycle_length
controls the number of input elements
that are processed concurrently. In general, this transformation will apply
map_func
to cycle_length
input elements, open iterators on the returned
dataset objects, and cycle through them producing block_length
consecutive
elements from each iterator, and consuming the next input element each time
it reaches the end of an iterator.
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Examples
## Not run:
dataset <- tensor_slices_dataset(c(1,2,3,4,5)) %>%
dataset_interleave(cycle_length = 2, block_length = 4, function(x) {
tensors_dataset(x) %>%
dataset_repeat(6)
})
# resulting dataset (newlines indicate "block" boundaries):
c(1, 1, 1, 1,
2, 2, 2, 2,
1, 1,
2, 2,
3, 3, 3, 3,
4, 4, 4, 4,
3, 3,
4, 4,
5, 5, 5, 5,
5, 5,
)
## End(Not run)
Map a function across a dataset.
Description
Map a function across a dataset.
Usage
dataset_map(dataset, map_func, num_parallel_calls = NULL)
Arguments
dataset |
A dataset |
map_func |
A function mapping a nested structure of tensors (having
shapes and types defined by |
num_parallel_calls |
(Optional) An integer, representing the number of elements to process in parallel If not specified, elements will be processed sequentially. |
Value
A dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Fused implementation of dataset_map() and dataset_batch()
Description
Maps 'map_func“ across batch_size consecutive elements of this dataset and then combines them into a batch. Functionally, it is equivalent to map followed by batch. However, by fusing the two transformations together, the implementation can be more efficient.
Usage
dataset_map_and_batch(
dataset,
map_func,
batch_size,
num_parallel_batches = NULL,
drop_remainder = FALSE,
num_parallel_calls = NULL
)
Arguments
dataset |
A dataset |
map_func |
A function mapping a nested structure of tensors (having
shapes and types defined by |
batch_size |
An integer, representing the number of consecutive elements of this dataset to combine in a single batch. |
num_parallel_batches |
(Optional) An integer, representing the number of batches to create in parallel. On one hand, higher values can help mitigate the effect of stragglers. On the other hand, higher values can increase contention if CPU is scarce. |
drop_remainder |
(Optional.) A boolean, representing whether the last
batch should be dropped in the case it has fewer than |
num_parallel_calls |
(Optional) An integer, representing the number of elements to process in parallel If not specified, elements will be processed sequentially. |
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Get or Set Dataset Options
Description
Get or Set Dataset Options
Usage
dataset_options(dataset, ...)
Arguments
dataset |
a tensorflow dataset |
... |
Valid values include:
|
Details
The options are "global" in the sense they apply to the entire dataset. If options are set multiple times, they are merged as long as different options do not use different non-default values.
Value
If values are supplied to ...
, returns a tf.data.Dataset
with the
given options set/updated. Otherwise, returns the currently set options for
the dataset.
Examples
## Not run:
# pass options directly:
range_dataset(0, 10) %>%
dataset_options(
experimental_deterministic = FALSE,
threading.private_threadpool_size = 10
)
# pass options as a named list:
opts <- list(
experimental_deterministic = FALSE,
threading.private_threadpool_size = 10
)
range_dataset(0, 10) %>%
dataset_options(opts)
# pass a tf.data.Options() instance
opts <- tf$data$Options()
opts$experimental_deterministic <- FALSE
opts$threading$private_threadpool_size <- 10L
range_dataset(0, 10) %>%
dataset_options(opts)
# get currently set options
range_dataset(0, 10) %>% dataset_options()
## End(Not run)
Combines consecutive elements of this dataset into padded batches.
Description
Combines consecutive elements of this dataset into padded batches.
Usage
dataset_padded_batch(
dataset,
batch_size,
padded_shapes = NULL,
padding_values = NULL,
drop_remainder = FALSE,
name = NULL
)
Arguments
dataset |
A dataset |
batch_size |
An integer, representing the number of consecutive elements of this dataset to combine in a single batch. |
padded_shapes |
(Optional.) A (nested) structure of
|
padding_values |
(Optional.) A (nested) structure of scalar-shaped
|
drop_remainder |
(Optional.) A boolean scalar, representing
whether the last batch should be dropped in the case it has fewer than
|
name |
(Optional.) A name for the tf.data operation. Requires tensorflow version >= 2.7. |
Details
This transformation combines multiple consecutive elements of the input dataset into a single element.
Like dataset_batch()
, the components of the resulting element will
have an additional outer dimension, which will be batch_size
(or
N %% batch_size
for the last element if batch_size
does not divide the
number of input elements N
evenly and drop_remainder
is FALSE
). If
your program depends on the batches having the same outer dimension, you
should set the drop_remainder
argument to TRUE
to prevent the smaller
batch from being produced.
Unlike dataset_batch()
, the input elements to be batched may have
different shapes, and this transformation will pad each component to the
respective shape in padded_shapes
. The padded_shapes
argument
determines the resulting shape for each dimension of each component in an
output element:
If the dimension is a constant, the component will be padded out to that length in that dimension.
If the dimension is unknown, the component will be padded out to the maximum length of all elements in that dimension.
See also tf$data$experimental$dense_to_sparse_batch
, which combines
elements that may have different shapes into a tf$sparse$SparseTensor
.
Value
A tf_dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Examples
## Not run:
A <- range_dataset(1, 5, dtype = tf$int32) %>%
dataset_map(function(x) tf$fill(list(x), x))
# Pad to the smallest per-batch size that fits all elements.
B <- A %>% dataset_padded_batch(2)
B %>% as_array_iterator() %>% iterate(print)
# Pad to a fixed size.
C <- A %>% dataset_padded_batch(2, padded_shapes=5)
C %>% as_array_iterator() %>% iterate(print)
# Pad with a custom value.
D <- A %>% dataset_padded_batch(2, padded_shapes=5, padding_values = -1L)
D %>% as_array_iterator() %>% iterate(print)
# Pad with a single value and multiple components.
E <- zip_datasets(A, A) %>% dataset_padded_batch(2, padding_values = -1L)
E %>% as_array_iterator() %>% iterate(print)
## End(Not run)
Creates a Dataset that prefetches elements from this dataset.
Description
Creates a Dataset that prefetches elements from this dataset.
Usage
dataset_prefetch(dataset, buffer_size = tf$data$AUTOTUNE)
Arguments
dataset |
A dataset |
buffer_size |
An integer, representing the maximum number elements that will be buffered when prefetching. |
Value
A dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
A transformation that prefetches dataset values to the given device
Description
A transformation that prefetches dataset values to the given device
Usage
dataset_prefetch_to_device(dataset, device, buffer_size = NULL)
Arguments
dataset |
A dataset |
device |
A string. The name of a device to which elements will be prefetched (e.g. "/gpu:0"). |
buffer_size |
(Optional.) The number of elements to buffer on device. Defaults to an automatically chosen value. |
Value
A dataset
Note
Although the transformation creates a dataset, the transformation must be the final dataset in the input pipeline.
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Prepare a dataset for analysis
Description
Transform a dataset with named columns into a list with features (x
) and
response (y
) elements.
Usage
dataset_prepare(
dataset,
x,
y = NULL,
named = TRUE,
named_features = FALSE,
parallel_records = NULL,
batch_size = NULL,
num_parallel_batches = NULL,
drop_remainder = FALSE
)
Arguments
dataset |
A dataset |
x |
Features to include. When |
y |
(Optional). Response variable. |
named |
|
named_features |
|
parallel_records |
(Optional) An integer, representing the number of records to decode in parallel. If not specified, records will be processed sequentially. |
batch_size |
(Optional). Batch size if you would like to fuse the
|
num_parallel_batches |
(Optional) An integer, representing the number of batches to create in parallel. On one hand, higher values can help mitigate the effect of stragglers. On the other hand, higher values can increase contention if CPU is scarce. |
drop_remainder |
(Optional.) A boolean, representing whether the last
batch should be dropped in the case it has fewer than |
Value
A dataset. The dataset will have a structure of either:
When
named_features
isTRUE
:list(x = list(feature_name = feature_values, ...), y = response_values)
When
named_features
isFALSE
:list(x = features_array, y = response_values)
, wherefeatures_array
is a Rank 2 array of(batch_size, num_features)
.
Note that the y
element will be omitted when y
is NULL
.
See Also
input_fn() for use with tfestimators.
Reduces the input dataset to a single element.
Description
The transformation calls reduce_func successively on every element of the input dataset until the dataset is exhausted, aggregating information in its internal state. The initial_state argument is used for the initial state and the final state is returned as the result.
Usage
dataset_reduce(dataset, initial_state, reduce_func)
Arguments
dataset |
A dataset |
initial_state |
An element representing the initial state of the transformation. |
reduce_func |
A function that maps |
Value
A dataset element.
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
A transformation that resamples a dataset to a target distribution.
Description
A transformation that resamples a dataset to a target distribution.
Usage
dataset_rejection_resample(
dataset,
class_func,
target_dist,
initial_dist = NULL,
seed = NULL,
name = NULL
)
Arguments
dataset |
A |
class_func |
A function mapping an element of the input dataset to a
scalar |
target_dist |
A floating point type tensor, shaped |
initial_dist |
(Optional.) A floating point type tensor, shaped
|
seed |
(Optional.) Integer seed for the resampler. |
name |
(Optional.) A name for the tf.data operation. |
Value
A tf.Dataset
Examples
## Not run:
initial_dist <- c(.5, .5)
target_dist <- c(.6, .4)
num_classes <- length(initial_dist)
num_samples <- 100000
data <- sample.int(num_classes, num_samples, prob = initial_dist, replace = TRUE)
dataset <- tensor_slices_dataset(data)
tally <- c(0, 0)
`add<-` <- function (x, value) x + value
# tfautograph::autograph({
# for(i in dataset)
# add(tally[as.numeric(i)]) <- 1
# })
dataset %>%
as_array_iterator() %>%
iterate(function(i) {
add(tally[i]) <<- 1
}, simplify = FALSE)
# The value of `tally` will be close to c(50000, 50000) as
# per the `initial_dist` distribution.
tally # c(50287, 49713)
tally <- c(0, 0)
dataset %>%
dataset_rejection_resample(
class_func = function(x) (x-1) %% 2,
target_dist = target_dist,
initial_dist = initial_dist
) %>%
as_array_iterator() %>%
iterate(function(element) {
names(element) <- c("class_id", "i")
add(tally[element$i]) <<- 1
}, simplify = FALSE)
# The value of tally will be now be close to c(75000, 50000)
# thus satisfying the target_dist distribution.
tally # c(74822, 49921)
## End(Not run)
Repeats a dataset count times.
Description
Repeats a dataset count times.
Usage
dataset_repeat(dataset, count = NULL)
Arguments
dataset |
A dataset |
count |
(Optional.) An integer value representing the number of times
the elements of this dataset should be repeated. The default behavior (if
|
Value
A dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
A transformation that scans a function across an input dataset
Description
A transformation that scans a function across an input dataset
Usage
dataset_scan(dataset, initial_state, scan_func)
Arguments
dataset |
A tensorflow dataset |
initial_state |
A nested structure of tensors, representing the initial state of the accumulator. |
scan_func |
A function that maps |
Details
This transformation is a stateful relative of dataset_map()
.
In addition to mapping scan_func
across the elements of the input dataset,
scan()
accumulates one or more state tensors, whose initial values are
initial_state
.
Examples
## Not run:
initial_state <- as_tensor(0, dtype="int64")
scan_func <- function(state, i) list(state + i, state + i)
dataset <- range_dataset(0, 10) %>%
dataset_scan(initial_state, scan_func)
reticulate::iterate(dataset, as.array) %>%
unlist()
# 0 1 3 6 10 15 21 28 36 45
## End(Not run)
Creates a dataset that includes only 1 / num_shards of this dataset.
Description
This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.
Usage
dataset_shard(dataset, num_shards, index)
Arguments
dataset |
A dataset |
num_shards |
A integer representing the number of shards operating in parallel. |
index |
A integer, representing the worker index. |
Value
A dataset
Randomly shuffles the elements of this dataset.
Description
Randomly shuffles the elements of this dataset.
Usage
dataset_shuffle(
dataset,
buffer_size,
seed = NULL,
reshuffle_each_iteration = NULL
)
Arguments
dataset |
A dataset |
buffer_size |
An integer, representing the number of elements from this dataset from which the new dataset will sample. |
seed |
(Optional) An integer, representing the random seed that will be used to create the distribution. |
reshuffle_each_iteration |
(Optional) A boolean, which if true indicates
that the dataset should be pseudorandomly reshuffled each time it is iterated
over. (Defaults to |
Value
A dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Shuffles and repeats a dataset returning a new permutation for each epoch.
Description
Shuffles and repeats a dataset returning a new permutation for each epoch.
Usage
dataset_shuffle_and_repeat(dataset, buffer_size, count = NULL, seed = NULL)
Arguments
dataset |
A dataset |
buffer_size |
An integer, representing the number of elements from this dataset from which the new dataset will sample. |
count |
(Optional.) An integer value representing the number of times
the elements of this dataset should be repeated. The default behavior (if
|
seed |
(Optional) An integer, representing the random seed that will be used to create the distribution. |
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Creates a dataset that skips count elements from this dataset
Description
Creates a dataset that skips count elements from this dataset
Usage
dataset_skip(dataset, count)
Arguments
dataset |
A dataset |
count |
An integer, representing the number of elements of this dataset that should be skipped to form the new dataset. If count is greater than the size of this dataset, the new dataset will contain no elements. If count is -1, skips the entire dataset. |
Value
A dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_take()
,
dataset_take_while()
,
dataset_window()
Persist the output of a dataset
Description
Persist the output of a dataset
Usage
dataset_snapshot(
dataset,
path,
compression = c("AUTO", "GZIP", "SNAPPY", "None"),
reader_func = NULL,
shard_func = NULL
)
Arguments
dataset |
A tensorflow dataset |
path |
Required. A directory to use for storing/loading the snapshot to/from. |
compression |
Optional. The type of compression to apply to the snapshot
written to disk. Supported options are |
reader_func |
Optional. A function to control how to read data from snapshot shards. |
shard_func |
Optional. A function to control how to shard data when writing a snapshot. |
Details
The snapshot API allows users to transparently persist the output of their preprocessing pipeline to disk, and materialize the pre-processed data on a different training run.
This API enables repeated preprocessing steps to be consolidated, and allows re-use of already processed data, trading off disk storage and network bandwidth for freeing up more valuable CPU resources and accelerator compute time.
https://github.com/tensorflow/community/blob/master/rfcs/20200107-tf-data-snapshot.md has detailed design documentation of this feature.
Users can specify various options to control the behavior of snapshot,
including how snapshots are read from and written to by passing in
user-defined functions to the reader_func
and shard_func
parameters.
shard_func
is a user specified function that maps input elements to
snapshot shards.
NUM_SHARDS <- parallel::detectCores() dataset %>% dataset_enumerate() %>% dataset_snapshot( "/path/to/snapshot/dir", shard_func = function(index, ds_elem) x %% NUM_SHARDS) %>% dataset_map(function(index, ds_elem) ds_elem)
reader_func
is a user specified function that accepts a single argument:
a Dataset of Datasets, each representing a "split" of elements of the
original dataset. The cardinality of the input dataset matches the
number of the shards specified in the shard_func
. The function
should return a Dataset of elements of the original dataset.
Users may want specify this function to control how snapshot files should be read from disk, including the amount of shuffling and parallelism.
Here is an example of a standard reader function a user can define. This function enables both dataset shuffling and parallel reading of datasets:
user_reader_func <- function(datasets) { num_cores <- parallel::detectCores() datasets %>% dataset_shuffle(num_cores) %>% dataset_interleave(function(x) x, num_parallel_calls=AUTOTUNE) } dataset <- dataset %>% dataset_snapshot("/path/to/snapshot/dir", reader_func = user_reader_func)
By default, snapshot parallelizes reads by the number of cores available on the system, but will not attempt to shuffle the data.
Creates a dataset with at most count elements from this dataset
Description
Creates a dataset with at most count elements from this dataset
Usage
dataset_take(dataset, count)
Arguments
dataset |
A dataset |
count |
Integer representing the number of elements of this dataset that
should be taken to form the new dataset. If |
Value
A dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take_while()
,
dataset_window()
A transformation that stops dataset iteration based on a predicate.
Description
A transformation that stops dataset iteration based on a predicate.
Usage
dataset_take_while(dataset, predicate, name = NULL)
Arguments
dataset |
A TF dataset |
predicate |
A function that maps a nested structure of tensors (having
shapes and types defined by |
name |
(Optional.) A name for the tf.data operation. |
Details
Example usage:
range_dataset(from = 0, to = 10) %>% dataset_take_while( ~ .x < 5) %>% as_array_iterator() %>% iterate(simplify = FALSE) %>% str() #> List of 5 #> $ : num 0 #> $ : num 1 #> $ : num 2 #> $ : num 3 #> $ : num 4
Value
A TF Dataset
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_window()
Unbatch a dataset
Description
Splits elements of a dataset into multiple elements.
Usage
dataset_unbatch(dataset, name = NULL)
Arguments
dataset |
A dataset |
name |
(Optional.) A name for the tf.data operation. |
A transformation that discards duplicate elements of a Dataset.
Description
Use this transformation to produce a dataset that contains one instance of each unique element in the input (See example).
Usage
dataset_unique(dataset, name = NULL)
Arguments
dataset |
A tf.Dataset. |
name |
(Optional.) A name for the tf.data operation. |
Value
A tf.Dataset
Note
This transformation only supports datasets which fit into memory and have elements of either tf.int32, tf.int64 or tf.string type.
Examples
## Not run:
c(0, 37, 2, 37, 2, 1) %>% as_tensor("int32") %>%
tensor_slices_dataset() %>%
dataset_unique() %>%
as_array_iterator() %>% iterate() %>% sort()
# [1] 0 1 2 37
## End(Not run)
Transform the dataset using the provided spec.
Description
Prepares the dataset to be used directly in a model.The transformed dataset is prepared to return tuples (x,y) that can be used directly in Keras.
Usage
dataset_use_spec(dataset, spec)
Arguments
dataset |
A TensorFlow dataset. |
spec |
A feature specification created with |
Value
A TensorFlow dataset.
See Also
-
feature_spec()
to initialize the feature specification. -
fit.FeatureSpec()
to create a tensorflow dataset prepared to modeling. -
steps to a list of all implemented steps.
Other Feature Spec Functions:
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ age) %>%
step_numeric_column(age)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Combines input elements into a dataset of windows.
Description
Combines input elements into a dataset of windows.
Usage
dataset_window(dataset, size, shift = NULL, stride = 1, drop_remainder = FALSE)
Arguments
dataset |
A dataset |
size |
representing the number of elements of the input dataset to combine into a window. |
shift |
epresenting the forward shift of the sliding window in each
iteration. Defaults to |
stride |
representing the stride of the input elements in the sliding window. |
drop_remainder |
representing whether a window should be dropped in
case its size is smaller |
See Also
Other dataset methods:
dataset_batch()
,
dataset_cache()
,
dataset_collect()
,
dataset_concatenate()
,
dataset_decode_delim()
,
dataset_filter()
,
dataset_interleave()
,
dataset_map()
,
dataset_map_and_batch()
,
dataset_padded_batch()
,
dataset_prefetch()
,
dataset_prefetch_to_device()
,
dataset_reduce()
,
dataset_repeat()
,
dataset_shuffle()
,
dataset_shuffle_and_repeat()
,
dataset_skip()
,
dataset_take()
,
dataset_take_while()
Specification for reading a record from a text file with delimited values
Description
Specification for reading a record from a text file with delimited values
Usage
delim_record_spec(
example_file,
delim = ",",
skip = 0,
names = NULL,
types = NULL,
defaults = NULL
)
csv_record_spec(
example_file,
skip = 0,
names = NULL,
types = NULL,
defaults = NULL
)
tsv_record_spec(
example_file,
skip = 0,
names = NULL,
types = NULL,
defaults = NULL
)
Arguments
example_file |
File that provides an example of the records to be read. If you don't explicitly specify names and types (or defaults) then this file will be read to generate default values. |
delim |
Character delimiter to separate fields in a record (defaults to ",") |
skip |
Number of lines to skip before reading data. Note that if
|
names |
Character vector with column names (or If If |
types |
Column types. If Types can be explicitliy specified in a character vector as "integer",
"double", and "character" (e.g. Alternatively, you can use a compact string representation where each
character represents one column: c = character, i = integer, d = double
(e.g. |
defaults |
List of default values which are used when data is
missing from a record (e.g. |
Dense Features
Description
Retrives the Dense Features from a spec.
Usage
dense_features(spec)
Arguments
spec |
A feature specification created with |
Value
A list of feature columns.
Creates a feature specification.
Description
Used to create initialize a feature columns specification.
Usage
feature_spec(dataset, x, y = NULL)
Arguments
dataset |
A TensorFlow dataset. |
x |
Features to include can use |
y |
(Optional) The response variable. Can also be specified using
a |
Details
After creating the feature_spec
object you can add steps using the
step
functions.
Value
a FeatureSpec
object.
See Also
-
fit.FeatureSpec()
to fit the FeatureSpec -
dataset_use_spec()
to create a tensorflow dataset prepared to modeling. -
steps to a list of all implemented steps.
Other Feature Spec Functions:
dataset_use_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ .)
# select using `tidyselect` helpers
spec <- feature_spec(hearts, x = c(thal, age), y = target)
## End(Not run)
A dataset of all files matching a pattern
Description
A dataset of all files matching a pattern
Usage
file_list_dataset(file_pattern, shuffle = NULL, seed = NULL)
Arguments
file_pattern |
A string, representing the filename pattern that will be matched. |
shuffle |
(Optional) If |
seed |
(Optional) An integer, representing the random seed that will be used to create the distribution. |
Details
For example, if we had the following files on our filesystem:
/path/to/dir/a.txt
/path/to/dir/b.csv
/path/to/dir/c.csv
If we pass "/path/to/dir/*.csv"
as the file_pattern
, the dataset would produce:
/path/to/dir/b.csv
/path/to/dir/c.csv
Value
A dataset of string corresponding to file names
Note
The shuffle
and seed
arguments only apply for TensorFlow >= v1.8
Fits a feature specification.
Description
This function will fit
the specification. Depending
on the steps added to the specification it will compute
for example, the levels of categorical features, normalization
constants, etc.
Usage
## S3 method for class 'FeatureSpec'
fit(object, dataset = NULL, ...)
Arguments
object |
A feature specification created with |
dataset |
(Optional) A TensorFlow dataset. If |
... |
(unused) |
Value
a fitted FeatureSpec
object.
See Also
-
feature_spec()
to initialize the feature specification. -
dataset_use_spec()
to create a tensorflow dataset prepared to modeling. -
steps to a list of all implemented steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ age) %>%
step_numeric_column(age)
spec_fit <- fit(spec)
spec_fit
## End(Not run)
A dataset of fixed-length records from one or more binary files.
Description
A dataset of fixed-length records from one or more binary files.
Usage
fixed_length_record_dataset(
filenames,
record_bytes,
header_bytes = NULL,
footer_bytes = NULL,
buffer_size = NULL
)
Arguments
filenames |
A string tensor containing one or more filenames. |
record_bytes |
An integer representing the number of bytes in each record. |
header_bytes |
(Optional) An integer scalar representing the number of bytes to skip at the start of a file. |
footer_bytes |
(Optional) A integer scalar representing the number of bytes to ignore at the end of a file. |
buffer_size |
(Optional) A integer scalar representing the number of bytes to buffer when reading. |
Value
A dataset
Identify the type of the variable.
Description
Can only be used inside the steps specifications to find variables by type.
Usage
has_type(match = "float32")
Arguments
match |
A list of types to match. |
See Also
Other Selectors:
all_nominal()
,
all_numeric()
Heart Disease Data Set
Description
Heart disease (angiographic disease status) dataset.
Usage
hearts
Format
A data frame with 303 rows and 14 variables:
- age
age in years
- sex
sex (1 = male; 0 = female)
- cp
chest pain type: Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic
- trestbps
resting blood pressure (in mm Hg on admission to the hospital)
- chol
serum cholestoral in mg/dl
- fbs
(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- restecg
resting electrocardiographic results: Value 0: normal, Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach
maximum heart rate achieved
- exang
exercise induced angina (1 = yes; 0 = no)
- oldpeak
ST depression induced by exercise relative to rest
- slope
the slope of the peak exercise ST segment: Value 1: upsloping, Value 2: flat, Value 3: downsloping
- ca
number of major vessels (0-3) colored by flourosopy
- thal
3 = normal; 6 = fixed defect; 7 = reversable defect
- target
diagnosis of heart disease angiographic
Source
https://archive.ics.uci.edu/ml/datasets/heart+Disease
References
The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.
Construct a tfestimators input function from a dataset
Description
Construct a tfestimators input function from a dataset
Usage
input_fn.tf_dataset(dataset, features, response = NULL)
Arguments
dataset |
A dataset |
features |
The names of feature variables to be used. |
response |
The name of the response variable. |
Details
Creating an input_fn from a dataset requires that the dataset
consist of a set of named output tensors (e.g. like the dataset
produced by the tfrecord_dataset()
or text_line_dataset()
function).
Value
An input_fn suitable for use with tfestimators train, evaluate, and predict methods
Get next element from iterator
Description
Returns a nested list of tensors that when evaluated will yield the next element(s) in the dataset.
Usage
iterator_get_next(iterator, name = NULL)
Arguments
iterator |
An iterator |
name |
(Optional) A name for the created operation. |
Value
A nested list of tensors
See Also
Other iterator functions:
iterator_initializer()
,
iterator_make_initializer()
,
iterator_string_handle()
,
make-iterator
An operation that should be run to initialize this iterator.
Description
An operation that should be run to initialize this iterator.
Usage
iterator_initializer(iterator)
Arguments
iterator |
An iterator |
See Also
Other iterator functions:
iterator_get_next()
,
iterator_make_initializer()
,
iterator_string_handle()
,
make-iterator
Create an operation that can be run to initialize this iterator
Description
Create an operation that can be run to initialize this iterator
Usage
iterator_make_initializer(iterator, dataset, name = NULL)
Arguments
iterator |
An iterator |
dataset |
A dataset |
name |
(Optional) A name for the created operation. |
Value
A tf$Operation that can be run to initialize this iterator on the given dataset.
See Also
Other iterator functions:
iterator_get_next()
,
iterator_initializer()
,
iterator_string_handle()
,
make-iterator
String-valued tensor that represents this iterator
Description
String-valued tensor that represents this iterator
Usage
iterator_string_handle(iterator, name = NULL)
Arguments
iterator |
An iterator |
name |
(Optional) A name for the created operation. |
Value
Scalar tensor of type string
See Also
Other iterator functions:
iterator_get_next()
,
iterator_initializer()
,
iterator_make_initializer()
,
make-iterator
Creates a list of inputs from a dataset
Description
DEPRECATED: Use keras3::layer_feature_space()
instead.
Usage
layer_input_from_dataset(dataset)
Arguments
dataset |
a TensorFlow dataset or a data.frame |
Details
Create a list ok Keras input layers that can be used together
with keras::layer_dense_features()
.
Value
a list of Keras input layers
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ age + slope) %>%
step_numeric_column(age, slope) %>%
step_bucketized_column(age, boundaries = c(10, 20, 30))
spec <- fit(spec)
dataset <- hearts %>% dataset_use_spec(spec)
input <- layer_input_from_dataset(dataset)
## End(Not run)
Get Dataset length
Description
Returns the length of the dataset.
Usage
## S3 method for class 'tf_dataset'
length(x)
## S3 method for class 'tensorflow.python.data.ops.dataset_ops.DatasetV2'
length(x)
Arguments
x |
a |
Value
Either Inf
if the dataset is infinite, NA
if the dataset length
is unknown, or an R numeric if it is known.
Examples
## Not run:
range_dataset(0, 42) %>% length()
# 42
range_dataset(0, 42) %>% dataset_repeat() %>% length()
# Inf
range_dataset(0, 42) %>% dataset_repeat() %>%
dataset_filter(function(x) TRUE) %>% length()
# NA
## End(Not run)
Reads CSV files into a batched dataset
Description
Reads CSV files into a dataset, where each element is a (features, labels) list that corresponds to a batch of CSV rows. The features dictionary maps feature column names to tensors containing the corresponding feature data, and labels is a tensor containing the batch's label data.
Usage
make_csv_dataset(
file_pattern,
batch_size,
column_names = NULL,
column_defaults = NULL,
label_name = NULL,
select_columns = NULL,
field_delim = ",",
use_quote_delim = TRUE,
na_value = "",
header = TRUE,
num_epochs = NULL,
shuffle = TRUE,
shuffle_buffer_size = 10000,
shuffle_seed = NULL,
prefetch_buffer_size = 1,
num_parallel_reads = 1,
num_parallel_parser_calls = 2,
sloppy = FALSE,
num_rows_for_inference = 100
)
Arguments
file_pattern |
List of files or glob patterns of file paths containing CSV records. |
batch_size |
An integer representing the number of records to combine in a single batch. |
column_names |
An optional list of strings that corresponds to the CSV columns, in order. One per column of the input record. If this is not provided, infers the column names from the first row of the records. These names will be the keys of the features dict of each dataset element. |
column_defaults |
A optional list of default values for the CSV fields. One item
per selected column of the input record. Each item in the list is either a valid CSV
dtype (integer, numeric, or string), or a tensor with one of the
aforementioned types. The tensor can either be a scalar default value (if the column
is optional), or an empty tensor (if the column is required). If a dtype is provided
instead of a tensor, the column is also treated as required. If this list is not
provided, tries to infer types based on reading the first |
label_name |
A optional string corresponding to the label column. If provided, the data for this column is returned as a separate tensor from the features dictionary, so that the dataset complies with the format expected by a TF Estiamtors and Keras. |
select_columns |
(Ignored if using TensorFlow version 1.8.) An optional list of
integer indices or string column names, that specifies a subset of columns of CSV data
to select. If column names are provided, these must correspond to names provided in
|
field_delim |
An optional string. Defaults to |
use_quote_delim |
An optional bool. Defaults to |
na_value |
Additional string to recognize as NA/NaN. |
header |
A bool that indicates whether the first rows of provided CSV files correspond to header lines with column names, and should not be included in the data. |
num_epochs |
An integer specifying the number of times this dataset is repeated. If NULL, cycles through the dataset forever. |
shuffle |
A bool that indicates whether the input should be shuffled. |
shuffle_buffer_size |
Buffer size to use for shuffling. A large buffer size ensures better shuffling, but increases memory usage and startup time. |
shuffle_seed |
Randomization seed to use for shuffling. |
prefetch_buffer_size |
An int specifying the number of feature batches to prefetch for performance improvement. Recommended value is the number of batches consumed per training step. |
num_parallel_reads |
Number of threads used to read CSV records from files. If >1, the results will be interleaved. |
num_parallel_parser_calls |
(Ignored if using TensorFlow version 1.11 or later.) Number of parallel invocations of the CSV parsing function on CSV records. |
sloppy |
If |
num_rows_for_inference |
Number of rows of a file to use for type inference if
record_defaults is not provided. If |
Value
A dataset, where each element is a (features, labels) list that corresponds to
a batch of batch_size
CSV rows. The features dictionary maps feature column names
to tensors containing the corresponding column data, and labels is a tensor
containing the column data for the label column specified by label_name
.
Creates an iterator for enumerating the elements of this dataset.
Description
Creates an iterator for enumerating the elements of this dataset.
Usage
make_iterator_one_shot(dataset)
make_iterator_initializable(dataset, shared_name = NULL)
make_iterator_from_structure(
output_types,
output_shapes = NULL,
shared_name = NULL
)
make_iterator_from_string_handle(
string_handle,
output_types,
output_shapes = NULL
)
Arguments
dataset |
A dataset |
shared_name |
(Optional) If non-empty, the returned iterator will be shared under the given name across multiple sessions that share the same devices (e.g. when using a remote server). |
output_types |
A nested structure of tf$DType objects corresponding to each component of an element of this iterator. |
output_shapes |
(Optional) A nested structure of tf$TensorShape objects corresponding to each component of an element of this dataset. If omitted, each component will have an unconstrainted shape. |
string_handle |
A scalar tensor of type string that evaluates
to a handle produced by the |
Value
An Iterator over the elements of this dataset.
Initialization
For make_iterator_one_shot()
, the returned
iterator will be initialized automatically. A "one-shot" iterator does not
currently support re-initialization.
For make_iterator_initializable()
,
the returned iterator will be in an uninitialized state, and you must run
the object returned from iterator_initializer()
before using it.
For make_iterator_from_structure()
, the returned iterator is not bound
to a particular dataset, and it has no initializer. To initialize the
iterator, run the operation returned by iterator_make_initializer()
.
See Also
Other iterator functions:
iterator_get_next()
,
iterator_initializer()
,
iterator_make_initializer()
,
iterator_string_handle()
Tensor(s) for retrieving the next batch from a dataset
Description
Tensor(s) for retrieving the next batch from a dataset
Usage
next_batch(dataset)
Arguments
dataset |
A dataset |
Details
To access the underlying data within the dataset you iteratively evaluate the tensor(s) to read batches of data.
Note that in many cases you won't need to explicitly evaluate the tensors. Rather, you will pass the tensors to another function that will perform the evaluation (e.g. the Keras layer_input() and compile() functions).
If you do need to perform iteration manually by evaluating the tensors, there are a couple of possible approaches to controlling/detecting when iteration should end.
One approach is to create a dataset that yields batches infinitely (traversing the dataset multiple times with different batches randomly drawn). In this case you'd use another mechanism like a global step counter or detecting a learning plateau.
Another approach is to detect when all batches have been yielded
from the dataset. When the tensor reaches the end of iteration a runtime
error will occur. You can catch and ignore the error when it occurs by wrapping
your iteration code in the with_dataset()
function.
See the examples below for a demonstration of each of these methods of iteration.
Value
Tensor(s) that can be evaluated to yield the next batch of training data.
Examples
## Not run:
# iteration with 'infinite' dataset and explicit step counter
library(tfdatasets)
dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>%
dataset_prepare(x = c(mpg, disp), y = cyl) %>%
dataset_shuffle(5000) %>%
dataset_batch(128) %>%
dataset_repeat() # repeat infinitely
batch <- next_batch(dataset)
steps <- 200
for (i in 1:steps) {
# use batch$x and batch$y tensors
}
# iteration that detects and ignores end of iteration error
library(tfdatasets)
dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>%
dataset_prepare(x = c(mpg, disp), y = cyl) %>%
dataset_batch(128) %>%
dataset_repeat(10)
batch <- next_batch(dataset)
with_dataset({
while(TRUE) {
# use batch$x and batch$y tensors
}
})
## End(Not run)
Output types and shapes
Description
Output types and shapes
Usage
output_types(object)
output_shapes(object)
Arguments
object |
A dataset or iterator |
Value
output_types()
returns the type of each component of an element of
this object; output_shapes()
returns the shape of each component of an
element of this object
Creates a Dataset
of pseudorandom values
Description
Creates a Dataset
of pseudorandom values
Usage
random_integer_dataset(seed = NULL)
Arguments
seed |
(Optional) If specified, the dataset produces a deterministic sequence of values. |
Details
The dataset generates a sequence of uniformly distributed integer values (dtype int64).
Creates a dataset of a step-separated range of values.
Description
Creates a dataset of a step-separated range of values.
Usage
range_dataset(from = 0, to = 0, by = 1, ..., dtype = tf$int64)
Arguments
from |
Range start |
to |
Range end (exclusive) |
by |
Increment of the sequence |
... |
ignored |
dtype |
Output dtype. (Optional, default: |
Read a dataset from a set of files
Description
Read files into a dataset, optionally processing them in parallel.
Usage
read_files(
files,
reader,
...,
parallel_files = 1,
parallel_interleave = 1,
num_shards = NULL,
shard_index = NULL
)
Arguments
files |
List of filenames or glob pattern for files (e.g. "*.csv") |
reader |
Function that maps a file into a dataset (e.g.
|
... |
Additional arguments to pass to |
parallel_files |
An integer, number of files to process in parallel |
parallel_interleave |
An integer, number of consecutive records to produce from each file before cycling to another file. |
num_shards |
An integer representing the number of shards operating in parallel. |
shard_index |
An integer, representing the worker index. Shared indexes are 0 based so for e.g. 8 shards valid indexes would be 0-7. |
Value
A dataset
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- generics
- reticulate
- tensorflow
- tidyselect
contains
,ends_with
,everything
,matches
,num_range
,one_of
,starts_with
Samples elements at random from the datasets in datasets
.
Description
Samples elements at random from the datasets in datasets
.
Usage
sample_from_datasets(
datasets,
weights = NULL,
seed = NULL,
stop_on_empty_dataset = TRUE
)
Arguments
datasets |
A list ofobjects with compatible structure. |
weights |
(Optional.) A list of |
seed |
(Optional.) An integer, representing the random seed that will be used to create the distribution. |
stop_on_empty_dataset |
If |
Value
A dataset that interleaves elements from datasets
at random, according to
weights
if provided, otherwise with uniform probability.
List of pre-made scalers
Description
-
scaler_standard: mean and standard deviation normalizer.
-
scaler_min_max: min max normalizer
See Also
Creates an instance of a min max scaler
Description
This scaler will learn the min and max of the numeric variable
and use this to create a normalizer_fn
.
Usage
scaler_min_max()
See Also
scaler to a complete list of normalizers
Other scaler:
scaler_standard()
Creates an instance of a standard scaler
Description
This scaler will learn the mean and the standard deviation
and use this to create a normalizer_fn
.
Usage
scaler_standard()
See Also
scaler to a complete list of normalizers
Other scaler:
scaler_min_max()
Selectors
Description
List of selectors that can be used to specify variables inside steps.
Usage
cur_info_env
Format
An object of class environment
of length 0.
Selectors
Splits each rank-N tf$SparseTensor
in this dataset row-wise.
Description
Splits each rank-N tf$SparseTensor
in this dataset row-wise.
Usage
sparse_tensor_slices_dataset(sparse_tensor)
Arguments
sparse_tensor |
A |
Value
A dataset of rank-(N-1) sparse tensors.
See Also
Other tensor datasets:
tensor_slices_dataset()
,
tensors_dataset()
A dataset consisting of the results from a SQL query
Description
A dataset consisting of the results from a SQL query
Usage
sql_record_spec(names, types)
sql_dataset(driver_name, data_source_name, query, record_spec)
sqlite_dataset(filename, query, record_spec)
Arguments
names |
Names of columns returned from the query |
types |
List of |
driver_name |
String containing the database type. Currently, the only supported value is 'sqlite'. |
data_source_name |
String containing a connection string to connect to the database. |
query |
String containing the SQL query to execute. |
record_spec |
Names and types of database columns |
filename |
Filename for the database |
Value
A dataset
Creates bucketized columns
Description
Use this step to create bucketized columns from numeric columns.
Usage
step_bucketized_column(spec, ..., boundaries)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
boundaries |
A sorted list or tuple of floats specifying the boundaries. |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
file <- tempfile()
writeLines(unique(hearts$thal), file)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ age) %>%
step_numeric_column(age) %>%
step_bucketized_column(age, boundaries = c(10, 20, 30))
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates a categorical column with hash buckets specification
Description
Represents sparse feature where ids are set by hashing.
Usage
step_categorical_column_with_hash_bucket(
spec,
...,
hash_bucket_size,
dtype = tf$string
)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
hash_bucket_size |
An int > 1. The number of buckets. |
dtype |
The type of features. Only string and integer types are supported. |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ thal) %>%
step_categorical_column_with_hash_bucket(thal, hash_bucket_size = 3)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Create a categorical column with identity
Description
Use this when your inputs are integers in the range [0-num_buckets)
.
Usage
step_categorical_column_with_identity(
spec,
...,
num_buckets,
default_value = NULL
)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
num_buckets |
Range of inputs and outputs is |
default_value |
If |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts$thal <- as.integer(as.factor(hearts$thal)) - 1L
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ thal) %>%
step_categorical_column_with_identity(thal, num_buckets = 5)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates a categorical column with vocabulary file
Description
Use this function when the vocabulary of a categorical variable is written to a file.
Usage
step_categorical_column_with_vocabulary_file(
spec,
...,
vocabulary_file,
vocabulary_size = NULL,
dtype = tf$string,
default_value = NULL,
num_oov_buckets = 0L
)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
vocabulary_file |
The vocabulary file name. |
vocabulary_size |
Number of the elements in the vocabulary. This
must be no greater than length of |
dtype |
The type of features. Only string and integer types are supported. |
default_value |
The integer ID value to return for out-of-vocabulary
feature values, defaults to |
num_oov_buckets |
Non-negative integer, the number of out-of-vocabulary
buckets. All out-of-vocabulary inputs will be assigned IDs in the range
|
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
file <- tempfile()
writeLines(unique(hearts$thal), file)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ thal) %>%
step_categorical_column_with_vocabulary_file(thal, vocabulary_file = file)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates a categorical column specification
Description
Creates a categorical column specification
Usage
step_categorical_column_with_vocabulary_list(
spec,
...,
vocabulary_list = NULL,
dtype = NULL,
default_value = -1L,
num_oov_buckets = 0L
)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
vocabulary_list |
An ordered iterable defining the vocabulary. Each
feature is mapped to the index of its value (if present) in vocabulary_list.
Must be castable to |
dtype |
The type of features. Only string and integer types are supported.
If |
default_value |
The integer ID value to return for out-of-vocabulary feature
values, defaults to |
num_oov_buckets |
Non-negative integer, the number of out-of-vocabulary buckets.
All out-of-vocabulary inputs will be assigned IDs in the range
|
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ thal) %>%
step_categorical_column_with_vocabulary_list(thal)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates crosses of categorical columns
Description
Use this step to create crosses between categorical columns.
Usage
step_crossed_column(spec, ..., hash_bucket_size, hash_key = NULL)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
hash_bucket_size |
An int > 1. The number of buckets. |
hash_key |
(optional) Specify the hash_key that will be used by the FingerprintCat64 function to combine the crosses fingerprints on SparseCrossOp. |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
file <- tempfile()
writeLines(unique(hearts$thal), file)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ age) %>%
step_numeric_column(age) %>%
step_bucketized_column(age, boundaries = c(10, 20, 30))
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates embeddings columns
Description
Use this step to create ambeddings columns from categorical columns.
Usage
step_embedding_column(
spec,
...,
dimension = function(x) {
as.integer(x^0.25)
},
combiner = "mean",
initializer = NULL,
ckpt_to_load_from = NULL,
tensor_name_in_ckpt = NULL,
max_norm = NULL,
trainable = TRUE
)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
dimension |
An integer specifying dimension of the embedding, must be > 0. Can also be a function of the size of the vocabulary. |
combiner |
A string specifying how to reduce if there are multiple entries in
a single row. Currently 'mean', 'sqrtn' and 'sum' are supported, with 'mean' the
default. 'sqrtn' often achieves good accuracy, in particular with bag-of-words
columns. Each of this can be thought as example level normalizations on
the column. For more information, see |
initializer |
A variable initializer function to be used in embedding
variable initialization. If not specified, defaults to
|
ckpt_to_load_from |
String representing checkpoint name/pattern from
which to restore column weights. Required if |
tensor_name_in_ckpt |
Name of the Tensor in ckpt_to_load_from from which to
restore the column weights. Required if |
max_norm |
If not |
trainable |
Whether or not the embedding is trainable. Default is |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
file <- tempfile()
writeLines(unique(hearts$thal), file)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ thal) %>%
step_categorical_column_with_vocabulary_list(thal) %>%
step_embedding_column(thal, dimension = 3)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates Indicator Columns
Description
Use this step to create indicator columns from categorical columns.
Usage
step_indicator_column(spec, ...)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
file <- tempfile()
writeLines(unique(hearts$thal), file)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ thal) %>%
step_categorical_column_with_vocabulary_list(thal) %>%
step_indicator_column(thal)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates a numeric column specification
Description
step_numeric_column
creates a numeric column specification. It can also be
used to normalize numeric columns.
Usage
step_numeric_column(
spec,
...,
shape = 1L,
default_value = NULL,
dtype = tf$float32,
normalizer_fn = NULL
)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
shape |
An iterable of integers specifies the shape of the Tensor. An integer can be given
which means a single dimension Tensor with given width. The Tensor representing the column will
have the shape of |
default_value |
A single value compatible with |
dtype |
defines the type of values. Default value is |
normalizer_fn |
If not |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_remove_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ age) %>%
step_numeric_column(age, normalizer_fn = standard_scaler())
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates a step that can remove columns
Description
Removes features of the feature specification.
Usage
step_remove_column(spec, ...)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
Value
a FeatureSpec
object.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_shared_embeddings_column()
,
steps
Examples
## Not run:
library(tfdatasets)
data(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
# use the formula interface
spec <- feature_spec(hearts, target ~ age) %>%
step_numeric_column(age, normalizer_fn = scaler_standard()) %>%
step_bucketized_column(age, boundaries = c(20, 50)) %>%
step_remove_column(age)
spec_fit <- fit(spec)
final_dataset <- hearts %>% dataset_use_spec(spec_fit)
## End(Not run)
Creates shared embeddings for categorical columns
Description
This is similar to step_embedding_column, except that it produces a list of embedding columns that share the same embedding weights.
Usage
step_shared_embeddings_column(
spec,
...,
dimension,
combiner = "mean",
initializer = NULL,
shared_embedding_collection_name = NULL,
ckpt_to_load_from = NULL,
tensor_name_in_ckpt = NULL,
max_norm = NULL,
trainable = TRUE
)
Arguments
spec |
A feature specification created with |
... |
Comma separated list of variable names to apply the step. selectors can also be used. |
dimension |
An integer specifying dimension of the embedding, must be > 0. Can also be a function of the size of the vocabulary. |
combiner |
A string specifying how to reduce if there are multiple entries in
a single row. Currently 'mean', 'sqrtn' and 'sum' are supported, with 'mean' the
default. 'sqrtn' often achieves good accuracy, in particular with bag-of-words
columns. Each of this can be thought as example level normalizations on
the column. For more information, see |
initializer |
A variable initializer function to be used in embedding
variable initialization. If not specified, defaults to
|
shared_embedding_collection_name |
Optional collective name of these columns. If not given, a reasonable name will be chosen based on the names of categorical_columns. |
ckpt_to_load_from |
String representing checkpoint name/pattern from
which to restore column weights. Required if |
tensor_name_in_ckpt |
Name of the Tensor in ckpt_to_load_from from which to
restore the column weights. Required if |
max_norm |
If not |
trainable |
Whether or not the embedding is trainable. Default is |
Value
a FeatureSpec
object.
Note
Does not work in the eager mode.
See Also
steps for a complete list of allowed steps.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
steps
Steps for feature columns specification.
Description
List of steps that can be used to specify columns in the feature_spec
interface.
Steps
-
step_numeric_column()
to define numeric columns. -
step_categorical_column_with_vocabulary_list()
to define categorical columns. -
step_categorical_column_with_hash_bucket()
to define categorical columns where ids are set by hashing. -
step_categorical_column_with_identity()
to define categorical columns represented by integers in the range[0-num_buckets)
. -
step_categorical_column_with_vocabulary_file()
to define categorical columns when their vocabulary is available in a file. -
step_indicator_column()
to create indicator columns from categorical columns. -
step_embedding_column()
to create embeddings columns from categorical columns. -
step_bucketized_column()
to create bucketized columns from numeric columns. -
step_crossed_column()
to perform crosses of categorical columns. -
step_shared_embeddings_column()
to share embeddings between a list of categorical columns. -
step_remove_column()
to remove columns from the specification.
See Also
-
selectors for a list of selectors that can be used to specify variables.
Other Feature Spec Functions:
dataset_use_spec()
,
feature_spec()
,
fit.FeatureSpec()
,
step_bucketized_column()
,
step_categorical_column_with_hash_bucket()
,
step_categorical_column_with_identity()
,
step_categorical_column_with_vocabulary_file()
,
step_categorical_column_with_vocabulary_list()
,
step_crossed_column()
,
step_embedding_column()
,
step_indicator_column()
,
step_numeric_column()
,
step_remove_column()
,
step_shared_embeddings_column()
Creates a dataset whose elements are slices of the given tensors.
Description
Creates a dataset whose elements are slices of the given tensors.
Usage
tensor_slices_dataset(tensors)
Arguments
tensors |
A nested structure of tensors, each having the same size in the first dimension. |
Value
A dataset.
See Also
Other tensor datasets:
sparse_tensor_slices_dataset()
,
tensors_dataset()
Creates a dataset with a single element, comprising the given tensors.
Description
Creates a dataset with a single element, comprising the given tensors.
Usage
tensors_dataset(tensors)
Arguments
tensors |
A nested structure of tensors. |
Value
A dataset.
See Also
Other tensor datasets:
sparse_tensor_slices_dataset()
,
tensor_slices_dataset()
A dataset comprising lines from one or more text files.
Description
A dataset comprising lines from one or more text files.
Usage
text_line_dataset(
filenames,
compression_type = NULL,
record_spec = NULL,
parallel_records = NULL
)
Arguments
filenames |
String(s) specifying one or more filenames |
compression_type |
A string, one of: |
record_spec |
(Optional) Specification used to decode delimimted text lines
into records (see |
parallel_records |
(Optional) An integer, representing the number of records to decode in parallel. If not specified, records will be processed sequentially. |
Value
A dataset
A dataset comprising records from one or more TFRecord files.
Description
A dataset comprising records from one or more TFRecord files.
Usage
tfrecord_dataset(
filenames,
compression_type = NULL,
buffer_size = NULL,
num_parallel_reads = NULL
)
Arguments
filenames |
String(s) specifying one or more filenames |
compression_type |
A string, one of: |
buffer_size |
An integer representing the number of bytes in the read buffer. (0 means no buffering). |
num_parallel_reads |
An integer representing the number of files to read in parallel. Defaults to reading files sequentially. |
Details
If the dataset encodes a set of TFExample instances, then they can be decoded
into named records using the dataset_map()
function (see example below).
Examples
## Not run:
# Creates a dataset that reads all of the examples from two files, and extracts
# the image and label features.
filenames <- c("/var/data/file1.tfrecord", "/var/data/file2.tfrecord")
dataset <- tfrecord_dataset(filenames) %>%
dataset_map(function(example_proto) {
features <- list(
image = tf$FixedLenFeature(shape(), tf$string, default_value = ""),
label = tf$FixedLenFeature(shape(), tf$int32, default_value = 0L)
)
tf$parse_single_example(example_proto, features)
})
## End(Not run)
Execute code that traverses a dataset until an out of range condition occurs
Description
Execute code that traverses a dataset until an out of range condition occurs
Usage
until_out_of_range(expr)
out_of_range_handler(e)
Arguments
expr |
Expression to execute (will be executed multiple times until the condition occurs) |
e |
Error object |
Details
When a dataset iterator reaches the end, an out of range runtime error will occur. This function will catch and ignore the error when it occurs.
Examples
## Not run:
library(tfdatasets)
dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>%
dataset_batch(128) %>%
dataset_repeat(10) %>%
dataset_prepare(x = c(mpg, disp), y = cyl)
iter <- make_iterator_one_shot(dataset)
next_batch <- iterator_get_next(iter)
until_out_of_range({
batch <- sess$run(next_batch)
# use batch$x and batch$y tensors
})
## End(Not run)
Execute code that traverses a dataset
Description
Execute code that traverses a dataset
Usage
with_dataset(expr)
Arguments
expr |
Expression to execute |
Details
When a dataset iterator reaches the end, an out of range runtime error
will occur. You can catch and ignore the error when it occurs by wrapping
your iteration code in a call to with_dataset()
(see the example
below for an illustration).
Examples
## Not run:
library(tfdatasets)
dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>%
dataset_prepare(x = c(mpg, disp), y = cyl) %>%
dataset_batch(128) %>%
dataset_repeat(10)
iter <- make_iterator_one_shot(dataset)
next_batch <- iterator_get_next(iter)
with_dataset({
while(TRUE) {
batch <- sess$run(next_batch)
# use batch$x and batch$y tensors
}
})
## End(Not run)
Creates a dataset by zipping together the given datasets.
Description
Merges datasets together into pairs or tuples that contain an element from each dataset.
Usage
zip_datasets(...)
Arguments
... |
Datasets to zip (or a single argument with a list or list of lists of datasets). |
Value
A dataset