Help for package tidyr

Title:

Tidy Messy Data

Version:

1.3.1

Description:

Tools to help to create tidy data, where each column is a variable, each row is an observation, and each cell contains a single value. 'tidyr' contains tools for changing the shape (pivoting) and hierarchy (nesting and 'unnesting') of a dataset, turning deeply nested lists into rectangular data frames ('rectangling'), and extracting values out of string columns. It also includes tools for working with missing values (both implicit and explicit).

License:

MIT + file LICENSE

URL:

https://tidyr.tidyverse.org, https://github.com/tidyverse/tidyr

BugReports:

https://github.com/tidyverse/tidyr/issues

Depends:

R (≥ 3.6)

Imports:

cli (≥ 3.4.1), dplyr (≥ 1.0.10), glue, lifecycle (≥ 1.0.3), magrittr, purrr (≥ 1.0.1), rlang (≥ 1.1.1), stringr (≥ 1.5.0), tibble (≥ 2.1.1), tidyselect (≥ 1.2.0), utils, vctrs (≥ 0.5.2)

Suggests:

covr, data.table, knitr, readr, repurrrsive (≥ 1.1.0), rmarkdown, testthat (≥ 3.0.0)

LinkingTo:

cpp11 (≥ 0.4.0)

VignetteBuilder:

knitr

Config/Needs/website:

tidyverse/tidytemplate

Config/testthat/edition:

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.0

NeedsCompilation:

yes

Packaged:

2024-01-23 14:27:23 UTC; hadleywickham

Author:

Hadley Wickham [aut, cre], Davis Vaughan [aut], Maximilian Girlich [aut], Kevin Ushey [ctb], Posit Software, PBC [cph, fnd]

Maintainer:

Hadley Wickham <hadley@posit.co>

Repository:

CRAN

Date/Publication:

2024-01-24 14:50:09 UTC

tidyr: Tidy Messy Data

Description

Author(s)

Maintainer: Hadley Wickham hadley@posit.co

Authors:

Davis Vaughan davis@posit.co
Maximilian Girlich

Other contributors:

Kevin Ushey kevin@posit.co [contributor]
Posit Software, PBC [copyright holder, funder]

Pipe operator

Description

See %>% for more details.

Usage

lhs %>% rhs

Song rankings for Billboard top 100 in the year 2000

Description

Song rankings for Billboard top 100 in the year 2000

Usage

billboard

Format

A dataset with variables:

artist: Artist name
track: Song name
date.enter: Date the song entered the top 100
wk1 – wk76: Rank of the song in each week after it entered

Source

The "Whitburn" project, https://waxy.org/2008/05/the_whitburn_project/, (downloaded April 2008)

Check assumptions about a pivot `spec`

Description

check_pivot_spec() is a developer facing helper function for validating the pivot spec used in pivot_longer_spec() or pivot_wider_spec(). It is only useful if you are extending pivot_longer() or pivot_wider() with new S3 methods.

check_pivot_spec() makes the following assertions:

spec must be a data frame.
spec must have a character column named .name.
spec must have a character column named .value.
The .name column must be unique.
The .name and .value columns must be the first two columns in the data frame, and will be reordered if that is not true.

Usage

check_pivot_spec(spec, call = caller_env())

Arguments

spec

A specification data frame. This is useful for more complex pivots because it gives you greater control on how metadata stored in the columns become column names in the result.

Must be a data frame containing character .name and .value columns. Additional columns in spec should be named to match columns in the long format of the dataset and contain values corresponding to columns pivoted from the wide format. The special .seq variable is used to disambiguate rows internally; it is automatically removed after pivoting.

Examples

# A valid spec
spec <- tibble(.name = "a", .value = "b", foo = 1)
check_pivot_spec(spec)

spec <- tibble(.name = "a")
try(check_pivot_spec(spec))

# `.name` and `.value` are forced to be the first two columns
spec <- tibble(foo = 1, .value = "b", .name = "a")
check_pivot_spec(spec)

Chop and unchop

Description

Chopping and unchopping preserve the width of a data frame, changing its length. chop() makes df shorter by converting rows within each group into list-columns. unchop() makes df longer by expanding list-columns so that each element of the list-column gets its own row in the output. chop() and unchop() are building blocks for more complicated functions (like unnest(), unnest_longer(), and unnest_wider()) and are generally more suitable for programming than interactive data analysis.

Usage

chop(data, cols, ..., error_call = current_env())

unchop(
  data,
  cols,
  ...,
  keep_empty = FALSE,
  ptype = NULL,
  error_call = current_env()
)

Arguments

data

A data frame.

cols

<tidy-select> Columns to chop or unchop.

For unchop(), each column should be a list-column containing generalised vectors (e.g. any mix of NULLs, atomic vector, S3 vectors, a lists, or data frames).

...

These dots are for future extensions and must be empty.

error_call

The execution environment of a currently running function, e.g. caller_env(). The function will be mentioned in error messages as the source of the error. See the call argument of abort() for more information.

keep_empty

By default, you get one row of output for each element of the list that you are unchopping/unnesting. This means that if there's a size-0 element (like NULL or an empty data frame or vector), then that entire row will be dropped from the output. If you want to preserve all rows, use keep_empty = TRUE to replace size-0 elements with a single row of missing values.

ptype

Optionally, a named list of column name-prototype pairs to coerce cols to, overriding the default that will be guessed from combining the individual values. Alternatively, a single empty ptype can be supplied, which will be applied to all cols.

Details

Generally, unchopping is more useful than chopping because it simplifies a complex data structure, and nest()ing is usually more appropriate than chop()ing since it better preserves the connections between observations.

chop() creates list-columns of class vctrs::list_of() to ensure consistent behaviour when the chopped data frame is emptied. For instance this helps getting back the original column types after the roundtrip chop and unchop. Because ⁠<list_of>⁠ keeps tracks of the type of its elements, unchop() is able to reconstitute the correct vector type even for empty list-columns.

Examples

# Chop ----------------------------------------------------------------------
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
# Note that we get one row of output for each unique combination of
# non-chopped variables
df %>% chop(c(y, z))
# cf nest
df %>% nest(data = c(y, z))

# Unchop --------------------------------------------------------------------
df <- tibble(x = 1:4, y = list(integer(), 1L, 1:2, 1:3))
df %>% unchop(y)
df %>% unchop(y, keep_empty = TRUE)

# unchop will error if the types are not compatible:
df <- tibble(x = 1:2, y = list("1", 1:3))
try(df %>% unchop(y))

# Unchopping a list-col of data frames must generate a df-col because
# unchop leaves the column names unchanged
df <- tibble(x = 1:3, y = list(NULL, tibble(x = 1), tibble(y = 1:2)))
df %>% unchop(y)
df %>% unchop(y, keep_empty = TRUE)

Data from the Centers for Medicare & Medicaid Services

Description

Two datasets from public data provided the Centers for Medicare & Medicaid Services, https://data.cms.gov.

cms_patient_experience contains some lightly cleaned data from "Hospice - Provider Data", which provides a list of hospice agencies along with some data on quality of patient care, https://data.cms.gov/provider-data/dataset/252m-zfp9.
cms_patient_care "Doctors and Clinicians Quality Payment Program PY 2020 Virtual Group Public Reporting", https://data.cms.gov/provider-data/dataset/8c70-d353

Usage

cms_patient_experience

cms_patient_care

Format

cms_patient_experience is a data frame with 500 observations and five variables:

org_pac_id,org_nm: Organisation ID and name
measure_cd,measure_title: Measure code and title
prf_rate: Measure performance rate

cms_patient_care is a data frame with 252 observations and five variables:

ccn,facility_name: Facility ID and name
measure_abbr: Abbreviated measurement title, suitable for use as variable name
score: Measure score
type: Whether score refers to the rating out of 100 ("observed"), or the maximum possible value of the raw score ("denominator")

Examples

cms_patient_experience %>%
  dplyr::distinct(measure_cd, measure_title)

cms_patient_experience %>%
  pivot_wider(
    id_cols = starts_with("org"),
    names_from = measure_cd,
    values_from = prf_rate
 )

cms_patient_care %>%
  pivot_wider(
    names_from = type,
    values_from = score
  )

cms_patient_care %>%
  pivot_wider(
    names_from = measure_abbr,
    values_from = score
  )

cms_patient_care %>%
  pivot_wider(
    names_from = c(measure_abbr, type),
    values_from = score
  )

Complete a data frame with missing combinations of data

Description

Turns implicit missing values into explicit missing values. This is a wrapper around expand(), dplyr::full_join() and replace_na() that's useful for completing missing combinations of data.

Usage

complete(data, ..., fill = list(), explicit = TRUE)

Arguments

data

A data frame.

...

<data-masking> Specification of columns to expand or complete. Columns can be atomic vectors or lists.

To find all unique combinations of x, y and z, including those not present in the data, supply each variable as a separate argument: expand(df, x, y, z) or complete(df, x, y, z).
To find only the combinations that occur in the data, use nesting: expand(df, nesting(x, y, z)).
You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row for each present school-student combination for all possible dates.

When used with factors, expand() and complete() use the full set of levels, not just those that appear in the data. If you want to use only the values seen in the data, use forcats::fct_drop().

When used with continuous variables, you may need to fill in values that do not appear in the data: to do so use expressions like year = 2010:2020 or year = full_seq(year,1).

fill

A named list that for each variable supplies a single value to use instead of NA for missing combinations.

explicit

Should both implicit (newly created) and explicit (pre-existing) missing values be filled by fill? By default, this is TRUE, but if set to FALSE this will limit the fill to only implicit missing values.

Grouped data frames

With grouped data frames created by dplyr::group_by(), complete() operates within each group. Because of this, you cannot complete a grouping column.

Examples

df <- tibble(
  group = c(1:2, 1, 2),
  item_id = c(1:2, 2, 3),
  item_name = c("a", "a", "b", "b"),
  value1 = c(1, NA, 3, 4),
  value2 = 4:7
)
df

# Combinations --------------------------------------------------------------
# Generate all possible combinations of `group`, `item_id`, and `item_name`
# (whether or not they appear in the data)
df %>% complete(group, item_id, item_name)

# Cross all possible `group` values with the unique pairs of
# `(item_id, item_name)` that already exist in the data
df %>% complete(group, nesting(item_id, item_name))

# Within each `group`, generate all possible combinations of
# `item_id` and `item_name` that occur in that group
df %>%
  dplyr::group_by(group) %>%
  complete(item_id, item_name)

# Supplying values for new rows ---------------------------------------------
# Use `fill` to replace NAs with some value. By default, affects both new
# (implicit) and pre-existing (explicit) missing values.
df %>%
  complete(
    group,
    nesting(item_id, item_name),
    fill = list(value1 = 0, value2 = 99)
  )

# Limit the fill to only the newly created (i.e. previously implicit)
# missing values with `explicit = FALSE`
df %>%
  complete(
    group,
    nesting(item_id, item_name),
    fill = list(value1 = 0, value2 = 99),
    explicit = FALSE
  )

Completed construction in the US in 2018

Description

Completed construction in the US in 2018

Usage

construction

Format

A dataset with variables:

Year,Month: Record date
⁠1 unit⁠, ⁠2 to 4 units⁠, ⁠5 units or mote⁠: Number of completed units of each size
Northeast,Midwest,South,West: Number of completed units in each region

Source

Completions of "New Residential Construction" found in Table 5 at https://www.census.gov/construction/nrc/xls/newresconst.xls (downloaded March 2019)

Deprecated SE versions of main verbs

Description

tidyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics: rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with tidyr. However, tidyr now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.

Unquoting triggers immediate evaluation of its operand and inlines the result within the captured expression. This result can be a value or an expression to be evaluated later with the rest of the argument. See vignette("programming", "dplyr") for more information.

Usage

complete_(data, cols, fill = list(), ...)

drop_na_(data, vars)

expand_(data, dots, ...)

crossing_(x)

nesting_(x)

extract_(
  data,
  col,
  into,
  regex = "([[:alnum:]]+)",
  remove = TRUE,
  convert = FALSE,
  ...
)

fill_(data, fill_cols, .direction = c("down", "up"))

gather_(
  data,
  key_col,
  value_col,
  gather_cols,
  na.rm = FALSE,
  convert = FALSE,
  factor_key = FALSE
)

nest_(...)

separate_rows_(data, cols, sep = "[^[:alnum:].]+", convert = FALSE)

separate_(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  ...
)

spread_(
  data,
  key_col,
  value_col,
  fill = NA,
  convert = FALSE,
  drop = TRUE,
  sep = NULL
)

unite_(data, col, from, sep = "_", remove = TRUE)

unnest_(...)

Arguments

data

A data frame

fill

A named list that for each variable supplies a single value to use instead of NA for missing combinations.

...

<data-masking> Specification of columns to expand or complete. Columns can be atomic vectors or lists.

To find all unique combinations of x, y and z, including those not present in the data, supply each variable as a separate argument: expand(df, x, y, z) or complete(df, x, y, z).
To find only the combinations that occur in the data, use nesting: expand(df, nesting(x, y, z)).
You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row for each present school-student combination for all possible dates.

When used with factors, expand() and complete() use the full set of levels, not just those that appear in the data. If you want to use only the values seen in the data, use forcats::fct_drop().

When used with continuous variables, you may need to fill in values that do not appear in the data: to do so use expressions like year = 2010:2020 or year = full_seq(year,1).

vars, cols, col

Name of columns.

x

For nesting_ and crossing_ a list of variables.

into

Names of new variables to create as character vector. Use NA to omit the variable in the output.

regex

A string representing a regular expression used to extract the desired values. There should be one group (defined by ⁠()⁠) for each element of into.

remove

If TRUE, remove input column from output data frame.

convert

If TRUE, will run type.convert() with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.

NB: this will cause string "NA"s to be converted to NAs.

fill_cols

Character vector of column names.

.direction

Direction in which to fill missing values. Currently either "down" (the default), "up", "downup" (i.e. first down and then up) or "updown" (first up and then down).

key_col, value_col

Strings giving names of key and value cols.

gather_cols

Character vector giving column names to be gathered into pair of key-value columns.

na.rm

If TRUE, will remove rows from output where the value column is NA.

factor_key

If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.

sep

Separator delimiting collapsed values.

extra

If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:

"warn" (the default): emit a warning and drop extra values.
"drop": drop any extra values without a warning.
"merge": only splits at most length(into) times

drop

If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill.

from

Names of existing columns as character vector

Drop rows containing missing values

Description

drop_na() drops rows where any column specified by ... contains a missing value.

Usage

drop_na(data, ...)

Arguments

data

A data frame.

...

<tidy-select> Columns to inspect for missing values. If empty, all columns are used.

Details

Another way to interpret drop_na() is that it only keeps the "complete" rows (where no rows contain missing values). Internally, this completeness is computed through vctrs::vec_detect_complete().

Examples

df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% drop_na()
df %>% drop_na(x)

vars <- "y"
df %>% drop_na(x, any_of(vars))

Expand data frame to include all possible combinations of values

Description

expand() generates all combination of variables found in a dataset. It is paired with nesting() and crossing() helpers. crossing() is a wrapper around expand_grid() that de-duplicates and sorts its inputs; nesting() is a helper that only finds combinations already present in the data.

expand() is often useful in conjunction with joins:

use it with right_join() to convert implicit missing values to explicit missing values (e.g., fill in gaps in your data frame).
use it with anti_join() to figure out which combinations are missing (e.g., identify gaps in your data frame).

Usage

expand(data, ..., .name_repair = "check_unique")

crossing(..., .name_repair = "check_unique")

nesting(..., .name_repair = "check_unique")

Arguments

data

A data frame.

...

<data-masking> Specification of columns to expand or complete. Columns can be atomic vectors or lists.

To find all unique combinations of x, y and z, including those not present in the data, supply each variable as a separate argument: expand(df, x, y, z) or complete(df, x, y, z).
To find only the combinations that occur in the data, use nesting: expand(df, nesting(x, y, z)).
You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row for each present school-student combination for all possible dates.

When used with factors, expand() and complete() use the full set of levels, not just those that appear in the data. If you want to use only the values seen in the data, use forcats::fct_drop().

When used with continuous variables, you may need to fill in values that do not appear in the data: to do so use expressions like year = 2010:2020 or year = full_seq(year,1).

.name_repair

Treatment of problematic column names:

"minimal": No name repair or checks, beyond basic existence,
"unique": Make sure names are unique and not empty,
"check_unique": (default value), no name repair, but check they are unique,
"universal": Make the names unique and syntactic
a function: apply custom name repair (e.g., .name_repair = make.names for names in the style of base R).
A purrr-style anonymous function, see rlang::as_function()

This argument is passed on as repair to vctrs::vec_as_names(). See there for more details on these terms and the strategies used to enforce them.

Grouped data frames

With grouped data frames created by dplyr::group_by(), expand() operates within each group. Because of this, you cannot expand on a grouping column.

Examples

# Finding combinations ------------------------------------------------------
fruits <- tibble(
  type = c("apple", "orange", "apple", "orange", "orange", "orange"),
  year = c(2010, 2010, 2012, 2010, 2011, 2012),
  size = factor(
    c("XS", "S", "M", "S", "S", "M"),
    levels = c("XS", "S", "M", "L")
  ),
  weights = rnorm(6, as.numeric(size) + 2)
)

# All combinations, including factor levels that are not used
fruits %>% expand(type)
fruits %>% expand(size)
fruits %>% expand(type, size)
fruits %>% expand(type, size, year)

# Only combinations that already appear in the data
fruits %>% expand(nesting(type))
fruits %>% expand(nesting(size))
fruits %>% expand(nesting(type, size))
fruits %>% expand(nesting(type, size, year))

# Other uses ----------------------------------------------------------------
# Use with `full_seq()` to fill in values of continuous variables
fruits %>% expand(type, size, full_seq(year, 1))
fruits %>% expand(type, size, 2010:2013)

# Use `anti_join()` to determine which observations are missing
all <- fruits %>% expand(type, size, year)
all
all %>% dplyr::anti_join(fruits)

# Use with `right_join()` to fill in missing rows (like `complete()`)
fruits %>% dplyr::right_join(all)

# Use with `group_by()` to expand within each group
fruits %>%
  dplyr::group_by(type) %>%
  expand(year, size)

Create a tibble from all combinations of inputs

Description

expand_grid() is heavily motivated by expand.grid(). Compared to expand.grid(), it:

Produces sorted output (by varying the first column the slowest, rather than the fastest).
Returns a tibble, not a data frame.
Never converts strings to factors.
Does not add any additional attributes.
Can expand any generalised vector, including data frames.

Usage

expand_grid(..., .name_repair = "check_unique")

Arguments

...

Name-value pairs. The name will become the column name in the output.

.name_repair

Treatment of problematic column names:

"minimal": No name repair or checks, beyond basic existence,
"unique": Make sure names are unique and not empty,
"check_unique": (default value), no name repair, but check they are unique,
"universal": Make the names unique and syntactic
a function: apply custom name repair (e.g., .name_repair = make.names for names in the style of base R).
A purrr-style anonymous function, see rlang::as_function()

This argument is passed on as repair to vctrs::vec_as_names(). See there for more details on these terms and the strategies used to enforce them.

Value

A tibble with one column for each input in .... The output will have one row for each combination of the inputs, i.e. the size be equal to the product of the sizes of the inputs. This implies that if any input has length 0, the output will have zero rows.

Examples

expand_grid(x = 1:3, y = 1:2)
expand_grid(l1 = letters, l2 = LETTERS)

# Can also expand data frames
expand_grid(df = tibble(x = 1:2, y = c(2, 1)), z = 1:3)
# And matrices
expand_grid(x1 = matrix(1:4, nrow = 2), x2 = matrix(5:8, nrow = 2))

Extract a character column into multiple columns using regular expression groups

Description

extract() has been superseded in favour of separate_wider_regex() because it has a more polished API and better handling of problems. Superseded functions will not go away, but will only receive critical bug fixes.

Given a regular expression with capturing groups, extract() turns each group into a new column. If the groups don't match, or the input is NA, the output will be NA.

Usage

extract(
  data,
  col,
  into,
  regex = "([[:alnum:]]+)",
  remove = TRUE,
  convert = FALSE,
  ...
)

Arguments

data

A data frame.

col

<tidy-select> Column to expand.

into

Names of new variables to create as character vector. Use NA to omit the variable in the output.

regex

A string representing a regular expression used to extract the desired values. There should be one group (defined by ⁠()⁠) for each element of into.

remove

If TRUE, remove input column from output data frame.

convert

If TRUE, will run type.convert() with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.

NB: this will cause string "NA"s to be converted to NAs.

...

Additional arguments passed on to methods.

Examples

df <- tibble(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
df %>% extract(x, "A")
df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)")

# Now recommended
df %>%
  separate_wider_regex(
    x,
    patterns = c(A = "[[:alnum:]]+", "-", B = "[[:alnum:]]+")
  )

# If no match, NA:
df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)")

Extract numeric component of variable.

Description

DEPRECATED: please use readr::parse_number() instead.

Usage

extract_numeric(x)

Arguments

x

A character vector (or a factor).

Fill in missing values with previous or next value

Description

Fills missing values in selected columns using the next or previous entry. This is useful in the common output format where values are not repeated, and are only recorded when they change.

Usage

fill(data, ..., .direction = c("down", "up", "downup", "updown"))

Arguments

data

A data frame.

...

<tidy-select> Columns to fill.

.direction

Direction in which to fill missing values. Currently either "down" (the default), "up", "downup" (i.e. first down and then up) or "updown" (first up and then down).

Details

Missing values are replaced in atomic vectors; NULLs are replaced in lists.

Grouped data frames

With grouped data frames created by dplyr::group_by(), fill() will be applied within each group, meaning that it won't fill across group boundaries.

Examples

# direction = "down" --------------------------------------------------------
# Value (year) is recorded only when it changes
sales <- tibble::tribble(
  ~quarter, ~year, ~sales,
  "Q1",    2000,    66013,
  "Q2",      NA,    69182,
  "Q3",      NA,    53175,
  "Q4",      NA,    21001,
  "Q1",    2001,    46036,
  "Q2",      NA,    58842,
  "Q3",      NA,    44568,
  "Q4",      NA,    50197,
  "Q1",    2002,    39113,
  "Q2",      NA,    41668,
  "Q3",      NA,    30144,
  "Q4",      NA,    52897,
  "Q1",    2004,    32129,
  "Q2",      NA,    67686,
  "Q3",      NA,    31768,
  "Q4",      NA,    49094
)
# `fill()` defaults to replacing missing data from top to bottom
sales %>% fill(year)

# direction = "up" ----------------------------------------------------------
# Value (pet_type) is missing above
tidy_pets <- tibble::tribble(
  ~rank, ~pet_type, ~breed,
  1L,        NA,    "Boston Terrier",
  2L,        NA,    "Retrievers (Labrador)",
  3L,        NA,    "Retrievers (Golden)",
  4L,        NA,    "French Bulldogs",
  5L,        NA,    "Bulldogs",
  6L,     "Dog",    "Beagles",
  1L,        NA,    "Persian",
  2L,        NA,    "Maine Coon",
  3L,        NA,    "Ragdoll",
  4L,        NA,    "Exotic",
  5L,        NA,    "Siamese",
  6L,     "Cat",    "American Short"
)

# For values that are missing above you can use `.direction = "up"`
tidy_pets %>%
  fill(pet_type, .direction = "up")

# direction = "downup" ------------------------------------------------------
# Value (n_squirrels) is missing above and below within a group
squirrels <- tibble::tribble(
  ~group,    ~name,     ~role,     ~n_squirrels,
  1,      "Sam",    "Observer",   NA,
  1,     "Mara", "Scorekeeper",    8,
  1,    "Jesse",    "Observer",   NA,
  1,      "Tom",    "Observer",   NA,
  2,     "Mike",    "Observer",   NA,
  2,  "Rachael",    "Observer",   NA,
  2,  "Sydekea", "Scorekeeper",   14,
  2, "Gabriela",    "Observer",   NA,
  3,  "Derrick",    "Observer",   NA,
  3,     "Kara", "Scorekeeper",    9,
  3,    "Emily",    "Observer",   NA,
  3, "Danielle",    "Observer",   NA
)

# The values are inconsistently missing by position within the group
# Use .direction = "downup" to fill missing values in both directions
squirrels %>%
  dplyr::group_by(group) %>%
  fill(n_squirrels, .direction = "downup") %>%
  dplyr::ungroup()

# Using `.direction = "updown"` accomplishes the same goal in this example

Fish encounters

Description

Information about fish swimming down a river: each station represents an autonomous monitor that records if a tagged fish was seen at that location. Fish travel in one direction (migrating downstream). Information about misses is just as important as hits, but is not directly recorded in this form of the data.

Usage

fish_encounters

Format

A dataset with variables:

fish: Fish identifier
station: Measurement station
seen: Was the fish seen? (1 if yes, and true for all rows)

Source

Dataset provided by Myfanwy Johnston; more details at https://fishsciences.github.io/post/visualizing-fish-encounter-histories/

Create the full sequence of values in a vector

Description

This is useful if you want to fill in missing values that should have been observed but weren't. For example, full_seq(c(1, 2, 4, 6), 1) will return 1:6.

Usage

full_seq(x, period, tol = 1e-06)

Arguments

x

A numeric vector.

period

Gap between each observation. The existing data will be checked to ensure that it is actually of this periodicity.

tol

Numerical tolerance for checking periodicity.

Examples

full_seq(c(1, 2, 4, 5, 10), 1)

Gather columns into key-value pairs

Description

Development on gather() is complete, and for new code we recommend switching to pivot_longer(), which is easier to use, more featureful, and still under active development. df %>% gather("key", "value", x, y, z) is equivalent to df %>% pivot_longer(c(x, y, z), names_to = "key", values_to = "value")

See more details in vignette("pivot").

Usage

gather(
  data,
  key = "key",
  value = "value",
  ...,
  na.rm = FALSE,
  convert = FALSE,
  factor_key = FALSE
)

Arguments

data

A data frame.

key, value

Names of new key and value columns, as strings or symbols.

This argument is passed by expression and supports quasiquotation (you can unquote strings and symbols). The name is captured from the expression with rlang::ensym() (note that this kind of interface where symbols do not represent actual objects is now discouraged in the tidyverse; we support it here for backward compatibility).

...

A selection of columns. If empty, all variables are selected. You can supply bare variable names, select all variables between x and z with x:z, exclude y with -y. For more options, see the dplyr::select() documentation. See also the section on selection rules below.

na.rm

If TRUE, will remove rows from output where the value column is NA.

convert

If TRUE will automatically run type.convert() on the key column. This is useful if the column types are actually numeric, integer, or logical.

factor_key

If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.

Rules for selection

Arguments for selecting columns are passed to tidyselect::vars_select() and are treated specially. Unlike other verbs, selecting functions make a strict distinction between data expressions and context expressions.

A data expression is either a bare name like x or an expression like x:y or c(x, y). In a data expression, you can only refer to columns from the data frame.
Everything else is a context expression in which you can only refer to objects that you have defined with ⁠<-⁠.

For instance, col1:col3 is a data expression that refers to data columns, while seq(start, end) is a context expression that refers to objects from the contexts.

If you need to refer to contextual objects from a data expression, you can use all_of() or any_of(). These functions are used to select data-variables whose names are stored in a env-variable. For instance, all_of(a) selects the variables listed in the character vector a. For more details, see the tidyselect::select_helpers() documentation.

Examples

# From https://stackoverflow.com/questions/1181060
stocks <- tibble(
  time = as.Date("2009-01-01") + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

gather(stocks, "stock", "price", -time)
stocks %>% gather("stock", "price", -time)

# get first observation for each Species in iris data -- base R
mini_iris <- iris[c(1, 51, 101), ]
# gather Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
gather(mini_iris, key = "flower_att", value = "measurement",
       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
# same result but less verbose
gather(mini_iris, key = "flower_att", value = "measurement", -Species)

Hoist values out of list-columns

Description

hoist() allows you to selectively pull components of a list-column into their own top-level columns, using the same syntax as purrr::pluck().

Learn more in vignette("rectangle").

Usage

hoist(
  .data,
  .col,
  ...,
  .remove = TRUE,
  .simplify = TRUE,
  .ptype = NULL,
  .transform = NULL
)

Arguments

.data

A data frame.

.col

<tidy-select> List-column to extract components from.

...

<dynamic-dots> Components of .col to turn into columns in the form col_name = "pluck_specification". You can pluck by name with a character vector, by position with an integer vector, or with a combination of the two with a list. See purrr::pluck() for details.

The column names must be unique in a call to hoist(), although existing columns with the same name will be overwritten. When plucking with a single string you can choose to omit the name, i.e. hoist(df, col, "x") is short-hand for hoist(df, col, x = "x").

.remove

If TRUE, the default, will remove extracted components from .col. This ensures that each value lives only in one place. If all components are removed from .col, then .col will be removed from the result entirely.

.simplify

If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector. Can also be a named list containing TRUE or FALSE declaring whether or not to attempt to simplify a particular column. If a named list is provided, the default for any unspecified columns is TRUE.

.ptype

Optionally, a named list of prototypes declaring the desired output type of each component. Alternatively, a single empty prototype can be supplied, which will be applied to all components. Use this argument if you want to check that each element has the type you expect when simplifying.

If a ptype has been specified, but simplify = FALSE or simplification isn't possible, then a list-of column will be returned and each element will have type ptype.

.transform

Optionally, a named list of transformation functions applied to each component. Alternatively, a single function can be supplied, which will be applied to all components. Use this argument if you want to transform or parse individual elements as they are extracted.

When both ptype and transform are supplied, the transform is applied before the ptype.

Examples

df <- tibble(
  character = c("Toothless", "Dory"),
  metadata = list(
    list(
      species = "dragon",
      color = "black",
      films = c(
        "How to Train Your Dragon",
        "How to Train Your Dragon 2",
        "How to Train Your Dragon: The Hidden World"
      )
    ),
    list(
      species = "blue tang",
      color = "blue",
      films = c("Finding Nemo", "Finding Dory")
    )
  )
)
df

# Extract only specified components
df %>% hoist(metadata,
  "species",
  first_film = list("films", 1L),
  third_film = list("films", 3L)
)

Household data

Description

This dataset is based on an example in vignette("datatable-reshape", package = "data.table")

Usage

household

Format

A data frame with 5 rows and 5 columns:

family: Family identifier
dob_child1: Date of birth of first child
dob_child2: Date of birth of second child
name_child1: Name of first child

name_child2: Name of second child

Nest rows into a list-column of data frames

Description

Nesting creates a list-column of data frames; unnesting flattens it back out into regular columns. Nesting is implicitly a summarising operation: you get one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.

Learn more in vignette("nest").

Usage

nest(.data, ..., .by = NULL, .key = NULL, .names_sep = NULL)

Arguments

.data

A data frame.

...

<tidy-select> Columns to nest; these will appear in the inner data frames.

Specified using name-variable pairs of the form new_col = c(col1, col2, col3). The right hand side can be any valid tidyselect expression.

If not supplied, then ... is derived as all columns not selected by .by, and will use the column name from .key.

: previously you could write df %>% nest(x, y, z). Convert to df %>% nest(data = c(x, y, z)).

.by

<tidy-select> Columns to nest by; these will remain in the outer data frame.

.by can be used in place of or in conjunction with columns supplied through ....

If not supplied, then .by is derived as all columns not selected by ....

.key

The name of the resulting nested column. Only applicable when ... isn't specified, i.e. in the case of df %>% nest(.by = x).

If NULL, then "data" will be used by default.

.names_sep

If NULL, the default, the inner names will come from the former outer names. If a string, the new inner names will use the outer names with names_sep automatically stripped. This makes names_sep roughly symmetric between nesting and unnesting.

Details

If neither ... nor .by are supplied, nest() will nest all variables, and will use the column name supplied through .key.

New syntax

tidyr 1.0.0 introduced a new syntax for nest() and unnest() that's designed to be more similar to other functions. Converting to the new syntax should be straightforward (guided by the message you'll receive) but if you just need to run an old analysis, you can easily revert to the previous behaviour using nest_legacy() and unnest_legacy() as follows:

library(tidyr)
nest <- nest_legacy
unnest <- unnest_legacy

Grouped data frames

df %>% nest(data = c(x, y)) specifies the columns to be nested; i.e. the columns that will appear in the inner data frame. df %>% nest(.by = c(x, y)) specifies the columns to nest by; i.e. the columns that will remain in the outer data frame. An alternative way to achieve the latter is to nest() a grouped data frame created by dplyr::group_by(). The grouping variables remain in the outer data frame and the others are nested. The result preserves the grouping of the input.

Variables supplied to nest() will override grouping variables so that df %>% group_by(x, y) %>% nest(data = !z) will be equivalent to df %>% nest(data = !z).

You can't supply .by with a grouped data frame, as the groups already represent what you are nesting by.

Examples

df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)

# Specify variables to nest using name-variable pairs.
# Note that we get one row of output for each unique combination of
# non-nested variables.
df %>% nest(data = c(y, z))

# Specify variables to nest by (rather than variables to nest) using `.by`
df %>% nest(.by = x)

# In this case, since `...` isn't used you can specify the resulting column
# name with `.key`
df %>% nest(.by = x, .key = "cols")

# Use tidyselect syntax and helpers, just like in `dplyr::select()`
df %>% nest(data = any_of(c("y", "z")))

# `...` and `.by` can be used together to drop columns you no longer need,
# or to include the columns you are nesting by in the inner data frame too.
# This drops `z`:
df %>% nest(data = y, .by = x)
# This includes `x` in the inner data frame:
df %>% nest(data = everything(), .by = x)

# Multiple nesting structures can be specified at once
iris %>%
  nest(petal = starts_with("Petal"), sepal = starts_with("Sepal"))
iris %>%
  nest(width = contains("Width"), length = contains("Length"))

# Nesting a grouped data frame nests all variables apart from the group vars
fish_encounters %>%
  dplyr::group_by(fish) %>%
  nest()

# That is similar to `nest(.by = )`, except here the result isn't grouped
fish_encounters %>%
  nest(.by = fish)

# Nesting is often useful for creating per group models
mtcars %>%
  nest(.by = cyl) %>%
  dplyr::mutate(models = lapply(data, function(df) lm(mpg ~ wt, data = df)))

Legacy versions of `nest()` and `unnest()`

Description

tidyr 1.0.0 introduced a new syntax for nest() and unnest(). The majority of existing usage should be automatically translated to the new syntax with a warning. However, if you need to quickly roll back to the previous behaviour, these functions provide the previous interface. To make old code work as is, add the following code to the top of your script:

library(tidyr)
nest <- nest_legacy
unnest <- unnest_legacy

Usage

nest_legacy(data, ..., .key = "data")

unnest_legacy(data, ..., .drop = NA, .id = NULL, .sep = NULL, .preserve = NULL)

Arguments

data

A data frame.

...

Specification of columns to unnest. Use bare variable names or functions of variables. If omitted, defaults to all list-cols.

.key

The name of the new column, as a string or symbol. This argument is passed by expression and supports quasiquotation (you can unquote strings and symbols). The name is captured from the expression with rlang::ensym() (note that this kind of interface where symbols do not represent actual objects is now discouraged in the tidyverse; we support it here for backward compatibility).

.drop

Should additional list columns be dropped? By default, unnest() will drop them if unnesting the specified columns requires the rows to be duplicated.

.id

Data frame identifier - if supplied, will create a new column with name .id, giving a unique identifier. This is most useful if the list column is named.

.sep

If non-NULL, the names of unnested data frame columns will combine the name of the original list-col with the names from the nested data frame, separated by .sep.

.preserve

Optionally, list-columns to preserve in the output. These will be duplicated in the same way as atomic vectors. This has dplyr::select() semantics so you can preserve multiple variables with .preserve = c(x, y) or .preserve = starts_with("list").

Examples

# Nest and unnest are inverses
df <- tibble(x = c(1, 1, 2), y = 3:1)
df %>% nest_legacy(y)
df %>% nest_legacy(y) %>% unnest_legacy()

# nesting -------------------------------------------------------------------
as_tibble(iris) %>% nest_legacy(!Species)
as_tibble(chickwts) %>% nest_legacy(weight)

# unnesting -----------------------------------------------------------------
df <- tibble(
  x = 1:2,
  y = list(
    tibble(z = 1),
    tibble(z = 3:4)
  )
)
df %>% unnest_legacy(y)

# You can also unnest multiple columns simultaneously
df <- tibble(
  a = list(c("a", "b"), "c"),
  b = list(1:2, 3),
  c = c(11, 22)
)
df %>% unnest_legacy(a, b)
# If you omit the column names, it'll unnest all list-cols
df %>% unnest_legacy()

Pack and unpack

Description

Packing and unpacking preserve the length of a data frame, changing its width. pack() makes df narrow by collapsing a set of columns into a single df-column. unpack() makes data wider by expanding df-columns back out into individual columns.

Usage

pack(.data, ..., .names_sep = NULL, .error_call = current_env())

unpack(
  data,
  cols,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  error_call = current_env()
)

Arguments

...

For pack(), <tidy-select> columns to pack, specified using name-variable pairs of the form new_col = c(col1, col2, col3). The right hand side can be any valid tidy select expression.

For unpack(), these dots are for future extensions and must be empty.

data, .data

A data frame.

cols

<tidy-select> Columns to unpack.

names_sep, .names_sep

If NULL, the default, the names will be left as is. In pack(), inner names will come from the former outer names; in unpack(), the new outer names will come from the inner names.

If a string, the inner and outer names will be used together. In unpack(), the names of the new outer columns will be formed by pasting together the outer and the inner column names, separated by names_sep. In pack(), the new inner names will have the outer names + names_sep automatically stripped. This makes names_sep roughly symmetric between packing and unpacking.

names_repair

Used to check that output data frame has valid names. Must be one of the following options:

⁠"minimal⁠": no name repair or checks, beyond basic existence,
⁠"unique⁠": make sure names are unique and not empty,
⁠"check_unique⁠": (the default), no name repair, but check they are unique,
⁠"universal⁠": make the names unique and syntactic
a function: apply custom name repair.
tidyr_legacy: use the name repair from tidyr 0.8.
a formula: a purrr-style anonymous function (see rlang::as_function())

See vctrs::vec_as_names() for more details on these terms and the strategies used to enforce them.

error_call, .error_call

Details

Generally, unpacking is more useful than packing because it simplifies a complex data structure. Currently, few functions work with df-cols, and they are mostly a curiosity, but seem worth exploring further because they mimic the nested column headers that are so popular in Excel.

Examples

# Packing -------------------------------------------------------------------
# It's not currently clear why you would ever want to pack columns
# since few functions work with this sort of data.
df <- tibble(x1 = 1:3, x2 = 4:6, x3 = 7:9, y = 1:3)
df
df %>% pack(x = starts_with("x"))
df %>% pack(x = c(x1, x2, x3), y = y)

# .names_sep allows you to strip off common prefixes; this
# acts as a natural inverse to name_sep in unpack()
iris %>%
  as_tibble() %>%
  pack(
    Sepal = starts_with("Sepal"),
    Petal = starts_with("Petal"),
    .names_sep = "."
  )

# Unpacking -----------------------------------------------------------------
df <- tibble(
  x = 1:3,
  y = tibble(a = 1:3, b = 3:1),
  z = tibble(X = c("a", "b", "c"), Y = runif(3), Z = c(TRUE, FALSE, NA))
)
df
df %>% unpack(y)
df %>% unpack(c(y, z))
df %>% unpack(c(y, z), names_sep = "_")

Pivot data from wide to long

Description

pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider()

Learn more in vignette("pivot").

Usage

pivot_longer(
  data,
  cols,
  ...,
  cols_vary = "fastest",
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = NULL,
  values_transform = NULL
)

Arguments

data

A data frame to pivot.

cols

<tidy-select> Columns to pivot into longer format.

...

Additional arguments passed on to methods.

cols_vary

When pivoting cols into longer format, how should the output rows be arranged relative to their original row number?

"fastest", the default, keeps individual rows from cols close together in the output. This often produces intuitively ordered output when you have at least one key column from data that is not involved in the pivoting process.
"slowest" keeps individual columns from cols close together in the output. This often produces intuitively ordered output when you utilize all of the columns from data in the pivoting process.

names_to

A character vector specifying the new column or columns to create from the information stored in the column names of data specified by cols.

If length 0, or if NULL is supplied, no columns will be created.
If length 1, a single column will be created which will contain the column names specified by cols.
If length >1, multiple columns will be created. In this case, one of names_sep or names_pattern must be supplied to specify how the column names should be split. There are also two additional character values you can take advantage of:
- NA will discard the corresponding component of the column name.
- ".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely.

names_prefix

A regular expression used to remove matching text from the start of each variable name.

names_sep, names_pattern

If names_to contains multiple values, these arguments control how the column name is broken up.

names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).

names_pattern takes the same specification as extract(), a regular expression containing matching groups (⁠()⁠).

If these arguments do not give you enough control, use pivot_longer_spec() to create a spec object and process manually as needed.

names_ptypes, values_ptypes

Optionally, a list of column name-prototype pairs. Alternatively, a single empty prototype can be supplied, which will be applied to all columns. A prototype (or ptype for short) is a zero-length vector (like integer() or numeric()) that defines the type, class, and attributes of a vector. Use these arguments if you want to confirm that the created columns are the types that you expect. Note that if you want to change (instead of confirm) the types of specific columns, you should use names_transform or values_transform instead.

names_transform, values_transform

Optionally, a list of column name-function pairs. Alternatively, a single function can be supplied, which will be applied to all columns. Use these arguments if you need to change the types of specific columns. For example, names_transform = list(week = as.integer) would convert a character variable called week to an integer.

If not specified, the type of the columns generated from names_to will be character, and the type of the variables generated from values_to will be the common type of the input columns used to generate them.

names_repair

What happens if the output has invalid column names? The default, "check_unique" is to error if the columns are duplicated. Use "minimal" to allow duplicates in the output, or "unique" to de-duplicated by adding numeric suffixes. See vctrs::vec_as_names() for more options.

values_to

A string specifying the name of the column to create from the data stored in cell values. If names_to is a character containing the special .value sentinel, this value will be ignored, and the name of the value column will be derived from part of the existing column names.

values_drop_na

If TRUE, will drop rows that contain only NAs in the value_to column. This effectively converts explicit missing values to implicit missing values, and should generally be used only when missing values in data were created by its structure.

Details

pivot_longer() is an updated approach to gather(), designed to be both simpler to use and to handle more use cases. We recommend you use pivot_longer() for new code; gather() isn't going away but is no longer under active development.

Examples

# See vignette("pivot") for examples and explanation

# Simplest case where column names are character data
relig_income
relig_income %>%
  pivot_longer(!religion, names_to = "income", values_to = "count")

# Slightly more complex case where columns have common prefix,
# and missing missings are structural so should be dropped.
billboard
billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    names_prefix = "wk",
    values_to = "rank",
    values_drop_na = TRUE
  )

# Multiple variables stored in column names
who %>% pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"),
  names_pattern = "new_?(.*)_(.)(.*)",
  values_to = "count"
)

# Multiple observations per row. Since all columns are used in the pivoting
# process, we'll use `cols_vary` to keep values from the original columns
# close together in the output.
anscombe
anscombe %>%
  pivot_longer(
    everything(),
    cols_vary = "slowest",
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)"
  )

Pivot data from wide to long using a spec

Description

This is a low level interface to pivoting, inspired by the cdata package, that allows you to describe pivoting with a data frame.

Usage

pivot_longer_spec(
  data,
  spec,
  ...,
  cols_vary = "fastest",
  names_repair = "check_unique",
  values_drop_na = FALSE,
  values_ptypes = NULL,
  values_transform = NULL,
  error_call = current_env()
)

build_longer_spec(
  data,
  cols,
  ...,
  names_to = "name",
  values_to = "value",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  error_call = current_env()
)

Arguments

data

A data frame to pivot.

spec

A specification data frame. This is useful for more complex pivots because it gives you greater control on how metadata stored in the column names turns into columns in the result.

...

These dots are for future extensions and must be empty.

cols_vary

When pivoting cols into longer format, how should the output rows be arranged relative to their original row number?

"fastest", the default, keeps individual rows from cols close together in the output. This often produces intuitively ordered output when you have at least one key column from data that is not involved in the pivoting process.
"slowest" keeps individual columns from cols close together in the output. This often produces intuitively ordered output when you utilize all of the columns from data in the pivoting process.

names_repair

values_drop_na

error_call

cols

<tidy-select> Columns to pivot into longer format.

names_to

A character vector specifying the new column or columns to create from the information stored in the column names of data specified by cols.

If length 0, or if NULL is supplied, no columns will be created.
If length 1, a single column will be created which will contain the column names specified by cols.
If length >1, multiple columns will be created. In this case, one of names_sep or names_pattern must be supplied to specify how the column names should be split. There are also two additional character values you can take advantage of:
- NA will discard the corresponding component of the column name.
- ".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely.

values_to

names_prefix

A regular expression used to remove matching text from the start of each variable name.

names_sep, names_pattern

If names_to contains multiple values, these arguments control how the column name is broken up.

names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).

names_pattern takes the same specification as extract(), a regular expression containing matching groups (⁠()⁠).

If these arguments do not give you enough control, use pivot_longer_spec() to create a spec object and process manually as needed.

names_ptypes, values_ptypes

names_transform, values_transform

Examples

# See vignette("pivot") for examples and explanation

# Use `build_longer_spec()` to build `spec` using similar syntax to `pivot_longer()`
# and run `pivot_longer_spec()` based on `spec`.
spec <- relig_income %>% build_longer_spec(
  cols = !religion,
  names_to = "income",
  values_to = "count"
)
spec

pivot_longer_spec(relig_income, spec)

# Is equivalent to:
relig_income %>% pivot_longer(
  cols = !religion,
  names_to = "income",
  values_to = "count"
)

Pivot data from long to wide

Description

pivot_wider() "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().

Learn more in vignette("pivot").

Usage

pivot_wider(
  data,
  ...,
  id_cols = NULL,
  id_expand = FALSE,
  names_from = name,
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_sort = FALSE,
  names_vary = "fastest",
  names_expand = FALSE,
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL,
  values_fn = NULL,
  unused_fn = NULL
)

Arguments

data

A data frame to pivot.

...

Additional arguments passed on to methods.

id_cols

<tidy-select> A set of columns that uniquely identify each observation. Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables.

Defaults to all columns in data except for the columns specified through names_from and values_from. If a tidyselect expression is supplied, it will be evaluated on data after removing the columns specified through names_from and values_from.

id_expand

Should the values in the id_cols columns be expanded by expand() before pivoting? This results in more rows, the output will contain a complete expansion of all possible values in id_cols. Implicit factor levels that aren't represented in the data will become explicit. Additionally, the row values corresponding to the expanded id_cols will be sorted.

names_from, values_from

<tidy-select> A pair of arguments describing which column (or columns) to get the name of the output column (names_from), and which column (or columns) to get the cell values from (values_from).

If values_from contains multiple values, the value will be added to the front of the output column.

names_prefix

String added to the start of every variable name. This is particularly useful if names_from is a numeric vector and you want to create syntactic variable names.

names_sep

If names_from or values_from contains multiple variables, this will be used to join their values together into a single string to use as a column name.

names_glue

Instead of names_sep and names_prefix, you can supply a glue specification that uses the names_from columns (and special .value) to create custom column names.

names_sort

Should the column names be sorted? If FALSE, the default, column names are ordered by first appearance.

names_vary

When names_from identifies a column (or columns) with multiple unique values, and multiple values_from columns are provided, in what order should the resulting column names be combined?

"fastest" varies names_from values fastest, resulting in a column naming scheme of the form: ⁠value1_name1, value1_name2, value2_name1, value2_name2⁠. This is the default.
"slowest" varies names_from values slowest, resulting in a column naming scheme of the form: ⁠value1_name1, value2_name1, value1_name2, value2_name2⁠.

names_expand

Should the values in the names_from columns be expanded by expand() before pivoting? This results in more columns, the output will contain column names corresponding to a complete expansion of all possible values in names_from. Implicit factor levels that aren't represented in the data will become explicit. Additionally, the column names will be sorted, identical to what names_sort would produce.

names_repair

values_fill

Optionally, a (scalar) value that specifies what each value should be filled in with when missing.

This can be a named list if you want to apply different fill values to different value columns.

values_fn

Optionally, a function applied to the value in each cell in the output. You will typically use this when the combination of id_cols and names_from columns does not uniquely identify an observation.

This can be a named list if you want to apply different aggregations to different values_from columns.

unused_fn

Optionally, a function applied to summarize the values from the unused columns (i.e. columns not identified by id_cols, names_from, or values_from).

The default drops all unused columns from the result.

This can be a named list if you want to apply different aggregations to different unused columns.

id_cols must be supplied for unused_fn to be useful, since otherwise all unspecified columns will be considered id_cols.

This is similar to grouping by the id_cols then summarizing the unused columns using unused_fn.

Details

pivot_wider() is an updated approach to spread(), designed to be both simpler to use and to handle more use cases. We recommend you use pivot_wider() for new code; spread() isn't going away but is no longer under active development.

Examples

# See vignette("pivot") for examples and explanation

fish_encounters
fish_encounters %>%
  pivot_wider(names_from = station, values_from = seen)
# Fill in missing values
fish_encounters %>%
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)

# Generate column names from multiple variables
us_rent_income
us_rent_income %>%
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe)
  )

# You can control whether `names_from` values vary fastest or slowest
# relative to the `values_from` column names using `names_vary`.
us_rent_income %>%
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe),
    names_vary = "slowest"
  )

# When there are multiple `names_from` or `values_from`, you can use
# use `names_sep` or `names_glue` to control the output variable names
us_rent_income %>%
  pivot_wider(
    names_from = variable,
    names_sep = ".",
    values_from = c(estimate, moe)
  )
us_rent_income %>%
  pivot_wider(
    names_from = variable,
    names_glue = "{variable}_{.value}",
    values_from = c(estimate, moe)
  )

# Can perform aggregation with `values_fn`
warpbreaks <- as_tibble(warpbreaks[c("wool", "tension", "breaks")])
warpbreaks
warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = mean
  )

# Can pass an anonymous function to `values_fn` when you
# need to supply additional arguments
warpbreaks$breaks[1] <- NA
warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = ~ mean(.x, na.rm = TRUE)
  )

Pivot data from long to wide using a spec

Description

This is a low level interface to pivoting, inspired by the cdata package, that allows you to describe pivoting with a data frame.

Usage

pivot_wider_spec(
  data,
  spec,
  ...,
  names_repair = "check_unique",
  id_cols = NULL,
  id_expand = FALSE,
  values_fill = NULL,
  values_fn = NULL,
  unused_fn = NULL,
  error_call = current_env()
)

build_wider_spec(
  data,
  ...,
  names_from = name,
  values_from = value,
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_sort = FALSE,
  names_vary = "fastest",
  names_expand = FALSE,
  error_call = current_env()
)

Arguments

data

A data frame to pivot.

spec

A specification data frame. This is useful for more complex pivots because it gives you greater control on how metadata stored in the columns become column names in the result.

...

These dots are for future extensions and must be empty.

names_repair

id_cols

<tidy-select> A set of columns that uniquely identifies each observation. Defaults to all columns in data except for the columns specified in spec$.value and the columns of the spec that aren't named .name or .value. Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables.

id_expand

values_fill

Optionally, a (scalar) value that specifies what each value should be filled in with when missing.

This can be a named list if you want to apply different fill values to different value columns.

values_fn

This can be a named list if you want to apply different aggregations to different values_from columns.

unused_fn

Optionally, a function applied to summarize the values from the unused columns (i.e. columns not identified by id_cols, names_from, or values_from).

The default drops all unused columns from the result.

This can be a named list if you want to apply different aggregations to different unused columns.

id_cols must be supplied for unused_fn to be useful, since otherwise all unspecified columns will be considered id_cols.

This is similar to grouping by the id_cols then summarizing the unused columns using unused_fn.

error_call

names_from, values_from

If values_from contains multiple values, the value will be added to the front of the output column.

names_prefix

String added to the start of every variable name. This is particularly useful if names_from is a numeric vector and you want to create syntactic variable names.

names_sep

If names_from or values_from contains multiple variables, this will be used to join their values together into a single string to use as a column name.

names_glue

Instead of names_sep and names_prefix, you can supply a glue specification that uses the names_from columns (and special .value) to create custom column names.

names_sort

Should the column names be sorted? If FALSE, the default, column names are ordered by first appearance.

names_vary

When names_from identifies a column (or columns) with multiple unique values, and multiple values_from columns are provided, in what order should the resulting column names be combined?

"fastest" varies names_from values fastest, resulting in a column naming scheme of the form: ⁠value1_name1, value1_name2, value2_name1, value2_name2⁠. This is the default.
"slowest" varies names_from values slowest, resulting in a column naming scheme of the form: ⁠value1_name1, value2_name1, value1_name2, value2_name2⁠.

names_expand

Examples

# See vignette("pivot") for examples and explanation

us_rent_income
spec1 <- us_rent_income %>%
  build_wider_spec(names_from = variable, values_from = c(estimate, moe))
spec1

us_rent_income %>%
  pivot_wider_spec(spec1)

# Is equivalent to
us_rent_income %>%
  pivot_wider(names_from = variable, values_from = c(estimate, moe))

# `pivot_wider_spec()` provides more control over column names and output format
# instead of creating columns with estimate_ and moe_ prefixes,
# keep original variable name for estimates and attach _moe as suffix
spec2 <- tibble(
  .name = c("income", "rent", "income_moe", "rent_moe"),
  .value = c("estimate", "estimate", "moe", "moe"),
  variable = c("income", "rent", "income", "rent")
)

us_rent_income %>%
  pivot_wider_spec(spec2)

Objects exported from other packages

Description

These objects are imported from other packages. Follow the links below to see their documentation.

tibble: as_tibble, tibble, tribble
tidyselect: all_of, any_of, contains, ends_with, everything, last_col, matches, num_range, one_of, starts_with

Pew religion and income survey

Description

Pew religion and income survey

Usage

relig_income

Format

A dataset with variables:

religion: Name of religion
⁠<$10k⁠-⁠Don\'t know/refused⁠: Number of respondees with income range in column name

Source

Downloaded from https://www.pewresearch.org/religion/religious-landscape-study/ (downloaded November 2009)

Replace NAs with specified values

Description

Replace NAs with specified values

Usage

replace_na(data, replace, ...)

Arguments

data

A data frame or vector.

replace

If data is a data frame, replace takes a named list of values, with one value for each column that has missing values to be replaced. Each value in replace will be cast to the type of the column in data that it being used as a replacement in.

If data is a vector, replace takes a single value. This single value replaces all of the missing values in the vector. replace will be cast to the type of data.

...

Additional arguments for methods. Currently unused.

Value

replace_na() returns an object with the same type as data.

Examples

# Replace NAs in a data frame
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))

# Replace NAs in a vector
df %>% dplyr::mutate(x = replace_na(x, 0))
# OR
df$x %>% replace_na(0)
df$y %>% replace_na("unknown")

# Replace NULLs in a list: NULLs are the list-col equivalent of NAs
df_list <- tibble(z = list(1:5, NULL, 10:20))
df_list %>% replace_na(list(z = list(5)))

Separate a character column into multiple columns with a regular expression or numeric locations

Description

separate() has been superseded in favour of separate_wider_position() and separate_wider_delim() because the two functions make the two uses more obvious, the API is more polished, and the handling of problems is better. Superseded functions will not go away, but will only receive critical bug fixes.

Given either a regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

Usage

separate(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  ...
)

Arguments

data

A data frame.

col

<tidy-select> Column to expand.

into

Names of new variables to create as character vector. Use NA to omit the variable in the output.

sep

Separator between columns.

If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.

If numeric, sep is interpreted as character positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.

remove

If TRUE, remove input column from output data frame.

convert

If TRUE, will run type.convert() with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.

NB: this will cause string "NA"s to be converted to NAs.

extra

If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:

"warn" (the default): emit a warning and drop extra values.
"drop": drop any extra values without a warning.
"merge": only splits at most length(into) times

fill

If sep is a character vector, this controls what happens when there are not enough pieces. There are three valid options:

"warn" (the default): emit a warning and fill from the right
"right": fill with missing values on the right
"left": fill with missing values on the left

...

Additional arguments passed on to methods.

Examples

# If you want to split by any non-alphanumeric value (the default):
df <- tibble(x = c(NA, "x.y", "x.z", "y.z"))
df %>% separate(x, c("A", "B"))

# If you just want the second variable:
df %>% separate(x, c(NA, "B"))

# We now recommend separate_wider_delim() instead:
df %>% separate_wider_delim(x, ".", names = c("A", "B"))
df %>% separate_wider_delim(x, ".", names = c(NA, "B"))

# Controlling uneven splits -------------------------------------------------
# If every row doesn't split into the same number of pieces, use
# the extra and fill arguments to control what happens:
df <- tibble(x = c("x", "x y", "x y z", NA))
df %>% separate(x, c("a", "b"))
# The same behaviour as previous, but drops the c without warnings:
df %>% separate(x, c("a", "b"), extra = "drop", fill = "right")
# Opposite of previous, keeping the c and filling left:
df %>% separate(x, c("a", "b"), extra = "merge", fill = "left")
# Or you can keep all three:
df %>% separate(x, c("a", "b", "c"))

# To only split a specified number of times use extra = "merge":
df <- tibble(x = c("x: 123", "y: error: 7"))
df %>% separate(x, c("key", "value"), ": ", extra = "merge")

# Controlling column types --------------------------------------------------
# convert = TRUE detects column classes:
df <- tibble(x = c("x:1", "x:2", "y:4", "z", NA))
df %>% separate(x, c("key", "value"), ":") %>% str()
df %>% separate(x, c("key", "value"), ":", convert = TRUE) %>% str()

Split a string into rows

Description

Each of these functions takes a string and splits it into multiple rows:

separate_longer_delim() splits by a delimiter.
separate_longer_position() splits by a fixed width.

Usage

separate_longer_delim(data, cols, delim, ...)

separate_longer_position(data, cols, width, ..., keep_empty = FALSE)

Arguments

data

A data frame.

cols

<tidy-select> Columns to separate.

delim

For separate_longer_delim(), a string giving the delimiter between values. By default, it is interpreted as a fixed string; use stringr::regex() and friends to split in other ways.

...

These dots are for future extensions and must be empty.

width

For separate_longer_position(), an integer giving the number of characters to split by.

keep_empty

By default, you'll get ceiling(nchar(x) / width) rows for each observation. If nchar(x) is zero, this means the entire input row will be dropped from the output. If you want to preserve all rows, use keep_empty = TRUE to replace size-0 elements with a missing value.

Value

A data frame based on data. It has the same columns, but different rows.

Examples

df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA))
df %>% separate_longer_delim(x, delim = " ")

# You can separate multiple columns at once if they have the same structure
df <- tibble(id = 1:3, x = c("x", "x y", "x y z"), y = c("a", "a b", "a b c"))
df %>% separate_longer_delim(c(x, y), delim = " ")

# Or instead split by a fixed length
df <- tibble(id = 1:3, x = c("ab", "def", ""))
df %>% separate_longer_position(x, 1)
df %>% separate_longer_position(x, 2)
df %>% separate_longer_position(x, 2, keep_empty = TRUE)

Separate a collapsed column into multiple rows

Description

separate_rows() has been superseded in favour of separate_longer_delim() because it has a more consistent API with other separate functions. Superseded functions will not go away, but will only receive critical bug fixes.

If a variable contains observations with multiple delimited values, separate_rows() separates the values and places each one in its own row.

Usage

separate_rows(data, ..., sep = "[^[:alnum:].]+", convert = FALSE)

Arguments

data

A data frame.

...

<tidy-select> Columns to separate across multiple rows

sep

Separator delimiting collapsed values.

convert

If TRUE will automatically run type.convert() on the key column. This is useful if the column types are actually numeric, integer, or logical.

Examples

df <- tibble(
  x = 1:3,
  y = c("a", "d,e,f", "g,h"),
  z = c("1", "2,3,4", "5,6")
)
separate_rows(df, y, z, convert = TRUE)

# Now recommended
df %>%
  separate_longer_delim(c(y, z), delim = ",")

Split a string into columns

Description

Each of these functions takes a string column and splits it into multiple new columns:

separate_wider_delim() splits by delimiter.
separate_wider_position() splits at fixed widths.
separate_wider_regex() splits with regular expression matches.

These functions are equivalent to separate() and extract(), but use stringr as the underlying string manipulation engine, and their interfaces reflect what we've learned from unnest_wider() and unnest_longer().

Usage

separate_wider_delim(
  data,
  cols,
  delim,
  ...,
  names = NULL,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start", "align_end"),
  too_many = c("error", "debug", "drop", "merge"),
  cols_remove = TRUE
)

separate_wider_position(
  data,
  cols,
  widths,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  too_many = c("error", "debug", "drop"),
  cols_remove = TRUE
)

separate_wider_regex(
  data,
  cols,
  patterns,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  cols_remove = TRUE
)

Arguments

data

A data frame.

cols

<tidy-select> Columns to separate.

delim

For separate_wider_delim(), a string giving the delimiter between values. By default, it is interpreted as a fixed string; use stringr::regex() and friends to split in other ways.

...

These dots are for future extensions and must be empty.

names

For separate_wider_delim(), a character vector of output column names. Use NA if there are components that you don't want to appear in the output; the number of non-NA elements determines the number of new columns in the result.

names_sep

If supplied, output names will be composed of the input column name followed by the separator followed by the new column name. Required when cols selects multiple columns.

For separate_wider_delim() you can specify instead of names, in which case the names will be generated from the source column name, names_sep, and a numeric suffix.

names_repair

Used to check that output data frame has valid names. Must be one of the following options:

⁠"minimal⁠": no name repair or checks, beyond basic existence,
⁠"unique⁠": make sure names are unique and not empty,
⁠"check_unique⁠": (the default), no name repair, but check they are unique,
⁠"universal⁠": make the names unique and syntactic
a function: apply custom name repair.
tidyr_legacy: use the name repair from tidyr 0.8.
a formula: a purrr-style anonymous function (see rlang::as_function())

See vctrs::vec_as_names() for more details on these terms and the strategies used to enforce them.

too_few

What should happen if a value separates into too few pieces?

"error", the default, will throw an error.
"debug" adds additional columns to the output to help you locate and resolve the underlying problem. This option is intended to help you debug the issue and address and should not generally remain in your final code.
"align_start" aligns starts of short matches, adding NA on the end to pad to the correct length.
"align_end" (separate_wider_delim() only) aligns the ends of short matches, adding NA at the start to pad to the correct length.

too_many

What should happen if a value separates into too many pieces?

"error", the default, will throw an error.
"debug" will add additional columns to the output to help you locate and resolve the underlying problem.
"drop" will silently drop any extra pieces.
"merge" (separate_wider_delim() only) will merge together any additional pieces.

cols_remove

Should the input cols be removed from the output? Always FALSE if too_few or too_many are set to "debug".

widths

A named numeric vector where the names become column names, and the values specify the column width. Unnamed components will match, but not be included in the output.

patterns

A named character vector where the names become column names and the values are regular expressions that match the contents of the vector. Unnamed components will match, but not be included in the output.

Value

A data frame based on data. It has the same rows, but different columns:

The primary purpose of the functions are to create new columns from components of the string. For separate_wider_delim() the names of new columns come from names. For separate_wider_position() the names come from the names of widths. For separate_wider_regex() the names come from the names of patterns.
If too_few or too_many is "debug", the output will contain additional columns useful for debugging:
- ⁠{col}_ok⁠: a logical vector which tells you if the input was ok or not. Use to quickly find the problematic rows.
- ⁠{col}_remainder⁠: any text remaining after separation.
- ⁠{col}_pieces⁠, ⁠{col}_width⁠, ⁠{col}_matches⁠: number of pieces, number of characters, and number of matches for separate_wider_delim(), separate_wider_position() and separate_regexp_wider() respectively.
If cols_remove = TRUE (the default), the input cols will be removed from the output.

Examples

df <- tibble(id = 1:3, x = c("m-123", "f-455", "f-123"))
# There are three basic ways to split up a string into pieces:
# 1. with a delimiter
df %>% separate_wider_delim(x, delim = "-", names = c("gender", "unit"))
# 2. by length
df %>% separate_wider_position(x, c(gender = 1, 1, unit = 3))
# 3. defining each component with a regular expression
df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+"))

# Sometimes you split on the "last" delimiter
df <- tibble(var = c("race_1", "race_2", "age_bucket_1", "age_bucket_2"))
# _delim won't help because it always splits on the first delimiter
try(df %>% separate_wider_delim(var, "_", names = c("var1", "var2")))
df %>% separate_wider_delim(var, "_", names = c("var1", "var2"), too_many = "merge")
# Instead, you can use _regex
df %>% separate_wider_regex(var, c(var1 = ".*", "_", var2 = ".*"))
# this works because * is greedy; you can mimic the _delim behaviour with .*?
df %>% separate_wider_regex(var, c(var1 = ".*?", "_", var2 = ".*"))

# If the number of components varies, it's most natural to split into rows
df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA))
df %>% separate_longer_delim(x, delim = " ")
# But separate_wider_delim() provides some tools to deal with the problem
# The default behaviour tells you that there's a problem
try(df %>% separate_wider_delim(x, delim = " ", names = c("a", "b")))
# You can get additional insight by using the debug options
df %>%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "debug",
    too_many = "debug"
  )

# But you can suppress the warnings
df %>%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "align_start",
    too_many = "merge"
  )

# Or choose to automatically name the columns, producing as many as needed
df %>% separate_wider_delim(x, delim = " ", names_sep = "", too_few = "align_start")

Some data about the Smith family

Description

A small demo dataset describing John and Mary Smith.

Usage

smiths

Format

A data frame with 2 rows and 5 columns.

Spread a key-value pair across multiple columns

Description

Development on spread() is complete, and for new code we recommend switching to pivot_wider(), which is easier to use, more featureful, and still under active development. df %>% spread(key, value) is equivalent to df %>% pivot_wider(names_from = key, values_from = value)

See more details in vignette("pivot").

Usage

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

Arguments

data

A data frame.

key, value

<tidy-select> Columns to use for key and value.

fill

If set, missing values will be replaced with this value. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren't present. Both types of missing value will be replaced by fill.

convert

If TRUE, type.convert() with asis = TRUE will be run on each of the new columns. This is useful if the value column was a mix of variables that was coerced to a string. If the class of the value column was factor or date, note that will not be true of the new columns that are produced, which are coerced to character before type conversion.

drop

If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill.

sep

If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be given by "<key_name><sep><key_value>".

Examples

stocks <- tibble(
  time = as.Date("2009-01-01") + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)
stocksm <- stocks %>% gather(stock, price, -time)
stocksm %>% spread(stock, price)
stocksm %>% spread(time, price)

# Spread and gather are complements
df <- tibble(x = c("a", "b"), y = c(3, 4), z = c(5, 6))
df %>%
  spread(x, y) %>%
  gather("x", "y", a:b, na.rm = TRUE)

# Use 'convert = TRUE' to produce variables of mixed type
df <- tibble(
  row = rep(c(1, 51), each = 3),
  var = rep(c("Sepal.Length", "Species", "Species_num"), 2),
  value = c(5.1, "setosa", 1, 7.0, "versicolor", 2)
)
df %>% spread(var, value) %>% str()
df %>% spread(var, value, convert = TRUE) %>% str()

Example tabular representations

Description

Data sets that demonstrate multiple ways to layout the same tabular data.

Usage

table1

table2

table3

table4a

table4b

table5

Details

table1, table2, table3, table4a, table4b, and table5 all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.

The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report

Source

https://www.who.int/teams/global-tuberculosis-programme/data

Argument type: data-masking

Description

This page describes the ⁠<data-masking>⁠ argument modifier which indicates that the argument uses data masking, a sub-type of tidy evaluation. If you've never heard of tidy evaluation before, start with the practical introduction in https://r4ds.hadley.nz/functions.html#data-frame-functions then then read more about the underlying theory in https://rlang.r-lib.org/reference/topic-data-mask.html.

Key techniques

To allow the user to supply the column name in a function argument, embrace the argument, e.g. filter(df, {{ var }}).

dist_summary <- function(df, var) {
  df %>%
    summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>% dist_summary(mpg)
mtcars %>% group_by(cyl) %>% dist_summary(mpg)

To work with a column name recorded as a string, use the .data pronoun, e.g. summarise(df, mean = mean(.data[[var]])).

for (var in names(mtcars)) {
  mtcars %>% count(.data[[var]]) %>% print()
}

lapply(names(mtcars), function(var) mtcars %>% count(.data[[var]]))

To suppress ⁠R CMD check⁠ NOTEs about unknown variables use .data$var instead of var:
```
# has NOTE
df %>% mutate(z = x + y)

# no NOTE
df %>% mutate(z = .data$x + .data$y)
```
You'll also need to import .data from rlang with (e.g.) ⁠@importFrom rlang .data⁠.

Dot-dot-dot (...)

... automatically provides indirection, so you can use it as is (i.e. without embracing) inside a function:

grouped_mean <- function(df, var, ...) {
  df %>%
    group_by(...) %>%
    summarise(mean = mean({{ var }}))
}

You can also use ⁠:=⁠ instead of = to enable a glue-like syntax for creating variables from user supplied data:

var_name <- "l100km"
mtcars %>% mutate("{var_name}" := 235 / mpg)

summarise_mean <- function(df, var) {
  df %>%
    summarise("mean_of_{{var}}" := mean({{ var }}))
}
mtcars %>% group_by(cyl) %>% summarise_mean(mpg)

Learn more in https://rlang.r-lib.org/reference/topic-data-mask-programming.html.

Legacy name repair

Description

Ensures all column names are unique using the approach found in tidyr 0.8.3 and earlier. Only use this function if you want to preserve the naming strategy, otherwise you're better off adopting the new tidyverse standard with name_repair = "universal"

Usage

tidyr_legacy(nms, prefix = "V", sep = "")

Arguments

nms

Character vector of names

prefix

prefix Prefix to use for unnamed column

sep

Separator to use between name and unique suffix

Examples

df <- tibble(x = 1:2, y = list(tibble(x = 3:5), tibble(x = 4:7)))

# Doesn't work because it would produce a data frame with two
# columns called x
## Not run: 
unnest(df, y)

## End(Not run)

# The new tidyverse standard:
unnest(df, y, names_repair = "universal")

# The old tidyr approach
unnest(df, y, names_repair = tidyr_legacy)

Argument type: tidy-select

Description

This page describes the ⁠<tidy-select>⁠ argument modifier which indicates that the argument uses tidy selection, a sub-type of tidy evaluation. If you've never heard of tidy evaluation before, start with the practical introduction in https://r4ds.hadley.nz/functions.html#data-frame-functions then then read more about the underlying theory in https://rlang.r-lib.org/reference/topic-data-mask.html.

Overview of selection features

tidyselect implements a DSL for selecting variables. It provides helpers for selecting variables:

var1:var10: variables lying between var1 on the left and var10 on the right.

starts_with("a"): names that start with "a".
ends_with("z"): names that end with "z".
contains("b"): names that contain "b".
matches("x.y"): names that match regular expression x.y.
num_range(x, 1:4): names following the pattern, x1, x2, ..., x4.
all_of(vars)/any_of(vars): matches names stored in the character vector vars. all_of(vars) will error if the variables aren't present; any_of(var) will match just the variables that exist.
everything(): all variables.
last_col(): furthest column on the right.
where(is.numeric): all variables where is.numeric() returns TRUE.

As well as operators for combining those selections:

!selection: only variables that don't match selection.
selection1 & selection2: only variables included in both selection1 and selection2.
selection1 | selection2: all variables that match either selection1 or selection2.

Key techniques

If you want the user to supply a tidyselect specification in a function argument, you need to tunnel the selection through the function argument. This is done by embracing the function argument {{ }}, e.g unnest(df, {{ vars }}).
If you have a character vector of column names, use all_of() or any_of(), depending on whether or not you want unknown variable names to cause an error, e.g unnest(df, all_of(vars)), unnest(df, !any_of(vars)).
To suppress ⁠R CMD check⁠ NOTEs about unknown variables use "var" instead of var:

# has NOTE
df %>% select(x, y, z)

# no NOTE
df %>% select("x", "y", "z")

"Uncount" a data frame

Description

Performs the opposite operation to dplyr::count(), duplicating rows according to a weighting variable (or expression).

Usage

uncount(data, weights, ..., .remove = TRUE, .id = NULL)

Arguments

data

A data frame, tibble, or grouped tibble.

weights

A vector of weights. Evaluated in the context of data; supports quasiquotation.

...

Additional arguments passed on to methods.

.remove

If TRUE, and weights is the name of a column in data, then this column is removed.

.id

Supply a string to create a new variable which gives a unique identifier for each created row.

Examples

df <- tibble(x = c("a", "b"), n = c(1, 2))
uncount(df, n)
uncount(df, n, .id = "id")

# You can also use constants
uncount(df, 2)

# Or expressions
uncount(df, 2 / n)

Unite multiple columns into one by pasting strings together

Description

Convenience function to paste together multiple columns into one.

Usage

unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)

Arguments

data

A data frame.

col

The name of the new column, as a string or symbol.

...

<tidy-select> Columns to unite

sep

Separator to use between values.

remove

If TRUE, remove input columns from output data frame.

na.rm

If TRUE, missing values will be removed prior to uniting each value.

Examples

df <- expand_grid(x = c("a", NA), y = c("b", NA))
df

df %>% unite("z", x:y, remove = FALSE)
# To remove missing values:
df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)

# Separate is almost the complement of unite
df %>%
  unite("xy", x:y) %>%
  separate(xy, c("x", "y"))
# (but note `x` and `y` contain now "NA" not NA)

Unnest a list-column of data frames into rows and columns

Description

Unnest expands a list-column containing data frames into rows and columns.

Usage

unnest(
  data,
  cols,
  ...,
  keep_empty = FALSE,
  ptype = NULL,
  names_sep = NULL,
  names_repair = "check_unique",
  .drop = deprecated(),
  .id = deprecated(),
  .sep = deprecated(),
  .preserve = deprecated()
)

Arguments

data

A data frame.

cols

<tidy-select> List-columns to unnest.

When selecting multiple columns, values from the same row will be recycled to their common size.

...

: previously you could write df %>% unnest(x, y, z). Convert to df %>% unnest(c(x, y, z)). If you previously created a new variable in unnest() you'll now need to do it explicitly with mutate(). Convert df %>% unnest(y = fun(x, y, z)) to df %>% mutate(y = fun(x, y, z)) %>% unnest(y).

keep_empty

ptype

names_sep

If NULL, the default, the outer names will come from the inner names. If a string, the outer names will be formed by pasting together the outer and the inner column names, separated by names_sep.

names_repair

Used to check that output data frame has valid names. Must be one of the following options:

⁠"minimal⁠": no name repair or checks, beyond basic existence,
⁠"unique⁠": make sure names are unique and not empty,
⁠"check_unique⁠": (the default), no name repair, but check they are unique,
⁠"universal⁠": make the names unique and syntactic
a function: apply custom name repair.
tidyr_legacy: use the name repair from tidyr 0.8.
a formula: a purrr-style anonymous function (see rlang::as_function())

See vctrs::vec_as_names() for more details on these terms and the strategies used to enforce them.

.drop, .preserve

: all list-columns are now preserved; If there are any that you don't want in the output use select() to remove them prior to unnesting.

.id

: convert df %>% unnest(x, .id = "id") to ⁠df %>% mutate(id = names(x)) %>% unnest(x))⁠.

.sep

: use names_sep instead.

New syntax

library(tidyr)
nest <- nest_legacy
unnest <- unnest_legacy

Examples

# unnest() is designed to work with lists of data frames
df <- tibble(
  x = 1:3,
  y = list(
    NULL,
    tibble(a = 1, b = 2),
    tibble(a = 1:3, b = 3:1, c = 4)
  )
)
# unnest() recycles input rows for each row of the list-column
# and adds a column for each column
df %>% unnest(y)

# input rows with 0 rows in the list-column will usually disappear,
# but you can keep them (generating NAs) with keep_empty = TRUE:
df %>% unnest(y, keep_empty = TRUE)

# Multiple columns ----------------------------------------------------------
# You can unnest multiple columns simultaneously
df <- tibble(
  x = 1:2,
  y = list(
    tibble(a = 1, b = 2),
    tibble(a = 3:4, b = 5:6)
  ),
  z = list(
    tibble(c = 1, d = 2),
    tibble(c = 3:4, d = 5:6)
  )
)
df %>% unnest(c(y, z))

# Compare with unnesting one column at a time, which generates
# the Cartesian product
df %>%
  unnest(y) %>%
  unnest(z)

Automatically call `unnest_wider()` or `unnest_longer()`

Description

unnest_auto() picks between unnest_wider() or unnest_longer() by inspecting the inner names of the list-col:

If all elements are unnamed, it uses unnest_longer(indices_include = FALSE).
If all elements are named, and there's at least one name in common across all components, it uses unnest_wider().
Otherwise, it falls back to unnest_longer(indices_include = TRUE).

It's handy for very rapid interactive exploration but I don't recommend using it in scripts, because it will succeed even if the underlying data radically changes.

Usage

unnest_auto(data, col)

Arguments

data

A data frame.

col

<tidy-select> List-column to unnest.

Unnest a list-column into rows

Description

unnest_longer() turns each element of a list-column into a row. It is most naturally suited to list-columns where the elements are unnamed and the length of each element varies from row to row.

unnest_longer() generally preserves the number of columns of x while modifying the number of rows.

Learn more in vignette("rectangle").

Usage

unnest_longer(
  data,
  col,
  values_to = NULL,
  indices_to = NULL,
  indices_include = NULL,
  keep_empty = FALSE,
  names_repair = "check_unique",
  simplify = TRUE,
  ptype = NULL,
  transform = NULL
)

Arguments

data

A data frame.

col

<tidy-select> List-column(s) to unnest.

When selecting multiple columns, values from the same row will be recycled to their common size.

values_to

A string giving the column name (or names) to store the unnested values in. If multiple columns are specified in col, this can also be a glue string containing "{col}" to provide a template for the column names. The default, NULL, gives the output columns the same names as the input columns.

indices_to

A string giving the column name (or names) to store the inner names or positions (if not named) of the values. If multiple columns are specified in col, this can also be a glue string containing "{col}" to provide a template for the column names. The default, NULL, gives the output columns the same names as values_to, but suffixed with "_id".

indices_include

A single logical value specifying whether or not to add an index column. If any value has inner names, the index column will be a character vector of those names, otherwise it will be an integer vector of positions. If NULL, defaults to TRUE if any value has inner names or if indices_to is provided.

If indices_to is provided, then indices_include can't be FALSE.

keep_empty

names_repair

Used to check that output data frame has valid names. Must be one of the following options:

⁠"minimal⁠": no name repair or checks, beyond basic existence,
⁠"unique⁠": make sure names are unique and not empty,
⁠"check_unique⁠": (the default), no name repair, but check they are unique,
⁠"universal⁠": make the names unique and syntactic
a function: apply custom name repair.
tidyr_legacy: use the name repair from tidyr 0.8.
a formula: a purrr-style anonymous function (see rlang::as_function())

See vctrs::vec_as_names() for more details on these terms and the strategies used to enforce them.

simplify

ptype

If a ptype has been specified, but simplify = FALSE or simplification isn't possible, then a list-of column will be returned and each element will have type ptype.

transform

When both ptype and transform are supplied, the transform is applied before the ptype.

Examples

# `unnest_longer()` is useful when each component of the list should
# form a row
df <- tibble(
  x = 1:4,
  y = list(NULL, 1:3, 4:5, integer())
)
df %>% unnest_longer(y)

# Note that empty values like `NULL` and `integer()` are dropped by
# default. If you'd like to keep them, set `keep_empty = TRUE`.
df %>% unnest_longer(y, keep_empty = TRUE)

# If the inner vectors are named, the names are copied to an `_id` column
df <- tibble(
  x = 1:2,
  y = list(c(a = 1, b = 2), c(a = 10, b = 11, c = 12))
)
df %>% unnest_longer(y)

# Multiple columns ----------------------------------------------------------
# If columns are aligned, you can unnest simultaneously
df <- tibble(
  x = 1:2,
  y = list(1:2, 3:4),
  z = list(5:6, 7:8)
)
df %>%
  unnest_longer(c(y, z))

# This is important because sequential unnesting would generate the
# Cartesian product of the rows
df %>%
  unnest_longer(y) %>%
  unnest_longer(z)

Unnest a list-column into columns

Description

unnest_wider() turns each element of a list-column into a column. It is most naturally suited to list-columns where every element is named, and the names are consistent from row-to-row. unnest_wider() preserves the rows of x while modifying the columns.

Learn more in vignette("rectangle").

Usage

unnest_wider(
  data,
  col,
  names_sep = NULL,
  simplify = TRUE,
  strict = FALSE,
  names_repair = "check_unique",
  ptype = NULL,
  transform = NULL
)

Arguments

data

A data frame.

col

<tidy-select> List-column(s) to unnest.

When selecting multiple columns, values from the same row will be recycled to their common size.

names_sep

If NULL, the default, the names will be left as is. If a string, the outer and inner names will be pasted together using names_sep as a separator.

If any values being unnested are unnamed, then names_sep must be supplied, otherwise an error is thrown. When names_sep is supplied, names are automatically generated for unnamed values as an increasing sequence of integers.

simplify

strict

A single logical specifying whether or not to apply strict vctrs typing rules. If FALSE, typed empty values (like list() or integer()) nested within list-columns will be treated like NULL and will not contribute to the type of the unnested column. This is useful when working with JSON, where empty values tend to lose their type information and show up as list().

names_repair

Used to check that output data frame has valid names. Must be one of the following options:

⁠"minimal⁠": no name repair or checks, beyond basic existence,
⁠"unique⁠": make sure names are unique and not empty,
⁠"check_unique⁠": (the default), no name repair, but check they are unique,
⁠"universal⁠": make the names unique and syntactic
a function: apply custom name repair.
tidyr_legacy: use the name repair from tidyr 0.8.
a formula: a purrr-style anonymous function (see rlang::as_function())

See vctrs::vec_as_names() for more details on these terms and the strategies used to enforce them.

ptype

If a ptype has been specified, but simplify = FALSE or simplification isn't possible, then a list-of column will be returned and each element will have type ptype.

transform

When both ptype and transform are supplied, the transform is applied before the ptype.

Examples

df <- tibble(
  character = c("Toothless", "Dory"),
  metadata = list(
    list(
      species = "dragon",
      color = "black",
      films = c(
        "How to Train Your Dragon",
        "How to Train Your Dragon 2",
        "How to Train Your Dragon: The Hidden World"
      )
    ),
    list(
      species = "blue tang",
      color = "blue",
      films = c("Finding Nemo", "Finding Dory")
    )
  )
)
df

# Turn all components of metadata into columns
df %>% unnest_wider(metadata)

# Choose not to simplify list-cols of length-1 elements
df %>% unnest_wider(metadata, simplify = FALSE)
df %>% unnest_wider(metadata, simplify = list(color = FALSE))

# You can also widen unnamed list-cols:
df <- tibble(
  x = 1:3,
  y = list(NULL, 1:3, 4:5)
)
# but you must supply `names_sep` to do so, which generates automatic names:
df %>% unnest_wider(y, names_sep = "_")

# 0-length elements ---------------------------------------------------------
# The defaults of `unnest_wider()` treat empty types (like `list()`) as `NULL`.
json <- list(
  list(x = 1:2, y = 1:2),
  list(x = list(), y = 3:4),
  list(x = 3L, y = list())
)

df <- tibble(json = json)
df %>%
  unnest_wider(json)

# To instead enforce strict vctrs typing rules, use `strict`
df %>%
  unnest_wider(json, strict = TRUE)

US rent and income data

Description

Captured from the 2017 American Community Survey using the tidycensus package.

Usage

us_rent_income

Format

A dataset with variables:

GEOID: FIP state identifier
NAME: Name of state
variable: Variable name: income = median yearly income, rent = median monthly rent
estimate: Estimated value
moe: 90% margin of error

World Health Organization TB data

Description

A subset of data from the World Health Organization Global Tuberculosis Report, and accompanying global populations. who uses the original codes from the World Health Organization. The column names for columns 5 through 60 are made by combining new_ with:

the method of diagnosis (rel = relapse, sn = negative pulmonary smear, sp = positive pulmonary smear, ep = extrapulmonary),
gender (f = female, m = male), and
age group (014 = 0-14 yrs of age, 1524 = 15-24, 2534 = 25-34, 3544 = 35-44 years of age, 4554 = 45-54, 5564 = 55-64, 65 = 65 years or older).

who2 is a lightly modified version that makes teaching the basics easier by tweaking the variables to be slightly more consistent and dropping iso2 and iso3. newrel is replaced by new_rel, and a ⁠_⁠ is added after the gender.

Usage

who

who2

population

Format

`who`

A data frame with 7,240 rows and 60 columns:

country: Country name
iso2, iso3: 2 & 3 letter ISO country codes
year: Year
new_sp_m014 - new_rel_f65: Counts of new TB cases recorded by group. Column names encode three variables that describe the group.

`who2`

A data frame with 7,240 rows and 58 columns.

`population`

A data frame with 4,060 rows and three columns:

country: Country name
year: Year
population: Population

Source

https://www.who.int/teams/global-tuberculosis-programme/data

Population data from the World Bank

Description

Data about population from the World Bank.

Usage

world_bank_pop

Format

A dataset with variables:

country: Three letter country code
indicator: Indicator name: SP.POP.GROW = population growth, SP.POP.TOTL = total population, SP.URB.GROW = urban population growth, SP.URB.TOTL = total urban population
2000-2018: Value for each year

Source

Dataset from the World Bank data bank: https://data.worldbank.org

tidyr: Tidy Messy Data

Description

Author(s)

See Also

Pipe operator

Description

Usage

Song rankings for Billboard top 100 in the year 2000

Description

Usage

Format

Source

Check assumptions about a pivot spec

Description

Usage

Arguments

Examples

Chop and unchop

Description

Usage

Arguments

Details

Examples

Data from the Centers for Medicare & Medicaid Services

Description

Usage

Format

Examples

Complete a data frame with missing combinations of data

Description

Usage

Arguments

Grouped data frames

Examples

Completed construction in the US in 2018

Description

Usage

Format

Source

Deprecated SE versions of main verbs

Description

Usage

Arguments

Drop rows containing missing values

Description

Usage

Arguments

Details

Examples

Expand data frame to include all possible combinations of values

Description

Usage

Arguments

Grouped data frames

See Also

Examples

Create a tibble from all combinations of inputs

Description

Usage

Arguments

Value

Examples

Extract a character column into multiple columns using regular expression groups

Description

Usage

Arguments

See Also

Examples

Extract numeric component of variable.

Description

Usage

Arguments

Fill in missing values with previous or next value

Description

Usage

Arguments

Details

Grouped data frames

Examples

Fish encounters

Check assumptions about a pivot `spec`

Legacy versions of `nest()` and `unnest()`