Type: Package
Title: A 'dplyr' Back End for Databases
Version: 2.5.0
Description: A 'dplyr' back end for databases that allows you to work with remote database tables as if they are in-memory data frames. Basic features works with any database that has a 'DBI' back end; more advanced features require 'SQL' translation to be provided by the package author.
License: MIT + file LICENSE
URL: https://dbplyr.tidyverse.org/, https://github.com/tidyverse/dbplyr
BugReports: https://github.com/tidyverse/dbplyr/issues
Depends: R (≥ 3.6)
Imports: blob (≥ 1.2.0), cli (≥ 3.6.1), DBI (≥ 1.1.3), dplyr (≥ 1.1.2), glue (≥ 1.6.2), lifecycle (≥ 1.0.3), magrittr, methods, pillar (≥ 1.9.0), purrr (≥ 1.0.1), R6 (≥ 2.2.2), rlang (≥ 1.1.1), tibble (≥ 3.2.1), tidyr (≥ 1.3.0), tidyselect (≥ 1.2.1), utils, vctrs (≥ 0.6.3), withr (≥ 2.5.0)
Suggests: bit64, covr, knitr, Lahman, nycflights13, odbc (≥ 1.4.2), RMariaDB (≥ 1.2.2), rmarkdown, RPostgres (≥ 1.4.5), RPostgreSQL, RSQLite (≥ 2.3.1), testthat (≥ 3.1.10)
VignetteBuilder: knitr
Config/Needs/website: tidyverse/tidytemplate
Config/testthat/edition: 3
Config/testthat/parallel: TRUE
Encoding: UTF-8
Language: en-gb
RoxygenNote: 7.3.1
Collate: 'db-sql.R' 'utils-check.R' 'import-standalone-types-check.R' 'import-standalone-obj-type.R' 'utils.R' 'sql.R' 'escape.R' 'translate-sql-cut.R' 'translate-sql-quantile.R' 'translate-sql-string.R' 'translate-sql-paste.R' 'translate-sql-helpers.R' 'translate-sql-window.R' 'translate-sql-conditional.R' 'backend-.R' 'backend-access.R' 'backend-hana.R' 'backend-hive.R' 'backend-impala.R' 'verb-copy-to.R' 'backend-mssql.R' 'backend-mysql.R' 'backend-odbc.R' 'backend-oracle.R' 'backend-postgres.R' 'backend-postgres-old.R' 'backend-redshift.R' 'backend-snowflake.R' 'backend-spark-sql.R' 'backend-sqlite.R' 'backend-teradata.R' 'build-sql.R' 'data-cache.R' 'data-lahman.R' 'data-nycflights13.R' 'db-escape.R' 'db-io.R' 'db.R' 'dbplyr.R' 'explain.R' 'ident.R' 'import-standalone-s3-register.R' 'join-by-compat.R' 'join-cols-compat.R' 'lazy-join-query.R' 'lazy-ops.R' 'lazy-query.R' 'lazy-select-query.R' 'lazy-set-op-query.R' 'memdb.R' 'optimise-utils.R' 'pillar.R' 'progress.R' 'sql-build.R' 'query-join.R' 'query-select.R' 'query-semi-join.R' 'query-set-op.R' 'query.R' 'reexport.R' 'remote.R' 'rows.R' 'schema.R' 'simulate.R' 'sql-clause.R' 'sql-expr.R' 'src-sql.R' 'src_dbi.R' 'table-name.R' 'tbl-lazy.R' 'tbl-sql.R' 'test-frame.R' 'testthat.R' 'tidyeval-across.R' 'tidyeval.R' 'translate-sql.R' 'utils-format.R' 'verb-arrange.R' 'verb-compute.R' 'verb-count.R' 'verb-distinct.R' 'verb-do-query.R' 'verb-do.R' 'verb-expand.R' 'verb-fill.R' 'verb-filter.R' 'verb-group_by.R' 'verb-head.R' 'verb-joins.R' 'verb-mutate.R' 'verb-pivot-longer.R' 'verb-pivot-wider.R' 'verb-pull.R' 'verb-select.R' 'verb-set-ops.R' 'verb-slice.R' 'verb-summarise.R' 'verb-uncount.R' 'verb-window.R' 'zzz.R'
NeedsCompilation: no
Packaged: 2024-03-15 17:27:18 UTC; hadleywickham
Author: Hadley Wickham [aut, cre], Maximilian Girlich [aut], Edgar Ruiz [aut], Posit Software, PBC [cph, fnd]
Maintainer: Hadley Wickham <hadley@posit.co>
Repository: CRAN
Date/Publication: 2024-03-19 17:00:05 UTC

dbplyr: A 'dplyr' Back End for Databases

Description

logo

A 'dplyr' back end for databases that allows you to work with remote database tables as if they are in-memory data frames. Basic features works with any database that has a 'DBI' back end; more advanced features require 'SQL' translation to be provided by the package author.

Author(s)

Maintainer: Hadley Wickham hadley@posit.co

Authors:

Other contributors:

See Also

Useful links:


Arrange rows by column values

Description

This is an method for the dplyr arrange() generic. It generates the ⁠ORDER BY⁠ clause of the SQL query. It also affects the window_order() of windowed expressions in mutate.tbl_lazy().

Note that ⁠ORDER BY⁠ clauses can not generally appear in subqueries, which means that you should arrange() as late as possible in your pipelines.

Usage

## S3 method for class 'tbl_lazy'
arrange(.data, ..., .by_group = FALSE)

Arguments

.data

A lazy data frame backed by a database query.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

.by_group

If TRUE, will sort first by grouping variable. Applies to grouped data frames only.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Missing values

Unlike R, most databases sorts NA (NULLs) at the front. You can can override this behaviour by explicitly sorting on is.na(x).

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(a = c(3, 4, 1, 2), b = c(5, 1, 2, NA))
db %>% arrange(a) %>% show_query()

# Note that NAs are sorted first
db %>% arrange(b)
# override by sorting on is.na() first
db %>% arrange(is.na(b), b)

Backend: MS Access

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Use simulate_access() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_access()

Examples

library(dplyr, warn.conflicts = FALSE)
lf <- lazy_frame(x = 1, y = 2, z = "a", con = simulate_access())

lf %>% head()
lf %>% mutate(y = as.numeric(y), z = sqrt(x^2 + 10))
lf %>% mutate(a = paste0(z, " times"))

Backend: SAP HANA

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Use simulate_hana() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_hana()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_hana())
lf %>% transmute(x = paste0(d, " times"))

Backend: Hive

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are a scattering of custom translations provided by users.

Use simulate_hive() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_hive()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_hive())
lf %>% transmute(x = cot(b))
lf %>% transmute(x = bitwShiftL(c, 1L))
lf %>% transmute(x = str_replace_all(c, "a", "b"))

lf %>% summarise(x = median(d, na.rm = TRUE))
lf %>% summarise(x = var(c, na.rm = TRUE))

Backend: Impala

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are a scattering of custom translations provided by users, mostly focussed on bitwise operations.

Use simulate_impala() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_impala()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_impala())
lf %>% transmute(X = bitwNot(bitwOr(b, c)))

Backend: SQL server

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Use simulate_mssql() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_mssql(version = "15.0")

Arguments

version

Version of MS SQL to simulate. Currently only, difference is that 15.0 and above will use TRY_CAST() instead of CAST().

Bit vs boolean

SQL server uses two incompatible types to represent TRUE and FALSE values:

dbplyr does its best to automatically create the correct type when needed, but can't do it 100% correctly because it does not have a full type inference system. This means that you many need to manually do conversions from time to time.

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_mssql())
lf %>% head()
lf %>% transmute(x = paste(b, c, d))

# Can use boolean as is:
lf %>% filter(c > d)
# Need to convert from boolean to bit:
lf %>% transmute(x = c > d)
# Can use boolean as is:
lf %>% transmute(x = ifelse(c > d, "c", "d"))

Backend: MySQL/MariaDB

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Use simulate_mysql() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_mysql()

simulate_mariadb()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_mysql())
lf %>% transmute(x = paste0(d, " times"))

Backend: ODBC

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are minor translations for common data types.

Use simulate_odbc() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_odbc()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_odbc())
lf %>% transmute(x = as.numeric(b))
lf %>% transmute(x = as.integer(b))
lf %>% transmute(x = as.character(b))

Backend: Oracle

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Note that versions of Oracle prior to 23c have limited supported for TRUE and FALSE and you may need to use 1 and 0 instead. See https://oracle-base.com/articles/23c/boolean-data-type-23c for more details.

Use simulate_oracle() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_oracle()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_oracle())
lf %>% transmute(x = paste0(c, " times"))
lf %>% setdiff(lf)

Backend: PostgreSQL

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Use simulate_postgres() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_postgres()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_postgres())
lf %>% summarise(x = sd(b, na.rm = TRUE))
lf %>% summarise(y = cor(b, c), z = cov(b, c))

Backend: Redshift

Description

Base translations come from PostgreSQL backend. There are generally few differences, apart from string manipulation.

Use simulate_redshift() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_redshift()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_redshift())
lf %>% transmute(x = paste(c, " times"))
lf %>% transmute(x = substr(c, 2, 3))
lf %>% transmute(x = str_replace_all(c, "a", "z"))

Backend: Snowflake

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology.

Use simulate_snowflake() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_snowflake()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_snowflake())
lf %>% transmute(x = paste0(d, " times"))

Backend: Databricks Spark SQL

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are better translation of statistical aggregate functions (e.g. var(), median()) and use of temporary views instead of temporary tables when copying data.

Use simulate_spark_sql() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_spark_sql()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_spark_sql())

lf %>% summarise(x = median(d, na.rm = TRUE))
lf %>% summarise(x = var(c, na.rm = TRUE), .by = d)

lf %>% mutate(x = first(c))
lf %>% mutate(x = first(c), .by = d)

Backend: SQLite

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Use simulate_sqlite() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_sqlite()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_sqlite())
lf %>% transmute(x = paste(c, " times"))
lf %>% transmute(x = log(b), y = log(b, base = 2))

Backend: Teradata

Description

See vignette("translation-function") and vignette("translation-verb") for details of overall translation technology. Key differences for this backend are:

Use simulate_teradata() with lazy_frame() to see simulated SQL without converting to live access database.

Usage

simulate_teradata()

Examples

library(dplyr, warn.conflicts = FALSE)

lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_teradata())
lf %>% head()

Build a SQL string.

Description

This is a convenience function that should prevent sql injection attacks (which in the context of dplyr are most likely to be accidental not deliberate) by automatically escaping all expressions in the input, while treating bare strings as sql. This is unlikely to prevent any serious attack, but should make it unlikely that you produce invalid sql.

Usage

build_sql(..., .env = parent.frame(), con = sql_current_con())

Arguments

...

input to convert to SQL. Use sql() to preserve user input as is (dangerous), and ident() to label user input as sql identifiers (safe)

.env

the environment in which to evaluate the arguments. Should not be needed in typical use.

con

database connection; used to select correct quoting characters.

Details

This function should be used only when generating SELECT clauses, other high level queries, or for other syntax that has no R equivalent. For individual function translations, prefer sql_expr().

Examples

con <- simulate_dbi()
build_sql("SELECT * FROM TABLE", con = con)
x <- "TABLE"
build_sql("SELECT * FROM ", x, con = con)
build_sql("SELECT * FROM ", ident(x), con = con)
build_sql("SELECT * FROM ", sql(x), con = con)

# http://xkcd.com/327/
name <- "Robert'); DROP TABLE Students;--"
build_sql("INSERT INTO Students (Name) VALUES (", name, ")", con = con)

Compute results of a query

Description

These are methods for the dplyr generics collapse(), compute(), and collect(). collapse() creates a subquery, compute() stores the results in a remote table, and collect() executes the query and downloads the data into R.

Usage

## S3 method for class 'tbl_sql'
collapse(x, ...)

## S3 method for class 'tbl_sql'
compute(
  x,
  name = NULL,
  temporary = TRUE,
  unique_indexes = list(),
  indexes = list(),
  analyze = TRUE,
  ...,
  cte = FALSE
)

## S3 method for class 'tbl_sql'
collect(x, ..., n = Inf, warn_incomplete = TRUE, cte = FALSE)

Arguments

x

A lazy data frame backed by a database query.

...

other parameters passed to methods.

name

Table name in remote database.

temporary

Should the table be temporary (TRUE, the default) or persistent (FALSE)?

unique_indexes

a list of character vectors. Each element of the list will create a new unique index over the specified column(s). Duplicate rows will result in failure.

indexes

a list of character vectors. Each element of the list will create a new index.

analyze

if TRUE (the default), will automatically ANALYZE the new table so that the query optimiser has useful information.

cte

[Experimental] Use common table expressions in the generated SQL?

n

Number of rows to fetch. Defaults to Inf, meaning all rows.

warn_incomplete

Warn if n is less than the number of result rows?

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(a = c(3, 4, 1, 2), b = c(5, 1, 2, NA))
db %>% filter(a <= 2) %>% collect()

Complete a SQL table with missing combinations of data

Description

Turns implicit missing values into explicit missing values. This is a method for the tidyr::complete() generic.

Usage

## S3 method for class 'tbl_lazy'
complete(data, ..., fill = list())

Arguments

data

A lazy data frame backed by a database query.

...

Specification of columns to expand. See tidyr::expand for more details.

fill

A named list that for each variable supplies a single value to use instead of NA for missing combinations.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples


df <- memdb_frame(
  group = c(1:2, 1),
  item_id = c(1:2, 2),
  item_name = c("a", "b", "b"),
  value1 = 1:3,
  value2 = 4:6
)

df %>% tidyr::complete(group, nesting(item_id, item_name))

# You can also choose to fill in missing values
df %>% tidyr::complete(group, nesting(item_id, item_name), fill = list(value1 = 0))


Use a local data frame in a dbplyr query

Description

This is an alternative to copy_to() that does not need write access and is faster for small data.

Usage

copy_inline(con, df, types = NULL)

Arguments

con

A database connection.

df

A local data frame. The data is written directly in the SQL query so it should be small.

types

A named character vector of SQL data types to use for the columns. The data types are backend specific. For example for Postgres this could be c(id = "bigint", created_at = "timestamp", values = "integer[]"). If NULL, the default, the types are determined from df.

Details

It writes the data directly in the SQL query via the VALUES clause.

Value

A tbl_lazy.

See Also

copy_to() to copy the data into a new database table.

Examples

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

copy_inline(con, df)

copy_inline(con, df) %>% dplyr::show_query()

Copy a local data frame to a remote database

Description

This is an implementation of the dplyr copy_to() generic and it mostly a wrapper around DBI::dbWriteTable().

It is useful for copying small amounts of data to a database for examples, experiments, and joins. By default, it creates temporary tables which are only visible within the current connection to the database.

Usage

## S3 method for class 'src_sql'
copy_to(
  dest,
  df,
  name = deparse(substitute(df)),
  overwrite = FALSE,
  types = NULL,
  temporary = TRUE,
  unique_indexes = NULL,
  indexes = NULL,
  analyze = TRUE,
  ...,
  in_transaction = TRUE
)

Arguments

dest

remote data source

df

A local data frame, a tbl_sql from same source, or a tbl_sql from another source. If from another source, all data must transition through R in one pass, so it is only suitable for transferring small amounts of data.

name

Name of new remote table. Use a string to create the table in the current catalog/schema. Use I() if you want to create it in a specific catalog/schema, e.g. I("schema.table").

overwrite

If TRUE, will overwrite an existing table with name name. If FALSE, will throw an error if name already exists.

types

a character vector giving variable types to use for the columns. See https://www.sqlite.org/datatype3.html for available types.

temporary

if TRUE, will create a temporary table that is local to this connection and will be automatically deleted when the connection expires

unique_indexes

a list of character vectors. Each element of the list will create a new unique index over the specified column(s). Duplicate rows will result in failure.

indexes

a list of character vectors. Each element of the list will create a new index.

analyze

if TRUE (the default), will automatically ANALYZE the new table so that the query optimiser has useful information.

...

other parameters passed to methods.

in_transaction

Should the table creation be wrapped in a transaction? This typically makes things faster, but you may want to suppress if the database doesn't support transactions, or you're wrapping in a transaction higher up (and your database doesn't support nested transactions.)

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

See Also

copy_inline() to use small data in an SQL query without actually writing to a table.

Examples

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(x = 1:5, y = letters[5:1])
db <- copy_to(src_memdb(), df)
db

df2 <- data.frame(y = c("a", "d"), fruit = c("apple", "date"))
# copy_to() is called automatically if you set copy = TRUE
# in the join functions
db %>% left_join(df2, copy = TRUE)

Count observations by group

Description

These are methods for the dplyr count() and tally() generics. They wrap up group_by.tbl_lazy(), summarise.tbl_lazy() and, optionally, arrange.tbl_lazy().

Usage

## S3 method for class 'tbl_lazy'
count(x, ..., wt = NULL, sort = FALSE, name = NULL)

## S3 method for class 'tbl_lazy'
add_count(x, ..., wt = NULL, sort = FALSE, name = NULL, .drop = NULL)

## S3 method for class 'tbl_lazy'
tally(x, wt = NULL, sort = FALSE, name = NULL)

Arguments

x

A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr).

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

wt

<data-masking> Frequency weights. Can be NULL or a variable:

  • If NULL (the default), counts the number of rows in each group.

  • If a variable, computes sum(wt) for each group.

sort

If TRUE, will show the largest groups at the top.

name

The name of the new column in the output.

If omitted, it will default to n. If there's already a column called n, it will use nn. If there's a column called n and nn, it'll use nnn, and so on, adding ns until it gets a new name.

.drop

Not supported for lazy tables.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2))
db %>% count(g) %>% show_query()
db %>% count(g, wt = x) %>% show_query()
db %>% count(g, wt = x, sort = TRUE) %>% show_query()

Database I/O generics

Description

These generics are responsible for getting data into and out of the database. They should be used a last resort - only use them when you can't make a backend work by providing methods for DBI generics, or for dbplyr's SQL generation generics. They tend to be most needed when a backend has special handling of temporary tables.

Usage

db_copy_to(
  con,
  table,
  values,
  ...,
  overwrite = FALSE,
  types = NULL,
  temporary = TRUE,
  unique_indexes = NULL,
  indexes = NULL,
  analyze = TRUE,
  in_transaction = TRUE
)

db_compute(
  con,
  table,
  sql,
  ...,
  overwrite = FALSE,
  temporary = TRUE,
  unique_indexes = list(),
  indexes = list(),
  analyze = TRUE,
  in_transaction = TRUE
)

db_collect(con, sql, n = -1, warn_incomplete = TRUE, ...)

db_table_temporary(con, table, temporary, ...)

See Also

Other generic: db-sql, db_connection_describe(), sql_escape_logical()


Miscellaneous database generics

Description

Usage

db_connection_describe(con, ...)

sql_join_suffix(con, suffix, ...)

db_sql_render(con, sql, ..., cte = FALSE, sql_options = NULL)

db_col_types(con, table, call)

dbplyr_edition(con)

Details

dbplyr 2.0.0

dbplyr 2.0.0 renamed a number of generics so that they could be cleanly moved from dplyr to dbplyr. If you have an existing backend, you'll need to rename the following methods.

See Also

Other generic: db-sql, db_copy_to(), sql_escape_logical()


SQL escaping/quoting generics

Description

These generics translate individual values into SQL. The core generics are DBI::dbQuoteIdentifier() and DBI::dbQuoteString for quoting identifiers and strings, but dbplyr needs additional tools for inserting logical, date, date-time, and raw values into queries.

Usage

sql_escape_logical(con, x)

sql_escape_date(con, x)

sql_escape_datetime(con, x)

sql_escape_raw(con, x)

See Also

Other generic: db-sql, db_connection_describe(), db_copy_to()

Examples

con <- simulate_dbi()
sql_escape_logical(con, c(TRUE, FALSE, NA))
sql_escape_date(con, Sys.Date())
sql_escape_date(con, Sys.time())
sql_escape_raw(con, charToRaw("hi"))

SQL generation generics

Description

SQL translation:

Tables:

Query manipulation:

Query indentation:

Query generation:

Query generation for manipulation:

Usage

sql_expr_matches(con, x, y, ...)

sql_translation(con)

sql_random(con)

sql_table_analyze(con, table, ...)

sql_table_index(
  con,
  table,
  columns,
  name = NULL,
  unique = FALSE,
  ...,
  call = caller_env()
)

sql_query_explain(con, sql, ...)

sql_query_fields(con, sql, ...)

sql_query_save(con, sql, name, temporary = TRUE, ...)

sql_query_wrap(con, from, name = NULL, ..., lvl = 0)

sql_indent_subquery(from, con, lvl = 0)

sql_query_rows(con, sql, ...)

supports_window_clause(con)

db_supports_table_alias_with_as(con)

sql_query_select(
  con,
  select,
  from,
  where = NULL,
  group_by = NULL,
  having = NULL,
  window = NULL,
  order_by = NULL,
  limit = NULL,
  distinct = FALSE,
  ...,
  subquery = FALSE,
  lvl = 0
)

sql_query_join(
  con,
  x,
  y,
  select,
  type = "inner",
  by = NULL,
  na_matches = FALSE,
  ...,
  lvl = 0
)

sql_query_multi_join(con, x, joins, table_names, by_list, select, ..., lvl = 0)

sql_query_semi_join(con, x, y, anti, by, where, vars, ..., lvl = 0)

sql_query_set_op(con, x, y, method, ..., all = FALSE, lvl = 0)

sql_query_union(con, x, unions, ..., lvl = 0)

sql_returning_cols(con, cols, table, ...)

dbplyr 2.0.0

Many ⁠dplyr::db_*⁠ generics have been replaced by ⁠dbplyr::sql_*⁠ generics. To update your backend, you'll need to extract the SQL generation out of your existing code, and place it in a new method for a dbplyr sql_ generic.

The query generating functions have also changed names. Their behaviour is unchanged, so you just need to rename the generic and import from dbplyr instead of dplyr.

Learn more in vignette("backend-2.0")

See Also

Other generic: db_connection_describe(), db_copy_to(), sql_escape_logical()


"Uncount" a database table

Description

This is a method for the tidyr uncount() generic. It uses a temporary table, so your database user needs permissions to create one.

Usage

dbplyr_uncount(data, weights, .remove = TRUE, .id = NULL)

Arguments

data

A lazy data frame backed by a database query.

weights

A vector of weights. Evaluated in the context of data; supports quasiquotation.

.remove

If TRUE, and weights is the name of a column in data, then this column is removed.

.id

Supply a string to create a new variable which gives a unique identifier for each created row.

Examples

df <- memdb_frame(x = c("a", "b"), n = c(1, 2))
dbplyr_uncount(df, n)
dbplyr_uncount(df, n, .id = "id")

# You can also use constants
dbplyr_uncount(df, 2)

# Or expressions
dbplyr_uncount(df, 2 / n)

Subset rows using their positions

Description

These are methods for the dplyr generics slice_min(), slice_max(), and slice_sample(). They are translated to SQL using filter() and window functions (ROWNUMBER, MIN_RANK, or CUME_DIST depending on arguments). slice(), slice_head(), and slice_tail() are not supported since database tables have no intrinsic order.

If data is grouped, the operation will be performed on each group so that (e.g.) slice_min(db, x, n = 3) will select the three rows with the smallest value of x in each group.

Usage

## S3 method for class 'tbl_lazy'
slice_min(
  .data,
  order_by,
  ...,
  n,
  prop,
  by = NULL,
  with_ties = TRUE,
  na_rm = TRUE
)

## S3 method for class 'tbl_lazy'
slice_max(
  .data,
  order_by,
  ...,
  n,
  by = NULL,
  prop,
  with_ties = TRUE,
  na_rm = TRUE
)

## S3 method for class 'tbl_lazy'
slice_sample(.data, ..., n, prop, by = NULL, weight_by = NULL, replace = FALSE)

Arguments

.data

A lazy data frame backed by a database query.

order_by

Variable or function of variables to order by.

...

Not used.

n, prop

Provide either n, the number of rows, or prop, the proportion of rows to select. If neither are supplied, n = 1 will be used.

If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size. If the proportion of a group size is not an integer, it is rounded down.

by

[Experimental]

<tidy-select> Optionally, a selection of columns to group by for just this operation, functioning as an alternative to group_by(). For details and examples, see ?dplyr_by.

with_ties

Should ties be kept together? The default, TRUE, may return more rows than you request. Use FALSE to ignore ties, and return the first n rows.

na_rm

Should missing values in order_by be removed from the result? If FALSE, NA values are sorted to the end (like in arrange()), so they will only be included if there are insufficient non-missing values to reach n/prop.

weight_by, replace

Not supported for database backends.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = 1:3, y = c(1, 1, 2))
db %>% slice_min(x) %>% show_query()
db %>% slice_max(x) %>% show_query()
db %>% slice_sample() %>% show_query()

db %>% group_by(y) %>% slice_min(x) %>% show_query()

# By default, ties are includes so you may get more rows
# than you expect
db %>% slice_min(y, n = 1)
db %>% slice_min(y, n = 1, with_ties = FALSE)

# Non-integer group sizes are rounded down
db %>% slice_min(x, prop = 0.5)

Subset distinct/unique rows

Description

This is a method for the dplyr distinct() generic. It adds the DISTINCT clause to the SQL query.

Usage

## S3 method for class 'tbl_lazy'
distinct(.data, ..., .keep_all = FALSE)

Arguments

.data

A lazy data frame backed by a database query.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

.keep_all

If TRUE, keep all variables in .data. If a combination of ... is not distinct, this keeps the first row of values.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = c(1, 1, 2, 2), y = c(1, 2, 1, 1))
db %>% distinct() %>% show_query()
db %>% distinct(x) %>% show_query()

Perform arbitrary computation on remote backend

Description

Perform arbitrary computation on remote backend

Usage

## S3 method for class 'tbl_sql'
do(.data, ..., .chunk_size = 10000L)

Arguments

.data

a tbl

...

Expressions to apply to each group. If named, results will be stored in a new column. If unnamed, must return a data frame. You can use . to refer to the current group. You can not mix named and unnamed arguments.

.chunk_size

The size of each chunk to pull into R. If this number is too big, the process will be slow because R has to allocate and free a lot of memory. If it's too small, it will be slow, because of the overhead of talking to the database.


Escape/quote a string.

Description

escape() requires you to provide a database connection to control the details of escaping. escape_ansi() uses the SQL 92 ANSI standard.

Usage

escape(x, parens = NA, collapse = " ", con = NULL)

escape_ansi(x, parens = NA, collapse = "")

sql_vector(x, parens = NA, collapse = " ", con = NULL)

Arguments

x

An object to escape. Existing sql vectors will be left as is, character vectors are escaped with single quotes, numeric vectors have trailing .0 added if they're whole numbers, identifiers are escaped with double quotes.

parens, collapse

Controls behaviour when multiple values are supplied. parens should be a logical flag, or if NA, will wrap in parens if length > 1.

Default behaviour: lists are always wrapped in parens and separated by commas, identifiers are separated by commas and never wrapped, atomic vectors are separated by spaces and wrapped in parens if needed.

con

Database connection.

Examples

# Doubles vs. integers
escape_ansi(1:5)
escape_ansi(c(1, 5.4))

# String vs known sql vs. sql identifier
escape_ansi("X")
escape_ansi(sql("X"))
escape_ansi(ident("X"))

# Escaping is idempotent
escape_ansi("X")
escape_ansi(escape_ansi("X"))
escape_ansi(escape_ansi(escape_ansi("X")))

Expand SQL tables to include all possible combinations of values

Description

This is a method for the tidyr::expand generics. It doesn't sort the result explicitly, so the order might be different to what expand() returns for data frames.

Usage

## S3 method for class 'tbl_lazy'
expand(data, ..., .name_repair = "check_unique")

Arguments

data

A lazy data frame backed by a database query.

...

Specification of columns to expand. See tidyr::expand for more details.

.name_repair

Treatment of problematic column names:

  • "minimal": No name repair or checks, beyond basic existence,

  • "unique": Make sure names are unique and not empty,

  • "check_unique": (default value), no name repair, but check they are unique,

  • "universal": Make the names unique and syntactic

  • a function: apply custom name repair (e.g., .name_repair = make.names for names in the style of base R).

  • A purrr-style anonymous function, see rlang::as_function()

This argument is passed on as repair to vctrs::vec_as_names(). See there for more details on these terms and the strategies used to enforce them.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples


fruits <- memdb_frame(
  type   = c("apple", "orange", "apple", "orange", "orange", "orange"),
  year   = c(2010, 2010, 2012, 2010, 2010, 2012),
  size = c("XS", "S",  "M", "S", "S", "M"),
  weights = rnorm(6)
)

# All possible combinations ---------------------------------------
fruits %>% tidyr::expand(type)
fruits %>% tidyr::expand(type, size)

# Only combinations that already appear in the data ---------------
fruits %>% tidyr::expand(nesting(type, size))


Fill in missing values with previous or next value

Description

Fill in missing values with previous or next value

Usage

## S3 method for class 'tbl_lazy'
fill(.data, ..., .direction = c("down", "up", "updown", "downup"))

Arguments

.data

A lazy data frame backed by a database query.

...

Columns to fill.

.direction

Direction in which to fill missing values. Currently either "down" (the default) or "up". Note that "up" does not work when .data is sorted by non-numeric columns. As a workaround revert the order yourself beforehand; for example replace arrange(x, desc(y)) by arrange(desc(x), y).

Examples


squirrels <- tibble::tribble(
  ~group,    ~name,     ~role,     ~n_squirrels, ~ n_squirrels2,
  1,      "Sam",    "Observer",   NA,                 1,
  1,     "Mara", "Scorekeeper",    8,                NA,
  1,    "Jesse",    "Observer",   NA,                NA,
  1,      "Tom",    "Observer",   NA,                 4,
  2,     "Mike",    "Observer",   NA,                NA,
  2,  "Rachael",    "Observer",   NA,                 6,
  2,  "Sydekea", "Scorekeeper",   14,                NA,
  2, "Gabriela",    "Observer",   NA,                NA,
  3,  "Derrick",    "Observer",   NA,                NA,
  3,     "Kara", "Scorekeeper",    9,                 10,
  3,    "Emily",    "Observer",   NA,                NA,
  3, "Danielle",    "Observer",   NA,                NA
)
squirrels$id <- 1:12

tbl_memdb(squirrels) %>%
  window_order(id) %>%
  tidyr::fill(
    n_squirrels,
    n_squirrels2,
  )


Subset rows using column values

Description

This is a method for the dplyr filter() generic. It generates the WHERE clause of the SQL query.

Usage

## S3 method for class 'tbl_lazy'
filter(.data, ..., .by = NULL, .preserve = FALSE)

Arguments

.data

A lazy data frame backed by a database query.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

.by

[Experimental]

<tidy-select> Optionally, a selection of columns to group by for just this operation, functioning as an alternative to group_by(). For details and examples, see ?dplyr_by.

.preserve

Not supported by this method.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = c(2, NA, 5, NA, 10), y = 1:5)
db %>% filter(x < 5) %>% show_query()
db %>% filter(is.na(x)) %>% show_query()

Extract and check the RETURNING rows

Description

[Experimental]

get_returned_rows() extracts the RETURNING rows produced by rows_insert(), rows_append(), rows_update(), rows_upsert(), or rows_delete() if these are called with the returning argument. An error is raised if this information is not available.

has_returned_rows() checks if x has stored RETURNING rows produced by rows_insert(), rows_append(), rows_update(), rows_upsert(), or rows_delete().

Usage

get_returned_rows(x)

has_returned_rows(x)

Arguments

x

A lazy tbl.

Value

For get_returned_rows(), a tibble.

For has_returned_rows(), a scalar logical.

Examples

library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbExecute(con, "CREATE TABLE Info (
   id INTEGER PRIMARY KEY AUTOINCREMENT,
   number INTEGER
)")
info <- tbl(con, "Info")

rows1 <- copy_inline(con, data.frame(number = c(1, 5)))
rows_insert(info, rows1, conflict = "ignore", in_place = TRUE)
info

# If the table has an auto incrementing primary key, you can use
# the returning argument + `get_returned_rows()` its value
rows2 <- copy_inline(con, data.frame(number = c(13, 27)))
info <- rows_insert(
  info,
  rows2,
  conflict = "ignore",
  in_place = TRUE,
  returning = id
)
info
get_returned_rows(info)

Group by one or more variables

Description

This is a method for the dplyr group_by() generic. It is translated to the ⁠GROUP BY⁠ clause of the SQL query when used with summarise() and to the ⁠PARTITION BY⁠ clause of window functions when used with mutate().

Usage

## S3 method for class 'tbl_lazy'
group_by(.data, ..., .add = FALSE, add = deprecated(), .drop = TRUE)

Arguments

.data

A lazy data frame backed by a database query.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

.add

When FALSE, the default, group_by() will override existing groups. To add to the existing groups, use .add = TRUE.

This argument was previously called add, but that prevented creating a new grouping variable called add, and conflicts with our naming conventions.

add

Deprecated. Please use .add instead.

.drop

Not supported by this method.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2))
db %>%
  group_by(g) %>%
  summarise(n()) %>%
  show_query()

db %>%
  group_by(g) %>%
  mutate(x2 = x / sum(x, na.rm = TRUE)) %>%
  show_query()

Subset the first rows

Description

This is a method for the head() generic. It is usually translated to the LIMIT clause of the SQL query. Because LIMIT is not an official part of the SQL specification, some database use other clauses like TOP or ⁠FETCH ROWS⁠.

Note that databases don't really have a sense of row order, so what "first" means is subject to interpretation. Most databases will respect ordering performed with arrange(), but it's not guaranteed. tail() is not supported at all because the situation is even murkier for the "last" rows.

Usage

## S3 method for class 'tbl_lazy'
head(x, n = 6L, ...)

Arguments

x

A lazy data frame backed by a database query.

n

Number of rows to return

...

Not used.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = 1:100)
db %>% head() %>% show_query()

# Pretend we have data in a SQL server database
db2 <- lazy_frame(x = 1:100, con = simulate_mssql())
db2 %>% head() %>% show_query()

Flag a character vector as SQL identifiers

Description

ident() takes strings and turns them as database identifiers (e.g. table or column names) quoting them using the identifer rules for your database. ident_q() does the same, but assumes the names have already been quoted, preventing them from being quoted again.

These are generally for internal use only; if you need to supply an table name that is qualified with schema or catalog, or has already been quoted for some other reason, use I().

Usage

ident(...)

is.ident(x)

Arguments

...

A character vector, or name-value pairs.

x

An object.

Examples

# SQL92 quotes strings with '
escape_ansi("x")

# And identifiers with "
ident("x")
escape_ansi(ident("x"))

# You can supply multiple inputs
ident(a = "x", b = "y")
ident_q(a = "x", b = "y")

Declare a identifier as being pre-quoted.

Description

No longer needed; please use sql() instead.

Usage

ident_q(...)

Refer to a table in another schema/catalog

Description

in_schema() and in_catalog() can be used to refer to tables outside of the current catalog/schema. However, we now recommend using I() as it's typically less typing.

Usage

in_schema(schema, table)

in_catalog(catalog, schema, table)

Arguments

catalog, schema, table

Names of catalog, schema, and table. These will be automatically quoted; use sql() to pass a raw name that won't get quoted.

Examples

# Previously:
in_schema("my_schema", "my_table")
in_catalog("my_catalog", "my_schema", "my_table")
in_schema(sql("my_schema"), sql("my_table"))

# Now
I("my_schema.my_table")
I("my_catalog.my_schema.my_table")
I("my_schema.my_table")

# Example using schemas with SQLite
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Add auxiliary schema
tmp <- tempfile()
DBI::dbExecute(con, paste0("ATTACH '", tmp, "' AS aux"))

library(dplyr, warn.conflicts = FALSE)
copy_to(con, iris, "df", temporary = FALSE)
copy_to(con, mtcars, I("aux.df"), temporary = FALSE)

con %>% tbl("df")
con %>% tbl(I("aux.df"))

SQL set operations

Description

These are methods for the dplyr generics dplyr::intersect(), dplyr::union(), and dplyr::setdiff(). They are translated to INTERSECT, UNION, and EXCEPT respectively.

Usage

## S3 method for class 'tbl_lazy'
intersect(x, y, copy = FALSE, ..., all = FALSE)

## S3 method for class 'tbl_lazy'
union(x, y, copy = FALSE, ..., all = FALSE)

## S3 method for class 'tbl_lazy'
union_all(x, y, copy = FALSE, ...)

## S3 method for class 'tbl_lazy'
setdiff(x, y, copy = FALSE, ..., all = FALSE)

Arguments

x, y

A pair of lazy data frames backed by database queries.

copy

If x and y are not from the same data source, and copy is TRUE, then y will be copied into a temporary table in same database as x. ⁠*_join()⁠ will automatically run ANALYZE on the created table in the hope that this will make you queries as efficient as possible by giving more data to the query planner.

This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it.

...

Not currently used; provided for future extensions.

all

If TRUE, includes all matches in output, not just unique rows.


Table paths

Description

dbplyr standardises all the ways of referring to a table (i.e. a single string, a string wrapped in I(), a DBI::Id() and the results of in_schema() and in_catalog()) into a table "path" of the form table, schema.table, or catalog.schema.path. A table path is always suitable for inlining into a query, so user input is quoted unless it is wrapped in I().

This is primarily for internal usage, but you may need to work with it if you're implementing a backend, and you need to compute with the table path, not just pass it on unchanged to some other dbplyr function.

A table_path object can technically be a vector of table paths, but you will never see this in table paths constructed from user inputs.

Usage

is_table_path(x)

table_path_name(x, con)

table_path_components(x, con)

check_table_path(x, error_arg = caller_arg(x), error_call = caller_env())

as_table_path(x, con, error_arg = caller_arg(x), error_call = caller_env())

Join SQL tables

Description

These are methods for the dplyr join generics. They are translated to the following SQL queries:

Usage

## S3 method for class 'tbl_lazy'
inner_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = NULL,
  ...,
  keep = NULL,
  na_matches = c("never", "na"),
  multiple = NULL,
  unmatched = "drop",
  relationship = NULL,
  sql_on = NULL,
  auto_index = FALSE,
  x_as = NULL,
  y_as = NULL
)

## S3 method for class 'tbl_lazy'
left_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = NULL,
  ...,
  keep = NULL,
  na_matches = c("never", "na"),
  multiple = NULL,
  unmatched = "drop",
  relationship = NULL,
  sql_on = NULL,
  auto_index = FALSE,
  x_as = NULL,
  y_as = NULL
)

## S3 method for class 'tbl_lazy'
right_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = NULL,
  ...,
  keep = NULL,
  na_matches = c("never", "na"),
  multiple = NULL,
  unmatched = "drop",
  relationship = NULL,
  sql_on = NULL,
  auto_index = FALSE,
  x_as = NULL,
  y_as = NULL
)

## S3 method for class 'tbl_lazy'
full_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = NULL,
  ...,
  keep = NULL,
  na_matches = c("never", "na"),
  multiple = NULL,
  relationship = NULL,
  sql_on = NULL,
  auto_index = FALSE,
  x_as = NULL,
  y_as = NULL
)

## S3 method for class 'tbl_lazy'
cross_join(
  x,
  y,
  ...,
  copy = FALSE,
  suffix = c(".x", ".y"),
  x_as = NULL,
  y_as = NULL
)

## S3 method for class 'tbl_lazy'
semi_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  ...,
  na_matches = c("never", "na"),
  sql_on = NULL,
  auto_index = FALSE,
  x_as = NULL,
  y_as = NULL
)

## S3 method for class 'tbl_lazy'
anti_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  ...,
  na_matches = c("never", "na"),
  sql_on = NULL,
  auto_index = FALSE,
  x_as = NULL,
  y_as = NULL
)

Arguments

x, y

A pair of lazy data frames backed by database queries.

by

A join specification created with join_by(), or a character vector of variables to join by.

If NULL, the default, ⁠*_join()⁠ will perform a natural join, using all variables in common across x and y. A message lists the variables so that you can check they're correct; suppress the message by supplying by explicitly.

To join on different variables between x and y, use a join_by() specification. For example, join_by(a == b) will match x$a to y$b.

To join by multiple variables, use a join_by() specification with multiple expressions. For example, join_by(a == b, c == d) will match x$a to y$b and x$c to y$d. If the column names are the same between x and y, you can shorten this by listing only the variable names, like join_by(a, c).

join_by() can also be used to perform inequality, rolling, and overlap joins. See the documentation at ?join_by for details on these types of joins.

For simple equality joins, you can alternatively specify a character vector of variable names to join by. For example, by = c("a", "b") joins x$a to y$a and x$b to y$b. If variable names differ between x and y, use a named character vector like by = c("x_a" = "y_a", "x_b" = "y_b").

To perform a cross-join, generating all combinations of x and y, see cross_join().

copy

If x and y are not from the same data source, and copy is TRUE, then y will be copied into a temporary table in same database as x. ⁠*_join()⁠ will automatically run ANALYZE on the created table in the hope that this will make you queries as efficient as possible by giving more data to the query planner.

This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it.

suffix

If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

...

Other parameters passed onto methods.

keep

Should the join keys from both x and y be preserved in the output?

  • If NULL, the default, joins on equality retain only the keys from x, while joins on inequality retain the keys from both inputs.

  • If TRUE, all keys from both inputs are retained.

  • If FALSE, only keys from x are retained. For right and full joins, the data in key columns corresponding to rows that only exist in y are merged into the key columns from x. Can't be used when joining on inequality conditions.

na_matches

Should NA (NULL) values match one another? The default, "never", is how databases usually work. "na" makes the joins behave like the dplyr join functions, merge(), match(), and %in%.

multiple, unmatched

Unsupported in database backends. As a workaround for multiple use a unique key and for unmatched a foreign key constraint.

relationship

Unsupported in database backends.

sql_on

A custom join predicate as an SQL expression. Usually joins use column equality, but you can perform more complex queries by supply sql_on which should be a SQL expression that uses LHS and RHS aliases to refer to the left-hand side or right-hand side of the join respectively.

auto_index

if copy is TRUE, automatically create indices for the variables in by. This may speed up the join if there are matching indexes in x.

x_as, y_as

Alias to use for x resp. y. Defaults to "LHS" resp. "RHS"

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples

library(dplyr, warn.conflicts = FALSE)

band_db <- tbl_memdb(dplyr::band_members)
instrument_db <- tbl_memdb(dplyr::band_instruments)
band_db %>% left_join(instrument_db) %>% show_query()

# Can join with local data frames by setting copy = TRUE
band_db %>%
  left_join(dplyr::band_instruments, copy = TRUE)

# Unlike R, joins in SQL don't usually match NAs (NULLs)
db <- memdb_frame(x = c(1, 2, NA))
label <- memdb_frame(x = c(1, NA), label = c("one", "missing"))
db %>% left_join(label, by = "x")
# But you can activate R's usual behaviour with the na_matches argument
db %>% left_join(label, by = "x", na_matches = "na")

# By default, joins are equijoins, but you can use `sql_on` to
# express richer relationships
db1 <- memdb_frame(x = 1:5)
db2 <- memdb_frame(x = 1:3, y = letters[1:3])
db1 %>% left_join(db2) %>% show_query()
db1 %>% left_join(db2, sql_on = "LHS.x < RHS.x") %>% show_query()

Cache and retrieve an src_sqlite of the Lahman baseball database.

Description

This creates an interesting database using data from the Lahman baseball data source, provided by Sean Lahman, and made easily available in R through the Lahman package by Michael Friendly, Dennis Murphy and Martin Monkman. See the documentation for that package for documentation of the individual tables.

Usage

lahman_sqlite(path = NULL)

lahman_postgres(dbname = "lahman", host = "localhost", ...)

lahman_mysql(dbname = "lahman", ...)

copy_lahman(con, ...)

has_lahman(type, ...)

lahman_srcs(..., quiet = NULL)

Arguments

...

Other arguments passed to src on first load. For MySQL and PostgreSQL, the defaults assume you have a local server with lahman database already created. For lahman_srcs(), character vector of names giving srcs to generate.

type

src type.

quiet

if TRUE, suppress messages about databases failing to connect.

Examples

# Connect to a local sqlite database, if already created

library(dplyr)

if (has_lahman("sqlite")) {
  lahman_sqlite()
  batting <- tbl(lahman_sqlite(), "Batting")
  batting
}

# Connect to a local postgres database with lahman database, if available
if (has_lahman("postgres")) {
  lahman_postgres()
  batting <- tbl(lahman_postgres(), "Batting")
}


Build and render SQL from a sequence of lazy operations

Description

sql_build() creates a select_query S3 object, that is rendered to a SQL string by sql_render(). The output from sql_build() is designed to be easy to test, as it's database agnostic, and has a hierarchical structure. Outside of testing, however, you should always call sql_render().

Usage

lazy_multi_join_query(
  x,
  joins,
  table_names,
  vars,
  group_vars = op_grps(x),
  order_vars = op_sort(x),
  frame = op_frame(x),
  call = caller_env()
)

lazy_rf_join_query(
  x,
  y,
  type,
  by,
  table_names,
  vars,
  group_vars = op_grps(x),
  order_vars = op_sort(x),
  frame = op_frame(x),
  call = caller_env()
)

lazy_semi_join_query(
  x,
  y,
  vars,
  anti,
  by,
  where,
  group_vars = op_grps(x),
  order_vars = op_sort(x),
  frame = op_frame(x),
  call = caller_env()
)

lazy_query(
  query_type,
  x,
  ...,
  group_vars = op_grps(x),
  order_vars = op_sort(x),
  frame = op_frame(x)
)

lazy_select_query(
  x,
  select = NULL,
  where = NULL,
  group_by = NULL,
  having = NULL,
  order_by = NULL,
  limit = NULL,
  distinct = FALSE,
  group_vars = NULL,
  order_vars = NULL,
  frame = NULL,
  select_operation = c("select", "mutate", "summarise"),
  message_summarise = NULL
)

lazy_set_op_query(x, y, type, all, call = caller_env())

lazy_union_query(x, unions, call = caller_env())

sql_build(op, con = NULL, ..., sql_options = NULL)

sql_render(
  query,
  con = NULL,
  ...,
  sql_options = NULL,
  subquery = FALSE,
  lvl = 0
)

sql_optimise(x, con = NULL, ..., subquery = FALSE)

join_query(
  x,
  y,
  select,
  ...,
  type = "inner",
  by = NULL,
  suffix = c(".x", ".y"),
  na_matches = FALSE
)

select_query(
  from,
  select = sql("*"),
  where = character(),
  group_by = character(),
  having = character(),
  window = character(),
  order_by = character(),
  limit = NULL,
  distinct = FALSE,
  from_alias = NULL
)

semi_join_query(
  x,
  y,
  vars,
  anti = FALSE,
  by = NULL,
  where = NULL,
  na_matches = FALSE
)

set_op_query(x, y, type, all = FALSE)

union_query(x, unions)

Arguments

...

Other arguments passed on to the methods. Not currently used.

op

A sequence of lazy operations

con

A database connection. The default NULL uses a set of rules that should be very similar to ANSI 92, and allows for testing without an active database connection.

sql_options

SQL rendering options generated by sql_options().

subquery

Is this SQL going to be used in a subquery? This is important because you can place a bare table name in a subquery and ORDER BY does not work in subqueries.

Details

sql_build() is generic over the lazy operations, lazy_ops, and generates an S3 object that represents the query. sql_render() takes a query object and then calls a function that is generic over the database. For example, sql_build.op_mutate() generates a select_query, and sql_render.select_query() calls sql_select(), which has different methods for different databases. The default methods should generate ANSI 92 SQL where possible, so you backends only need to override the methods if the backend is not ANSI compliant.


Lazy operations

Description

This set of S3 classes describe the action of dplyr verbs. These are currently used for SQL sources to separate the description of operations in R from their computation in SQL. This API is very new so is likely to evolve in the future.

Usage

lazy_base_query(x, vars, class = character(), ...)

op_grps(op)

op_vars(op)

op_sort(op)

op_frame(op)

Details

op_vars() and op_grps() compute the variables and groups from a sequence of lazy operations. op_sort() and op_frame() tracks the order and frame for use in window functions.


Create a database table in temporary in-memory database.

Description

memdb_frame() works like tibble::tibble(), but instead of creating a new data frame in R, it creates a table in src_memdb().

Usage

memdb_frame(..., .name = unique_table_name())

tbl_memdb(df, name = deparse(substitute(df)))

src_memdb()

Arguments

...

<dynamic-dots> A set of name-value pairs. These arguments are processed with rlang::quos() and support unquote via !! and unquote-splice via !!!. Use ⁠:=⁠ to create columns that start with a dot.

Arguments are evaluated sequentially. You can refer to previously created elements directly or using the .data pronoun. To refer explicitly to objects in the calling environment, use !! or .env, e.g. !!.data or .env$.data for the special case of an object named .data.

df

Data frame to copy

name, .name

Name of table in database: defaults to a random name that's unlikely to conflict with an existing table.

Examples

library(dplyr)
df <- memdb_frame(x = runif(100), y = runif(100))
df %>% arrange(x)
df %>% arrange(x) %>% show_query()

mtcars_db <- tbl_memdb(mtcars)
mtcars_db %>% group_by(cyl) %>% summarise(n = n()) %>% show_query()

Create, modify, and delete columns

Description

These are methods for the dplyr mutate() and transmute() generics. They are translated to computed expressions in the SELECT clause of the SQL query.

Usage

## S3 method for class 'tbl_lazy'
mutate(
  .data,
  ...,
  .by = NULL,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

Arguments

.data

A lazy data frame backed by a database query.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

.by

[Experimental]

<tidy-select> Optionally, a selection of columns to group by for just this operation, functioning as an alternative to group_by(). For details and examples, see ?dplyr_by.

.keep

Control which columns from .data are retained in the output. Grouping columns and columns created by ... are always kept.

  • "all" retains all columns from .data. This is the default.

  • "used" retains only the columns used in ... to create new columns. This is useful for checking your work, as it displays inputs and outputs side-by-side.

  • "unused" retains only the columns not used in ... to create new columns. This is useful if you generate new columns, but no longer need the columns used to generate them.

  • "none" doesn't retain any extra columns from .data. Only the grouping variables and columns created by ... are kept.

.before, .after

<tidy-select> Optionally, control where new columns should appear (the default is to add to the right hand side). See relocate() for more details.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = 1:5, y = 5:1)
db %>%
  mutate(a = (x + y) / 2, b = sqrt(x^2L + y^2L)) %>%
  show_query()

# dbplyr automatically creates subqueries as needed
db %>%
  mutate(x1 = x + 1, x2 = x1 * 2) %>%
  show_query()

Provides comma-separated string out of the parameters

Description

Provides comma-separated string out of the parameters

Usage

named_commas(x)

Database versions of the nycflights13 data

Description

These functions cache the data from the nycflights13 database in a local database, for use in examples and vignettes. Indexes are created to making joining tables on natural keys efficient.

Usage

nycflights13_sqlite(path = NULL)

nycflights13_postgres(dbname = "nycflights13", ...)

has_nycflights13(type = c("sqlite", "postgres"), ...)

copy_nycflights13(con, ...)

Arguments

path

location of SQLite database file

dbname, ...

Arguments passed on to src_postgres()


Partially evaluate an expression.

Description

This function partially evaluates an expression, using information from the tbl to determine whether names refer to local expressions or remote variables. This simplifies SQL translation because expressions don't need to carry around their environment - all relevant information is incorporated into the expression.

Usage

partial_eval(call, data, env = caller_env(), vars = deprecated(), error_call)

Arguments

call

an unevaluated expression, as produced by quote()

data

A lazy data frame backed by a database query.

env

environment in which to search for local values

vars

[Deprecated]: Pass a lazy frame to data instead.

Symbol substitution

partial_eval() needs to guess if you're referring to a variable on the server (remote), or in the current environment (local). It's not possible to do this 100% perfectly. partial_eval() uses the following heuristic:

You can override the guesses using local() and remote() to force computation, or by using the .data and .env pronouns of tidy evaluation.

Examples

lf <- lazy_frame(year = 1980, id = 1)
partial_eval(quote(year > 1980), data = lf)

ids <- c("ansonca01", "forceda01", "mathebo01")
partial_eval(quote(id %in% ids), lf)

# cf.
partial_eval(quote(id == .data$id), lf)

# You can use local() or .env to disambiguate between local and remote
# variables: otherwise remote is always preferred
year <- 1980
partial_eval(quote(year > year), lf)
partial_eval(quote(year > local(year)), lf)
partial_eval(quote(year > .env$year), lf)

# Functions are always assumed to be remote. Use local to force evaluation
# in R.
f <- function(x) x + 1
partial_eval(quote(year > f(1980)), lf)
partial_eval(quote(year > local(f(1980))), lf)

Pivot data from wide to long

Description

pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is tidyr::pivot_wider().

Learn more in vignette("pivot", "tidyr").

While most functionality is identical there are some differences to pivot_longer() on local data frames:

Note that build_longer_spec() and pivot_longer_spec() do not work with remote tables.

Usage

## S3 method for class 'tbl_lazy'
pivot_longer(
  data,
  cols,
  ...,
  cols_vary,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes,
  values_transform = NULL
)

Arguments

data

A data frame to pivot.

cols

Columns to pivot into longer format.

...

Additional arguments passed on to methods.

cols_vary

Unsupported; included for compatibility with the generic.

names_to

A string specifying the name of the column to create from the data stored in the column names of data.

names_prefix

A regular expression used to remove matching text from the start of each variable name.

names_sep, names_pattern

If names_to contains multiple values, these arguments control how the column name is broken up.

names_ptypes

A list of column name-prototype pairs.

names_transform, values_transform

A list of column name-function pairs.

names_repair

What happens if the output has invalid column names?

values_to

A string specifying the name of the column to create from the data stored in cell values. If names_to is a character containing the special .value sentinel, this value will be ignored, and the name of the value column will be derived from part of the existing column names.

values_drop_na

If TRUE, will drop rows that contain only NAs in the value_to column.

values_ptypes

Not supported.

Details

The SQL translation basically works as follows:

  1. split the specification by its key columns i.e. by variables crammed into the column names.

  2. for each part in the split specification transmute() data into the following columns

  1. combine all the parts with union_all()

Examples


# See vignette("pivot") for examples and explanation

# Simplest case where column names are character data
memdb_frame(
  id = c("a", "b"),
  x = 1:2,
  y = 3:4
) %>%
  tidyr::pivot_longer(-id)


Pivot data from long to wide

Description

pivot_wider() "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer(). Learn more in vignette("pivot", "tidyr").

Note that pivot_wider() is not and cannot be lazy because we need to look at the data to figure out what the new column names will be. If you have a long running query you have two options:

Usage

## S3 method for class 'tbl_lazy'
pivot_wider(
  data,
  ...,
  id_cols = NULL,
  id_expand = FALSE,
  names_from = name,
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_sort = FALSE,
  names_vary = "fastest",
  names_expand = FALSE,
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL,
  values_fn = ~max(.x, na.rm = TRUE),
  unused_fn = NULL
)

dbplyr_pivot_wider_spec(
  data,
  spec,
  ...,
  names_repair = "check_unique",
  id_cols = NULL,
  id_expand = FALSE,
  values_fill = NULL,
  values_fn = ~max(.x, na.rm = TRUE),
  unused_fn = NULL,
  error_call = current_env()
)

Arguments

data

A lazy data frame backed by a database query.

...

Unused; included for compatibility with generic.

id_cols

A set of columns that uniquely identifies each observation.

id_expand

Unused; included for compatibility with the generic.

names_from, values_from

A pair of arguments describing which column (or columns) to get the name of the output column (names_from), and which column (or columns) to get the cell values from (values_from).

If values_from contains multiple values, the value will be added to the front of the output column.

names_prefix

String added to the start of every variable name.

names_sep

If names_from or values_from contains multiple variables, this will be used to join their values together into a single string to use as a column name.

names_glue

Instead of names_sep and names_prefix, you can supply a glue specification that uses the names_from columns (and special .value) to create custom column names.

names_sort

Should the column names be sorted? If FALSE, the default, column names are ordered by first appearance.

names_vary

When names_from identifies a column (or columns) with multiple unique values, and multiple values_from columns are provided, in what order should the resulting column names be combined?

  • "fastest" varies names_from values fastest, resulting in a column naming scheme of the form: ⁠value1_name1, value1_name2, value2_name1, value2_name2⁠. This is the default.

  • "slowest" varies names_from values slowest, resulting in a column naming scheme of the form: ⁠value1_name1, value2_name1, value1_name2, value2_name2⁠.

names_expand

Should the values in the names_from columns be expanded by expand() before pivoting? This results in more columns, the output will contain column names corresponding to a complete expansion of all possible values in names_from. Additionally, the column names will be sorted, identical to what names_sort would produce.

names_repair

What happens if the output has invalid column names?

values_fill

Optionally, a (scalar) value that specifies what each value should be filled in with when missing.

values_fn

A function, the default is max(), applied to the value in each cell in the output. In contrast to local data frames it must not be NULL.

unused_fn

Optionally, a function applied to summarize the values from the unused columns (i.e. columns not identified by id_cols, names_from, or values_from).

The default drops all unused columns from the result.

This can be a named list if you want to apply different aggregations to different unused columns.

id_cols must be supplied for unused_fn to be useful, since otherwise all unspecified columns will be considered id_cols.

This is similar to grouping by the id_cols then summarizing the unused columns using unused_fn.

spec

A specification data frame. This is useful for more complex pivots because it gives you greater control on how metadata stored in the columns become column names in the result.

Must be a data frame containing character .name and .value columns. Additional columns in spec should be named to match columns in the long format of the dataset and contain values corresponding to columns pivoted from the wide format. The special .seq variable is used to disambiguate rows internally; it is automatically removed after pivoting.

error_call

The execution environment of a currently running function, e.g. caller_env(). The function will be mentioned in error messages as the source of the error. See the call argument of abort() for more information.

Details

The big difference to pivot_wider() for local data frames is that values_fn must not be NULL. By default it is max() which yields the same results as for local data frames if the combination of id_cols and value column uniquely identify an observation. Mind that you also do not get a warning if an observation is not uniquely identified.

The translation to SQL code basically works as follows:

  1. Get unique keys in names_from column.

  2. For each key value generate an expression of the form:

    value_fn(
      CASE WHEN (`names from column` == `key value`)
      THEN (`value column`)
      END
    ) AS `output column`
    
  3. Group data by id columns.

  4. Summarise the grouped data with the expressions from step 2.

Examples


memdb_frame(
  id = 1,
  key = c("x", "y"),
  value = 1:2
) %>%
  tidyr::pivot_wider(
    id_cols = id,
    names_from = key,
    values_from = value
  )


Extract a single column

Description

This is a method for the dplyr pull() generic. It evaluates the query retrieving just the specified column.

Usage

## S3 method for class 'tbl_sql'
pull(.data, var = -1, name = NULL, ...)

Arguments

.data

A lazy data frame backed by a database query.

var

A variable specified as:

  • a literal variable name

  • a positive integer, giving the position counting from the left

  • a negative integer, giving the position counting from the right.

The default returns the last column (on the assumption that's the column you've created most recently).

This argument is taken by expression and supports quasiquotation (you can unquote column names and column locations).

name

An optional parameter that specifies the column to be used as names for a named vector. Specified in a similar manner as var.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

Value

A vector of data.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = 1:5, y = 5:1)
db %>%
  mutate(z = x + y * 2) %>%
  pull()

Objects exported from other packages

Description

These objects are imported from other packages. Follow the links below to see their documentation.

magrittr

%>%


Metadata about a remote table

Description

remote_name() gives the unescaped name of the remote table, or NULL if it is a query (created by sql()) or already escape (created by ident_q()). remote_table() gives the remote table or the query. remote_query() gives the text of the query, and remote_query_plan() the query plan (as computed by the remote database). remote_src() and remote_con() give the dplyr source and DBI connection respectively.

Usage

remote_name(x, null_if_local = TRUE)

remote_table(x, null_if_local = TRUE)

remote_src(x)

remote_con(x)

remote_query(x, cte = FALSE, sql_options = NULL)

remote_query_plan(x, ...)

Arguments

x

Remote table, currently must be a tbl_sql.

null_if_local

Return NULL if the remote table is created via tbl_lazy() or lazy_frame()?

cte

[Deprecated] Use the render_otions argument instead.

sql_options

[Experimental] SQL rendering options generated by sql_options().

...

Additional arguments passed on to methods.

Value

The value, or NULL if not remote table, or not applicable. For example, computed queries do not have a "name"

Examples

mf <- memdb_frame(x = 1:5, y = 5:1, .name = "blorp")
remote_name(mf)
remote_src(mf)
remote_con(mf)
remote_query(mf)

mf2 <- dplyr::filter(mf, x > 3)
remote_name(mf2)
remote_src(mf2)
remote_con(mf2)
remote_query(mf2)

Replace NAs with specified values

Description

This is a method for the tidyr::replace_na() generic.

Usage

## S3 method for class 'tbl_lazy'
replace_na(data, replace = list(), ...)

Arguments

data

A pair of lazy data frame backed by database queries.

replace

A named list of values, with one value for each column that has NA values to be replaced.

...

Unused; included for compatibility with generic.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples


df <- memdb_frame(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% tidyr::replace_na(list(x = 0, y = "unknown"))


Edit individual rows in the underlying database table

Description

These are methods for the dplyr rows_insert(), rows_append(), rows_update(), rows_patch(), rows_upsert(), and rows_delete() generics.

When in_place = TRUE these verbs do not generate SELECT queries, but instead directly modify the underlying data using INSERT, UPDATE, or DELETE operators. This will require that you have write access to the database: the connection needs permission to insert, modify or delete rows, but not to alter the structure of the table.

The default, in_place = FALSE, generates equivalent lazy tables (using SELECT queries) that allow previewing the result without actually modifying the underlying table on the database.

Usage

## S3 method for class 'tbl_lazy'
rows_insert(
  x,
  y,
  by = NULL,
  ...,
  conflict = c("error", "ignore"),
  copy = FALSE,
  in_place = FALSE,
  returning = NULL,
  method = NULL
)

## S3 method for class 'tbl_lazy'
rows_append(x, y, ..., copy = FALSE, in_place = FALSE, returning = NULL)

## S3 method for class 'tbl_lazy'
rows_update(
  x,
  y,
  by = NULL,
  ...,
  unmatched = c("error", "ignore"),
  copy = FALSE,
  in_place = FALSE,
  returning = NULL
)

## S3 method for class 'tbl_lazy'
rows_patch(
  x,
  y,
  by = NULL,
  ...,
  unmatched = c("error", "ignore"),
  copy = FALSE,
  in_place = FALSE,
  returning = NULL
)

## S3 method for class 'tbl_lazy'
rows_upsert(
  x,
  y,
  by = NULL,
  ...,
  copy = FALSE,
  in_place = FALSE,
  returning = NULL,
  method = NULL
)

## S3 method for class 'tbl_lazy'
rows_delete(
  x,
  y,
  by = NULL,
  ...,
  unmatched = c("error", "ignore"),
  copy = FALSE,
  in_place = FALSE,
  returning = NULL
)

Arguments

x

A lazy table. For in_place = TRUE, this must be a table instantiated with tbl() or compute(), not to a lazy query. The remote_name() function is used to determine the name of the table to be updated.

y

A lazy table, data frame, or data frame extensions (e.g. a tibble).

by

An unnamed character vector giving the key columns. The key columns must exist in both x and y. Keys typically uniquely identify each row, but this is only enforced for the key values of y when rows_update(), rows_patch(), or rows_upsert() are used.

By default, we use the first column in y, since the first column is a reasonable place to put an identifier variable.

...

Other parameters passed onto methods.

conflict

For rows_insert(), how should keys in y that conflict with keys in x be handled? A conflict arises if there is a key in y that already exists in x.

One of:

  • "error", the default, is not supported for database tables. To get the same behaviour add a unique index on the by columns and use rows_append().

  • "ignore" will ignore rows in y with keys that conflict with keys in x.

copy

If x and y are not from the same data source, and copy is TRUE, then y will be copied into the same src as x. This allows you to join tables across srcs, but it is a potentially expensive operation so you must opt into it.

in_place

Should x be modified in place? If FALSE will generate a SELECT query that returns the modified table; if TRUE will modify the underlying table using a DML operation (INSERT, UPDATE, DELETE or similar).

returning

Columns to return. See get_returned_rows() for details.

method

A string specifying the method to use. This is only relevant for in_place = TRUE.

unmatched

For rows_update(), rows_patch(), and rows_delete(), how should keys in y that are unmatched by the keys in x be handled?

One of:

  • "error", the default, is not supported for database tables. Add a foreign key constraint on the by columns of y to let the database check this behaviour for you.

  • "ignore" will ignore rows in y with keys that are unmatched by the keys in x.

Value

A new tbl_lazy of the modified data. With in_place = FALSE, the result is a lazy query that prints visibly, because the purpose of this operation is to preview the results. With in_place = TRUE, x is returned invisibly, because the purpose of this operation is the side effect of modifying rows in the table behind x.

Examples

library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbExecute(con, "CREATE TABLE Ponies (
   id INTEGER PRIMARY KEY AUTOINCREMENT,
   name TEXT,
   cutie_mark TEXT
)")

ponies <- tbl(con, "Ponies")

applejack <- copy_inline(con, data.frame(
  name = "Apple Jack",
  cutie_mark = "three apples"
))

# The default behavior is to generate a SELECT query
rows_insert(ponies, applejack, conflict = "ignore")
# And the original table is left unchanged:
ponies

# You can also choose to modify the table with in_place = TRUE:
rows_insert(ponies, applejack, conflict = "ignore", in_place = TRUE)
# In this case `rows_insert()` returns nothing and the underlying
# data is modified
ponies

Subset, rename, and reorder columns using their names

Description

These are methods for the dplyr select(), rename(), and relocate() generics. They generate the SELECT clause of the SQL query.

These functions do not support predicate functions, i.e. you can not use where(is.numeric) to select all numeric variables.

Usage

## S3 method for class 'tbl_lazy'
select(.data, ...)

## S3 method for class 'tbl_lazy'
rename(.data, ...)

## S3 method for class 'tbl_lazy'
rename_with(.data, .fn, .cols = everything(), ...)

## S3 method for class 'tbl_lazy'
relocate(.data, ..., .before = NULL, .after = NULL)

Arguments

.data

A lazy data frame backed by a database query.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

.fn

A function used to transform the selected .cols. Should return a character vector the same length as the input.

.cols

<tidy-select> Columns to rename; defaults to all columns.

.before, .after

<tidy-select> Destination of columns selected by .... Supplying neither will move columns to the left-hand side; specifying both is an error.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = 1, y = 2, z = 3)
db %>% select(-y) %>% show_query()
db %>% relocate(z) %>% show_query()
db %>% rename(first = x, last = z) %>% show_query()

Simulate database connections

Description

These functions generate S3 objects that have been designed to simulate the action of a database connection, without actually having the database available. Obviously, this simulation can only be incomplete, but most importantly it allows us to simulate SQL generation for any database without actually connecting to it.

Usage

simulate_dbi(class = character(), ...)

Details

Simulated SQL always quotes identifies with `x`, and strings with 'x'.


SQL escaping.

Description

These functions are critical when writing functions that translate R functions to sql functions. Typically a conversion function should escape all its inputs and return an sql object.

Usage

sql(...)

is.sql(x)

as.sql(x, con)

Arguments

...

Character vectors that will be combined into a single SQL expression.

x

Object to coerce

con

Needed when x is directly supplied from the user so that schema specifications can be quoted using the correct identifiers.


Generate SQL from R expressions

Description

Low-level building block for generating SQL from R expressions. Strings are escaped; names become bare SQL identifiers. User infix functions have ⁠%⁠ stripped.

Usage

sql_expr(x, con = sql_current_con())

sql_call2(.fn, ..., con = sql_current_con())

Arguments

x

A quasiquoted expression

con

Connection to use for escaping. Will be set automatically when called from a function translation.

.fn

Function name (as string, call, or symbol)

...

Arguments to function

Details

Using sql_expr() in package will require use of globalVariables() to avoid ⁠R CMD check⁠ NOTES. This is a small amount of additional pain, which I think is worthwhile because it leads to more readable translation code.

Examples

con <- simulate_dbi() # not necessary when writing translations

sql_expr(f(x + 1), con = con)
sql_expr(f("x", "y"), con = con)
sql_expr(f(x, y), con = con)

x <- ident("x")
sql_expr(f(!!x, y), con = con)

sql_expr(cast("x" %as% DECIMAL), con = con)
sql_expr(round(x) %::% numeric, con = con)

sql_call2("+", quote(x), 1, con = con)
sql_call2("+", "x", 1, con = con)

Options for generating SQL

Description

Options for generating SQL

Usage

sql_options(cte = FALSE, use_star = TRUE, qualify_all_columns = FALSE)

Arguments

cte

If FALSE, the default, subqueries are used. If TRUE common table expressions are used.

use_star

If TRUE, the default, * is used to select all columns of a table. If FALSE all columns are explicitly selected.

qualify_all_columns

If FALSE, the default, columns are only qualified with the table they come from if the same column name appears in multiple tables.

Value

A <dbplyr_sql_options> object.

Examples

library(dplyr, warn.conflicts = FALSE)
lf1 <- lazy_frame(key = 1, a = 1, b = 2)
lf2 <- lazy_frame(key = 1, a = 1, c = 3)

result <- left_join(lf1, lf2, by = "key") %>%
  filter(c >= 3)

show_query(result)
sql_options <- sql_options(cte = TRUE, qualify_all_columns = TRUE)
show_query(result, sql_options = sql_options)

Generate SQL for Insert, Update, Upsert, and Delete

Description

These functions generate the SQL used in rows_*(in_place = TRUE).

Usage

sql_query_insert(
  con,
  table,
  from,
  insert_cols,
  by,
  ...,
  conflict = c("error", "ignore"),
  returning_cols = NULL,
  method = NULL
)

sql_query_append(con, table, from, insert_cols, ..., returning_cols = NULL)

sql_query_update_from(
  con,
  table,
  from,
  by,
  update_values,
  ...,
  returning_cols = NULL
)

sql_query_upsert(
  con,
  table,
  from,
  by,
  update_cols,
  ...,
  returning_cols = NULL,
  method = NULL
)

sql_query_delete(con, table, from, by, ..., returning_cols = NULL)

Arguments

con

Database connection.

table

Table to update. Must be a table identifier. Use a string to refer to tables in the current schema/catalog or I() to refer to tables in other schemas/catalogs.

from

Table or query that contains the new data. Either a table identifier or SQL.

insert_cols

Names of columns to insert.

by

An unnamed character vector giving the key columns. The key columns must exist in both x and y. Keys typically uniquely identify each row, but this is only enforced for the key values of y when rows_update(), rows_patch(), or rows_upsert() are used.

By default, we use the first column in y, since the first column is a reasonable place to put an identifier variable.

...

Other parameters passed onto methods.

conflict

For rows_insert(), how should keys in y that conflict with keys in x be handled? A conflict arises if there is a key in y that already exists in x.

One of:

  • "error", the default, will error if there are any keys in y that conflict with keys in x.

  • "ignore" will ignore rows in y with keys that conflict with keys in x.

returning_cols

Optional. Names of columns to return.

method

Optional. The method to use.

update_values

A named SQL vector that specify how to update the columns.

update_cols

Names of columns to update.

Details

Insert Methods

"where_not_exists"

The default for most databases.

INSERT INTO x_name
SELECT *
FROM y
WHERE NOT EXISTS <match on by columns>

"on_conflict"

Supported by:

This method uses the ⁠ON CONFLICT⁠ clause and therefore requires a unique index on the columns specified in by.

Upsert Methods

"merge"

The upsert method according to the SQL standard. It uses the MERGE statement

MERGE INTO x_name
USING y
  ON <match on by columns>
WHEN MATCHED THEN
  UPDATE SET ...
WHEN NOT MATCHED THEN
  INSERT ...

"on_conflict"

Supported by:

This method uses the ⁠ON CONFLICT⁠ clause and therefore requires a unique index on the columns specified in by.

"cte_update"

Supported by:

The classical way to upsert in Postgres and SQLite before support for ⁠ON CONFLICT⁠ was added. The update is done in a CTE clause and the unmatched values are then inserted outside of the CTE.

Value

A SQL query.

Examples

sql_query_upsert(
  con = simulate_postgres(),
  table = ident("airlines"),
  from = ident("df"),
  by = "carrier",
  update_cols = "name"
)

Helper function for quoting sql elements.

Description

If the quote character is present in the string, it will be doubled. NAs will be replaced with NULL.

Usage

sql_quote(x, quote)

Arguments

x

Character vector to escape.

quote

Single quoting character.

Examples

sql_quote("abc", "'")
sql_quote("I've had a good day", "'")
sql_quote(c("abc", NA), "'")

Create an sql translator

Description

When creating a package that maps to a new SQL based src, you'll often want to provide some additional mappings from common R commands to the commands that your tbl provides. These three functions make that easy.

Usage

sql_substr(f = "SUBSTR")

sql_str_sub(subset_f = "SUBSTR", length_f = "LENGTH", optional_length = TRUE)

sql_paste(default_sep, f = "CONCAT_WS")

sql_paste_infix(default_sep, op, cast)

sql_variant(
  scalar = sql_translator(),
  aggregate = sql_translator(),
  window = sql_translator()
)

sql_translator(..., .funs = list(), .parent = new.env(parent = emptyenv()))

sql_infix(f, pad = TRUE)

sql_prefix(f, n = NULL)

sql_aggregate(f, f_r = f)

sql_aggregate_2(f)

sql_aggregate_n(f, f_r = f)

sql_not_supported(f)

sql_cast(type)

sql_try_cast(type)

sql_log()

sql_cot()

sql_runif(rand_expr, n = n(), min = 0, max = 1)

base_scalar

base_agg

base_win

base_no_win

base_odbc_scalar

base_odbc_agg

base_odbc_win

Arguments

f

the name of the sql function as a string

scalar, aggregate, window

The three families of functions than an SQL variant can supply.

..., .funs

named functions, used to add custom converters from standard R functions to sql functions. Specify individually in ..., or provide a list of .funs

.parent

the sql variant that this variant should inherit from. Defaults to base_agg which provides a standard set of mappings for the most common operators and functions.

pad

If TRUE, the default, pad the infix operator with spaces.

n

for sql_infix(), an optional number of arguments to expect. Will signal error if not correct.

f_r

the name of the r function being translated as a string

Helper functions

sql_infix() and sql_prefix() create default SQL infix and prefix functions given the name of the SQL function. They don't perform any input checking, but do correctly escape their input, and are useful for quickly providing default wrappers for a new SQL variant.

See Also

win_over() for helper functions for window functions.

sql() for an example of a more customised sql conversion function.

Examples

# An example of adding some mappings for the statistical functions that
# postgresql provides: http://bit.ly/K5EdTn

postgres_agg <- sql_translator(.parent = base_agg,
  cor = sql_aggregate_2("CORR"),
  cov = sql_aggregate_2("COVAR_SAMP"),
  sd =  sql_aggregate("STDDEV_SAMP", "sd"),
  var = sql_aggregate("VAR_SAMP", "var")
)

# Next we have to simulate a connection that uses this variant
con <- simulate_dbi("TestCon")
sql_translation.TestCon <- function(x) {
  sql_variant(
    base_scalar,
    postgres_agg,
    base_no_win
  )
}

translate_sql(cor(x, y), con = con, window = FALSE)
translate_sql(sd(income / years), con = con, window = FALSE)

# Any functions not explicitly listed in the converter will be translated
# to sql as is, so you don't need to convert all functions.
translate_sql(regr_intercept(y, x), con = con)

Database src

Description

[Superseded]

Since can generate a tbl() directly from a DBI connection we no longer recommend using src_dbi().

Usage

src_dbi(con, auto_disconnect = FALSE)

Arguments

con

An object that inherits from DBI::DBIConnection, typically generated by DBI::dbConnect

auto_disconnect

Should the connection be automatically closed when the src is deleted? Set to TRUE if you initialize the connection the call to src_dbi(). Pass NA to auto-disconnect but print a message when this happens.

Value

An S3 object with class src_dbi, src_sql, src.


Create a "sql src" object

Description

Deprecated: please use directly use a DBIConnection object instead.

Usage

src_sql(subclass, con, ...)

Arguments

subclass

name of subclass. "src_sql" is an abstract base class, so you must supply this value. src_ is automatically prepended to the class name

con

the connection object

...

fields used by object


Summarise each group to one row

Description

This is a method for the dplyr summarise() generic. It generates the SELECT clause of the SQL query, and generally needs to be combined with group_by().

Usage

## S3 method for class 'tbl_lazy'
summarise(.data, ..., .by = NULL, .groups = NULL)

Arguments

.data

A lazy data frame backed by a database query.

...

<data-masking> Variables, or functions of variables. Use desc() to sort a variable in descending order.

.by

[Experimental]

<tidy-select> Optionally, a selection of columns to group by for just this operation, functioning as an alternative to group_by(). For details and examples, see ?dplyr_by.

.groups

[Experimental] Grouping structure of the result.

  • "drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.

  • "drop": All levels of grouping are dropped.

  • "keep": Same grouping structure as .data.

When .groups is not specified, it defaults to "drop_last".

In addition, a message informs you of that choice, unless the result is ungrouped, the option "dplyr.summarise.inform" is set to FALSE, or when summarise() is called from a function in a package.

Value

Another tbl_lazy. Use show_query() to see the generated query, and use collect() to execute the query and return data to R.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2))
db %>%
  summarise(n()) %>%
  show_query()

db %>%
  group_by(g) %>%
  summarise(n()) %>%
  show_query()

Create a local lazy tibble

Description

These functions are useful for testing SQL generation without having to have an active database connection. See simulate_dbi() for a list available database simulations.

Usage

tbl_lazy(df, con = NULL, ..., name = "df")

lazy_frame(..., con = NULL, .name = "df")

Examples

library(dplyr)
df <- data.frame(x = 1, y = 2)

df_sqlite <- tbl_lazy(df, con = simulate_sqlite())
df_sqlite %>% summarise(x = sd(x, na.rm = TRUE)) %>% show_query()

Create an SQL tbl (abstract)

Description

Generally, you should no longer need to provide a custom tbl() method. The default tbl.DBIConnect method should work in most cases.

Usage

tbl_sql(subclass, src, from, ..., vars = NULL, check_from = deprecated())

Arguments

subclass

name of subclass

...

needed for agreement with generic. Not otherwise used.

vars

Provide column names as a character vector to avoid retrieving them from the database. Mainly useful for better performance when creating multiple tbl objects.

check_from

[Deprecated]


Use dplyr verbs with a remote database table

Description

All data manipulation on SQL tbls are lazy: they will not actually run the query or retrieve the data unless you ask for it: they all return a new tbl_dbi object. Use compute() to run the query and save the results in a temporary in the database, or use collect() to retrieve the results to R. You can see the query with show_query().

Usage

## S3 method for class 'src_dbi'
tbl(src, from, ...)

Arguments

src

A DBIConnection object produced by DBI::dbConnect().

from

Either a table identifier or a literal sql() string.

Use a string to identify a table in the current schema/catalog. We recommend using I() to identify a table outside the default catalog or schema, e.g. I("schema.table") or I("catalog.schema.table"). You can also use in_schema()/in_catalog() or DBI::Id().

...

Passed on to tbl_sql()

Details

For best performance, the database should have an index on the variables that you are grouping by. Use explain() to check that the database is using the indexes that you expect.

There is one verb that is not lazy: do() is eager because it must pull the data into R.

Examples

library(dplyr)

# Connect to a temporary in-memory SQLite database
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Add some data
copy_to(con, mtcars)
DBI::dbListTables(con)

# To retrieve a single table from a source, use `tbl()`
con %>% tbl("mtcars")

# Use `I()` for qualified table names
con %>% tbl(I("temp.mtcars")) %>% head(1)

# You can also use pass raw SQL if you want a more sophisticated query
con %>% tbl(sql("SELECT * FROM mtcars WHERE cyl = 8"))

# If you just want a temporary in-memory database, use src_memdb()
src2 <- src_memdb()

# To show off the full features of dplyr's database integration,
# we'll use the Lahman database. lahman_sqlite() takes care of
# creating the database.

if (requireNamespace("Lahman", quietly = TRUE)) {
batting <- copy_to(con, Lahman::Batting)
batting

# Basic data manipulation verbs work in the same way as with a tibble
batting %>% filter(yearID > 2005, G > 130)
batting %>% select(playerID:lgID)
batting %>% arrange(playerID, desc(yearID))
batting %>% summarise(G = mean(G), n = n())

# There are a few exceptions. For example, databases give integer results
# when dividing one integer by another. Multiply by 1 to fix the problem
batting %>%
  select(playerID:lgID, AB, R, G) %>%
  mutate(
   R_per_game1 = R / G,
   R_per_game2 = R * 1.0 / G
 )

# All operations are lazy: they don't do anything until you request the
# data, either by `print()`ing it (which shows the first ten rows),
# or by `collect()`ing the results locally.
system.time(recent <- filter(batting, yearID > 2010))
system.time(collect(recent))

# You can see the query that dplyr creates with show_query()
batting %>%
  filter(G > 0) %>%
  group_by(playerID) %>%
  summarise(n = n()) %>%
  show_query()
}

Infrastructure for testing dplyr

Description

Register testing sources, then use test_load() to load an existing data frame into each source. To create a new table in each source, use test_frame().

Usage

test_register_src(name, src)

test_register_con(name, ...)

src_test(name)

test_load(
  df,
  name = unique_table_name(),
  srcs = test_srcs$get(),
  ignore = character()
)

test_frame(..., srcs = test_srcs$get(), ignore = character())

Examples

## Not run: 
test_register_src("sqlite", {
  DBI::dbConnect(RSQLite::SQLite(), ":memory:", create = TRUE)
})

test_frame(x = 1:3, y = 3:1)
test_load(mtcars)

## End(Not run)

Translate an expression to SQL

Description

dbplyr translates commonly used base functions including logical (!, &, |), arithmetic (^), and comparison (!=) operators, as well as common summary (mean(), var()), and transformation (log()) functions. All other functions will be preserved as is. R's infix functions (e.g. ⁠%like%⁠) will be converted to their SQL equivalents (e.g. LIKE).

Learn more in vignette("translation-function").

Usage

translate_sql(
  ...,
  con,
  vars_group = NULL,
  vars_order = NULL,
  vars_frame = NULL,
  window = TRUE
)

translate_sql_(
  dots,
  con,
  vars_group = NULL,
  vars_order = NULL,
  vars_frame = NULL,
  window = TRUE,
  context = list()
)

Arguments

..., dots

Expressions to translate. translate_sql() automatically quotes them for you. translate_sql_() expects a list of already quoted objects.

con

An optional database connection to control the details of the translation. The default, NULL, generates ANSI SQL.

vars_group, vars_order, vars_frame

Parameters used in the OVER expression of windowed functions.

window

Use FALSE to suppress generation of the OVER statement used for window functions. This is necessary when generating SQL for a grouped summary.

context

Use to carry information for special translation cases. For example, MS SQL needs a different conversion for is.na() in WHERE vs. SELECT clauses. Expects a list.

Examples

con <- simulate_dbi()

# Regular maths is translated in a very straightforward way
translate_sql(x + 1, con = con)
translate_sql(sin(x) + tan(y), con = con)

# Note that all variable names are escaped
translate_sql(like == "x", con = con)
# In ANSI SQL: "" quotes variable _names_, '' quotes strings

# Logical operators are converted to their sql equivalents
translate_sql(x < 5 & !(y >= 5), con = con)
# xor() doesn't have a direct SQL equivalent
translate_sql(xor(x, y), con = con)

# If is translated into case when
translate_sql(if (x > 5) "big" else "small", con = con)

# Infix functions are passed onto SQL with % removed
translate_sql(first %like% "Had%", con = con)
translate_sql(first %is% NA, con = con)
translate_sql(first %in% c("John", "Roger", "Robert"), con = con)

# And be careful if you really want integers
translate_sql(x == 1, con = con)
translate_sql(x == 1L, con = con)

# If you have an already quoted object, use translate_sql_:
x <- quote(y + 1 / sin(t))
translate_sql_(list(x), con = simulate_dbi())

# Windowed translation --------------------------------------------
# Known window functions automatically get OVER()
translate_sql(mpg > mean(mpg), con = con)

# Suppress this with window = FALSE
translate_sql(mpg > mean(mpg), window = FALSE, con = con)

# vars_group controls partition:
translate_sql(mpg > mean(mpg), vars_group = "cyl", con = con)

# and vars_order controls ordering for those functions that need it
translate_sql(cumsum(mpg), con = con)
translate_sql(cumsum(mpg), vars_order = "mpg", con = con)

Generate SQL expression for window functions

Description

win_over() makes it easy to generate the window function specification. win_absent(), win_rank(), win_aggregate(), and win_cumulative() provide helpers for constructing common types of window functions. win_current_group() and win_current_order() allow you to access the grouping and order context set up by group_by() and arrange().

Usage

win_over(
  expr,
  partition = NULL,
  order = NULL,
  frame = NULL,
  con = sql_current_con()
)

win_rank(f, empty_order = FALSE)

win_aggregate(f)

win_aggregate_2(f)

win_cumulative(f)

win_absent(f)

win_current_group()

win_current_order()

win_current_frame()

win_rank_tdata(f)

Arguments

expr

The window expression

partition

Variables to partition over

order

Variables to order by

frame

A numeric vector of length two defining the frame.

f

The name of an sql function as a string

empty_order

A logical value indicating whether to order by NULL if order is not specified

Examples

con <- simulate_dbi()

win_over(sql("avg(x)"), con = con)
win_over(sql("avg(x)"), "y", con = con)
win_over(sql("avg(x)"), order = "y", con = con)
win_over(sql("avg(x)"), order = c("x", "y"), con = con)
win_over(sql("avg(x)"), frame = c(-Inf, 0), order = "y", con = con)

Override window order and frame

Description

These allow you to override the ⁠PARTITION BY⁠ and ⁠ORDER BY⁠ clauses of window functions generated by grouped mutates.

Usage

window_order(.data, ...)

window_frame(.data, from = -Inf, to = Inf)

Arguments

.data

A lazy data frame backed by a database query.

...

Variables to order by

from, to

Bounds of the frame.

Examples

library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(g = rep(1:2, each = 5), y = runif(10), z = 1:10)
db %>%
  window_order(y) %>%
  mutate(z = cumsum(y)) %>%
  show_query()

db %>%
  group_by(g) %>%
  window_frame(-3, 0) %>%
  window_order(z) %>%
  mutate(z = sum(y)) %>%
  show_query()