class: center, middle # Lazy Evaluation ## Data Analysis with R and Python ### Deepayan Sarkar
--- layout: true # Lazy Evaluation --- * Consider the following function
$$ \newcommand{\sub}{_} \newcommand{\nseq}[1]{ {#1}\sub{1}, {#1}\sub{2}, \dotsc, {#1}\sub{n} } $$
```r choose1 <- function(u, a, b) { if (u < 0.5) a else b } ``` --- * Chooses and returns one of two arguments depending on a third argument -- * Could be used for treatment randomization ```r choose1(rbinom(1, size = 1, prob = 0.3), "placebo", "treatment") ``` ``` [1] "placebo" ``` ```r choose1(rbinom(1, size = 1, prob = 0.3), "placebo", "treatment") ``` ``` [1] "placebo" ``` ```r choose1(rbinom(1, size = 1, prob = 0.3), "placebo", "treatment") ``` ``` [1] "placebo" ``` ```r choose1(rbinom(1, size = 1, prob = 0.3), "placebo", "treatment") ``` ``` [1] "treatment" ``` --- * How does this function work? -- * Maybe * The function is called with some arguments * R matches and evaluates the arguments `u`, `a`, and `b` * Body of function executed with these values -- * Does this sound reasonable? * How can we check? [Code demo](lazy-evaluation-demo.R) * What happens in Python? --- layout: true # Non-standard Evaluation --- * Suppose we have the following problem: * Find all subsets of size 2 from a set of size 5 --- * Easy to create all _ordered_ pairs using the `expand.grid()` function .scrollable500[ ```r g <- expand.grid(a = 1:5, b = 1:5) g ``` ``` a b 1 1 1 2 2 1 3 3 1 4 4 1 5 5 1 6 1 2 7 2 2 8 3 2 9 4 2 10 5 2 11 1 3 12 2 3 13 3 3 14 4 3 15 5 3 16 1 4 17 2 4 18 3 4 19 4 4 20 5 4 21 1 5 22 2 5 23 3 5 24 4 5 25 5 5 ``` ] --- * But we want to retain only the $5 \choose 2$ distinct combinations of size 2 * Also not difficult: ```r g$b > g$a # find the rows for which $a > b$ ``` ``` [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [15] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE ``` -- ```r g[g$b > g$a, ] # Then use this as a row index ``` ``` a b 6 1 2 11 1 3 12 2 3 16 1 4 17 2 4 18 3 4 21 1 5 22 2 5 23 3 5 24 4 5 ``` --- * As noted before, referring to `g` multiple times not ideal * We can avoid doing this using the `eval()` function --- * We can replace ```r g$b > g$a ``` ``` [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [15] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE ``` * by ```r eval(quote(b > a), g) ``` ``` [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [15] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE ``` -- * A general function that makes this even simpler is the `with()` function ```r with(g, b > a) ``` ``` [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [15] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE ``` --- * We might guess that this is done using `substitute()` -- * We can verify this by looking at the `with` function by typing its name ```r with ``` ``` function (data, expr, ...) UseMethod("with")
``` --- * Actually this just tells us that `with()` is a _generic_ function * The _method_ we want is the default method ```r getS3method("with", "default") ``` ``` function (data, expr, ...) eval(substitute(expr), data, enclos = parent.frame())
``` --- * All of this can now be combined to get all the $5 \choose 2$ distinct 2-subsets ```r g[with(g, b > a), ] ``` ``` a b 6 1 2 11 1 3 12 2 3 16 1 4 17 2 4 18 3 4 21 1 5 22 2 5 23 3 5 24 4 5 ``` --- * This is a very common use case: obtaining a subset a dataset * There is an even simpler way using the `subset()` function ```r subset(g, b > a) ``` ``` a b 6 1 2 11 1 3 12 2 3 16 1 4 17 2 4 18 3 4 21 1 5 22 2 5 23 3 5 24 4 5 ``` --- * A popular add-on package __dplyr__ has a similar function called `filter()` ```r dplyr::filter(g, b > a) ``` ``` a b 1 1 2 2 1 3 3 2 3 4 1 4 5 2 4 6 3 4 7 1 5 8 2 5 9 3 5 10 4 5 ``` -- * Here `dplyr::filter` is the _namespace_ notation `package::function` * Allows us to use the function without attaching the whole package * Similar to the `.` accessor in Python --- * The __dplyr__ package is very useful for routine data manipulation * We will discuss it again later * The idea of quoted expressions and _non-standard evaluation_ is essential in many other contexts