R works using what is known as a REPL (Read-Eval-Print-Loop)
R waits for user input when it starts
User types input expression and presses Enter
R reads input and tries to evaluate
Evaluation is either successful or produces an error
R starts to wait again for further user input
Today's talk takes a closer look at how the evaluation step works
R works using what is known as a REPL (Read-Eval-Print-Loop)
R waits for user input when it starts
User types input expression and presses Enter
R reads input and tries to evaluate
Evaluation is either successful or produces an error
R starts to wait again for further user input
Today's talk takes a closer look at how the evaluation step works
Useful links: Source (R + Markdown), Makefile (conversion using knitr and pandoc), Extracted R code
Basic R programming
You should be comfortable with writing functions
If you know how to use eval()
and quote()
, you already know too much
Statistical concepts used for illustration
Maximum likelihood, regularization
Probability density, histograms
Linear regression
sqrt(x)
Error in eval(expr, envir, enclos): object 'x' not found
This expression involves two "symbols", sqrt
and x
To be able to evaluate the expression, R must be able to "find" both symbols
find("sqrt")
[1] "package:base"
find("x") # not found
character(0)
x
and try againx <- 10sqrt(x)
[1] 3.162278
x <- -1sqrt(x)
Warning in sqrt(x): NaNs produced
[1] NaN
x <- -1+0isqrt(x)
[1] 0+1i
x <- "-1"sqrt(x)
Error in sqrt(x): non-numeric argument to mathematical function
sqrt
and x
are foundfind("sqrt")
[1] "package:base"
find("x")
[1] ".GlobalEnv"
Suppose we want to define a "smarter" sqrt()
First check what sqrt()
does, so we can mimic:
sqrt
function (x) .Primitive("sqrt")
.Primitive()
do? See help(.Primitive)
(but we don't really need to know)sqrt()
as follows:sqrt <- function(x){ orig.sqrt <- .Primitive("sqrt") if (is.character(x)) x <- as.numeric(x) if (is.numeric(x) && any(x < 0)) x <- complex(real = x, imaginary = 0) orig.sqrt(x)}
x <- -1sqrt(x)
[1] 0+1i
find("x")
[1] ".GlobalEnv"
find("sqrt")
[1] ".GlobalEnv" "package:base"
sqrt()
is not overwritten; instead a new
one is definedsqrt
in ".GlobalEnv"
and not the one in "package:base"
?search()
[1] ".GlobalEnv" "package:knitr" "package:stats" "package:graphics" [5] "package:grDevices" "package:utils" "package:datasets" "package:methods" [9] "Autoloads" "package:base"
What exactly are the objects in the search path?
They are known as "environments", which are central to how evaluation works in R
In essence, an environment is a collection of symbols (variables) bound to certain values
Each environment usually also has an associated "enclosing environment" or "parent environment"
.GlobalEnv
is the environment where user-defined variables are stored
.GlobalEnv
is
identical to the search pathe <- .GlobalEnvwhile (TRUE) { str(e, give.attr = FALSE); e <- parent.env(e) }
<environment: R_GlobalEnv> <environment: package:knitr> <environment: package:stats> <environment: package:graphics> <environment: package:grDevices> <environment: package:utils> <environment: package:datasets> <environment: package:methods> <environment: 0x7f9063aa8bb8> <environment: base> <environment: R_EmptyEnv>
Error in parent.env(e): the empty environment has no parent
search() # compare
[1] ".GlobalEnv" "package:knitr" "package:stats" "package:graphics" [5] "package:grDevices" "package:utils" "package:datasets" "package:methods" [9] "Autoloads" "package:base"
new.env()
explicitly creates environments
Variables inside environments can be accessed using $
(like lists or data frames)
e1 <- new.env()e1$x <- seq(0, 2 * pi, length.out = 101)e1$y <- cos(e1$x)plot(y ~ x, data = e1, type = "l", ylim = c(-1, 1)) # Formula interface
Environments are an exception to R's duplicate-on-copy rule
A copy of an environment is the same environment, not a new one
e2 <- e1range(e1$y)
[1] -1 1
e2$y <- abs(e2$y)range(e1$y) # changes even though e1$y has not been directly modified
[1] 1.608123e-16 1.000000e+00
ls()
ls(e2)
[1] "x" "y"
ls.str(e2) ## summary of all objects in e2
x : num [1:101] 0 0.0628 0.1257 0.1885 0.2513 ...y : num [1:101] 1 0.998 0.992 0.982 0.969 ...
Environments are central to how R functions work
Functions in R are actually "closures"
In addition to the argument list and body (code), a function also has an associated environment
This is usually the environment in which the function was defined
environment(sqrt)
<environment: R_GlobalEnv>
environment(base::sqrt) # special "primitive" (C) function, so no environment
NULL
environment(hist)
<environment: namespace:graphics>
Every time a function is invoked, a new environment is created
This is the "evaluation environment" and contains the local variables of the function
The environment of the function is the environment enclosing this "evaluation environment"
Every time a function is invoked, a new environment is created
This is the "evaluation environment" and contains the local variables of the function
The environment of the function is the environment enclosing this "evaluation environment"
While the function is running, evaluation tries to find symbols as follows:
First, in the evaluation environment
Next, in its enclosing environment, i.e., the environment of the function
Then, in the enclosing environment of that environment, and so on
This behaviour is called "lexical scoping"
Every time a function is invoked, a new environment is created
This is the "evaluation environment" and contains the local variables of the function
The environment of the function is the environment enclosing this "evaluation environment"
While the function is running, evaluation tries to find symbols as follows:
First, in the evaluation environment
Next, in its enclosing environment, i.e., the environment of the function
Then, in the enclosing environment of that environment, and so on
This behaviour is called "lexical scoping"
Depends!
You can use R productively with only a superficial understanding of evaluation
However, these details often help explain and diagnose seemingly odd behaviour
They can also be a lot of fun
Given positive-valued (possibly skewed) data, one may want to transform them to look Normal
The Box-Cox transformation conveniently combines power transforms and log in a single family
Parameterized by a single parameter
lambda
)boxcox <- function(x) # applies Box-Cox transformation on data 'x'{ if (lambda == 0) log(x) else (x^lambda - 1) / lambda}b <- function(x) # computes histogram bins given data 'x' { n <- sum(is.finite(x)) r <- extendrange(x) seq(r[1], r[2], length.out = round(sqrt(n)) + 1)}
x <- rlnorm(1000)hist(boxcox(x), breaks = b) # error, as it should be
Error in boxcox(x): object 'lambda' not found
lambda
happens to be definedlambda <- 0hist(boxcox(x), breaks = b) # notice that 'b' is a function! (See ?hist)
Normally, depending on this kind of scoping behaviour should be unnecessary
In fact, a well-designed function should not depend on arbitrary variables
However, there is one situation where this evaluation rule becomes important
When a function creates and returns or uses another function!
Probably not very common...
But only because we are limited in our thinking
We already use functions as arguments to other functions
Thinking "functionally" opens up many other interesting approaches
This is particularly useful for quick prototyping
fboxcox <- function(lambda){ f <- function(x) { if (lambda == 0) log(x) else (x^lambda - 1) / lambda } return(f)}
boxcox.sqrt <- fboxcox(0.5)boxcox.log <- fboxcox(0)
par(mfrow = c(1, 2))hist(boxcox.sqrt(x), breaks = b) # equivalently, hist(fboxcox(0.5)(x), breaks = b)hist(boxcox.log(x), breaks = b) # hist(fboxcox(0)(x), breaks = b)
boxcox.sqrt
and boxcox.log
are both functions (created by
another function fboxcox
)boxcox.sqrt
function(x) { if (lambda == 0) log(x) else (x^lambda - 1) / lambda }<environment: 0x7f90676098a8>
boxcox.log
function(x) { if (lambda == 0) log(x) else (x^lambda - 1) / lambda }<bytecode: 0x7f90636851f0><environment: 0x7f9067669b98>
environment(boxcox.sqrt)
<environment: 0x7f90676098a8>
environment(boxcox.log)
<environment: 0x7f9067669b98>
This is because they were created in different executions of fboxcox()
Although not immediately apparent, these environments retain the value of lambda
ls.str(environment(boxcox.sqrt))
f : function (x) lambda : num 0.5
ls.str(environment(boxcox.log))
f : function (x) lambda : num 0
Without lexical scoping, this would not have been possible
Another example from The R Paper:
y <- 123f <- function(x) { y <- x * x g <- function() print(y) g()}f(10)
[1] 100
123
... to look as Normal as possible, but we don't know what the correct transformation should be
A standard textbook solution is to look for an "optimum" choice within the Box-Cox family
A maximum-likelihood approach: assume
... to look as Normal as possible, but we don't know what the correct transformation should be
A standard textbook solution is to look for an "optimum" choice within the Box-Cox family
A maximum-likelihood approach: assume
A little mathematics tells us that
There is no closed form solution for
The optimum (MLE)
where
... to look as Normal as possible, but we don't know what the correct transformation should be
A standard textbook solution is to look for an "optimum" choice within the Box-Cox family
A maximum-likelihood approach: assume
A little mathematics tells us that
There is no closed form solution for
The optimum (MLE)
where
Suppose we have a general numerical optimizer that can optimize a function of one argument
Make a function that converts data into a function suitable for optimization:
negllBoxCox <- function(x){ n <- length(x) slx <- sum(log(x)) # compute only once function(lambda) { y <- fboxcox(lambda)(x) # Note use of fboxcox() defined earlier sy <- mean((y - mean(y))^2) n * log(sy) / 2 - (lambda - 1) * slx # ignoring constant terms }}
x <- rlnorm(100) # log-normal, so true lambda = 0f <- negllBoxCox(x)f(0)
[1] 4.311438
optimize(f, lower = -10, upper = 10)
$minimum[1] -0.1608655$objective[1] 2.230694
## OR optim(par = 1, fn = f) for alternative methods
We can even plot
The Vectorize()
function creates a vectorized function from a non-vectorized one
plot(Vectorize(f), from = -2, to = 2, n = 1000, las = 1)
lambda.hat <- replicate(1000, optimize(negllBoxCox(rnorm(100, mean = 10)), lower = -10, upper = 10)$minimum)hist(lambda.hat, breaks = b, las = 1)
Another common use case: optimization of a loss function
Consider the following general approach to linear regression given data
Common choices for
make.objective <- function(x, y, L = abs){ function(beta) { sum(L(y - beta[1] - beta[2] * x)) }}data(phones, package = "MASS")obj.lad <- with(phones, make.objective(year, calls, L = abs))obj.lse <- with(phones, make.objective(year, calls, L = function(u) u*u))
obj.lad
function(beta) { sum(L(y - beta[1] - beta[2] * x)) }<environment: 0x7f9061b191e8>
environment(obj.lse)$L
function(u) u*u<environment: 0x7f9061bad478>
environment(obj.lad)$L
function (x) .Primitive("abs")
fm.lad <- optim(obj.lad, par = c(0,0))fm.lse <- optim(obj.lse, par = c(0,0))str(fm.lad)
List of 5 $ par : num [1:2] -66.05 1.35 $ value : num 844 $ counts : Named int [1:2] 117 NA ..- attr(*, "names")= chr [1:2] "function" "gradient" $ convergence: int 0 $ message : NULL
str(fm.lse)
List of 5 $ par : num [1:2] -260.12 5.04 $ value : num 69544 $ counts : Named int [1:2] 109 NA ..- attr(*, "names")= chr [1:2] "function" "gradient" $ convergence: int 0 $ message : NULL
plot(calls ~ year, phones)abline(fm.lad$par, col = "blue")abline(fm.lse$par, col = "red") # This is what we would get from lm()
Suppose we want to plot a histogram
One important parameter is the bin width
Fixing starting point, what is an optimal choice?
Maximum Likelihood fails because it "oversmooths", so needs regularization
Options
The Akaike Information Criterion is a popular model selection criterion
Simple definition:
Our prototype: make function of
chooseBinsAIC <- function(x, p = 0.25){ x <- x[is.finite(x)] n <- length(x) r <- extendrange(x) d <- r[2] - r[1] histAIC <- function(B) { b <- seq(r[1], r[2], length.out = B + 1) h <- hist(x, breaks = b, plot = FALSE) ## likelihood is simply product of density^counts keep <- h$counts > 0 # skip 0 counts (by convention 0 log 0 == 0) logL <- with(h, sum(counts[keep] * log(density[keep]))) 2 * B - 2 * logL } Bvec <- seq(floor(sqrt(n)*p), ceiling(sqrt(n)/p)) list(range = r, B = Bvec, AIC = unname(sapply(Bvec, histAIC)))}
xx <- c(rnorm(500), rnorm(300, mean = 6, sd = 1.5))ll <- chooseBinsAIC(xx, p = 0.125)str(ll)
List of 3 $ range: num [1:2] -3.79 11.35 $ B : int [1:225] 3 4 5 6 7 8 9 10 11 12 ... $ AIC : num [1:225] 4146 4115 3876 3736 3723 ...
with(ll, { plot(AIC ~ B, type = "o") abline(v = B[which.min(AIC)])})
(nclass.aic <- with(ll, B[which.min(AIC)]))
[1] 26
nclass.Sturges(xx)
[1] 11
nclass.FD(xx)
[1] 12
nclass.scott(xx)
[1] 12
breaks <- seq(ll$range[1], ll$range[2], length.out = nclass.aic + 1)hist(xx, breaks = breaks)
Denote the histogram for bin width
Want to choose
The third term is unknown but constant (as
The first term can be calculated from
The second term involves the unknown
Leave-one-out estimator: average of
The estimated integrated square error (without the constant term) simplifies to
chooseBins <- function(x, p = 0.25){ x <- x[is.finite(x)] n <- length(x) r <- extendrange(x) d <- r[2] - r[1] CVE <- function(B) { h <- d / B breaks <- seq(r[1], r[2], length.out = B+1) k <- findInterval(x, breaks) # runtime O(n * log(B)) counts <- table(factor(k, levels = 1:B)) 2 / ((n-1)*h) - sum(counts^2) * (n+1) / ((n-1) * n^2 * h) } Bvec <- seq(floor(sqrt(n)*p), ceiling(sqrt(n)/p)) list(range = r, B = Bvec, CVE = unname(sapply(Bvec, CVE)))}
## xx <- c(rnorm(500), rnorm(300, mean = 6, sd = 1.5))ll <- chooseBins(xx, p = 0.125)str(ll)
List of 3 $ range: num [1:2] -3.79 11.35 $ B : int [1:225] 3 4 5 6 7 8 9 10 11 12 ... $ CVE : num [1:225] -0.0831 -0.0802 -0.1034 -0.1207 -0.1195 ...
with(ll, { plot(CVE ~ B, type = "o") abline(v = B[which.min(CVE)])})
(nclass.cv <- with(ll, B[which.min(CVE)]))
[1] 20
nclass.Sturges(xx)
[1] 11
nclass.FD(xx)
[1] 12
nclass.scott(xx)
[1] 12
breaks <- seq(ll$range[1], ll$range[2], length.out = nclass.cv + 1)hist(xx, breaks = breaks)
rm(list = c("x", "y")) # remove x, y if defined earlierplot(x, y, type = "l")
Error in plot(x, y, type = "l"): object 'x' not found
e <- quote(plot(x, y, type = "l"))e
plot(x, y, type = "l")
e[[1]]
plot
e[[2]]
x
e[[3]]
y
eval()
eval(e)
Error in plot(x, y, type = "l"): object 'x' not found
Fails because current environment (.GlobalEnv
) does not contain variables x
and y
But Recall: Environment e2
defined earlier contains
ls.str(e2)
x : num [1:101] 0 0.0628 0.1257 0.1885 0.2513 ...y : num [1:101] 1 0.998 0.992 0.982 0.969 ...
eval(e, envir = e2)
replicate()
function?replicate()
and with()
workmy.replicate <- function(N, e){ ans <- numeric(N) for (i in 1:N) ans[[i]] <- eval(e) ans}
my.replicate(5, max(rnorm(10))) # not what we wanted
[1] 2.1574 2.1574 2.1574 2.1574 2.1574
my.replicate(5, quote(max(rnorm(10))))
[1] 0.5284321 0.6858938 1.3573033 0.8895072 1.6434051
substitute()
trick: non-standard evaluation (using lazy argument evaluation)my.replicate <- function(N, e){ expr <- substitute(e) # avoids the need to quote() ans <- numeric(N) for (i in 1:N) ans[[i]] <- eval(expr) ans}str(x <- my.replicate(1000, max(rnorm(20))))
num [1:1000] 1.648 2.524 1.636 1.346 0.729 ...
set.seed(123)rm(sqrt) # remove the sqrt() we defined earlierchooseOne <- function(a, b, u = runif(1)){ if (u < 0.5) a else b}
a
and b
is evaluated in any particular executionchooseOne(sqrt(2), sqrt(-2)) # no warning message
[1] 1.414214
chooseOne(sqrt(2), sqrt(-2)) # warning message
Warning in sqrt(-2): NaNs produced
[1] NaN
? environment? eval? expression? quote? bquote? substitute? with
with.default
function (data, expr, ...) eval(substitute(expr), data, enclos = parent.frame())<bytecode: 0x7f9061b19fb0><environment: namespace:base>
replicate
function (n, expr, simplify = "array") sapply(integer(n), eval.parent(substitute(function(...) expr)), simplify = simplify)<bytecode: 0x7f9061ac9e08><environment: namespace:base>
R works using what is known as a REPL (Read-Eval-Print-Loop)
R waits for user input when it starts
User types input expression and presses Enter
R reads input and tries to evaluate
Evaluation is either successful or produces an error
R starts to wait again for further user input
Today's talk takes a closer look at how the evaluation step works
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |