class: center, middle # Numerical Optimization for LAD regression ## Statistics I — Data Exploration ### Deepayan Sarkar
---
$$ \newcommand{\sub}{_} \newcommand{\set}[1]{\left\lbrace {#1} \right\rbrace} \newcommand{\abs}[1]{\left\lvert {#1} \right\rvert} \newcommand{\nseq}[1]{ {#1}\sub{1}, {#1}\sub{2}, \dotsc, {#1}\sub{n} } $$
# Motivation * Given paired bivariate data $(X\sub{i}, Y\sub{i}), i = 1, 2, \dotsc, n$ * To find $a, b$ that minimize sum of absolute errors $$ \lambda(a, b) = \sum\limits\sub{i=1}^n \abs{ Y\sub{i} - a - b X\sub{i}} $$ * Closed form solution not available --- # Loss function in R ```r lad_loss_function <- function(x, y) { if (length(x) != length(y)) stop("Lengths and x and y do not match") skip <- is.na(x) | is.na(y) x <- x[!skip] y <- y[!skip] lambda <- function(par) { if (length(par) != 2) stop("par must have exactly two elements") e <- abs(y - par[1] - par[2] * x) sum(e) } lambda } data(Davis, package = "carData") DavisLoss <- with(Davis, lad_loss_function(x = height, y = weight)) ``` --- layout: true # Loss function surface --- * Evaluate loss function on a grid ```r g <- expand.grid(a = seq(-500, 500, length.out = 200), b = seq(-3, 3, length.out = 200)) g$loss <- 0 for (i in seq_along(g$loss)) { g$loss[i] <- DavisLoss(c(g$a[i], g$b[i])) } ``` --- ```r levelplot(loss ~ a + b, data = g) ```  --- ```r wireframe(loss ~ a + b, data = g, shade = TRUE) ```  --- layout: true # Useful trick: change parameterization of line --- * "Center" $X$ data so that Y axis passes through data ```r median(Davis$height) ``` ``` [1] 169.5 ``` * Modified problem: To find $a, b$ that minimize sum of absolute errors $$ \lambda(c, d) = \sum\limits\sub{i=1}^n \abs{ Y\sub{i} - c - d (X\sub{i} - 170)} $$ * It is easy to see that $b = d, a = c - 170d$ --- * Loss function for modified parameters ```r DavisLoss2 <- with(Davis, lad_loss_function(x = height - 170, y = weight)) g2 <- expand.grid(c = seq(0, 150, length.out = 200), d = seq(-3, 3, length.out = 200)) g2$loss <- 0 for (i in seq_along(g2$loss)) { g2$loss[i] <- DavisLoss2(c(g2$c[i], g2$d[i])) } ``` --- ```r p <- levelplot(loss ~ c + d, data = g2) p ```  --- * Question: Can we do better than grid search? -- * Suppose we have an initial guess, say $c = d = 0$ * Can we try to "improve" our current guess? -- * We will try a very simple approach called _gradient descent_ --- layout: true # Gradient descent --- * Consider a differentiable function $f(x)$ of one variable, with starting point $x\sub{0}$  --- * Consider a differentiable function $f(x)$ of one variable, with starting point $x\sub{0}$  * In which direction should we move so that $f$ decreases? * Natural to move in the direction in which the tangent decreases --- * Basic approach: * Compute slope $f^{\prime}$ * If $f^{\prime} > 0$, move left (decrease $x$) * If $f^{\prime} < 0$, move right (increase $x$) -- * But don't move too much if $f^{\prime} \approx 0$ * Why? Because we don't want to move away from a minima --- * Gradient descent algorithm: * Current value: $x\sub{n}$ * Update: $x\sub{n+1} = x\sub{n} - \gamma f^{\prime}(x\sub{n})$ * Repeat until some stopping criterion is met * $\gamma$ is known as the "learning rate" * If $f^{\prime}(x\sub{n})$ is unknown, approximate it --- * Implementation (alternate coordinatewise) ```r gdoptim1 <- function(L, par = c(0, 0), niter = 50, gamma = 0.1, eps = 0.001) { L_current <- L(par) res <- data.frame(iter = c(0, rep(seq_len(niter))), L = L_current, par1 = par[1], par2 = par[2]) for (i in seq_len(niter)) { if (i %% 2 == 0) { slope1 <- (L(par + c(eps, 0)) - L_current) / eps par[1] <- par[1] - gamma * slope1 } else { slope2 <- (L(par + c(0, eps)) - L_current) / eps par[2] <- par[2] - gamma * slope2 } L_current <- L(par) res$L[i + 1] <- L_current res$par1[i + 1] <- par[1] res$par2[i + 1] <- par[2] } res } ``` --- ```r res1_cd <- gdoptim1(DavisLoss2, par = c(0, 0), gamma = 0.1) p + xyplot(par2 ~ par1, res1_cd, type = "o", col = "red") ```  --- ```r res1_cd <- gdoptim1(DavisLoss2, par = c(0, 0), gamma = 0.01, niter = 500) p + xyplot(par2 ~ par1, res1_cd, type = "o", col = "red") ```  --- ```r res1_cd <- gdoptim1(DavisLoss2, par = c(0, 0), gamma = 0.001, niter = 5000) p + xyplot(par2 ~ par1, res1_cd, type = "l", col = "red") ```  --- * Implementation (both coordinates together) ```r gdoptim2 <- function(L, par = c(0, 0), niter = 50, gamma = 0.1, eps = 0.001) { L_current <- L(par) res <- data.frame(iter = c(0, rep(seq_len(niter))), L = L_current, par1 = par[1], par2 = par[2]) for (i in seq_len(niter)) { slope1 <- (L(par + c(eps, 0)) - L_current) / eps slope2 <- (L(par + c(0, eps)) - L_current) / eps par <- par - gamma * c(slope1, slope2) L_current <- L(par) res$L[i + 1] <- L_current res$par1[i + 1] <- par[1] res$par2[i + 1] <- par[2] } res } ``` * This is in fact the direction of steepest descent, which is convenient --- ```r res2_cd <- gdoptim2(DavisLoss2, par = c(0, 0), gamma = 0.1) p + xyplot(par2 ~ par1, res2_cd, type = "b", col = "red") ```  --- ```r res2_cd <- gdoptim2(DavisLoss2, par = c(0, 0), gamma = 0.01, niter = 500) p + xyplot(par2 ~ par1, res2_cd, type = "b", col = "red") ```  --- ```r res2_cd <- gdoptim2(DavisLoss2, par = c(0, 0), gamma = 0.001, niter = 5000) p + xyplot(par2 ~ par1, res2_cd, type = "l", col = "red") ```  --- * What about original parameterization? ```r res2_ab <- gdoptim2(DavisLoss, par = c(0, 0), gamma = 0.00001, niter = 500) levelplot(loss ~ a + b, data = g) + xyplot(par2 ~ par1, res2_ab, type = "b", col = "red") ```  --- * Summary * Gradient descent is a simple optimization method * Can be effective as long as we are careful -- * There are many other optimization methods * Each has its own advantages and disadvantages * You will learn more next semester --- layout: true # Specialized alternative: IRLS --- * This particular problem has a nice "intuitive" solution * Rewrite the loss function as follows: $$ \lambda(a, b) = \sum\limits\sub{i=1}^n \abs{ Y\sub{i} - a - b X\sub{i}} = \sum\limits\sub{i=1}^n \frac{1}{\abs{Y\sub{i} - a - b X\sub{i}}} \left( Y\sub{i} - a - b X\sub{i} \right)^2 $$ -- * This suggests the following iterative algorithm: * Start with initial / current values $\hat{a}\sub{n}, \hat{b}\sub{n}$ * Calculate $w^{(n)}\sub{i} = 1 / \abs{Y\sub{i} - \hat{a}\sub{n} - \hat{b}\sub{n} X\sub{i}}$ * Find $\hat{a}\sub{n+1}, \hat{b}\sub{n+1}$ that minimizes $\lambda\sub{n}(a, b) = \sum\limits\sub{i=1}^n w^{(n)}\sub{i} \left( Y\sub{i} - a - b X\sub{i} \right)^2$ --- * The last step can be solved using calculus (requires a little work, but similar to OLS) * Known as _Weighted Least Squares_ (WLS) -- * The resulting iterative procedure is known as _Iteratively Reweighted Least Squares_ (IRLS) -- * Exercise: Implement this and study behaviour --- layout: true # IRLS for Davis data --- ```r irls_cd <- with(Davis, irls(height - 170, weight, par = c(0, 0), niter = 10)) levelplot(loss ~ c + d, data = g2) + xyplot(par2 ~ par1, irls_cd, type = "b", col = "red") ```  --- ```r irls_ab <- with(Davis, irls(height, weight, par = c(0, 0), niter = 10)) levelplot(loss ~ a + b, data = g) + xyplot(par2 ~ par1, irls_ab, type = "b", col = "red") ```  --- layout: false class: center, middle # Questions?