Numerical Optimization for LAD regression

# Numerical Optimization for LAD regression

## Statistics I — Data Exploration

### Deepayan Sarkar

---

<div>
$$
\newcommand{\sub}{_}
\newcommand{\set}[1]{\left\lbrace {#1} \right\rbrace}
\newcommand{\abs}[1]{\left\lvert {#1} \right\rvert}
\newcommand{\nseq}[1]{ {#1}\sub{1}, {#1}\sub{2}, \dotsc, {#1}\sub{n} }
$$
</div>

# Motivation

* Given paired bivariate data $(X\sub{i}, Y\sub{i}), i = 1, 2, \dotsc, n$

* To find $a, b$ that minimize sum of absolute errors

$$
\lambda(a, b) = \sum\limits\sub{i=1}^n \abs{ Y\sub{i} - a - b X\sub{i}}
$$

* Closed form solution not available

---

# Loss function in R

```r
lad_loss_function <- function(x, y) {
    if (length(x) != length(y)) stop("Lengths and x and y do not match")
    skip <- is.na(x) | is.na(y)
    x <- x[!skip]
    y <- y[!skip]
    lambda <- function(par) {
        if (length(par) != 2) stop("par must have exactly two elements")
        e <- abs(y - par[1] - par[2] * x)
        sum(e)
    }
    lambda
}
data(Davis, package = "carData")
DavisLoss <- with(Davis, lad_loss_function(x = height, y = weight))
```

---

# Loss function surface

---

* Evaluate loss function on a grid

```r
g <- expand.grid(a = seq(-500, 500, length.out = 200),
                 b = seq(-3, 3, length.out = 200))
g$loss <- 0
for (i in seq_along(g$loss)) {
  g$loss[i] <- DavisLoss(c(g$a[i], g$b[i]))
}
```

---

```r
levelplot(loss ~ a + b, data = g)
```

![plot of chunk unnamed-chunk-3](figures/optlad-unnamed-chunk-3-1.svg)

---

```r
wireframe(loss ~ a + b, data = g, shade = TRUE)
```

![plot of chunk unnamed-chunk-4](figures/optlad-unnamed-chunk-4-1.svg)

---

# Useful trick: change parameterization of line

---

* "Center" $X$ data so that Y axis passes through data

```r
median(Davis$height)
```

```
[1] 169.5
```

* Modified problem: To find $a, b$ that minimize sum of absolute errors

$$
\lambda(c, d) = \sum\limits\sub{i=1}^n \abs{ Y\sub{i} - c - d (X\sub{i} - 170)}
$$

* It is easy to see that $b = d, a = c - 170d$

---

* Loss function for modified parameters

```r
DavisLoss2 <- with(Davis, lad_loss_function(x = height - 170, y = weight))
g2 <- expand.grid(c = seq(0, 150, length.out = 200),
                  d = seq(-3, 3, length.out = 200))
g2$loss <- 0
for (i in seq_along(g2$loss)) {
  g2$loss[i] <- DavisLoss2(c(g2$c[i], g2$d[i]))
}
```

---

```r
p <- levelplot(loss ~ c + d, data = g2)
p
```

![plot of chunk unnamed-chunk-7](figures/optlad-unnamed-chunk-7-1.svg)

---

* Question: Can we do better than grid search?

* Suppose we have an initial guess, say $c = d = 0$

* Can we try to "improve" our current guess?

* We will try a very simple approach called _gradient descent_

---

# Gradient descent

---

* Consider a differentiable function $f(x)$ of one variable, with starting point $x\sub{0}$

![plot of chunk unnamed-chunk-8](figures/optlad-unnamed-chunk-8-1.svg)

---

* Consider a differentiable function $f(x)$ of one variable, with starting point $x\sub{0}$

![plot of chunk unnamed-chunk-9](figures/optlad-unnamed-chunk-9-1.svg)

* In which direction should we move so that $f$ decreases?

* Natural to move in the direction in which the tangent decreases

---

* Basic approach:

* Compute slope $f^{\prime}$
	
	* If $f^{\prime} > 0$, move left (decrease $x$)

* If $f^{\prime} < 0$, move right (increase $x$)

* But don't move too much if $f^{\prime} \approx 0$
	
	* Why? Because we don't want to move away from a minima
	
---

* Gradient descent algorithm:

* Current value: $x\sub{n}$
	
	* Update: $x\sub{n+1} = x\sub{n} - \gamma f^{\prime}(x\sub{n})$
	
	* Repeat until some stopping criterion is met

* $\gamma$ is known as the "learning rate"

* If $f^{\prime}(x\sub{n})$ is unknown, approximate it
	
---

* Implementation (alternate coordinatewise)

```r
gdoptim1 <- function(L, par = c(0, 0), niter = 50, gamma = 0.1, eps = 0.001)
{
    L_current <- L(par)
    res <- data.frame(iter = c(0, rep(seq_len(niter))),
                      L = L_current, par1 = par[1], par2 = par[2])
    for (i in seq_len(niter)) {
        if (i %% 2 == 0) {
            slope1 <- (L(par + c(eps, 0)) - L_current) / eps
            par[1] <- par[1] - gamma * slope1
        }
        else {
            slope2 <- (L(par + c(0, eps)) - L_current) / eps
            par[2] <- par[2] - gamma * slope2
        }
        L_current <- L(par)
        res$L[i + 1] <- L_current
        res$par1[i + 1] <- par[1]
        res$par2[i + 1] <- par[2]
    }
    res
}
```

---

```r
res1_cd <- gdoptim1(DavisLoss2, par = c(0, 0), gamma = 0.1)
p + xyplot(par2 ~ par1, res1_cd, type = "o", col = "red")
```

![plot of chunk unnamed-chunk-11](figures/optlad-unnamed-chunk-11-1.svg)

---

```r
res1_cd <- gdoptim1(DavisLoss2, par = c(0, 0), gamma = 0.01, niter = 500)
p + xyplot(par2 ~ par1, res1_cd, type = "o", col = "red")
```

![plot of chunk unnamed-chunk-12](figures/optlad-unnamed-chunk-12-1.svg)

---

```r
res1_cd <- gdoptim1(DavisLoss2, par = c(0, 0), gamma = 0.001, niter = 5000)
p + xyplot(par2 ~ par1, res1_cd, type = "l", col = "red")
```

![plot of chunk unnamed-chunk-13](figures/optlad-unnamed-chunk-13-1.svg)

---

* Implementation (both coordinates together)

```r
gdoptim2 <- function(L, par = c(0, 0), niter = 50, gamma = 0.1, eps = 0.001)
{
    L_current <- L(par)
    res <- data.frame(iter = c(0, rep(seq_len(niter))),
                      L = L_current, par1 = par[1], par2 = par[2])
    for (i in seq_len(niter)) {
        slope1 <- (L(par + c(eps, 0)) - L_current) / eps
        slope2 <- (L(par + c(0, eps)) - L_current) / eps
        par <- par - gamma * c(slope1, slope2)
        L_current <- L(par)
        res$L[i + 1] <- L_current
        res$par1[i + 1] <- par[1]
        res$par2[i + 1] <- par[2]
    }
    res
}
```

* This is in fact the direction of steepest descent, which is convenient

---

```r
res2_cd <- gdoptim2(DavisLoss2, par = c(0, 0), gamma = 0.1)
p + xyplot(par2 ~ par1, res2_cd, type = "b", col = "red")
```

![plot of chunk unnamed-chunk-15](figures/optlad-unnamed-chunk-15-1.svg)

---

```r
res2_cd <- gdoptim2(DavisLoss2, par = c(0, 0), gamma = 0.01, niter = 500)
p + xyplot(par2 ~ par1, res2_cd, type = "b", col = "red")
```

![plot of chunk unnamed-chunk-16](figures/optlad-unnamed-chunk-16-1.svg)

---

```r
res2_cd <- gdoptim2(DavisLoss2, par = c(0, 0), gamma = 0.001, niter = 5000)
p + xyplot(par2 ~ par1, res2_cd, type = "l", col = "red")
```

![plot of chunk unnamed-chunk-17](figures/optlad-unnamed-chunk-17-1.svg)

---

* What about original parameterization?

```r
res2_ab <- gdoptim2(DavisLoss, par = c(0, 0), gamma = 0.00001, niter = 500)
levelplot(loss ~ a + b, data = g) + xyplot(par2 ~ par1, res2_ab, type = "b", col = "red")
```

![plot of chunk unnamed-chunk-18](figures/optlad-unnamed-chunk-18-1.svg)

---

* Summary

* Gradient descent is a simple optimization method
	
	* Can be effective as long as we are careful
	
--

* There are many other optimization methods

* Each has its own advantages and disadvantages

* You will learn more next semester

---

# Specialized alternative: IRLS

---

* This particular problem has a nice "intuitive" solution

* Rewrite the loss function as follows:

$$
\lambda(a, b) = \sum\limits\sub{i=1}^n \abs{ Y\sub{i} - a - b X\sub{i}} = \sum\limits\sub{i=1}^n \frac{1}{\abs{Y\sub{i} - a - b X\sub{i}}} \left( Y\sub{i} - a - b X\sub{i} \right)^2
$$

* This suggests the following iterative algorithm:

* Start with initial / current values $\hat{a}\sub{n}, \hat{b}\sub{n}$
	
	* Calculate $w^{(n)}\sub{i} = 1 / \abs{Y\sub{i} - \hat{a}\sub{n} - \hat{b}\sub{n} X\sub{i}}$
	
	* Find $\hat{a}\sub{n+1}, \hat{b}\sub{n+1}$ that minimizes $\lambda\sub{n}(a, b) = \sum\limits\sub{i=1}^n w^{(n)}\sub{i} \left( Y\sub{i} - a - b X\sub{i} \right)^2$

---

* The last step can be solved using calculus (requires a little work, but similar to OLS)

* Known as _Weighted Least Squares_ (WLS)

* The resulting iterative procedure is known as  _Iteratively Reweighted Least Squares_ (IRLS)

* Exercise: Implement this and study behaviour

---

# IRLS for Davis data

---

```r
irls_cd <- with(Davis, irls(height - 170, weight, par = c(0, 0), niter = 10))
levelplot(loss ~ c + d, data = g2) + xyplot(par2 ~ par1, irls_cd, type = "b", col = "red")
```

![plot of chunk unnamed-chunk-20](figures/optlad-unnamed-chunk-20-1.svg)

---

```r
irls_ab <- with(Davis, irls(height, weight, par = c(0, 0), niter = 10))
levelplot(loss ~ a + b, data = g) + xyplot(par2 ~ par1, irls_ab, type = "b", col = "red")
```

![plot of chunk unnamed-chunk-21](figures/optlad-unnamed-chunk-21-1.svg)

---

# Questions?