Using ggplot2

# Using ggplot2

## Data Analysis with R and Python

### Deepayan Sarkar

---

# ggplot2

---

* Traditional graphics and lattice share a common philosophy

* Plotting functions are named after plot type
	
	* Example: `graphics::hist()` and `lattice::histogram()`

* ggplot approaches data visualization in a different way

<div>
$$
\newcommand{\sub}{_}
$$
</div>

---

* Want to recreate the following histogram using ggplot2

```r
data(NHANES, package = "NHANES")
lattice::histogram( ~ BPSysAve | Gender, data = NHANES,
                    layout = c(1, 2), nint = 25)
```

![plot of chunk unnamed-chunk-1](figures/7-ggplot-unnamed-chunk-1-1.svg)

---

* The gg in ggplot stands for _grammar of graphics_

* Approach:

* Define an abstract grammar for visualization

* Consists of various predefined components

* Plots are constructed by combining these components

---

* We start by specifying two of those components:

* The dataset to be used

* The faceting variable

```r
library(package = "ggplot2")
p <- ggplot(data = NHANES) + facet_wrap(~ Gender, ncol = 1)
```

---

* Note that we have assigned the result to a variable

* This is an important feature of __lattice__ and __ggplot__

* Traditional graphics calls plot something as soon as they are called

* This is not true for __lattice__ and __ggplot__
	
	* Instead, they create objects that contain the information required for a plot

---

* Actual plot is created when they are 'printed' (blank because it doesn't have any geoms)

```r
p
```

![plot of chunk unnamed-chunk-3](figures/7-ggplot-unnamed-chunk-3-1.svg)

---

* Fortunately, there is a "histogram" geom  we can use

```r
p + geom_histogram(aes(x = BPSysAve), bins = 25)
```

```
Warning: Removed 1449 rows containing non-finite values (`stat_bin()`).
```

![plot of chunk unnamed-chunk-4](figures/7-ggplot-unnamed-chunk-4-1.svg)

---

# Stats

---

* This approach hides an intermediate step

* We map average systolic blood pressure to the x-coordinate (aesthetic mapping)

* However, the y-coordinate mapping is not explicitly mapped

* What is actually plotted here are summary statistics (bin counts)

* Even the x-coordinate is no longer the raw data, but only the bin locations

* So essentially raw data has been summarized to create some new data (which is plotted)

* Such statistical summaries are called __"stats"__ in ggplot2 terminology

---

* We can make this operation explicit as follows

```r
p + stat_bin(aes(x = BPSysAve), bins = 25, na.rm = TRUE)
```

![plot of chunk unnamed-chunk-5](figures/7-ggplot-unnamed-chunk-5-1.svg)

---

* This 'stat' is applied by default in `geom_histogram()`

```r
str(geom_histogram)
```

```
function (mapping = NULL, data = NULL, stat = "bin", position = "stack", ..., binwidth = NULL, 
    bins = NULL, na.rm = FALSE, orientation = NA, show.legend = NA, inherit.aes = TRUE)  
```

* Similarly, `stat_bin()` has a default 'geom'

```r
str(stat_bin)
```

```
function (mapping = NULL, data = NULL, geom = "bar", position = "stack", ..., binwidth = NULL, 
    bins = NULL, center = NULL, boundary = NULL, breaks = NULL, closed = c("right", 
        "left"), pad = FALSE, na.rm = FALSE, orientation = NA, show.legend = NA, 
    inherit.aes = TRUE)  
```

---

* Thus, the last plot we created is actually

```r
p + stat_bin(aes(x = BPSysAve), bins = 25, geom = "bar", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-8](figures/7-ggplot-unnamed-chunk-8-1.svg)

---

* We can change the geom to get a different visualization of the same summary

```r
p + stat_bin(aes(x = BPSysAve), bins = 25, geom = "point", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-9](figures/7-ggplot-unnamed-chunk-9-1.svg)

---

* A more useful choice of geom is "line" (result is a reasonable substitute for histograms)

```r
p + stat_bin(aes(x = BPSysAve), bins = 25, geom = "line", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-10](figures/7-ggplot-unnamed-chunk-10-1.svg)

---

* These plots show bin counts

* As we saw earlier, histograms can also be viewed as _density_ estimators

* Read help page of `stat_bin()`: it also computes estimated density

---

* For these, y-axis mapping must be `density`, not `count`

```r
p + stat_bin(aes(x = BPSysAve, y = after_stat(density)), bins = 25, geom = "line", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-11](figures/7-ggplot-unnamed-chunk-11-1.svg)

---

* Alternative method: average shifted histogram / kernel density estimation

```r
p + stat_density(aes(x = BPSysAve), na.rm = TRUE)
```

![plot of chunk unnamed-chunk-12](figures/7-ggplot-unnamed-chunk-12-1.svg)

---

* Again, replacing the default geom ("area") by "line" gives more familiar density plot

```r
p + stat_density(aes(x = BPSysAve), geom = "line", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-13](figures/7-ggplot-unnamed-chunk-13-1.svg)

---

# Aesthetic mapping

---

* Important feature of ggplot2: _aesthetic mapping_

* Aesthetic attributes of any geom can be mapped to specific variables
	
	* This includes, color, fill, size, shape, etc.
	
	* Details depend on specific geom

```r
str(geom_line)
```

```
function (mapping = NULL, data = NULL, stat = "identity", position = "identity", 
    na.rm = FALSE, orientation = NA, show.legend = NA, inherit.aes = TRUE, ...)  
```

---

* Example: map `Race1` (categorical variable) to line color

```r
p + stat_density(aes(x = BPSysAve, color = Race1), geom = "line", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-15](figures/7-ggplot-unnamed-chunk-15-1.svg)

---

* Example: Default `position = "stack"` is useful for default geom "area"

```r
p + stat_density(aes(x = BPSysAve, fill = Race1), geom = "area", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-16](figures/7-ggplot-unnamed-chunk-16-1.svg)

---

```r
str(stat_density)
```

```
function (mapping = NULL, data = NULL, geom = "area", position = "stack", ..., bw = "nrd0", 
    adjust = 1, kernel = "gaussian", n = 512, trim = FALSE, na.rm = FALSE, bounds = c(-Inf, 
        Inf), orientation = NA, show.legend = NA, inherit.aes = TRUE)  
```

---

```r
p + stat_density(aes(x = BPSysAve, color = Race1),
                 geom = "line", position = "identity", na.rm = TRUE)
```

![plot of chunk unnamed-chunk-18](figures/7-ggplot-unnamed-chunk-18-1.svg)

---

* The idea of aesthetic mapping of variables to visual attributes is very powerful

* We will demonstrate this with the gapminder data

```r
library(gapminder)
str(gapminder)
```

```
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
```

---

* Create a blank ggplot object with data and a faceting variable (year)

```r
pgm <- ggplot(data = gapminder) + facet_wrap(~ year)
```

---

* Scatter plot to study relation between `lifeExp` and `gdpPercap`

```r
pgm + geom_point(aes(x = gdpPercap, y = lifeExp))
```

![plot of chunk unnamed-chunk-21](figures/7-ggplot-unnamed-chunk-21-1.svg)

---

* `gdpPercap` is highly skewed, so take logarithm

```r
pgm + geom_point(aes(x = log(gdpPercap), y = lifeExp))
```

![plot of chunk unnamed-chunk-22](figures/7-ggplot-unnamed-chunk-22-1.svg)

---

* Other attributes: map continent to color

```r
pgm + geom_point(aes(x = log(gdpPercap), y = lifeExp, color = continent))
```

![plot of chunk unnamed-chunk-23](figures/7-ggplot-unnamed-chunk-23-1.svg)

---

* Another continuous variable is population: map to _size_ of points (bubble chart)

```r
pgm + geom_point(aes(x = log(gdpPercap), y = lifeExp, color = continent, size = pop))
```

![plot of chunk unnamed-chunk-24](figures/7-ggplot-unnamed-chunk-24-1.svg)