A Sample Session in R

# A Sample Session in R

## Data Analysis with R and Python

### Deepayan Sarkar

---

# Starting and Interacting with R

* R is typically used interactively

* When we start R, the command window (or console) displays a prompt, typically `>`

* We use R by entering an expression to be evaluated

* R evaluates the expression and prints the result

``` r
1 + 2
```

```
[1] 3
```

* It then provides a new prompt and waits for more input

* To **quit** R, you can use the command `q()`

<div>
$$
\newcommand{\sub}{_}
$$
</div>

---

# Infix and Prefix Notation

* R uses **infix** notation for standard arithmetic operations, e.g.,

```r
1 + 2
```

* The corresponding **prefix** notation would look something like

```
+ 1 2
```

* This is actually what R does internally, using _function notation_

``` r
`+`(1, 2)
```

```
[1] 3
```

* In general, R expressions are typically function calls of the form `f(a, b)`

---

# Basic principles: Data types

---

* R can handle many different kinds of data

* Basic classification: *simple data* and *compound data*

* **Simple Data** includes:

* Numbers (numeric values, including integers and floating-point numbers):
```r
1     # an integer  
-3.14 # a floating point number
```
--

*  Logical values:
```r
 TRUE # true  
FALSE # false (or T and F shortcuts)
```
--

* Strings (enclosed in single or double quotes):
```r
"This is a string 1 2 3 4"
```

---

* **Compound Data** primarily consists of

* **vectors** : ordered collections of elements of the same type
	
	* **lists** : ordered collections with elements of possibly different types

--
* We commonly define compound data using the concatenate function `c()`:

``` r
c(1, 2, 3)
```

```
[1] 1 2 3
```

---

* We can also have **Symbols** which are used for naming variables or functions:

```r
x  
gdp.data  
this_is_a_symbol
```

---

# The REPL

* An R session involves interaction between the user and the console (_listener_)

* When we enter an expression, the listener passes it to the _evaluator_

* Basic rule:

* Everything is evaluated

* The results are (usually) printed
	
	* Once done, listener goes back to listening

<br/>

This is known as the __Read-Eval-Print-Loop__ or __REPL__

]

---

# Evaluation rules

---

* Numbers and strings evaluate to themselves:

``` r
10
```

```
[1] 10
```

``` r
"Hello"  
```

```
[1] "Hello"
```

---

* Expressions can involve _functions_

``` r
sqrt(10)
```

```
[1] 3.162278
```

``` r
nchar("Hello")
```

```
[1] 5
```

* Functions are applied using parentheses

* Function name precedes arguments, mirroring **prefix** notation

* For example ` min(x, y) ` instead of ` min x y `

* Parentheses needed when number of arguments of functions are unknown
	
	* Compare ` min(pi, sqrt(10)) ` and ` min pi sqrt 10 `

---

* Evaluation can be **suppressed** by _quoting_ an expression

``` r
quote(min(pi, sqrt(10)))
```

```
min(pi, sqrt(10))
```

* This turns out to be a very interesting feature that we will return to later

* Python does **not** have a similar feature (R and Python are otherwise very similar)

---

# Elementary Statistical Operations

Fundamental numerical and graphical statistical operations in R

---

# Example dataset: World Social Indicators, 1960

---

country|gnppc|pctlit_adult|highered100k
:-------|----:|-----------:|-----------:
Nepal|45|5|56
Afghanistan|50|2.5|12
Laos|50|17.5|4
Ethiopia|55|2.5|5
Burma|57|47.5|63
Libya|60|13|49
Sudan|60|9|34
Tanganyika|61|7.5|9
Uganda|64|27.5|14
Pakistan|70|13|165
China|73|47.5|69
India|73|19.3|220
South Vietnam|76|17.5|83
Nigeria|78|10|4
Kenya|87|22.5|5
Madagascar|88|33.5|21
Congo|92|37.5|4
Thailand|96|68|251
Bolivia|99|32.1|166
Cambodia|99|17.5|18

]

---

* Data from a small subset of countries

* All have relatively low per capita GDP

* Variables

* `gnppc` : per capita GNP (around 1957),

* `pctlit_adult` : adult literacy (%) around 1960

* `highered100k` : enrollment in higher education per 100,000 population

---

# Simple univariate calculations

---

* Simplest statistical data: univariate

* Usually consists of groups of numbers

* We first consider only the data on enrolment in higher education, which are

```
56 12 4 5 63 49 34 9 14 165 69 220 83 4 5 21 4 251 166 18
```

* In R, we represent this data as a vector using `c()` (combine):

``` r
c(56, 12, 4, 5, 63, 49, 34, 9, 14, 165, 69, 220, 83, 4, 5, 21, 4, 251, 166, 18)
```

```
 [1]  56  12   4   5  63  49  34   9  14 165  69 220  83   4   5  21   4 251 166
[20]  18
```

---

* The `mean()` function computes the average (**arithmetic mean**) of a vector of numbers.

``` r
mean(c(56, 12, 4, 5, 63, 49, 34, 9, 14, 165, 69, 220, 83, 4, 5, 21, 4, 251, 166, 18))
```

```
[1] 62.6
```

* The **median** of these numbers can be calculated using `median()`:

``` r
median(c(56, 12, 4, 5, 63, 49, 34, 9, 14, 165, 69, 220, 83, 4, 5, 21, 4, 251, 166, 18))
```

```
[1] 27.5
```

---

* This requires retyping the data every time

* We can avoid this by assigning it a _name_ to reference by

* Done using the assignment operator `<-` or the (mostly) equivalent `=` operator.

``` r
higher.educ <- c(56, 12, 4, 5, 63, 49, 34, 9, 14, 165, 69, 220, 83, 4, 5, 21, 4, 251, 166, 18)
```

* This is known as a **variable assignment**

---

* The symbol `higher.educ` now holds the vector of 20 numbers

* If we evaluate the symbol, R returns its value.

``` r
higher.educ
```

```
 [1]  56  12   4   5  63  49  34   9  14 165  69 220  83   4   5  21   4 251 166
[20]  18
```

---

* We can easily compute numerical descriptive statistics.

``` r
mean(higher.educ)  
```

```
[1] 62.6
```

``` r
median(higher.educ)  
```

```
[1] 27.5
```

``` r
sd(higher.educ) # Standard deviation  
```

```
[1] 76.57222
```

``` r
IQR(higher.educ) # Interquartile range
```

```
[1] 64.5
```

---

# Vectorized arithmetic

---

* R also supports **elementwise arithmetic operations** on vectors

* For example, we can add 1 to each value using

``` r
1 + higher.educ
```

```
 [1]  57  13   5   6  64  50  35  10  15 166  70 221  84   5   6  22   5 252 167
[20]  19
```

* We can calculate the natural logarithms of the values

``` r
log(higher.educ)
```

```
 [1] 4.025352 2.484907 1.386294 1.609438 4.143135 3.891820 3.526361 2.197225
 [9] 2.639057 5.105945 4.234107 5.393628 4.418841 1.386294 1.609438 3.044522
[17] 1.386294 5.525453 5.111988 2.890372
```

---

* Functions can be nested, as we have been doing

``` r
mean(log(higher.educ))
```

```
[1] 3.300523
```

``` r
median(log(higher.educ))
```

```
[1] 3.285441
```

---

* Expressions with simple nested functions can be written in _pipeline_ notation

* Sometimes easier to follow because order of application is left-to-right (like _postfix_ notation)

``` r
higher.educ |> log() |> mean()
```

```
[1] 3.300523
```

``` r
higher.educ |> log() |> median()
```

```
[1] 3.285441
```

``` r
higher.educ |> log() |> mean() |> exp()
```

```
[1] 27.12684
```

``` r
higher.educ |> log() |> median() |> exp()
```

```
[1] 26.72078
```

---

# Arithmetic Mean, Geometric Mean, and Median

* What does the following tell us about the data?

``` r
higher.educ |> mean() # arithmetic mean
```

```
[1] 62.6
```

``` r
higher.educ |> log() |> mean() |> exp() # geometric mean
```

```
[1] 27.12684
```

``` r
higher.educ |> median() # median
```

```
[1] 27.5
```

---

# Another dataset: Average monthly PM 2.5 levels

* Recorded at an air quality monitoring station in R.K.Puram (Delhi)

* Over a 3-year period, from January 2021 to December 2023.

``` r
pm25 <- c(288, 223, 167, 156, 126, 120, 102, 106, 83, 114, 259, 282, 
          234, 183, 174, 176, 160, 139, 102, 99, 110, 173, 245, 250, 260, 
          190, 150, 164, 161, 144, 115, 138, 123, 182, 323, 280)
```

---

# Some numerical summaries

``` r
mean(pm25)  
```

```
[1] 175.0278
```

``` r
median(pm25)  
```

```
[1] 162.5
```

``` r
sd(pm25)  
```

```
[1] 63.83796
```

``` r
IQR(pm25)
```

```
[1] 103.5
```

* Graphical summaries give better idea of distribution

---

# Histogram

* The function `hist()` draws a histogram of the data

``` r
hist(pm25) # Produces a histogram plot
```

![plot of chunk pm25-hist](figures/1-rsession-pm25-hist-1.svg)

---

# Five-Number Summary

* Standard quartiles + extreme values are useful to judge symmetry

* Useful to compare transformations

``` r
fivenum(pm25)
```

```
[1]  83.0 121.5 162.5 228.5 323.0
```

``` r
fivenum(sqrt(pm25))
```

```
[1]  9.110434 11.022494 12.747413 15.115122 17.972201
```

``` r
fivenum(log(pm25))
```

```
[1] 4.418841 4.799838 5.090635 5.431246 5.777652
```

---

# Box-and-Whisker Plot

``` r
par(mfrow = c(1, 3))
boxplot(pm25, main = "PM25")
boxplot(sqrt(pm25), main = "sqrt(PM25)")
boxplot(log(pm25), main = "log(PM25)")
```

![plot of chunk pm25-boxplot](figures/1-rsession-pm25-boxplot-1.svg)

---

# Time Series Plots

---

* We often plot observations against time (or the order in which they were obtained)

* Helps to convey serial correlation or trend

* The `plot()` function creates a scatterplot of two variables

* To use it, we need a sequence of integers for the time variable

---

* Useful function that generates a sequence: `seq()` or the shorthand `:` operator

``` r
time <- 0:35  
plot(time, pm25)
```

![plot of chunk pm25-ts1](figures/1-rsession-pm25-ts1-1.svg)

---

* It is common to connect points by lines, using the `type` argument, to emphasize the trend

``` r
plot(time, pm25, type = "o") # "o" stands for 'overlay'
```

![plot of chunk pm25-ts2](figures/1-rsession-pm25-ts2-1.svg)

---

# Scatter plots

---

* General scatter plots show points with coordinates given by two variables

* Very useful for examining the relationship between two numerical variables

* Recall: `higher.educ` from social indicators data

* Additionally define the `adult_lit` variable to contain corresponding adult literacy (%).

``` r
adult.lit <- c(5, 2.5, 17.5, 2.5, 47.5, 13, 9, 7.5, 27.5, 13, 47.5,
               19.3, 17.5, 10, 22.5, 33.5, 37.5, 68, 32.1, 17.5)
```

---

* Scatter plot of `higher.educ` against `adult.lit`

``` r
plot(adult.lit, higher.educ)
```

![plot of chunk pm25-alit-hieduc](figures/1-rsession-pm25-alit-hieduc-1.svg)

---

# Plotting Functions

---

* Sometimes we are interested in plotting functions; e.g., plot $\sin(x)$ from $-\pi$ to $+\pi$

``` r
x_points <- seq(-pi, pi, length.out = 50) # equally spaced grid  
plot(x_points, sin(x_points), type = "l")
```

![plot of chunk plot-sin-grid](figures/1-rsession-plot-sin-grid-1.svg)

---

* It is also possible to plot functions (of one argument) directly

``` r
plot(sin, from = -2 * pi, to = 2 * pi)
```

![plot of chunk plot-sin-fun](figures/1-rsession-plot-sin-fun-1.svg)

---

We can alse define a new function to plot as follows (more details later).

``` r
f <- function(x) { 2 * x + 3 * x^2 - x^3 }
plot(f, from = -10, to = 10)
```

![plot of chunk plot-custom-fun](figures/1-rsession-plot-custom-fun-1.svg)

---

# Example: Loss Function

---

* The mean and median can be viewed as solutions that minimize a _loss function_

* Sample mean of $X\sub{1}, X\sub{2}, \dotsc, X\sub{n}$:

$$
\arg \min\sub{\theta} \sum\limits\sub{i=1}^n (X\sub{i} - \theta)^2
$$

* Sample median of $X\sub{1}, X\sub{2}, \dotsc, X\sub{n}$:

$$
\arg \min\sub{\theta} \sum\limits\sub{i=1}^n \lvert X\sub{i} - \theta \rvert
$$

---

* The mean and median can be viewed as solutions that minimize a _loss function_

* Sample mean of $X\sub{1}, X\sub{2}, \dotsc, X\sub{n}$:

$$
\arg \min\sub{\theta} L\sub{1}(\theta) \ \text{ where } L\sub{1}(\theta) = \sum\limits\sub{i=1}^n (X\sub{i} - \theta)^2
$$

* Sample median of $X\sub{1}, X\sub{2}, \dotsc, X\sub{n}$:

$$
\arg \min\sub{\theta}  L\sub{2}(\theta) \ \text{ where } L\sub{2}(\theta) = \sum\limits\sub{i=1}^n \lvert X\sub{i} - \theta \rvert
$$

* What do the function $L\sub{1}$ and $L\sub{2}$ look like?

---

* How can we define $L\sub{1}$?

``` r
SSD <- function(theta) {
    S <- 0
    n <- length(higher.educ)
    for (i in 1:n) { # for loop
        S <- S + (higher.educ[i] - theta)^2 # indexing, scope
    }
    S # value returned by function
}
```

* Useful approach in general, but **not** recommended in R

---

* Implementation using vectorization

``` r
SSD <- function(theta) {
    dev <- higher.educ - theta
    sum(dev * dev)
}
```

* Uses the fact that `-` and `*` operate elementwise on vectors

* True for most mathematical functions as well

---

* How can we plot `SSD`?

``` r
theta_vals <- seq(0, 100, length.out = 201)
plot(theta_vals, SSD(theta_vals), type = "l")
```

```
Warning in higher.educ - theta: longer object length is not a multiple of
shorter object length
```

```
Error in xy.coords(x, y, xlabel, ylabel, log): 'x' and 'y' lengths differ
```

* Can you guess _why_ this fails?

---

* The function `SSD()` is not _vectorized_

* In such cases, we cannot avoid a for loop

``` r
SSD_vals1 <- numeric(100) # numeric array (vector)
for (i in 1:100) {
    SSD_vals1[i] <- SSD(theta_vals[i])
}
```

* But this is a special kind of for loop known as _mapping_

* Here we _apply_ the same function on each element of a list

* There is a function called `sapply()` which makes this very easy

``` r
SSD_vals2 <- sapply(theta_vals, SSD) # evaluates SSD(x) for each x in theta_vals
```

---

``` r
plot(theta_vals, SSD_vals2, type = "l")
```

![plot of chunk plot_loss_ssd](figures/1-rsession-plot_loss_ssd-1.svg)

---

``` r
SAD <- function(theta) {
    dev <- higher.educ - theta
    sum(abs(dev))
}
SAD_vals <- sapply(theta_vals, SAD)
```

---

``` r
plot(theta_vals, SAD_vals, type = "l")
```

![plot of chunk plot_loss_sad](figures/1-rsession-plot_loss_sad-1.svg)

---

# Generating and Modifying Data

Generating systematic and random data, modifying existing data

---

# Generating Random Data (Simulation)

* R provides functions for generating pseudo-random numbers

* `runif(n)` generates `n` Uniform random variables

``` r
runif(10)
```

```
 [1] 0.69707840 0.79136846 0.61254284 0.36829765 0.09293093 0.37830032
 [7] 0.81270137 0.62896807 0.18946650 0.42166522
```

* `rnorm(n)` generates `n` Standard Normal random variables.

``` r
runif(25)
```

```
 [1] 0.465512227 0.292726838 0.140677985 0.031051590 0.858157426 0.909276547
 [7] 0.325394625 0.032395882 0.190905378 0.450959374 0.548433178 0.467726122
[13] 0.192999747 0.158144519 0.587087643 0.470371093 0.432812775 0.980726274
[19] 0.526989399 0.928303930 0.483526011 0.053993177 0.775922364 0.099876461
[25] 0.008726112
```

---

# Generating Systematic Data

---

* We have seen `seq(start, end)` (or `start:end`) for equally spaced integer sequences

``` r
seq(10, 19.5)
```

```
 [1] 10 11 12 13 14 15 16 17 18 19
```

``` r
1:pi
```

```
[1] 1 2 3
```

* Also `seq(a, b, length.out = n)` for general equally spaced sequences

``` r
seq(1, pi, length.out = 10)
```

```
 [1] 1.000000 1.237955 1.475909 1.713864 1.951819 2.189774 2.427728 2.665683
 [9] 2.903638 3.141593
```

---

* The `rep()` function is useful for generating sequences with specific patterns

* If we want to repeat a sequence:

``` r
rep(c(1, 2, 3), 2)  
```

```
[1] 1 2 3 1 2 3
```

* If we want to repeat each element a specified number of times:

``` r
rep(c(1, 2, 3), times = c(3, 2, 1))  
```

```
[1] 1 1 1 2 2 3
```

---

# Forming Subsets and Deleting Cases

---

* R uses bracket indexing `[]` to select elements from a vector or list

* An important difference is that **R uses 1-based indexing**, and not 0-based indexing

* Suppose we define a vector `x`:

``` r
x <- c(3, 7, 5, 9, 12, 3, 14, 2)
```

* To retrieve the second element (index 2), we can use

``` r
x[[2]]
```

```
[1] 7
```

``` r
x[2]
```

```
[1] 7
```

---

* To retrieve a _group_ of elements, we must use the second form, with a
vector as index:

``` r
x[c(1, 3)]
```

```
[1] 3 5
```

* To exclude elements, we use negative indices

* To exclude the 3rd element:

``` r
x[-3]
```

```
[1]  3  7  9 12  3 14  2
```

---

* We can also use **logical indexing**

* To select all elements of `x` that are greater than 3:

``` r
x[x > 3]
```

```
[1]  7  5  9 12 14
```

---

# Combining Several Lists

* To combine several short vectors into a single longer vector, use `c()`:

``` r
z1 <- c(1, 2, 3)  
z2 <- c(4)
z3 <- c(5, 6, 7, 8)  
c(z1, z2, z3)  
```

```
[1] 1 2 3 4 5 6 7 8
```

---

# Modifying Data: Replace values in existing vector

* R uses subsetting combined with assignment

* To change the `12` (the 5th element) in `x` to `11`:

``` r
x
```

```
[1]  3  7  5  9 12  3 14  2
```

``` r
x[5] <- 11  
x  
```

```
[1]  3  7  5  9 11  3 14  2
```

* To change elements 1 and 3 to `15` and `16`:

``` r
x[c(1, 3)] <- c(15, 16)  
x  
```

```
[1] 15  7 16  9 11  3 14  2
```

---

# Reference versus copy

* R copies vectors upon modification (does not modify in-place). For example:

``` r
x
```

```
[1] 15  7 16  9 11  3 14  2
```

``` r
y <- x # y is a copy  
x[3] <- 100  
x
```

```
[1]  15   7 100   9  11   3  14   2
```

``` r
y
```

```
[1] 15  7 16  9 11  3 14  2
```

* This behavior (implicit copying on modification) simplifies many tasks

* Python does **not** copy implicitly in such situations

* If required, copies must be made explicitly

---

# Useful Features

Interacting with the R environment

---

# Getting Help

* Online help is available for most R functions

* You can use the `?` operator followed by the function name, or the `help()` function

```r
?median  
help("median")
```

* You may not always know the exact function name beforehand

* You can still use the `??` operator to search the documentation for keywords

```r
??normal
```

---

# Listing and Undefining Variables

* To find out which variables we have defined in the current session:

``` r
ls()
```

```
 [1] "adult.lit"   "f"           "higher.educ" "i"           "pm25"       
 [6] "SAD"         "SAD_vals"    "showCall"    "SSD"         "SSD_vals1"  
[11] "SSD_vals2"   "theta_vals"  "time"        "x"           "x_points"   
[16] "y"           "z1"          "z2"          "z3"         
```

* To remove a variable to free up memory / clean up your workspace:

``` r
rm(theta_vals, SSD_vals1, SSD_vals2, SSD)
ls()
```

```
 [1] "adult.lit"   "f"           "higher.educ" "i"           "pm25"       
 [6] "SAD"         "SAD_vals"    "showCall"    "time"        "x"          
[11] "x_points"    "y"           "z1"          "z2"          "z3"         
```

---

# Saving Your Work

* R provides mechanisms to save variables and record sessions.

* To save variables for later use:

```r
save(higher.educ, pm25, file = "examples.rda")
```

* This saves the specified variables to a file in a special binary format

* Can be reloaded later in a different R session using `load("examples.rda")`

---

# Loading files

* Data files saved in R (using `save()`) can be read in using `load()`

```r
load("examples.rda")
```

* R code can also be saved in a file (typically with extension `.R`)

* We can run such a script, as a series of commands, using

```r
source("/path/to/script.R")
```

* Good practice: Open an "R Script" to write / edit code instead of prompt

* Saving this file keeps a record of what you have done

---

# Importing data stored in other formats

* Small datasets can be typed in at the R console to illustrate basic usage

* Real world datasets are too large for this to be feasible

* Typically distributed in a variety of formats

* Easiest to import: text formats such as

* CSV (comma-separated values)

* JSON (JavaScript Object Notation)

* Often distributed in proprietary or specialized formats meant for specific software:

* `.xls` or `.xlsx` files exported by Microsoft Excel

* `.xpt` files exported by SAS

* `.sav` files exported by SPSS

* `.dta` files exported by Stata.

---

# Importing data stored in proprietary formats

* Not always guaranteed that R will be able to read data from such files

* But most common formats are supported (through add-on packages)

* See [R Data
Import/Export](https://cran.isid.ac.in/doc/manuals/r-devel/R-data.html)
manual

* Also covers interacting with data stored in Database Management Systens (useful for large datasets)

* Most data import methods will import datasets as _data frames_

* Data frames basically combine multiple columns in a single container

---

# Data Frames

---

* Can be constructed explicitly using the `data.frame()` function

* Example: combine `higher.educ` and `adult.lit` along with country names

``` r
dsocial <-
    data.frame(country = c("Nepal", "Afghanistan", "Laos", "Ethiopia",
                           "Burma", "Libya", "Sudan", "Tanganyika",
                           "Uganda", "Pakistan", "China", "India",
                           "South Vietnam", "Nigeria", "Kenya", "Madagascar",
                           "Congo", "Thailand", "Bolivia", "Cambodia"),
               hedu = higher.educ,
               adlit = adult.lit)
```

---

* `dsocial` is now like a matrix / spreadsheet

``` r
dsocial
```

```
         country hedu adlit
1          Nepal   56   5.0
2    Afghanistan   12   2.5
3           Laos    4  17.5
4       Ethiopia    5   2.5
5          Burma   63  47.5
6          Libya   49  13.0
7          Sudan   34   9.0
8     Tanganyika    9   7.5
9         Uganda   14  27.5
10      Pakistan  165  13.0
11         China   69  47.5
12         India  220  19.3
13 South Vietnam   83  17.5
14       Nigeria    4  10.0
15         Kenya    5  22.5
16    Madagascar   21  33.5
17         Congo    4  37.5
18      Thailand  251  68.0
19       Bolivia  166  32.1
20      Cambodia   18  17.5
```

]

---

* Individual "columns" can be extracted using the `$` operator

``` r
dsocial$country
```

```
 [1] "Nepal"         "Afghanistan"   "Laos"          "Ethiopia"     
 [5] "Burma"         "Libya"         "Sudan"         "Tanganyika"   
 [9] "Uganda"        "Pakistan"      "China"         "India"        
[13] "South Vietnam" "Nigeria"       "Kenya"         "Madagascar"   
[17] "Congo"         "Thailand"      "Bolivia"       "Cambodia"     
```

``` r
dsocial$adlit
```

```
 [1]  5.0  2.5 17.5  2.5 47.5 13.0  9.0  7.5 27.5 13.0 47.5 19.3 17.5 10.0 22.5
[16] 33.5 37.5 68.0 32.1 17.5
```

``` r
mean(dsocial$adlit)
```

```
[1] 22.52
```

---

* Can also be imported from file

``` r
social_indicators <-
    read.csv("https://deepayan.github.io/BSDS/2026-01-DARP/slides/data/social-indicators-1964.csv",
             comment.char = "#")
head(social_indicators)
```

```
      Country GNP.per.Capita Percent.Urban Percent.Adult.Literacy
1       Nepal             45           4.4                    5.0
2 Afghanistan             50           7.5                    2.5
3        Laos             50           4.0                   17.5
4        Togo             50           4.5                    7.5
5    Ethiopia             55           1.7                    2.5
6       Burma             57          10.0                   47.5
  Higher.Ed.per.100000 Inhabitants.per.Physician Radios.per.1000
1                   56                     72000              NA
2                   12                     41000             1.7
3                    4                    100000             8.0
4                   NA                     58000             4.3
5                    5                    117000             4.5
6                   63                     15000             5.6
```

---

# Plotting data in data frames

---

* Possible using what we already know (but not recommended)

``` r
plot(sqrt(dsocial$hedu), sqrt(dsocial$adlit))
```

![plot of chunk plot-socind-default](figures/1-rsession-plot-socind-default-1.svg)

---

* Formula interface (used extensively in R)

``` r
plot(sqrt(adlit) ~ sqrt(hedu), data = dsocial)
```

![plot of chunk plot-socind-formula](figures/1-rsession-plot-socind-formula-1.svg)

---

* Formula interface in __lattice__ add-on package

``` r
lattice::xyplot(sqrt(adlit) ~ sqrt(hedu), data = dsocial)
```

![plot of chunk xyplot-socind](figures/1-rsession-xyplot-socind-1.svg)

---

* Similar approach in __ggplot2__ add-on package

``` r
ggplot2::ggplot(dsocial, mapping = ggplot2::aes(x = sqrt(hedu), y = sqrt(adlit))) + ggplot2::geom_point()
```

![plot of chunk ggplot-socind](figures/1-rsession-ggplot-socind-1.svg)

---

* Same plot for all countries in Original source: World handbook of
  political and social indicators, 1964

``` r
lattice::xyplot(sqrt(Percent.Adult.Literacy) ~ sqrt(Higher.Ed.per.100000), data = social_indicators)
```

![plot of chunk xyplot-socind-all](figures/1-rsession-xyplot-socind-all-1.svg)

---

* Data in original units

``` r
lattice::xyplot(Percent.Adult.Literacy ~ Higher.Ed.per.100000, data = social_indicators)
```

![plot of chunk xyplot-socind-raw](figures/1-rsession-xyplot-socind-raw-1.svg)

---

# Questions?