Data Collection and Summarization

# Data Collection and Summarization

## Statistics I — Data Exploration

### Deepayan Sarkar

---

<div>
$$
\newcommand{\sub}{_}
\newcommand{\set}[1]{\left\lbrace {#1} \right\rbrace}
\newcommand{\nseq}[1]{ {#1}\sub{1}, {#1}\sub{2}, \dotsc, {#1}\sub{n} }
$$
</div>

# Goals

* Where do data come from?

* Can we classify datasets by how they were collected?

* How can we summarize data numerically or graphically?

---

# Sources of data

---

* Traditional data types

* Categorical - nominal / ordered

* Numeric - discrete / continuous

* Data "modes" that more difficult to analyse (but are of increasing interest)

* Free text
	
	* Images
	
	* Sound
	
	* Many others
	
--

* But how are such data typically collected?

---

* Key methodologies

* Census
	
	* Sample survey
	
	* Observational studies / Case-control studies
	
	* Randomized studies / Randomized controlled trials

* Key concepts

* Observational units

* Population
	
	* Sample
	
---

# Some scenarios

---

* We will consider some specific scenarios along with a question of interest

* In each case, we want to plan a data collection experiment that will answer the question

---

* What is the leading cause of death for adults in India?

* Does it vary by sex?
	
	* Does it vary by state? By district?
	
	* Does it vary by education level? By income?
	
	* Do the answers change over time?
	
--

* What is the target population?

* Can we do a census?

* Can we do a survey?

* How should we sample?

* What are the observational units?

* Observational or randomized? Controls?

---

* How much money does a typical adult in India make?

* Does it vary by sex?
	
	* Does it vary by state? By district?
	
	* Does it vary by education level? By employment status?
	
	* Do the answers change over time?
	
--

* What is the target population?

* Can we do a census?

* Can we do a survey?

* How should we sample?

* What are the observational units?

* Observational or randomized? Controls?

---

* Consider a "random" person who will be born in India in 2025

* How likely is it that they will be born on January 1? January 2? ... December 31?

* What is the target population?

* Can we do a census?

* Can we do a survey?

* How should we sample?

* What are the observational units?

* Observational or randomized? Controls?

* Here is a [relevant dataset](https://www.ons.gov.uk/visualisations/nesscontent/dvc307/line_chart/data.csv)  (CSV) from [England and Wales](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/articles/howpopularisyourbirthday/2015-12-18)

* Can we use this data to answer the same question for England and Wales?

* Similar but more important question: Predict the __air quality__ in Delhi from October to December?

---

* A medicine factory produces tablets that are supposed to have a certain chemical composition

* A regulator wants to check whether the actual composition is within acceptable limits

* However, the only way to test a tablet is "destructive", i.e., the tablet cannot be used afterward

* What is the target population?

* Can we do a census?

* Can we do a survey?

* How should we sample?

* What are the observational units?

* Observational or randomized? Controls?

---

* A new yearly injection can potentially delay the onset of diabetes

* However, both the amount of benefit and the possible side effects are currently unknown

* We want to plan an experiment to study both

* What is the target population?

* Can we do a census?

* Can we do a survey?

* How should we sample?

* What are the observational units?

* Observational or randomized? Controls?

---

* Your friend loves a particular type of pizza from Dominos

* He also claims he can differentiate between the pizzas from two different outlets in your neighbourhood

* You want to test whether this claim is true

* What is the target population?

* Can we do a census?

* Can we do a survey?

* How should we sample?

* What are the observational units?

* Observational or randomized? Controls?

---

* What is the GDP of India? What is the average life expectancy in India?

* How do they compare with other contries?

* How have they changed over time?

* How do we measure?

* More basic question: how are they defined?

* These are by definition __summary__ measures

* Example: [TSV data](https://deepayan.github.io/BSDS/2024-01-DE/data/gapminder.tsv) and [visualization](https://www.gapminder.org/tools/) from GapMinder

* The actual process to calculate these from unit-level data is probably quite complicated

---

# What do we actually do with data?

---

* The process of data analysis is often involves just calculating various summary measures

* In fact, the technical name for a summary measure computed from data is a __statistic__

* In this course, we will mostly learn about commonly used summary statistics

* But also important to remember that summary statistics are usually computed from _samples_

* We need to be careful when we use them to make conclusions about a larger _population_

---

* Example: Are birthdays equally likely?

* Let $X$ be the smallest birth month frequency in a 'random sample' of $n = 65$ people

* We take one sample, where we observe $X = 3$. What can we conclude?

```r
min_count <- function(n, pmonths = NULL)
{
    b <- sample(1:12, n, prob = pmonths, replace = TRUE)
    T <- table(b)
    if (length(T) < 12) {
        return(0)
    }
    else {
        return(min(T))
    }
}
```

---

```r
replicate(10000, min_count(65)) |>
    table() |> prop.table() |> barplot()
```

![plot of chunk unnamed-chunk-2](figures/data-summary-unnamed-chunk-2-1.svg)

---

```r
replicate(10000, min_count(65, pmonths = c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31) / 365)) |>
    table() |> prop.table() |> barplot()
```

![plot of chunk unnamed-chunk-3](figures/data-summary-unnamed-chunk-3-1.svg)

---

* Example: Are birthdays equally likely?

* Let $X$ be the smallest birth month frequency in a 'random sample' of $n = 65$ people

* We take one sample, where we observe $X = 3$. What can we conclude?

* We __cannot__ conclude that birthdays are _not_ equally likely

---

# What statistics are useful?

---

* This is a natural and important question

* The answer depends on the type of data and the problem of interest

---

* How can we summarize categorical variables such as _cause of death_?

* We usually want to find the probabilities of various categories

* Natural summary statistic is sample proportion
	
	* The 'most likely' category is known as the __mode__

---

* How can we summarize numeric variables such as _income_?

* This is much more difficult to answer
	
	* Common summary statistics: __mean__ and __median__
	
	* But these capture only limited aspects of the distribution
	
--

* We will learn about mean, median and similar summary statistics

* But first we will learn about the _empirical distribution_

---

# Empirical distribution

---

* Suppose we have observed values $\nseq{X}$ of some variable

* Assume that the values of $\nseq{X}$ are all _distinct_

* The empirical distribution is a probability distribution whose

* Sample space is the set $\set{ \nseq{X} }$
	
	* All elements in the sample space are equally likely (has probability $1/n$)
	
--

* This is adjusted suitably if values are not distinct

* The sample space consists of all distinct values

* If the $i$th value appears $k\sub{i}$ times, the corresponding probability if $k\sub{i} / n$

---

* The empirical distribution contains all information in the data...

* as long as the observations are "independent" (a 'random sample')
	
* As it is a probability distribution, we can apply the tools we have to study distributions

* Why is the empirical distribution important?

* We have seen that sample proportions 'converge' to probabilities as sample size $n$ increases

* More generally, the empirical distribution also 'converges' to the population distribution

--
	
	* This is one of the _main justifications_ for using observed data to 'infer' about underlying population

* This convergence is mathematically more complicated (which will not be discussed in this course)
	
	* However, understanding this convergence graphically is important

---

# Visualizing the empirical distribution

---

* Visualizing the empirical distribution is important but challenging

* Let us consider the height data we collected in our survey

```r
survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
height <- survey$height
height
```

```
 [1] 160.000 181.000 155.000 167.000 185.000 173.000 176.000 173.000 175.000
[10] 183.000 178.000 175.000 165.000 190.000 175.000 168.000 164.000 182.000
[19] 168.000 168.000 180.000 170.000 172.000 180.000 180.000 130.000 166.000
[28] 176.000 175.000 175.000 169.500 170.000 152.400 169.000 180.000 168.000
[37] 192.024 163.000 162.000 175.000 176.000 169.000 175.000 165.000 182.000
[46] 157.000 170.000 173.000 172.000 178.000 178.000 176.000 178.000 167.640
[55] 162.000 182.000 165.000 175.000 186.000 178.000 178.000 172.000 167.640
[64] 159.000 159.000
```

* How do we visualize it?

---

```r
stripchart(height, method = "stack")
```

![plot of chunk unnamed-chunk-5](figures/data-summary-unnamed-chunk-5-1.svg)

---

* For comparison, let us simulate data uniformly between 150 and 200

```r
n <- length(height)
u <- sample(150:195, n, replace = TRUE)
stripchart(list(unif = u, obs = height), method = "stack")
```

![plot of chunk unnamed-chunk-6](figures/data-summary-unnamed-chunk-6-1.svg)

---

* Are the two samples similar or very different? Not very easy to say

* Another possible way to compare

```r
par(mfrow = c(1, 2)); plot(u,       ylim = c(150, 195)); plot(height,       ylim = c(150, 195))
```

![plot of chunk unnamed-chunk-7](figures/data-summary-unnamed-chunk-7-1.svg)

---

* There seems to be some qualitative difference

* This difference becomes clearer if we plot _sorted_ data

```r
par(mfrow = c(1, 2)); plot(sort(u), ylim = c(150, 195)); plot(sort(height), ylim = c(150, 195))
```

![plot of chunk unnamed-chunk-8](figures/data-summary-unnamed-chunk-8-1.svg)

---

* It is more traditional to swap the axes and convert the y-axis to a probability scale

```r
par(mfrow = c(1, 2)); plot(sort(u), ppoints(n), xlim = c(150, 195));
                      plot(sort(height), ppoints(n), xlim = c(150, 195))
```

![plot of chunk unnamed-chunk-9](figures/data-summary-unnamed-chunk-9-1.svg)

---

* This is equivalent to the _empirical cumulative distribution function_ (ECDF)

```r
par(mfrow = c(1, 2)); plot(ecdf(u), xlim = c(150, 195));
                      plot(ecdf(height), xlim = c(150, 195))
```

![plot of chunk unnamed-chunk-10](figures/data-summary-unnamed-chunk-10-1.svg)

---

# The empirical cumulative distribution function (ECDF)

---

* Height data from a different source (the NHANES survey)

```r
library(NHANES)
dim(NHANES)
```

```
[1] 10000    76
```

```r
nhsub1 <- subset(NHANES, !is.na(Height)) # remove missing height values
n1 <- nrow(nhsub1)
n1
```

```
[1] 9647
```

---

* ECDF of height of a random sample of Americans

```r
plot(sort(nhsub1$Height), ppoints(n1), type = "l")
```

![plot of chunk unnamed-chunk-12](figures/data-summary-unnamed-chunk-12-1.svg)

---

* Low heights are from children

* We are interested in the height distribution among adults

```r
nhsub2 <- subset(nhsub1, Age >= 20) # remove children
n2 <- nrow(nhsub2)
n2
```

```
[1] 7182
```

---

* ECDF of height of a random sample of American adults

```r
plot(sort(nhsub2$Height), ppoints(n2), type = "l")
```

![plot of chunk unnamed-chunk-14](figures/data-summary-unnamed-chunk-14-1.svg)

---

* ECDF of the magic distribution: Normal

```r
plot(sort(rnorm(n2)), ppoints(n2), type = "l")
```

![plot of chunk unnamed-chunk-15](figures/data-summary-unnamed-chunk-15-1.svg)

---

# Quantile-Quantile plots

---

* These ECDF plots essentially show sorted data vs equally spaced probabilities

* Useful alternative: plot sorted data against sorted data from the 'Normal distribution'

```r
par(mfrow = c(1, 2)); plot(sort(rnorm(n2)), sort(nhsub2$Height))
```

![plot of chunk unnamed-chunk-16](figures/data-summary-unnamed-chunk-16-1.svg)

---

* How do we get data from the 'Normal distribution'? Simulate using `rnorm()`

* Problem: Every time we simulate, we will get a slightly different plot

```r
par(mfrow = c(1, 2)); plot(sort(rnorm(n2)), sort(nhsub2$Height));
```

![plot of chunk unnamed-chunk-17](figures/data-summary-unnamed-chunk-17-1.svg)

---

* Solution: Instead of simulating, use something called 'theoretical quantiles' of the Normal distribution

* The result is known as a _Normal Quantile-Quantile Plot_ or simply _Normal Q-Q plot_

```r
par(mfrow = c(1, 2)); plot(sort(rnorm(n2)), sort(nhsub2$Height)); qqnorm(nhsub2$Height)
```

![plot of chunk unnamed-chunk-18](figures/data-summary-unnamed-chunk-18-1.svg)

---

* As before, the lattice package has a different implementation that is more powerful

```r
qqmath(~ Height, data = nhsub2, grid = TRUE, aspect = 1, groups = Gender, auto.key = TRUE)
```

![plot of chunk unnamed-chunk-19](figures/data-summary-unnamed-chunk-19-1.svg)

---

# Questions

---

* This raises several important questions:

* What is the Normal distribution and why should we compare with it?

* What are quantiles?

* What are theoretical quantiles?

* What can Q-Q plots tell us?
	
--

* The first two questions will be discussed in your probability course

* We will give very vague answers for now

---

* What is the Normal distribution

* The Normal distribution is a very simple _continuous_ distribution
	
	* It is best for now to think of it as an _approximation_ to various distributions that we observe in real life

* We will see examples of this soon

---

* What are quantiles?

* In Q-Q plots we plot sorted data
	
	* Sorted data have a special name in statistics: _order statistics_
	
--

* Specifically, the $k$-th value of $X\sub{i}$ in sorted order is known as the $k$-th order statistic
	
	* The standard notation for the $k$-th order statistic is $X\sub{(k)}$

* Roughly speaking, quantiles are order statistics, but identified by their _relative rank_ (like percentiles)
	
---

* Examples of quantiles:

* The $90$-th percentile is the same as the $0.9$ quantile
	
	* This is defined as the number $Q$ such that 90% of the data is less than (or equal to) $Q$
	
	* In a sample of size $n = 100$, this would be $X\sub{(90)}$

* In a sample of size $n = 1000$, this would be $X\sub{(900)}$

* In a sample of size $n = 7182$, this would be ??

```r
0.9 * 7182
```

```
[1] 6463.8
```

---
	
* Several approximations available; see `help(quantile)`

```r
with(subset(nhsub2, Gender == "male"), quantile(Height, probs = c(0.2, 0.4, 0.6, 0.8)))
```

```
  20%   40%   60%   80% 
169.7 173.9 177.5 181.8 
```

```r
with(subset(nhsub2, Gender == "female"), quantile(Height, probs = c(0.2, 0.4, 0.6, 0.8)))
```

```
   20%    40%    60%    80% 
155.80 160.30 164.00 168.16 
```

---

* What are theoretical quantiles?

* Similar idea, but with probability instead of relative rank
	
	* $p$-th quantile is the number $Q$ such that $P(X \leq Q) = p$
	
	* Needs to be modified suitably to account for discrete jumps
	
--

* R has _quantile functions_ for all standard distributions

```r
qbinom(c(0.2, 0.4, 0.6, 0.8), size = 20, prob = 0.25)
```

```
[1] 3 4 5 7
```

```r
qbinom(c(0.2, 0.4, 0.6, 0.8), size = 200, prob = 0.25)
```

```
[1] 45 48 51 55
```

```r
qnorm(c(0.2, 0.4, 0.6, 0.8))
```

```
[1] -0.8416212 -0.2533471  0.2533471  0.8416212
```

---

* Why are Q-Q plots useful?

* Sample quantiles converge to theoretical quantiles of underlying population
	
	* Q-Q plots compare order statistics with corresponding theoretical quantiles

* Why not ECDF?

* This has to do with _human perception_
	
	* We find it easier to detect departures from a straight line (as opposed to a curve)

--
	
	* In principle, the theoretical quantiles can be from any appropriate distribution

* The Normal distribution has been found to be the most useful default choice

---

* Two final points before moving on from Q-Q plots

* In what sense do sample quantiles converge to population quantiles?
	
	* Why is the Normal distribution a useful reference for Q-Q plots?

* We will illustrate both these through simulation

* Theoretical answers will be discussed later in other courses

---

# Convergence of sample quantiles

---

* Suppose data are generated from the Normal distribution

* Ideally, Q-Q plot should look like a straight line

* However, this will rarely happen for small sample sizes

---

```r
library(latticeExtra)
c(qqmath(~ rnorm(  10)), qqmath(~ rnorm(  10)), qqmath(~ rnorm(  10))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-23](figures/data-summary-unnamed-chunk-23-1.svg)

---

```r
c(qqmath(~ rnorm(  50)), qqmath(~ rnorm(  50)), qqmath(~ rnorm(  50))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-24](figures/data-summary-unnamed-chunk-24-1.svg)

---

```r
c(qqmath(~ rnorm( 100)), qqmath(~ rnorm( 100)), qqmath(~ rnorm( 100))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-25](figures/data-summary-unnamed-chunk-25-1.svg)

---

```r
c(qqmath(~ rnorm( 500)), qqmath(~ rnorm( 500)), qqmath(~ rnorm( 500))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-26](figures/data-summary-unnamed-chunk-26-1.svg)

---

```r
c(qqmath(~ rnorm(5000)), qqmath(~ rnorm(5000)), qqmath(~ rnorm(5000))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-27](figures/data-summary-unnamed-chunk-27-1.svg)

---

* Exercise:

* Try this for more sample sizes

* Repeat several times to understand variability (as function of sample size)

* Q-Q plots can compare with quantiles of other distributions as well

* Let's try with Binomial

---

```r
size <- 100; prob <- 0.25; x <- rbinom(  10, size = size, prob = prob)
qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob),
       grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-28](figures/data-summary-unnamed-chunk-28-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom(  50, size = size, prob = prob)
qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob),
       grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-29](figures/data-summary-unnamed-chunk-29-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom( 100, size = size, prob = prob)
qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob),
       grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-30](figures/data-summary-unnamed-chunk-30-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom( 500, size = size, prob = prob)
qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob),
       grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-31](figures/data-summary-unnamed-chunk-31-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom(5000, size = size, prob = prob)
qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob),
       grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-32](figures/data-summary-unnamed-chunk-32-1.svg)

---

* So similar pattern:

* more variable for low sample size
	
	* Converges to straight line as sample size increases

* But what happens if we compare with Normal quantiles instead of Binomial quantiles?

---

```r
size <- 100; prob <- 0.25; x <- rbinom(  10, size = size, prob = prob)
qqmath(~ x, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-33](figures/data-summary-unnamed-chunk-33-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom(  50, size = size, prob = prob)
qqmath(~ x, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-34](figures/data-summary-unnamed-chunk-34-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom( 100, size = size, prob = prob)
qqmath(~ x, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-35](figures/data-summary-unnamed-chunk-35-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom( 500, size = size, prob = prob)
qqmath(~ x, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-36](figures/data-summary-unnamed-chunk-36-1.svg)

---

```r
size <- 100; prob <- 0.25; x <- rbinom(5000, size = size, prob = prob)
qqmath(~ x, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-37](figures/data-summary-unnamed-chunk-37-1.svg)

---

* This suggests that the Binomial$(n = 100, p = 0.25)$ distribution is well approximated by Normal

* This approximation improves with the number of Bernoulli trials $n$

* Let's try Binomial$(n = 1000, p = 0.25)$

---

```r
size <- 1000; prob <- 0.25; x <- rbinom(5000, size = size, prob = prob)
qqmath(~ x, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-38](figures/data-summary-unnamed-chunk-38-1.svg)

---

* But the approximation is not as good for, say, Binomial$(n = 1000, p = 0.0025)$

```r
size <- 1000; prob <- 0.0025; x <- rbinom(5000, size = size, prob = prob)
qqmath(~ x, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-39](figures/data-summary-unnamed-chunk-39-1.svg)

---

* Comparing to the true distribution will lead to a better visual fit

```r
qqmath(~ x, grid = TRUE, aspect = 1,
       distribution = function(p) qbinom(p, size = size, prob = prob))
```

![plot of chunk unnamed-chunk-40](figures/data-summary-unnamed-chunk-40-1.svg)

---

* Summary

* Sample quantiles (order statistics) converge to "theoretical quantiles" as sample size increases
	
	* Theoretical quantiles essentially determine underlying population

* This means that looking at sample quantiles (through ECDF) tells us about underlying population
	
	* Q-Q plots transform ECDF to compare with a reference distribution (usually Normal)
	
--

* The Normal distribution is a good reference in a surprisingly wide range of situations
	
	* However, there are also many situations where it is not appropriate
	
	* Systematic patterns in the Normal Q-Q plot can suggest alternative distributions (later)

---

# Comparing multiple samples

---

* Q-Q plots are typically used to compare a sample to a reference distribution

* We are usually more interested in comparing two or more samples

---

* This is possible, as we saw before: compare heights of males and females in NHANES data

```r
qqmath(~ Height, data = NHANES, grid = TRUE, aspect = 1, groups = Gender, auto.key = TRUE, subset = Age >= 20)
```

![plot of chunk unnamed-chunk-41](figures/data-summary-unnamed-chunk-41-1.svg)

---

* The difference can be summarized as a _constant_ shift in all the quantiles

* This can also be seen in a smaller subset of quantiles:

```r
s <- with(subset(NHANES, Age >= 20), split(Height, Gender))
str(s)
```

```
List of 2
 $ female: num [1:3683] 168 167 167 167 148 ...
 $ male  : num [1:3552] 165 165 165 170 182 ...
```

```r
lapply(s, quantile, probs = c(0.2, 0.4, 0.6, 0.8),  na.rm = TRUE)
```

```
$female
   20%    40%    60%    80% 
155.80 160.30 164.00 168.16

$male
  20%   40%   60%   80% 
169.7 173.9 177.5 181.8 
```

---

* It is traditional to summarize a sample with a specific set of __five__ quantiles

* This in known as the [__five number summary__](https://en.wikipedia.org/wiki/Five-number_summary), and can be produced by `fivenum()`

```r
lapply(s, fivenum)
```

```
$female
[1] 134.5 157.2 162.1 167.0 184.5

$male
[1] 149.4 170.9 175.6 180.5 200.4
```

---

* These are actually just the quantiles corresponding to probabilities $0, \frac14, \frac12, \frac34, 1$

* In other words: minimum, _first quartile_, median, _third quartile_, maximum

```r
lapply(s, quantile, probs = c(0, 0.25, 0.5, 0.75, 1),  na.rm = TRUE)
```

```
$female
     0%     25%     50%     75%    100% 
134.500 157.200 162.100 166.975 184.500

$male
   0%   25%   50%   75%  100% 
149.4 170.9 175.6 180.5 200.4 
```

---

* We usually compare these quantiles _graphically_ using a __box and whisker plot__

```r
bwplot(Gender ~ Height, data = NHANES, subset = Age >= 20, coef = 0)
```

![plot of chunk unnamed-chunk-45](figures/data-summary-unnamed-chunk-45-1.svg)

---

* The _compactness_ of this design makes it easier to compare multiple groups together

```r
bwplot(Race1 ~ Height | Gender, data = NHANES, subset = Age >= 20, coef = 0)
```

![plot of chunk unnamed-chunk-46](figures/data-summary-unnamed-chunk-46-1.svg)

---

* Common axes makes comparison easier

```r
bwplot(Height ~ Gender | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1), coef = 0)
```

![plot of chunk unnamed-chunk-47](figures/data-summary-unnamed-chunk-47-1.svg)

---

* Similar plot for a different variable: `Weight`

```r
bwplot(Weight ~ Gender | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1), coef = 0)
```

![plot of chunk unnamed-chunk-48](figures/data-summary-unnamed-chunk-48-1.svg)

---

* What happens if we drop the `coef = 0`? Plots "outlier" data points that are too extreme (if Normal)

```r
bwplot(Weight ~ Gender | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1))
```

![plot of chunk unnamed-chunk-49](figures/data-summary-unnamed-chunk-49-1.svg)

---

* We can go back to Q-Q plots if we see something unusual

```r
qqmath( ~ Weight | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1),
       groups = Gender, auto.key = TRUE, grid = TRUE)
```

![plot of chunk unnamed-chunk-50](figures/data-summary-unnamed-chunk-50-1.svg)

---

* Sometimes transforming the data can make the distribution "closer" to Normal

```r
qqmath( ~ Weight^(1/3) | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1),
       groups = Gender, auto.key = TRUE, grid = TRUE)
```

![plot of chunk unnamed-chunk-51](figures/data-summary-unnamed-chunk-51-1.svg)

---

* Why do we care about the Normal distribution?

* Mainly because

* It is good approximation for many observed distributions
	
	* It is defined by just __two__ parameters: the median and the length of the box
	
	* The length of the box is known as the __IQR__ or _inter-quartile range_

* Many statistical problems have "nice" solutions when the underlying population is Normal

* So one focus of initial "data exploration" is to decide if observed data look like Normal

* This means that we need to be able to tell when data are _not_ like Normal

* This is possible from a Q-Q plot, but it is not as easy to describe the __type of departure__

* For this, another tool known as the __histogram__ is more useful

---

# Data binning and histograms

---

* The idea of binning is useful in a variety of contexts

* Motivation from a data summary perspective:

* Categorical or discrete data can be summarized by table counts (without loss of information)

* Binning provides a similar summary for continuous data (with some loss of information)

* Another motivation is probabilistic, which we will discuss later

---

* Consider the NHANES height data (after removing `NA`-s and rows with Age less than 20)

```r
str(nhsub2$Height)                          # raw data ~ 7000 records
```

```
 num [1:7182] 165 165 165 168 167 ...
```

```r
T1 <- table(nhsub2$Height); str(T1)         # table of unique values (lossless summary) ~ 500 values
```

```
 'table' int [1:510(1d)] 1 1 1 1 1 2 2 6 1 1 ...
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:510] "134.5" "137.3" "139.8" "139.9" ...
```

```r
T2 <- table(round(nhsub2$Height)); str(T2)  # table of rounded values (summary with some loss) ~ 60 values
```

```
 'table' int [1:62(1d)] 1 1 3 4 7 6 16 13 16 21 ...
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:62] "134" "137" "140" "141" ...
```

---

```r
xyplot(as.numeric(T1) ~ as.numeric(names(T1)), type = "h")
```

![plot of chunk unnamed-chunk-54](figures/data-summary-unnamed-chunk-54-1.svg)

---

```r
xyplot(as.numeric(T2) ~ as.numeric(names(T2)), type = "h")
```

![plot of chunk unnamed-chunk-55](figures/data-summary-unnamed-chunk-55-1.svg)

---

```r
xyplot(as.numeric(T2) ~ as.numeric(names(T2)), type = "h",
       panel = panel.barchart, horizontal = FALSE, origin = 0)
```

![plot of chunk unnamed-chunk-56](figures/data-summary-unnamed-chunk-56-1.svg)

---

```r
T3 <- table(10 * round(nhsub2$Height / 10)); # round to 10th digit
xyplot(as.numeric(T3) ~ as.numeric(names(T3)), type = "h",
       panel = panel.barchart, box.width = 8, horizontal = FALSE, origin = 0)
```

![plot of chunk unnamed-chunk-57](figures/data-summary-unnamed-chunk-57-1.svg)

---

* Histograms basically generalize this kind of rounding

* They are determined by a sequence of 'break points' defining bins (covering all the data)
	
* The data are summarized by the number (or relative frequency) of data points falling in each bin

* Data points may coincide with bin boundaries (break points)

* We need to be careful to specify which bin such data points should be counted in

* Histogram bins all _usually_ have the same width (the break points are equally spaced)

---

* Counting the number of data points in each bin is not trivial

* R has three related functions for this:

* `cut()` converts numeric data into categories defined by breakpoints
	
	* `findInterval()` is similar but more efficient with a different return value
	
	* `hist()` computes the bin _counts_ directly
	
---

```r
cut(nhsub2$Height, breaks = seq(125, 205, by = 10)) |> head(30)
```

```
 [1] (155,165] (155,165] (155,165] (165,175] (165,175] (165,175] (165,175]
 [8] (165,175] (175,185] (165,175] (145,155] (175,185] (175,185] (165,175]
[15] (165,175] (165,175] (155,165] (175,185] (175,185] (175,185] (165,175]
[22] (165,175] (165,175] (145,155] (175,185] (145,155] (145,155] (165,175]
[29] (155,165] (155,165]
8 Levels: (125,135] (135,145] (145,155] (155,165] (165,175] ... (195,205]
```

```r
cut(nhsub2$Height, breaks = seq(125, 205, by = 10)) |> table()
```

```

(125,135] (135,145] (145,155] (155,165] (165,175] (175,185] (185,195] (195,205] 
        1        41       591      2048      2450      1658       377        16 
```

---

```r
hist(nhsub2$Height, breaks = seq(125, 205, by = 10), plot = FALSE) |> str()
```

```
List of 6
 $ breaks  : num [1:9] 125 135 145 155 165 175 185 195 205
 $ counts  : int [1:8] 1 41 591 2048 2450 1658 377 16
 $ density : num [1:8] 1.39e-05 5.71e-04 8.23e-03 2.85e-02 3.41e-02 ...
 $ mids    : num [1:8] 130 140 150 160 170 180 190 200
 $ xname   : chr "nhsub2$Height"
 $ equidist: logi TRUE
 - attr(*, "class")= chr "histogram"
```

---

```r
histogram(~ Height, NHANES, breaks = seq(125, 205, by = 10), subset = Age >= 20, type = "count")
```

![plot of chunk unnamed-chunk-61](figures/data-summary-unnamed-chunk-61-1.svg)

---

```r
histogram(~ Height, NHANES, breaks = seq(125, 205, by = 1), subset = Age >= 20, type = "percent")
```

![plot of chunk unnamed-chunk-62](figures/data-summary-unnamed-chunk-62-1.svg)

---

```r
histogram(~ Height, NHANES, nint = 30, subset = Age >= 20) # specify number of bins
```

![plot of chunk unnamed-chunk-63](figures/data-summary-unnamed-chunk-63-1.svg)

---

```r
histogram(~ Height | Gender, NHANES, nint = 30, subset = Age >= 20, layout = c(1, 2))
```

![plot of chunk unnamed-chunk-64](figures/data-summary-unnamed-chunk-64-1.svg)

---

# Histograms vs bar charts

* In some ways similar, but also different

* Barcharts are meant for discrete or categorical data (with pre-specified categories)

* Histograms are meant for continuous data

* This is reflected in the lack of _gaps_ between bars in a histogram

* Relation with probability theory

* Bar charts estimates the population _probability mass function_ (categorical / discrete)
	
	* Histograms estimate the population _probability density function_ (continuous)
	
* We will discuss these theoretical connections later

---

# Counting bin frequencies

* Easy with computers

* Not so easy without one

* Example: height data from survey

```r
survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
height <- round(survey$height)
height
```

```
 [1] 160 181 155 167 185 173 176 173 175 183 178 175 165 190 175 168 164 182 168
[20] 168 180 170 172 180 180 130 166 176 175 175 170 170 152 169 180 168 192 163
[39] 162 175 176 169 175 165 182 157 170 173 172 178 178 176 178 168 162 182 165
[58] 175 186 178 178 172 168 159 159
```

---

# Stem and leaf plot

* A very crude way of counting

```r
stem(height)
```

```

The decimal point is 1 digit(s) to the right of the |

13 | 0
  14 | 
  15 | 25799
  16 | 022345556788888899
  17 | 0000222333555555556666888888
  18 | 00001222356
  19 | 02
```

---

# Histograms for Normal data

---

```r
c(histogram(~ rnorm(  10)), histogram(~ rnorm( 10)), histogram(~ rnorm(  10))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-67](figures/data-summary-unnamed-chunk-67-1.svg)

---

```r
c(histogram(~ rnorm(  50)), histogram(~ rnorm( 50)), histogram(~ rnorm(  50))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-68](figures/data-summary-unnamed-chunk-68-1.svg)

---

```r
c(histogram(~ rnorm( 100)), histogram(~ rnorm( 100)), histogram(~ rnorm( 100))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-69](figures/data-summary-unnamed-chunk-69-1.svg)

---

```r
c(histogram(~ rnorm( 500)), histogram(~ rnorm( 500)), histogram(~ rnorm( 500))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-70](figures/data-summary-unnamed-chunk-70-1.svg)

---

```r
c(histogram(~ rnorm(5000)), histogram(~ rnorm(5000)), histogram(~ rnorm(5000))) |>
    update(grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-71](figures/data-summary-unnamed-chunk-71-1.svg)

---

# Histograms for NHANES data

---

```r
histogram(~ Height | Race1 + Gender, NHANES, subset = Age >= 20)
```

![plot of chunk unnamed-chunk-72](figures/data-summary-unnamed-chunk-72-1.svg)

---

```r
histogram(~ Weight | Race1 + Gender, NHANES, subset = Age >= 20)
```

![plot of chunk unnamed-chunk-73](figures/data-summary-unnamed-chunk-73-1.svg)

---

```r
histogram(~ Weight^(1/3) | Race1 + Gender, NHANES, subset = Age >= 20)
```

![plot of chunk unnamed-chunk-74](figures/data-summary-unnamed-chunk-74-1.svg)

---

# Example: Investment simulation

---

* A person starts with a pool of 100 rupees that they want to invest for 30 years

* Every year, they have the option of choosing either

* A safe scheme, which is guaranteed to grow an investment of amount $X$ to $1.08 X$
	
	* A risky scheme, which turns an investment of $X$ to either $1.5 X$ or $X / 1.5$ with equal probability

* Suppose the person chooses either of the schemes randomly each year (with equal probability)
	
* Let $Y$ denote the amount after 30 years

* What does the distribution of $Y$ look like?

---

```r
sim_investment <- function(n = 30)
{
    f <- sample(c(1.08, 1.5, 1/1.5), n, replace = TRUE, prob = c(0.5, 0.25, 0.25))
    100 * prod(f)
}
Y <- replicate(10000, sim_investment(30))
```

---

```r
qqmath(~ Y, grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-76](figures/data-summary-unnamed-chunk-76-1.svg)

---

```r
histogram(~ Y, grid = TRUE)
```

![plot of chunk unnamed-chunk-77](figures/data-summary-unnamed-chunk-77-1.svg)

---

```r
qqmath(~ log(Y), grid = TRUE, aspect = 1)
```

![plot of chunk unnamed-chunk-78](figures/data-summary-unnamed-chunk-78-1.svg)

---

```r
histogram(~ log(Y), grid = TRUE)
```

![plot of chunk unnamed-chunk-79](figures/data-summary-unnamed-chunk-79-1.svg)

---

# Exercises

---

* In the previous scheme, suppose the person chooses the safe strategy with probability 0.9 every year

* How does the distribution of $Y$ change?

* What other strategies can you think of?

* How would you _compare_ two strategies in terms of risk and reward?

---

* Consider a very large fixed number $M = 10^6 = 1000000$. Fix $n = 5001$.

* Draw a _random sample_ $\nseq{X}$ from the discrete uniform distribution on $\set{0, 1, 2, \dotsc, M}$

* Define $\nseq{Y}$ as $Y\sub{i} = \frac{X\sub{i}}{M}$

* Define

* $M\sub{1} = \frac1n \sum\limits\sub{i=1}^n Y\sub{i}$ (the sample mean)

* $M\sub{2} = Y\sub{(2501)}$ (the sample median)

* $M\sub{3} = Y\sub{(1)}$ (the sample minimum)

* $M\sub{4} = Y\sub{(n)}$ (the sample maximum)

---

* What do the distributions of $M\sub{1}, M\sub{2}, M\sub{3}, M\sub{4}$ look like?

* Do the distributions change depending on whether you sample with or without replacement?

* Suppose we change the underlying distribution of $\nseq{X}$ to Binomial$(M, \frac12)$

* How do the distributions change?

---

# Questions?