class: center, middle # Data Collection and Summarization ## Statistics I — Data Exploration ### Deepayan Sarkar
---
$$ \newcommand{\sub}{_} \newcommand{\set}[1]{\left\lbrace {#1} \right\rbrace} \newcommand{\nseq}[1]{ {#1}\sub{1}, {#1}\sub{2}, \dotsc, {#1}\sub{n} } $$
# Goals * Where do data come from? * Can we classify datasets by how they were collected? * How can we summarize data numerically or graphically? --- layout: true # Sources of data --- * Traditional data types * Categorical - nominal / ordered * Numeric - discrete / continuous -- * Data "modes" that more difficult to analyse (but are of increasing interest) * Free text * Images * Sound * Many others -- * But how are such data typically collected? --- * Key methodologies * Census * Sample survey * Observational studies / Case-control studies * Randomized studies / Randomized controlled trials * Key concepts * Observational units * Population * Sample --- layout: true # Some scenarios --- * We will consider some specific scenarios along with a question of interest * In each case, we want to plan a data collection experiment that will answer the question --- * What is the leading cause of death for adults in India? * Does it vary by sex? * Does it vary by state? By district? * Does it vary by education level? By income? * Do the answers change over time? -- * What is the target population? * Can we do a census? * Can we do a survey? * How should we sample? * What are the observational units? * Observational or randomized? Controls? --- * How much money does a typical adult in India make? * Does it vary by sex? * Does it vary by state? By district? * Does it vary by education level? By employment status? * Do the answers change over time? -- * What is the target population? * Can we do a census? * Can we do a survey? * How should we sample? * What are the observational units? * Observational or randomized? Controls? --- * Consider a "random" person who will be born in India in 2025 * How likely is it that they will be born on January 1? January 2? ... December 31? -- * What is the target population? * Can we do a census? * Can we do a survey? * How should we sample? * What are the observational units? * Observational or randomized? Controls? -- * Here is a [relevant dataset](https://www.ons.gov.uk/visualisations/nesscontent/dvc307/line_chart/data.csv) (CSV) from [England and Wales](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/articles/howpopularisyourbirthday/2015-12-18) * Can we use this data to answer the same question for England and Wales? -- * Similar but more important question: Predict the __air quality__ in Delhi from October to December? --- * A medicine factory produces tablets that are supposed to have a certain chemical composition * A regulator wants to check whether the actual composition is within acceptable limits * However, the only way to test a tablet is "destructive", i.e., the tablet cannot be used afterward -- * What is the target population? * Can we do a census? * Can we do a survey? * How should we sample? * What are the observational units? * Observational or randomized? Controls? --- * A new yearly injection can potentially delay the onset of diabetes * However, both the amount of benefit and the possible side effects are currently unknown * We want to plan an experiment to study both -- * What is the target population? * Can we do a census? * Can we do a survey? * How should we sample? * What are the observational units? * Observational or randomized? Controls? --- * Your friend loves a particular type of pizza from Dominos * He also claims he can differentiate between the pizzas from two different outlets in your neighbourhood * You want to test whether this claim is true -- * What is the target population? * Can we do a census? * Can we do a survey? * How should we sample? * What are the observational units? * Observational or randomized? Controls? --- * What is the GDP of India? What is the average life expectancy in India? * How do they compare with other contries? * How have they changed over time? -- * How do we measure? * More basic question: how are they defined? -- * These are by definition __summary__ measures * Example: [TSV data](https://deepayan.github.io/BSDS/2024-01-DE/data/gapminder.tsv) and [visualization](https://www.gapminder.org/tools/) from GapMinder * The actual process to calculate these from unit-level data is probably quite complicated --- layout: true # What do we actually do with data? --- * The process of data analysis is often involves just calculating various summary measures -- * In fact, the technical name for a summary measure computed from data is a __statistic__ * In this course, we will mostly learn about commonly used summary statistics -- * But also important to remember that summary statistics are usually computed from _samples_ * We need to be careful when we use them to make conclusions about a larger _population_ --- * Example: Are birthdays equally likely? * Let $X$ be the smallest birth month frequency in a 'random sample' of $n = 65$ people * We take one sample, where we observe $X = 3$. What can we conclude? -- ```r min_count <- function(n, pmonths = NULL) { b <- sample(1:12, n, prob = pmonths, replace = TRUE) T <- table(b) if (length(T) < 12) { return(0) } else { return(min(T)) } } ``` --- ```r replicate(10000, min_count(65)) |> table() |> prop.table() |> barplot() ```  --- ```r replicate(10000, min_count(65, pmonths = c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31) / 365)) |> table() |> prop.table() |> barplot() ```  --- * Example: Are birthdays equally likely? * Let $X$ be the smallest birth month frequency in a 'random sample' of $n = 65$ people * We take one sample, where we observe $X = 3$. What can we conclude? -- * We __cannot__ conclude that birthdays are _not_ equally likely --- layout: true # What statistics are useful? --- * This is a natural and important question * The answer depends on the type of data and the problem of interest --- * How can we summarize categorical variables such as _cause of death_? -- * We usually want to find the probabilities of various categories * Natural summary statistic is sample proportion * The 'most likely' category is known as the __mode__ --- * How can we summarize numeric variables such as _income_? * This is much more difficult to answer * Common summary statistics: __mean__ and __median__ * But these capture only limited aspects of the distribution -- * We will learn about mean, median and similar summary statistics * But first we will learn about the _empirical distribution_ --- layout: true # Empirical distribution --- * Suppose we have observed values $\nseq{X}$ of some variable * Assume that the values of $\nseq{X}$ are all _distinct_ * The empirical distribution is a probability distribution whose * Sample space is the set $\set{ \nseq{X} }$ * All elements in the sample space are equally likely (has probability $1/n$) -- * This is adjusted suitably if values are not distinct * The sample space consists of all distinct values * If the $i$th value appears $k\sub{i}$ times, the corresponding probability if $k\sub{i} / n$ --- * The empirical distribution contains all information in the data... * as long as the observations are "independent" (a 'random sample') * As it is a probability distribution, we can apply the tools we have to study distributions -- * Why is the empirical distribution important? * We have seen that sample proportions 'converge' to probabilities as sample size $n$ increases * More generally, the empirical distribution also 'converges' to the population distribution -- * This is one of the _main justifications_ for using observed data to 'infer' about underlying population -- * This convergence is mathematically more complicated (which will not be discussed in this course) * However, understanding this convergence graphically is important --- layout: true # Visualizing the empirical distribution --- * Visualizing the empirical distribution is important but challenging -- * Let us consider the height data we collected in our survey ```r survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv") height <- survey$height height ``` ``` [1] 160.000 181.000 155.000 167.000 185.000 173.000 176.000 173.000 175.000 [10] 183.000 178.000 175.000 165.000 190.000 175.000 168.000 164.000 182.000 [19] 168.000 168.000 180.000 170.000 172.000 180.000 180.000 130.000 166.000 [28] 176.000 175.000 175.000 169.500 170.000 152.400 169.000 180.000 168.000 [37] 192.024 163.000 162.000 175.000 176.000 169.000 175.000 165.000 182.000 [46] 157.000 170.000 173.000 172.000 178.000 178.000 176.000 178.000 167.640 [55] 162.000 182.000 165.000 175.000 186.000 178.000 178.000 172.000 167.640 [64] 159.000 159.000 ``` * How do we visualize it? --- ```r stripchart(height, method = "stack") ```  --- * For comparison, let us simulate data uniformly between 150 and 200 ```r n <- length(height) u <- sample(150:195, n, replace = TRUE) stripchart(list(unif = u, obs = height), method = "stack") ```  --- * Are the two samples similar or very different? Not very easy to say -- * Another possible way to compare ```r par(mfrow = c(1, 2)); plot(u, ylim = c(150, 195)); plot(height, ylim = c(150, 195)) ```  --- * There seems to be some qualitative difference * This difference becomes clearer if we plot _sorted_ data ```r par(mfrow = c(1, 2)); plot(sort(u), ylim = c(150, 195)); plot(sort(height), ylim = c(150, 195)) ```  --- * It is more traditional to swap the axes and convert the y-axis to a probability scale ```r par(mfrow = c(1, 2)); plot(sort(u), ppoints(n), xlim = c(150, 195)); plot(sort(height), ppoints(n), xlim = c(150, 195)) ```  --- * This is equivalent to the _empirical cumulative distribution function_ (ECDF) ```r par(mfrow = c(1, 2)); plot(ecdf(u), xlim = c(150, 195)); plot(ecdf(height), xlim = c(150, 195)) ```  --- layout: true # The empirical cumulative distribution function (ECDF) --- * Height data from a different source (the NHANES survey) ```r library(NHANES) dim(NHANES) ``` ``` [1] 10000 76 ``` ```r nhsub1 <- subset(NHANES, !is.na(Height)) # remove missing height values n1 <- nrow(nhsub1) n1 ``` ``` [1] 9647 ``` --- * ECDF of height of a random sample of Americans ```r plot(sort(nhsub1$Height), ppoints(n1), type = "l") ```  --- * Low heights are from children * We are interested in the height distribution among adults ```r nhsub2 <- subset(nhsub1, Age >= 20) # remove children n2 <- nrow(nhsub2) n2 ``` ``` [1] 7182 ``` --- * ECDF of height of a random sample of American adults ```r plot(sort(nhsub2$Height), ppoints(n2), type = "l") ```  --- * ECDF of the magic distribution: Normal ```r plot(sort(rnorm(n2)), ppoints(n2), type = "l") ```  --- layout: true # Quantile-Quantile plots --- * These ECDF plots essentially show sorted data vs equally spaced probabilities -- * Useful alternative: plot sorted data against sorted data from the 'Normal distribution' ```r par(mfrow = c(1, 2)); plot(sort(rnorm(n2)), sort(nhsub2$Height)) ```  --- * How do we get data from the 'Normal distribution'? Simulate using `rnorm()` * Problem: Every time we simulate, we will get a slightly different plot ```r par(mfrow = c(1, 2)); plot(sort(rnorm(n2)), sort(nhsub2$Height)); ```  --- * Solution: Instead of simulating, use something called 'theoretical quantiles' of the Normal distribution * The result is known as a _Normal Quantile-Quantile Plot_ or simply _Normal Q-Q plot_ ```r par(mfrow = c(1, 2)); plot(sort(rnorm(n2)), sort(nhsub2$Height)); qqnorm(nhsub2$Height) ```  --- * As before, the lattice package has a different implementation that is more powerful ```r qqmath(~ Height, data = nhsub2, grid = TRUE, aspect = 1, groups = Gender, auto.key = TRUE) ```  --- layout: true # Questions --- * This raises several important questions: * What is the Normal distribution and why should we compare with it? * What are quantiles? * What are theoretical quantiles? * What can Q-Q plots tell us? -- * The first two questions will be discussed in your probability course * We will give very vague answers for now --- * What is the Normal distribution * The Normal distribution is a very simple _continuous_ distribution * It is best for now to think of it as an _approximation_ to various distributions that we observe in real life * We will see examples of this soon --- * What are quantiles? * In Q-Q plots we plot sorted data * Sorted data have a special name in statistics: _order statistics_ -- * Specifically, the $k$-th value of $X\sub{i}$ in sorted order is known as the $k$-th order statistic * The standard notation for the $k$-th order statistic is $X\sub{(k)}$ -- * Roughly speaking, quantiles are order statistics, but identified by their _relative rank_ (like percentiles) --- * Examples of quantiles: * The $90$-th percentile is the same as the $0.9$ quantile * This is defined as the number $Q$ such that 90% of the data is less than (or equal to) $Q$ * In a sample of size $n = 100$, this would be $X\sub{(90)}$ * In a sample of size $n = 1000$, this would be $X\sub{(900)}$ -- * In a sample of size $n = 7182$, this would be ?? -- ```r 0.9 * 7182 ``` ``` [1] 6463.8 ``` --- * Several approximations available; see `help(quantile)` ```r with(subset(nhsub2, Gender == "male"), quantile(Height, probs = c(0.2, 0.4, 0.6, 0.8))) ``` ``` 20% 40% 60% 80% 169.7 173.9 177.5 181.8 ``` ```r with(subset(nhsub2, Gender == "female"), quantile(Height, probs = c(0.2, 0.4, 0.6, 0.8))) ``` ``` 20% 40% 60% 80% 155.80 160.30 164.00 168.16 ``` --- * What are theoretical quantiles? * Similar idea, but with probability instead of relative rank * $p$-th quantile is the number $Q$ such that $P(X \leq Q) = p$ * Needs to be modified suitably to account for discrete jumps -- * R has _quantile functions_ for all standard distributions ```r qbinom(c(0.2, 0.4, 0.6, 0.8), size = 20, prob = 0.25) ``` ``` [1] 3 4 5 7 ``` ```r qbinom(c(0.2, 0.4, 0.6, 0.8), size = 200, prob = 0.25) ``` ``` [1] 45 48 51 55 ``` ```r qnorm(c(0.2, 0.4, 0.6, 0.8)) ``` ``` [1] -0.8416212 -0.2533471 0.2533471 0.8416212 ``` --- * Why are Q-Q plots useful? * Sample quantiles converge to theoretical quantiles of underlying population * Q-Q plots compare order statistics with corresponding theoretical quantiles -- * Why not ECDF? * This has to do with _human perception_ * We find it easier to detect departures from a straight line (as opposed to a curve) -- * In principle, the theoretical quantiles can be from any appropriate distribution * The Normal distribution has been found to be the most useful default choice --- * Two final points before moving on from Q-Q plots * In what sense do sample quantiles converge to population quantiles? * Why is the Normal distribution a useful reference for Q-Q plots? * We will illustrate both these through simulation * Theoretical answers will be discussed later in other courses --- layout: true # Convergence of sample quantiles --- * Suppose data are generated from the Normal distribution * Ideally, Q-Q plot should look like a straight line * However, this will rarely happen for small sample sizes --- ```r library(latticeExtra) c(qqmath(~ rnorm( 10)), qqmath(~ rnorm( 10)), qqmath(~ rnorm( 10))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(qqmath(~ rnorm( 50)), qqmath(~ rnorm( 50)), qqmath(~ rnorm( 50))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(qqmath(~ rnorm( 100)), qqmath(~ rnorm( 100)), qqmath(~ rnorm( 100))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(qqmath(~ rnorm( 500)), qqmath(~ rnorm( 500)), qqmath(~ rnorm( 500))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(qqmath(~ rnorm(5000)), qqmath(~ rnorm(5000)), qqmath(~ rnorm(5000))) |> update(grid = TRUE, aspect = 1) ```  --- * Exercise: * Try this for more sample sizes * Repeat several times to understand variability (as function of sample size) -- * Q-Q plots can compare with quantiles of other distributions as well * Let's try with Binomial --- ```r size <- 100; prob <- 0.25; x <- rbinom( 10, size = size, prob = prob) qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob), grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom( 50, size = size, prob = prob) qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob), grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom( 100, size = size, prob = prob) qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob), grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom( 500, size = size, prob = prob) qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob), grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom(5000, size = size, prob = prob) qqmath(~ x, distribution = function(p) qbinom(p, size = size, prob = prob), grid = TRUE, aspect = 1) ```  --- * So similar pattern: * more variable for low sample size * Converges to straight line as sample size increases -- * But what happens if we compare with Normal quantiles instead of Binomial quantiles? --- ```r size <- 100; prob <- 0.25; x <- rbinom( 10, size = size, prob = prob) qqmath(~ x, grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom( 50, size = size, prob = prob) qqmath(~ x, grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom( 100, size = size, prob = prob) qqmath(~ x, grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom( 500, size = size, prob = prob) qqmath(~ x, grid = TRUE, aspect = 1) ```  --- ```r size <- 100; prob <- 0.25; x <- rbinom(5000, size = size, prob = prob) qqmath(~ x, grid = TRUE, aspect = 1) ```  --- * This suggests that the Binomial$(n = 100, p = 0.25)$ distribution is well approximated by Normal -- * This approximation improves with the number of Bernoulli trials $n$ * Let's try Binomial$(n = 1000, p = 0.25)$ --- ```r size <- 1000; prob <- 0.25; x <- rbinom(5000, size = size, prob = prob) qqmath(~ x, grid = TRUE, aspect = 1) ```  --- * But the approximation is not as good for, say, Binomial$(n = 1000, p = 0.0025)$ ```r size <- 1000; prob <- 0.0025; x <- rbinom(5000, size = size, prob = prob) qqmath(~ x, grid = TRUE, aspect = 1) ```  --- * Comparing to the true distribution will lead to a better visual fit ```r qqmath(~ x, grid = TRUE, aspect = 1, distribution = function(p) qbinom(p, size = size, prob = prob)) ```  --- * Summary * Sample quantiles (order statistics) converge to "theoretical quantiles" as sample size increases * Theoretical quantiles essentially determine underlying population -- * This means that looking at sample quantiles (through ECDF) tells us about underlying population * Q-Q plots transform ECDF to compare with a reference distribution (usually Normal) -- * The Normal distribution is a good reference in a surprisingly wide range of situations * However, there are also many situations where it is not appropriate * Systematic patterns in the Normal Q-Q plot can suggest alternative distributions (later) --- layout: true # Comparing multiple samples --- * Q-Q plots are typically used to compare a sample to a reference distribution * We are usually more interested in comparing two or more samples --- * This is possible, as we saw before: compare heights of males and females in NHANES data ```r qqmath(~ Height, data = NHANES, grid = TRUE, aspect = 1, groups = Gender, auto.key = TRUE, subset = Age >= 20) ```  --- * The difference can be summarized as a _constant_ shift in all the quantiles * This can also be seen in a smaller subset of quantiles: ```r s <- with(subset(NHANES, Age >= 20), split(Height, Gender)) str(s) ``` ``` List of 2 $ female: num [1:3683] 168 167 167 167 148 ... $ male : num [1:3552] 165 165 165 170 182 ... ``` ```r lapply(s, quantile, probs = c(0.2, 0.4, 0.6, 0.8), na.rm = TRUE) ``` ``` $female 20% 40% 60% 80% 155.80 160.30 164.00 168.16 $male 20% 40% 60% 80% 169.7 173.9 177.5 181.8 ``` --- * It is traditional to summarize a sample with a specific set of __five__ quantiles * This in known as the [__five number summary__](https://en.wikipedia.org/wiki/Five-number_summary), and can be produced by `fivenum()` ```r lapply(s, fivenum) ``` ``` $female [1] 134.5 157.2 162.1 167.0 184.5 $male [1] 149.4 170.9 175.6 180.5 200.4 ``` --- * These are actually just the quantiles corresponding to probabilities $0, \frac14, \frac12, \frac34, 1$ * In other words: minimum, _first quartile_, median, _third quartile_, maximum ```r lapply(s, quantile, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE) ``` ``` $female 0% 25% 50% 75% 100% 134.500 157.200 162.100 166.975 184.500 $male 0% 25% 50% 75% 100% 149.4 170.9 175.6 180.5 200.4 ``` --- * We usually compare these quantiles _graphically_ using a __box and whisker plot__ ```r bwplot(Gender ~ Height, data = NHANES, subset = Age >= 20, coef = 0) ```  --- * The _compactness_ of this design makes it easier to compare multiple groups together ```r bwplot(Race1 ~ Height | Gender, data = NHANES, subset = Age >= 20, coef = 0) ```  --- * Common axes makes comparison easier ```r bwplot(Height ~ Gender | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1), coef = 0) ```  --- * Similar plot for a different variable: `Weight` ```r bwplot(Weight ~ Gender | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1), coef = 0) ```  --- * What happens if we drop the `coef = 0`? Plots "outlier" data points that are too extreme (if Normal) ```r bwplot(Weight ~ Gender | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1)) ```  --- * We can go back to Q-Q plots if we see something unusual ```r qqmath( ~ Weight | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1), groups = Gender, auto.key = TRUE, grid = TRUE) ```  --- * Sometimes transforming the data can make the distribution "closer" to Normal ```r qqmath( ~ Weight^(1/3) | Race1, data = NHANES, subset = Age >= 20, layout = c(5, 1), groups = Gender, auto.key = TRUE, grid = TRUE) ```  --- * Why do we care about the Normal distribution? -- * Mainly because * It is good approximation for many observed distributions * It is defined by just __two__ parameters: the median and the length of the box * The length of the box is known as the __IQR__ or _inter-quartile range_ -- * Many statistical problems have "nice" solutions when the underlying population is Normal * So one focus of initial "data exploration" is to decide if observed data look like Normal -- * This means that we need to be able to tell when data are _not_ like Normal * This is possible from a Q-Q plot, but it is not as easy to describe the __type of departure__ -- * For this, another tool known as the __histogram__ is more useful --- layout: true # Data binning and histograms --- * The idea of binning is useful in a variety of contexts * Motivation from a data summary perspective: * Categorical or discrete data can be summarized by table counts (without loss of information) * Binning provides a similar summary for continuous data (with some loss of information) -- * Another motivation is probabilistic, which we will discuss later --- * Consider the NHANES height data (after removing `NA`-s and rows with Age less than 20) ```r str(nhsub2$Height) # raw data ~ 7000 records ``` ``` num [1:7182] 165 165 165 168 167 ... ``` ```r T1 <- table(nhsub2$Height); str(T1) # table of unique values (lossless summary) ~ 500 values ``` ``` 'table' int [1:510(1d)] 1 1 1 1 1 2 2 6 1 1 ... - attr(*, "dimnames")=List of 1 ..$ : chr [1:510] "134.5" "137.3" "139.8" "139.9" ... ``` -- ```r T2 <- table(round(nhsub2$Height)); str(T2) # table of rounded values (summary with some loss) ~ 60 values ``` ``` 'table' int [1:62(1d)] 1 1 3 4 7 6 16 13 16 21 ... - attr(*, "dimnames")=List of 1 ..$ : chr [1:62] "134" "137" "140" "141" ... ``` --- ```r xyplot(as.numeric(T1) ~ as.numeric(names(T1)), type = "h") ```  --- ```r xyplot(as.numeric(T2) ~ as.numeric(names(T2)), type = "h") ```  --- ```r xyplot(as.numeric(T2) ~ as.numeric(names(T2)), type = "h", panel = panel.barchart, horizontal = FALSE, origin = 0) ```  --- ```r T3 <- table(10 * round(nhsub2$Height / 10)); # round to 10th digit xyplot(as.numeric(T3) ~ as.numeric(names(T3)), type = "h", panel = panel.barchart, box.width = 8, horizontal = FALSE, origin = 0) ```  --- * Histograms basically generalize this kind of rounding * They are determined by a sequence of 'break points' defining bins (covering all the data) * The data are summarized by the number (or relative frequency) of data points falling in each bin -- * Data points may coincide with bin boundaries (break points) * We need to be careful to specify which bin such data points should be counted in -- * Histogram bins all _usually_ have the same width (the break points are equally spaced) --- * Counting the number of data points in each bin is not trivial * R has three related functions for this: * `cut()` converts numeric data into categories defined by breakpoints * `findInterval()` is similar but more efficient with a different return value * `hist()` computes the bin _counts_ directly --- ```r cut(nhsub2$Height, breaks = seq(125, 205, by = 10)) |> head(30) ``` ``` [1] (155,165] (155,165] (155,165] (165,175] (165,175] (165,175] (165,175] [8] (165,175] (175,185] (165,175] (145,155] (175,185] (175,185] (165,175] [15] (165,175] (165,175] (155,165] (175,185] (175,185] (175,185] (165,175] [22] (165,175] (165,175] (145,155] (175,185] (145,155] (145,155] (165,175] [29] (155,165] (155,165] 8 Levels: (125,135] (135,145] (145,155] (155,165] (165,175] ... (195,205] ``` -- ```r cut(nhsub2$Height, breaks = seq(125, 205, by = 10)) |> table() ``` ``` (125,135] (135,145] (145,155] (155,165] (165,175] (175,185] (185,195] (195,205] 1 41 591 2048 2450 1658 377 16 ``` --- ```r hist(nhsub2$Height, breaks = seq(125, 205, by = 10), plot = FALSE) |> str() ``` ``` List of 6 $ breaks : num [1:9] 125 135 145 155 165 175 185 195 205 $ counts : int [1:8] 1 41 591 2048 2450 1658 377 16 $ density : num [1:8] 1.39e-05 5.71e-04 8.23e-03 2.85e-02 3.41e-02 ... $ mids : num [1:8] 130 140 150 160 170 180 190 200 $ xname : chr "nhsub2$Height" $ equidist: logi TRUE - attr(*, "class")= chr "histogram" ``` --- ```r histogram(~ Height, NHANES, breaks = seq(125, 205, by = 10), subset = Age >= 20, type = "count") ```  --- ```r histogram(~ Height, NHANES, breaks = seq(125, 205, by = 1), subset = Age >= 20, type = "percent") ```  --- ```r histogram(~ Height, NHANES, nint = 30, subset = Age >= 20) # specify number of bins ```  --- ```r histogram(~ Height | Gender, NHANES, nint = 30, subset = Age >= 20, layout = c(1, 2)) ```  --- layout: false # Histograms vs bar charts * In some ways similar, but also different * Barcharts are meant for discrete or categorical data (with pre-specified categories) * Histograms are meant for continuous data -- * This is reflected in the lack of _gaps_ between bars in a histogram -- * Relation with probability theory * Bar charts estimates the population _probability mass function_ (categorical / discrete) * Histograms estimate the population _probability density function_ (continuous) * We will discuss these theoretical connections later --- # Counting bin frequencies * Easy with computers * Not so easy without one -- * Example: height data from survey ```r survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv") height <- round(survey$height) height ``` ``` [1] 160 181 155 167 185 173 176 173 175 183 178 175 165 190 175 168 164 182 168 [20] 168 180 170 172 180 180 130 166 176 175 175 170 170 152 169 180 168 192 163 [39] 162 175 176 169 175 165 182 157 170 173 172 178 178 176 178 168 162 182 165 [58] 175 186 178 178 172 168 159 159 ``` --- # Stem and leaf plot * A very crude way of counting ```r stem(height) ``` ``` The decimal point is 1 digit(s) to the right of the | 13 | 0 14 | 15 | 25799 16 | 022345556788888899 17 | 0000222333555555556666888888 18 | 00001222356 19 | 02 ``` --- layout: true # Histograms for Normal data --- ```r c(histogram(~ rnorm( 10)), histogram(~ rnorm( 10)), histogram(~ rnorm( 10))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(histogram(~ rnorm( 50)), histogram(~ rnorm( 50)), histogram(~ rnorm( 50))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(histogram(~ rnorm( 100)), histogram(~ rnorm( 100)), histogram(~ rnorm( 100))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(histogram(~ rnorm( 500)), histogram(~ rnorm( 500)), histogram(~ rnorm( 500))) |> update(grid = TRUE, aspect = 1) ```  --- ```r c(histogram(~ rnorm(5000)), histogram(~ rnorm(5000)), histogram(~ rnorm(5000))) |> update(grid = TRUE, aspect = 1) ```  --- layout: true # Histograms for NHANES data --- ```r histogram(~ Height | Race1 + Gender, NHANES, subset = Age >= 20) ```  --- ```r histogram(~ Weight | Race1 + Gender, NHANES, subset = Age >= 20) ```  --- ```r histogram(~ Weight^(1/3) | Race1 + Gender, NHANES, subset = Age >= 20) ```  --- layout: true # Example: Investment simulation --- * A person starts with a pool of 100 rupees that they want to invest for 30 years * Every year, they have the option of choosing either * A safe scheme, which is guaranteed to grow an investment of amount $X$ to $1.08 X$ * A risky scheme, which turns an investment of $X$ to either $1.5 X$ or $X / 1.5$ with equal probability -- * Suppose the person chooses either of the schemes randomly each year (with equal probability) * Let $Y$ denote the amount after 30 years * What does the distribution of $Y$ look like? --- ```r sim_investment <- function(n = 30) { f <- sample(c(1.08, 1.5, 1/1.5), n, replace = TRUE, prob = c(0.5, 0.25, 0.25)) 100 * prod(f) } Y <- replicate(10000, sim_investment(30)) ``` --- ```r qqmath(~ Y, grid = TRUE, aspect = 1) ```  --- ```r histogram(~ Y, grid = TRUE) ```  --- ```r qqmath(~ log(Y), grid = TRUE, aspect = 1) ```  --- ```r histogram(~ log(Y), grid = TRUE) ```  --- layout: true # Exercises --- * In the previous scheme, suppose the person chooses the safe strategy with probability 0.9 every year * How does the distribution of $Y$ change? -- * What other strategies can you think of? * How would you _compare_ two strategies in terms of risk and reward? --- * Consider a very large fixed number $M = 10^6 = 1000000$. Fix $n = 5001$. * Draw a _random sample_ $\nseq{X}$ from the discrete uniform distribution on $\set{0, 1, 2, \dotsc, M}$ * Define $\nseq{Y}$ as $Y\sub{i} = \frac{X\sub{i}}{M}$ * Define * $M\sub{1} = \frac1n \sum\limits\sub{i=1}^n Y\sub{i}$ (the sample mean) * $M\sub{2} = Y\sub{(2501)}$ (the sample median) * $M\sub{3} = Y\sub{(1)}$ (the sample minimum) * $M\sub{4} = Y\sub{(n)}$ (the sample maximum) --- * What do the distributions of $M\sub{1}, M\sub{2}, M\sub{3}, M\sub{4}$ look like? * Do the distributions change depending on whether you sample with or without replacement? -- * Suppose we change the underlying distribution of $\nseq{X}$ to Binomial$(M, \frac12)$ * How do the distributions change? --- layout: false class: center, middle # Questions?