class: center, middle # Continuous Distributions ## Statistics I — Data Exploration ### Deepayan Sarkar
---
$$ \newcommand{\sub}{_} \newcommand{\set}[1]{\left\lbrace {#1} \right\rbrace} \newcommand{\abs}[1]{\left\lvert {#1} \right\rvert} \newcommand{\nseq}[1]{ {#1}\sub{1}, {#1}\sub{2}, \dotsc, {#1}\sub{n} } $$
# Discrete vs Continuous distributions * We have mentioned the Normal distribution several times * But most specific examples we have considered are from discrete populations -- * Discrete populations are more natural * Why do we need the Normal distribution? -- * One way to justify Normal and other continuous distributions are as useful approximations --- layout: true # Continuous distributions --- * Example: Discrete Uniform * Equally likely outcomes on a finite set $S = \set{ s\sub{1}, \dotsc, s\sub{n}}$ * Simplest example: $S = \set{ 1, 2, \dotsc, n}$ --- * Distribution (probability mass function) $P(X = s)$ of Discrete Uniform on $S = \set{ 1, 2, \dotsc, 10}$  --- * Distribution (probability mass function) $P(X = s)$ of Discrete Uniform on $S = \set{ 1, 2, \dotsc, 100}$  --- * Cumulative Distribution Function $F(s) = P(X \leq s)$ of Discrete Uniform on $S = \set{ 1, 2, \dotsc, 10}$  --- * Cumulative Distribution Function $F(s) = P(X \leq s)$ of Discrete Uniform on $S = \set{ 1, 2, \dotsc, 100}$  --- * Natural limiting approximation: Uniform on interval $S = (0, 1)$  --- * Cumulative Distribution Function $F(s) = P(X \leq s)$ of Discrete Uniform on $S = \set{ 1, 2, \dotsc, n}$ $$ F(s) = P(X \leq s) = \frac{s}{n} = \sum\limits\sub{x = 1}^s P(X = x) = \sum\limits\sub{x = 1}^s \frac{1}{n} \text{ for } s \leq n $$ -- * Cumulative Distribution Function $F(s) = P(X \leq s)$ of Uniform on interval $S = (0, 1)$ $$ F(s) = P(X \leq s) = s = \int\limits\sub{0}^s f(x) \text{d}x = \int\limits\sub{0}^s 1 \, \text{d}x \text{ for } s \leq 1 $$ -- * This gives probabilities for intervals and other events that can be derived from them * We call this the Uniform$(0, 1)$ distribution or simply $U(0, 1)$ --- * In general, probabilities can be computed using any "density function" f using $$ F(s) = P(X \leq s) = \int\limits\sub{-\infty}^s f(x) \text{d}x $$ * To ensure total probability equals 1, $f$ must satisty $f(x) \geq 0$ for all $x \in \mathbb{R}$, and $$ \int\limits\sub{-\infty}^\infty f(x) \, \text{d}x = 1 $$ -- * Many such $f$ can be constructed, but which are important? --- layout: true # Statistics based on Uniform$(0, 1)$ --- * Suppose $\nseq{U} \sim U(0, 1)$ * How do summary statistics based on $\nseq{U}$ behave? -- * Statistics we have studied: sample mean, sample median, other sample quantiles * These give rise to three fundamental distributions: Beta, Normal, and Exponential --- * Distribution of $U\sub{(k)}$ is Beta$(k, n-k+1)$ * Density $f: \mathbb{R} \to [0, \infty)$ of Beta$(a, b)$ is $$ f(x) = \begin{cases} C(a, b) \, x^{a-1} (1-x)^{b-1} & x \in [0, 1] \cr 0 & x \not\in [0, 1] \end{cases} $$ * This is an exact distribution --- * What happens to distribution of $U\sub{(1)}$? -- * What happens to distribution of $n U\sub{(1)}$? -- * The limiting distribution is the Exponential distribution $f(x) = e^{-x} 1\sub{\set{ x > 0}}$ --- * What happens to distribution of median $U\sub{(n/2)}$? -- * To avoid ambiguity, assume $n = 2k - 1$ is odd * Median is $U\sub{(k)} \sim \text{Beta}(k, k)$ * What happens to distribution of (standardized) $$ 2 \sqrt{2k + 1} \left( U\sub{(k)} - \frac12 \right) $$ -- * The limiting distribution is the Normal distribution $f(x) = C e^{-x^2}$ --- * In both these cases this is easy to "verify" by calculating $f(x)$ explicitly * It is also possible to verify empirically using Q-Q plots -- * It can also be shown that the sample mean $\bar{U}$ has a limiting Normal distribution * The required standardization is similar to the median $$ \sqrt{12 n} \left( \bar{U} - \frac12 \right) $$ * Direct calculation of density of $\bar{U}$ is difficult (can be done using the [Convolution Theorem](https://en.wikipedia.org/wiki/Convolution_theorem)) * Easy to verify empirically using Q-Q plots --- layout: false # Central Limit Theorems * The sample mean and sample median have limiting Normal distributions in many other cases * These results are known as Central Limit Theorems, and are very useful in practice --- # Estimation of density * Natural question: How to estimate density $f$ from data $\nseq{X}$? * Two approaches: * Binning * Point-wise estimation --- layout: false class: center, middle # Questions?