Distributions

# Distributions

## Statistics I — Data Exploration

### Deepayan Sarkar

---

<div>
$$
\newcommand{\sub}{_}
$$
</div>

# Goals

* Some preliminary concepts

* Visualization

* Probability distributions

* Data Frame (R equivalent of a spreadsheet)

---

# Probability distribution

---

* Example: _Number of distinct birthdays_ in a group of $n$ people

* This is different from the earlier problem (probability of all distinct)

* One way to simulate:

```r
distinct_birthdays <- function(n) {
    sample(1:365, n, replace = TRUE) |> unique() |> length()
}
```

* The `|>` notation represents a "pipe" operator

* This is equivalent to

```r
distinct_birthdays <- function(n) {
    length(unique(sample(1:365, n, replace = TRUE)))
}
```

---

* This is a 'random' quantity — depends on the sample drawn

* We can repeat the simulation experiment and obtain 'data'

```r
d <- replicate(25, distinct_birthdays(23))
d # each element represents result of one simulation
```

```
 [1] 21 23 22 22 23 22 21 23 21 22 22 21 23 21 23 23 22 22 22 22 23 23 23 23 23
```

* How can we plot this data?

* Do we even want to?

---

# The `plot()` function

---

```r
plot(d) # not very useful in this case
```

![plot of chunk unnamed-chunk-3](figures/distribution-unnamed-chunk-3-1.svg)

---

```r
plot(sort(d), pch = 16) # sorting makes it slightly more useful 
```

![plot of chunk unnamed-chunk-4](figures/distribution-unnamed-chunk-4-1.svg)

---

# The `stripchart()` function

---

```r
stripchart(d) # also mostly useless
```

![plot of chunk unnamed-chunk-5](figures/distribution-unnamed-chunk-5-1.svg)

---

```r
stripchart(d, method = "jitter") # but stripchart() has some useful options
```

![plot of chunk unnamed-chunk-6](figures/distribution-unnamed-chunk-6-1.svg)

---

```r
stripchart(d, method = "stack") # more useful: the "stack" method
```

![plot of chunk unnamed-chunk-7](figures/distribution-unnamed-chunk-7-1.svg)

---

* Let's try with a larger $n$ and more replications

```r
d50 <- replicate(100, distinct_birthdays(50))
d50
```

```
  [1] 46 48 48 49 47 46 45 45 46 44 48 45 48 49 46 48 46 47 44 45 45 48 45 47 48
 [26] 47 49 46 47 47 46 48 47 45 48 47 45 48 46 48 46 47 47 49 47 49 48 45 46 46
 [51] 45 46 49 47 49 49 46 43 50 49 47 48 47 48 47 44 45 49 46 46 48 45 47 47 46
 [76] 49 43 48 48 49 46 49 46 48 46 47 48 47 44 45 47 47 49 46 47 48 47 49 46 46
```

---

```r
stripchart(d50, method = "stack", pch = 16)
```

![plot of chunk unnamed-chunk-9](figures/distribution-unnamed-chunk-9-1.svg)

---

# Tables: `barplot()` and `dotchart()`

---

* By stacking, we are basically counting the "frequency" of each value

* This can be done directly using the `table()` function

```r
table(d50)
```

```
d50
43 44 45 46 47 48 49 50 
 2  4 13 22 23 20 15  1 
```

---

* We are actually interested in just the proportion

```r
table(d50) |> prop.table()
```

```
d50
  43   44   45   46   47   48   49   50 
0.02 0.04 0.13 0.22 0.23 0.20 0.15 0.01 
```

* Tables can be plotted using "bar charts" and "dot plots"

---

```r
table(d50) |> prop.table() |> barplot()
```

![plot of chunk unnamed-chunk-12](figures/distribution-unnamed-chunk-12-1.svg)

---

```r
table(d50) |> prop.table() |> dotchart(pch = 16)
```

![plot of chunk unnamed-chunk-13](figures/distribution-unnamed-chunk-13-1.svg)

---

# Alternative functions: `barchart()` and `dotplot()`

---

* Did you notice that the names are switched around?

* R comes with an add-on package called `lattice` which has alternative implementations

* To use it, you need to "load" the `lattice` package

```r
library(package = "lattice")
```

---

```r
table(d50) |> prop.table() |>
    barchart()
```

![plot of chunk unnamed-chunk-15](figures/distribution-unnamed-chunk-15-1.svg)

---

```r
table(d50) |> prop.table() |>
    barchart(horizontal = FALSE)
```

![plot of chunk unnamed-chunk-16](figures/distribution-unnamed-chunk-16-1.svg)

---

```r
replicate(100, distinct_birthdays(50)) |> table() |> prop.table() |>
    barchart(horizontal = FALSE) # another simulation
```

![plot of chunk unnamed-chunk-17](figures/distribution-unnamed-chunk-17-1.svg)

---

```r
replicate(100000, distinct_birthdays(50)) |> table() |> prop.table() |>
    barchart(horizontal = FALSE) # increase replications
```

![plot of chunk unnamed-chunk-18](figures/distribution-unnamed-chunk-18-1.svg)

---

```r
replicate(100000, distinct_birthdays(60)) |> table() |> prop.table() |>
    barchart(horizontal = FALSE) # n = 60
```

![plot of chunk unnamed-chunk-19](figures/distribution-unnamed-chunk-19-1.svg)

---

```r
replicate(100000, distinct_birthdays(70)) |> table() |> prop.table() |>
    barchart(horizontal = FALSE) # n = 70
```

![plot of chunk unnamed-chunk-20](figures/distribution-unnamed-chunk-20-1.svg)

---

```r
replicate(100000, distinct_birthdays(80)) |> table() |> prop.table() |>
    barchart(horizontal = FALSE) # n = 80
```

![plot of chunk unnamed-chunk-21](figures/distribution-unnamed-chunk-21-1.svg)

---

```r
replicate(100000, distinct_birthdays(80)) |> table() |> prop.table() |>
    dotplot(horizontal = FALSE) # n = 70
```

![plot of chunk unnamed-chunk-22](figures/distribution-unnamed-chunk-22-1.svg)

---

# Distribution

---

* What we are really trying to visualize is a "probability distribution"

* Let $X$ be the number of distinct birthdays in a group of 80 people

* What is $P(X = 80)$ ?

* What is $P(X = 79)$ ?

* What is $P(X = 78)$ ?

* What is the most likely value of $X$ ?

* What is the expected value of $X$ ?

* In this case, true answer is difficult to calculate

* Simulation will give us answers that will be close (by "law of large numbers")

---

# The Binomial distribution

---

* An important distribution: number of successes in $n$ independent trials (e.g., coin tosses)

* The probabily of success in one trial is $p$ (which is known, but not necessarily $\frac12$)

* We can simulate exactly as before

```r
n <- 80
sample(c(0, 1), size = n, replace = TRUE)
```

```
 [1] 0 1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1
[39] 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0
[77] 1 0 1 0
```

```r
sample(c(0, 1), size = n, replace = TRUE) |> sum()
```

```
[1] 46
```

---

```r
replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE))) |>
    table()
```

```

21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36 
   2    2    6   10   38   60  136  239  434  726 1205 1782 2659 3658 4784 5892 
  37   38   39   40   41   42   43   44   45   46   47   48   49   50   51   52 
7251 8080 8662 8814 8671 8209 7101 6019 4734 3571 2666 1836 1190  678  424  244 
  53   54   55   56   57   58   61 
 111   54   34   11    3    2    2 
```

---

```r
replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE))) |>
    table() |> prop.table()
```

```

20      22      23      24      25      26      27      28      29      30 
0.00001 0.00002 0.00006 0.00018 0.00036 0.00054 0.00130 0.00251 0.00474 0.00704 
     31      32      33      34      35      36      37      38      39      40 
0.01232 0.01750 0.02637 0.03629 0.04804 0.05985 0.07122 0.08134 0.08798 0.08892 
     41      42      43      44      45      46      47      48      49      50 
0.08378 0.08122 0.07117 0.06127 0.04640 0.03669 0.02666 0.01760 0.01171 0.00764 
     51      52      53      54      55      56      57      59 
0.00442 0.00233 0.00137 0.00060 0.00039 0.00011 0.00003 0.00002 
```

---

```r
replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE))) |> 
    table() |> prop.table() |> barchart(horizontal = FALSE)
```

![plot of chunk unnamed-chunk-26](figures/distribution-unnamed-chunk-26-1.svg)

---

```r
replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE, prob = c(0.1, 0.9)))) |> 
    table() |> prop.table() |> barchart(horizontal = FALSE)
```

![plot of chunk unnamed-chunk-27](figures/distribution-unnamed-chunk-27-1.svg)

---

* In this example, the exact probabilities are actually easy to calculate

* Let $X$ be the number of successes in $n$ independent trials with probabily of success $p$. Then,

$$
P(X = k) = { n \choose k } \, p^k \, (1-p)^{n - k}\,, \, k = 0, 1, 2, \dotsc, n
$$

---

* To compare, we need both simulation probabilities and exact probabilities

* This is where data frames become useful

* Data frames are like a spreadsheet, containing multiple columns

* Every column must have a name, and must have the same length

---

* Example: Proportions observed in simulation

```r
X <- replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE, prob = c(0.1, 0.9))))
T <- prop.table(table(X))
```

```r
T
```

```
X
     57      59      60      61      62      63      64      65      66      67 
0.00001 0.00002 0.00008 0.00020 0.00055 0.00151 0.00334 0.00690 0.01471 0.02722 
     68      69      70      71      72      73      74      75      76      77 
0.04652 0.07295 0.10284 0.13129 0.14676 0.14455 0.12236 0.09032 0.05227 0.02471 
     78      79      80 
0.00865 0.00203 0.00021 
```

```r
str(T)
```

```
 'table' num [1:23(1d)] 0.00001 0.00002 0.00008 0.0002 0.00055 ...
 - attr(*, "dimnames")=List of 1
  ..$ X: chr [1:23] "57" "59" "60" "61" ...
```

---

```r
df <- data.frame(k = as.integer(names(T)), proportion = as.numeric(T))
df
```

```
    k proportion
1  57    0.00001
2  59    0.00002
3  60    0.00008
4  61    0.00020
5  62    0.00055
6  63    0.00151
7  64    0.00334
8  65    0.00690
9  66    0.01471
10 67    0.02722
11 68    0.04652
12 69    0.07295
13 70    0.10284
14 71    0.13129
15 72    0.14676
16 73    0.14455
17 74    0.12236
18 75    0.09032
19 76    0.05227
20 77    0.02471
21 78    0.00865
22 79    0.00203
23 80    0.00021
```

]

---

```r
df$probability <- with(df, choose(80, k) * 0.9^k * 0.1^(80-k))
df
```

```
    k proportion  probability
1  57    0.00001 1.683886e-06
2  59    0.00002 2.016825e-05
3  60    0.00008 6.353000e-05
4  61    0.00020 1.874656e-04
5  62    0.00055 5.170421e-04
6  63    0.00151 1.329537e-03
7  64    0.00334 3.178424e-03
8  65    0.00690 7.041432e-03
9  66    0.01471 1.440293e-02
10 67    0.02722 2.708610e-02
11 68    0.04652 4.660403e-02
12 69    0.07295 7.294544e-02
13 70    0.10284 1.031657e-01
14 71    0.13129 1.307734e-01
15 72    0.14676 1.471201e-01
16 73    0.14455 1.451048e-01
17 74    0.12236 1.235351e-01
18 75    0.09032 8.894529e-02
19 76    0.05227 5.266497e-02
20 77    0.02471 2.462259e-02
21 78    0.00865 8.523203e-03
22 79    0.00203 1.941996e-03
23 80    0.00021 2.184745e-04
```

]

---

```r
with(df, plot(k, proportion, pch = 16))
with(df, lines(k, probability, col = "blue"))
```

![plot of chunk unnamed-chunk-32](figures/distribution-unnamed-chunk-32-1.svg)

---

# Questions?