class: center, middle # Distributions ## Statistics I — Data Exploration ### Deepayan Sarkar
---
$$ \newcommand{\sub}{_} $$
# Goals * Some preliminary concepts * Visualization * Probability distributions * Data Frame (R equivalent of a spreadsheet) --- layout: true # Probability distribution --- * Example: _Number of distinct birthdays_ in a group of $n$ people * This is different from the earlier problem (probability of all distinct) -- * One way to simulate: ```r distinct_birthdays <- function(n) { sample(1:365, n, replace = TRUE) |> unique() |> length() } ``` * The `|>` notation represents a "pipe" operator * This is equivalent to ```r distinct_birthdays <- function(n) { length(unique(sample(1:365, n, replace = TRUE))) } ``` --- * This is a 'random' quantity — depends on the sample drawn -- * We can repeat the simulation experiment and obtain 'data' ```r d <- replicate(25, distinct_birthdays(23)) d # each element represents result of one simulation ``` ``` [1] 21 23 22 22 23 22 21 23 21 22 22 21 23 21 23 23 22 22 22 22 23 23 23 23 23 ``` -- * How can we plot this data? * Do we even want to? --- layout: true # The `plot()` function --- ```r plot(d) # not very useful in this case ```  --- ```r plot(sort(d), pch = 16) # sorting makes it slightly more useful ```  --- layout: true # The `stripchart()` function --- ```r stripchart(d) # also mostly useless ```  --- ```r stripchart(d, method = "jitter") # but stripchart() has some useful options ```  --- ```r stripchart(d, method = "stack") # more useful: the "stack" method ```  --- * Let's try with a larger $n$ and more replications ```r d50 <- replicate(100, distinct_birthdays(50)) d50 ``` ``` [1] 46 48 48 49 47 46 45 45 46 44 48 45 48 49 46 48 46 47 44 45 45 48 45 47 48 [26] 47 49 46 47 47 46 48 47 45 48 47 45 48 46 48 46 47 47 49 47 49 48 45 46 46 [51] 45 46 49 47 49 49 46 43 50 49 47 48 47 48 47 44 45 49 46 46 48 45 47 47 46 [76] 49 43 48 48 49 46 49 46 48 46 47 48 47 44 45 47 47 49 46 47 48 47 49 46 46 ``` --- ```r stripchart(d50, method = "stack", pch = 16) ```  --- layout: true # Tables: `barplot()` and `dotchart()` --- * By stacking, we are basically counting the "frequency" of each value -- * This can be done directly using the `table()` function ```r table(d50) ``` ``` d50 43 44 45 46 47 48 49 50 2 4 13 22 23 20 15 1 ``` --- * We are actually interested in just the proportion ```r table(d50) |> prop.table() ``` ``` d50 43 44 45 46 47 48 49 50 0.02 0.04 0.13 0.22 0.23 0.20 0.15 0.01 ``` * Tables can be plotted using "bar charts" and "dot plots" --- ```r table(d50) |> prop.table() |> barplot() ```  --- ```r table(d50) |> prop.table() |> dotchart(pch = 16) ```  --- layout: true # Alternative functions: `barchart()` and `dotplot()` --- * Did you notice that the names are switched around? -- * R comes with an add-on package called `lattice` which has alternative implementations * To use it, you need to "load" the `lattice` package ```r library(package = "lattice") ``` --- ```r table(d50) |> prop.table() |> barchart() ```  --- ```r table(d50) |> prop.table() |> barchart(horizontal = FALSE) ```  --- ```r replicate(100, distinct_birthdays(50)) |> table() |> prop.table() |> barchart(horizontal = FALSE) # another simulation ```  --- ```r replicate(100000, distinct_birthdays(50)) |> table() |> prop.table() |> barchart(horizontal = FALSE) # increase replications ```  --- ```r replicate(100000, distinct_birthdays(60)) |> table() |> prop.table() |> barchart(horizontal = FALSE) # n = 60 ```  --- ```r replicate(100000, distinct_birthdays(70)) |> table() |> prop.table() |> barchart(horizontal = FALSE) # n = 70 ```  --- ```r replicate(100000, distinct_birthdays(80)) |> table() |> prop.table() |> barchart(horizontal = FALSE) # n = 80 ```  --- ```r replicate(100000, distinct_birthdays(80)) |> table() |> prop.table() |> dotplot(horizontal = FALSE) # n = 70 ```  --- layout: true # Distribution --- * What we are really trying to visualize is a "probability distribution" -- * Let $X$ be the number of distinct birthdays in a group of 80 people * What is $P(X = 80)$ ? -- * What is $P(X = 79)$ ? * What is $P(X = 78)$ ? -- * What is the most likely value of $X$ ? * What is the expected value of $X$ ? -- * In this case, true answer is difficult to calculate * Simulation will give us answers that will be close (by "law of large numbers") --- layout: true # The Binomial distribution --- * An important distribution: number of successes in $n$ independent trials (e.g., coin tosses) * The probabily of success in one trial is $p$ (which is known, but not necessarily $\frac12$) -- * We can simulate exactly as before ```r n <- 80 sample(c(0, 1), size = n, replace = TRUE) ``` ``` [1] 0 1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 [39] 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 [77] 1 0 1 0 ``` ```r sample(c(0, 1), size = n, replace = TRUE) |> sum() ``` ``` [1] 46 ``` --- ```r replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE))) |> table() ``` ``` 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 2 2 6 10 38 60 136 239 434 726 1205 1782 2659 3658 4784 5892 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 7251 8080 8662 8814 8671 8209 7101 6019 4734 3571 2666 1836 1190 678 424 244 53 54 55 56 57 58 61 111 54 34 11 3 2 2 ``` --- ```r replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE))) |> table() |> prop.table() ``` ``` 20 22 23 24 25 26 27 28 29 30 0.00001 0.00002 0.00006 0.00018 0.00036 0.00054 0.00130 0.00251 0.00474 0.00704 31 32 33 34 35 36 37 38 39 40 0.01232 0.01750 0.02637 0.03629 0.04804 0.05985 0.07122 0.08134 0.08798 0.08892 41 42 43 44 45 46 47 48 49 50 0.08378 0.08122 0.07117 0.06127 0.04640 0.03669 0.02666 0.01760 0.01171 0.00764 51 52 53 54 55 56 57 59 0.00442 0.00233 0.00137 0.00060 0.00039 0.00011 0.00003 0.00002 ``` --- ```r replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE))) |> table() |> prop.table() |> barchart(horizontal = FALSE) ```  --- ```r replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE, prob = c(0.1, 0.9)))) |> table() |> prop.table() |> barchart(horizontal = FALSE) ```  --- * In this example, the exact probabilities are actually easy to calculate * Let $X$ be the number of successes in $n$ independent trials with probabily of success $p$. Then, $$ P(X = k) = { n \choose k } \, p^k \, (1-p)^{n - k}\,, \, k = 0, 1, 2, \dotsc, n $$ --- * To compare, we need both simulation probabilities and exact probabilities * This is where data frames become useful -- * Data frames are like a spreadsheet, containing multiple columns * Every column must have a name, and must have the same length --- * Example: Proportions observed in simulation ```r X <- replicate(100000, sum(sample(c(0, 1), size = 80, replace = TRUE, prob = c(0.1, 0.9)))) T <- prop.table(table(X)) ``` -- ```r T ``` ``` X 57 59 60 61 62 63 64 65 66 67 0.00001 0.00002 0.00008 0.00020 0.00055 0.00151 0.00334 0.00690 0.01471 0.02722 68 69 70 71 72 73 74 75 76 77 0.04652 0.07295 0.10284 0.13129 0.14676 0.14455 0.12236 0.09032 0.05227 0.02471 78 79 80 0.00865 0.00203 0.00021 ``` ```r str(T) ``` ``` 'table' num [1:23(1d)] 0.00001 0.00002 0.00008 0.0002 0.00055 ... - attr(*, "dimnames")=List of 1 ..$ X: chr [1:23] "57" "59" "60" "61" ... ``` --- .scrollable500[ ```r df <- data.frame(k = as.integer(names(T)), proportion = as.numeric(T)) df ``` ``` k proportion 1 57 0.00001 2 59 0.00002 3 60 0.00008 4 61 0.00020 5 62 0.00055 6 63 0.00151 7 64 0.00334 8 65 0.00690 9 66 0.01471 10 67 0.02722 11 68 0.04652 12 69 0.07295 13 70 0.10284 14 71 0.13129 15 72 0.14676 16 73 0.14455 17 74 0.12236 18 75 0.09032 19 76 0.05227 20 77 0.02471 21 78 0.00865 22 79 0.00203 23 80 0.00021 ``` ] --- .scrollable500[ ```r df$probability <- with(df, choose(80, k) * 0.9^k * 0.1^(80-k)) df ``` ``` k proportion probability 1 57 0.00001 1.683886e-06 2 59 0.00002 2.016825e-05 3 60 0.00008 6.353000e-05 4 61 0.00020 1.874656e-04 5 62 0.00055 5.170421e-04 6 63 0.00151 1.329537e-03 7 64 0.00334 3.178424e-03 8 65 0.00690 7.041432e-03 9 66 0.01471 1.440293e-02 10 67 0.02722 2.708610e-02 11 68 0.04652 4.660403e-02 12 69 0.07295 7.294544e-02 13 70 0.10284 1.031657e-01 14 71 0.13129 1.307734e-01 15 72 0.14676 1.471201e-01 16 73 0.14455 1.451048e-01 17 74 0.12236 1.235351e-01 18 75 0.09032 8.894529e-02 19 76 0.05227 5.266497e-02 20 77 0.02471 2.462259e-02 21 78 0.00865 8.523203e-03 22 79 0.00203 1.941996e-03 23 80 0.00021 2.184745e-04 ``` ] --- ```r with(df, plot(k, proportion, pch = 16)) with(df, lines(k, probability, col = "blue")) ```  --- layout: false class: center, middle # Questions?