class: center, middle # Basic Usage of R ## Data Analysis with R and Python ### Deepayan Sarkar
--- # R is a full programming language * Variables * Functions * Control flow structures * For loops, while loops * If-then-else (branching) -- * Distinguishing features * Focus on _vectors_ and _vectorized operations_ * Treatment of _functions_ at par with other object types
$$ \newcommand{\sub}{_} $$
--- # R is easily extensible * Most standard data analysis methods are already implemented * Can be extended by writing add-on packages * Thousands of add-on packages are available --- # Major concepts we need to know * Variables (in the context of programming) -- * Data structures needed for data analyis -- * Functions (set of instructions for performing a procedure) --- # Variables * Variables are symbols that may be associated with different values * Computations involving variables are done using their current value ```r sqrt(x) ``` ``` Error: object 'x' not found ``` --- # Variables * Variables are symbols that may be associated with different values * Computations involving variables are done using their current value ```r x <- 10 # assignment sqrt(x) ``` ``` [1] 3.162278 ``` ```r x <- -1 sqrt(x) ``` ``` Warning in sqrt(x): NaNs produced ``` ``` [1] NaN ``` ```r x <- -1+0i sqrt(x) ``` ``` [1] 0+1i ``` --- # Data structures for data analysis * Vectors * Matrices * Data frames (a spreadsheet-like data set) * Lists (general collection of objects) --- # Atomic vectors * Indexed collection of homogeneous scalars, can be * Numeric / Integer / Complex * Character * Logical (`TRUE` / `FALSE`) -- * Missing values are allowed, indicated as `NA` -- * Elements are indexed starting from 1 -- * $i$th element of vector `x` can be extracted using `x[[i]]` * There are also more sophisticated forms of (vector) indexing --- # Atomic vectors: examples ```r month.name # built-in ``` ``` [1] "January" "February" "March" "April" "May" "June" "July" [8] "August" "September" "October" "November" "December" ``` -- ```r x <- rnorm(10) x ``` ``` [1] 0.1835270 0.6246906 -1.7680396 1.1349834 -0.9381819 0.3000937 0.7678891 [8] 1.5019791 -1.4474599 0.1974796 ``` -- ```r x[[3]] # third element of x ``` ``` [1] -1.76804 ``` --- # Atomic vectors: examples ```r str(x) # useful function ``` ``` num [1:10] 0.184 0.625 -1.768 1.135 -0.938 ... ``` ```r str(month.name) ``` ``` chr [1:12] "January" "February" "March" "April" "May" "June" "July" "August" ... ``` --- # Creating atomic vectors * Constructor functions ```r numeric(10) ``` ``` [1] 0 0 0 0 0 0 0 0 0 0 ``` ```r logical(5) ``` ``` [1] FALSE FALSE FALSE FALSE FALSE ``` ```r character(5) ``` ``` [1] "" "" "" "" "" ``` --- # Scalars are also vectors * "Scalars" are just vectors of length 1 ```r str(numeric(2)) ``` ``` num [1:2] 0 0 ``` ```r str(numeric(1)) ``` ``` num 0 ``` ```r str(0) ``` ``` num 0 ``` --- # Vectors can have zero length * Vectors can have length zero ```r numeric(0) ``` ``` numeric(0) ``` ```r logical(0) ``` ``` logical(0) ``` ```r length(character(0)) ``` ``` [1] 0 ``` -- ```r length(NULL) ``` ``` [1] 0 ``` --- # Combining vectors using `c()` * Vectors can also be created by combining smaller vectors * For example, vectors `x` and `y` can be combined using `c(x, y)` ```r c(1:5, numeric(3)) ``` ``` [1] 1 2 3 4 5 0 0 0 ``` -- * Any number of vectors can be combined * This is a common way to build up a vector using scalars ```r c(2, 4, 6, 9, 11) ``` ``` [1] 2 4 6 9 11 ``` --- # Combining vectors of different types * Atomic vectors of different types cannot be combined * Attempting to do so will convert into one of the types ```r c(1:5, c(TRUE, FALSE)) ``` ``` [1] 1 2 3 4 5 1 0 ``` ```r c(1:5, month.name[[1]]) ``` ``` [1] "1" "2" "3" "4" "5" "January" ``` -- ```r c(1:5, c(TRUE, FALSE), month.name[[1]]) ``` ``` [1] "1" "2" "3" "4" "5" "TRUE" "FALSE" "January" ``` ```r c(c(1:5, TRUE, FALSE), month.name[[1]]) ``` ``` [1] "1" "2" "3" "4" "5" "1" "0" "January" ``` --- # Example: Our first dataset * Life expectancy in different countries over time | year| Australia| France| India| Zimbabwe| |----:|---------:|------:|------:|--------:| | 1952| 69.120| 67.410| 37.373| 48.451| | 1957| 70.330| 68.930| 40.249| 50.469| | 1962| 70.930| 70.510| 43.605| 52.358| | 1967| 71.100| 71.550| 47.193| 53.995| | 1972| 71.930| 72.380| 50.651| 55.635| | 1977| 73.490| 73.830| 54.208| 57.674| | 1982| 74.740| 74.890| 56.596| 60.363| | 1987| 76.320| 76.340| 58.553| 62.351| | 1992| 77.560| 77.460| 60.223| 60.377| | 1997| 78.830| 78.640| 61.765| 46.809| | 2002| 80.370| 79.590| 62.879| 39.989| | 2007| 81.235| 80.657| 64.698| 43.487| --- layout: true # Life Exepectancy in France --- ```r year <- c(1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007) year ``` ``` [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 ``` -- ```r lexp_france <- c(67.41, 68.93, 70.51, 71.55, 72.38, 73.83, 74.89, 76.34, 77.46, 78.64, 79.59, 80.657) lexp_france ``` ``` [1] 67.410 68.930 70.510 71.550 72.380 73.830 74.890 76.340 77.460 78.640 79.590 80.657 ``` --- ```r year <- seq(1952, 2007, by = 5) year ``` ``` [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 ``` --- ```r plot(year, lexp_france, pch = 16) ```  --- ```r lexp_france[[2]] - lexp_france[[1]] ``` ``` [1] 1.52 ``` --- ```r c(lexp_france[[2]] - lexp_france[[1]], lexp_france[[3]] - lexp_france[[2]], lexp_france[[4]] - lexp_france[[3]], lexp_france[[5]] - lexp_france[[4]], lexp_france[[6]] - lexp_france[[5]], lexp_france[[7]] - lexp_france[[6]], lexp_france[[8]] - lexp_france[[7]], lexp_france[[9]] - lexp_france[[8]], lexp_france[[10]] - lexp_france[[9]], lexp_france[[11]] - lexp_france[[10]], lexp_france[[12]] - lexp_france[[11]]) ``` ``` [1] 1.520 1.580 1.040 0.830 1.450 1.060 1.450 1.120 1.180 0.950 1.067 ``` --- ```r d <- numeric(0) ``` -- ```r for (i in 1:11) { d <- c(d, lexp_france[[i+1]] - lexp_france[[i]]) } d ``` ``` [1] 1.520 1.580 1.040 0.830 1.450 1.060 1.450 1.120 1.180 0.950 1.067 ``` --- ```r lexp_france[-1] - lexp_france[-12] ``` ``` [1] 1.520 1.580 1.040 0.830 1.450 1.060 1.450 1.120 1.180 0.950 1.067 ``` -- ```r diff(lexp_france) ``` ``` [1] 1.520 1.580 1.040 0.830 1.450 1.060 1.450 1.120 1.180 0.950 1.067 ``` --- ```r d <- diff(lexp_france) median(d) ``` ``` [1] 1.12 ``` ```r mean(d) ``` ``` [1] 1.204273 ``` --- ```r plot(d, pch = 16, type = "o", ylab = "difference", xlab = "period") ```  --- layout: false # Types of vector indexing * Indexing refers to extracting subsets of data * R supports several kinds of indexing: * Indexing with a vector of positive integers * Indexing with a vector of negative integers * Indexing with a logical vector * Indexing with a vector of names --- # The empty index * A vector indexing operation has the form `x[index]` -- * The most basic form is an empty index, which selects all elements ```r month.name[] ``` ``` [1] "January" "February" "March" "April" "May" "June" "July" [8] "August" "September" "October" "November" "December" ``` --- # Indexing with an integer vector * For integer indexing, `index` is an integer vector ```r month.name[c(2, 4, 6, 9, 11)] ``` ``` [1] "February" "April" "June" "September" "November" ``` -- * Elements can be repeated ```r month.name[c(2, 2, 6, 4, 6, 11)] ``` ``` [1] "February" "February" "June" "April" "June" "November" ``` --- # Indexing with an integer vector * "Out-of-bounds" indexing gives `NA` (missing) ```r month.name[13] ``` ``` [1] NA ``` -- ```r seq(1, by = 2, length.out = 8) ``` ``` [1] 1 3 5 7 9 11 13 15 ``` -- ```r month.name[seq(1, by = 2, length.out = 8)] ``` ``` [1] "January" "March" "May" "July" "September" "November" NA [8] NA ``` --- # Indexing with an integer vector * Indexing with a scalar (vector of length 1) also works: ```r month.name[2] ``` ``` [1] "February" ``` -- * This is usually the same as `x[[index]]` ```r month.name[[2]] ``` ``` [1] "February" ``` -- * However, these differ in the behaviour when an index is out of bound ```r month.name[15] ``` ``` [1] NA ``` ```r month.name[[15]] ``` ``` Error in month.name[[15]]: subscript out of bounds ``` --- # Indexing with a vector of negative integers * Negative integers omit the specified entries ```r month.name[-2] ``` ``` [1] "January" "March" "April" "May" "June" "July" "August" [8] "September" "October" "November" "December" ``` ```r month.name[-c(2, 4, 6, 9, 11)] ``` ``` [1] "January" "March" "May" "July" "August" "October" "December" ``` -- * Cannot be mixed with positive integers ```r month.name[c(2, -3)] ``` ``` Error in month.name[c(2, -3)]: only 0's may be mixed with negative subscripts ``` --- # Indexing with 0 * Zero has a special meaning - it doesn't select anything ```r month.name[0] ``` ``` character(0) ``` ```r month.name[integer(0)] ## same as empty index ``` ``` character(0) ``` ```r month.name[c(1, 2, 0, 11, 12)] ``` ``` [1] "January" "February" "November" "December" ``` ```r month.name[-c(1, 2, 0, 11, 12)] ``` ``` [1] "March" "April" "May" "June" "July" "August" "September" [8] "October" ``` --- # Indexing with a logical vector * Indexing by logical vector: select `TRUE` elements ```r month.name[c(TRUE, FALSE, FALSE)] # index replicated ``` ``` [1] "January" "April" "July" "October" ``` --- # Indexing with a logical vector * Indexing by logical vector: select `TRUE` elements ```r i <- substring(month.name, 1, 1) == "J" i ``` ``` [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE ``` -- ```r month.name[i] ``` ``` [1] "January" "June" "July" ``` --- # Indexing with a logical vector ```r (x <- rnorm(20)) ``` ``` [1] 0.4551969 -0.4193646 0.1298366 1.7835150 -0.2016912 -0.6367525 -0.4775448 [8] -0.2687966 0.2538650 0.8750837 -2.1376604 -0.6995476 0.0897731 0.6781549 [15] 0.4202109 0.6365381 1.1706711 0.7683105 0.4760369 1.1587456 ``` ```r x > 0 ``` ``` [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE [15] TRUE TRUE TRUE TRUE TRUE TRUE ``` ```r x[x > 0] ``` ``` [1] 0.4551969 0.1298366 1.7835150 0.2538650 0.8750837 0.0897731 0.6781549 0.4202109 [9] 0.6365381 1.1706711 0.7683105 0.4760369 1.1587456 ``` ```r mean(x[x > 0]) ``` ``` [1] 0.6843029 ``` --- # Converting a logical index vector to integer * Logical indexing may be replaced by integer indexing using `which()` ```r i ``` ``` [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE ``` ```r which(i) ``` ``` [1] 1 6 7 ``` -- ```r month.name[ which(i) ] ``` ``` [1] "January" "June" "July" ``` -- ```r month.name[ -which(i) ] # same as month.name[ !i ] ``` ``` [1] "February" "March" "April" "May" "August" "September" "October" [8] "November" "December" ``` --- # Converting a logical index vector to integer * But be careful about zero-length indices ```r which(substring(month.name, 1, 1) == "B") ``` ``` integer(0) ``` ```r month.name[ which( substring(month.name, 1, 1) == "B") ] ``` ``` character(0) ``` ```r -which(substring(month.name, 1, 1) == "B") ``` ``` integer(0) ``` ```r month.name[ -which( substring(month.name, 1, 1) == "B") ] ``` ``` character(0) ``` --- layout: true # Indexing with a vector of names --- - Vectors can optionally have names — one for each element - These are usually informative labels -- - Example: quantiles of a Normal random sample ```r x <- rnorm(100) qx <- quantile(x) qx ``` ``` 0% 25% 50% 75% 100% -2.45118171 -0.70671319 0.05292767 0.49462338 1.90524820 ``` ```r names(qx) ``` ``` [1] "0%" "25%" "50%" "75%" "100%" ``` -- ```r names(x) # no names ``` ``` NULL ``` --- - When present, names may be used to identify elements - Indexing with names works in the same way as positive integers - Instead of position, the corresponding named element is selected ```r qx[["50%"]] ## extracting a single element using scalar indexing ``` ``` [1] 0.05292767 ``` ```r qx["50%"] ## extracting a single element with vector indexing ``` ``` 50% 0.05292767 ``` ```r qx[c("25%", "75%")] ## extracting multiple elements ``` ``` 25% 75% -0.7067132 0.4946234 ``` --- * Inter-quartile range ```r diff(qx[c("25%", "75%")]) ``` ``` 75% 1.201337 ``` -- ```r IQR(x) ``` ``` [1] 1.201337 ``` --- - Unmatched names are treated like out-of-bound indexes ```r qx[["95%"]] ``` ``` Error in qx[["95%"]]: subscript out of bounds ``` ```r qx["95%"] ``` ```
NA ```