Data Frames and Control ======================================================== author: John S date: 16 September 2015 font-family: 'Garamond' Agenda === - Making and working with data frames - Conditionals: switching between different calculations - Iteration: Doing something over and over - Vectorizing: Avoiding explicit iteration In Our Last Thrilling Episode === - Vectors: series of values all of the same type `v[5]`, `v["name"] - Arrays: multi-dimensional generalization of vectors `a[5,6,2]`, `a[,6,]`, `a[rowname, colname, layername]` - Matrices: special 2D arrays with matrix math `m[5,6]`, `m[,6]`, `m[,colname]` - Lists: series of values of mixed types `l[[3]]`, `l$name` - Dataframes: hybrid of matrix and list Dataframes, Encore === - 2D tables of data - Each case/unit is a row - Each variable is a column - Variables can be of any type (numbers, text, Booleans, ...) - Both rows and columns can get names Creating an example dataframe === ```{r} library(datasets) states <- data.frame(state.x77, abb=state.abb, region=state.region, division=state.division) ``` `data.frame()` is combining here a pre-existing matrix (`state.x77`), a vector of characters (`state.abb`), and two vectors of qualitative categorical variables (**factors**; `state.region`, `state.division`) Column names are preserved or guessed if not explicitly set === ```{r} colnames(states) states[1,] ``` Dataframe access === - By row and column index ```{r} states[49,3] ``` - By row and column names ```{r} states["Wisconsin","Illiteracy"] ``` Dataframe access (cont'd) === - All of a row: ```{r} states["Wisconsin",] ``` Exercise: what class is `states["Wisconsin",]`? Dataframe access (cont'd.) === - All of a column: ```{r} head(states[,3]) head(states[,"Illiteracy"]) head(states$Illiteracy) ``` Dataframe access (cont'd.) === - Rows matching a condition: ```{r} states[states$division=="New England", "Illiteracy"] states[states$region=="South", "Illiteracy"] ``` Replacing values === Parts or all of the dataframe can be assigned to: ```{r} summary(states$HS.Grad) states$HS.Grad <- states$HS.Grad/100 summary(states$HS.Grad) states$HS.Grad <- 100*states$HS.Grad ``` with() === What percentage of literate adults graduated HS? ```{r} head(100*(states$HS.Grad/(100-states$Illiteracy))) ``` `with()` takes a data frame and evaluates an expression "inside" it: ```{r} with(states, head(100*(HS.Grad/(100-Illiteracy)))) ``` Data arguments === Lots of functions take `data` arguments, and look variables up in that data frame: ```{r} plot(Illiteracy~Frost, data=states) ``` $R^2 =0.45$, $p \approx {10}^{-7}$ Conditionals === Have the computer decide what to do next - Mathematically: \[ |x| = \left\{ \begin{array}{cl} x & \mathrm{if}~x\geq 0 \\ -x &\mathrm{if}~ x < 0\end{array}\right. ~,~ \psi(x) = \left\{ \begin{array}{cl} x^2 & \mathrm{if}~|x|\leq 1\\ 2|x|-1 &\mathrm{if}~ |x| > 1\end{array}\right. \] Exercise: plot $\psi$ in R - Computationally: ``` if the country code is not "US", multiply prices by current exchange rate ``` if() === Simplest conditional: ``` if (x >= 0) { x } else { -x } ``` Condition in `if` needs to give _one_ `TRUE` or `FALSE` value `else` clause is optional one-line actions don't need braces ``` if (x >= 0) x else -x ``` Nested if() === `if` can *nest* arbitrarily deeply: ``` if (x^2 < 1) { x^2 } else { if (x >= 0) { 2*x-1 } else { -2*x-1 } } ``` Can get ugly though Combining Booleans: && and || === `&` work `|` like `+` or `*`: combine terms element-wise Flow control wants *one* Boolean value, and to skip calculating what's not needed `&&` and `||` give _one_ Boolean, lazily: ```{r} (0 > 0) && (all.equal(42%%6, 169%%13)) ``` This *never* evaluates the complex expression on the right Use `&&` and `||` for control, `&` and `|` for subsetting Iteration === Repeat similar actions multiple times: ```{r} table.of.logarithms <- vector(length=7,mode="numeric") table.of.logarithms for (i in 1:length(table.of.logarithms)) { table.of.logarithms[i] <- log(i) } table.of.logarithms ``` for() === ``` for (i in 1:length(table.of.logarithms)) { table.of.logarithms[i] <- log(i) } ``` `for` increments a **counter** (here `i`) along a vector (here `1:length(table.of.logarithms)`) and **loops through** the **body* until it runs through the vector "**iterates over** the vector" N.B., there is a better way to do this job! The body of the for() loop === Can contain just about anything, including: - if() clauses - other for() loops (nested iteration) Nested iteration example === ``` c <- matrix(0, nrow=nrow(a), ncol=ncol(b)) if (ncol(a) == nrow(b)) { for (i in 1:nrow(c)) { for (j in 1:ncol(c)) { for (k in 1:ncol(a)) { c[i,j] <- c[i,j] + a[i,k]*b[k,j] } } } } else { stop("matrices a and b non-conformable") } ``` while(): conditional iteration === ``` while (max(x) - 1 > 1e-06) { x <- sqrt(x) } ``` Condition in the argument to `while` must be a single Boolean value (like `if`) Body is looped over until the condition is `FALSE` so can loop forever Loop never begins unless the condition starts `TRUE` for() vs. while() === for() is better when the number of times to repeat (values to iterate over) is clear in advance while() is better when you can recognize when to stop once you're there, even if you can't guess it to begin with Every for() could be replaced with a while() Exercise: show this Avoiding iteration === R has many ways of _avoiding_ iteration, by acting on whole objects - It's conceptually clearer - It leads to simpler code - It's faster (sometimes a little, sometimes drastically) Vectorized arithmetic === How many languages add 2 vectors: ``` c <- vector(length(a)) for (i in 1:length(a)) { c[i] <- a[i] + b[i] } ``` How R adds 2 vectors: ``` a+b ``` or a triple `for()` loop for matrix multiplication vs. `a %*% b` Advantages of vectorizing === - Clarity: the syntax is about _what_ we're doing - Concision: we write less - Abstraction: the syntax hides _how the computer does it_ - Generality: same syntax works for numbers, vectors, arrays, ... - Speed: modifying big vectors over and over is slow in R; work gets done by optimized low-level code Vectorized calculations === Many functions are set up to vectorize automatically ```{r} abs(-3:3) log(1:7) ``` See also `apply()` from last week We'll come back to this in great detail later Vectorized conditions: ifelse() === ``` ifelse(x^2 > 1, 2*abs(x)-1, x^2) ``` 1st argument is a Boolean vector, then pick from the 2nd or 3rd vector arguments as `TRUE` or `FALSE` Summary === - Dataframes - `if`, nested `if`, `switch` - Iteration: `for`, `while` - Avoiding iteration with whole-object ("vectorized") operations What Is Truth? === 0 counts as `FALSE`; other numeric values count as `TRUE`; the strings "TRUE" and "FALSE" count as you'd hope; most everything else gives an error Advice: Don't play games here; try to make sure control expressions are getting Boolean values Conversely, in arithmetic, `FALSE` is 0 and `TRUE` is 1 ```{r} mean(states$Murder > 7) ``` switch() === give a variable to select on and a value for each option (and other) ```{r} ccc <- c("b","QQ","a","A","bb") for(ch in ccc) cat(ch,":", switch(EXPR = ch, a =, A = 1, b = 2:3, "Otherwise: last"),"\n") ``` Unconditional iteration === ``` repeat { print("Help! I am Dr. Morris Culpepper, trapped in an endless loop!") } ``` "Manual" control over iteration === ``` repeat { if (watched) { next() } print("Help! I am Dr. Morris Culpepper, trapped in an endless loop!") if (rescued) { break() } } ``` `break()` exits the loop; `next()` skips the rest of the body and goes back into the loop both work with `for()` and `while()` as well Exercise: how would you replace `while()` with `repeat()`?