Iteration and functions

Learning Objectives

You will learn about functions and iteration by using simulation to calculate a power analysis for ANOVA on a simple two-group design.

Basic

work with base::rep() and base::seq() function
create your own data frames with tibble::tibble()
repeat commands and handle result using purrr::rerun(), purrr::map_*(), purrr::walk()
write your own custom functions with function()

Intermediate

repeat commands having multiple arguments using purrr::map2_*() and purrr::pmap_*()
create nested data frames using dplyr::group_by() and tidyr::nest()
work with nested data frames in dplyr
simulated data from a simple two-group design

Advanced

capture and deal with errors using ‘adverb’ functions purrr::safely() and purrr::possibly()

Prep

Read Chapters 19 and 21 of R for Data Science

Resources

RStudio Apply Functions Cheatsheet

Background

In the next two lectures, we are going to learn more about iteration (doing the same commands over and over) and custom functions through a data simulation exercise, which will also lead us into more traditional statistical topics. Along the way you will also learn more about how to create vectors and tables in R.

# libraries needed for these examples
library(tidyverse)  ## contains purrr, tidyr, dplyr

Phase 1: Do all the things once

Easy

Use rep() to create a vector of alternating "A" and "B" values of length 24.

rep(c("A", "B"), 12)

##  [1] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A"
## [18] "B" "A" "B" "A" "B" "A" "B"

Use rep() to create a vector of 12 "A" values followed by 12 "B" values

rep(c("A", "B"), each = 12)

##  [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B"
## [18] "B" "B" "B" "B" "B" "B" "B"

Use rep() to create a vector of 11 "A" values followed by 3 "B" values

rep(c("A", "B"), c(11, 3))

##  [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B"

Less easy

The built-in variable letters contains the 26 letters of the English alphabet. Create a vector where the repetition for each letter is determined by its position in the alphabet; in other words: "a" "b" "b" "c" "c" "c" etc. (Hint: ?seq_len() and ?seq_along())

rep(letters, seq_along(letters))

##   [1] "a" "b" "b" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e" "e" "f" "f"
##  [18] "f" "f" "f" "f" "g" "g" "g" "g" "g" "g" "g" "h" "h" "h" "h" "h" "h"
##  [35] "h" "h" "i" "i" "i" "i" "i" "i" "i" "i" "i" "j" "j" "j" "j" "j" "j"
##  [52] "j" "j" "j" "j" "k" "k" "k" "k" "k" "k" "k" "k" "k" "k" "k" "l" "l"
##  [69] "l" "l" "l" "l" "l" "l" "l" "l" "l" "l" "m" "m" "m" "m" "m" "m" "m"
##  [86] "m" "m" "m" "m" "m" "m" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
## [103] "n" "n" "n" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o"
## [120] "o" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p"
## [137] "q" "q" "q" "q" "q" "q" "q" "q" "q" "q" "q" "q" "q" "q" "q" "q" "q"
## [154] "r" "r" "r" "r" "r" "r" "r" "r" "r" "r" "r" "r" "r" "r" "r" "r" "r"
## [171] "r" "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "s"
## [188] "s" "s" "s" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [205] "t" "t" "t" "t" "t" "t" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u"
## [222] "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "v" "v" "v" "v" "v" "v" "v"
## [239] "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "w" "w"
## [256] "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w"
## [273] "w" "w" "w" "w" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x"
## [290] "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "y" "y" "y" "y" "y" "y"
## [307] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
## [324] "y" "y" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z"
## [341] "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z"

Create a vector of 10 random normal deviates using rnorm(), and store these values in err. Then use tibble::tibble() to create a table including err as a column. Your table should mimic the structure of the table below (your values for err will differ), with the constraint that Y = mu + eff + err. Store this table in dat.
```
dat <- tibble(mu = 100,
       eff = rep(c(-3, 3), each = 5),
       A = rep(c("A1", "A2"), each = 5),
       err = rnorm(10),
       Y = mu + eff + err) %>%
  select(Y, mu:err)
```
The table you created above is what is known as a decomposition matrix for a linear model where Y is the dependent variable and A is the independent variable with two levels. The main effects of A are -3 for A1 and 3 for A2; a 6 unit difference. mu ( \(\mu\) ) is the grand mean and err is the residual. Run a one-factor ANOVA on the above data using the aov() function. Run summary() on the result. NB: you must make A into a factor first, using as.factor().
```
   mod <- aov(Y ~ A, dat %>% mutate(A = as.factor(A)))

summary(mod)
```
```
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## A            1  96.82   96.82   56.01 7.04e-05 ***
## Residuals    8  13.83    1.73                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
Now tidy the results into a table using broom::tidy(), pull out the \(p\)-value and store it in pval.
```
pval <- broom::tidy(mod) %>% pull(p.value) %>% pluck(1)
pval
```
```
## [1] 7.035458e-05
```

Phase 2: Create functions for easy re-use

Now we are going to wrap up the code we created above into two custom functions: sim_data(), which will generate a tibble() with randomly generated two group data; and run_anova() which will run an anova on a data table. But let’s first get some practice creating functions.

Here is an example of a function that computes the sum of two numbers x and y and then divides this sum by a third number z.

xy_over_z <- function(x, y, z) { # <- function args x y z
  ## function body
  cat(sprintf("x = %0.3f; y = %0.3f; z = %0.3f", x, y, z), "\n")
  (x + y) / z
}

The function xy_over_z takes 3 arguments: x, y, and z. The value that is returned to the calling process is the last value that is computed by the function, which in this case is also the only computation: (x + y) / z.

Try it out with some different arguments.

xy_over_z(1, 3, 2)

## x = 1.000; y = 3.000; z = 2.000

## [1] 2

xy_over_z(1999, 42771, 308)

## x = 1999.000; y = 42771.000; z = 308.000

## [1] 145.3571

xy_over_z(log(4), log(4), log(2))

## x = 1.386; y = 1.386; z = 0.693

## [1] 4

xy_over_z(10, 27, 3)

## x = 10.000; y = 27.000; z = 3.000

## [1] 12.33333

xy_over_z(z = 10, y = 27, x = 3)

## x = 3.000; y = 27.000; z = 10.000

## [1] 3

Something important to understand: except the return value, whatever happens in the function stays in the function. You can’t use a function to modify existing values. Try it:

x <- 999L
print(x)

## [1] 999

xy_over_z(1, 3, 2)

## x = 1.000; y = 3.000; z = 2.000

## [1] 2

print(x)

## [1] 999

Always remember: you can read values outside of the function (often not a good idea) but any modifications you make to the values will be local to the function body and will thus be thrown away once the function call completes.

Flow control. Notice a flaw with our function? What happens if the user runs the function with z = 0?

xy_over_z(1, 3, 0)

## x = 1.000; y = 3.000; z = 0.000

## [1] Inf

It returns the value Inf (infinity) which… is perhaps undesirable. Maybe in that case it should return NA (not available) instead.

You can control the flow of operations in your functions with an if (test) {} else {} statement. If the value of test is TRUE, it will run the code within the first brackets; if FALSE, it will skip that code and run the code in the brackets following else. For instance:

xy_over_z <- function(x, y, z) { # <- function args x y z
  ## function body
  cat(sprintf("x = %0.3f; y = %0.3f; z = %0.3f", x, y, z), "\n")
  if (z == 0) {  ## test is either TRUE or FALSE
    NA  ## execute only if z == 0
  } else {
    (x + y) / z   ## execute otherwise
  }
}

xy_over_z(1, 3, 0)

## x = 1.000; y = 3.000; z = 0.000

## [1] NA

Easy

Default values. Modify the function definition of xy_over_z() so that the default value for z is 10, and test that it works with xy_over_z(4, 6).

xy_over_z <- function(x, y, z = 10) { # <- function args x y z
  ## function body
  cat(sprintf("x = %0.3f; y = %0.3f; z = %0.3f", x, y, z), "\n")
  (x + y) / z
}

xy_over_z(4, 6)

## x = 4.000; y = 6.000; z = 10.000

## [1] 1

Madlibs. Create a function madlibs() that takes a noun, a verb, and an adjective, and puts them into a sentence using paste(). Try it out.

madlibs <- function(noun, verb, adjective) {
  paste("The", adjective, noun, "would like to", verb, "as much as is humanly possible.")
}

madlibs("special investigator", "prosecute", "virtuous")

## [1] "The virtuous special investigator would like to prosecute as much as is humanly possible."

madlibs("pumpkin", "twerk", "flabby")

## [1] "The flabby pumpkin would like to twerk as much as is humanly possible."

Now comes the fun part.

Put the code you used to create dat as the body of a new function sim_data() with no arguments. Call the function twice to make sure it works and that it gives you different data each time.
```
sim_data <- function() {
  tibble(mu = 100,
     eff = rep(c(-3, 3), each = 5),
     A = rep(c("A1", "A2"), each = 5),
     err = rnorm(10),
     Y = mu + eff + err) %>%
  select(Y, mu:err)
}

sim_data()
```
```
sim_data()
```

Less easy

Now re-write sim_data() to allow the user to change the values of mu, eff, and the standard deviation of err. The use the values you used before as defaults.
```
sim_data <- function(mu = 100, eff = 3, sd = 1) {
  tibble(mu = mu,
     eff = rep(c(-eff, eff), each = 5),
     A = rep(c("A1", "A2"), each = 5),
     err = rnorm(10, sd = sd),
     Y = mu + eff + err) %>%
  select(Y, mu:err)
}

sim_data()
```
```
sim_data(0, 10, 40)
```
Next, re-write it again to allow the user to change the number of observations per group through an argument n_per_group, which defaults to 5. While you’re at it, make sure that A is also defined as a factor (will save time later). Run it a few times with varying arguments to check.
```
sim_data <- function(mu = 100, eff = 3, sd = 1, n_per_group = 5) {
  tibble(mu = mu,
     eff = rep(c(-eff, eff), each = n_per_group),
     A = factor(rep(c("A1", "A2"), each = n_per_group)),
     err = rnorm(2 * n_per_group, sd = sd),
     Y = mu + eff + err) %>%
  select(Y, mu:err)
}

sim_data(n = 2)
```
```
sim_data(0, 10, 40, 20)
```
OK now let’s wrap the code you created to run an ANOVA and get the p-value in the function run_anova() (part 1, problems 6 and 7). It should have one argument, x, which is the data on which the anova is to be performed, and it should return a single value: the \(p\)-value. Call it a few times with x as a simulated dataset to make sure it works as you intended, and that the p-values are changing.
```
run_anova <- function(x) {
  mod <- aov(Y ~ A, x)
  broom::tidy(mod) %>% pull(p.value) %>% pluck(1)
}

run_anova(sim_data())
```
```
## [1] 7.115453e-08
```
Modify run_anova() so that it accepts an additional argument all_stats (default FALSE) which determines whether the whole table from broom::tidy() is returned (when TRUE), or just the \(p\)-value (when FALSE).
```
run_anova <- function(x, all_stats = FALSE) {
  mod <- aov(Y ~ A, x)
  stats_tbl <- broom::tidy(mod)
  if (all_stats) {
    stats_tbl
  } else {
    stats_tbl %>% pull(p.value) %>% pluck(1)
  }
}

run_anova(sim_data())
```
```
## [1] 2.101921e-06
```
```
run_anova(sim_data(), TRUE)
```

Phase 3: Iterating

Now we’re going to pull together what we’ve done so far and attempt to perform a power analysis through simulation!

Power is the probability of rejecting a false null hypothesis, usually denoted as \(1 - \beta\) where \(\beta\) is the probability of a Type II error (retaining \(H_0\) when it is false). It relates to the sensitivity of an analysis.

For simple designs, like our two-group situation, power can be calculated analytically. However, for more complicated designs, it is often necessary to run simulations to estimate power. So although the approach here is overkill (for a two-group design you could use the function pow.t.test()) it more generalizable.

Our simulation-based power estimate will involve three main steps:

Generate nmc number of datasets, where nmc is a large number (typically >1000);
Run the analysis on the datasets and store the results
Analyze the results from step 2

First let’s learn about some of the functions used for iteration. These functions live in the purrr package which is part of the tidyverse.

The purrr::map* functions are probably the most important. purrr::map() takes a list as its first object and iterates through the elements of the list, passing each element to the function name specified as the second argument and storing the results. You can choose one of the various map() functions depending on how you want the return values stored. map_chr() returns a vector of character strings; map_dbl() returns a vector of real numbers; map_int() returns a vector of integers.

map() is appropriate when you have varying input and want to do something to each element of the input and keep the result. Sometimes you don’t have any input, but just want to do the same thing \(X\) times and keep the result. In this situation, use purrr::rerun(). Other times, you might have varying input, but want to call a function not for the result it will return, but for its side effect (e.g., save a file, or print a message to the user). In this situation, use purrr::walk(). Take a moment to familiarize yourself with the help pages for these functions.

Create a list dsets containing 5 datasets using purrr::rerun() on your sim_data() function.
```
dsets <- rerun(5, sim_data())
```

Now use map() on dsets to call run_anova() on each of the 5 datasets.

map(dsets, run_anova)

## [[1]]
## [1] 4.173647e-05
## 
## [[2]]
## [1] 4.485345e-06
## 
## [[3]]
## [1] 2.591919e-07
## 
## [[4]]
## [1] 1.793923e-05
## 
## [[5]]
## [1] 6.30033e-05

Re-write the map() command above so that it returns a vector of type double instead of a list. At the same time, re-write it so that it performs steps 1 and 2 in a single line.
```
map_dbl(rerun(5, sim_data()), run_anova)
```
```
## [1] 1.743741e-07 6.675640e-07 1.635759e-05 4.098155e-06 1.869699e-05
```
Write code to count the proportion of runs for which the \(p\)-value is less than .05. This is your estimate of power, given the effect size.
```
n_runs <- 5L
pvals <- map_dbl(rerun(n_runs, sim_data()), run_anova)
psig <- sum(pvals < .05) / n_runs
```
Now put all of the code you have created in this section into a new function sim_power() which gives the user control over all parameters (mu, eff, n_per_group, sd, n_runs). If you need to debug your function, put browser() as the first line of the function body and then call it from the console.
```
sim_power <- function(eff = 0, sd = 6, mu = 0, n_per_group = 10, n_runs = 1000L) {
  pvals <- map_dbl(rerun(n_runs, sim_data(mu, eff, sd, n_per_group)), 
                   run_anova)
  sum(pvals < .05) / n_runs
}
```
Check your function by running scenarios where the effect takes on various values, including zero. Compare your results to pow.t.test().