Probability and simulating data

Learning Objectives

Basic

Understand what types of data are best modeled by different distributions
- uniform
- binomial
- normal
- poisson
Generate and plot data randomly sampled from the above distributions
Test sampled distributions against a null hypothesis
- exact binomial test
- t-test (1-sample, independent samples, paired samples)
- correlation (pearson, kendall and spearman)
Define the following statistical terms:
- p-value
- alpha
- power
- false positive (type I error)
- false negative (type II error)
- confidence interval

Intermediate

Create a function to generate a sample with specific properties and run an inferential test
Calculate power using replicate and a sampling function
Calculate the minimum sample size for a specific power level and design

Advanced

Generate 3+ variables from a multivariate normal distribution and plot them

Prep

Chapter 21: Iteration of R for Data Science

Resources

Improving your statistical inferences on Coursera (week 1)

Class Notes

Simulating data is a very powerful way to test your understanding of statistical concepts. We are going to use simulations to learn the basics of probability.

# libraries needed for these examples
library(tidyverse)
library(MASS)

Uniform Distribution

The uniform distribution is the simplest distribution. All numbers in the range have an equal probability of being sampled.

Sample continuous distribution

runif(n, min=0, max=1)

Use runif() to sample from a continuous uniform distribution.

runif(10, min = 0, max = 1)

##  [1] 0.4629114 0.7160526 0.3703031 0.3297495 0.6740526 0.8923179 0.4092995
##  [8] 0.1899743 0.9876491 0.1418342

Sample discrete distribution

sample(x, size, replace = FALSE, prob = NULL)

Use sample() to sample from a discrete distribution.

Simulate a single roll of a 6-sided die.

sample(6, 1)

## [1] 5

Simulate 10 rolls of a 6-sided die. Set replace to TRUE so each roll is independent. See what happens if you set replace to FALSE.

sample(6, 10, replace = TRUE)

##  [1] 3 5 6 3 2 6 1 3 3 5

You can also use sample to sample from a list of named outcomes.

pet_types <- c("cat", "dog", "ferret", "bird", "fish")
sample(pet_types, 10, replace = TRUE)

##  [1] "cat"    "fish"   "fish"   "ferret" "dog"    "fish"   "cat"   
##  [8] "fish"   "dog"    "fish"

Ferrets are a much less common pet than cats and dogs, so our sample isn’t very realistic. You can set the probabilities of each item in the list with the prob argument.

pet_types <- c("cat", "dog", "ferret", "bird", "fish")
pet_prob <- c(0.3, 0.4, 0.1, 0.1, 0.1)
sample(pet_types, 10, replace = TRUE, prob = pet_prob)

##  [1] "cat"    "dog"    "dog"    "dog"    "ferret" "cat"    "dog"   
##  [8] "dog"    "dog"    "dog"

Binomial Distribution

The binomial distribution is useful for modeling binary data, where each observation can have one of two outcomes, like success/failure, yes/no or head/tails.

Sample distribution

rbinom(n, size, prob)

The rbinom function will generate a random binomial distribution.

n = number of observations
size = number of trials
prob = probability of success on each trial

Coin flips are a typical example of a binomial distribution, where we can assign head to 1 and tails to 0.

20 individual coin flips of a fair coin

rbinom(20, 1, 0.5)

##  [1] 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1

20 individual coin flips of a baised (0.75) coin

rbinom(20, 1, 0.75)

##  [1] 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 0 0 1

You can generate the total number of heads in 1 set of 20 coin flips by setting size to 20 and n to 1.

rbinom(1, 20, 0.75)

## [1] 14

You can generate more sets of 20 coin flips by increasing the n.

rbinom(10, 20, 0.5)

##  [1]  7  8 11 11  7 11 10 15  8  8

You should always check your randomly generated data to check that it makes sense. For large samples, it’s easiest to do that graphically. A histogram is usually the best choice for plotting binomial data.

flips <- rbinom(1000, 20, 0.5)

ggplot() +
  geom_histogram(
    aes(flips), 
    binwidth = 1, 
    fill = "white", 
    color = "black"
  )

Run the simulation above several times, noting how the histogram changes. Try changing the values of n, size, and prob.

Exact binomial test

binom.test(x, n, p)

You can test a binomial distribution against a specific probability using the exact binomial test.

x = the number of successes
n = the number of trials
p = hypothesised probability of success

Here we can test a series of 10 coin flips from a fair coin and a biased coin against the hypothesised probability of 0.5 (even odds).

n <- 10
fair_coin <- rbinom(1, n, 0.5)
biased_coin <- rbinom(1, n, 0.6)

binom.test(fair_coin, n, p = 0.5)

## 
##  Exact binomial test
## 
## data:  fair_coin and n
## number of successes = 3, number of trials = 10, p-value = 0.3438
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.06673951 0.65245285
## sample estimates:
## probability of success 
##                    0.3

binom.test(biased_coin, n, p = 0.5)

## 
##  Exact binomial test
## 
## data:  biased_coin and n
## number of successes = 6, number of trials = 10, p-value = 0.7539
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2623781 0.8784477
## sample estimates:
## probability of success 
##                    0.6

Run the code above several times, noting the p-values for the fair and biased coins. Alternatively, you can simulate coin flips online and build up a graph of results and p-values.

How does the p-value vary for the fair and biased coins?
What happens to the confidence intervals if you increase n from 10 to 100?
What criterion would you use to tell if the observed data indicate the coin is fair or biased?
How often do you conclude the fair coin is biased (false positives)?
How often do you conclude the biased coin is fair (false negatives)?

False positives & negatives

The probability that a test concludes the fair coin is biased is called the false positive rate (or Type I Error Rate). The alpha is the false positive rate we accept for a test. This is traditionally set at 0.05, but there are good arguments for setting a different criterion in some circumstances.

The probability that a test concludes the biased coin is fair is called the false negative rate (of Type II Error Rate). The power of a test is 1 minus its false negative rate (i.e., the true positive rate). Power depends on how biased the coin is and how many samples we take.

Sampling function

To estimate these rates, we need to repeat the sampling above many times. A function is ideal for repeating the exact same procedure over and over. Set the arguments of the function to variables that you might want to change. Here, we will want to estimate power for:

different sample sizes (n)
different coin biases (bias)
different hypothesised probabilities (p, defaults to 0.5)

sim_binom_test <- function(n, bias, p = 0.5) {
  coin <- rbinom(1, n, bias)
  btest <- binom.test(coin, n, p)
  
  btest$p.value
}

Once you’ve created your function, test it a few times, changing the values.

sim_binom_test(100, 0.6)

## [1] 0.01203298

Calculate power

Then you can use the replicate() function to run it many times and save all the output values. You can calculate the power of your analysis by checking the proportion of your simulated analyses that have a p-value less than your alpha (the probability of rejecting the null hypothesis when the null hypothesis is true).

my_reps <- replicate(1e4, sim_binom_test(100, 0.6))

mean(my_reps < 0.05)

## [1] 0.4595

1e4 is just scientific notation for a 1 followed by 4 zeros (10000). When youre running simulations, you usually want to run a lot of them and it’s a pain to keep track of whether you’ve typed 5 or 6 zeros (100000 vs 1000000) and this will change your running time by an order of magnitude.

Normal Distribution

Sample distribution

rnorm(n, mean, sd)

We can simulate a normal distribution of size n if we know the mean and standard deviation (sd). A density plot is usually the best way to visualise this type of data if your n is large.

dv <- rnorm(1e5, 10, 2)

ggplot() +
  geom_density(aes(dv), fill = "white") +
  geom_vline(xintercept = mean(dv), color = "red") +
  geom_vline(xintercept = quantile(dv, .5 - (.6827/2)), color = "darkgreen") +
  geom_vline(xintercept = quantile(dv, .5 + (.6827/2)), color = "darkgreen") +
  geom_vline(xintercept = quantile(dv, .5 - (.9545/2)), color = "blue") +
  geom_vline(xintercept = quantile(dv, .5 + (.9545/2)), color = "blue") +
  geom_vline(xintercept = quantile(dv, .5 - (.9973/2)), color = "purple") +
  geom_vline(xintercept = quantile(dv, .5 + (.9973/2)), color = "purple") +
  scale_x_continuous(
    limits = c(0,20), 
    breaks = seq(0,20)
  )

Run the simulation above several times, noting how the density plot changes. What do the vertical lines represent? Try changing the values of n, mean, and sd.

T-test

t.test(x, y, alternative, mu, paired)

Use a t-test to compare the mean of one distribution to a null hypothesis (one-sample t-test), compare the means of two samples (independent-samples t-test), or compare pairs of values (paired-samples t-test).

You can run a one-sample t-test comparing the mean of your data to mu. Here is a simulated distribution with a mean of 0.5 and an SD of 1, creating an effect size of 0.5 SD when tested against a mu of 0. Run the simulation a few times to see how often the t-test returns a significant p-value (or run it in the shiny app).

sim_norm <- rnorm(100, 0.5, 1)
t.test(sim_norm, mu = 0)

## 
##  One Sample t-test
## 
## data:  sim_norm
## t = 3.9531, df = 99, p-value = 0.0001449
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.2048914 0.6178703
## sample estimates:
## mean of x 
## 0.4113809

Run an independent-samples t-test by comparing two lists of values.

a <- rnorm(100, 0.5, 1)
b <- rnorm(100, 0.7, 1)
t_ind <- t.test(a, b, paired = FALSE)
t_ind

## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -3.4277, df = 197.94, p-value = 0.0007404
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6423706 -0.1731738
## sample estimates:
## mean of x mean of y 
## 0.3417206 0.7494927

The paired argument defaults to FALSE, but it’s good practice to always explicitly set it so you are never confused about what type of test you are performing.

Sampling function

We can use the names() function to find out the names of all the t.test parameters and use this to just get one type of data, like the test statistic (e.g., t-value).

names(t_ind)

## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"

t_ind$statistic

##        t 
## -3.42771

Alternatively, use broom::tidy() to convert the output into a tidy table.

broom::tidy(t_ind)

If you want to run the simulation many times and record information each time, first you need to turn your simulation into a function.

sim_t_ind <- function(n, m1, sd1, m2, sd2) {
  v1 <- rnorm(n, m1, sd1)
  v2 <- rnorm(n, m2, sd2)
  t_ind <- t.test(v1, v2, paired = FALSE)
  
  return(t_ind$p.value)
}

Run it a few times to check that it gives you sensible values.

sim_t_ind(100, 0.7, 1, 0.5, 1)

## [1] 0.8879239

Now replicate the simulation 1000 times.

my_reps <- replicate(1e4, sim_t_ind(100, 0.7, 1, 0.5, 1))

alpha <- 0.05
power <- mean(my_reps < alpha)
power

## [1] 0.2882

Run the code above several times. How much does the power value fluctuate? How many replications do you need to run to get a reliable estimate of power?

Compare your power estimate from simluation to a power calculation using power.t.test(). Here, delta is the difference between m1 and m2 above.

power.t.test(n = 100, delta = 0.2, sd = 1, sig.level = alpha, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 100
##           delta = 0.2
##              sd = 1
##       sig.level = 0.05
##           power = 0.2902664
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

You can plot the distribution of p-values.

ggplot() + 
  geom_histogram(
    aes(my_reps), 
    binwidth = 0.05, 
    boundary = 0,
    fill = "white", 
    color = "black"
  )

What do you think the distribution of p-values is when there is no effect (i.e., the means are identical)? Check this yourself.

Make sure the boundary argument is set to 0 for p-value histograms. See what happens with a null effect if boundary is not set.

Bivariate Normal

Correlation

You can test if two continuous variables are related to each other using the cor() function.

Below is a quick and dirty way to generate two correlated variables. x is drawn from a normal distribution, while y is the sum of x and another value drawn from a random normal distribution. We’ll learn later how to generate specific correlations in simulated data.

n <- 100 # number of random samples

x <- rnorm(n, 0, 1)
y <- x + rnorm(n, 0, 1)

cor(x, y)

## [1] 0.6836834

cor() defaults to Pearson’s correlations. Set the method argument to use Kendall or Spearman correlations.

cor(x, y, method = "spearman")

## [1] 0.569769

Sample distribution

What if we want to sample from a population with specific relationships between variables? We can sample from a bivariate normal distribution using the MASS package,

n <- 1000 # number of random samples
rho <- 0.5 # population correlation between the two variables

mu <- c(10, 20) # the means of the samples
stdevs <- c(5, 6) # the SDs of the samples

# correlation matrix
cor_mat <- matrix(c(  1, rho, 
                    rho,   1), 2) 

# create the covariance matrix
sigma <- (stdevs %*% t(stdevs)) * cor_mat

# sample from bivariate normal distribution
bvn <- mvrnorm(n, mu, sigma) 

cor(bvn) # check correlation matrix

##           [,1]      [,2]
## [1,] 1.0000000 0.4864211
## [2,] 0.4864211 1.0000000

Plot your sampled variables to check everything worked like you expect. You need to convert the output of mvnorm into a tibble in order to use it in ggplot.

bvn %>%
  as_tibble() %>%
  ggplot(aes(V1, V2)) +
    geom_point(alpha = 0.5) + 
    geom_smooth(method = "lm") +
    geom_density2d()

Multivariate Normal

Sample distribution

n <- 200 # number of random samples
rho1_2 <- 0.5 # correlation betwen v1 and v2
rho1_3 <- 0 # correlation betwen v1 and v3
rho2_3 <- 0.7 # correlation betwen v2 and v3

mu <- c(10, 20, 30) # the means of the samples
stdevs <- c(8, 9, 10) # the SDs of the samples

# correlation matrix
cor_mat <- matrix(c(     1, rho1_2, rho1_3, 
                    rho1_2,      1, rho2_3,
                    rho1_3, rho2_3,      1), 3) 

sigma <- (stdevs %*% t(stdevs)) * cor_mat
bvn3 <- mvrnorm(n, mu, sigma)

cor(bvn3) # check correlation matrix

##             [,1]      [,2]        [,3]
## [1,]  1.00000000 0.4757010 -0.05956941
## [2,]  0.47570101 1.0000000  0.67172523
## [3,] -0.05956941 0.6717252  1.00000000

3D Plots

You can use the plotly library to make a 3D graph.

library(plotly)

marker_style = list(
    color = "#ff0000", 
    line = list(
      color = "#444", 
      width = 1
    ), 
    opacity = 0.5,
    size = 5
  )

bvn3 %>%
  as_tibble() %>%
  plot_ly(x = ~V1, y = ~V2, z = ~V3, marker = marker_style) %>%
  add_markers()

Example

This example uses the Growth Chart Data Tables from the US CDC.

Load & wrangle

We have to do a little data wrangling first. Have a look at the data after you import it and relabel Sex to male and female instead of 1 and 2. Also convert Agemos (age in months) to years. Relabel the column 0 as mean and calculate a new column named sd as the difference between columns 1 and 0.

height_age <- read_csv("https://www.cdc.gov/growthcharts/data/zscore/zstatage.csv") %>%
  filter(Sex %in% c(1,2)) %>%
  mutate(
    sex = recode(Sex, "1" = "male", "2" = "female"),
    age = as.numeric(Agemos)/12,
    sd = `1` - `0`
  ) %>%
  dplyr::select(sex, age, mean = `0`, sd)

If you run the code above without putting dplyr:: before the select() function, you will get an error message. This is because the MASS package also has a function called select() and, since we loaded MASS after tidyverse, the MASS function is the default. When you loaded MASS, you should have seen a warning like “The following object is masked from ‘package:dplyr’: select”. You can use functions with the same name from different packages by specifying the package before the function name, separated by two colons.

Plot

Plot your new data frame to see how mean height changes with age for boys and girls.

ggplot(height_age, aes(age, mean, color = sex)) +
  geom_smooth(aes(ymin = mean - sd, ymax = mean + sd), stat="identity")

Get means and SDs

Create new variables for the means and SDs for 20-year-old men and women.

height_sub <- height_age %>% filter(age == 20)

m_mean <- height_sub %>% filter(sex == "male") %>% pull(mean)
m_sd   <- height_sub %>% filter(sex == "male") %>% pull(sd)
f_mean <- height_sub %>% filter(sex == "female") %>% pull(mean)
f_sd   <- height_sub %>% filter(sex == "female") %>% pull(sd)

height_sub

Simulate a population

Simulate 50 random male heights and 50 radom female heights using the rnorm() function and the means and SDs above. Plot the data.

sim_height <- tibble(
  male = rnorm(50, m_mean, m_sd),
  female = rnorm(50, f_mean, f_sd)
) %>%
  gather("sex", "height", male:female)

ggplot(sim_height) +
  geom_density(aes(height, fill = sex), alpha = 0.5) +
  xlim(125, 225)

Run the simulation above several times, noting how the density plot changes. Try changing the age you’re simulating.

Analyse simulated data

Use the sim_t_ind(n, m1, sd1, m2, sd2) function we created above to generate one simulation with a sample size of 50 in each group using the means and SDs of male and female 14-year-olds.

height_sub <- height_age %>% filter(age == 14)
m_mean <- height_sub %>% filter(sex == "male") %>% pull(mean)
m_sd   <- height_sub %>% filter(sex == "male") %>% pull(sd)
f_mean <- height_sub %>% filter(sex == "female") %>% pull(mean)
f_sd   <- height_sub %>% filter(sex == "female") %>% pull(sd)

sim_t_ind(50, m_mean, m_sd, f_mean, f_sd)

## [1] 0.3352524

Replicate simulation

Now replicate this 1e4 times using the replicate() function. This function will save the returned p-values in a list (my_reps). We can then check what proportion of those p-values are less than our alpha value. This is the power of our test.

my_reps <- replicate(1e4, sim_t_ind(50, m_mean, m_sd, f_mean, f_sd))

alpha <- 0.05
power <- mean(my_reps < alpha)
power

## [1] 0.6469

One-tailed prediction

This design has about 65% power to detect the sex difference in height (with a 2-tailed test). Modify the sim_t_ind function for a 1-tailed prediction.

You could just set alternative equal to “greater” in the function, but it might be better to add the alternative argument to your function (giving it the same default value as t.test) and change the value of alternative in the function to alternative.

sim_t_ind <- function(n, m1, sd1, m2, sd2, alternative = "two.sided") {
  v1 <- rnorm(n, m1, sd1)
  v2 <- rnorm(n, m2, sd2)
  t_ind <- t.test(v1, v2, paired = FALSE, alternative = alternative)
  
  return(t_ind$p.value)
}

my_reps <- replicate(1e4, sim_t_ind(50, m_mean, m_sd, f_mean, f_sd, "greater"))
mean(my_reps < alpha)

## [1] 0.761

Range of sample sizes

What if we want to find out what sample size will give us 80% power? We can try trial and error. We know the number should be slightly larger than 50. But you can search more systematically by repeating your power calculation for a range of sample sizes.

This might seem like overkill for a t-test, where you can easily look up sample size calculators online, but it is a valuable skill to learn for when your analyses become more complicated.

Start with a relatively low number of replications and/or more spread-out samples to estimate where you should be looking more specifically. Then you can repeat with a narrower/denser range of sample sizes and more iterations.

alpha <- 0.05
power_table <- tibble(
  n = seq(20, 100, by = 5)
) %>%
  mutate(power = map_dbl(n, function(n) {
    ps <- replicate(1e3, sim_t_ind(n, m_mean, m_sd, f_mean, f_sd, "greater"))
    mean(ps < alpha)
  }))

ggplot(power_table, aes(n, power)) +
  geom_smooth() +
  geom_point() +
  geom_hline(yintercept = 0.8)

Now we can narrow down our search to values around 55 (plus or minus 5) and increase the number of replications from 1e3 to 1e4.

power_table <- tibble(
  n = seq(50, 60)
) %>%
  mutate(power = map_dbl(n, function(n) {
    ps <- replicate(1e3, sim_t_ind(n, m_mean, m_sd, f_mean, f_sd, "greater"))
    mean(ps < alpha)
  }))

##ggplot(power_table, aes(n, power)) +
##  geom_smooth() +
##  geom_point() +
##  geom_hline(yintercept = 0.8) +
##  scale_x_continuous(breaks = sample_size)

Exercises

Generate and compare random data

A. Generate 50 data points from a normal distribution with a mean of 0 and SD of 1 (variable a).

n <- 50
a <- rnorm(n, 0, 1)

B. Generate another variable (b) that is equal to the sum of a and another 50 data points from a normal distribution with a mean of 0.5 and SD of 1.

b <- a + rnorm(n, 0.5, 1)

C. Run a paired-samples t-test comparing a and b.

t.test(a, b, paired = TRUE)

## 
##  Paired t-test
## 
## data:  a and b
## t = -4.1626, df = 49, p-value = 0.0001268
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7892376 -0.2753051
## sample estimates:
## mean of the differences 
##              -0.5322713

Calculate power for a two-tailed t-test with an alpha of .05 for detecting a difference between two independent samples of 50 with an effect size of 0.3?

my_reps <- replicate(1e4, sim_t_ind(50, 0, 1, 0.3, 1))
mean(my_reps < 0.05)

## [1] 0.3202

power.t.test(n = 50, delta = 0.3, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 50
##           delta = 0.3
##              sd = 1
##       sig.level = 0.05
##           power = 0.3175171
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Modify the sim_t_ind function to handle different sample sizes. Use it to calculate the power of the following design:

20 observations from a normal distribution with mean of 10 and SD of 4

30 observations from a normal distribution with a mean of 13 and an SD of 4.5

sim_t_ind <- function(n1, m1, sd1, n2, m2, sd2, 
                       alternative = "two.sided") {
  v1 <- rnorm(n1, m1, sd1)
  v2 <- rnorm(n2, m2, sd2)
  t_ind <- t.test(v1, v2, 
                  alternative = alternative, 
                  paired = FALSE)

  return(t_ind$p.value)
}

my_reps <- replicate(1e4, sim_t_ind(20, 10, 4, 30, 13, 4.5))
mean(my_reps < 0.05)

## [1] 0.6695

Calculate power for a two-tailed t-test with an alpha of .005 for detecting a sex difference in height in a sample of 10 male and 10 female 20-year-olds?

height_sub <- height_age %>% filter(age == 20)
m_mean <- height_sub %>% filter(sex == "male") %>% pull(mean)
m_sd   <- height_sub %>% filter(sex == "male") %>% pull(sd)
f_mean <- height_sub %>% filter(sex == "female") %>% pull(mean)
f_sd   <- height_sub %>% filter(sex == "female") %>% pull(sd)

my_reps <- replicate(1e4, sim_t_ind(10, m_mean, m_sd, 10, f_mean, f_sd))
mean(my_reps < 0.005)

## [1] 0.8579

What is the poisson distribution? How do you sample from a poisson distribution?
The [poisson distribution(https://en.wikipedia.org/wiki/Poisson_distribution) is good for modeling the rate of something, like the number of texts you receive per day. Then you can test hypotheses like you receive more texts on weekends than weekdays. The poisson distribution gets more like a normal distribution when the rate gets higher, so it’s most useful for low-rate events.
```
lambda <- 4 # lambda sets the mean of the poisson distribution
tibble (
  x = rpois(1000, lambda)
)%>%
  ggplot(aes(x)) +
  geom_histogram(fill="white", color="black", binwidth = 1)
```
Demonstrate to yourself that the binomial distribution looks like the normal distribution when the number of trials is greater than 10.

You can calculate the equivalent mean for the normal distribution as the number of trials times the probability of success (my_mean <- trials * prob) and the equivalent SD as the square root of the mean times one minus probability of success (my_sd <- sqrt(mean * (1 - prob))).
```
n <- 1e5  # use a large n to get good estimates of the distributions
trials <- 50
prob <- 0.8
my_mean <- trials * prob
my_sd <- sqrt(my_mean * (1 - prob))

tibble (
  normal = rnorm(n, my_mean, my_sd),
  binomial = rbinom(n, trials, prob)
) %>%
  gather("distribution", "value", normal:binomial) %>%
  ggplot(aes(value, color=distribution)) +
  geom_freqpoly(binwidth = 1)
```
Correlation power simulation

A. Write a function to create a pair of variables of any size with any specified correlation. Hint: modify the code from the Bivariate Normal section above.
```
bvn2 <- function(n = 100, rho = 0, m1 = 0, m2 = 0, sd1 = 1, sd2 = 1) {
  # n = number of random samples
  # rho = population correlation between the two variables

  mu <- c(m1, m2) # the means of the samples
  stdevs <- c(sd1, sd2) # the SDs of the samples

  # correlation matrix
  cor_mat <- matrix(c(  1, rho, 
                      rho,   1), 2) 

  # create the covariance matrix
  sigma <- (stdevs %*% t(stdevs)) * cor_mat

  # sample from bivariate normal distribution
  mvrnorm(n, mu, sigma) %>% as.tibble() 
}
```
Notice that we converted the output of mvrnorm() to a tibble. This makes it easier to deal with these variables in further steps.
B. Use cor.test to test the Pearson’s product-moment correlation between two variables generated with your function having an n of 50 and a rho of 0.45.
```
my_vars <- bvn2(50, 0.45)

cor.test(my_vars$V1, my_vars$V2)
```
```
## 
##  Pearson's product-moment correlation
## 
## data:  my_vars$V1 and my_vars$V2
## t = 3.8792, df = 48, p-value = 0.0003188
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2432788 0.6750897
## sample estimates:
##       cor 
## 0.4885442
```
```
bvn2(50, 0.45) %$%
  cor.test(V1, V2) %>%
  broom::tidy()
```
Notice we used a new type of pipe (%$%), which allows you to use each column of a table in the next function by name.
C. Make a new function that calculates power for cor.test. Remember, there are many, many ways to do things in R. The solution below is only one of many. The important thing is to create your functin step-by-step, checking the accuracy at each step.
```
power.cor.test <- function(reps = 1000, n = 50, rho = 0.5, alpha = 0.05, method = "pearson"){
  power <- tibble(
    data = rerun(reps, bvn2(n, rho))
  ) %>%
    mutate(power = map(data, function(data) {
      data %$%
        cor.test(V1, V2, method = method) %>%
        broom::tidy()
    })) %>%
    unnest(power) %>%
    mutate(sig = p.value < alpha)

  tibble(
    n = n,
    rho = rho,
    alpha = alpha,
    power = mean(power$sig),
    method = method,
    reps = reps
  )
}

 power.cor.test(1000, 30, 0.5, method = "spearman")
```