Learning Objectives

Introduction to R/RStudio

  1. Understand the R console and the components of the RStudio IDE
  2. Use R as a calculator
  3. Creating vectors and storing as variables
  4. Understand the basic data types (integer, double, character, factor)
  5. Understand vectorized operations
  6. Call functions and supply arguments

Developing an analysis in R/RStudio

  1. Appropriately structure an R script
  2. Create a simple data table
  3. Load data from a CSV and Excel files
  4. Create and compile an Rmarkdown document

Prep

Class Notes

What is R?

A programming environment for data processing and statistical analysis

  • free and open-source
  • community supported
  • continually evolving
  • promotes reproducible research

Interacting with R

The Base R Console

R is was developed almost two decades ago, and has a longer history as a derivative language of the scripting language S-PLUS developed in Bell Labs in the 70s and 80s. “Base R” consists of a “Read Evaluate Print Loop” (REPL) command interpreter, in which you type in text commands, which are evaluated, and the results of which are printed to the screen. This “R Console” window looks something like this.

The RStudio Integrated Development Environment (IDE)

However, when you are developing a script, you will want to work in a text editor and send commands to the console, rather than typing directly into the console. Developing an analysis script is R is essentially an exercise in programming, and for developing code it is best to use an Integrated Development Environment or IDE. An IDE provides additional functionality that wraps around the basic console.

The IDE that is highly recommended for this class is by RStudio (http://www.rstudio.com) and is depicted above. This IDE provides multiple windows in additional to the console that greatly facilitate developing code. In addition to the console (appearing as the bottom left window in the above figure), there is a script editor (top left), which provides syntax highlighting, autocompletion, and pop-up tool tips, a window showing functions and objects residing in memory for the session in the “Environment” tab (top right window in the figure), and a window that shows plots, files in the working directory, available add-on packages, and documentation (bottom right).

You will install both base R and RStudio, but will interact with R through the RStudio IDE. You will have icons for both RStudio and for a very primitive IDE called “R commander” which comes packaged with R. R commander is not as sophisticated or user-friendly as RStudio, so make sure you launch the RStudio IDE and not R commander by clicking on the correct icon. Launch RStudio will also launch the R console, so that is all you need to click.

ALWAYS REMEMBER: Launch R though the RStudio IDE

If you are an experienced programmer, you might want to consider using Emacs + ESS + Org Mode as an IDE instead of RStudio. (In fact, the document you are currently reading was written in emacs using org-mode markdown and exported to HTML). See this link if you want to go this advanced route.

Installing R

It is recommended that you at least attempt to install R and RStudio on your own workstation. In the long run, it will be better to have it on your own system, and moreover, it won’t cost you anything. However, you don’t have to be that ambitious. There are workstations in the Boyd Orr labs that have R/RStudio installed; additionally, library workstations may also have copies installed. The upside of using these workstations is that everything has been installed and tested. The downside (apart from the obvious of not being able to take them home with you) is that you will have limited ability to configure it to your needs because you lack access privileges. There may be some packages that won’t install, and those that do install successfully will be wiped away after you logout. These annoyances can be avoided by having your own version.

Installing R and RStudio is very easy. The sections below explain how, but in case you find it confusing, there is a helpful YouTube video here.

Installing Base R

Install base R from one of the mirrors near you.They are listed at http://cran.r-project.org/mirrors.html. If a particular mirror is down, try another one. Once you have chosen a mirror, choose the download link for your operating system (Linux, Mac OS X, or Windows) and install ‘base’ binaries for distribution. If you are using Linux or Mac OS, you are done; skip to the next section on RStudio. If you are installing the Windows version, after you install R, you should also install RTools. Follow the link below (on the same page on the mirror where you downloaded base R) to RTools, and then click on a ‘frozen’ version nearest to the top of the list (Rtools33.exe at the time of writing, but there might be a later frozen version by the time you are reading this).

Installing RStudio

This is very easy: just go to https://www.rstudio.com/products/rstudio/download3/ and download the RStudio Desktop (Free License) version for your operating system.

Additional tweaks you might want to try

Although installing R and RStudio is itself very easy, there is an additional optional tweak that may not be so easy but you might want to try it. This is installing the LaTeX typesetting system so that you can produce PDF reports from RStudio. Without this additional tweak, you will be able to produce reports in HTML but not PDF. To generate PDF reports, you will additionally need:

  1. pandoc, and
  2. LaTeX, a typesetting language, available through

Again, these additional tweaks are optional, and if you have problems installing these, don’t get hung up on it; you can just generate HTML reports, and if you want a PDF, just use one of the Boyd Orr computers.

Developing Reproducible Scripts

Here is what an R script looks like. Don’t worry about the details for now.

# load add-on packages
library(tidyverse)

# define custom functions
cumulativeToTarget <- function(x) {
    sessID <- x$SessionID[1]
    # etc... do some other stuff
    return(res)
}

## SCRIPT BEGINS HERE
load(file = "pog.RData")

pog2 <- pog %>% filter(ms >= -200 & ms <= 1000) %>%
  filter(FrameID <= 600) %>% 
  select(-ms) %>%
  do(cumulativeToTarget(.)) %>% 
  ungroup %>%
  mutate(ms = (FrameID-1) * 2 - 200, ID = factor(ID))

save(pog2, file = "pog2.RData")

All scripts will have the following structure:

  • load in any add-on packages you need to use
  • define any custom functions
  • load in the data you will be working with
  • work with the data
  • save anything you need to save

Its best if you follow the above convention when developing your own scripts.

Configure RStudio for Maximum Reproducibility

In this class, you will be learning how to develop reproducible scripts. This means scripts that completely and transparently perform some analysis from start to finish in a way that yields the same result for different people using the same software on different computers. And transparency is a key value of science, as embodied in the “trust but verify” motto. When you do things reproducibly, others can understand and check your work. This benefits science, but there is a selfish reason, too: the most important person who will benefit from a reproducible script is your future self. When you return to an analysis after two weeks of vacation, you will thank your earlier self for doing things in a transparent, reproducible way, as you can easily pick up right where you left off.

There are two tweaks that you should do to your RStudio installation to maximize reproducibility. Go to the setting menu, and uncheck the box that says “Restore .RData into workspace at startup”. If you keep things around in your workspace, things will get messy, and unexpected things will happen. You should always start with a clear workspace. This also means that you never want to save your workspace when you exit, so set this to “Never”. The only thing you want to save are your scripts.

Reproducible reports with RStudio and RMarkdown

We will be working toward producing reproducible reports following the principles of “literate programming”. The basic idea is to have the text of the report together in a single document along with the R code needed to perform all analyses and generate the tables. The report is then ’compiled’ from the original format into some other, more portable format, such as HTML or PDF. This is different from traditional cutting and pasting approaches where, for instance, you create a graph in Microsoft Excel or a statistics program like SPSS and then paste it into Microsoft Word.

We will be using RMarkdown to create reproducible reports, which enables interleaving text with R code blocks.

You can read more about Donald Knuth’s idea about literate programming at this Wikipedia page, and about the RMarkdown format here.

A reproducible script will contain sections of code in code blocks. A code block is delimited using three backtick symbols in a row, like so:

This is just some text before the code block

```{r blockname}
# now we are inside the R code block
rnorm(10)  # generate some random numbers
```

now we're back outside the code block

If you open up a new RMarkdown file from a template, you will see an example document with several code blocks in it.

To create an HTML or PDF report from an rmarkdown (rmd) document, you compile it. Compiling a document is called ’knitting’ in RStudio. There is a button that looks like a ball of yarn with needles through it that you click on to compile your file into a report. Try it with the template file and see what happens!

Typing in commands

We are first going to learn about how to interact with the console. In generally, you will be developing R scripts or R markdown files, rather than working directly in the console window. However, you can consider the console a kind of ’sandbox’ where you can try out lines of code and adapt them until you get them to do what you want. Then you can copy them back into the script editor.

Mostly, however, you will be typing into the script editor window (either into an R script or an RMarkdown file) and then sending the commands to the console by placing the cursor on the line and holding down the Ctrl key while you press Enter. The Ctrl+Enter key sequence sends the command in the script to the console.

Warming up: Use R as a calculator

One simple way to learn about the R console is to use it as a calculator. Enter the lines of code below and see if your results match. Be prepared to make lots of typos (at first) :/

## REPL: Read/Evaluate/Print Loop
## R prints results back at you
1 + 1
## [1] 2

The R console remembers a history of the commands you typed in the past. Use the up and down arrow keys on your keyboard to scroll backwards and forwards through your history. It’s a lot faster than re-typing.

1 + 1 + 3
## [1] 5

You can break up math expressions over multiple lines; R waits for a complete expression before processing it.

## here comes a long expression
## let's break it over multiple lines
1 + 2 + 3 + 4 + 5 + 6 +
    7 + 8 + 9 +
    10
## [1] 55
"Good afternoon"
## [1] "Good afternoon"

You can break up text over multiple lines; R waits for a close quote before processing it.

"There is nothing in the world 
that makes people so unhappy as fear.  
The misfortune that befalls us is 
seldom, or never, as bad as that 
which we fear.

- Friedrich Schiller"
## [1] "There is nothing in the world \nthat makes people so unhappy as fear.  \nThe misfortune that befalls us is \nseldom, or never, as bad as that \nwhich we fear.\n\n- Friedrich Schiller"

You can add comments to an R script by with the ’#’ symbol. The R interpreter will ignore characters from the # symbol to the end of the line.

## comments: any text from '#' on is ignored until end of line
22 / 7  # approximation to pi
## [1] 3.142857

Storing results in a variable

Often you want to store the result of some computation for later use. You can store it in a variable. There are some important things to consider when naming your variables.

  • capitalization matters (myVar is different from myvar)
  • don’t use spaces or special characters (^&"'*+?) etc.; use the ’_’ where you would use a space (e.g., my_var is a legal variable name)
  • must begin with a letter (m2 is a valid name, but 2m is not)

Use the assignment operator <- to assign the value on the right to the variable named on the left.

## use the assignment operator '<-'
## R stores the number in the variable
x <- 5

Now that we have set x to a value, we can do something with it:

x * 2
## [1] 10
## R evaluates the expression and stores the result in the variable
boring_calculation <- 2 + 2

Note that it doesn’t print the result back at you when it’s stored. To view the result, just type the variable name on a blank line.

boring_calculation
## [1] 4

Whitespace

# R waits until next line for evaluation
(3 + 2) *
     5
## [1] 25
# often useful to spread function arguments over multiple lines
library(cowsay)
say("This function call is far too wide to fit all on one line",
    "stretchycat")

When you see > at the beginning of a line, that means R is waiting for you to start a new command. However, if you see a + instead of > at the start of the line, that means R is waiting for you to finish a command you started on a previous line. If you want to cancel whatever command you started, just press the Esc key in the console window and you’ll get back to the > command prompt.

The workspace

Anytime you assign something to a new variable, R creates a new object in your workspace. Objects in your workspace exist until you end your session; then they disappear forever (unless you save them).

ls()  # print the objects in the workspace
##  [1] "boring_calculation" "dat"                "dsets"             
##  [4] "g"                  "madlibs"            "mod"               
##  [7] "n_runs"             "nsig"               "pval"              
## [10] "pvals"              "run_anova"          "sim_data"          
## [13] "sim_power"          "x"                  "xy_over_z"
rm("x")   # remove the object named x from the workspace

rm(list = ls()) # clear out the workspace

Vectors

One of the most fundamental data types in R is the vector. A vector in R is like a vector in math: a set of ordered elements. All of the elements in a vector must be of the same data type (numeric, character, factor). You can create a vector by enclosing the elements in c(...), as shown below.

## put information into a vector using c(...)
c(1, 2, 3)
## [1] 1 2 3
c("this", "is", "cool")
## [1] "this" "is"   "cool"
## what happens when you mix types?
c(2, "good", 2, "b", "true")
## [1] "2"    "good" "2"    "b"    "true"

Vectorized Operations

R performs calculations on vectors in a special way. Let’s look at an example using \(z\)-scores. \(z\)-scores is deviation score (a score minus a mean) divided by a standard deviation. You will learn more about these concepts later in the course. Let’s say we have a set of four IQ scores.

## example IQ scores: mu = 100, sigma = 15
iq <- c(86, 101, 127, 99)

If we want to subtract the mean from these four scores, we just use the following code:

iq - 100
## [1] -14   1  27  -1

This subtracts 100 from each element of the vector. R automatically assumes that this is what you wanted to do; it is called a vectorized operation and it makes it possible to express operations more efficiently.

To calculate \(z\)-scores we use the formula:

\(z = \frac{X - \mu}{\sigma}\)

where X are the scores, \(\mu\) is the mean, and \(\sigma\) is the standard deviation. We can expression this formula in R as follows:

## z-scores
(iq - 100) / 15
## [1] -0.93333333  0.06666667  1.80000000 -0.06666667

You can see that it computed all four \(z\)-scores with a single line of code. Very efficient!

Add-on packages

One of the great things about R is that it is user extensible: anyone can create a new add-on software package that extends its functionality. There are currently thousands of add-on packages that R users have created to solve many different kinds of problems, or just simply to have fun. There are packages for data visualisation, machine learning, neuroimaging, eyetracking, web scraping, and playing games such as Sudoku.

Add-on packages are not distributed with base R, but have to be downloaded and installed from an archive, in the same way that you would, for instance, download and install a fitness app on your smartphone.

The main repository where packages reside is called CRAN, the Comprehensive R Archive Network. A package has to pass strict tests devised by the R core team to be allowed to be part of the CRAN archive. You can install from the CRAN archive through R using the install.packages() function.

There is an important distinction between installing a package and loading a package.

  • Installing a package is done using install.packages(). This is like installing an app on your smartphone: you only have to do it once and the app will remain installed until you remove it. For instance, if you want to use Facebook on your phone you install it once from the App Store or Play Store, and you don’t have to re-install it each time you want to use it. Once you launch the app, it will run in the background until you close it or restart your phone. Likewise, when you install a package, the package will be available (but not loaded) every time you open up R.

  • Loading a package: This is done using library(packagename). This is like launching an app on your phone: the functionality is only there where the app is launched and remains there until you close the app or restart. Likewise, when you run library(packagename) within a session, the functionality of the package referred to by packagename will be made available for your R session. The next time you start R, you will need to run the library() function again if you want to access its functionality.

You may only be able to permanently install packages if you are using R on your own system; you may not be able to do this on public workstations because you will lack the appropriate privileges.

Try installing the library fortunes on your system:

install.packages("fortunes")

If you don’t get an error message, the installation was successful.

You can then access the functionality of fortune for your current R session as follows:

library(fortunes)

Once you have typed this, you can run the function fortune(), which spouts random wisdom from one of the R help lists:

fortune()
## 
## My best advice regarding R^2 statistics with nonlinear models is, as Nancy
## Reagan suggested, "Just say no.".
##    -- Douglas Bates
##       R-help (August 2000)

Note that we will use the convention package::function() and package::object to indicate in which add-on package a function or object resides. For instance, if you see readr::read_csv(), that refers to the function read_csv() in the readr add-on package. If you see a function introduced without a package name, that means it is part of the base R system and not an add-on package (depending on the context). Sometimes I will make this explicit by using base in the place of the package name; for instance, I might refer to rnorm() in base as base::rnorm().

Getting help

# these methods are all equivalent ways of getting help
help("say") # if package 'cowsay' is loaded
?say
help("say", package="cowsay") # if cowsay not loaded

??say # search for help files with "say"

# start up help in a browser
help.start()

Working with files

Working Directory

When developing an analysis, you usually want to have all of your scripts and data files in one subtree of your computer’s directory structure. Usually there is a single working directory where your data and scripts are stored.

  • All references to data files in your scripts will be relative to the top level of this directory tree; always use relative paths, and never use absolute paths.

  • Never set or change your working directory in a script; always store your main script file in the top-level directory and manually set your working directory to that location.

For instance, if on a Windows machine your data and scripts live in the directory C:\Carla's_files\thesis22\my_thesis\new_analysis, you will set your working directory to new_analysis in one of two ways: (1) by going to the Session pull down menu in RStudio and choosing Set Working Directory, or (2) by typing setwd("C:\Carla's_files\thesis22\my_thesis\new_analysis") in the console window.

Never put the setwd() command in your script, because others will not have the same directory tree as you (and when your laptop dies and you get a new one, neither will you).

If your script needs a file in a subdirectory of new_analysis, say, analysis2/dat.rds, load it in using a relative path:

dat <- readRDS("analysis2/dat.rds")  # right way

Do not load it in using an absolute path:

dat <- readRDS("C:/Carla's_files/thesis22/my_thesis/new_analysis/analysis2/dat.rds")   # wrong

Also note the convention of using forward slashes, unlike the Windows specific convention of using backward slashes. This is to make references to files platform independent.

Loading Data

There are many different types of files that you might work with when doing data analysis. These different file types are usually distinguished by the three letter extension following a period at the end of the file name. Here are some examples of different types of files and the functions you would use to read them in or write them out.

Extension File Type Reading Writing
.csv Comma-separated values readr::read_csv() readr::write_csv()
.xls, .xlsx Excel workbook readxl::read_excel() N/A
.rds R binary file readRDS() saveRDS()
.RData R binary file load() save()

Note: following the conventions introduced above in the section about add-on packages, readr::read_csv() refers to the read_csv() function in the readr package, and readxl::read_excel() refers to the function read_excel() in the package readxl.

Probably the most common file type you will encounter is .csv (comma-separated values). As the name suggests, a CSV file distinguishes which values go with which variable by separating them with commas, and text values are sometimes enclosed in double quotes. The first line of a file usually provides the names of the variables. For example, here are the first few lines of a CSV containing Scottish baby names (see the page at National Records Scotland):

yr,sex,FirstForename,number,rank,position
1974,B,David,1794,1,1
1974,B,John,1528,2,2
1974,B,Paul,1260,3,3
1974,B,Mark,1234,4,4
1974,B,James,1202,5,5
1974,B,Andrew,1067,6,6
1974,B,Scott,1060,7,7
1974,B,Steven,1020,8,8
1974,B,Robert,885,9,9
1974,B,Stephen,866,10,10

There are six variables in this dataset, and their names are given in the first line of the file: yr, sex, FirstForename, number, rank, and position. You can see that the values for each of these variables are given in order, separated by commas, on each subsequent line of the file.

When you read in CSV files, it is best practice to use the readr::read_csv() function. The readr package is automatically loaaded as part of the tidyverse package, which we will be using in almost every script. Note that you would normally want to store the result of the read_csv() function to a variable, as so:

library(tidyverse)
dat <- read_csv("my_data_file.csv")

Once loaded, you can view your data using the data viewer. In the upper right hand window of RStudio, under the Environment tab, you will see the object dat listed.

If you click on the View icon (), it will bring up a table view of the data you loaded in the top left pane of RStudio.

This allows you to check that the data have been loaded in properly. You can close the tab when you’re done looking at it—it won’t remove the object.

Writing Data

If you have data that you want to save your data to a CSV file, use readr::write_csv(), as follows.

write_csv(dat, "my_data_file2.csv")

This will save the data in CSV format to your working directory.

Calling functions

R has a lot of build in functions that are useful, like round() for rounding numbers, and sort() for sorting them. Here are some examples of how to use these functions.

iq_z <- (iq - 100) / 15

sort(iq_z)
## [1] -0.93333333 -0.06666667  0.06666667  1.80000000
round(iq_z, 2)
## [1] -0.93  0.07  1.80 -0.07

If we wanted to sort the scores before rounding them, we can embed the sort(iq_z) function into the first argument of round().

round(sort(iq_z), 2)
## [1] -0.93 -0.07  0.07  1.80

Function syntax

Functions have the following generic syntax:

functionname(arg1, arg2, arg3, ...)

Each function has named arguments which may or may not have default values. Arguments without default values are mandatory; arguments with these values are optional. If an optional argument is not specified, it will take on the default value. You can override default values by supplying your own.

Arguments can be specified by:

  • position (unnamed)
  • name

Most functions return a value, but may also produce ’side effects’ like printing to the console.

To illustrate, the function rnorm() generates random numbers from the standard normal distribution. The help page for rnorm() (accessed by typing ?rnorm in the console) shows that it has the syntax

rnorm(n, mean = 0, sd = 1)

where n is the number of randomly generated numbers you want, mean is the mean of the distribution, and sd is the standard deviation. The default mean is 0, and the default standard deviation is 1. There is no default for n which means you’ll get an error if you don’t specify it:

rnorm()

Error in rnorm() : argument "n" is missing, with no default

If you want 10 random numbers from a distribution with mean of 0 and standard deviation, you can just use the defaults.

rnorm(10)
##  [1]  0.5567344 -0.2724490 -1.8576433  0.9593917 -0.7632288 -0.1959521
##  [7]  0.6607789 -0.7162494  1.2624202  0.2645942

If you want 10 numbers from a distribution with a mean of 100:

rnorm(10, 100)
##  [1]  99.75245 101.30849  98.28152  99.67349  97.76897 101.85810 101.04115
##  [8]  99.49127  99.95403 100.25107

This would be an equivalent but less efficient way of calling the function:

rnorm(n = 10, mean = 100)
##  [1] 100.65942 100.12341 100.19610 101.73296  99.19302 101.17099 100.02426
##  [8] 100.49153  99.10591  99.45094

We don’t need to name the arguments because R will recognize that we intended to fill in the first and second arguments by their position in the function call. However, if we want to change the default for an argument coming later in the list, then we need to name it. For instance, if we wanted to keep the default mean = 0 but change the standard deviation to 100 we would do it this way:

rnorm(10, sd = 100)
##  [1]   12.167649  140.066700   77.636256 -104.272579   -5.454245
##  [6]   41.408819   69.634572  -42.626199  133.713950  135.423620

Pipes

Pipes (%>%) are very useful for stringing together a sequence of commands in R. They might be a bit confusing at first but they are worth learning because they will make your code more readable and efficient. Because pipes are a recent innovation, they are not part of base R. That means you need to load an add-on package to use them. Although the “home” package of the pipe operator is a package called magrittr, more commonly you will gain access to them by loading the tidyverse package (library("tidyverse")). If you get either of the following errors in your script:

Error: unexpected SPECIAL in "%>%"
or
Error: could not find function "%>%"

you tried to use %>% before doing library("tidyverse").

It is easiest to understand how to use pipes through an example. Let’s say that we want to sample 5 random integers between 1 and 10 (with replacement), figure out which unique numbers were sampled, and then sort them in descending order. We will need to call three functions in a sequence: sample() to generate the integers, unique() to figure out which unique integers were sampled (because the same integer may have been sampled multiple times), and then sort() with decreasing = TRUE to put them in descending order. So we might write code like this:

x <- sample(1:10, 5, replace = TRUE)
y <- unique(x)
sort(y, TRUE) # set second argument to 'TRUE' so that sort order is descending
## [1] 10  8  6  4

While there is nothing wrong with this code, it required us to define variables x and y which we won’t ever need again, and which clutter up our environment. To avoid this you could rewrite this code using nested function calls like so:

sort(unique(sample(1:10, 5, replace = TRUE)), TRUE)
## [1] 9 8 5 4

(If the above call looks confusing, it should!) The call to sample() is embedded within a call to unique() which in turn is embedded within a call to sort(). The functions are executed from most embedded (the “bottom”) to least embedded (the “top”), starting with the function sample(), whose result is then passed in as the first argument to unique(), whose result in turn is passed in as the first argument to sort(); notice the second argument of sort (TRUE) is all the way at the end of the statement, making it hard to figure out which of the three functions it belongs to. We read from left to right; however, understanding this code requires us to work our way from right to left, and therefore unnatural. Moreover it is simply an ugly line of code.

This is where pipes come in. You can re-write the original code using pipes like so:

sample(1:10, 5, replace = TRUE) %>% 
  unique() %>% 
  sort(TRUE)
## [1] 10  9  7  4  2

R will calculate the result of sample(1:10, 5, replace = TRUE) and then pass this result as the first argument of unique(); then, the result of unique() will in turn be passed along as the first argument of sort() with the second argument set to TRUE. The thing to note here is that for any function call on the right hand side of a pipe, you should omit the first argument and start with the second, because the pipe automatically places the result of the call on the left in that spot.

Some basic data types

Below are a list of different data types in R.

type description example
double floating point value .333337
integer integer -1, 0, 1
numeric any real number (int,dbl) 1, .5, -.222
boolean assertion of truth/falsity TRUE, FALSE
character text string "hello world", 'howdy'

Wonder what type a particular variable is? Use class() to find out

Container types

Here are some examples of different container types in R. Container types are structures that hold values. You will learn more about these as we go along.

Vector

  • Defining
x <- 1 # is a vector w/one element

v1 <- 7:12 # nums from 7 to 12
v2 <- c(first=1, second=2, third=3)
v3 <- c(a="one", b="two", c="three")
mode(v3)
length(v1)
names(v2)
  • Accessing
v1[4]
v2["second"]

v1[c(1:3,5)]
v2[c("second","third")]
v3[c(TRUE, FALSE, TRUE)]

List

  • Defining
albums <- 
  list(
    Michael.Jackson = c(
      "Off the Wall",
      "Thriller",
      "Bad",
      "Dangerous"
    ),
    Nirvana = c(
      "Bleach",
      "Nevermind",
      "In Utero"
    )
  )  
names(albums)
length(albums)
  • Accessing
albums[[1]]
albums[["Nirvana"]]
albums[c(2, 1)]
albums[c(TRUE, FALSE)]
  • Operations
# lapply is like a 'for' loop
lapply(albums, length) # apply function "length" to each element
lapply(albums, sample)

Matrix

  • Definition
mx <- matrix(1:6, nrow=2)
mx2 <- matrix(albums[[1]], ncol=2)
  • Accessing
mx[1,3] # element 1,3
mx[2,] # second row
mx[,3] # third column
  • Operations
mx + 1 # vectorized
mx^2 
t(mx) # transpose rows and columns
c(mx) # convert back to a vector
apply(mx, 1, sum)
apply(mx, 2, sum)
-   elements must all be of same type (numeric or character)

Data.frame

  • Define
x <- data.frame(
  ID = 1:3,
    Month = c("Jan","Feb","Mar")
)
# note: Month defined as a factor
# internal representation as list
is.list(x)
is.data.frame(x)
nrow(x)               
colnames(x) # or just names(x)
# rows can also have names
# (not recommended)
x$Abbr <- c("J","F","M") # new column
  • Access
x[1,]
x[,2]
x[,c("ID","Month")]
x$Month
x[1,2]
subset(x, ID==2)
x[x$Month=="Mar",]
  • Operations

    You’ll learn about data frame operations in the tidyr and dplyr lessons.

Exercises

Calling functions and supplying arguments

We will be working with the cowsay add-on package (help(package = "cowsay"))

Check to see if there are any vignettes available for this package.

vignette(package = "cowsay")

Load in and read the vignette to get an idea of how the package works.

vignette("cowsay_tutorial", package = "cowsay")

Your first task is to develop a reproducible script that accomplishes the tasks below. Compile the RMarkdown (rmd) document into HTML. Make sure the report includes the code in addition to the output.

Important! Try to perform each task making the shortest function call you can by taking advantage of the function defaults and include the results in an R script.

  1. Make a cat say, “FEED ME”

    say("FEED ME")
  2. Make a shark say “Hello world!”

    say("Hello world!", "shark")
  3. Make anything produce a famous quote

    say("If you want to know what God thinks of money, just look at the people he gave it to. ~Dorothy Parker", 
        "grumpycat")
  4. Make a clippy warn the user about the impending apocalypse

    say("It looks like you are trying to annihilate the planet with a particle beam. Are you sure you want to do this?", "clippy")
  5. Make a cat produce a random quote from an R coder. You should get a different quote every time you run the code (hint: read the documentation for cowsay::say()).

    say("fortune")
  6. Define a variable creature and assign to it the value of one of the types of creatures accepted by the say() function. Then use the variable to output the current time.

    creature <- "spider"
    
    say(base::date(), creature)
  7. Change the value of the variable creature to some other thing, and make it display the time.

    creature <- "buffalo"
    
    say(base::date(), creature)
  8. Restart R and re-run the script to check whether it is reproducible.

  9. Advanced: Create an RMarkdown file including each answer below each question heading (question 1-7 only), and compile it to HTML.

Tabular data: Creating tibbles and data import

  1. Create a tibble with the name, age, and sex of 3-5 friends or family members.

    What are three ways to look at the data and table setup?

    # you can do this with an inline csv file using read_csv
    family <- read_csv("name,  age, sex
                        Lisa,   40, female
                        Ben,    41, male
                        Robbie, 10, male")
    
    # or you can do this with the tibble function
    family <- tibble(name = c("Lisa", "Ben", "Robbie"),
                     age = c(40, 41, 10),
                     sex = c("female", "male", "male") )
    
    glimpse(family)
    head(family)
    View(family)
  2. Download the dataset disgust_scores.csv and read it into a table.

    disgust <- read_csv("data/disgust_scores.csv")
  3. Override the default column specifications to skip the id column.

    my_cols <- cols(
      id = col_skip()
    )
    
    disgust <- read_csv("data/disgust_scores.csv", col_types = my_cols)
    head(disgust)
  4. How many rows and columns are in the dataset from question 3?

    ## gives rows as "Observations"" and columns as "Variables"
    glimpse(disgust)
    ## Observations: 20,000
    ## Variables: 5
    ## $ user_id  <int> 1, 155324, 155366, 155370, 155386, 155409, 155427, 15...
    ## $ date     <date> 2008-07-10, 2008-07-11, 2008-07-12, 2008-07-12, 2008...
    ## $ moral    <dbl> 1.428571, 3.000000, 5.571429, 5.714286, 1.428571, 4.1...
    ## $ pathogen <dbl> 2.714286, 2.571429, 4.000000, 4.857143, 3.857143, 4.1...
    ## $ sexual   <dbl> 1.7142857, 1.8571429, 0.4285714, 4.7142857, 3.7142857...
    ## returns a list c(rows, cols)
    dim(disgust)
    ## [1] 20000     5
    ## returns the number of rows
    nrow(disgust)
    ## [1] 20000
    ## returns the number of columns
    ncol(disgust)
    ## [1] 5