Data cleaning

Learning Objectives

Basic

Understand the concept of “tidy data”
Be able to use the 4 basic tidyr verbs

Intermediate

Be able to chain functions to load and tidy a dataset without creating several intermediate datasets
Be able to use arguments like sep, extra, and convert to handle less straightforward data cleaning

Advanced

Be able to use regular expressions to separate complex columns

Prep

Read Tidy Data

Resources

Data wrangling cheat sheet
Chapter 12: Tidy Data in R for Data Science

Class Notes

Setup

# libraries needed
library(tidyverse)
library(readxl)

Load Data

Get data on infant mortality rates from the CSV file infmort.csv in the directory data.

infmort <- read_csv("data/infmort.csv")
glimpse(infmort)

## Observations: 5,044
## Variables: 3
## $ Country                                                                                     <chr> ...
## $ Year                                                                                        <int> ...
## $ `Infant mortality rate (probability of dying between birth and age 1 per 1000 live births)` <chr> ...

Get data on maternal mortality from from the excel file matmort.xls in the directory data

matmort <- read_xls("data/matmort.xls")
glimpse(matmort)

## Observations: 181
## Variables: 4
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ `1990`  <chr> "1 340 [ 878 - 1 950]", "71 [ 58 -  88]", "216 [ 141 -...
## $ `2000`  <chr> "1 100 [ 745 - 1 570]", "43 [ 33 -  56]", "170 [ 118 -...
## $ `2015`  <chr> "396 [ 253 -  620]", "29 [ 16 -  46]", "140 [ 82 -  24...

Get data on country codes from https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv

ccodes <- read_csv("https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv")
glimpse(ccodes)

## Observations: 249
## Variables: 9
## $ name              <chr> "Afghanistan", "Åland Islands", "Albania", "...
## $ `alpha-2`         <chr> "AF", "AX", "AL", "DZ", "AS", "AD", "AO", "A...
## $ `alpha-3`         <chr> "AFG", "ALA", "ALB", "DZA", "ASM", "AND", "A...
## $ `country-code`    <chr> "004", "248", "008", "012", "016", "020", "0...
## $ `iso_3166-2`      <chr> "ISO 3166-2:AF", "ISO 3166-2:AX", "ISO 3166-...
## $ region            <chr> "Asia", "Europe", "Europe", "Africa", "Ocean...
## $ `sub-region`      <chr> "Southern Asia", "Northern Europe", "Souther...
## $ `region-code`     <chr> "142", "150", "150", "002", "009", "150", "0...
## $ `sub-region-code` <chr> "034", "154", "039", "015", "061", "039", "0...

gather()

gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

matmort is in wide format, with a separate column for each year. Change it to long format, with a row for each County/Year observation.

key is what you want to call the row headers; it’s “year” in this example.
value is what you want to call the values in the gathered columns; they’re “stats” in this example.
The ... refers to the columns you want to gather. You can refer to them by their names, like col1, col2, col3, col4 or col1:col4 or by their numbers, like 8, 9, 10 or 8:10.

This example is complicated because the column names to gather are numbers. If the column names are non-standard (e.g., have spaces, start with numbers, or have special characters), you can enclose them in backticks (`) like the example below.

matmort_long <- matmort %>%
  gather("Year", "stats", `1990`:`2015`)

glimpse(matmort_long)

## Observations: 543
## Variables: 3
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year    <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ stats   <chr> "1 340 [ 878 - 1 950]", "71 [ 58 -  88]", "216 [ 141 -...

separate()

separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn")

The data in the stats column is in a crazy format with some sort of confidence interval in brackets and lots of extra spaces. We don’t need any of the spaces, so first we’ll remove them with mutate.

The separate function will separate your data on anything that is not a number or letter, so try it first without specifying the sep argument. The into argument is a list of the new column names.

matmort_split <- matmort_long %>%
  mutate(stats = gsub(" ", "", stats)) %>%
  separate(stats, c("rate", "ci_low", "ci_hi"))

glimpse(matmort_split)

## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year    <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate    <chr> "1340", "71", "216", "1160", "72", "58", "8", "8", "64...
## $ ci_low  <chr> "878", "58", "141", "627", "64", "51", "7", "7", "56",...
## $ ci_hi   <chr> "1950", "88", "327", "2020", "80", "65", "9", "10", "7...

The gsub(pattern, replacement, x) function is a flexible way to do search and replace. The example above replaces all occurances of the pattern " " (a space), with the replacement “” (nothing), in the string x (the stats column). Use sub() instead if you only want to replace the first occurance of a pattern. We only used a simple pattern here, but you can use more complicated regex patterns to replace, for example, all even numbers (e.g., gsub("[:02468:]", "*", "id = 123456")) or all occurances of the word colour in US or UK spelling (e.g., gsub("colo(u)?r", "***", "replace color, colour, or colours, but not collors")).

Handle spare columns with `extra`

The previous example should have given you an error warning about “Too many values at 543 locations”. This is because separate splits the column at the brackets and dashes, so the text “100[90-110]” would split into four values c(“100”, “90”, “110”, “”), but we only specified 3 new columns. The fourth value is always empty (just the part after the last bracket), so we are happy to drop it, but separate generates a warning so you don’t do that accidentally. You can turn off the warning by adding the extra argument and setting it to “drop”. Look at the help for ??tidyr::separate to see what the other options do.

matmort_split <- matmort_long %>%
  mutate(stats = gsub(" ", "", stats)) %>%
  separate(stats, c("rate", "ci_low", "ci_hi"), extra = "drop")

glimpse(matmort_split)

## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year    <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate    <chr> "1340", "71", "216", "1160", "72", "58", "8", "8", "64...
## $ ci_low  <chr> "878", "58", "141", "627", "64", "51", "7", "7", "56",...
## $ ci_hi   <chr> "1950", "88", "327", "2020", "80", "65", "9", "10", "7...

Set delimiters with `sep`

Now do the same with infmort. It’s already in long format, so you don’t need to use gather, but the third column has a crazy long name, so we can just refer to it by its column number (3).

infmort_split <- infmort %>%
  separate(3, c("rate", "ci_low", "ci_hi"), extra = "drop")

glimpse(infmort_split)

## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year    <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate    <chr> "66", "68", "69", "71", "73", "75", "76", "78", "80", ...
## $ ci_low  <chr> "3", "1", "9", "7", "4", "1", "8", "6", "4", "3", "4",...
## $ ci_hi   <chr> "52", "55", "58", "61", "64", "66", "69", "71", "73", ...

Wait, that didn’t work at all! It split the column on spaces, brackets, and full stops. We just want to split on the spaces, brackets and dashes. So we need to manually set sep to what the delimiters are. Also, once there are more than a few arguments specified for a function, it’s easier to read them if you put one argument on each line.

You can use regular expressions to separate complex columns. Here, we want to separate on dashes and brackets. You can separate on a list of delimiters by putting them in parentheses, separated by “|”. It’s a little more complicated because brackets have a special meaning in regex, so you need to “escape” the left one with two backslashes “\\”.

infmort_split <- infmort %>%
  separate(
    col = 3, 
    into = c("rate", "ci_low", "ci_hi"), 
    extra = "drop", 
    sep = "(\\[|-|])"
  )

glimpse(infmort_split)

## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year    <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate    <chr> "66.3 ", "68.1 ", "69.9 ", "71.7 ", "73.4 ", "75.1 ", ...
## $ ci_low  <chr> "52.7", "55.7", "58.7", "61.6", "64.4", "66.9", "69.0"...
## $ ci_hi   <chr> "83.9", "83.6", "83.5", "83.7", "84.2", "85.1", "86.1"...

Fix data types with `convert`

That’s better. Notice the next to Year, rate, ci_low and ci_hi. That means these columns hold characters (like words), not numbers or integers. This can cause problems when you try to do thigs like average the numbers (you can’t average words), so we can fix it by adding the argument convert and setting it to TRUE.

infmort_split <- infmort %>%
  separate(3, c("rate", "ci_low", "ci_hi"), extra = "drop", sep = "(\\[|-|])", convert = TRUE)

glimpse(infmort_split)

## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year    <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate    <dbl> 66.3, 68.1, 69.9, 71.7, 73.4, 75.1, 76.8, 78.6, 80.4, ...
## $ ci_low  <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi   <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...

Do the same for matmort.

matmort_split <- matmort_long %>%
  mutate(stats = gsub(" ", "", stats)) %>%
  separate(stats, c("rate", "ci_low", "ci_hi"), extra = "drop", convert = TRUE)

glimpse(matmort_split)

## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year    <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate    <int> 1340, 71, 216, 1160, 72, 58, 8, 8, 64, 46, 26, 569, 58...
## $ ci_low  <int> 878, 58, 141, 627, 64, 51, 7, 7, 56, 34, 20, 446, 47, ...
## $ ci_hi   <int> 1950, 88, 327, 2020, 80, 65, 9, 10, 74, 61, 33, 715, 7...

All in one step

We can chain all the steps above together, since we don’t need those intermediate dataframes.

infmort <- read_csv("data/infmort.csv") %>%
  separate(
    3, 
    c("rate", "ci_low", "ci_hi"), 
    extra = "drop", 
    sep = "(\\[|-|])", 
    convert = TRUE
  )

matmort <- read_xls("data/matmort.xls") %>%
  gather("Year", "stats", `1990`:`2015`) %>%
  mutate(stats = gsub(" ", "", stats)) %>%
  separate(
    stats, 
    c("rate", "ci_low", "ci_hi"), 
    extra = "drop", 
    convert = TRUE
  )

glimpse(matmort)

## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year    <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate    <int> 1340, 71, 216, 1160, 72, 58, 8, 8, 64, 46, 26, 569, 58...
## $ ci_low  <int> 878, 58, 141, 627, 64, 51, 7, 7, 56, 34, 20, 446, 47, ...
## $ ci_hi   <int> 1950, 88, 327, 2020, 80, 65, 9, 10, 74, 61, 33, 715, 7...

glimpse(infmort)

## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year    <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate    <dbl> 66.3, 68.1, 69.9, 71.7, 73.4, 75.1, 76.8, 78.6, 80.4, ...
## $ ci_low  <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi   <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...

spread()

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

You can reverse the processes above, as well. For example, you can convert data from long format into wide format.

key is the column that contains your new column headers
value is the column that contains the values in the new spread columns

Let’s spread out the infant mortality rate by year.

infmort_wide <- infmort %>%
  spread(Year, rate)

glimpse(infmort_wide)

## Observations: 4,934
## Variables: 29
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ ci_low  <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi   <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...
## $ `1990`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1991`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1992`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1993`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1994`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1995`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1996`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1997`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1998`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1999`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `2000`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `2001`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `2002`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 91...
## $ `2003`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 89, NA...
## $ `2004`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 86.7, NA, ...
## $ `2005`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 84.4, NA, NA, ...
## $ `2006`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 82.3, NA, NA, NA, ...
## $ `2007`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 80.4, NA, NA, NA, NA, ...
## $ `2008`  <dbl> NA, NA, NA, NA, NA, NA, NA, 78.6, NA, NA, NA, NA, NA, ...
## $ `2009`  <dbl> NA, NA, NA, NA, NA, NA, 76.8, NA, NA, NA, NA, NA, NA, ...
## $ `2010`  <dbl> NA, NA, NA, NA, NA, 75.1, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2011`  <dbl> NA, NA, NA, NA, 73.4, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2012`  <dbl> NA, NA, NA, 71.7, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2013`  <dbl> NA, NA, 69.9, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2014`  <dbl> NA, 68.1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2015`  <dbl> 66.3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Nope, that didn’t work at all, but it’s a really common mistake when spreading data. This is because spread matches on all the remaining columns, so Afghanistan with ci_low of 52.7 is treated as a different observation than Afghanistan with ci_low of 55.7. We can fix this by merging the rate, ci_low and ci_hi columns back together.

unite()

unite(data, col, ..., sep = "_", remove = TRUE)

infmort_united <- infmort %>%
  unite(rate_ci, rate, ci_low, ci_hi)

glimpse(infmort_united)

## Observations: 5,044
## Variables: 3
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year    <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate_ci <chr> "66.3_52.7_83.9", "68.1_55.7_83.6", "69.9_58.7_83.5", ...

Control separation with `sep`

unite() separates merged names with an underscore by default. Set the sep argument if you want to change that.

infmort_united <- infmort %>%
  unite(rate_ci, rate, ci_low, ci_hi, sep = ", ")

glimpse(infmort_united)

## Observations: 5,044
## Variables: 3
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year    <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate_ci <chr> "66.3, 52.7, 83.9", "68.1, 55.7, 83.6", "69.9, 58.7, 8...

What if you want to put it back into the format “rate [ci_low - ci_hi]”? Then, mutate and paste are a better choice than unite, but you have to get rid of the rate, ci_low and ci_hi columns with select. You’ll learn more about these function in the Data Manipulation lesson.

infmort_united <- infmort %>%
  mutate(rate_ci = paste0(rate, " [", ci_low, " - ", ci_hi, "]"))

glimpse(infmort_united)

## Observations: 5,044
## Variables: 6
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year    <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate    <dbl> 66.3, 68.1, 69.9, 71.7, 73.4, 75.1, 76.8, 78.6, 80.4, ...
## $ ci_low  <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi   <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...
## $ rate_ci <chr> "66.3 [52.7 - 83.9]", "68.1 [55.7 - 83.6]", "69.9 [58....

Now let’s try spreading on year again. Notice here we’re uniting columns rate:ci_hi, instead of rate, ci_low, ci_hi. The colon just says to select all the columns between the first and last named ones. Check the help documentation for ??tidyr::unite and ??tidyr::select to see other ways to select columns.

infmort_wide <- infmort %>%
  unite(rate_ci, rate:ci_hi, sep = ", ") %>%
  spread(Year, rate_ci)

glimpse(infmort_wide)

## Observations: 194
## Variables: 27
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angol...
## $ `1990`  <chr> "122.5, 111.6, 135.5", "35.1, 31.3, 39.2", "39.7, 37.1...
## $ `1991`  <chr> "118.3, 108, 129.9", "33.7, 30.2, 37.6", "38.8, 36.1, ...
## $ `1992`  <chr> "114.4, 104.6, 125.2", "32.5, 29.2, 36.2", "38.1, 35.4...
## $ `1993`  <chr> "110.9, 101.4, 120.9", "31.4, 28.2, 34.9", "37.5, 34.9...
## $ `1994`  <chr> "107.7, 98.6, 117.2", "30.3, 27.1, 33.8", "36.9, 34.6,...
## $ `1995`  <chr> "105, 96.2, 114.1", "29.1, 26, 32.7", "36.3, 34.2, 38....
## $ `1996`  <chr> "102.7, 94.5, 111.3", "27.9, 24.8, 31.5", "35.7, 34, 3...
## $ `1997`  <chr> "100.7, 92.9, 109.1", "26.8, 23.6, 30.4", "35.1, 33.8,...
## $ `1998`  <chr> "98.9, 91.4, 107.2", "25.5, 22.4, 29.2", "34.7, 33.7, ...
## $ `1999`  <chr> "97.2, 89.9, 105.4", "24.4, 21.2, 28.1", "34.4, 33.5, ...
## $ `2000`  <chr> "95.4, 88.2, 103.6", "23.2, 20, 27", "33.9, 33.2, 34.7...
## $ `2001`  <chr> "93.4, 86.3, 101.6", "22.1, 18.8, 26", "33.3, 32.7, 34...
## $ `2002`  <chr> "91.2, 84.3, 99.3", "21, 17.6, 25.1", "32.4, 31.8, 33"...
## $ `2003`  <chr> "89, 82.1, 97", "20, 16.5, 24.3", "31.3, 30.7, 31.9", ...
## $ `2004`  <chr> "86.7, 79.9, 94.8", "19.1, 15.4, 23.8", "30.1, 29.5, 3...
## $ `2005`  <chr> "84.4, 77.7, 92.6", "18.3, 14.2, 23.4", "28.8, 28.3, 2...
## $ `2006`  <chr> "82.3, 75.5, 90.7", "17.4, 13.2, 23.1", "27.6, 27, 28....
## $ `2007`  <chr> "80.4, 73.4, 88.9", "16.7, 12.1, 22.9", "26.4, 25.9, 2...
## $ `2008`  <chr> "78.6, 71.2, 87.3", "16, 11.2, 22.7", "25.3, 24.8, 25....
## $ `2009`  <chr> "76.8, 69, 86.1", "15.4, 10.5, 22.6", "24.3, 23.8, 24....
## $ `2010`  <chr> "75.1, 66.9, 85.1", "14.8, 9.8, 22.4", "23.5, 23, 23.9...
## $ `2011`  <chr> "73.4, 64.4, 84.2", "14.3, 9.1, 22.3", "22.8, 22.4, 23...
## $ `2012`  <chr> "71.7, 61.6, 83.7", "13.8, 8.5, 22.2", "22.4, 22, 22.9...
## $ `2013`  <chr> "69.9, 58.7, 83.5", "13.3, 7.9, 22.1", "22.1, 21.7, 22...
## $ `2014`  <chr> "68.1, 55.7, 83.6", "12.9, 7.5, 22.1", "22, 21.3, 22.7...
## $ `2015`  <chr> "66.3, 52.7, 83.9", "12.5, 7, 22.2", "21.9, 20.8, 23",...

Exercises

Tidy the data from personality.csv.

These data are from a 5-factor (OCEAN) personality questionnaire. Each question is labelled with the domain (Op = openness, Co = concientiousness, Ex = extraversion, Ag = agreeableness, and Ne = neuroticism) and the question number.

Basic: Load the data and convert from wide to long format.

The resulting dataframe should have the columns: user_id, date, question, and score.
```
ocean <- read_csv("data/personality.csv") %>%
  gather("question", "score", Op1:Ex9)
```
Basic: Split the question column into two columns: domain and qnumber.

There is no character to split on, here, but you can separate a column after a specific number of characters by setting sep to an integer. For example, to split “abcde” after the third character, use sep = 3, which results in c(“abc”, “de”). You can also use negative number to split before the nth character from the right. For example, to split a column that has words of various lengths and 2-digit suffixes (like “lisa03”“,”amanda38“), you can use sep = -2.
```
ocean_sep <- ocean %>%
  separate(question, c("domain", "qnumber"), sep = 2)
```
Basic: Put the domain and qnumber columns back together into a new column named domain_n. Make it in a format like “Op_Q1”.
```
ocean_unite <- ocean_sep %>%
  unite("domain_n", domain, qnumber, sep = "_Q")
```

Basic: Convert back to wide format.

ocean_spread <- ocean_unite %>%
  spread(domain_n, score)

Intermediate: Chain all the steps above using pipes.

ocean <- read_csv("data/personality.csv") %>%
  gather("question", "score", Op1:Ex9) %>%
  separate(question, c("domain", "qnumber"), sep = 2) %>%
  unite("domain_n", domain, qnumber, sep = "_Q") %>%
  spread(domain_n, score)

Intermediate: Debug the following code:

Load the data from sensation_seeking.csv.

ss <- read_csv(data/sensation_seeking.csv)

ss <- read_csv("data/sensation_seeking.csv")

Convert from wide to long format.

ss_long <- gather(ss, "question", "score")

ss_long <- gather(ss, "question", "score", sss1:sss14)

Split the question column into two columns: domain and qnumber.

ss_sep <- ss_long %>%
  separate(question, domain, qnumber, sep = 3)

ss_sep <- ss_long %>%
  separate(question, c("domain", "qnumber"), sep = 3)

Put the id and user_id columns together into a new column named super_id. Make it in a format like “id-user_id”.

ss_unite <- ss_sep %>%
  unite(id, user_id, "super_id", sep = "-")

ss_unite <- ss_sep %>%
  unite("super_id", id, user_id, sep = "-")

Convert back to wide format.

ss_wide <- ss_unite %>%
  spreadr(qnumber, score)

ss_wide <- ss_unite %>%
  spread(qnumber, score)

Intermediate: Load the dataset family_composition.csv.

The columns oldbro through twinsis give the number of siblings of that age and sex. Put this into long format and create separate columns for sibling age (old, young, twin) and sex (bro, sis).
```
family <- read_csv("data/family_composition.csv") %>%
  gather("sibtype", "n", oldbro:twinsis) %>%
  separate(sibtype, c("sibage", "sibsex"), sep = -4)
```

Advanced: Tidy the data from eye_descriptions.csv. This dataset contains descriptions of the eyes of 50 people. Some raters wrote more than one description per face, separated by commas, semicolons, or slashes.

Create a dataset with separate columns for face_id, description, and number of description.

eyes <- read_csv("data/eye_descriptions.csv") %>%
  gather("face_id", "description", t1:t50) %>%
  separate(description, c("d1", "d2", "d3"), sep = "(,|;|\\/)+", extra = "merge") %>%
  gather("desc_n", "description", d1:d3) %>%
  filter(!is.na(description))  # gets rid of rows with no description

I’m bored

Using the dataset from question 6, calculate how many siblings of each sex each person has, narrow the dataset down to people with fewer than 6 siblings, and generate at least two different ways to graph this.

family %>%
  group_by(user_id, sex, sibsex) %>%
  summarise(n = sum(n)) %>%
  group_by(user_id) %>%
  filter(sex %in% c("male", "female"), sum(n) < 6) %>%
  ggplot(aes(n, fill = sibsex)) +
  geom_histogram(binwidth = 1, colour = "black", position = "dodge")

family %>%
  group_by(user_id, sex, sibsex) %>%
  summarise(n = sum(n)) %>%
  filter(sex %in% c("male", "female")) %>%
  spread(sibsex, n) %>%
  filter(bro + sis < 6) %>%
  ggplot(aes(bro, sis)) +
  geom_bin2d(binwidth = c(1,1))

If you’re fluent with the tidyverse (if not, we’ll get to this next week), create a list of the most common descriptions from the dtaaset in question 7. Remove useless descriptions and merge redundant descriptions.
```
eyes %>%
  mutate(
    description = trimws(description), # get rid of white space around string
    description = tolower(description) # make all characters lowercase
  ) %>%
  group_by(description) %>%
  summarise(n = n()) %>%               # count occurances of each description
  arrange(desc(n)) %>%                 # sort by count (descending)
  filter(nchar(description) > 1)       # get rid of 1-character descriptions
```