tidyr
verbs
# libraries needed
library(tidyverse)
library(readxl)
Get data on infant mortality rates from the CSV file infmort.csv
in the directory data
.
infmort <- read_csv("data/infmort.csv")
glimpse(infmort)
## Observations: 5,044
## Variables: 3
## $ Country <chr> ...
## $ Year <int> ...
## $ `Infant mortality rate (probability of dying between birth and age 1 per 1000 live births)` <chr> ...
Get data on maternal mortality from from the excel file matmort.xls
in the directory data
matmort <- read_xls("data/matmort.xls")
glimpse(matmort)
## Observations: 181
## Variables: 4
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ `1990` <chr> "1 340 [ 878 - 1 950]", "71 [ 58 - 88]", "216 [ 141 -...
## $ `2000` <chr> "1 100 [ 745 - 1 570]", "43 [ 33 - 56]", "170 [ 118 -...
## $ `2015` <chr> "396 [ 253 - 620]", "29 [ 16 - 46]", "140 [ 82 - 24...
Get data on country codes from https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv
ccodes <- read_csv("https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv")
glimpse(ccodes)
## Observations: 249
## Variables: 9
## $ name <chr> "Afghanistan", "Åland Islands", "Albania", "...
## $ `alpha-2` <chr> "AF", "AX", "AL", "DZ", "AS", "AD", "AO", "A...
## $ `alpha-3` <chr> "AFG", "ALA", "ALB", "DZA", "ASM", "AND", "A...
## $ `country-code` <chr> "004", "248", "008", "012", "016", "020", "0...
## $ `iso_3166-2` <chr> "ISO 3166-2:AF", "ISO 3166-2:AX", "ISO 3166-...
## $ region <chr> "Asia", "Europe", "Europe", "Africa", "Ocean...
## $ `sub-region` <chr> "Southern Asia", "Northern Europe", "Souther...
## $ `region-code` <chr> "142", "150", "150", "002", "009", "150", "0...
## $ `sub-region-code` <chr> "034", "154", "039", "015", "061", "039", "0...
gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
matmort
is in wide format, with a separate column for each year. Change it to long format, with a row for each County/Year observation.
key
is what you want to call the row headers; it’s “year” in this example.value
is what you want to call the values in the gathered columns; they’re “stats” in this example....
refers to the columns you want to gather. You can refer to them by their names, like col1, col2, col3, col4
or col1:col4
or by their numbers, like 8, 9, 10
or 8:10
.This example is complicated because the column names to gather are numbers. If the column names are non-standard (e.g., have spaces, start with numbers, or have special characters), you can enclose them in backticks (`) like the example below.
matmort_long <- matmort %>%
gather("Year", "stats", `1990`:`2015`)
glimpse(matmort_long)
## Observations: 543
## Variables: 3
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ stats <chr> "1 340 [ 878 - 1 950]", "71 [ 58 - 88]", "216 [ 141 -...
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn")
The data in the stats
column is in a crazy format with some sort of confidence interval in brackets and lots of extra spaces. We don’t need any of the spaces, so first we’ll remove them with mutate
.
The separate
function will separate your data on anything that is not a number or letter, so try it first without specifying the sep
argument. The into
argument is a list of the new column names.
matmort_split <- matmort_long %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(stats, c("rate", "ci_low", "ci_hi"))
glimpse(matmort_split)
## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate <chr> "1340", "71", "216", "1160", "72", "58", "8", "8", "64...
## $ ci_low <chr> "878", "58", "141", "627", "64", "51", "7", "7", "56",...
## $ ci_hi <chr> "1950", "88", "327", "2020", "80", "65", "9", "10", "7...
The gsub(pattern, replacement, x)
function is a flexible way to do search and replace. The example above replaces all occurances of the pattern
" " (a space), with the replacement
“” (nothing), in the string x
(the stats
column). Use sub()
instead if you only want to replace the first occurance of a pattern. We only used a simple pattern here, but you can use more complicated regex patterns to replace, for example, all even numbers (e.g., gsub("[:02468:]", "*", "id = 123456")
) or all occurances of the word colour in US or UK spelling (e.g., gsub("colo(u)?r", "***", "replace color, colour, or colours, but not collors")
).
extra
The previous example should have given you an error warning about “Too many values at 543 locations”. This is because separate
splits the column at the brackets and dashes, so the text “100[90-110]” would split into four values c(“100”, “90”, “110”, “”), but we only specified 3 new columns. The fourth value is always empty (just the part after the last bracket), so we are happy to drop it, but separate
generates a warning so you don’t do that accidentally. You can turn off the warning by adding the extra
argument and setting it to “drop”. Look at the help for ??tidyr::separate
to see what the other options do.
matmort_split <- matmort_long %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(stats, c("rate", "ci_low", "ci_hi"), extra = "drop")
glimpse(matmort_split)
## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate <chr> "1340", "71", "216", "1160", "72", "58", "8", "8", "64...
## $ ci_low <chr> "878", "58", "141", "627", "64", "51", "7", "7", "56",...
## $ ci_hi <chr> "1950", "88", "327", "2020", "80", "65", "9", "10", "7...
sep
Now do the same with infmort
. It’s already in long format, so you don’t need to use gather
, but the third column has a crazy long name, so we can just refer to it by its column number (3).
infmort_split <- infmort %>%
separate(3, c("rate", "ci_low", "ci_hi"), extra = "drop")
glimpse(infmort_split)
## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate <chr> "66", "68", "69", "71", "73", "75", "76", "78", "80", ...
## $ ci_low <chr> "3", "1", "9", "7", "4", "1", "8", "6", "4", "3", "4",...
## $ ci_hi <chr> "52", "55", "58", "61", "64", "66", "69", "71", "73", ...
Wait, that didn’t work at all! It split the column on spaces, brackets, and full stops. We just want to split on the spaces, brackets and dashes. So we need to manually set sep
to what the delimiters are. Also, once there are more than a few arguments specified for a function, it’s easier to read them if you put one argument on each line.
You can use regular expressions to separate complex columns. Here, we want to separate on dashes and brackets. You can separate on a list of delimiters by putting them in parentheses, separated by “|”. It’s a little more complicated because brackets have a special meaning in regex, so you need to “escape” the left one with two backslashes “\\”.
infmort_split <- infmort %>%
separate(
col = 3,
into = c("rate", "ci_low", "ci_hi"),
extra = "drop",
sep = "(\\[|-|])"
)
glimpse(infmort_split)
## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate <chr> "66.3 ", "68.1 ", "69.9 ", "71.7 ", "73.4 ", "75.1 ", ...
## $ ci_low <chr> "52.7", "55.7", "58.7", "61.6", "64.4", "66.9", "69.0"...
## $ ci_hi <chr> "83.9", "83.6", "83.5", "83.7", "84.2", "85.1", "86.1"...
convert
That’s better. Notice the Year
, rate
, ci_low
and ci_hi
. That means these columns hold characters (like words), not numbers or integers. This can cause problems when you try to do thigs like average the numbers (you can’t average words), so we can fix it by adding the argument convert
and setting it to TRUE
.
infmort_split <- infmort %>%
separate(3, c("rate", "ci_low", "ci_hi"), extra = "drop", sep = "(\\[|-|])", convert = TRUE)
glimpse(infmort_split)
## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate <dbl> 66.3, 68.1, 69.9, 71.7, 73.4, 75.1, 76.8, 78.6, 80.4, ...
## $ ci_low <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...
Do the same for matmort
.
matmort_split <- matmort_long %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(stats, c("rate", "ci_low", "ci_hi"), extra = "drop", convert = TRUE)
glimpse(matmort_split)
## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate <int> 1340, 71, 216, 1160, 72, 58, 8, 8, 64, 46, 26, 569, 58...
## $ ci_low <int> 878, 58, 141, 627, 64, 51, 7, 7, 56, 34, 20, 446, 47, ...
## $ ci_hi <int> 1950, 88, 327, 2020, 80, 65, 9, 10, 74, 61, 33, 715, 7...
We can chain all the steps above together, since we don’t need those intermediate dataframes.
infmort <- read_csv("data/infmort.csv") %>%
separate(
3,
c("rate", "ci_low", "ci_hi"),
extra = "drop",
sep = "(\\[|-|])",
convert = TRUE
)
matmort <- read_xls("data/matmort.xls") %>%
gather("Year", "stats", `1990`:`2015`) %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(
stats,
c("rate", "ci_low", "ci_hi"),
extra = "drop",
convert = TRUE
)
glimpse(matmort)
## Observations: 543
## Variables: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argent...
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990"...
## $ rate <int> 1340, 71, 216, 1160, 72, 58, 8, 8, 64, 46, 26, 569, 58...
## $ ci_low <int> 878, 58, 141, 627, 64, 51, 7, 7, 56, 34, 20, 446, 47, ...
## $ ci_hi <int> 1950, 88, 327, 2020, 80, 65, 9, 10, 74, 61, 33, 715, 7...
glimpse(infmort)
## Observations: 5,044
## Variables: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate <dbl> 66.3, 68.1, 69.9, 71.7, 73.4, 75.1, 76.8, 78.6, 80.4, ...
## $ ci_low <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
You can reverse the processes above, as well. For example, you can convert data from long format into wide format.
key
is the column that contains your new column headersvalue
is the column that contains the values in the new spread columnsLet’s spread out the infant mortality rate by year.
infmort_wide <- infmort %>%
spread(Year, rate)
glimpse(infmort_wide)
## Observations: 4,934
## Variables: 29
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ ci_low <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...
## $ `1990` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1991` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1992` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1993` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1994` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1995` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1996` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1997` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1998` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `1999` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `2000` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `2001` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `2002` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 91...
## $ `2003` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 89, NA...
## $ `2004` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 86.7, NA, ...
## $ `2005` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 84.4, NA, NA, ...
## $ `2006` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 82.3, NA, NA, NA, ...
## $ `2007` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 80.4, NA, NA, NA, NA, ...
## $ `2008` <dbl> NA, NA, NA, NA, NA, NA, NA, 78.6, NA, NA, NA, NA, NA, ...
## $ `2009` <dbl> NA, NA, NA, NA, NA, NA, 76.8, NA, NA, NA, NA, NA, NA, ...
## $ `2010` <dbl> NA, NA, NA, NA, NA, 75.1, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2011` <dbl> NA, NA, NA, NA, 73.4, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2012` <dbl> NA, NA, NA, 71.7, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2013` <dbl> NA, NA, 69.9, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2014` <dbl> NA, 68.1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `2015` <dbl> 66.3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Nope, that didn’t work at all, but it’s a really common mistake when spreading data. This is because spread
matches on all the remaining columns, so Afghanistan with ci_low
of 52.7 is treated as a different observation than Afghanistan with ci_low
of 55.7. We can fix this by merging the rate
, ci_low
and ci_hi
columns back together.
unite(data, col, ..., sep = "_", remove = TRUE)
infmort_united <- infmort %>%
unite(rate_ci, rate, ci_low, ci_hi)
glimpse(infmort_united)
## Observations: 5,044
## Variables: 3
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate_ci <chr> "66.3_52.7_83.9", "68.1_55.7_83.6", "69.9_58.7_83.5", ...
sep
unite()
separates merged names with an underscore by default. Set the sep
argument if you want to change that.
infmort_united <- infmort %>%
unite(rate_ci, rate, ci_low, ci_hi, sep = ", ")
glimpse(infmort_united)
## Observations: 5,044
## Variables: 3
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate_ci <chr> "66.3, 52.7, 83.9", "68.1, 55.7, 83.6", "69.9, 58.7, 8...
What if you want to put it back into the format “rate [ci_low - ci_hi]”? Then, mutate
and paste
are a better choice than unite
, but you have to get rid of the rate
, ci_low
and ci_hi
columns with select
. You’ll learn more about these function in the Data Manipulation lesson.
infmort_united <- infmort %>%
mutate(rate_ci = paste0(rate, " [", ci_low, " - ", ci_hi, "]"))
glimpse(infmort_united)
## Observations: 5,044
## Variables: 6
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, ...
## $ rate <dbl> 66.3, 68.1, 69.9, 71.7, 73.4, 75.1, 76.8, 78.6, 80.4, ...
## $ ci_low <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, ...
## $ ci_hi <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, ...
## $ rate_ci <chr> "66.3 [52.7 - 83.9]", "68.1 [55.7 - 83.6]", "69.9 [58....
Now let’s try spreading on year again. Notice here we’re uniting columns rate:ci_hi
, instead of rate, ci_low, ci_hi
. The colon just says to select all the columns between the first and last named ones. Check the help documentation for ??tidyr::unite
and ??tidyr::select
to see other ways to select columns.
infmort_wide <- infmort %>%
unite(rate_ci, rate:ci_hi, sep = ", ") %>%
spread(Year, rate_ci)
glimpse(infmort_wide)
## Observations: 194
## Variables: 27
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angol...
## $ `1990` <chr> "122.5, 111.6, 135.5", "35.1, 31.3, 39.2", "39.7, 37.1...
## $ `1991` <chr> "118.3, 108, 129.9", "33.7, 30.2, 37.6", "38.8, 36.1, ...
## $ `1992` <chr> "114.4, 104.6, 125.2", "32.5, 29.2, 36.2", "38.1, 35.4...
## $ `1993` <chr> "110.9, 101.4, 120.9", "31.4, 28.2, 34.9", "37.5, 34.9...
## $ `1994` <chr> "107.7, 98.6, 117.2", "30.3, 27.1, 33.8", "36.9, 34.6,...
## $ `1995` <chr> "105, 96.2, 114.1", "29.1, 26, 32.7", "36.3, 34.2, 38....
## $ `1996` <chr> "102.7, 94.5, 111.3", "27.9, 24.8, 31.5", "35.7, 34, 3...
## $ `1997` <chr> "100.7, 92.9, 109.1", "26.8, 23.6, 30.4", "35.1, 33.8,...
## $ `1998` <chr> "98.9, 91.4, 107.2", "25.5, 22.4, 29.2", "34.7, 33.7, ...
## $ `1999` <chr> "97.2, 89.9, 105.4", "24.4, 21.2, 28.1", "34.4, 33.5, ...
## $ `2000` <chr> "95.4, 88.2, 103.6", "23.2, 20, 27", "33.9, 33.2, 34.7...
## $ `2001` <chr> "93.4, 86.3, 101.6", "22.1, 18.8, 26", "33.3, 32.7, 34...
## $ `2002` <chr> "91.2, 84.3, 99.3", "21, 17.6, 25.1", "32.4, 31.8, 33"...
## $ `2003` <chr> "89, 82.1, 97", "20, 16.5, 24.3", "31.3, 30.7, 31.9", ...
## $ `2004` <chr> "86.7, 79.9, 94.8", "19.1, 15.4, 23.8", "30.1, 29.5, 3...
## $ `2005` <chr> "84.4, 77.7, 92.6", "18.3, 14.2, 23.4", "28.8, 28.3, 2...
## $ `2006` <chr> "82.3, 75.5, 90.7", "17.4, 13.2, 23.1", "27.6, 27, 28....
## $ `2007` <chr> "80.4, 73.4, 88.9", "16.7, 12.1, 22.9", "26.4, 25.9, 2...
## $ `2008` <chr> "78.6, 71.2, 87.3", "16, 11.2, 22.7", "25.3, 24.8, 25....
## $ `2009` <chr> "76.8, 69, 86.1", "15.4, 10.5, 22.6", "24.3, 23.8, 24....
## $ `2010` <chr> "75.1, 66.9, 85.1", "14.8, 9.8, 22.4", "23.5, 23, 23.9...
## $ `2011` <chr> "73.4, 64.4, 84.2", "14.3, 9.1, 22.3", "22.8, 22.4, 23...
## $ `2012` <chr> "71.7, 61.6, 83.7", "13.8, 8.5, 22.2", "22.4, 22, 22.9...
## $ `2013` <chr> "69.9, 58.7, 83.5", "13.3, 7.9, 22.1", "22.1, 21.7, 22...
## $ `2014` <chr> "68.1, 55.7, 83.6", "12.9, 7.5, 22.1", "22, 21.3, 22.7...
## $ `2015` <chr> "66.3, 52.7, 83.9", "12.5, 7, 22.2", "21.9, 20.8, 23",...
Tidy the data from personality.csv.
These data are from a 5-factor (OCEAN) personality questionnaire. Each question is labelled with the domain (Op = openness, Co = concientiousness, Ex = extraversion, Ag = agreeableness, and Ne = neuroticism) and the question number.
Basic: Load the data and convert from wide to long format.
The resulting dataframe should have the columns: user_id
, date
, question
, and score
.
ocean <- read_csv("data/personality.csv") %>%
gather("question", "score", Op1:Ex9)
Basic: Split the question
column into two columns: domain
and qnumber
.
There is no character to split on, here, but you can separate a column after a specific number of characters by setting sep
to an integer. For example, to split “abcde” after the third character, use sep = 3
, which results in c(“abc”, “de”). You can also use negative number to split before the nth character from the right. For example, to split a column that has words of various lengths and 2-digit suffixes (like “lisa03”“,”amanda38“), you can use sep = -2
.
ocean_sep <- ocean %>%
separate(question, c("domain", "qnumber"), sep = 2)
Basic: Put the domain and qnumber columns back together into a new column named domain_n
. Make it in a format like “Op_Q1”.
ocean_unite <- ocean_sep %>%
unite("domain_n", domain, qnumber, sep = "_Q")
Basic: Convert back to wide format.
ocean_spread <- ocean_unite %>%
spread(domain_n, score)
ocean <- read_csv("data/personality.csv") %>%
gather("question", "score", Op1:Ex9) %>%
separate(question, c("domain", "qnumber"), sep = 2) %>%
unite("domain_n", domain, qnumber, sep = "_Q") %>%
spread(domain_n, score)
Intermediate: Debug the following code:
Load the data from sensation_seeking.csv.
ss <- read_csv(data/sensation_seeking.csv)
ss <- read_csv("data/sensation_seeking.csv")
Convert from wide to long format.
ss_long <- gather(ss, "question", "score")
ss_long <- gather(ss, "question", "score", sss1:sss14)
Split the question
column into two columns: domain
and qnumber
.
ss_sep <- ss_long %>%
separate(question, domain, qnumber, sep = 3)
ss_sep <- ss_long %>%
separate(question, c("domain", "qnumber"), sep = 3)
Put the id
and user_id
columns together into a new column named super_id
. Make it in a format like “id-user_id”.
ss_unite <- ss_sep %>%
unite(id, user_id, "super_id", sep = "-")
ss_unite <- ss_sep %>%
unite("super_id", id, user_id, sep = "-")
Convert back to wide format.
ss_wide <- ss_unite %>%
spreadr(qnumber, score)
ss_wide <- ss_unite %>%
spread(qnumber, score)
Intermediate: Load the dataset family_composition.csv.
The columns oldbro
through twinsis
give the number of siblings of that age and sex. Put this into long format and create separate columns for sibling age (old, young, twin) and sex (bro, sis).
family <- read_csv("data/family_composition.csv") %>%
gather("sibtype", "n", oldbro:twinsis) %>%
separate(sibtype, c("sibage", "sibsex"), sep = -4)
Advanced: Tidy the data from eye_descriptions.csv. This dataset contains descriptions of the eyes of 50 people. Some raters wrote more than one description per face, separated by commas, semicolons, or slashes.
Create a dataset with separate columns for face_id, description, and number of description.
eyes <- read_csv("data/eye_descriptions.csv") %>%
gather("face_id", "description", t1:t50) %>%
separate(description, c("d1", "d2", "d3"), sep = "(,|;|\\/)+", extra = "merge") %>%
gather("desc_n", "description", d1:d3) %>%
filter(!is.na(description)) # gets rid of rows with no description
I’m bored
Using the dataset from question 6, calculate how many siblings of each sex each person has, narrow the dataset down to people with fewer than 6 siblings, and generate at least two different ways to graph this.
family %>%
group_by(user_id, sex, sibsex) %>%
summarise(n = sum(n)) %>%
group_by(user_id) %>%
filter(sex %in% c("male", "female"), sum(n) < 6) %>%
ggplot(aes(n, fill = sibsex)) +
geom_histogram(binwidth = 1, colour = "black", position = "dodge")
family %>%
group_by(user_id, sex, sibsex) %>%
summarise(n = sum(n)) %>%
filter(sex %in% c("male", "female")) %>%
spread(sibsex, n) %>%
filter(bro + sis < 6) %>%
ggplot(aes(bro, sis)) +
geom_bin2d(binwidth = c(1,1))
If you’re fluent with the tidyverse (if not, we’ll get to this next week), create a list of the most common descriptions from the dtaaset in question 7. Remove useless descriptions and merge redundant descriptions.
eyes %>%
mutate(
description = trimws(description), # get rid of white space around string
description = tolower(description) # make all characters lowercase
) %>%
group_by(description) %>%
summarise(n = n()) %>% # count occurances of each description
arrange(desc(n)) %>% # sort by count (descending)
filter(nchar(description) > 1) # get rid of 1-character descriptions