Learning Objectives

  1. Be able to use the 6 main dplyr one-table verbs:
    • select()
    • filter()
    • arrange()
    • mutate()
    • summarise()
    • group_by()
  2. Also know these additional one-table verbs:
    • rename()
    • distinct()
    • count()
    • slice()
    • pull()

Prep

Setup

You’ll need the following two packages. Might as well convert to a tibble to make printing prettier.

  library("tidyverse")
  library("ukbabynames")

  ## convert to a tibble
  ukb <- as_tibble(ukbabynames)

In-class tasks

Easy

  1. How many records are in the dataset?

      count(ukb) ## or: nrow(ukb)
  2. Remove the column rank from the dataset.

      ukb %>%
        select(-rank)
  3. What is the range of birth years contained in the dataset?

      ukb %>%
        summarise(minyear = min(year),
                  maxyear = max(year))
  4. Pull out the babies named Hermione.

    ukb %>%
      filter(name == "Hermione")
  5. Sort the dataset by sex and then by year (descending) and then by rank (descending).

    ukb %>%
      arrange(sex, desc(year), desc(rank))
  6. Create a new column, decade, that contains the decade of birth (1990, 2000, 2010). Hint: see ?floor

      ukb %>%
        mutate(decade = floor(year / 10) * 10)
  7. Pull out the male babies named Courtney that were born between 1998 and 2001 (inclusive).

        ukb %>%
          filter(name == "Courtney", sex == "M",
                 year >= 1998, year <= 2001)
  8. How many distinct names are represented in the dataset?

    ## how many distinct names are represented in the dataset?
    ukb %>%
      distinct(name) %>%
      count()

Intermediate

  1. Pull out all the female babies named Frankie that were born before 1999 or after 2010

      ukb %>%
        filter(name == "Frankie", sex == "F",
              (year < 1999) | (year > 2010))
  2. How many total babies in the dataset were named ‘Emily’?

      ## 
      ukb %>%
        filter(name == "Emily") %>%
        summarise(tot = sum(n)) %>%
        pull(tot)
    ## [1] 102250
  3. How many distinct names are there for each sex?

      ukb %>% 
        group_by(sex) %>%
        distinct(name) %>%
        count()
  4. What is the most popular name in the dataset?

      ukb %>%
        group_by(name) %>%
        summarise(tot = sum(n)) %>%
        arrange(desc(tot)) %>%
        slice(1) %>%
        pull(name)
    ## [1] "Jack"
  5. How many babies were born each year for each sex? Make a plot.

      babes_per_year <- ukb %>%
        group_by(year, sex) %>%
        summarise(tot = sum(n))
    
      ggplot(babes_per_year, aes(year, tot, color = sex)) +
        geom_line()

  6. Create a column prop that contains the proportion of babies that were given a particular name for a given sex in a given year, then ungroup() the resulting table.

      ukb_prop <- ukb %>%
        group_by(year, sex) %>%
        mutate(p = n / sum(n)) %>%
        ungroup()
    
      ## TODO double check that you did it right by making sure the props 
      ## for each year/sex combo sum to 1

Advanced

  1. Make a frequency histogram for the final letter of each name, broken down by sex. Are certain final letters more “gendered” than others?

      last_letter <- ukb %>%
        mutate(lastchar = substr(name, nchar(name), nchar(name))) %>%
        filter(lastchar %in% letters) %>%
        count(sex, lastchar) %>%
        arrange(lastchar)
    
      ggplot(last_letter, aes(lastchar, nn, fill = sex)) +
        geom_bar(stat = "identity")

  2. Calculate the top 5 boys and girls names for each decade and print out the whole table.

      ukb %>%
        mutate(decade = floor(year / 10) * 10) %>%
        group_by(decade, sex, name) %>%
        summarise(tot_n = sum(n)) %>%
        arrange(desc(tot_n)) %>%
        slice(1:5) %>%
        ungroup() %>%
        print(n = +Inf)
    ## # A tibble: 30 x 4
    ##    decade   sex    name tot_n
    ##     <dbl> <chr>   <chr> <dbl>
    ##  1   1990     F   Chloe 33235
    ##  2   1990     F   Emily 26341
    ##  3   1990     F  Sophie 24111
    ##  4   1990     F   Megan 23363
    ##  5   1990     F Jessica 23319
    ##  6   1990     M    Jack 40554
    ##  7   1990     M  Thomas 38004
    ##  8   1990     M   James 37183
    ##  9   1990     M  Daniel 34052
    ## 10   1990     M  Joshua 30532
    ## 11   2000     F   Emily 51663
    ## 12   2000     F   Chloe 50965
    ## 13   2000     F Jessica 48023
    ## 14   2000     F  Olivia 44937
    ## 15   2000     F  Sophie 44885
    ## 16   2000     M    Jack 82540
    ## 17   2000     M  Thomas 70475
    ## 18   2000     M  Joshua 69428
    ## 19   2000     M   James 59247
    ## 20   2000     M  Oliver 56990
    ## 21   2010     F  Amelia 32397
    ## 22   2010     F  Olivia 28977
    ## 23   2010     F   Emily 24246
    ## 24   2010     F Jessica 21690
    ## 25   2010     F    Lily 21379
    ## 26   2010     M  Oliver 42642
    ## 27   2010     M   Harry 38128
    ## 28   2010     M    Jack 37492
    ## 29   2010     M Charlie 31009
    ## 30   2010     M   Jacob 29667
  3. Pull out the top 5 boys and girls names for the final year of the dataset, and plot the historical trajectory of their popularity, with separate graphs for boys and girls. (Hint: This might require merging tables, which you will learn about next week.)

      top_names <- ukb %>%
        filter(year == max(year), rank <= 5) %>%
        select(sex, name)
    
      ukb %>%
        inner_join(top_names, c("sex", "name")) %>%
        ggplot(aes(year, n, color = name)) +
          geom_line() +
          facet_wrap(~sex)

  4. What are the 10 most androgynous names in the UK? Discard any names that were given to less than 5000 babies total. Calculate an “androgyny index” for each name as log(F + .5) - log(M + .5) where F is the number of female and M is the number of male babies. This index will be zero for perfect gender balance, positive for skewed female, and negative for skewed male.

      andro <- ukb %>%
        group_by(sex, name) %>%
        summarise(tot = sum(n)) %>%
        ungroup() %>%
        spread(sex, tot, fill = 0) %>%
        mutate(N = F + M)
    
      andro_gbal <- andro %>%
        filter(N >= 5000) %>%
        mutate(gbal = log(F + .5) - log(M + .5),
               tot = F + M) %>%
        arrange(abs(gbal)) %>%
        slice(1:10)
  5. Which girl name has increased the most in popularity, if you directly compare the first year of the dataset to the last year? Which girl name has decreased the most? (Only comare names that were given to at least 500 babies in at least one of the years covered by the dataset.)

      name_pop <- ukb %>%
        filter(sex == "F", (year == 1996) | (year == 2015)) %>%
        spread(year, n, fill = 0) %>%
        filter(`2015` > 500 | `1996` > 500) %>%
        mutate(chg = `2015` - `1996`)
    
      name_pop %>%
        arrange(desc(chg)) %>%
        slice(1) %>%
        pull(name)
    ## [1] "Amelia"
      name_pop %>%
        arrange(chg) %>%
        slice(1) %>%
        pull(name)
    ## [1] "Sophie"
  6. Calculate the proportion of names that are androgynous for each year in the dataset (were given to both male and female babies) and then plot the historical trend.

      p_andro <- ukb %>%
        select(-rank) %>%
        spread(sex, n, fill = 0) %>%
        mutate(is_andro = (F != 0) & (M != 0)) %>%
        group_by(year) %>%
        summarise(p = mean(is_andro))
    
      ggplot(p_andro, aes(year, p)) + geom_line()

  7. Naming diversity trends. Calculate a naming diversity index (number of names divided by number of babies) for each year and sex in the dataset. Plot the historical trend for naming diversity.

      ndiversity <- ukb %>%
        group_by(year, sex) %>%
        summarise(n_names = n_distinct(name),
                  n_babies = sum(n),
                  d_index = n_names / n_babies) %>%
        ungroup()
    
      ggplot(ndiversity,
              aes(year, d_index, color = sex)) + geom_line()

Other things you can try: The disgust dataset

Some of the tasks below involve the use of the lubridate package for working with dates and times. See R4DS:

# libraries needed for these examples
library(lubridate)
library(tidyverse)

These examples will use data from disgust.csv, which contains data from the Three Domain Disgust Scale. Each participant is identified by a unique user_id and each questionnaire completion has a unique id.

disgust <- read_csv("data/disgust.csv")

Questionnaire Instructions: The following items describe a variety of concepts. Please rate how disgusting you find the concepts described in the items, where 0 means that you do not find the concept disgusting at all, and 6 means that you find the concept extremely disgusting.

colname question
moral1 Shoplifting a candy bar from a convenience store
moral2 Stealing from a neighbor
moral3 A student cheating to get good grades
moral4 Deceiving a friend
moral5 Forging someone’s signature on a legal document
moral6 Cutting to the front of a line to purchase the last few tickets to a show
moral7 Intentionally lying during a business transaction
sexual1 Hearing two strangers having sex
sexual2 Performing oral sex
sexual3 Watching a pornographic video
sexual4 Finding out that someone you don’t like has sexual fantasies about you
sexual5 Bringing someone you just met back to your room to have sex
sexual6 A stranger of the opposite sex intentionally rubbing your thigh in an elevator
sexual7 Having anal sex with someone of the opposite sex
pathogen1 Stepping on dog poop
pathogen2 Sitting next to someone who has red sores on their arm
pathogen3 Shaking hands with a stranger who has sweaty palms
pathogen4 Seeing some mold on old leftovers in your refrigerator
pathogen5 Standing close to a person who has body odor
pathogen6 Seeing a cockroach run across the floor
pathogen7 Accidentally touching a person’s bloody cut

select()

Select columns by name or number.

You can select each column individually, separated by commas (e.g., col1, col2). You can also select all columns between two columns by separating them with a colon (e.g., start_col:end_col).

moral <- disgust %>% select(user_id, moral1:moral7)
glimpse(moral)
## Observations: 20,000
## Variables: 8
## $ user_id <int> 0, 1, 2, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 606...
## $ moral1  <int> 5, 2, 1, 0, 4, 1, 3, 2, 6, 0, 4, 1, 1, 4, 2, 1, NA, 3,...
## $ moral2  <int> 6, 2, 1, 1, 4, 5, 4, 4, 6, 1, 5, 2, 4, 4, 5, 3, NA, 5,...
## $ moral3  <int> 4, 1, 1, 1, 4, NA, 3, 3, 6, 3, 6, 2, 3, 4, 4, 2, NA, 4...
## $ moral4  <int> 6, 2, 1, 1, 4, 5, 4, 5, 0, 4, 5, 2, 4, 4, 6, 4, NA, 5,...
## $ moral5  <int> 5, 1, NA, 1, 4, 5, 4, 5, 6, 1, 5, 1, 3, 3, 5, 5, NA, 3...
## $ moral6  <int> 5, 1, NA, 2, 4, 5, 3, 5, 0, 0, 4, 2, 4, 3, 4, 4, NA, 4...
## $ moral7  <int> 6, 1, 1, 1, 4, 1, 3, 3, 0, 1, 4, 1, 3, 2, 5, 4, NA, 3,...

You can select colmns by number, which is useful when the column names are long or complicated.

sexual <- disgust %>% select(2, 11:17)
glimpse(sexual)
## Observations: 20,000
## Variables: 8
## $ user_id <int> 0, 1, 2, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 606...
## $ sexual1 <int> 4, 3, 1, 4, 2, 0, 2, 4, 0, 3, 3, 1, 1, 4, 2, 0, NA, 1,...
## $ sexual2 <int> 0, 1, NA, 3, 1, 5, 4, 2, 0, 5, 0, 0, 0, 2, 1, 0, NA, 3...
## $ sexual3 <int> 1, 1, 1, 0, 2, 0, 3, 4, 0, 2, 1, 0, 0, 1, 0, 0, NA, 3,...
## $ sexual4 <int> 0, 2, NA, 6, 1, 2, 2, 6, 0, 4, 6, 3, 1, 3, 2, 0, NA, 5...
## $ sexual5 <int> 1, 1, 1, 0, 1, 0, 1, 6, 0, 6, 3, 0, 0, 2, 1, 0, NA, 5,...
## $ sexual6 <int> 4, 2, NA, 3, 1, 1, 5, 6, 0, 6, 5, 4, 2, 3, 4, 0, NA, 3...
## $ sexual7 <int> 5, 2, NA, 5, 5, 0, 4, 2, 0, 5, 3, 4, 1, 6, 3, 0, NA, 3...

You can use a minus symbol to unselect columns, leaving all of the other columns. If you want to exclude a span of columns, put parentheses around the span first (e.g., -(moral1:moral7), not -moral1:moral7).

pathogen <- disgust %>% select(-id, -date, -(moral1:sexual7))
glimpse(pathogen)
## Observations: 20,000
## Variables: 8
## $ user_id   <int> 0, 1, 2, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 6...
## $ pathogen1 <int> 6, 3, NA, 5, 5, 6, 6, 5, 6, 5, 6, 4, 4, 5, 3, 5, NA,...
## $ pathogen2 <int> 1, 2, NA, 6, 5, 3, 4, 6, 6, 2, 5, 3, 4, 2, 3, 4, NA,...
## $ pathogen3 <int> 6, 3, 1, 4, 4, 1, 4, 6, 6, 4, 5, 1, 3, 0, 2, 3, NA, ...
## $ pathogen4 <int> 5, 3, NA, 6, 4, 1, 3, 4, 6, 4, 5, 1, 4, 4, 4, 1, NA,...
## $ pathogen5 <int> 4, 2, NA, 5, 5, 3, 3, 6, 0, 2, 5, 4, 5, 5, 5, 4, NA,...
## $ pathogen6 <int> 5, 3, NA, 5, 4, 1, 2, 1, 0, 2, 5, 1, 4, 5, 1, 2, NA,...
## $ pathogen7 <int> 6, 3, NA, 4, 3, 0, 3, 6, 6, 6, 5, 3, 5, 4, 5, 3, NA,...

select() helpers

You can select columns based on criteria about the column names.

starts_with()

Select columns that start with a character string.

u <- disgust %>% select(starts_with("u"))
glimpse(u)
## Observations: 20,000
## Variables: 1
## $ user_id <int> 0, 1, 2, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 606...

ends_with()

Select columns that end with a character string.

firstq <- disgust %>% select(ends_with("1"))
glimpse(firstq)
## Observations: 20,000
## Variables: 3
## $ moral1    <int> 5, 2, 1, 0, 4, 1, 3, 2, 6, 0, 4, 1, 1, 4, 2, 1, NA, ...
## $ sexual1   <int> 4, 3, 1, 4, 2, 0, 2, 4, 0, 3, 3, 1, 1, 4, 2, 0, NA, ...
## $ pathogen1 <int> 6, 3, NA, 5, 5, 6, 6, 5, 6, 5, 6, 4, 4, 5, 3, 5, NA,...

contains()

Select columns that contain a character string.

pathogen <- disgust %>% select(contains("pathogen"))
glimpse(pathogen)
## Observations: 20,000
## Variables: 7
## $ pathogen1 <int> 6, 3, NA, 5, 5, 6, 6, 5, 6, 5, 6, 4, 4, 5, 3, 5, NA,...
## $ pathogen2 <int> 1, 2, NA, 6, 5, 3, 4, 6, 6, 2, 5, 3, 4, 2, 3, 4, NA,...
## $ pathogen3 <int> 6, 3, 1, 4, 4, 1, 4, 6, 6, 4, 5, 1, 3, 0, 2, 3, NA, ...
## $ pathogen4 <int> 5, 3, NA, 6, 4, 1, 3, 4, 6, 4, 5, 1, 4, 4, 4, 1, NA,...
## $ pathogen5 <int> 4, 2, NA, 5, 5, 3, 3, 6, 0, 2, 5, 4, 5, 5, 5, 4, NA,...
## $ pathogen6 <int> 5, 3, NA, 5, 4, 1, 2, 1, 0, 2, 5, 1, 4, 5, 1, 2, NA,...
## $ pathogen7 <int> 6, 3, NA, 4, 3, 0, 3, 6, 6, 6, 5, 3, 5, 4, 5, 3, NA,...

num_range(prefix, range, width = NULL)

Select columns with a name that matches the pattern prefix#.

moral2_4 <- disgust %>% select(num_range("moral", 2:4))
glimpse(moral2_4)
## Observations: 20,000
## Variables: 3
## $ moral2 <int> 6, 2, 1, 1, 4, 5, 4, 4, 6, 1, 5, 2, 4, 4, 5, 3, NA, 5, ...
## $ moral3 <int> 4, 1, 1, 1, 4, NA, 3, 3, 6, 3, 6, 2, 3, 4, 4, 2, NA, 4,...
## $ moral4 <int> 6, 2, 1, 1, 4, 5, 4, 5, 0, 4, 5, 2, 4, 4, 6, 4, NA, 5, ...

Use width to set the number of digits with leading zeros. For example, num_range("var_", 8:10, width=2) selects columns var_08, var_09, and var_10.

filter()

Select rows by matching column criteria.

Select all rows where the user_id is 1 (that’s Lisa).

disgust %>% filter(user_id == 1)

Remember to use == and not = to check if two things are equivalent. A single = assigns the righthand value to the lefthand variable and (usually) evaluates to TRUE.

You can select on multiple criteria by separating them with commas.

amoral <- disgust %>% filter(
  moral1 == 0, 
  moral2 == 0,
  moral3 == 0, 
  moral4 == 0,
  moral5 == 0,
  moral6 == 0,
  moral7 == 0
)

You can use the symbols &, |, and ! to mean “and”, “or”, and “not”. You can also use other operators to make equations.

# everyone who chose either 0 or 7 for question moral1
moral_extremes <- disgust %>% 
  filter(moral1 == 0 | moral1 == 7)

# everyone who chose the same answer for all moral questions
moral_consistent <- disgust %>% 
  filter(
    moral2 == moral1 & 
      moral3 == moral1 & 
      moral4 == moral1 &
      moral5 == moral1 &
      moral6 == moral1 &
      moral7 == moral1
  )

# everyone who did not answer 7 for all 7 moral questions
moral_no_ceiling <- disgust %>%
  filter(moral1+moral2+moral3+moral4+moral5+moral6+moral7 != 7*7)

Sometimes you need to exclude some participant IDs for reasons that can’t be described in code. the %in% operator is useful here for testing if a column value is in a list. Surround the equation with parentheses and put ! in front to test that a value is not in the list.

no_researchers <- disgust %>%
  filter(!(user_id %in% c(1,2)))

You can use the lubridate package to work with dates. For example, you can use the year() function to return just the year from the date column and then select only data collected in 2010.

disgust2010 <- disgust  %>%
  filter(year(date) == 2010)

Or select data from at least 5 years ago. You can use the range function to check the minimum and maxiumum dates in the resulting dataset.

disgust_5ago <- disgust %>%
  filter(date < today() - dyears(5))

range(disgust_5ago$date)
## [1] "2008-07-10" "2012-11-03"

arrange()

Sort your dataset using arrange().

disgust_order <- disgust %>%
  arrange(id)

head(disgust_order)

Reverse the order using desc()

disgust_order <- disgust %>%
  arrange(desc(id))

head(disgust_order)

mutate()

Add new columns. This is one of the most useful functions in the tidyverse.

Refer to other columns by their names (unquoted). You can add more than one column, just separate the columns with a comma. Once you make a new column, you can use it in further column definitions e.g., total below).

disgust_total <- disgust %>%
  mutate(
    pathogen = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7,
    moral = moral1 + moral2 + moral3 + moral4 + moral5 + moral6 + moral7,
    sexual = sexual1 + sexual2 + sexual3 + sexual4 + sexual5 + sexual6 + sexual7,
    total = pathogen + moral + sexual,
    user_id = paste0("U", user_id)
  )

You can overwrite a column by giving a new column the same name as the old column. Make sure that you mean to do this and that you aren’t trying to use the old column value after you redefine it.

summarise()

Create summary statistics for the dataset. Check the Data Wrangling Cheat Sheet or the Data Transformation Cheat Sheet for various summary functions. Some common ones are: mean(), sd(), n(), sum(), and quantile().

disgust_total %>%
  summarise(
    n = n(),
    q25 = quantile(total, .25, na.rm = TRUE),
    q50 = quantile(total, .50, na.rm = TRUE),
    q75 = quantile(total, .75, na.rm = TRUE),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE)
  )

group_by()

Create subsets of the data. You can use this to create summaries, like the mean value for all of your experimental groups.

Here, we’ll use mutate to create a new column called year, group by year, and calculate the average scores.

disgust_total %>%
  mutate(year = year(date)) %>%
  group_by(year) %>%
  summarise(
    n = n(),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE)
  )

You can use filter after group_by. The following example returns the lowest total score from each year.

disgust_total %>%
  mutate(year = year(date)) %>%
  select(user_id, year, total) %>%
  group_by(year) %>%
  filter(rank(total) == 1) %>%
  arrange(year)

You can also use mutate after group_by. The following example calculates subject-mean-centered scores by grouping the scores by user_id and then subtracting the group-specific mean from each score. Note the use of gather to tidy the data into a long format first.

disgust_smc <- disgust %>%
  gather("question", "score", moral1:pathogen7) %>%
  group_by(user_id) %>%
  mutate(score_smc = score - mean(score, na.rm = TRUE))

All Together

A lot of what we did above would be easier if the data were tidy, so let’s do that first. Then we can use group_by to calculate the domain scores.

It is good practice to use ungroup() after using group_by and summarise. Forgetting to ungroup the dataset won’t affect some further processing, but can really mess up other things.

Then we can spread out the 3 domains, calculate the total score, remove any rows with a missing (NA) total, and calculate mean values by year.

disgust_tidy <- read_csv("data/disgust.csv") %>%
  gather("question", "score", moral1:pathogen7) %>%
  separate(question, c("domain","q_num"), sep = -2) %>%
  group_by(id, user_id, date, domain) %>%
  summarise(score = mean(score)) %>%
  ungroup() 

disgust_tidy2 <- disgust_tidy %>%
  spread(domain, score) %>%
  mutate(
    total = moral + sexual + pathogen,
    year = year(date)
  ) %>%
  filter(!is.na(total)) %>%
  arrange(user_id) 

disgust_tidy3 <- disgust_tidy2 %>%
  group_by(year) %>%
  summarise(
    n = n(),
    avg_pathogen = mean(pathogen),
    avg_moral = mean(moral),
    avg_sexual = mean(sexual),
    first_user = first(user_id),
    last_user = last(user_id)
  )

disgust_tidy3