1 min read

R^6: moRe Readable deplyR summaRize indentation Recommendation Ramble

Taking a cue for hrbrmstr, here’s a quick post!

I’m encouraging others in the R community to don [R^6] and do their own small, focused posts on topics that would help the R community learn things.


dplyr pipelines can sometimes be long and confusing, especially with multiple group_by calls in succession. I propose we start indenting the grouped portions of the pipeline.

The following code is an example of a standard reporting pipeline reporting new user creation by month. The indents clearly show the two steps: summarise users by year-month, then add the goal based on last year’s numbers.

getUsers() %>% #Returns a data frame
  mutate(date = as.Date(created),
         ay    = date_to_ay(date),
         month = date_to_aymonth(date)) %>% #created is a date-time object
  filter(date >= start_date,
         date <= end_date) %>%
  group_by(ay, month) %>%
    summarize(n_new    = n_distinct(uid),
              date     = as.Date(min(date))) %>% #New users by month
  group_by(month) %>% #Cheat to get monthly goal
    arrange(date) %>%
    mutate(goal = new_user.goal * lag(n_new)) %>%
    ungroup() %>%