RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://martinctc.github.io/blog/comparing-common-operations-in-dplyr-and-data.table/ below:

Website Navigation

Comparing Common Operations in dplyr and data.table – Musings on R – A blog on all things R and Data Science by Martin Chan

Background

This post compares common data manipulation operations in dplyr and data.table.

For new-comers to R who are not aware, there are many ways to do the same thing in R. Depending on the purpose of the code (readability vs creating functions) and the size of the data, I for one often find myself switching from one flavour (or dialect) of R data manipulation to another. Generally, I prefer the dplyr style for its readability and intuitiveness (for myself), data.table for its speed in grouping and summarising operations,¹ and base R when I am writing functions. This is by no means the R community consensus by the way (perfectly aware that I am venturing into a total minefield),² but is more of a representation of how I personally navigate the messy (but awesome) R world.

In this post, I am going to list out some of the most common data manipulations in both styles:

group_by(), summarise() (a single column)
group_by(), summarise_at() (multiple columns)
filter(), mutate()
mutate_at() (changing multiple columns)
Row-wise operations
Vectorised multiple if-else (case_when())
Function-writing: referencing a column with string

There is a vast amount of resources out there on the internet on the comparison of dplyr and data.table. For those who love to get into the details, I would really recommend Atrebas’s seminal blog post that gives a comprehensive tour of dplyr and data.table, comparing the code side-by-side. I would also recommend this comparison of the three R dialects by Jason Mercer, which not only includes base R in its comparison, but also goes into a fair bit of detail on elements such as piping/chaining (%>%). There’s also a very excellent cheat sheet from DataCamp, linked here.

Why write a new blog post then, you ask? One key (selfish / self-centred) reason is that I myself often refer to my blog for an aide-memoire on how to do a certain thing in R, and my notes are optimised to only contain my most frequently used code. They also contain certain idiosyncracies in the way that I code (e.g. using pipes with data.table), which I’d like to be upfront about - and would at the same time very much welcome any discussion on it. It is perhaps also justifiable that I at least attempted to build on and unify the work of others in this post, which I have argued as what is ultimately important in relation of duplicated R artefacts.

Rambling on… so here we go!

To make it easy to reproduce results, I am going to just stick to the good ol’ mtcars and iris datasets which come shipped with R. I will also err on the side of verbosity and load the packages at the beginning of each code chunk, as if each code chunk is its own independent R session.

1. group_by(), summarise() (a single column)

Analysis: Maximum MPG (mpg) value for each cylinder type in the mtcars dataset.
Operations: Summarise with the max() function by group.

To group by and summarise values, you would run something like this in dplyr:

library(dplyr)

mtcars %>%
    group_by(cyl) %>%
    summarise(max_mpg = max(mpg), .groups = "drop_last")

You could do the same in data.table, and still use magrittr pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

mtcars %>%
    as.data.table() %>%
    .[,.(max_mpg = max(mpg)), by = cyl]

2. group_by(), summarise_at() (multiple columns)

Analysis: Average mean value for Sepal.Width and Sepal.Length for each iris Species in the iris dataset.
Operations: Summarise with the mean() function by group.

Note: this is slightly different from the scenario above because the “summarisation” is applied to multiple columns.

In dplyr:

library(dplyr)

# Option 1
iris %>%
    group_by(Species) %>%
    summarise_at(vars(contains("Sepal")),~mean(.))

# Option 2
iris %>%
  group_by(Species) %>%
  summarise(across(contains("Sepal"), mean), .groups = "drop_last")

In data.table with pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

# Option 1
iris %>%
    as.data.table() %>%
    .[,lapply(.SD, mean), by = Species, .SDcols = c("Sepal.Length", "Sepal.Width")]
    
# Option 2
iris %>%
  as.data.table() %>%
  .[,lapply(.SD, mean), by = Species, .SDcols = names(.) %like% "Sepal"]

3. filter(), mutate()

Analysis: Find out what the multiple of Sepal.Width and Sepal.Length would be for the iris species setosa.
Operations: Filter by Species=="setosa" and create a new column called Sepal_Index.

In dplyr:

library(dplyr)

iris %>%
    filter(Species == "setosa") %>%
    mutate(Sepal_Index = Sepal.Width * Sepal.Length)

In data.table:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

iris %>%
    as.data.table() %>%
    .[, Species := as.character(Species)] %>%
    .[Species == "setosa"] %>%
    .[, Sepal_Index := Sepal.Width * Sepal.Length] %>%
  .[]

4. mutate_at() (changing multiple columns)

Analysis: Multiply Sepal.Width and Sepal.Length by 100.
Operations: As above

In dplyr:

library(dplyr)

# Option 1
iris %>%
    mutate_at(vars(Sepal.Length, Sepal.Width), ~.*100)

# Option 2
iris %>%
  mutate(across(starts_with("Sepal"), ~.*100))

In data.table with pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)


sepal_vars <- c("Sepal.Length", "Sepal.Width")

iris %>%
  as.data.table() %>%
  .[,as.vector(sepal_vars) := lapply(.SD, function(x) x * 100), .SDcols = sepal_vars] %>%
  .[]

5. Row-wise operations

This is always an awkward one, even for dplyr. For this, I will list a couple of options for row-wise calculations.

Analysis: Create a TotalSize column by summing all four columns of Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.
Operations: As above

In dplyr:

library(dplyr)

# Option 1 - use `rowwise()`
iris %>%
  rowwise() %>%
  mutate(TotalSize = sum(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width))

# Option 2 - use `apply()` and `select()`
# Select all columns BUT `Species`
iris %>%
  mutate(TotalSize = select(., -Species) %>% apply(MARGIN = 1, FUN = sum))

# Option 3 - `rowwise()` and `c_across()`
# Select all columns BUT `Species`
iris %>%
  rowwise() %>%
  mutate(TotalSize = sum(c_across(-Species)))

In data.table with pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

# Get all the column names in Species except for `Species`
all_vars <- names(iris)[names(iris) != "Species"]

iris %>%
  as.data.table() %>%
  .[, "Sepal_Total" := apply(.SD, 1, sum), .SDcols = all_vars] %>%
  .[]

6. Vectorised multiple if-else (case_when())

Analysis: Classify an Age into different categories
Operations: Create a new column called AgeLabel based on the Age variable

In dplyr:

library(dplyr)

age_data <- tibble(Age = seq(1, 100))

age_data %>%
  mutate(AgeLabel = case_when(Age < 18 ~ "0 - 17",
                              Age < 35 ~ "18 - 34",
                              Age < 65 ~ "35 - 64",
                              TRUE ~ "65+"))

In data.table:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

# Option 1 - without pipes
age_data <- data.table(Age = 0:100)
age_data[, AgeLabel := "65+"]
age_data[Age < 65, AgeLabel := "35-64"]
age_data[Age < 35, AgeLabel := "18-34"]
age_data[Age < 18, AgeLabel := "0-17"]        

# Option 2 - with pipes
age_data2 <- data.table(Age = 0:100)

age_data2 %>%
  .[, AgeLabel := "65+"] %>%
  .[Age < 65, AgeLabel := "35-64"] %>%
  .[Age < 35, AgeLabel := "18-34"] %>%
  .[Age < 18, AgeLabel := "0-17"] %>%
  .[]

One thing to note is that there are two options here - Option 2 with and Option 1 without using magrittr pipes. The reason why Option 1 is possible without any assignment (<-) is because of reference semantics in data.table. When := is used in data.table, a change is made to the data.table object via ‘modify by reference’, without creating a copy of the data.table object; when you assign it to a new object, that is referred to as ‘modify by copy’.

As Tyson Barrett nicely summarises, this ‘modifying by reference’ behaviour in data.table is partly what makes it efficient, but can be surprising if you do not expect or understand it; however, the good news is that data.table gives you the option whether to modify by reference or by making a copy.

7. Function-writing: referencing a column with string

Requirement: Create a function that will multiply a column by three. A string should be supplied to the argument to specify the column to be multiplied. The function returns the original data frame with the modified column.

Here, I intentionally name the packages explicitly within the function and not load them, as it’s best practice for functions to be able to run on their own without loading in an entire library.

In dplyr:

multiply_three <- function(data, variable){
  
  dplyr::mutate(data, !!rlang::sym(variable) := !!rlang::sym(variable) * 3)
}

multiply_three(iris, "Sepal.Length")

In data.table:

(See https://stackoverflow.com/questions/45982595/r-using-get-and-data-table-within-a-user-defined-function)

multiply_three <- function(data, variable){
  
  dt <- data.table::as.data.table(data)
  dt[, as.character(substitute(variable)) := get(variable) * 3]
  dt[] # Print
}

multiply_three(iris, "Sepal.Length")

End Note

This is it! For anything with greater detail, please consult the blogs and cheat sheets I recommended at the beginning of this blog post. I’d say this covers 65% (not a strictly empirical statistic) of my needs for data manipulation, so I hope this is of some help to you. (The gather() vs melt() vs pivot_longer() subject is a whole other beast, and ought to be dealt with in another post)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4