This post compares common data manipulation operations in dplyr and data.table.
For new-comers to R who are not aware, there are many ways to do the same thing in R. Depending on the purpose of the code (readability vs creating functions) and the size of the data, I for one often find myself switching from one flavour (or dialect) of R data manipulation to another. Generally, I prefer the dplyr style for its readability and intuitiveness (for myself), data.table for its speed in grouping and summarising operations,1 and base R when I am writing functions. This is by no means the R community consensus by the way (perfectly aware that I am venturing into a total minefield),2 but is more of a representation of how I personally navigate the messy (but awesome) R world.
In this post, I am going to list out some of the most common data manipulations in both styles:
group_by()
, summarise()
(a single column)group_by()
, summarise_at()
(multiple columns)filter()
, mutate()
mutate_at()
(changing multiple columns)case_when()
)There is a vast amount of resources out there on the internet on the comparison of dplyr and data.table. For those who love to get into the details, I would really recommend Atrebas’s seminal blog post that gives a comprehensive tour of dplyr and data.table, comparing the code side-by-side. I would also recommend this comparison of the three R dialects by Jason Mercer, which not only includes base R in its comparison, but also goes into a fair bit of detail on elements such as piping/chaining (%>%
). There’s also a very excellent cheat sheet from DataCamp, linked here.
Why write a new blog post then, you ask? One key (selfish / self-centred) reason is that I myself often refer to my blog for an aide-memoire on how to do a certain thing in R, and my notes are optimised to only contain my most frequently used code. They also contain certain idiosyncracies in the way that I code (e.g. using pipes with data.table), which I’d like to be upfront about - and would at the same time very much welcome any discussion on it. It is perhaps also justifiable that I at least attempted to build on and unify the work of others in this post, which I have argued as what is ultimately important in relation of duplicated R artefacts.
Rambling on… so here we go!
To make it easy to reproduce results, I am going to just stick to the good ol’ mtcars and iris datasets which come shipped with R. I will also err on the side of verbosity and load the packages at the beginning of each code chunk, as if each code chunk is its own independent R session.
1.group_by()
, summarise()
(a single column)
mpg
) value for each cylinder type in the mtcars dataset.max()
function by group.To group by and summarise values, you would run something like this in dplyr:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(max_mpg = max(mpg), .groups = "drop_last")
You could do the same in data.table, and still use magrittr pipes:
library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)
mtcars %>%
as.data.table() %>%
.[,.(max_mpg = max(mpg)), by = cyl]
2. group_by()
, summarise_at()
(multiple columns)
Sepal.Width
and Sepal.Length
for each iris Species
in the iris dataset.mean()
function by group.Note: this is slightly different from the scenario above because the “summarisation” is applied to multiple columns.
In dplyr:
library(dplyr)
# Option 1
iris %>%
group_by(Species) %>%
summarise_at(vars(contains("Sepal")),~mean(.))
# Option 2
iris %>%
group_by(Species) %>%
summarise(across(contains("Sepal"), mean), .groups = "drop_last")
In data.table with pipes:
library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)
# Option 1
iris %>%
as.data.table() %>%
.[,lapply(.SD, mean), by = Species, .SDcols = c("Sepal.Length", "Sepal.Width")]
# Option 2
iris %>%
as.data.table() %>%
.[,lapply(.SD, mean), by = Species, .SDcols = names(.) %like% "Sepal"]
3. filter()
, mutate()
Sepal.Width
and Sepal.Length
would be for the iris species setosa
.Species=="setosa"
and create a new column called Sepal_Index
.In dplyr:
library(dplyr)
iris %>%
filter(Species == "setosa") %>%
mutate(Sepal_Index = Sepal.Width * Sepal.Length)
In data.table:
library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)
iris %>%
as.data.table() %>%
.[, Species := as.character(Species)] %>%
.[Species == "setosa"] %>%
.[, Sepal_Index := Sepal.Width * Sepal.Length] %>%
.[]
4. mutate_at()
(changing multiple columns)
Sepal.Width
and Sepal.Length
by 100.In dplyr:
library(dplyr)
# Option 1
iris %>%
mutate_at(vars(Sepal.Length, Sepal.Width), ~.*100)
# Option 2
iris %>%
mutate(across(starts_with("Sepal"), ~.*100))
In data.table with pipes:
library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)
sepal_vars <- c("Sepal.Length", "Sepal.Width")
iris %>%
as.data.table() %>%
.[,as.vector(sepal_vars) := lapply(.SD, function(x) x * 100), .SDcols = sepal_vars] %>%
.[]
5. Row-wise operations
This is always an awkward one, even for dplyr. For this, I will list a couple of options for row-wise calculations.
TotalSize
column by summing all four columns of Sepal.Length
, Sepal.Width
, Petal.Length
, and Petal.Width
.In dplyr:
library(dplyr)
# Option 1 - use `rowwise()`
iris %>%
rowwise() %>%
mutate(TotalSize = sum(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width))
# Option 2 - use `apply()` and `select()`
# Select all columns BUT `Species`
iris %>%
mutate(TotalSize = select(., -Species) %>% apply(MARGIN = 1, FUN = sum))
# Option 3 - `rowwise()` and `c_across()`
# Select all columns BUT `Species`
iris %>%
rowwise() %>%
mutate(TotalSize = sum(c_across(-Species)))
In data.table with pipes:
library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)
# Get all the column names in Species except for `Species`
all_vars <- names(iris)[names(iris) != "Species"]
iris %>%
as.data.table() %>%
.[, "Sepal_Total" := apply(.SD, 1, sum), .SDcols = all_vars] %>%
.[]
6. Vectorised multiple if-else (case_when()
)
Age
into different categoriesAgeLabel
based on the Age
variableIn dplyr:
library(dplyr)
age_data <- tibble(Age = seq(1, 100))
age_data %>%
mutate(AgeLabel = case_when(Age < 18 ~ "0 - 17",
Age < 35 ~ "18 - 34",
Age < 65 ~ "35 - 64",
TRUE ~ "65+"))
In data.table:
library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)
# Option 1 - without pipes
age_data <- data.table(Age = 0:100)
age_data[, AgeLabel := "65+"]
age_data[Age < 65, AgeLabel := "35-64"]
age_data[Age < 35, AgeLabel := "18-34"]
age_data[Age < 18, AgeLabel := "0-17"]
# Option 2 - with pipes
age_data2 <- data.table(Age = 0:100)
age_data2 %>%
.[, AgeLabel := "65+"] %>%
.[Age < 65, AgeLabel := "35-64"] %>%
.[Age < 35, AgeLabel := "18-34"] %>%
.[Age < 18, AgeLabel := "0-17"] %>%
.[]
One thing to note is that there are two options here - Option 2 with and Option 1 without using magrittr pipes. The reason why Option 1 is possible without any assignment (<-
) is because of reference semantics in data.table. When :=
is used in data.table, a change is made to the data.table object via ‘modify by reference’, without creating a copy of the data.table object; when you assign it to a new object, that is referred to as ‘modify by copy’.
As Tyson Barrett nicely summarises, this ‘modifying by reference’ behaviour in data.table is partly what makes it efficient, but can be surprising if you do not expect or understand it; however, the good news is that data.table gives you the option whether to modify by reference or by making a copy.
7. Function-writing: referencing a column with stringHere, I intentionally name the packages explicitly within the function and not load them, as it’s best practice for functions to be able to run on their own without loading in an entire library.
In dplyr:
multiply_three <- function(data, variable){
dplyr::mutate(data, !!rlang::sym(variable) := !!rlang::sym(variable) * 3)
}
multiply_three(iris, "Sepal.Length")
In data.table:
multiply_three <- function(data, variable){
dt <- data.table::as.data.table(data)
dt[, as.character(substitute(variable)) := get(variable) * 3]
dt[] # Print
}
multiply_three(iris, "Sepal.Length")
End Note
This is it! For anything with greater detail, please consult the blogs and cheat sheets I recommended at the beginning of this blog post. I’d say this covers 65% (not a strictly empirical statistic) of my needs for data manipulation, so I hope this is of some help to you. (The gather()
vs melt()
vs pivot_longer()
subject is a whole other beast, and ought to be dealt with in another post)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4