Having an ordered
column seems to mess up the efficiency of doing a group_by %>% summary
operation:
library(dplyr)
library(data.table)
n <- 1e5
d <- tibble(x=sample(2000, n, TRUE),
y=sample(800, n, TRUE),
z=sample(5, n, TRUE) %>% ordered(levels=c("4", "1", "3", "2", "5")),
val=runif(n))
system.time({
y_dt <- data.table(d, key=c("x", "y")) %>%
`[`(, .(w=val[which.max(z)]), by=list(x, y)) %>%
as_tibble()
})
#> user system elapsed
#> 0.110 0.003 0.112
system.time({
y_dp <- d %>%
group_by(x, y) %>%
summarize(w = val[which.max(z)])
})
#> user system elapsed
#> 14.117 12.375 26.591
If I change the type of z
to a simple integer, the problem goes away:
d %<>% mutate(z=as.integer(z))
system.time({
y_dp <- d %>%
group_by(x, y) %>%
summarize(w = val[which.max(z)])
})
#> user system elapsed
#> 0.456 0.044 0.502
It is interesting to me that z
isn't even used in the grouping operation. Maybe this happens because of copying data subsets for use in the summarize
operation?
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.2 dplyr_0.8.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 packrat_0.5.0 crayon_1.3.4 assertthat_0.2.1 R6_2.4.0
[6] magrittr_1.5 pillar_1.4.2 rlang_0.4.0 rstudioapi_0.10 tools_3.6.0
[11] glue_1.3.1 purrr_0.3.2 compiler_3.6.0 pkgconfig_2.0.2 tidyselect_0.2.5
[16] tibble_2.1.3
(Note: I originally had a reprex that exhibited the #4458 bug (see SO post https://stackoverflow.com/q/57097806/169947), but on my real data set it turns out to be still an issue in dplyr
version 0.8.3.)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4