A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/tidyverse/dplyr/issues/4501 below:

group_by() performance problem with ordered factors · Issue #4501 · tidyverse/dplyr · GitHub

Having an ordered column seems to mess up the efficiency of doing a group_by %>% summary operation:

library(dplyr)
library(data.table)

n <- 1e5

d <- tibble(x=sample(2000, n, TRUE),
            y=sample(800, n, TRUE),
            z=sample(5, n, TRUE) %>% ordered(levels=c("4", "1", "3", "2", "5")),
            val=runif(n))

system.time({
  y_dt <- data.table(d, key=c("x", "y")) %>%
    `[`(, .(w=val[which.max(z)]), by=list(x, y)) %>%
    as_tibble()
})
#>    user  system elapsed 
#>   0.110   0.003   0.112 

system.time({
  y_dp <- d %>%
    group_by(x, y) %>%
    summarize(w = val[which.max(z)])
})
#>    user  system elapsed 
#>  14.117  12.375  26.591 

If I change the type of z to a simple integer, the problem goes away:

d %<>% mutate(z=as.integer(z))
system.time({
  y_dp <- d %>%
    group_by(x, y) %>%
    summarize(w = val[which.max(z)])
})
#>    user  system elapsed 
#>   0.456   0.044   0.502 

It is interesting to me that z isn't even used in the grouping operation. Maybe this happens because of copying data subsets for use in the summarize operation?

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.2 dplyr_0.8.3      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       packrat_0.5.0    crayon_1.3.4     assertthat_0.2.1 R6_2.4.0        
 [6] magrittr_1.5     pillar_1.4.2     rlang_0.4.0      rstudioapi_0.10  tools_3.6.0     
[11] glue_1.3.1       purrr_0.3.2      compiler_3.6.0   pkgconfig_2.0.2  tidyselect_0.2.5
[16] tibble_2.1.3 

(Note: I originally had a reprex that exhibited the #4458 bug (see SO post https://stackoverflow.com/q/57097806/169947), but on my real data set it turns out to be still an issue in dplyr version 0.8.3.)


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4