RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/Rdatatable/data.table/issues/1232 below:

`on` performing slower than double `setkey` · Issue #1232 · Rdatatable/data.table · GitHub

I recently integrated the new on functionality into some code of mine that was being dragged down by repetitive key switching (here for some context), so I was excited for the new on feature to (potentially) speed things up. I was quite surprised to find that actually the code ran about 30% slower (45 instead of 35 minutes) using on.

I was able to reproduce this using large data.tables beefed up from @jangorecki's join_on tests:

nn<-1e6
mm<-1e2

times=50L

set.seed(45L)
DT1 = data.table(x=sample(letters[1:3], nn, TRUE), y=sample(6:10, nn, TRUE), 
                 a=sample(100, nn,T), b=runif(nn))
DT2 = CJ(x=letters[1:3], y=6:10)[, mul := sample(20, 15)][sample(15L, mm,T)]

times2<-times1<-numeric(times)
for (ii in 1:times){
  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  cp1[cp2,on="x",allow.cartesian=T]
  stp<-get_nanotime()
  times1[ii]<-stp-strt

  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  setkey(cp1,x)[setkey(cp2,x),allow.cartesian=T]
  stp<-get_nanotime()
  times2[ii]<-stp-strt
}
> median(times1)/median(times2)
[1] 1.274535

So about 27% slower here. Maybe I'm not understanding the purpose of on, but I thought that the double-keyed approach should basically be an upper bound for how long on takes. And indeed on is faster when the tables are smaller:

nn<-1e3

> median(times1)/median(times2)
[1] 0.9491699

So, roughly 5% faster when DT1 is smaller.

nn<-1e6; mm<-5
> median(times1)/median(times2)
[1] 0.9394226

Roughly 7% faster when DT2 is smaller.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4