I recently integrated the new on
functionality into some code of mine that was being dragged down by repetitive key switching (here for some context), so I was excited for the new on
feature to (potentially) speed things up. I was quite surprised to find that actually the code ran about 30% slower (45 instead of 35 minutes) using on
.
I was able to reproduce this using large data.table
s beefed up from @jangorecki's join_on
tests:
nn<-1e6
mm<-1e2
times=50L
set.seed(45L)
DT1 = data.table(x=sample(letters[1:3], nn, TRUE), y=sample(6:10, nn, TRUE),
a=sample(100, nn,T), b=runif(nn))
DT2 = CJ(x=letters[1:3], y=6:10)[, mul := sample(20, 15)][sample(15L, mm,T)]
times2<-times1<-numeric(times)
for (ii in 1:times){
cp1<-copy(DT1); cp2<-copy(DT2)
strt<-get_nanotime()
cp1[cp2,on="x",allow.cartesian=T]
stp<-get_nanotime()
times1[ii]<-stp-strt
cp1<-copy(DT1); cp2<-copy(DT2)
strt<-get_nanotime()
setkey(cp1,x)[setkey(cp2,x),allow.cartesian=T]
stp<-get_nanotime()
times2[ii]<-stp-strt
}
> median(times1)/median(times2)
[1] 1.274535
So about 27% slower here. Maybe I'm not understanding the purpose of on
, but I thought that the double-keyed approach should basically be an upper bound for how long on
takes. And indeed on
is faster when the tables are smaller:
nn<-1e3
> median(times1)/median(times2)
[1] 0.9491699
So, roughly 5% faster when DT1
is smaller.
nn<-1e6; mm<-5
> median(times1)/median(times2)
[1] 0.9394226
Roughly 7% faster when DT2
is smaller.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4