Get Numba 0.11 (not 0.12 yet) from numba.pydata.org. Now we can jit compile this code with LLVM:
# plain NumPy version
import numpy as np
def foobar(mixinsize, count, xs, mixins, acc):
for i in xrange(count):
k = xs[i]
acc[k:k + mixinsize] += mixins[i,:]
# LLVM compiled version
from numba import jit, void, int64, double
signature = void(int64,int64,int64[:],double[:,:],double[:])
foobar_jit = jit(signature)(foobar)
if __name__ == "__main__":
from time import clock
blocksize = 1000 # Chosen at runtime.
mixinsize = 100 # Chosen at runtime.
count = 100000 # Chosen at runtime.
xs = np.random.randint(0, blocksize + 1, count)
mixins = np.empty((count, mixinsize))
acc = np.zeros(blocksize + mixinsize)
t0 = clock()
foobar(mixinsize, count, xs, mixins, acc)
t1 = clock()
print("elapsed time: %g ms" % (1000*(t1-t0),))
t2 = clock()
foobar_jit(mixinsize, count, xs, mixins, acc)
t3 = clock()
print("elapsed time with numba jit: %g ms" % (1000*(t3-t2),))
print("speedup factor: %g" % ((t1-t0)/(t3-t2),))
$ python test_numba.py
elapsed time: 590.632 ms
elapsed time with numba jit: 12.31 ms
speedup factor: 47.9799
Ok, so that is almost 50x speedup with just three additional lines of Python code.
Now we can also test a plain C version for comparison, using clang/LLVM as compiler.
void foobar(long mixinsize, long count,
long *xs, double *mixins, double *accumulator)
{
long i, j, k;
double *cur, *acc;
for (i=0;i<count;i++) {
acc = accumulator + xs[i];
cur = mixins + i*mixinsize;
for(j=0;j<mixinsize;j++) *acc++ += *cur++;
}
}
from numpy.ctypeslib import ndpointer
import ctypes
so = ctypes.CDLL('plainc.so')
foobar_c = so.foobar
foobar_c.restype = None
foobar_c.argtypes = (
ctypes.c_long,
ctypes.c_long,
ndpointer(dtype=np.int64, ndim=1),
ndpointer(dtype=np.float64, ndim=2),
ndpointer(dtype=np.float64, ndim=1)
)
t4 = clock()
foobar_c(mixinsize, count, xs, mixins, acc)
t5 = clock()
print("elapsed time with plain C: %g ms" % (1000*(t5-t4),))
$ CC -Ofast -shared -m64 -o plainc.so plainc.c
$ python test_numba.py
elapsed time: 599.136 ms
elapsed time with numba jit: 11.958 ms
speedup factor: 50.1034
elapsed time with plain C: 5.472 ms
So, Numba is about half as fast as the plain C version when optimized with -Ofast. In comparison, the run-time using -O2 was about 8 ms. That means that the speed of Numba JIT compiled Python in this case is about 75 % of C with -O2 optimization flag. That is not bad for just three additional lines of Python code.
We can for comparison look at a plain Python version:
def foobar_py(mixinsize, count, xs, mixins, acc):
for i in xrange(count):
k = xs[i]
for j in xrange(mixinsize):
acc[j+k] += mixins[i][j]
# covert NumPy arrays to lists
_xs = map(int,xs)
_mixins = [map(float,mixins[i,:]) for i in xrange(count)]
_acc = map(float,acc)
t6 = clock()
foobar_py(mixinsize, count, _xs, _mixins, _acc)
t7 = clock()
print("elapsed time with plain Python: %g ms" % (1000*(t7-t6),))
This Python code executed in 1775 ms. Thus, relative to plain Python we could get about 3x speedup using NumPy, 150x speedup using Numba, and 350x speedup using C and -Ofast.
A word of caution from Donald Knuth, who attributed this to C. A. R. Hoare: "Premature optimization is the root of all evil in computer programming." While this might seem to be impressive relative speedups, the absolute speedup from going down this route only allowed us to save some milliseconds of CPU time. Was it really worth my time to save the CPU from that amount of labour? Is it worth your time? Decide for yourself.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4