While doing experiments with the KMeans class I observed hugely varying running times for the same problem when using different versions of numpy. A systematic check revealed the following:
Statistics (figures are repeatable with little variance):
time: 7.67, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.16.4
time: 8.20, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.18.5
time: 40.85, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.19.0
time: 9.69, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.16.4
time: 10.21, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.18.5
time: 17.68, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.19.0
time: 7.74, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.00, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.47, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.37, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.80, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.54, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.21.1
time: 7.82, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.54, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.29, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.25, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.06, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.20, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.21.1
time: 14.45, Err: 0.01360200 python= 3.9.6 sklearn= 0.23.2 numpy 1.21.1
time: 17.44, Err: 0.01361583 python= 3.9.6 sklearn= 0.24.2 numpy 1.21.1
Steps/Code to Reproduce
Here is the python benchmark I used to produce each of the above lines. It generates three random data sets and runs k-means++ with k in {50,100}. The time for running the fit() method is aggregated as well as the normalized error
# file 'bench.py' from sklearn.cluster import KMeans import numpy as np from time import time import sys import sklearn def bench(): np.random.seed(2) # fix random generator # generate datasets of size 1000,5000,10000 # run k-means++ with k= 50,100 Ds=[] for n in [1000,5000,10000]: Ds.append(np.random.random(size=(n,2))) ks=[50,100] t_total=0 err=0 for D in Ds: for k in ks: km=KMeans(n_clusters=k) t0=time() km.fit(D) t=time()-t0 t_total+=t err+=km.inertia_/len(D) return f"time: {t_total:5.2f}, Err: {err:.8f}" print(bench(),"python=",sys.version.split()[0], "sklearn=",sklearn.__version__, "numpy",np.__version__)
The environments were produced with conda, e.g. the one for Python3.9 and scikit-learn 0.24.2 as follows:
conda create --name py3.9-sk0.24.2 python=3.9
conda activate py3.9-sk0.24.2
pip install scikit-learn==0.24.2
pip install numpy==1.18.5
Then for each environment the above benchmark was run:
Expected ResultsI was expecting similar execution times for the different numpy versions or even better performance for higher versions.
Actual ResultsThe performance dropped sharply when switching from numpy 1.18.5 to 1.19.0. This was the case for Python 6,7,8 and scikit-learn 0.23.2 and 0.24.2. The performance hit was sharper larger for scikit-learn 0.23.2 than for scikit-learn 0.24.2.
With numpy 1.18.5 the performance of scikit-learn 0.23.2 was better than that of scikit-learn 0.24.2.
THe computed error values were identical for a given verson of scikit-learn and differred only slightly between scikit-learn 0.23.2 and 0.24.2
Question: What could be the reason for the huge performance deterioration for all numpy version starting with numpy 1.19.0?
Versions(see table above)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4