RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/scikit-learn/scikit-learn/issues/20642 below:

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 · Issue #20642 · scikit-learn/scikit-learn · GitHub

Describe the bug

While doing experiments with the KMeans class I observed hugely varying running times for the same problem when using different versions of numpy. A systematic check revealed the following:

for numpy version 1.9.0+ KMeans was 100-500% slower than for numpy 1.8.5-

Statistics (figures are repeatable with little variance):

time:  7.67, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.16.4
time:  8.20, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.18.5
time: 40.85, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.19.0
time:  9.69, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.16.4
time: 10.21, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.18.5
time: 17.68, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.19.0

time:  7.74, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.00, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.47, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.37, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.80, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.54, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.21.1

time:  7.82, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.54, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.29, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.25, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.06, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.20, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.21.1

time: 14.45, Err: 0.01360200 python= 3.9.6  sklearn= 0.23.2 numpy 1.21.1
time: 17.44, Err: 0.01361583 python= 3.9.6  sklearn= 0.24.2 numpy 1.21.1

Steps/Code to Reproduce

Here is the python benchmark I used to produce each of the above lines. It generates three random data sets and runs k-means++ with k in {50,100}. The time for running the fit() method is aggregated as well as the normalized error

# file 'bench.py'
from sklearn.cluster import KMeans
import numpy as np
from time import time
import sys
import sklearn

def bench():
    np.random.seed(2) # fix random generator
    # generate datasets of size 1000,5000,10000
    # run k-means++ with k= 50,100
    Ds=[]
    for n in [1000,5000,10000]:
        Ds.append(np.random.random(size=(n,2)))
    ks=[50,100]
    t_total=0
    err=0
    for D in Ds:
        for k in ks:
            km=KMeans(n_clusters=k)
            t0=time()
            km.fit(D)
            t=time()-t0
            t_total+=t
            err+=km.inertia_/len(D)
        
    return f"time: {t_total:5.2f}, Err: {err:.8f}"
    
print(bench(),"python=",sys.version.split()[0],
"sklearn=",sklearn.__version__, "numpy",np.__version__)

The environments were produced with conda, e.g. the one for Python3.9 and scikit-learn 0.24.2 as follows:

conda create --name py3.9-sk0.24.2 python=3.9
conda activate py3.9-sk0.24.2
pip install scikit-learn==0.24.2
pip install numpy==1.18.5

Then for each environment the above benchmark was run:

Expected Results

I was expecting similar execution times for the different numpy versions or even better performance for higher versions.

Actual Results

The performance dropped sharply when switching from numpy 1.18.5 to 1.19.0. This was the case for Python 6,7,8 and scikit-learn 0.23.2 and 0.24.2. The performance hit was sharper larger for scikit-learn 0.23.2 than for scikit-learn 0.24.2.

With numpy 1.18.5 the performance of scikit-learn 0.23.2 was better than that of scikit-learn 0.24.2.

THe computed error values were identical for a given verson of scikit-learn and differred only slightly between scikit-learn 0.23.2 and 0.24.2

Question: What could be the reason for the huge performance deterioration for all numpy version starting with numpy 1.19.0?

Versions

(see table above)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4