A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pandas-dev/pandas/issues/55031 below:

Aggregations on timestamp[ns][pyarrow] extremely slow compared to datetime64[ns] · Issue #55031 · pandas-dev/pandas · GitHub

Pandas version checks Reproducible Example

See test below where we see that agg operation min (same for other aggregation functions mean, max, first, last ...) on Timestamp[pyArrow] is extremely slow.

import timeit
import numpy as np
import pandas as pd
import pyarrow as pa

def create_sample_dataframe(rows=2000, columns=5):
        data = {
            'int_col1': np.random.randint(1, 10000000, size=rows),
            'int_col2': np.random.randint(1, 10000000, size=rows),
            'int_col3': np.random.randint(1, 10000000, size=rows),
            'int_col4': np.random.randint(1, 10000000, size=rows),
            'start_time': pd.to_datetime(np.random.randint(1_000_000_000, 2_000_000_000, size=rows), unit='s')
        }
        return pd.DataFrame(data)


# Function to perform aggregation and measure execution time
def measure_execution_time(df, use_pyarrow=False):
    if use_pyarrow:
        # Convert the 'start_time' column to Timestamp[pyArrow]
        df['start_time'] = df['start_time'].astype(pd.ArrowDtype(pa.timestamp(unit="ns")))
    else: 
        # Convert the 'start_time' column to Timestamp[pandas]
        df['start_time'] = pd.to_datetime(df['start_time'], unit='ns')
        
      # Group by four int64 columns and aggregate the 'start_time' column using min
    start_time = timeit.default_timer()
    result = df.groupby(['int_col1', 'int_col2', 'int_col3', 'int_col4'])['start_time'].min()
    end_time = timeit.default_timer()
    return end_time - start_time

# Create a sample dataframe
df = create_sample_dataframe(100000)

# In the case of datetime64[ns] dtype:  
print(measure_execution_time(df, use_pyarrow=False))
# 0.1226182999998855 sec 

# In the case of timestamp[ns][pyarrow] dtype
print(measure_execution_time(df, use_pyarrow=True))
# 8.103240000000369 sec
Installed Versions INSTALLED VERSIONS

commit : ba1cccd
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.90.1-microsoft-standard-WSL2
Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.24.3
pytz : 2022.7
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : 7.4.0
hypothesis : 6.82.0
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 3.1.2
IPython : 8.12.2
pandas_datareader : None
bs4 : None
bottleneck : 1.3.5
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.4.0
gcsfs : 2023.4.0
matplotlib : None
numba : 0.57.1
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.39
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
None

Prior Performance

No response


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4