-  
 -   Notifications  
You must be signed in to change notification settings  - Fork 19.2k
 
Description
Pandas version checks
-  
I have checked that this issue has not already been reported.
 -  
I have confirmed this issue exists on the latest version of pandas.
 -  
I have confirmed this issue exists on the main branch of pandas.
 
Reproducible Example
Conversion to pyarrow datatypes changes the performance drastically. I did a bit of profiling, it looks like agg is to blame here. With the recent introduction of PEP 668 testing the code on the latest branch is cumbersome and so I didn't. There are also potentially relevant issues, but it's not the same: #50121, #46505
from datetime import datetime import numpy as np import pandas as pd symbols = 1000 start = datetime(2023, 1, 1) end = datetime(2023, 1, 2) data_cols = ['A', 'B', 'C', 'D', 'E'] agg_props = {'A': 'first', 'B': 'max', 'C': 'min', 'D': 'last', 'E': 'sum'} base, sample = '1min', '5min' def pandas_resample(df: pd.DataFrame): return (df .sort_values(['sid', 'timestamp']) .set_index('timestamp') .groupby('sid') .resample(sample, label='left', closed='left') .agg(agg_props) .reset_index() ) _rng = np.random.default_rng(123) timestamps = pd.date_range(start, end, freq=base) df = pd.DataFrame({'timestamp': pd.DatetimeIndex(timestamps), **{_col: _rng.integers(50, 150, len(timestamps)) for _col in data_cols}}) ids = pd.DataFrame({'sid': _rng.integers(1000, 2000, symbols)}) df['id'] = 1 ids['id'] = 1 full_df = ids.merge(df, on='id').drop(columns=['id'])%timeit pandas_resample(full_df.copy()) 1.68 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)full_df['sid'] = full_df['sid'].astype("uint16[pyarrow]") full_df['A'] = full_df['A'].astype("int16[pyarrow]") full_df['B'] = full_df['B'].astype("int16[pyarrow]") full_df['C'] = full_df['C'].astype("int16[pyarrow]") full_df['D'] = full_df['D'].astype("int16[pyarrow]") full_df['E'] = full_df['E'].astype("int16[pyarrow]")%timeit pandas_resample(full_df.copy()) 36.4 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)Installed Versions
INSTALLED VERSIONS
commit : 965ceca
 python : 3.11.3.final.0
 python-bits : 64
 OS : Linux
 OS-release : 6.4.2-arch1-1
 Version : #1 SMP PREEMPT_DYNAMIC Thu, 06 Jul 2023 18:35:54 +0000
 machine : x86_64
 processor :
 byteorder : little
 LC_ALL : en_GB.UTF-8
 LANG : en_GB.UTF-8
 LOCALE : en_GB.UTF-8
pandas : 2.0.2
 numpy : 1.25.0
 pytz : 2023.3
 dateutil : 2.8.2
 setuptools : 68.0.0
 pip : 23.1.2
 Cython : 0.29.36
 pytest : 7.4.0
 hypothesis : 6.75.3
 sphinx : 7.0.1
 blosc : None
 feather : None
 xlsxwriter : None
 lxml.etree : 4.9.2
 html5lib : 1.1
 pymysql : None
 psycopg2 : None
 jinja2 : 3.1.2
 IPython : 8.14.0
 pandas_datareader: None
 bs4 : 4.12.2
 bottleneck : None
 brotli : None
 fastparquet : None
 fsspec : 2022.11.0
 gcsfs : None
 matplotlib : 3.7.1
 numba : None
 numexpr : None
 odfpy : None
 openpyxl : 3.1.2
 pandas_gbq : None
 pyarrow : 10.0.1
 pyreadstat : None
 pyxlsb : None
 s3fs : None
 scipy : 1.11.1
 snappy : None
 sqlalchemy : None
 tables : None
 tabulate : None
 xarray : 2023.1.0
 xlrd : None
 zstandard : None
 tzdata : 2023.3
 qtpy : None
 pyqt5 : None
Prior Performance
No response