-  
-   Notifications  You must be signed in to change notification settings 
- Fork 19.2k
Closed
Labels
BenchmarkPerformance (ASV) benchmarksPerformance (ASV) benchmarksCategoricalCategorical Data TypeCategorical Data TypeGroupby
Description
we have a number of groupby benchmarks with categoricals, but I think we need a comprehensive set to exercise combinations of:
xref SO
groupby on cat/object columns
 cython function (e.g. first/max/....)
 .agg variants of cython functions
In [4]: import pandas as pd ...: import numpy as np ...: animals = ['Dog', 'Cat'] ...: days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday'] ...: N = 1000000 ...: df = pd.DataFrame({'animals': np.array(animals).take(np.random.randint(0, len(animals), size=N)), ...: 'days': np.array(days).take(np.random.randint(0, len(days), size=N))}) ...: df2 = df.copy() ...: df2['animals'] = df2['animals'].astype('category') ...: ...: df3 = df2.copy() ...: df3['animals'] = df3['animals'].cat.codes ...: ...: # group on object, aggregate cat ...: print('groupby on object') ...: %timeit df.groupby('days').agg({'animals': 'first'}) ...: %timeit df2.groupby('days').agg({'animals': 'first'}) ...: ...: ...: # group on cat, aggregate cat ...: print('groupby on cat / codes / agg') ...: %timeit df.groupby('animals').agg({'animals': 'first'}) ...: %timeit df2.groupby('animals').agg({'animals': 'first'}) ...: %timeit df3.groupby('animals').agg({'animals': 'first'}) ...: ...: print('groupby on cat / codes / cython') ...: %timeit df2.groupby('animals').first() ...: %timeit df3.groupby('animals').first() ...: [1] groupby on object 270 ms +- 5.22 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) 118 ms +- 1.96 ms per loop (mean +- std. dev. of 7 runs, 10 loops each) [2] groupby on cat / codes / agg 147 ms +- 2.53 ms per loop (mean +- std. dev. of 7 runs, 10 loops each) 69.1 ms +- 1.56 ms per loop (mean +- std. dev. of 7 runs, 10 loops each) 22.2 ms +- 838 us per loop (mean +- std. dev. of 7 runs, 10 loops each) [3] groupby on cat / codes / cython 156 ms +- 4.32 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) 169 ms +- 4.8 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) so [3] should be as fast as [2], culprit is here
we could have multiple PRs to solve this issue (benchmark & perf fix)
jakevdp
Metadata
Metadata
Assignees
Labels
BenchmarkPerformance (ASV) benchmarksPerformance (ASV) benchmarksCategoricalCategorical Data TypeCategorical Data TypeGroupby