-  
-   Notifications  You must be signed in to change notification settings 
- Fork 19.2k
Description
Code Sample, a copy-pastable example if possible
# Imports import pandas as pd import numpy as np # Define x and bins x = np.linspace(0, 1, 100) bins = np.arange(0, 1, 0.1) # Create Series object with the x-data series = pd.Series(x, name='X') # Group series by provided bins data_cut = pd.cut(series, bins, include_lowest=True) grp = series.groupby(by=data_cut) # Calculate the 0.5 quantile of this group perc = grp.quantile() # For every group, print the 0.5 quantile as determined by that group and its values for group, indices in grp.indices.items(): print() print("Bin:", group) print("Group quantile:", perc.loc[group]) grp_series = grp.get_group(group) print("Group values quantile:", grp_series.quantile())Problem description
NOTE: I see that there are several other issues already open that discuss problems with groupby and quantile in v1.0.3, but I figured that I post this anyway, as it is a very easy and simple reproducible example.
Executing the code above in pandas v1.0.3 (or any v1.0.x version) results in the quantiles not agreeing with each other for every group, even though they should.
 Instead, it seems that the quantile as calculated by a group is shifted by 1 group.
 This will become 2 groups when using bins = np.arange(0, 0.9, 0.1) and goes away when using bins = np.arange(0, 1.1, 0.1) (or the bins = np.linspace(0, 1, 11) equivalent).
 The problem above does not occur for pandas v0.24.x, but instead gives the proper output.
Additionally, replacing x and bins with
# Define x and bins x = np.linspace(1, 10, 100) bins = 10**np.arange(0, 1, 0.1)in order to use logarithmic values (that are not equally binned), will not result in a simple shift, but simply values that make no sense at all at first.
After some more testing, it seems that the quantile method of a GroupBy object, first goes through all values that are not in a group (their indices are NaN) and then goes through the remaining data normally.
 Not sure if that is helpful.
Expected Output
The expected output is that the two different values that are printed are always the same for each group, regardless of the bins that are used.
Output of pd.show_versions()
 INSTALLED VERSIONS
commit : None
 python : 3.7.4.final.0
 python-bits : 64
 OS : Linux
 OS-release : 4.15.0-91-generic
 machine : x86_64
 processor : x86_64
 byteorder : little
 LC_ALL : None
 LANG : nl_NL.UTF-8
 LOCALE : nl_NL.UTF-8
pandas : 1.0.3
 numpy : 1.18.1
 pytz : 2019.3
 dateutil : 2.8.1
 pip : 20.0.2
 setuptools : 46.0.0.post20200309
 Cython : None
 pytest : 5.4.1
 hypothesis : None
 sphinx : 2.4.4
 blosc : None
 feather : None
 xlsxwriter : None
 lxml.etree : None
 html5lib : None
 pymysql : None
 psycopg2 : None
 jinja2 : 2.11.1
 IPython : 7.13.0
 pandas_datareader: None
 bs4 : None
 bottleneck : None
 fastparquet : None
 gcsfs : None
 lxml.etree : None
 matplotlib : 3.2.0
 numexpr : None
 odfpy : None
 openpyxl : None
 pandas_gbq : None
 pyarrow : None
 pytables : None
 pytest : 5.4.1
 pyxlsb : None
 s3fs : None
 scipy : 1.4.1
 sqlalchemy : None
 tables : None
 tabulate : None
 xarray : None
 xlrd : None
 xlwt : None
 xlsxwriter : None
 numba : None