-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: groupby.describe on a frame with duplicate column names #50846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
mroeschke merged 34 commits into pandas-dev:main from rhshadrach:groupby_select_obj_dup_cols Feb 3, 2023
Merged
Changes from 21 commits
Commits
Show all changes
34 commits Select commit Hold shift + click to select a range
aa9c9e1 REF: groupby Series selection with as_index=False
rhshadrach 7d00d07 GH#
rhshadrach fd62b4e Merge branch 'main' of https://github.com/pandas-dev/pandas into seri…
rhshadrach c0891db Merge branch 'main' into series_as_index_false
rhshadrach 6bcfb12 Merge branch 'main' of https://github.com/pandas-dev/pandas into seri…
rhshadrach 41399ad type-hinting fixes
rhshadrach c26957d WIP
rhshadrach f2b538e Merge branch 'main' of https://github.com/pandas-dev/pandas into owe_…
rhshadrach 1860c4d WIP
rhshadrach e42e222 WIP
rhshadrach 0bdf009 BUG: groupby.describe on a frame with duplicate column names
rhshadrach 185e4f8 cleanup
rhshadrach d2b965f test fixup
rhshadrach 932e3c8 Fix type-hint for _group_selection
rhshadrach 5139df8 Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach 8f132cd Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach eeea6fc Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach feb6661 Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach 83f12b7 Speedup
rhshadrach c37a1ab refinement
rhshadrach 973b893 Merge branch 'main' into groupby_select_obj_dup_cols
rhshadrach 78a3d5f Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach 4dafe5a cleanup, faster implementation
rhshadrach 0959c1b Merge branch 'main' into groupby_select_obj_dup_cols
rhshadrach 2fc97b2 Merge branch 'main' into groupby_select_obj_dup_cols
rhshadrach d5df78c Make group_selection a Boolean flag
rhshadrach 62bb1fb Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach 8d6df54 Avoid resetting cache
rhshadrach 62540af Improve test
rhshadrach f7a6973 Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach 615d9c6 Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach 88a9ec9 Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach 359d7ff Rework test
rhshadrach d1d2610 Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -1254,6 +1254,27 @@ def test_describe_with_duplicate_output_column_names(as_index, keys): | |
| tm.assert_frame_equal(result, expected) | ||
| | ||
| | ||
| def test_describe_duplicate_columns(): | ||
| # GH#50806 | ||
| df = DataFrame([[0, 1, 2, 3]]) | ||
| df.columns = [0, 1, 2, 0] | ||
| gb = df.groupby(df[1]) | ||
| result = gb.describe(percentiles=[]) | ||
| | ||
| columns = ["count", "mean", "std", "min", "50%", "max"] | ||
| frames = [ | ||
| DataFrame([[1.0, val, np.nan, val, val, val]], index=[1], columns=columns) | ||
| for val in (0.0, 2.0, 3.0) | ||
| ] | ||
| expected = pd.concat(frames, axis=1) | ||
| expected.columns = MultiIndex( | ||
| levels=[[0, 2], columns], | ||
| codes=[6 * [0] + 6 * [1] + 6 * [0], 3 * list(range(6))], | ||
| ) | ||
| expected.index.names = [1] | ||
| tm.assert_frame_equal(result, expected) | ||
| | ||
| | ||
| def test_groupby_mean_no_overflow(): | ||
| # Regression test for (#22487) | ||
| df = DataFrame( | ||
| | @@ -1594,3 +1615,29 @@ def test_multiindex_group_all_columns_when_empty(groupby_func): | |
| result = method(*args).index | ||
| expected = df.index | ||
| tm.assert_index_equal(result, expected) | ||
| | ||
| | ||
| def test_duplicate_columns(request, groupby_func, as_index): | ||
| # GH#50806 | ||
| if groupby_func == "corrwith": | ||
| msg = "GH#50845 - corrwith fails when there are duplicate columns" | ||
| request.node.add_marker(pytest.mark.xfail(reason=msg)) | ||
| df = DataFrame([[1, 3, 6], [1, 4, 7], [2, 5, 8]], columns=list("abb")) | ||
| args = get_groupby_method_args(groupby_func, df) | ||
| gb = df.groupby("a", as_index=as_index) | ||
| result = getattr(gb, groupby_func)(*args) | ||
| | ||
| if groupby_func in ("size", "ngroup", "cumcount"): | ||
| expected = getattr( | ||
| df.take([0, 1], axis=1).groupby("a", as_index=as_index), groupby_func | ||
| ||
| )(*args) | ||
| tm.assert_equal(result, expected) | ||
| else: | ||
| expected_df = df.copy() | ||
| expected_df.columns = ["a", "b", "c"] | ||
| expected_args = get_groupby_method_args(groupby_func, expected_df) | ||
| expected = getattr(expected_df.groupby("a", as_index=as_index), groupby_func)( | ||
| *expected_args | ||
| ) | ||
| expected = expected.rename(columns={"c": "b"}) | ||
| tm.assert_frame_equal(result, expected) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same as _obj_with_exclusions, which should already be cached, so we could avoid making a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great - not only that, but we can also avoid all the code that determines
_group_selection. I've turned_group_selectioninto a Boolean flag.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we expect this change to affect the timings in the OP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASVs updated; essentially the same results.