BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffBehavioural Changes
Fixing two related bugs: when grouping on multiple categoricals,
.sum()and.count()would returnNaNfor the missing categories, but they are expected to return0for the missing categories. Both these bugs are fixed.Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an
xfail. For this PR thexfailsare removed and the tests are passing normally. As well, a few other existing tests were expectingsum()to returnNaN; these have been updated so that the tests now expect to get0(which is the desired behaviour).Pivot
The change to
.sum()also impacts thedf.pivot_table()if it is called withaggfunc=sumand is pivoted on a Categorical column withobserved=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, notNaN). One test on test_pivot.py was updated to reflect this change.Default Behaviour
Because
df.groupby()anddf.pivot_table()haveobserved=Falseas the default, the default behaviour will change for a user callingdf.groupby().sum()ordf.pivot_table(..., aggfunc='sum')if they are grouping/pivoting on a categorical with missing categories. Previously the default would give themNaNfor the missing categories, now the default will give them0.What is the appropriate to highlight/document this change to the default behaviour?