-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Categorical.unique() preserves categories #15439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| you are changing |
| res = cat.unique() | ||
| exp = Index(["a", "b", "c"]) | ||
| self.assert_index_equal(res.categories, exp) | ||
| tm.assert_categorical_equal(res, Categorical(exp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing should be changing in this file
| tm.assert_index_equal(df.groupby('A', sort=False).first().index, index) | ||
| | ||
| # ordered=False | ||
| df = DataFrame({'A': pd.Categorical(list('ba'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC from the issue there are 4 cases? (product of sort=True/False, ordered=True/False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupby above and below is called twice (sort=True/False) for each case of ordered. There are four assertions.
0d0d0bf to 8aea45f Compare | tm.assert_index_equal(df.groupby('A', sort=False).first().index, index) | ||
| | ||
| # ordered=False | ||
| df = DataFrame({'A': pd.Categorical(list('ba'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupby above and below is called twice (sort=True/False) for each case of ordered. There are four assertions.
pandas/core/groupby.py Outdated
| else: | ||
| cat = self.grouper.unique() | ||
| cat.add_categories([c for c in self.grouper.categories | ||
| if c not in cat.categories], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How strongly would self.grouper.categories[~self.grouper.categories.isin(cat.categories)] be preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like that better! (also vectorized).
as an aside, prob should add an asv for categorical grouping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of like this? I don't know what I'm doing. 8) flake8 diff surely doesn't pass now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do:
git diff master | flake8 --diff for flake8 checking locally :>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but the asv additions cannot.
bafd99d to 81a2183 Compare doc/source/whatsnew/v0.20.0.txt Outdated
| | ||
| - Bug in ``resample``, where a non-string ```loffset`` argument would not be applied when resampling a timeseries (:issue:`13218`) | ||
| | ||
| - Bug in ``.groupby`` where ```.groupby(categorical, sort=False)`` would raise ``ValueError`` due to non-matching categories (:issue:`13179`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a subsection for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A whole subsection for an obvious bug? What would the title be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this an obvious bug? a sub-section is not very long just showing what it did before and what it does now. Your whatsnew is so non-obvious now (and just changing wording is not going to help). An example is worth 1000 words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant obvious as in the combination of sort=False and a missing category in the data causing:
ValueError: items in new_categories are not the same as in old categories
There is no other change in behavior except this now works as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not obvious at all. you can simply use the test example, show the old behavior and the new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Is this, by chance, a subsection of Bug Fixes? It doesn't feel like api_breaking nor a particularly mentionable enhancement ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this?
pandas/core/categorical.py Outdated
| | ||
| # unlike np.unique, unique1d does not sort | ||
| unique_codes = unique1d(self.codes) | ||
| unique_codes = unique1d(self.codes).astype(self.codes.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .astype(copy=False)
this is actually a bug (separate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you don't need this change. This is just for taking (which needs to be int64 anyhow; it will upcast it later if its not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of groupby it will be upcast, but in general, .unique() should be type preserving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, never mind. :)
pandas/core/groupby.py Outdated
| # fix bug #GH8868 sort=False being ignored in categorical | ||
| # groupby | ||
| else: | ||
| cat = self.grouper.unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I am wondering why we are doing any of this. isn't self.grouper.categories the end result that we want anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in case of sort=False, some expect the result categories sorted in the order of encounter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, ok for that case then (though try to use vectorized operations here). In fact probably create a method on Categorical, call it ._codes_for_grouping(sort=True/False) where you can handle this in a method (rather than clutter it up here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is vectorized, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.grouper can also be a CategoricalIndex. How do I get one from the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that is fine (though return a new object, never use inplace).
add the method to a CI as well (which just calls it on the .values), its a pass thru
Codecov Report
@@ Coverage Diff @@ ## master #15439 +/- ## ========================================== + Coverage 90.37% 90.37% +<.01% ========================================== Files 135 135 Lines 49476 49481 +5 ========================================== + Hits 44713 44720 +7 + Misses 4763 4761 -2
Continue to review full report at Codecov.
|
8f47bd9 to a7cd756 Compare pandas/core/categorical.py Outdated
| doc=_categories_doc) | ||
| | ||
| def _codes_for_groupby(self, sort=False): | ||
| """ Return a Categorical adjusted for groupby """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can add a doc-string (Parameters, Returns), just like its a publick method.
a7cd756 to 52c1bc8 Compare
jreback left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good
just like to give the docs another pass
| def _codes_for_groupby(self, sort): | ||
| """ | ||
| Return a Categorical adjusted for groupby | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted is not very clear ; can you make this a bit more clear (and the comments below)
pretend you have never seen this and see it from a new reader perspective
4842bc7 to 82ff3d4 Compare
jorisvandenbossche left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I only have one question: should we have different behaviour of sort=False depending on whether the categorical is ordered or not?
Because now the sort=False seems to be ignored in case of an ordered categorical?
doc/source/whatsnew/v0.20.0.txt Outdated
| GroupBy on Categoricals | ||
| ^^^^^^^^^^^^^^^^^^^^^^^ | ||
| | ||
| In previous version, ``.groupby(..., sort=False)`` would fail with a ``ValueError`` when grouping on a categorical series with some categories not appearing in the data. (:issue:`13179`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous version -> previous versions
82ff3d4 to 55733b8 Compare
hmm, @kernc can you take a look at the original issue (in the reference from the code) and see if you can divine what is happening? |
| Indeed, the order is retained by Semantically, I don't think the order should be lost if the categorical is ordered. The relationships This lgtm; please edit to your liking. |
| thanks @kernc nice PR! keep em coming! |
closes pandas-dev#13179 Author: Kernc <kerncece@gmail.com> Closes pandas-dev#15439 from kernc/Categorical.unique-nostrip-unused and squashes the following commits: 55733b8 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing 2aec326 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing c813146 [Kernc] PERF: add asv for categorical grouping 0c550e6 [Kernc] BUG: Fix .groupby(categorical, sort=False) failing
git diff upstream/master | flake8 --diff