-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH union_categoricals supports ignore_order GH13410 #15219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorisvandenbossche left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Added a few comments. Can you add a whatsnew note in v0.20.0.txt and add something to the docs about this (http://pandas.pydata.org/pandas-docs/stable/categorical.html#unioning)
Can you also add a test case with differing categories (not only different order)? So a case that would raise in case of ignore_ordered=False (eg the first test case of test_union_categoricals_sort but with ordered categories)
pandas/tools/tests/test_concat.py Outdated
| tm.assert_categorical_equal(res, exp) | ||
| | ||
| res = union_categoricals([c1, c1], ignore_order=True) | ||
| exp = Categorical([1, 2, 3, 1, 2, 3], ordered=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference with this and the test case above? (ordered=False is the default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended to test two ordered categoricals with identical categories and orders. The above was a mixed ordered and unordered with identical categories.
I did remove ordered=False from this test and other tests since it is the default.
pandas/tools/tests/test_concat.py Outdated
| tm.assert_categorical_equal(res, exp) | ||
| | ||
| c1 = Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True) | ||
| c2 = Categorical([1, 2, 3], ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leave this out, and use the c1 and c2 from above (but just swap them in the code [c2, c1])
pandas/types/concat.py Outdated
| If true, resulting categories will be lexsorted, otherwise | ||
| they will be ordered as they appear in the data. | ||
| ignore_order: boolean, default False | ||
| If true, ordered categories will be ignored. Results in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ordered categories" -> "the ordered attribute of the categorical" / "whether the categorical is ordered or not" ? as the categories itself are not ignored, only its "orderedness"
pandas/types/concat.py Outdated
| indexer = categories.get_indexer(first.categories) | ||
| new_codes = take_1d(indexer, new_codes, fill_value=-1) | ||
| elif all(not c.ordered for c in to_union): | ||
| elif ignore_order | all(not c.ordered for c in to_union): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| -> or
| raise TypeError('Categorical.ordered must be the same') | ||
| | ||
| if ignore_order: | ||
| ordered = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ordered is already False? (line 263) Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement on line 264 can be entered if the ordered categoricals have the same categories and order.
is_dtype_equal checks categories and ordering
| Thanks for the comments. Pull request has been updated. |
Codecov Report
@@ Coverage Diff @@ ## master #15219 +/- ## ========================================== + Coverage 90.37% 90.37% +<.01% ========================================== Files 135 135 Lines 49464 49466 +2 ========================================== + Hits 44702 44705 +3 + Misses 4762 4761 -1
Continue to review full report at Codecov.
|
| @jorisvandenbossche Any other actions on my part? |
| @js3711 minor corrections and a flake issue you can see
ping when pushed / green. |
| can you rebase / update |
| I'll update this weekend. Apologies for the delay. |
6dc5bb8 to d278d62 Compare
jreback left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just a couple of minor doc changes, and you have a flake error. you can use git diff master | flake8 --diff to see this
| union_categoricals([c1, c2]) | ||
| | ||
| def test_union_categoricals_ignore_order(self): | ||
| c1 = Categorical([1, 2, 3], ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number here as a comment
doc/source/whatsnew/v0.20.0.txt Outdated
| - ``.select_dtypes()`` now allows the string 'datetimetz' to generically select datetimes with tz (:issue:`14910`) | ||
| - ``pd.merge_asof()`` gained the option ``direction='backward'|'forward'|'nearest'`` (:issue:`14887`) | ||
| | ||
| - ``ignore_ordered`` argument added to ``pd.types.concat.union_categoricals``; setting the argument to true will ignore the ordered attribute of unioned categoricals (:issue:`13410`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reverse this
pd.types.concat.....union_categoricals(..) has gained the ignore_ordered=False argument.
| they will be ordered as they appear in the data. | ||
| ignore_order: boolean, default False | ||
| If true, the ordered attribute of the Categoricals will be ignored. | ||
| Results in an unordered categorical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded 0.20.0 tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the versionadded tag
| c2 = Categorical([1, 2, 3], ordered=False) | ||
| | ||
| res = union_categoricals([c1, c2], ignore_order=True) | ||
| exp = Categorical([1, 2, 3, 1, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you an an explicit test with ignore_order=False that raises (there are tests in other sections, but should have one that explicityly specifies)
| union_categoricals([c1, c2]) | ||
| | ||
| def test_union_categoricals_ignore_order(self): | ||
| c1 = Categorical([1, 2, 3], ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment. pls add a tests with ignore_order=False (explicity set), and additional one with no ignore_order kw passed (to tests the defaults)
doc/source/whatsnew/v0.20.0.txt Outdated
| - HTML table output skips ``colspan`` or ``rowspan`` attribute if equal to 1. (:issue:`15403`) | ||
| | ||
| .. _ISO 8601 duration: https://en.wikipedia.org/wiki/ISO_8601#Durations | ||
| - ``ignore_ordered`` argument added to ``pd.types.concat.union_categoricals``; setting the argument to true will ignore the ordered attribute of unioned categoricals (:issue:`13410`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in a :ref: to the docs you added.
| @jreback thanks for the feedback. I made the suggested changes and submitted a new PR. |
| thanks for the PR @js3711 |
xref pandas-dev#13410 (ignore_order portion) Author: Justin Solinsky <justinsolinsky@Justins-MacBook-Pro.local> Closes pandas-dev#15219 from js3711/GH13410-ENHunion_categoricals and squashes the following commits: e9d00de [Justin Solinsky] GH15219 Documentation fixes based on feedback d278d62 [Justin Solinsky] ENH union_categoricals supports ignore_order GH13410 9b827ef [Justin Solinsky] ENH union_categoricals supports ignore_order GH13410
xref #13410 (ignore_order portion)