-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
API: add "level=" argument to MultiIndex.unique() #17897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add "level=" argument to MultiIndex.unique() #17897
Conversation
pandas/core/indexes/multi.py Outdated
| Parameters | ||
| ---------- | ||
| level : int level | ||
| unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add a version added tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it, and then removed it on @jorisvandenbossche 's suggestion because this is an internal method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the default value in the doc string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can remove the versionadded tag here
| From #17881 , @jreback 's comments:
I'm not a fan either...
... but I think this would be pretty confusing ("used" as in "this label appears in the index" vs. "used" as "this label is used multiple times") |
| Alternatives:
|
| how about |
you mean "levels" or "labels"? Anyway, not following you: |
57aa7e6 to 8a69543 Compare Codecov Report
@@ Coverage Diff @@ ## master #17897 +/- ## ========================================== - Coverage 91.23% 91.22% -0.02% ========================================== Files 163 163 Lines 50105 50107 +2 ========================================== - Hits 45715 45708 -7 - Misses 4390 4399 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@ ## master #17897 +/- ## ========================================== - Coverage 91.38% 91.34% -0.05% ========================================== Files 164 164 Lines 49797 49812 +15 ========================================== - Hits 45508 45501 -7 - Misses 4289 4311 +22
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.21.0.txt Outdated
| - Pandas no longer registers matplotlib converters on import. The converters | ||
| will be registered and used when the first plot is draw (:issue:`17710`) | ||
| - Setting on a column with a scalar value and 0-len index now raises a ``ValueError`` (:issue:`16823`) | ||
| - :func:`MultiIndex.get_level_values` now supports the `unique` argument (:issue:`17896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should unique have double ticks here, ``unique``?
| Answering @jorisvandenbossche from here:
I'm open to alternatives, but let me summarize my rationale for working on
All this said, an alternative which I would find elegant enough would be to add an argument |
| Two additional elements in favor of
|
| OK, trying to clarify my rationale for objecting adding it to So indeed we have to distinguish the
You could indeed see it that way (unique "effective values"), but I would rather see it as the used "unique labels". If you look at it like that, it clearly pertains to the first .. In that sense, suppose we had a method like
This actually sounds like a nice alternative. |
or |
Good point. Still, it's recent and hence probably not very well known. And in principle less efficient for short indexes with multiple long levels. Anyway, I'm OK with changing this PR to |
seems reasonable. alternatively, |
e5a4635 to 3617e2a Compare
I'm afraid it would be confusing (I would love to swap its name with |
| values = self._get_level_values(level) | ||
| return values | ||
| | ||
| def unique(self, level=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add the level=None kw to Index.unique as well (for compat. It should raise if its not None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, would you mention it in the (Index) docs?
pandas/tests/indexes/test_multi.py Outdated
| expected = Index(['foo', 'bar', 'baz', 'qux'], | ||
| name='first') | ||
| tm.assert_index_equal(result, expected) | ||
| assert result.name == 'first' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already done by assert_index_equal
pandas/tests/indexes/test_multi.py Outdated
| arrays = [['a', 'b', 'b'], [2, np.nan, 2]] | ||
| index = pd.MultiIndex.from_arrays(arrays) | ||
| values = index.unique(level=1) | ||
| expected = np.array([2], dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a normal index, the NaN is included in the uniques.
pandas/tests/indexes/test_multi.py Outdated
| index = pd.MultiIndex.from_arrays(arrays) | ||
| values = index.unique(level=1) | ||
| expected = np.array([2], dtype=np.int64) | ||
| tm.assert_numpy_array_equal(values.values, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you testing here the .values and not comparing with an Index object ?
pandas/core/indexes/multi.py Outdated
| unique : bool | ||
| if True, drop duplicated values | ||
| .. versionadded:: 0.21.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed for an internal method IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see that actually @jreback asked to add it above. But still, I don't think this should be added for internal private functions, as internal code should never have to care about the pandas version
pandas/core/indexes/multi.py Outdated
| Parameters | ||
| ---------- | ||
| level : int, optional, defaults None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you be able to both specify int or level name ?
a07e2d8 to f0f7874 Compare | The first |
| can you rebase and will take a look. |
pandas/core/indexes/base.py Outdated
| def unique(self): | ||
| def unique(self, level=None): | ||
| if level not in {0, self.name, None}: | ||
| raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the doc string for unique updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because you never replied to my question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what the question is
pandas/core/indexes/multi.py Outdated
| Parameters | ||
| ---------- | ||
| level : int level | ||
| unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a versionadded tag
pandas/tests/indexes/test_base.py Outdated
| tm.assert_index_equal(result, index_with_name) | ||
| | ||
| def test_unique(self): | ||
| idx = pd.Index([2, 3, 2, 1], name='my_index') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number
pandas/tests/indexes/test_base.py Outdated
| def test_unique(self): | ||
| idx = pd.Index([2, 3, 2, 1], name='my_index') | ||
| expected = pd.Index([2, 3, 1], name='my_index') | ||
| for level in 0, 'my_index', None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you integrete with the other tests of unique in indexes/common.py and/or tests/test_base.py
| exp = pd.MultiIndex.from_arrays([['a'], ['a']]) | ||
| tm.assert_index_equal(res, exp) | ||
| | ||
| # GH #17896 - with level= argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try with named level, try on an already unique level as well. ideally even more corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a separate test and parametrize
f0f7874 to 284649a Compare doc/source/whatsnew/v0.22.0.txt Outdated
| - Bug in :func:`PeriodIndex.truncate` which raises ``TypeError`` when ``PeriodIndex`` is monotonic (:issue:`17717`) | ||
| - Bug in ``DataFrame.groupby`` where key as tuple in a ``MultiIndex`` were interpreted as a list of keys (:issue:`17979`) | ||
| - | ||
| - :func:`MultiIndex.unique` now supports the ``level=`` argument (:issue:`17896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to other enhancements
pandas/core/indexes/base.py Outdated
| def unique(self): | ||
| def unique(self, level=None): | ||
| if level not in {0, self.name, None}: | ||
| raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the doc string for unique updated?
pandas/core/indexes/multi.py Outdated
| Parameters | ||
| ---------- | ||
| level : int level | ||
| unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the default value in the doc string
pandas/core/indexes/multi.py Outdated
| Parameters | ||
| ---------- | ||
| level : int level | ||
| unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can remove the versionadded tag here
| assert idx.has_duplicates | ||
| | ||
| def test_unique(self): | ||
| # GH 17896 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really like to use the indices fixture here to test all Index types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It assigns no name to the indexes... shall I change it so it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can try that
| exp = pd.MultiIndex.from_arrays([['a'], ['a']]) | ||
| tm.assert_index_equal(res, exp) | ||
| | ||
| # GH #17896 - with level= argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a separate test and parametrize
pandas/tests/indexes/test_multi.py Outdated
| tm.assert_index_equal(result, expected) | ||
| | ||
| # With already unique level | ||
| mi = pd.MultiIndex.from_arrays([[1, 3, 2, 4], [1, 1, 1, 2]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test with each level of the mi
pandas/tests/indexes/test_multi.py Outdated
| tm.assert_index_equal(result, expected) | ||
| | ||
| @pytest.mark.xfail(reason='GH 17924 (returns Int64Index with float data)') | ||
| def test_unique_with_nans(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this test
103d192 to d5f4324 Compare | The second commit changes the behavior of If instead the current behavior is preferred, that commit can be dropped, and I will just need to exclude the |
how did this come up? why do you think that the consistency should diverge here? |
My new test compares
I'm not in favor of breaking consistency (between |
d5f4324 to 861867c Compare | The behavior of |
| ( @jreback : ping) |
| Will take a look at this tomorrow! |
pandas/core/indexes/base.py Outdated
| level : int or str, optional, default None | ||
| only return values from specified level (for MultiIndex) | ||
| .. versionadded:: 0.21.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.22.0
pandas/tests/indexes/common.py Outdated
| with tm.assert_raises_regex(ValueError, msg): | ||
| indices.unique(level=level) | ||
| | ||
| def test_unique_na(self, indices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are not using indices here, so don't add it as an argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general!
Added a bunch of minor comments
doc/source/whatsnew/v0.22.0.txt Outdated
| | ||
| - Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`) | ||
| - :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`) | ||
| - :func:`MultiIndex.unique` now supports the ``level=`` argument (:issue:`17896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add briefly what it does, something like ".. to get the unique values of a single index level" ?
pandas/core/indexes/category.py Outdated
| def is_monotonic_decreasing(self): | ||
| return Index(self.codes).is_monotonic_decreasing | ||
| | ||
| @Appender(base._shared_docs['unique'] % _index_doc_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the shared docs of 'index_unique' ?
pandas/core/indexes/base.py Outdated
| Series.unique | ||
| """) | ||
| | ||
| @Appender(base._shared_docs['index_unique'] % _index_doc_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the _index_doc_kwargs are doing something here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right. But it was already the case, so I thought it was a choice of consistency ("you don't need to change this line if you change the docs making the interpolation meaningful"). Shall I remove it?
pandas/core/indexes/base.py Outdated
| | ||
| @Appender(base._shared_docs['unique'] % _index_doc_kwargs) | ||
| def unique(self): | ||
| base._shared_docs['index_unique'] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be in _index_shared_docs instead of base._shared_docs ? (because base are the ones shared with series)
pandas/core/indexes/base.py Outdated
| Parameters | ||
| ---------- | ||
| level : int or str, optional, default None | ||
| only return values from specified level (for MultiIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start sentence with capital letter
pandas/core/indexes/multi.py Outdated
| values = self._get_level_values(level) | ||
| return values | ||
| | ||
| @Appender(base._shared_docs['index_unique'] % _index_doc_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base._shared_docs -> index_shared_docs
pandas/core/indexes/base.py Outdated
| @Appender(base._shared_docs['index_unique'] % _index_doc_kwargs) | ||
| def unique(self, level=None): | ||
| if level not in {0, self.name, None}: | ||
| raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have _get_level_number for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather _validate_index_level which is used in _get_level_number
| tm.assert_index_equal(index.get_level_values(1), exp) | ||
| | ||
| def test_get_level_values_na(self): | ||
| @pytest.mark.xfail(reason='GH 17924 (returns Int64Index with float data)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we xfail this? (I mean, why don't we keep asserting for now that it is float)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions compare a correctly built index (float64 data in Float64Index) with an invalid index (float64 data in Int64Index, due to #17924 ). Maybe I'm missing something, but it's not obvious to me what we would assert - plus, I see it as an added value to xfail for a buggy behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
fbf9eff to 337e942 Compare | needs a rebase |
337e942 to 50f199d Compare There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last minor comments, otherwise looks good!
pandas/core/indexes/category.py Outdated
| @Appender(_index_shared_docs['index_unique'] % _index_doc_kwargs) | ||
| def unique(self, level=None): | ||
| if level not in {0, self.name, None}: | ||
| raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use here _validate_index_level as well?
| tm.assert_index_equal(index.get_level_values(1), exp) | ||
| | ||
| def test_get_level_values_na(self): | ||
| @pytest.mark.xfail(reason='GH 17924 (returns Int64Index with float data)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
50f199d to feb65ed Compare | ping |
| Thanks! |
git diff master -u -- "*.py" | flake8 --diff