Skip to content

Conversation

@WillAyd
Copy link
Member

@WillAyd WillAyd commented Apr 19, 2019

Would ideally like to combine first, nth and last implementations. Consider this a precursor

@WillAyd WillAyd added this to the 0.25.0 milestone Apr 19, 2019
@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #26152 into master will decrease coverage by <.01%.
The diff coverage is 93.75%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #26152 +/- ## ========================================== - Coverage 91.97% 91.96% -0.01%  ========================================== Files 175 175 Lines 52368 52371 +3 ========================================== - Hits 48164 48163 -1  - Misses 4204 4208 +4
Flag Coverage Δ
#multiple 90.52% <93.75%> (ø) ⬆️
#single 40.69% <12.5%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 97.24% <93.75%> (+0.01%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9feb3ad...c2a0b8e. Read the comment docs.

def nth(self, n, dropna=None):
def nth(self,
n: Union[int, List[int]],
dropna: Optional[Union[bool, str]] = None) -> DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options[str] here

@Substitution(see_also=_common_see_also)
def nth(self, n, dropna=None):
def nth(self,
n: Union[int, Collection[int]],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could arguably go list-like here if we wanted to try and add to _typing but didn't want to stray too far.

The documentation mentions only supporting lists, though the actual implementation makes accommodations for lists, sets or tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so generally we would use Sequence then

Copy link
Member Author

@WillAyd WillAyd Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking that but the isinstance check in the method body also includes set, which wouldn't qualify as a Sequence.

I reverted it to List to keep it simple since Collection isn't available from typing until 3.6 and that is what the documentation shows anyway

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

@WillAyd has a failure in the checks (not the numpy_dev issue)

mask = mask_left | mask_right

ids, _, _ = self.grouper.group_info
mask = mask & (ids != -1) # Drop NA values in grouping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the comment on the prior line


# old behaviour, but with all and any support for DataFrames.
# modified in GH 7559 to have better perf
n = cast(int, n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really really avoid the need to do this, you can instead assign to a new variable that is the correct type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. So in your mind would cast only be suitable for expressions without assignment or should we avoid altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really avoid having to cast at all; this is just plain confusing as its not 'code' but for annotation purposes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the branches to appease the type checker without cast but also arguably make the code more clear. Makes the diff a little larger than originally but I think for the better

mask_left = np.in1d(self._cumcount_array(), nth_array)
mask_right = np.in1d(self._cumcount_array(ascending=False) + 1,
-nth_values)
-nth_array)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I had to change this because nth_values was inferred as List[int] upon initial assignment. This prevents reassignment from potentially obfuscating the type checker and code intent

nth_values = [n]
elif isinstance(n, (set, list, tuple)):
nth_values = list(set(n))
if dropna is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of explicitly checking if dropna is not None this condition was refactored to sit in a branch that follows the implicit condition of if dropna.

There is however a slight behavior difference between this PR and master, where now these are equivalent:

>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar']) >>> df.groupby('foo').nth([0], dropna=None) bar foo 0 1 >>> df.groupby('foo').nth([0], dropna=False) bar foo 0 1

Whereas on master these would not yield the same thing:

>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar']) >>> df.groupby('foo').nth([0], dropna=None) bar foo 0 1 >>> df.groupby('foo').nth([0], dropna=False) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1626, in nth "dropna option with a list of nth values is not supported") ValueError: dropna option with a list of nth values is not supported

I think the new behavior is more consistent and preferred. Might be worth a whatsnew

@jreback
Copy link
Contributor

jreback commented Apr 26, 2019

looks reasonable but i need to look again

axis = self.grouper.axis
grouper = axis[axis.isin(dropped.index)]

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you de-dent this here and remove the else which is unecessary; add a comment though of what this branch is

@jreback jreback merged commit 7120725 into pandas-dev:master May 5, 2019
@jreback
Copy link
Contributor

jreback commented May 5, 2019

thanks @WillAyd

@WillAyd WillAyd deleted the groupby-nth-nan branch January 16, 2020 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 participants