BUG: first/last lose timezone in groupby with as_index=False #21573

reidy-p · 2018-06-21T12:39:21Z

closes BUG: first() loses the timezone in groupby #15884
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

reidy-p · 2018-06-21T12:43:21Z

pandas/core/groupby/groupby.py

 def _wrap_agged_blocks(self, items, blocks):
 if not self.as_index:
- index = np.arange(blocks[0].values.shape[1])
+ if blocks[0].values.ndim > 1:


In a case such as:

pd.DataFrame({'time': [pd.Timestamp('2012-01-01 13:00:00+00:00')], 'A': [3]}).groupby('A', as_index=False).first()

the blocks[0].values is a DatetimeIndex and not an array so trying to call shape[1] on a DTI results in an index out of range error and then the compat routine for first or last are called which leads to the timezone being lost.

the compat routine for first or last are called which leads to the timezone being lost.

Is there a case where DatetimeTZ data can legitimately go through the compat routine (maybe with as_index=True?) It would be great for the data type to also be preserved there as well.

Yeah I have also been wondering about the same thing. The following case goes through the compat routine and consequently loses the timezone:

In [2]: df = pd.DataFrame({'group': [1, 1, 2], 'category_string': pd.Series(list('abc')).astype('category'), 'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern'}) In [3]: df.groupby('group').first() Out[3]: category_string datetimetz group 1 a 2013-01-01 05:00:00 2 c 2013-01-03 05:00:00

But if we exclude the categorical column it doesn't go through the compat routine and preserves the timezone information:

In[4]: df[['group', 'datetimetz']].groupby('group').first() Out[4]: datetimetz group 1 2013-01-01 00:00:00-05:00 2 2013-01-03 00:00:00-05:00

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine? I think this might actually be quite straightforward (see #15885)

jschendel · 2018-06-21T16:53:20Z

doc/source/whatsnew/v0.24.0.txt

 ^^^^^^^^^^^^^^^^^^^^^^^^

-
+- Bug in :func:`pandas.core.groupby.first` and :func:`pandas.core.groupby.last` with ``as_index=False`` leading to the loss of timezone information (:issue:`15884`)


I think the :func: links should be pandas.core.groupby.GroupBy.first (and likewise for last)

Yep, you're right, thanks!

codecov · 2018-06-21T19:17:53Z

Codecov Report

Merging #21573 into master will not change coverage.
The diff coverage is 100%.

@@ Coverage Diff @@ ## master #21573 +/- ## ======================================= Coverage 91.9% 91.9% ======================================= Files 153 153 Lines 49549 49549 ======================================= Hits 45539 45539 Misses 4010 4010

Flag	Coverage Δ
#multiple	`90.3% <100%> (ø)`	⬆️
#single	`41.78% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`92.66% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 028c9c0...ef5d5b1. Read the comment docs.

jreback · 2018-06-22T10:32:16Z

pandas/core/groupby/groupby.py

 def _wrap_agged_blocks(self, items, blocks):
 if not self.as_index:
- index = np.arange(blocks[0].values.shape[1])
+ index = np.arange(blocks[0].values.shape[-1])


@reidy-p pushed a simplification. but maybe need some additional tests that do this when a column is selected

e.g. df.groupby('id', as_index=False)['foo'].first()

Nice simplification. I added some new tests.

mroeschke · 2018-06-22T17:02:09Z

@reidy-p

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine?

I am not too familiar of the conditions in which the data gets passed through the compat routine, but timezones should be preserved in the compat routine. Yes, looks like #15885 should solve that issue (and offer a performance boost for Categoriacals as well xref #19026)

jreback · 2018-06-22T23:01:46Z

thanks @reidy-p very nice!

…dev#21573)

reidy-p commented Jun 21, 2018

View reviewed changes

reidy-p changed the title ~~BUG: first/last lose timezone in groupby~~ BUG: first/last lose timezone in groupby with as_index=False Jun 21, 2018

jschendel reviewed Jun 21, 2018

View reviewed changes

mroeschke added Bug Timezones Timezone data dtype labels Jun 21, 2018

jreback requested changes Jun 22, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jun 22, 2018

reidy-p and others added 2 commits June 22, 2018 15:20

BUG: first/last lose timezone in groupby

4d7e1cf

simplify

f46ce84

reidy-p force-pushed the groupby_tz branch 2 times, most recently from 7bb7a3b to 26ed691 Compare June 22, 2018 15:55

Fix whatsnew and add tests

7408aab

reidy-p force-pushed the groupby_tz branch from 26ed691 to 7408aab Compare June 22, 2018 15:57

lint

ef5d5b1

jreback approved these changes Jun 22, 2018

View reviewed changes

jreback merged commit c6347c4 into pandas-dev:master Jun 22, 2018

mroeschke mentioned this pull request Jun 23, 2018

BUG: groupby.first/last loses timezone information followup #21603

Closed

reidy-p deleted the groupby_tz branch June 23, 2018 22:13

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: first/last lose timezone in groupby with as_index=False (pandas-…

594b75e

…dev#21573)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: first/last lose timezone in groupby with as_index=False #21573

BUG: first/last lose timezone in groupby with as_index=False #21573

Uh oh!

reidy-p commented Jun 21, 2018 •

edited

Loading

reidy-p Jun 21, 2018 •

edited

Loading

mroeschke Jun 21, 2018 •

edited

Loading

reidy-p Jun 22, 2018

jschendel Jun 21, 2018

reidy-p Jun 22, 2018

codecov bot commented Jun 21, 2018 •

edited

Loading

jreback Jun 22, 2018

reidy-p Jun 22, 2018

mroeschke commented Jun 22, 2018

jreback commented Jun 22, 2018

Labels

4 participants

Uh oh!

BUG: first/last lose timezone in groupby with as_index=False #21573

BUG: first/last lose timezone in groupby with as_index=False #21573

Uh oh!

Conversation

reidy-p commented Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

reidy-p Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

mroeschke Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

reidy-p Jun 22, 2018

Choose a reason for hiding this comment

jschendel Jun 21, 2018

Choose a reason for hiding this comment

reidy-p Jun 22, 2018

Choose a reason for hiding this comment

codecov bot commented Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

jreback Jun 22, 2018

Choose a reason for hiding this comment

reidy-p Jun 22, 2018

Choose a reason for hiding this comment

mroeschke commented Jun 22, 2018

jreback commented Jun 22, 2018

Labels

4 participants

reidy-p commented Jun 21, 2018 •

edited

Loading

reidy-p Jun 21, 2018 •

edited

Loading

mroeschke Jun 21, 2018 •

edited

Loading

codecov bot commented Jun 21, 2018 •

edited

Loading