Skip to content

Conversation

@reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Jun 21, 2018

def _wrap_agged_blocks(self, items, blocks):
if not self.as_index:
index = np.arange(blocks[0].values.shape[1])
if blocks[0].values.ndim > 1:
Copy link
Contributor Author

@reidy-p reidy-p Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a case such as:

pd.DataFrame({'time': [pd.Timestamp('2012-01-01 13:00:00+00:00')], 'A': [3]}).groupby('A', as_index=False).first() 

the blocks[0].values is a DatetimeIndex and not an array so trying to call shape[1] on a DTI results in an index out of range error and then the compat routine for first or last are called which leads to the timezone being lost.

Copy link
Member

@mroeschke mroeschke Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compat routine for first or last are called which leads to the timezone being lost.

Is there a case where DatetimeTZ data can legitimately go through the compat routine (maybe with as_index=True?) It would be great for the data type to also be preserved there as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have also been wondering about the same thing. The following case goes through the compat routine and consequently loses the timezone:

In [2]: df = pd.DataFrame({'group': [1, 1, 2], 'category_string': pd.Series(list('abc')).astype('category'), 'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern'}) In [3]: df.groupby('group').first() Out[3]: category_string datetimetz group 1 a 2013-01-01 05:00:00 2 c 2013-01-03 05:00:00 

But if we exclude the categorical column it doesn't go through the compat routine and preserves the timezone information:

In[4]: df[['group', 'datetimetz']].groupby('group').first() Out[4]: datetimetz group 1 2013-01-01 00:00:00-05:00 2 2013-01-03 00:00:00-05:00 

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine? I think this might actually be quite straightforward (see #15885)

@reidy-p reidy-p changed the title BUG: first/last lose timezone in groupby BUG: first/last lose timezone in groupby with as_index=False Jun 21, 2018
^^^^^^^^^^^^^^^^^^^^^^^^

-
- Bug in :func:`pandas.core.groupby.first` and :func:`pandas.core.groupby.last` with ``as_index=False`` leading to the loss of timezone information (:issue:`15884`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the :func: links should be pandas.core.groupby.GroupBy.first (and likewise for last)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right, thanks!

@mroeschke mroeschke added Bug Timezones Timezone data dtype labels Jun 21, 2018
@codecov
Copy link

codecov bot commented Jun 21, 2018

Codecov Report

Merging #21573 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #21573 +/- ## ======================================= Coverage 91.9% 91.9% ======================================= Files 153 153 Lines 49549 49549 ======================================= Hits 45539 45539 Misses 4010 4010
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 41.78% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 92.66% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 028c9c0...ef5d5b1. Read the comment docs.

def _wrap_agged_blocks(self, items, blocks):
if not self.as_index:
index = np.arange(blocks[0].values.shape[1])
index = np.arange(blocks[0].values.shape[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reidy-p pushed a simplification. but maybe need some additional tests that do this when a column is selected

e.g. df.groupby('id', as_index=False)['foo'].first()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification. I added some new tests.

@jreback jreback added this to the 0.24.0 milestone Jun 22, 2018
@reidy-p reidy-p force-pushed the groupby_tz branch 2 times, most recently from 7bb7a3b to 26ed691 Compare June 22, 2018 15:55
@mroeschke
Copy link
Member

@reidy-p

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine?

I am not too familiar of the conditions in which the data gets passed through the compat routine, but timezones should be preserved in the compat routine. Yes, looks like #15885 should solve that issue (and offer a performance boost for Categoriacals as well xref #19026)

@jreback jreback merged commit c6347c4 into pandas-dev:master Jun 22, 2018
@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

thanks @reidy-p very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Timezones Timezone data dtype

4 participants