Skip to content

Conversation

@paul-mannino
Copy link
Contributor

@codecov-io
Copy link

codecov-io commented Sep 15, 2016

Current coverage is 85.24% (diff: 100%)

Merging #14225 into master will increase coverage by <.01%

@@ master #14225 diff @@ ========================================== Files 140 140 Lines 50559 50560 +1 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== + Hits 43099 43101 +2  + Misses 7460 7459 -1  Partials 0 0 

Powered by Codecov. Last update e8357a1...42affd5

@jreback jreback added Bug Groupby Categorical Categorical Data Type labels Sep 15, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adjusting the hash table value counts, I think you could use np.bincount here? - see Categorical.value_counts

Copy link
Contributor Author

@paul-mannino paul-mannino Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your suggestion, I borrowed some of the logic from Categorical.value_counts. The mask.all() check seemed to make the code run about as fast (or slower) in this case, so I left it out.

Copy link
Contributor

@chris-b1 chris-b1 Sep 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because agg intercepts methods (i.e. np.median is interpreted as 'median'), this doesn't fail on master - wrap the targop in a lambda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test that hits the public api (e.g. df.groupby(...).median())?

Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

@chris-b1
Copy link
Contributor

lgtm, @jreback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this?
when is this hit?

Copy link
Contributor Author

@paul-mannino paul-mannino Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _algos.groupby_indices function expects a counts array that is the same length as the categories. If one of the categories has no values in it (empty bins), the _hash.value_count_int64 function would just return a shorter count array (with no 0 entries for empty categories/bins). My initial approach was to pad the array with 0s in the right places, but per chris-b1's suggestion, I threw out this code and used np.bincount instead.

@jreback
Copy link
Contributor

jreback commented Sep 18, 2016

lgtm

merge away

@chris-b1 chris-b1 merged commit db9dc65 into pandas-dev:master Sep 18, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Oct 13, 2016
* commit 'v0.19.0rc1-25-ga7469cf': (471 commits) ENH: Add divmod to series and index. (pandas-dev#14208) Fix generator tests to run (pandas-dev#14245) BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225) TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240) DOC: added example to Series.map showing use of na_action parameter (GH14231) DOC: split docstring into multiple lines in excel.py (pandas-dev#14073) MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181) ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002) DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211) BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797 BUG: add check for infinity in __call__ of EngFormatter In gbq.to_gbq allow the DataFrame column order to differ from schema BLD: require cython if tempita is needed DOC: add source links to api docs (pandas-dev#14200) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Categorical Categorical Data Type Groupby

4 participants