-  
 -   Notifications  
You must be signed in to change notification settings  - Fork 19.2k
 
BUG: mean overflows for integer dtypes (fixes #10155) #10172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
   pandas/core/nanops.py  Outdated    
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For float32 input, the sum should still be float32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ... you are right. This means we actually have more to fix than the cases for int types. Because dtype_max would be float64 even when the inputs are float32... I'll update shortly.
4155bbb to 4ac5a80   Compare   |   There are technically two changes now: old behavior
 new behavior
 Both points in the old behavior are not consistent with   |  
|   LGTM... just needs a release note  |  
|   ok - in another issue should audit the rest of the nan funcs for overflow / dtype preserving effects 
  |  
|   So I was actually quite puzzled by why the unit tests wouldn't pass for some of the virtual environments. And I think I tracked down the problem ... seems like a bug in  I verified that   |  
a2e4822 to 3d85fc7   Compare   |   Ok I've updated this with a numpy version check, travis should pass now. Also added a release note.  |  
   pandas/core/nanops.py  Outdated    
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing the count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's because for float dtypes count can be higher precision and therefore changing the dtype of the returned result. E.g. for float32 input count is still float64 and therefore the result would be cast to float64 when we do sum / count, and that's not what we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't count always be integer type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, seems we are casting that, don't really remember why. You can try changing get_count to simply return an int64 which I think will always be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback actually changing count to int64 may not help, this is what I'm seeing:
In [1]: import numpy as np In [2]: (np.float32(100) / np.int64(100)).dtype Out[2]: dtype('float64') In [3]: (np.float32(100) / np.int32(100)).dtype Out[3]: dtype('float64') In [4]: (np.float32(100) / np.float32(100)).dtype Out[4]: dtype('float32') It still gets cast to float64 unless we have the denominator as float32 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok.
maybe better to add pass thru the dtype to get_counts then as well (and cast to that inside get_counts), to avoid this specific code all over the place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure sounds good, will update shortly
|   @jreback I added an optional   |  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does numpy do in < 1.9.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for numpy < 1.9.0: (wrong result)
In [1]: import numpy as np In [2]: np.__version__ Out[2]: '1.8.2' In [3]: a = 20150515061816532 In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64) In [5]: arr.mean() Out[5]: 20150515061816464.0 numpy >= 1.9.0: (correct result)
In [1]: import numpy as np In [2]: np.__version__ Out[2]: '1.9.0' In [3]: a = 20150515061816532 In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64) In [5]: arr.mean() Out[5]: 20150515061816532.0 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, ok
|   Indeed, I think this is good to go... I'll wait a little bit and then merge.  |  
|   yep, lgtm  |  
BUG: mean overflows for integer dtypes (fixes #10155)
|   thanks @mortada !  |  
|   cool thanks guys!  |  
closes #10155