-  
-   Notifications  You must be signed in to change notification settings 
- Fork 19.2k
BUG: fix polluted window in skewness computation #62863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: fix polluted window in skewness computation #62863
Conversation
This change uses online update for the mean instead of computing the centralized values. Additionally, it checks for possible catastrophic cancellation by big changes in 3rd central moment.
| Reproduction of #47461 import pandas as pd data = [ -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 7729964027, 7729965641, 7729965103, 7729966179, 7729968330, 7729969406, 7729971020, 7729969406, 7729969944, 7729969406, 7729969406, 7729969944, 7729971558, 7729972633, 7729973171, 7729973171, 7729973171, 7729969406, ] df = pd.DataFrame(data, columns=["data"]) print(df.data.rolling(10).skew()[-20:]) # 34 NaN # 35 NaN # 36 3.162278e+00 # 37 1.778781e+00 # 38 1.035098e+00 # 39 4.841229e-01 # 40 2.426355e-13 # 41 -4.841229e-01 # 42 -1.035098e+00 # 43 -1.778781e+00 # 44 -3.162278e+00 # 45 -3.928221e-01 # 46 -6.794734e-01 # 47 -1.272001e+00 # 48 -1.034738e+00 # 49 8.839369e-01 # 50 9.294772e-01 # 51 4.561840e-01 # 52 1.688976e-01 # 53 1.688976e-01 # Name: data, dtype: float64 print(df.data.tail(10).skew()) # 0.16889762763904356 | 
| The window from the example #62863 (comment) was recomputed 3 times. 
 | 
| cdef: | ||
| float64_t result, dnobs | ||
| float64_t A, B, C, R | ||
| float64_t moments_ratio, correction | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you for using descriptive names for these variables!
| xxx[0] = t | ||
| m3_update = delta_n * (term1 * (n - 2.0) - 3.0 * m2[0]) | ||
| new_m3 = m3[0] + m3_update | ||
| if fabs(m3_update) + fabs(m3[0]) > 1e10 * fabs(new_m3): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with the 1e10 threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It loses signifcant digits when 
| n = <float64_t>(nobs[0]) | ||
| delta = val - mean[0] | ||
| delta_n = delta / n | ||
| term1 = delta_n * delta * (n + 1.0) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be interesting to use fused add multiply fma where applicable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out an effective way to reformulate the computation of term1 to use fma. But I managed to change the update functions for the 3rd moment. Now I am using fma(term1, n + 2.0, -3.0 * m2[0])
| I feel like this PR is changing a lot of the current behavior in the skew computation. I will minimize the changes to focus on numerical stability. | 
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.This PR changes the algorithm used to compute the skewness. Previously, the algorithm used computed the summation of$X^{3}$ , $X^{2}$ , and $X$ , and used kahan summation for stability. The problem is that $EX^3$  - $(EX)^3$  can suffer heavilly from catastrophic cancellation.
Now, it uses a Welford-like algorithm to compute up to the 3rd central moment, and continuously check for stability when updating the 3rd central moment, by checking for severe loss of significant digits when updating its value. If it detects a stability problem, it recomputes the window.
Benchmark
asv continuous -f 1.1 -E virtualenv:3.13 upstream/main HEAD -b rolling.MethodsNote
The performance increase is mainly due to the fact that it didn't have to recompute the window. This change can have a worse case scenario of$O(n * \text{window\_size})$