Skip to content

Conversation

@Alvaro-Kothe
Copy link
Member

@Alvaro-Kothe Alvaro-Kothe commented Oct 26, 2025


This PR changes the algorithm used to compute the skewness. Previously, the algorithm used computed the summation of $X^{3}$, $X^{2}$, and $X$, and used kahan summation for stability. The problem is that $EX^3$ - $(EX)^3$ can suffer heavilly from catastrophic cancellation.

Now, it uses a Welford-like algorithm to compute up to the 3rd central moment, and continuously check for stability when updating the 3rd central moment, by checking for severe loss of significant digits when updating its value. If it detects a stability problem, it recomputes the window.

Benchmark

asv continuous -f 1.1 -E virtualenv:3.13 upstream/main HEAD -b rolling.Methods

Change Before [ea75dd7] After [488e99e] <fix/polluted-window-skew> Ratio Benchmark (Parameter)
- 1.41±0.01ms 1.21±0.01ms 0.86 rolling.Methods.time_method('DataFrame', ('expanding', {}), 'float', 'skew')
- 1.45±0.01ms 1.24±0.01ms 0.86 rolling.Methods.time_method('DataFrame', ('expanding', {}), 'int', 'skew')
- 1.41±0ms 1.22±0.03ms 0.86 rolling.Methods.time_method('Series', ('expanding', {}), 'int', 'skew')
- 1.39±0.01ms 1.18±0.08ms 0.85 rolling.Methods.time_method('Series', ('expanding', {}), 'float', 'skew')

Note

The performance increase is mainly due to the fact that it didn't have to recompute the window. This change can have a worse case scenario of $O(n * \text{window\_size})$

This change uses online update for the mean instead of computing the centralized values. Additionally, it checks for possible catastrophic cancellation by big changes in 3rd central moment.
@Alvaro-Kothe
Copy link
Member Author

Reproduction of #47461

import pandas as pd data = [ -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 7729964027, 7729965641, 7729965103, 7729966179, 7729968330, 7729969406, 7729971020, 7729969406, 7729969944, 7729969406, 7729969406, 7729969944, 7729971558, 7729972633, 7729973171, 7729973171, 7729973171, 7729969406, ] df = pd.DataFrame(data, columns=["data"]) print(df.data.rolling(10).skew()[-20:]) # 34 NaN # 35 NaN # 36 3.162278e+00 # 37 1.778781e+00 # 38 1.035098e+00 # 39 4.841229e-01 # 40 2.426355e-13 # 41 -4.841229e-01 # 42 -1.035098e+00 # 43 -1.778781e+00 # 44 -3.162278e+00 # 45 -3.928221e-01 # 46 -6.794734e-01 # 47 -1.272001e+00 # 48 -1.034738e+00 # 49 8.839369e-01 # 50 9.294772e-01 # 51 4.561840e-01 # 52 1.688976e-01 # 53 1.688976e-01 # Name: data, dtype: float64 print(df.data.tail(10).skew()) # 0.16889762763904356
@eicchen
Copy link
Contributor

eicchen commented Oct 26, 2025

Could I have you review my PR for #61416? The work is done so should save some repeated effort.

PR: #61481

@Alvaro-Kothe
Copy link
Member Author

Alvaro-Kothe commented Oct 26, 2025

The window from the example #62863 (comment) was recomputed 3 times.

  1. at index 23, removing 2. The 3rd moment dropped from 46.08 to -7e-15
  2. at index 40, adding 7729968330. The 3rd moment dropped from 2e29 to 1e17.
  3. at index 45, removing -2. The 3rd moment increased from -3e29 to 4e14.
@Alvaro-Kothe Alvaro-Kothe added Window rolling, ewma, expanding Reduction Operations sum, mean, min, max, etc. labels Oct 27, 2025
cdef:
float64_t result, dnobs
float64_t A, B, C, R
float64_t moments_ratio, correction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you for using descriptive names for these variables!

xxx[0] = t
m3_update = delta_n * (term1 * (n - 2.0) - 3.0 * m2[0])
new_m3 = m3[0] + m3_update
if fabs(m3_update) + fabs(m3[0]) > 1e10 * fabs(new_m3):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with the 1e10 threshold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It loses signifcant digits when $|A| + |B| &gt;&gt; |A - B|$, so I thought that a change in 10 significant digits in base 10 should be considered a catastrophic cancellation.

n = <float64_t>(nobs[0])
delta = val - mean[0]
delta_n = delta / n
term1 = delta_n * delta * (n + 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to use fused add multiply fma where applicable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out an effective way to reformulate the computation of term1 to use fma. But I managed to change the update functions for the 3rd moment. Now I am using fma(term1, n + 2.0, -3.0 * m2[0])

@Alvaro-Kothe
Copy link
Member Author

I feel like this PR is changing a lot of the current behavior in the skew computation. I will minimize the changes to focus on numerical stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Reduction Operations sum, mean, min, max, etc. Window rolling, ewma, expanding

3 participants