Before making predictions, we need to make sure our data is ready.
A raw time series often contains trends or fluctuations that can mislead a forecasting model.
The ARIMA model has one key requirement: it only works properly with stationary series.
- A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time.
- A non-stationary series, on the other hand, changes significantly (for example, with a strong trend or seasonality). Without this preparation, ARIMA may produce biased or unreliable forecasts.
In the previous article (How Time Series Reveal the Future: An Introduction to the ARIMA Model), we explored what a time series is, its components (trend, seasonality, noise), and the intuition behind ARIMA.
We also visualized the AirPassengers dataset, which showed a steady upward trend and yearly seasonality.
👉 But for ARIMA to work, our data must satisfy one key condition: stationarity.
That’s exactly what this article is about: transforming a non-stationary series into a stationary one using simple techniques (differencing, statistical tests).
In other words: after observing, we now move on to preparing.
Simplified Theory
- What is stationarity? A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time. 👉 Example: daily winter temperatures in a city (around a stable mean with small fluctuations).
A non-stationary series changes too much over time:
- *Trend *(e.g., constant increase in smartphone sales).
- *Seasonality *(e.g., ice cream sales peaking every summer).
ARIMA assumes the series is stationary: otherwise, it “believes” past trends will continue forever, leading to biased forecasts.
- Differencing To make a series stationary, we use differencing: Yt′=Yt−Yt−1Y't = Y_t - Y{t-1} In other words, each value is replaced by the *change between two successive periods. *
- This removes linear trends.
- For strong seasonality, we can apply seasonal differencing (e.g., difference with the value one year before).
Example: instead of analyzing raw monthly sales, we analyze the month-to-month change.
- Statistical tests (ADF & KPSS) To check if a series is stationary, we use two complementary tests: ADF (Augmented Dickey-Fuller Test)
- Null hypothesis (H₀): the series is non-stationary.
- If p-value < 0.05 → reject H₀ → the series is stationary. KPSS (Kwiatkowski-Phillips-Schmidt-Shin Test)
- Null hypothesis (H₀): the series is stationary.
- If p-value < 0.05 → reject H₀ → the series is non-stationary.
In practice:
- We apply both tests for robustness.
- If ADF and KPSS disagree, we refine with additional transformations.
Hands-on in Python
We’ll use a simple time series dataset: the annual flow of the Nile River (built into statsmodels).
import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.stattools import adfuller, kpss from statsmodels.datasets import nile # Load Nile dataset data = nile.load_pandas().data data.index = pd.date_range(start="1871", periods=len(data), freq="Y") series = data['volume'] # Plot series plt.figure(figsize=(10,4)) plt.plot(series) plt.title("Annual Nile River Flow (1871–1970)") plt.show() Check stationarity with ADF & KPSS
def adf_test(series): result = adfuller(series, autolag='AIC') print("ADF Statistic:", result[0]) print("p-value:", result[1]) if result[1] < 0.05: print("✅ The series is stationary (ADF).") else: print("❌ The series is NON-stationary (ADF).") def kpss_test(series): result = kpss(series, regression='c', nlags="auto") print("KPSS Statistic:", result[0]) print("p-value:", result[1]) if result[1] < 0.05: print("❌ The series is NON-stationary (KPSS).") else: print("✅ The series is stationary (KPSS).") print("ADF Test") adf_test(series) print("\nKPSS Test") kpss_test(series) Apply differencing
series_diff = series.diff().dropna() plt.figure(figsize=(10,4)) plt.plot(series_diff) plt.title("Differenced series (1st difference)") plt.show() print("ADF Test after differencing") adf_test(series_diff) print("\nKPSS Test after differencing") kpss_test(series_diff) Understanding ARIMA and its parameters
The ARIMA(p, d, q) model combines three parts:
- AR (AutoRegressive, p)
- Uses past values to predict the future.
- Example: if
- 𝑝 = 2
- p=2, the current value depends on the last 2 values.
Formula: Yt=ϕ1Yt−1+ϕ2Yt−2+ϵtY_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t
I (Integrated, d)
Number of differences applied to make the series stationary.
Example: d=0d = 0 → no differencing.
d=1d = 1 → one difference applied.
MA (Moving Average, q)
Uses past errors (residuals) for prediction.
Example: if 𝑞 = 1, the prediction depends on the last error.
Formula: Yt=θ1ϵt−1+θ2ϵt−2+ϵtY_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \epsilon_t
In short:
- p = past values memory
- d = differencing degree
- q = past errors memory
Example in Python
from statsmodels.tsa.arima.model import ARIMA # ARIMA(1,1,1) model = ARIMA(series, order=(1,1,1)) fit = model.fit() print(fit.summary()) Typical output:
- AR (p) → past values effect
- I (d) → differencing applied
- MA (q) → past errors effect
- AIC/BIC → model quality (lower = better)
Choosing the best parameters (p,d,q)
One of the main challenges with ARIMA is selecting the right p, d, q.
Choosing the best parameters (p,d,q)
Choosing p and q with ACF & PACF
- ACF → helps to choose q (MA part).
- PACF → helps to choose p (AR part).
Simple rules:
- PACF cutoff → good candidate for p.
- ACF cutoff → good candidate for q.



Top comments (0)