DEV Community

Cover image for Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts
Mubarak Mohamed
Mubarak Mohamed

Posted on

Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts

Before making predictions, we need to make sure our data is ready.
A raw time series often contains trends or fluctuations that can mislead a forecasting model.

The ARIMA model has one key requirement: it only works properly with stationary series.

  • A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time.
  • A non-stationary series, on the other hand, changes significantly (for example, with a strong trend or seasonality). Without this preparation, ARIMA may produce biased or unreliable forecasts.

In the previous article (How Time Series Reveal the Future: An Introduction to the ARIMA Model), we explored what a time series is, its components (trend, seasonality, noise), and the intuition behind ARIMA.
We also visualized the AirPassengers dataset, which showed a steady upward trend and yearly seasonality.

👉 But for ARIMA to work, our data must satisfy one key condition: stationarity.
That’s exactly what this article is about: transforming a non-stationary series into a stationary one using simple techniques (differencing, statistical tests).
In other words: after observing, we now move on to preparing.

Simplified Theory

  1. What is stationarity? A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time. 👉 Example: daily winter temperatures in a city (around a stable mean with small fluctuations).

A non-stationary series changes too much over time:

  • *Trend *(e.g., constant increase in smartphone sales).
  • *Seasonality *(e.g., ice cream sales peaking every summer).

ARIMA assumes the series is stationary: otherwise, it “believes” past trends will continue forever, leading to biased forecasts.

  1. Differencing To make a series stationary, we use differencing: Yt′=Yt−Yt−1Y't = Y_t - Y{t-1} In other words, each value is replaced by the *change between two successive periods. *
  2. This removes linear trends.
  3. For strong seasonality, we can apply seasonal differencing (e.g., difference with the value one year before).

Example: instead of analyzing raw monthly sales, we analyze the month-to-month change.

  1. Statistical tests (ADF & KPSS) To check if a series is stationary, we use two complementary tests: ADF (Augmented Dickey-Fuller Test)
  2. Null hypothesis (H₀): the series is non-stationary.
  3. If p-value < 0.05 → reject H₀ → the series is stationary. KPSS (Kwiatkowski-Phillips-Schmidt-Shin Test)
  4. Null hypothesis (H₀): the series is stationary.
  5. If p-value < 0.05 → reject H₀ → the series is non-stationary.

In practice:

  • We apply both tests for robustness.
  • If ADF and KPSS disagree, we refine with additional transformations.

Hands-on in Python

We’ll use a simple time series dataset: the annual flow of the Nile River (built into statsmodels).

import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.stattools import adfuller, kpss from statsmodels.datasets import nile # Load Nile dataset data = nile.load_pandas().data data.index = pd.date_range(start="1871", periods=len(data), freq="Y") series = data['volume'] # Plot series plt.figure(figsize=(10,4)) plt.plot(series) plt.title("Annual Nile River Flow (1871–1970)") plt.show() 
Enter fullscreen mode Exit fullscreen mode

Check stationarity with ADF & KPSS

def adf_test(series): result = adfuller(series, autolag='AIC') print("ADF Statistic:", result[0]) print("p-value:", result[1]) if result[1] < 0.05: print("✅ The series is stationary (ADF).") else: print("❌ The series is NON-stationary (ADF).") def kpss_test(series): result = kpss(series, regression='c', nlags="auto") print("KPSS Statistic:", result[0]) print("p-value:", result[1]) if result[1] < 0.05: print("❌ The series is NON-stationary (KPSS).") else: print("✅ The series is stationary (KPSS).") print("ADF Test") adf_test(series) print("\nKPSS Test") kpss_test(series) 
Enter fullscreen mode Exit fullscreen mode

Apply differencing

series_diff = series.diff().dropna() plt.figure(figsize=(10,4)) plt.plot(series_diff) plt.title("Differenced series (1st difference)") plt.show() print("ADF Test after differencing") adf_test(series_diff) print("\nKPSS Test after differencing") kpss_test(series_diff) 
Enter fullscreen mode Exit fullscreen mode

Understanding ARIMA and its parameters

The ARIMA(p, d, q) model combines three parts:

  1. AR (AutoRegressive, p)
  2. Uses past values to predict the future.
  3. Example: if
  4. 𝑝 = 2
  5. p=2, the current value depends on the last 2 values.
  6. Formula: Yt=ϕ1Yt−1+ϕ2Yt−2+ϵtY_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t

  7. I (Integrated, d)

  8. Number of differences applied to make the series stationary.

  9. Example: d=0d = 0 → no differencing.

  10. d=1d = 1 → one difference applied.

  11. MA (Moving Average, q)

  12. Uses past errors (residuals) for prediction.
    Example: if 𝑞 = 1, the prediction depends on the last error.
    Formula: Yt=θ1ϵt−1+θ2ϵt−2+ϵtY_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \epsilon_t

In short:

  • p = past values memory
  • d = differencing degree
  • q = past errors memory

Example in Python

from statsmodels.tsa.arima.model import ARIMA # ARIMA(1,1,1) model = ARIMA(series, order=(1,1,1)) fit = model.fit() print(fit.summary()) 
Enter fullscreen mode Exit fullscreen mode

Typical output:

  • AR (p) → past values effect
  • I (d) → differencing applied
  • MA (q) → past errors effect
  • AIC/BIC → model quality (lower = better)

Choosing the best parameters (p,d,q)

One of the main challenges with ARIMA is selecting the right p, d, q.

Choosing the best parameters (p,d,q)

Choosing p and q with ACF & PACF

  • ACF → helps to choose q (MA part).
  • PACF → helps to choose p (AR part).

Simple rules:

  • PACF cutoff → good candidate for p.
  • ACF cutoff → good candidate for q.

Top comments (0)