Skip to content

felix-nack/hybrid-ml-statistical-forecasting

Repository files navigation

Hybrid ML-Statistical Forecasting Framework

A sophisticated time series forecasting framework that combines machine learning and statistical methods through intelligent clustering-based model selection.

Overview

This framework automatically assigns forecasting models to time series based on their intrinsic characteristics. It uses K-Means clustering to categorize time series into two groups:

  • Noisy/irregular series → Global Random Forest (MLForecast)
  • Seasonal/stable series → Weighted Statistical Ensemble

This hybrid approach ensures each time series is forecasted with the most suitable method, improving overall prediction accuracy.

Key Features

  • Automated Model Selection: Clustering-based assignment eliminates manual model selection
  • Hybrid Modeling: Combines ML (Random Forest) with statistical methods (ARIMA, ETS, Seasonal Naive, etc.)
  • Hyperparameter Optimization: Automated tuning using Optuna (100 trials)
  • Feature Importance Analysis: Identifies and uses only relevant predictive features
  • Cross-Validation: Rigorous evaluation with multiple time windows
  • Performance Monitoring: Automated detection of model degradation over time
  • Exogenous Variables: Incorporates external features (promotions, holidays, store metadata)

Methodology

1. Time Series Characterization

Each time series is analyzed for:

  • Seasonality strength (weekly patterns)
  • Trend strength (directional movement)
  • Residual variability (unexplained noise)

2. Clustering & Model Assignment

  • K-Means clustering (k=2) groups series by similarity
  • High residual variability cluster → Random Forest
    • Noisy, irregular patterns unsuitable for classical statistical models
    • Leverages lag features and static exogenous variables
  • Low residual variability cluster → InverseMSE-Best3 Ensemble
    • Strong weekly seasonality, low trend, stable patterns
    • Best-performing 3 models weighted by inverse MSE

3. Random Forest Training

  • Global model trained on all matching time series (IDs 1-99 + similar IDs 100+)
  • Feature engineering: Lags, rolling means, date features, static variables
  • Optuna optimization: 100 trials for hyperparameters and feature selection
  • Feature importance: Removes non-contributing features

4. Statistical Ensemble

Base models evaluated per time series:

  • Naive
  • Window Average
  • Random Walk with Drift
  • Auto ARIMA
  • Auto ETS
  • Seasonal Naive
  • Seasonal Window Average

Top 3 models selected and weighted by inverse MSE for each series.

5. Cross-Validation & Evaluation

  • Multiple windows: Time-series cross-validation (7 windows)
  • Metrics: MSE, MAPE
  • Horizon: 56-day forecast (8 weeks)
  • Weekly aggregation: Daily forecasts aggregated to weekly totals

Project Structure

├── forecasting_framework.py    # Main pipeline orchestration
├── forecasting_utils.py        # Core functions and utilities
├── requirements.txt            # Python dependencies
├── saved_models/               # Trained model artifacts (created on first run)
├── combined_forecasts.csv      # Daily forecasts (output)
├── weekly_forecast.csv         # Weekly aggregated forecasts (output)
└── cv_results_summary.csv      # Cross-validation metrics (output)

Installation

Install all required dependencies:

pip install -r requirements.txt

Usage

Configuration

Edit the following parameters in forecasting_framework.py:

RETRAIN_MODELS = False          # Set True to retrain (slower)
MODEL_SAVE_PATH = 'saved_models'
h = 56                           # Forecast horizon (days)
step_size = 56
n_windows = 7                    # Cross-validation windows
freq = 'D'                       # Forecast frequency

Input Files (Required)

  • sales_data.csv: Historical sales with columns: store_id, date, sales, open, promo, state_holiday, school_holiday
  • future_values.csv: Future exogenous variables with same columns (excluding sales)
  • metadata.csv: Store metadata: store_id, store_type, assortment, competition_distance

Output Files

  • combined_forecasts.csv: Daily forecasts per store
  • weekly_forecast.csv: Weekly aggregated forecasts (8 weeks)
  • cv_results_summary.csv: Model evaluation metrics

Performance Monitoring

The framework automatically checks for accuracy degradation by comparing current MSE against baseline thresholds. If performance drops >10%, a warning is issued with affected store IDs.


Note: First run with RETRAIN_MODELS = True trains and saves models (~long runtime). Subsequent runs with RETRAIN_MODELS = False load saved models for faster execution.

About

Hybrid forecasting framework combining Random Forest & statistical ensembles through intelligent clustering-based model selection for time series prediction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages