Hybrid ML-Statistical Forecasting Framework

A sophisticated time series forecasting framework that combines machine learning and statistical methods through intelligent clustering-based model selection.

Overview

This framework automatically assigns forecasting models to time series based on their intrinsic characteristics. It uses K-Means clustering to categorize time series into two groups:

Noisy/irregular series → Global Random Forest (MLForecast)
Seasonal/stable series → Weighted Statistical Ensemble

This hybrid approach ensures each time series is forecasted with the most suitable method, improving overall prediction accuracy.

Key Features

Automated Model Selection: Clustering-based assignment eliminates manual model selection
Hybrid Modeling: Combines ML (Random Forest) with statistical methods (ARIMA, ETS, Seasonal Naive, etc.)
Hyperparameter Optimization: Automated tuning using Optuna (100 trials)
Feature Importance Analysis: Identifies and uses only relevant predictive features
Cross-Validation: Rigorous evaluation with multiple time windows
Performance Monitoring: Automated detection of model degradation over time
Exogenous Variables: Incorporates external features (promotions, holidays, store metadata)

Methodology

1. Time Series Characterization

Each time series is analyzed for:

Seasonality strength (weekly patterns)
Trend strength (directional movement)
Residual variability (unexplained noise)

2. Clustering & Model Assignment

K-Means clustering (k=2) groups series by similarity
High residual variability cluster → Random Forest
- Noisy, irregular patterns unsuitable for classical statistical models
- Leverages lag features and static exogenous variables
Low residual variability cluster → InverseMSE-Best3 Ensemble
- Strong weekly seasonality, low trend, stable patterns
- Best-performing 3 models weighted by inverse MSE

3. Random Forest Training

Global model trained on all matching time series (IDs 1-99 + similar IDs 100+)
Feature engineering: Lags, rolling means, date features, static variables
Optuna optimization: 100 trials for hyperparameters and feature selection
Feature importance: Removes non-contributing features

4. Statistical Ensemble

Base models evaluated per time series:

Naive
Window Average
Random Walk with Drift
Auto ARIMA
Auto ETS
Seasonal Naive
Seasonal Window Average

Top 3 models selected and weighted by inverse MSE for each series.

5. Cross-Validation & Evaluation

Multiple windows: Time-series cross-validation (7 windows)
Metrics: MSE, MAPE
Horizon: 56-day forecast (8 weeks)
Weekly aggregation: Daily forecasts aggregated to weekly totals

Project Structure

├── forecasting_framework.py    # Main pipeline orchestration
├── forecasting_utils.py        # Core functions and utilities
├── requirements.txt            # Python dependencies
├── saved_models/               # Trained model artifacts (created on first run)
├── combined_forecasts.csv      # Daily forecasts (output)
├── weekly_forecast.csv         # Weekly aggregated forecasts (output)
└── cv_results_summary.csv      # Cross-validation metrics (output)

Installation

Install all required dependencies:

pip install -r requirements.txt

Usage

Configuration

Edit the following parameters in forecasting_framework.py:

RETRAIN_MODELS = False          # Set True to retrain (slower)
MODEL_SAVE_PATH = 'saved_models'
h = 56                           # Forecast horizon (days)
step_size = 56
n_windows = 7                    # Cross-validation windows
freq = 'D'                       # Forecast frequency

Input Files (Required)

sales_data.csv: Historical sales with columns: store_id, date, sales, open, promo, state_holiday, school_holiday
future_values.csv: Future exogenous variables with same columns (excluding sales)
metadata.csv: Store metadata: store_id, store_type, assortment, competition_distance

Output Files

combined_forecasts.csv: Daily forecasts per store
weekly_forecast.csv: Weekly aggregated forecasts (8 weeks)
cv_results_summary.csv: Model evaluation metrics

Performance Monitoring

The framework automatically checks for accuracy degradation by comparing current MSE against baseline thresholds. If performance drops >10%, a warning is issued with affected store IDs.

Note: First run with RETRAIN_MODELS = True trains and saves models (~long runtime). Subsequent runs with RETRAIN_MODELS = False load saved models for faster execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid ML-Statistical Forecasting Framework

Overview

Key Features

Methodology

1. Time Series Characterization

2. Clustering & Model Assignment

3. Random Forest Training

4. Statistical Ensemble

5. Cross-Validation & Evaluation

Project Structure

Installation

Usage

Configuration

Input Files (Required)

Output Files

Performance Monitoring

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
forecasting_framework.py		forecasting_framework.py
forecasting_utils.py		forecasting_utils.py
future_values.csv		future_values.csv
metadata.csv		metadata.csv
requirements.txt		requirements.txt
sales_data.csv		sales_data.csv

Folders and files

Latest commit

History

Repository files navigation

Hybrid ML-Statistical Forecasting Framework

Overview

Key Features

Methodology

1. Time Series Characterization

2. Clustering & Model Assignment

3. Random Forest Training

4. Statistical Ensemble

5. Cross-Validation & Evaluation

Project Structure

Installation

Usage

Configuration

Input Files (Required)

Output Files

Performance Monitoring

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages