A sophisticated time series forecasting framework that combines machine learning and statistical methods through intelligent clustering-based model selection.
This framework automatically assigns forecasting models to time series based on their intrinsic characteristics. It uses K-Means clustering to categorize time series into two groups:
- Noisy/irregular series → Global Random Forest (MLForecast)
- Seasonal/stable series → Weighted Statistical Ensemble
This hybrid approach ensures each time series is forecasted with the most suitable method, improving overall prediction accuracy.
- Automated Model Selection: Clustering-based assignment eliminates manual model selection
- Hybrid Modeling: Combines ML (Random Forest) with statistical methods (ARIMA, ETS, Seasonal Naive, etc.)
- Hyperparameter Optimization: Automated tuning using Optuna (100 trials)
- Feature Importance Analysis: Identifies and uses only relevant predictive features
- Cross-Validation: Rigorous evaluation with multiple time windows
- Performance Monitoring: Automated detection of model degradation over time
- Exogenous Variables: Incorporates external features (promotions, holidays, store metadata)
Each time series is analyzed for:
- Seasonality strength (weekly patterns)
- Trend strength (directional movement)
- Residual variability (unexplained noise)
- K-Means clustering (k=2) groups series by similarity
- High residual variability cluster → Random Forest
- Noisy, irregular patterns unsuitable for classical statistical models
- Leverages lag features and static exogenous variables
- Low residual variability cluster → InverseMSE-Best3 Ensemble
- Strong weekly seasonality, low trend, stable patterns
- Best-performing 3 models weighted by inverse MSE
- Global model trained on all matching time series (IDs 1-99 + similar IDs 100+)
- Feature engineering: Lags, rolling means, date features, static variables
- Optuna optimization: 100 trials for hyperparameters and feature selection
- Feature importance: Removes non-contributing features
Base models evaluated per time series:
- Naive
- Window Average
- Random Walk with Drift
- Auto ARIMA
- Auto ETS
- Seasonal Naive
- Seasonal Window Average
Top 3 models selected and weighted by inverse MSE for each series.
- Multiple windows: Time-series cross-validation (7 windows)
- Metrics: MSE, MAPE
- Horizon: 56-day forecast (8 weeks)
- Weekly aggregation: Daily forecasts aggregated to weekly totals
├── forecasting_framework.py # Main pipeline orchestration
├── forecasting_utils.py # Core functions and utilities
├── requirements.txt # Python dependencies
├── saved_models/ # Trained model artifacts (created on first run)
├── combined_forecasts.csv # Daily forecasts (output)
├── weekly_forecast.csv # Weekly aggregated forecasts (output)
└── cv_results_summary.csv # Cross-validation metrics (output)
Install all required dependencies:
pip install -r requirements.txtEdit the following parameters in forecasting_framework.py:
RETRAIN_MODELS = False # Set True to retrain (slower)
MODEL_SAVE_PATH = 'saved_models'
h = 56 # Forecast horizon (days)
step_size = 56
n_windows = 7 # Cross-validation windows
freq = 'D' # Forecast frequencysales_data.csv: Historical sales with columns:store_id,date,sales,open,promo,state_holiday,school_holidayfuture_values.csv: Future exogenous variables with same columns (excludingsales)metadata.csv: Store metadata:store_id,store_type,assortment,competition_distance
combined_forecasts.csv: Daily forecasts per storeweekly_forecast.csv: Weekly aggregated forecasts (8 weeks)cv_results_summary.csv: Model evaluation metrics
The framework automatically checks for accuracy degradation by comparing current MSE against baseline thresholds. If performance drops >10%, a warning is issued with affected store IDs.
Note: First run with RETRAIN_MODELS = True trains and saves models (~long runtime). Subsequent runs with RETRAIN_MODELS = False load saved models for faster execution.