Research Project | Forecasting Seminar | TUM | Winter 2025/26
This repository contains a comprehensive numerical study investigating whether XGBoost suffers from overfitting in time series forecasting tasks, and how it can be prevented through regularization and hyperparameter tuning.
Does XGBoost have an overfitting problem, and how can it be prevented?
How stable does the generalization of XGBoost vs. RandomForest remain as data noise increases?
- Compares two default and tuned model configurations
- Tests robustness under increasing noise levels (0%, 5%, 10%, 20%, 50%)
- Measures generalization gap as indicator of overfitting
RQ2.1: Which XGBoost hyperparameters have the strongest quantitative effect on overfitting?
- Evaluates 7 XGBoost hyperparameters across 3 levels (low, default, high)
- Measures impact on generalization gap via Random Forest feature importance
- Uses ceteris paribus (all else equal) experimental design
RQ2.2: How do these XGBoost hyperparameters affect predictive performance (test RMSE)?
- Assesses trade-offs between accuracy and stability
- Identifies hyperparameters that improve both metrics vs. those with conflicts
Rossmann Store Sales Dataset
- Time series: 2,000 retail stores
- Frequency: Daily observations
- Target: Daily sales per store (
y) - Features:
- Lagged: Customer counts (lag-56 to avoid data leakage)
- Temporal: Day of week, day of year, month, quarter
- Static: Store type, assortment level, competition distance
- Dynamic: Promotion, open/closed indicators, school holidays
.
├── README.md # This file
├── requirements.txt # Python dependencies
├── main.py # Main experiment runner
├── data/ # Input datasets
│ ├── sales_data.csv # Daily sales per store
│ ├── metadata.csv # Store attributes
│ └── monthly/ # (Supporting data)
├── src/ # Source code
│ ├── __init__.py
│ └── forecasting_utils.py # Data loading & cross-validation logic
├── results/ # Experiment outputs
│ ├── question1_results.csv # RQ1 evaluation metrics
│ ├── question2_results.csv # RQ2 evaluation metrics
│ └── *_feature_importance.csv # Feature importance tables
└── plots/ # Visualizations & analysis scripts
└── results_interpretation_v2.py # Analysis & plotting
-
Data Loading & Preprocessing (
src/forecasting_utils.py)- Merge sales + metadata with left join
- Fill missing open/closed indicators using store-specific weekday patterns
- Create lagged customer feature (lag-56) to avoid future information leakage
- One-hot encode categorical features for tree models
-
Noise Injection (
apply_noise_level)- Add Gaussian noise to sales target
- Noise level = factor × σ(y), e.g., 0.1 = 10% of standard deviation
-
Time Series Cross-Validation (
evaluate_mlforecast_timeseries_manual_cv)- Rolling-origin cross-validation with 8 windows
- Forecast horizon: 56 days (8 weeks)
- Step size: 56 days (no overlap)
- In-sample RMSE: Train RMSE (fit quality)
- Test RMSE: Forecast RMSE (generalization)
- Generalization Gap: Test RMSE - Train RMSE (overfitting indicator)
- Models: XGBoost (default + tuned) vs. RandomForest (default + tuned)
- Noise levels: 0%, 5%, 10%, 20%, 50%
- Tuning: 10-iteration RandomizedSearchCV on 10% subset
- Total experiments: 2 models × 2 configs × 5 noise levels = 20 scenarios
- Base model: XGBoost
- Hyperparameters (7 total):
learning_rate: [0.01, 0.3, 0.6]max_depth: [3, 6, 9]min_child_weight: [0.5, 1.0, 10.0]subsample: [0.5, 1.0, 1.0]reg_lambda: [0.1, 1.0, 10.0]reg_alpha: [0.0, 0.0, 5.0]n_estimators: [50, 100, 500]
- Variation: 3³ = 27 combinations per noise level
- Design: Ceteris paribus (all others at default)
- Targets: Generalization gap + Test RMSE
- Analysis: Random Forest feature importance (1,000 trees)
- Python 3.8+
- See
requirements.txt
git clone <repo-url>
cd BF_XGBoost
pip install -r requirements.txt# Edit main.py: set SELECTED_RQ = 1
python main.py# Edit main.py: set SELECTED_RQ = 2
python main.py# Visualize findings & generate feature importance plots
cd plots
python results_interpretation_v2.py # Edit INPUT_FILE to switch between RQ1/RQ2Outputs:
- Feature importance tables:
results/*_feature_importance.csv - Plots:
plots/robustness_rq1_*.png— Noise robustnessplots/rmse_comparison_*.png— Train vs. test RMSEplots/question*_feature_importance.png— Hyperparameter impactplots/hparam_tradeoff_final.png— Accuracy vs. stability tradeoff
Answer: No. Default parameters are excellent at preventing overfitting.
- XGBoost with default hyperparameters maintains low generalization gap across different noise levels
- Built-in regularization (L1/L2) provides strong overfitting protection without manual tuning
- Trade-off: Default parameters may leave some predictive performance on the table compared to tuned alternatives
Key Findings:
| Scenario | Winner | Note |
|---|---|---|
| Default parameters | XGBoost | Prevents overfitting very well without tuning |
| Tuning required | RandomForest | Needs hyperparameter tuning to match/exceed XGBoost |
| Noisy data (high noise) | Tuned RandomForest | Best generalization gap on Rossmann dataset |
| Computational efficiency | XGBoost | Little time/computing power → use XGBoost |
- XGBoost generalization remains stable as noise increases
- RandomForest (tuned) achieves best robustness but requires careful hyperparameter optimization
- XGBoost is more practical when computing resources or tuning time are limited
RQ2.1: Strongest Drivers of Overfitting (Generalization Gap)
The following hyperparameters have the biggest quantitative effect on controlling overfitting:
learning_rate— Controls boosting speed; lower rates improve generalizationmax_depth— Limits tree complexity; shallower trees reduce overfittingn_estimators— Number of boosting rounds; more rounds can increase overfitting if other parameters not tuned
Secondary factors:
reg_lambda— L2 regularization; prevents weights from becoming too largereg_alpha— L1 regularization; can reduce feature importance variancemin_child_weight,subsample— Moderate effect on generalization gap
RQ2.2: Trade-offs with Predictive Performance (Test RMSE)
-
learning_rate&n_estimators: Often in tension with overfitting control- Lower learning rate → better generalization, slightly worse RMSE
- Fewer estimators → less overfitting, but potential underfitting
-
max_depth: Sweet spot exists- Too shallow → underfitting (high bias)
- Too deep → overfitting (high variance)
-
Regularization (
reg_lambda,reg_alpha): Generally win-win- Improve both generalization gap AND test RMSE in many cases
Primary Strategies:
- Increase regularization: Tune
reg_lambdaandreg_alphaupward - Control model complexity: Reduce
max_depthandn_estimators - Reduce learning rate: Lower
learning_ratefor slower, more stable boosting - Use default parameters: Often sufficient for good generalization
For Maximum Robustness to Noisy Data:
- Use tuned RandomForest or carefully tuned XGBoost
- Tune RandomForest:
max_depth,min_samples_leaf,max_features - Prioritize generalization gap metric over raw RMSE improvements
| File | Purpose |
|---|---|
main.py |
Main experiment orchestrator; generates results CSVs |
src/forecasting_utils.py |
Data loading, preprocessing, cross-validation |
plots/results_interpretation_v2.py |
Analysis, visualization, feature importance |
results/*.csv |
Experiment results (metrics per scenario) |
plots/*.png |
Visualizations (plots, feature importance) |
SELECTED_RQ = 1 or 2 # Which research question to run
h = 56 # Forecast horizon (days)
step_size = 56 # CV window step size
n_windows = 8 # Number of rolling windows
n_jobs = -1 # Parallel workersINPUT_FILE = "results/question1_results.csv" or "question2_results.csv"
USE_NORMALIZED_GAP = True # Gap / Test RMSE
N_ESTIMATORS = 1000 # Random Forest trees