XGBoost Overfitting Analysis: A Numerical Study

Research Project | Forecasting Seminar | TUM | Winter 2025/26

Overview

This repository contains a comprehensive numerical study investigating whether XGBoost suffers from overfitting in time series forecasting tasks, and how it can be prevented through regularization and hyperparameter tuning.

Research Questions

Fundamental Question

Does XGBoost have an overfitting problem, and how can it be prevented?

Research Question 1 (RQ1)

How stable does the generalization of XGBoost vs. RandomForest remain as data noise increases?

Compares two default and tuned model configurations
Tests robustness under increasing noise levels (0%, 5%, 10%, 20%, 50%)
Measures generalization gap as indicator of overfitting

Research Question 2 (RQ2)

RQ2.1: Which XGBoost hyperparameters have the strongest quantitative effect on overfitting?

Evaluates 7 XGBoost hyperparameters across 3 levels (low, default, high)
Measures impact on generalization gap via Random Forest feature importance
Uses ceteris paribus (all else equal) experimental design

RQ2.2: How do these XGBoost hyperparameters affect predictive performance (test RMSE)?

Assesses trade-offs between accuracy and stability
Identifies hyperparameters that improve both metrics vs. those with conflicts

Dataset

Rossmann Store Sales Dataset

Time series: 2,000 retail stores
Frequency: Daily observations
Target: Daily sales per store (y)
Features:
- Lagged: Customer counts (lag-56 to avoid data leakage)
- Temporal: Day of week, day of year, month, quarter
- Static: Store type, assortment level, competition distance
- Dynamic: Promotion, open/closed indicators, school holidays

Project Structure

.
├── README.md                           # This file
├── requirements.txt                    # Python dependencies
├── main.py                             # Main experiment runner
├── data/                               # Input datasets
│   ├── sales_data.csv                  # Daily sales per store
│   ├── metadata.csv                    # Store attributes
│   └── monthly/                        # (Supporting data)
├── src/                                # Source code
│   ├── __init__.py
│   └── forecasting_utils.py            # Data loading & cross-validation logic
├── results/                            # Experiment outputs
│   ├── question1_results.csv           # RQ1 evaluation metrics
│   ├── question2_results.csv           # RQ2 evaluation metrics
│   └── *_feature_importance.csv        # Feature importance tables
└── plots/                              # Visualizations & analysis scripts
    └── results_interpretation_v2.py    # Analysis & plotting

Methodology

Data Pipeline

Data Loading & Preprocessing (src/forecasting_utils.py)
- Merge sales + metadata with left join
- Fill missing open/closed indicators using store-specific weekday patterns
- Create lagged customer feature (lag-56) to avoid future information leakage
- One-hot encode categorical features for tree models
Noise Injection (apply_noise_level)
- Add Gaussian noise to sales target
- Noise level = factor × σ(y), e.g., 0.1 = 10% of standard deviation
Time Series Cross-Validation (evaluate_mlforecast_timeseries_manual_cv)
- Rolling-origin cross-validation with 8 windows
- Forecast horizon: 56 days (8 weeks)
- Step size: 56 days (no overlap)
- In-sample RMSE: Train RMSE (fit quality)
- Test RMSE: Forecast RMSE (generalization)
- Generalization Gap: Test RMSE - Train RMSE (overfitting indicator)

Experimental Design

RQ1: Model Comparison

Models: XGBoost (default + tuned) vs. RandomForest (default + tuned)
Noise levels: 0%, 5%, 10%, 20%, 50%
Tuning: 10-iteration RandomizedSearchCV on 10% subset
Total experiments: 2 models × 2 configs × 5 noise levels = 20 scenarios

RQ2: Hyperparameter Sensitivity

Base model: XGBoost
Hyperparameters (7 total):
- learning_rate: [0.01, 0.3, 0.6]
- max_depth: [3, 6, 9]
- min_child_weight: [0.5, 1.0, 10.0]
- subsample: [0.5, 1.0, 1.0]
- reg_lambda: [0.1, 1.0, 10.0]
- reg_alpha: [0.0, 0.0, 5.0]
- n_estimators: [50, 100, 500]
Variation: 3³ = 27 combinations per noise level
Design: Ceteris paribus (all others at default)
Targets: Generalization gap + Test RMSE
Analysis: Random Forest feature importance (1,000 trees)

Installation & Setup

Requirements

Python 3.8+
See requirements.txt

Installation

git clone <repo-url>
cd BF_XGBoost
pip install -r requirements.txt

Usage

Run Full Pipeline

RQ1: Model Robustness

# Edit main.py: set SELECTED_RQ = 1
python main.py

RQ2: Hyperparameter Sensitivity

# Edit main.py: set SELECTED_RQ = 2
python main.py

Analyze Results

# Visualize findings & generate feature importance plots
cd plots
python results_interpretation_v2.py  # Edit INPUT_FILE to switch between RQ1/RQ2

Outputs:

Feature importance tables: results/*_feature_importance.csv
Plots:
- plots/robustness_rq1_*.png — Noise robustness
- plots/rmse_comparison_*.png — Train vs. test RMSE
- plots/question*_feature_importance.png — Hyperparameter impact
- plots/hparam_tradeoff_final.png — Accuracy vs. stability tradeoff

Key Findings

Fundamental Question: Does XGBoost Have an Overfitting Problem?

Answer: No. Default parameters are excellent at preventing overfitting.

XGBoost with default hyperparameters maintains low generalization gap across different noise levels
Built-in regularization (L1/L2) provides strong overfitting protection without manual tuning
Trade-off: Default parameters may leave some predictive performance on the table compared to tuned alternatives

Research Question 1: Noise Robustness (XGBoost vs. RandomForest)

Key Findings:

Scenario	Winner	Note
Default parameters	XGBoost	Prevents overfitting very well without tuning
Tuning required	RandomForest	Needs hyperparameter tuning to match/exceed XGBoost
Noisy data (high noise)	Tuned RandomForest	Best generalization gap on Rossmann dataset
Computational efficiency	XGBoost	Little time/computing power → use XGBoost

XGBoost generalization remains stable as noise increases
RandomForest (tuned) achieves best robustness but requires careful hyperparameter optimization
XGBoost is more practical when computing resources or tuning time are limited

Research Question 2: Hyperparameter Sensitivity (RQ2.1 & RQ2.2)

RQ2.1: Strongest Drivers of Overfitting (Generalization Gap)

The following hyperparameters have the biggest quantitative effect on controlling overfitting:

learning_rate — Controls boosting speed; lower rates improve generalization
max_depth — Limits tree complexity; shallower trees reduce overfitting
n_estimators — Number of boosting rounds; more rounds can increase overfitting if other parameters not tuned

Secondary factors:

reg_lambda — L2 regularization; prevents weights from becoming too large
reg_alpha — L1 regularization; can reduce feature importance variance
min_child_weight, subsample — Moderate effect on generalization gap

RQ2.2: Trade-offs with Predictive Performance (Test RMSE)

learning_rate & n_estimators: Often in tension with overfitting control
- Lower learning rate → better generalization, slightly worse RMSE
- Fewer estimators → less overfitting, but potential underfitting
max_depth: Sweet spot exists
- Too shallow → underfitting (high bias)
- Too deep → overfitting (high variance)
Regularization (reg_lambda, reg_alpha): Generally win-win
- Improve both generalization gap AND test RMSE in many cases

How Can Overfitting Be Prevented?

Primary Strategies:

Increase regularization: Tune reg_lambda and reg_alpha upward
Control model complexity: Reduce max_depth and n_estimators
Reduce learning rate: Lower learning_rate for slower, more stable boosting
Use default parameters: Often sufficient for good generalization

For Maximum Robustness to Noisy Data:

Use tuned RandomForest or carefully tuned XGBoost
Tune RandomForest: max_depth, min_samples_leaf, max_features
Prioritize generalization gap metric over raw RMSE improvements

File Descriptions

File	Purpose
`main.py`	Main experiment orchestrator; generates results CSVs
`src/forecasting_utils.py`	Data loading, preprocessing, cross-validation
`plots/results_interpretation_v2.py`	Analysis, visualization, feature importance
`results/*.csv`	Experiment results (metrics per scenario)
`plots/*.png`	Visualizations (plots, feature importance)

Configuration

Adjustable Parameters (main.py)

SELECTED_RQ = 1 or 2          # Which research question to run
h = 56                        # Forecast horizon (days)
step_size = 56                # CV window step size
n_windows = 8                 # Number of rolling windows
n_jobs = -1                   # Parallel workers

Analysis Parameters (plots/results_interpretation_v2.py)

INPUT_FILE = "results/question1_results.csv"  or "question2_results.csv"
USE_NORMALIZED_GAP = True                     # Gap / Test RMSE
N_ESTIMATORS = 1000                           # Random Forest trees

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
plots		plots
results		results
src		src
.DS_Store		.DS_Store
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

XGBoost Overfitting Analysis: A Numerical Study

Overview

Research Questions

Fundamental Question

Research Question 1 (RQ1)

Research Question 2 (RQ2)

Dataset

Project Structure

Methodology

Data Pipeline

Experimental Design

RQ1: Model Comparison

RQ2: Hyperparameter Sensitivity

Installation & Setup

Requirements

Installation

Usage

Run Full Pipeline

RQ1: Model Robustness

RQ2: Hyperparameter Sensitivity

Analyze Results

Key Findings

Fundamental Question: Does XGBoost Have an Overfitting Problem?

Research Question 1: Noise Robustness (XGBoost vs. RandomForest)

Research Question 2: Hyperparameter Sensitivity (RQ2.1 & RQ2.2)

How Can Overfitting Be Prevented?

File Descriptions

Configuration

Adjustable Parameters (main.py)

Analysis Parameters (plots/results_interpretation_v2.py)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages