Skip to content

felix-nack/XGBoost-Numerical-Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

XGBoost Overfitting Analysis: A Numerical Study

Research Project | Forecasting Seminar | TUM | Winter 2025/26

Overview

This repository contains a comprehensive numerical study investigating whether XGBoost suffers from overfitting in time series forecasting tasks, and how it can be prevented through regularization and hyperparameter tuning.

Research Questions

Fundamental Question

Does XGBoost have an overfitting problem, and how can it be prevented?

Research Question 1 (RQ1)

How stable does the generalization of XGBoost vs. RandomForest remain as data noise increases?

  • Compares two default and tuned model configurations
  • Tests robustness under increasing noise levels (0%, 5%, 10%, 20%, 50%)
  • Measures generalization gap as indicator of overfitting

Research Question 2 (RQ2)

RQ2.1: Which XGBoost hyperparameters have the strongest quantitative effect on overfitting?

  • Evaluates 7 XGBoost hyperparameters across 3 levels (low, default, high)
  • Measures impact on generalization gap via Random Forest feature importance
  • Uses ceteris paribus (all else equal) experimental design

RQ2.2: How do these XGBoost hyperparameters affect predictive performance (test RMSE)?

  • Assesses trade-offs between accuracy and stability
  • Identifies hyperparameters that improve both metrics vs. those with conflicts

Dataset

Rossmann Store Sales Dataset

  • Time series: 2,000 retail stores
  • Frequency: Daily observations
  • Target: Daily sales per store (y)
  • Features:
    • Lagged: Customer counts (lag-56 to avoid data leakage)
    • Temporal: Day of week, day of year, month, quarter
    • Static: Store type, assortment level, competition distance
    • Dynamic: Promotion, open/closed indicators, school holidays

Project Structure

.
├── README.md                           # This file
├── requirements.txt                    # Python dependencies
├── main.py                             # Main experiment runner
├── data/                               # Input datasets
│   ├── sales_data.csv                  # Daily sales per store
│   ├── metadata.csv                    # Store attributes
│   └── monthly/                        # (Supporting data)
├── src/                                # Source code
│   ├── __init__.py
│   └── forecasting_utils.py            # Data loading & cross-validation logic
├── results/                            # Experiment outputs
│   ├── question1_results.csv           # RQ1 evaluation metrics
│   ├── question2_results.csv           # RQ2 evaluation metrics
│   └── *_feature_importance.csv        # Feature importance tables
└── plots/                              # Visualizations & analysis scripts
    └── results_interpretation_v2.py    # Analysis & plotting

Methodology

Data Pipeline

  1. Data Loading & Preprocessing (src/forecasting_utils.py)

    • Merge sales + metadata with left join
    • Fill missing open/closed indicators using store-specific weekday patterns
    • Create lagged customer feature (lag-56) to avoid future information leakage
    • One-hot encode categorical features for tree models
  2. Noise Injection (apply_noise_level)

    • Add Gaussian noise to sales target
    • Noise level = factor × σ(y), e.g., 0.1 = 10% of standard deviation
  3. Time Series Cross-Validation (evaluate_mlforecast_timeseries_manual_cv)

    • Rolling-origin cross-validation with 8 windows
    • Forecast horizon: 56 days (8 weeks)
    • Step size: 56 days (no overlap)
    • In-sample RMSE: Train RMSE (fit quality)
    • Test RMSE: Forecast RMSE (generalization)
    • Generalization Gap: Test RMSE - Train RMSE (overfitting indicator)

Experimental Design

RQ1: Model Comparison

  • Models: XGBoost (default + tuned) vs. RandomForest (default + tuned)
  • Noise levels: 0%, 5%, 10%, 20%, 50%
  • Tuning: 10-iteration RandomizedSearchCV on 10% subset
  • Total experiments: 2 models × 2 configs × 5 noise levels = 20 scenarios

RQ2: Hyperparameter Sensitivity

  • Base model: XGBoost
  • Hyperparameters (7 total):
    • learning_rate: [0.01, 0.3, 0.6]
    • max_depth: [3, 6, 9]
    • min_child_weight: [0.5, 1.0, 10.0]
    • subsample: [0.5, 1.0, 1.0]
    • reg_lambda: [0.1, 1.0, 10.0]
    • reg_alpha: [0.0, 0.0, 5.0]
    • n_estimators: [50, 100, 500]
  • Variation: 3³ = 27 combinations per noise level
  • Design: Ceteris paribus (all others at default)
  • Targets: Generalization gap + Test RMSE
  • Analysis: Random Forest feature importance (1,000 trees)

Installation & Setup

Requirements

  • Python 3.8+
  • See requirements.txt

Installation

git clone <repo-url>
cd BF_XGBoost
pip install -r requirements.txt

Usage

Run Full Pipeline

RQ1: Model Robustness

# Edit main.py: set SELECTED_RQ = 1
python main.py

RQ2: Hyperparameter Sensitivity

# Edit main.py: set SELECTED_RQ = 2
python main.py

Analyze Results

# Visualize findings & generate feature importance plots
cd plots
python results_interpretation_v2.py  # Edit INPUT_FILE to switch between RQ1/RQ2

Outputs:

  • Feature importance tables: results/*_feature_importance.csv
  • Plots:
    • plots/robustness_rq1_*.png — Noise robustness
    • plots/rmse_comparison_*.png — Train vs. test RMSE
    • plots/question*_feature_importance.png — Hyperparameter impact
    • plots/hparam_tradeoff_final.png — Accuracy vs. stability tradeoff

Key Findings

Fundamental Question: Does XGBoost Have an Overfitting Problem?

Answer: No. Default parameters are excellent at preventing overfitting.

  • XGBoost with default hyperparameters maintains low generalization gap across different noise levels
  • Built-in regularization (L1/L2) provides strong overfitting protection without manual tuning
  • Trade-off: Default parameters may leave some predictive performance on the table compared to tuned alternatives

Research Question 1: Noise Robustness (XGBoost vs. RandomForest)

Key Findings:

Scenario Winner Note
Default parameters XGBoost Prevents overfitting very well without tuning
Tuning required RandomForest Needs hyperparameter tuning to match/exceed XGBoost
Noisy data (high noise) Tuned RandomForest Best generalization gap on Rossmann dataset
Computational efficiency XGBoost Little time/computing power → use XGBoost
  • XGBoost generalization remains stable as noise increases
  • RandomForest (tuned) achieves best robustness but requires careful hyperparameter optimization
  • XGBoost is more practical when computing resources or tuning time are limited

Research Question 2: Hyperparameter Sensitivity (RQ2.1 & RQ2.2)

RQ2.1: Strongest Drivers of Overfitting (Generalization Gap)

The following hyperparameters have the biggest quantitative effect on controlling overfitting:

  1. learning_rate — Controls boosting speed; lower rates improve generalization
  2. max_depth — Limits tree complexity; shallower trees reduce overfitting
  3. n_estimators — Number of boosting rounds; more rounds can increase overfitting if other parameters not tuned

Secondary factors:

  • reg_lambda — L2 regularization; prevents weights from becoming too large
  • reg_alpha — L1 regularization; can reduce feature importance variance
  • min_child_weight, subsample — Moderate effect on generalization gap

RQ2.2: Trade-offs with Predictive Performance (Test RMSE)

  • learning_rate & n_estimators: Often in tension with overfitting control

    • Lower learning rate → better generalization, slightly worse RMSE
    • Fewer estimators → less overfitting, but potential underfitting
  • max_depth: Sweet spot exists

    • Too shallow → underfitting (high bias)
    • Too deep → overfitting (high variance)
  • Regularization (reg_lambda, reg_alpha): Generally win-win

    • Improve both generalization gap AND test RMSE in many cases

How Can Overfitting Be Prevented?

Primary Strategies:

  1. Increase regularization: Tune reg_lambda and reg_alpha upward
  2. Control model complexity: Reduce max_depth and n_estimators
  3. Reduce learning rate: Lower learning_rate for slower, more stable boosting
  4. Use default parameters: Often sufficient for good generalization

For Maximum Robustness to Noisy Data:

  • Use tuned RandomForest or carefully tuned XGBoost
  • Tune RandomForest: max_depth, min_samples_leaf, max_features
  • Prioritize generalization gap metric over raw RMSE improvements

File Descriptions

File Purpose
main.py Main experiment orchestrator; generates results CSVs
src/forecasting_utils.py Data loading, preprocessing, cross-validation
plots/results_interpretation_v2.py Analysis, visualization, feature importance
results/*.csv Experiment results (metrics per scenario)
plots/*.png Visualizations (plots, feature importance)

Configuration

Adjustable Parameters (main.py)

SELECTED_RQ = 1 or 2          # Which research question to run
h = 56                        # Forecast horizon (days)
step_size = 56                # CV window step size
n_windows = 8                 # Number of rolling windows
n_jobs = -1                   # Parallel workers

Analysis Parameters (plots/results_interpretation_v2.py)

INPUT_FILE = "results/question1_results.csv"  or "question2_results.csv"
USE_NORMALIZED_GAP = True                     # Gap / Test RMSE
N_ESTIMATORS = 1000                           # Random Forest trees

About

XGBoost overfitting analysis for time series forecasting using Rossmann sales data, comparing robustness vs. RandomForest and quantifying hyperparameter effects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages