H2O version, Operating System and Environment
- H2O version:
3.46.0.9
- Local run: Ubuntu 22.04, OpenJDK 11, 20 logical CPUs, 15 GiB RAM
- Distributed run: Docker Compose with 3 H2O nodes
- Docker node config: 2 GB heap per node,
nthreads=4 per node, OpenJDK 17
- Not running on Kubernetes or Hadoop
I have tested this on H2O 3.46.0.9.
Actual behavior
H2OGradientBoostingEstimator(distribution="quantile") is much slower than gaussian and tweedie on the same regression dataset and near-identical GBM settings.
The slowdown appears on a single-node run and becomes much larger on a 3-node Docker H2O cluster. Predictions and metrics are consistent across the local and Docker runs, so this looks like a performance issue rather than a correctness issue.
I originally noticed this while using GBM with quantile loss on a much larger work dataset. Training was so slow that the model was almost impractical to train. I cannot share the company dataset or environment, so I created this smaller reproducible setup to demonstrate the same pattern and ask for guidance on whether this is expected or can be improved.
Summary from profiling/results/cpu_vs_docker3_comparison_superconductivity.csv (added in zip file):
| model |
CPU JVM train seconds |
CPU ms/tree |
Docker 3-node JVM train seconds |
Docker 3-node ms/tree |
| gaussian |
62 |
37.736 |
220 |
133.901 |
| tweedie |
59 |
35.393 |
223 |
133.773 |
| quantile |
166 |
83.000 |
1216 |
608.000 |
Note: gaussian and tweedie stopped before 2000 trees in these runs, while quantile built all 2000 trees. For that reason I am comparing ms_per_tree, not only total train time.
Expected behavior
I expected quantile loss to be somewhat more expensive than gaussian/tweedie, but not to show this large a per-tree gap, especially with the same data, model settings, and tree budget.
If this is expected behavior, it would be useful to understand which part of the quantile implementation makes distributed GBM training more expensive and whether there are recommended settings to reduce the overhead.
I am filing this as a performance investigation rather than a confirmed bug. If maintainers think this belongs in Discussions instead, I am happy to move or reframe it.
Steps to reproduce
The reproduction scripts and outputs are in profiling/.
Run single-node CPU:
cd profiling
python run_cpu.py
Run Docker 3-node:
cd profiling/docker_3node
./run_docker_3node.sh
Generate comparison and flamegraph tables:
cd profiling
python compare_results.py
python analyze_flamegraphs.py
All three losses use the same GBM settings except for distribution-specific parameters:
ntrees=2000
max_depth=12
learn_rate=0.005
min_rows=1
nbins=100
sample_rate=1.0
col_sample_rate=1.0
score_tree_interval=10
stopping_rounds=0
Distributions tested:
distribution="gaussian"
distribution="tweedie", tweedie_power=1.5
distribution="quantile", quantile_alpha=0.5
I can provide a zip containing the reproduction scripts, Docker files, processed train/validation/test CSVs, result summaries, and selected flamegraphs.
Upload logs
No exception is thrown. I can upload H2O logs if useful, but the most relevant artifacts are probably:
profiling/results/cpu_vs_docker3_comparison_superconductivity.csv
profiling/results/combined_training_summary_superconductivity.csv
profiling/results/analysis/cpu_distribution_comparison.csv
- CPU and Docker 3-node async-profiler flamegraphs for each distribution
Screenshots
N/A
Additional context
I will include the profiling flamegraphs and summary CSVs in the reproduction zip.
From the profiling output, the expensive part is around leaf assignment and/or the quantile aggregation work that happens after rows are assigned to leaves. The strongest profiling signals are around:
profiling.zip
hex/quantile/Quantile$Histo
hex/quantile/Quantile$StratifiedQuantilesTask
- high
MRTask activity in quantile runs
H2O version, Operating System and Environment
3.46.0.9nthreads=4per node, OpenJDK 17I have tested this on H2O
3.46.0.9.Actual behavior
H2OGradientBoostingEstimator(distribution="quantile") is much slower than
gaussianandtweedieon the same regression dataset and near-identical GBM settings.The slowdown appears on a single-node run and becomes much larger on a 3-node Docker H2O cluster. Predictions and metrics are consistent across the local and Docker runs, so this looks like a performance issue rather than a correctness issue.
I originally noticed this while using GBM with quantile loss on a much larger work dataset. Training was so slow that the model was almost impractical to train. I cannot share the company dataset or environment, so I created this smaller reproducible setup to demonstrate the same pattern and ask for guidance on whether this is expected or can be improved.
Summary from
profiling/results/cpu_vs_docker3_comparison_superconductivity.csv(added in zip file):Note: gaussian and tweedie stopped before 2000 trees in these runs, while quantile built all 2000 trees. For that reason I am comparing
ms_per_tree, not only total train time.Expected behavior
I expected quantile loss to be somewhat more expensive than gaussian/tweedie, but not to show this large a per-tree gap, especially with the same data, model settings, and tree budget.
If this is expected behavior, it would be useful to understand which part of the quantile implementation makes distributed GBM training more expensive and whether there are recommended settings to reduce the overhead.
I am filing this as a performance investigation rather than a confirmed bug. If maintainers think this belongs in Discussions instead, I am happy to move or reframe it.
Steps to reproduce
The reproduction scripts and outputs are in
profiling/.Run single-node CPU:
cd profiling python run_cpu.pyRun Docker 3-node:
cd profiling/docker_3node ./run_docker_3node.shGenerate comparison and flamegraph tables:
cd profiling python compare_results.py python analyze_flamegraphs.pyAll three losses use the same GBM settings except for distribution-specific parameters:
Distributions tested:
I can provide a zip containing the reproduction scripts, Docker files, processed train/validation/test CSVs, result summaries, and selected flamegraphs.
Upload logs
No exception is thrown. I can upload H2O logs if useful, but the most relevant artifacts are probably:
profiling/results/cpu_vs_docker3_comparison_superconductivity.csvprofiling/results/combined_training_summary_superconductivity.csvprofiling/results/analysis/cpu_distribution_comparison.csvScreenshots
N/A
Additional context
I will include the profiling flamegraphs and summary CSVs in the reproduction zip.
From the profiling output, the expensive part is around leaf assignment and/or the quantile aggregation work that happens after rows are assigned to leaves. The strongest profiling signals are around:
profiling.zip
hex/quantile/Quantile$Histohex/quantile/Quantile$StratifiedQuantilesTaskMRTaskactivity in quantile runs