GBM quantile regression is much slower than gaussian/tweedie

**H2O version, Operating System and Environment**

- H2O version: `3.46.0.9`
- Local run: Ubuntu 22.04, OpenJDK 11, 20 logical CPUs, 15 GiB RAM
- Distributed run: Docker Compose with 3 H2O nodes
- Docker node config: 2 GB heap per node, `nthreads=4` per node, OpenJDK 17
- Not running on Kubernetes or Hadoop

I have tested this on H2O `3.46.0.9`.

**Actual behavior**
H2OGradientBoostingEstimator(distribution="quantile") is much slower than `gaussian` and `tweedie` on the same regression dataset and near-identical GBM settings.

The slowdown appears on a single-node run and becomes much larger on a 3-node Docker H2O cluster. Predictions and metrics are consistent across the local and Docker runs, so this looks like a performance issue rather than a correctness issue.

I originally noticed this while using GBM with quantile loss on a much larger work dataset. Training was so slow that the model was almost impractical to train. I cannot share the company dataset or environment, so I created this smaller reproducible setup to demonstrate the same pattern and ask for guidance on whether this is expected or can be improved.

Summary from `profiling/results/cpu_vs_docker3_comparison_superconductivity.csv` (added in zip file):

| model | CPU JVM train seconds | CPU ms/tree | Docker 3-node JVM train seconds | Docker 3-node ms/tree |
|---|---:|---:|---:|---:|
| gaussian | 62 | 37.736 | 220 | 133.901 |
| tweedie | 59 | 35.393 | 223 | 133.773 |
| quantile | 166 | 83.000 | 1216 | 608.000 |

Note: gaussian and tweedie stopped before 2000 trees in these runs, while quantile built all 2000 trees. For that reason I am comparing `ms_per_tree`, not only total train time.

**Expected behavior**
I expected quantile loss to be somewhat more expensive than gaussian/tweedie, but not to show this large a per-tree gap, especially with the same data, model settings, and tree budget.

If this is expected behavior, it would be useful to understand which part of the quantile implementation makes distributed GBM training more expensive and whether there are recommended settings to reduce the overhead.

I am filing this as a performance investigation rather than a confirmed bug. If maintainers think this belongs in Discussions instead, I am happy to move or reframe it.


**Steps to reproduce**
The reproduction scripts and outputs are in `profiling/`.

Run single-node CPU:

```bash
cd profiling
python run_cpu.py
```

Run Docker 3-node:

```bash
cd profiling/docker_3node
./run_docker_3node.sh
```

Generate comparison and flamegraph tables:

```bash
cd profiling
python compare_results.py
python analyze_flamegraphs.py
```

All three losses use the same GBM settings except for distribution-specific parameters:

```python
ntrees=2000
max_depth=12
learn_rate=0.005
min_rows=1
nbins=100
sample_rate=1.0
col_sample_rate=1.0
score_tree_interval=10
stopping_rounds=0
```

Distributions tested:

```python
distribution="gaussian"
distribution="tweedie", tweedie_power=1.5
distribution="quantile", quantile_alpha=0.5
```

I can provide a zip containing the reproduction scripts, Docker files, processed train/validation/test CSVs, result summaries, and selected flamegraphs.

**Upload logs**
No exception is thrown. I can upload H2O logs if useful, but the most relevant artifacts are probably:

- `profiling/results/cpu_vs_docker3_comparison_superconductivity.csv`
- `profiling/results/combined_training_summary_superconductivity.csv`
- `profiling/results/analysis/cpu_distribution_comparison.csv`
- CPU and Docker 3-node async-profiler flamegraphs for each distribution

**Screenshots**
N/A

**Additional context**
I will include the profiling flamegraphs and summary CSVs in the reproduction zip.

From the profiling output, the expensive part is around leaf assignment and/or the quantile aggregation work that happens after rows are assigned to leaves. The strongest profiling signals are around:

[profiling.zip](https://github.com/user-attachments/files/28480412/profiling.zip)

- `hex/quantile/Quantile$Histo`
- `hex/quantile/Quantile$StratifiedQuantilesTask`
- high `MRTask` activity in quantile runs


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GBM quantile regression is much slower than gaussian/tweedie #16867

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

model	CPU JVM train seconds	CPU ms/tree	Docker 3-node JVM train seconds	Docker 3-node ms/tree
gaussian	62	37.736	220	133.901
tweedie	59	35.393	223	133.773
quantile	166	83.000	1216	608.000

Uh oh!

GBM quantile regression is much slower than gaussian/tweedie #16867

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions