Skip to content

feat(engine): add HPA-style stabilization of scaling recommendations#1353

Open
ev-shindin wants to merge 6 commits into
llm-d:mainfrom
ev-shindin:feat/scaling-stabilization
Open

feat(engine): add HPA-style stabilization of scaling recommendations#1353
ev-shindin wants to merge 6 commits into
llm-d:mainfrom
ev-shindin:feat/scaling-stabilization

Conversation

@ev-shindin

Copy link
Copy Markdown
Collaborator

Fixes #1352

Proposed Changes

  • 🎁 Add internal/stabilization: a clean-room port of the Kubernetes HPA configurable scaling behavior (trailing window — min over the up-window / max over the down-window — plus per-period Pods/Percent rate policies with selectPolicy, tolerance deadband, and min/max clamp). Reuses the autoscaling/v2 behavior types so it is configured the familiar HPA way.
  • 🎁 Wire stabilization into the V2 engine right after the optimizer and before the enforcer, keyed per namespace/model/variant[/role], with a structured stabilization-decision cycle log. Gated by a new enableStabilization saturation-config flag, default off. Intended deployment contract: WVA owns stabilization, HPA behavior delays set to 0.
  • 🧹 Extract VariantDecision.ActionForTarget and reuse it from both the enforcer and the stabilizer (removes a duplicated action-recompute switch).

This is the foundation for owning scale-down so the priority-weighted rescale proposal (#1238) can actuate cleanly, and a prerequisite for direct /scale actuation in the 1→N range. Per-policy config via the ConfigMap, direct actuation, and the V1 path are follow-ups (see #1352).

Pre-review Checklist

  • E2E tests for any new behavior — N/A while default-off (no behavior change on upgrade). Covered by internal/stabilization unit specs and the saturation engine envtest suite; E2E will accompany enabling it by default / direct actuation.
  • Docs PR for any user-facing impact — developer design note included (docs/plans/engine/stabilization.md); no user-facing impact while the flag is off. User-facing docs will accompany the per-policy config follow-up.
  • Proposal PR for any new enhancement or change to existing behavior — rationale tracked in Add HPA-style stabilization of scaling recommendations in WVA #1352 and related to the rescale proposal docs(proposals): add priority-weighted rescale proposal #1238; a dedicated proposal for the operator-facing config surface is a follow-up.

Release Note

Added opt-in HPA-style stabilization of WVA scaling recommendations (a trailing
scale-down window plus per-period rate policies), enabled via
`enableStabilization` in the saturation scaling config. Disabled by default.

Docs

No user-visible impact while default-off; user-facing documentation will accompany the per-policy configuration follow-up.

The optimizer emits a fresh per-variant replica target every cycle, which
flaps when load is noisy. Today WVA leans on a downstream HPA to damp this;
stabilizing in-process is the prerequisite for WVA actuating /scale directly.

Add internal/stabilization, a clean-room port of the Kubernetes HPA
configurable scaling behavior: a trailing stabilization window (min over the
scale-up window as a floor, max over the scale-down window as a ceiling), a
tolerance deadband, per-period Pods/Percent rate policies with selectPolicy,
and a min/max clamp. The autoscaling/v2 behavior types are reused so operators
configure damping the familiar HPA way; the algorithm is reimplemented because
the upstream logic is unexported in k8s.io/kubernetes, which is not a
consumable module.

Wire it into the V2 engine right after the optimizer and before the enforcer,
keyed namespace/variant[/role] so disaggregated P/D targets damp
independently, emitting a stabilization-decision cycle log. Gated by a new
enableStabilization saturation-config flag (default off), so upgrades see no
behavior change; when enabled, the HPA default behavior is applied.

Includes table-driven Ginkgo specs with an injectable clock and a design note
under docs/plans/engine.

Signed-off-by: Evgeny Shindin <evgensh@il.ibm.com>
Fixes from the multi-agent review of the previous commit:

- stabilization: hold one lock for the whole Stabilize call so concurrent
  same-key callers observe and update per-key history atomically (was three
  separate critical sections with a TOCTOU on the rate budget).
- stabilization: give each direction its own SelectPolicy pointer in
  DefaultBehavior so a caller overriding one direction by dereference cannot
  corrupt the other.
- stabilization: add Forget(active) and call it each cycle so per-key history
  is bounded to the live set of variants; replace the hand-rolled pruneEvents
  with slices.DeleteFunc.
- interfaces: add VariantDecision.ActionForTarget and use it from both the
  stabilizer's retargetDecision and the enforcer's updateDecisionAction,
  removing the duplicated action-recompute switch.
- engine: use ptr.Deref for the optional replica bounds instead of a local
  helper.

Tests: scale-down selectPolicy table (incl. Disabled), scale-from-zero
tolerance guard, Forget retain/drop, engine applyStabilization enable-gate /
key construction / action recompute, and EnableStabilization config-merge
(including the intentional true-sticky semantics).

Signed-off-by: Evgeny Shindin <evgensh@il.ibm.com>
Address the round-2 review nits (no behavior change):

- stabilization: rename Forget(active) to Retain(active) — the argument is the
  set of keys to keep, so the name now matches the semantics.
- engine: move the "0 means no floor / no cap" note from retargetDecision's doc
  (which never sees the bounds) to the ptr.Deref call site that does.

Tests:
- interfaces: direct table test for VariantDecision.ActionForTarget covering
  the >, <, and == (no-change) branches.
- stabilization: Retain now also covers the empty-active-set eviction and the
  rate-event-budget reset (the prior specs seeded desired==current, so no scale
  event was recorded and the event-map clearing was never exercised).

Signed-off-by: Evgeny Shindin <evgensh@il.ibm.com>
Headline fix from the third review round: the per-period rate limit
reconstructed the period-start replica count from only the same-direction
scale events (current - added for up, current + removed for down). The HPA
uses both directions (current - added + removed) for each. With the old
formula, an up-then-down oscillation inside one policy period drove the
period-start baseline negative and pinned the next legitimate scale-up low
(e.g. 2->6 then 6->2 then a spike back to 6 was capped at 2). Dormant under
the wired defaults because the 300s down window suppresses the intervening
down event, but a real divergence under responsive custom configs. Added a
regression test reproducing the stuck-low scenario.

Also:
- include the model in the stabilizer key (namespace/model/variant[/role]) so
  two models in a namespace can never share a history bucket.
- correct the defaults documentation: the magnitudes (4 pods, 100%) and the
  300s down window match the HPA, but the 60s policy period is a deliberate
  choice for WVA's ~30s optimize cadence, not the controller's 15s default
  (the autoscaling/v2 API doc itself states 60s; the controller defaults to
  15s). Dropped the inaccurate "mirrors upstream" wording.
- clarify the recordScaleEvent comment about the post-stabilization enforcer
  (benign under-count, never less safe) and add a doc comment to reason().

Tests: cross-direction budgeting regression, down-direction tolerance,
current-inside-band freeze, percent scale-down multi-cycle, reason() strings,
and the applyStabilization pass-through (no-retarget) path.

Signed-off-by: Evgeny Shindin <evgensh@il.ibm.com>
Round-4 review confirmed the cross-direction rate-budget fix correct with no
regressions. Remaining items were doc/test only:

- fix the stale Args.Key doc example (namespace/model/variant[/role]) and a
  run-on in the package doc after the spec URL.
- add the symmetric scale-down cross-direction regression test (an intervening
  up event must not inflate the down period-start baseline) and a test for the
  scale-down floor-to-zero branch (a Pods policy removing more than the
  period-start count clamps to 0).

Signed-off-by: Evgeny Shindin <evgensh@il.ibm.com>
@ev-shindin ev-shindin added this to the v0.9.0 release milestone Jun 28, 2026
@ev-shindin ev-shindin self-assigned this Jun 28, 2026
@ev-shindin ev-shindin requested a review from biranofer June 28, 2026 09:29
The scaleDecision test helper always received the same namespace, variant
name, and role, which golangci-lint's unparam flags. Reduce it to the two
parameters that actually vary (current, target) and inline the constants.

Signed-off-by: Evgeny Shindin <evgensh@il.ibm.com>
@ev-shindin

Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions

Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions

Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions

Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 29 21
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@lionelvillard

Copy link
Copy Markdown
Collaborator

Is that really necessary? User will have to configure stabilization in two different places, is that intended?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add HPA-style stabilization of scaling recommendations in WVA

2 participants