llm-d · ev-shindin · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026
diff --git a/docs/plans/engine/stabilization.md b/docs/plans/engine/stabilization.md
@@ -0,0 +1,106 @@
+# HPA-style stabilization of optimizer recommendations
+
+## Motivation
+
+The optimizer produces a fresh per-variant replica target every optimize cycle
+(~30s). Acting on those targets directly is prone to flapping: a momentary load
+spike scales up and the next cycle scales straight back down. Today WVA leans on
+a downstream HorizontalPodAutoscaler to absorb this — WVA emits a desired-replica
+metric and the HPA's own stabilization window damps it.
+
+Stabilizing inside WVA is the prerequisite for WVA ever actuating `/scale`
+directly (the 1→N range), because patching the scale subresource without a
+stabilization window would lose the damping the HPA provides today and reproduce
+the flapping. This change implements that stabilization step; direct actuation is
+a separate, later step.
+
+## Design
+
+A small, self-contained `internal/stabilization` package ports the Kubernetes
+HorizontalPodAutoscaler **configurable scaling behavior** so an operator can
+express damping the familiar HPA way.
+
+- **Config types are reused** from `k8s.io/api/autoscaling/v2`
+  (`HorizontalPodAutoscalerBehavior`, `HPAScalingRules`, `HPAScalingPolicy`).
+  Already a dependency; maps 1:1 if WVA ever generates HPAs.
+- **The algorithm is clean-room.** The upstream behavior logic lives on
+  unexported methods of `HorizontalController` in `k8s.io/kubernetes`, which is
+  not consumable as a module. The behavior is documented at
+  [configurable-scaling-behavior](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior),
+  so it is reimplemented rather than imported. Knative's KPA was also considered
+  and rejected (wrong domain — it smooths a request/concurrency metric, not a
+  replica recommendation — and heavy Knative coupling).
+
+### Algorithm
+
+`Stabilizer.Stabilize` damps one raw recommendation for one scale target. The
+pipeline, in order, matches the HPA:
+
+1. **Trailing stabilization window.** Per key, recent raw recommendations are
+   retained. The smallest recommendation over the scale-up window is a floor
+   (scale up only once every recent recommendation agrees); the largest over the
+   scale-down window is a ceiling (scale down only after the window decays). The
+   current replica count is clamped into `[floor, ceiling]`. Default windows:
+   scale-up `0s` (immediate), scale-down `300s`.
+2. **Tolerance deadband.** A change within the configured per-direction
+   tolerance fraction of current replicas is suppressed. Default `0` (disabled),
+   since the optimizer already decided to act.
+3. **Per-period rate policies.** `Pods` and `Percent` policies cap the change
+   per `periodSeconds`, budgeting against changes already actuated within the
+   period (the period start is reconstructed as `current − added + removed`
+   using both scale-up and scale-down events, as the HPA does). `selectPolicy`
+   `Max`/`Min`/`Disabled` combines competing policies. Defaults: scale-up the
+   higher of `+4 pods` or `+100%` per `60s`; scale-down `-100%` per `60s`. The
+   magnitudes and the `300s` down window match the HPA; the `60s` period is
+   chosen for WVA's ~30s optimize cadence rather than the controller's `15s`
+   default (which would make the budget a near no-op at that cadence).
+4. **Min/max clamp** to the variant's bounds.
+
+The `Stabilizer` is long-lived and concurrency-safe; it keeps the trailing
+windows and rate budgets in memory, mirroring the HPA controller, which keeps
+the same state in memory on the elected leader.
+
+### Wiring
+
+The stabilizer is applied in the V2 engine's `optimizeV2`, immediately after the
+optimizer produces decisions (and the `scaling-decision` cycle log) and **before**
+the scale-to-zero / minimum-replica enforcer. This ordering is deliberate and
+matches the HPA: stabilize the recommendation, then let the hard bounds have the
+last word. The existing emit path (Prometheus desired-replica gauge, internal
+decision cache, CRD status patch) carries the stabilized value unchanged.
+
+Each variant is keyed `namespace/model/variant[/role]`, so disaggregated
+prefill/decode scale targets are damped independently.
+
+A `stabilization-decision` structured log line is emitted per model per cycle,
+grouped like the sibling `scaling-decision` line, with `curr`, `raw`
+(optimizer recommendation) and `final` (stabilized) per variant.
+
+### Configuration
+
+Gated by `enableStabilization` in the saturation scaling config (default
+`false`), mirroring how `enableLimiter` gates the GPU limiter. When disabled the
+engine emits the optimizer's raw recommendation exactly as before — no
+operator-facing behavior change on upgrade. When enabled, the HPA default
+behavior is applied. Exposing the full per-policy behavior through config is a
+follow-up.
+
+## Known interactions
+
+- **Double stabilization.** If a downstream HPA also stabilizes the emitted
+  metric, lag compounds. The intended contract once this is enabled is that WVA
+  owns stabilization and the HPA `behavior` is set to near-passthrough. Until
+  that is wired and documented for operators, `enableStabilization` defaults off.
+- **Scale-to-zero.** The enforcer runs after stabilization and applies
+  scale-to-zero / minimum-replica floors as a hard override, so an idle model
+  still scales to zero even if the window is holding replicas high. The trailing
+  window records the pre-enforcement recommendation; recovery from zero is
+  handled by the scale-from-zero path.
+
+## Testing
+
+`internal/stabilization` has table-driven Ginkgo specs with an injectable clock
+covering: immediate default scale-up, the 300s scale-down hold-and-release, the
+scale-up window delay, per-period Pods/Percent rate caps, `selectPolicy`
+Max/Min/Disabled, the tolerance deadband, min/max clamping, and per-key
+isolation.
diff --git a/internal/config/saturation_scaling.go b/internal/config/saturation_scaling.go
@@ -33,6 +33,15 @@ type SaturationScalingConfig struct {
 	// Default is false (limiter disabled).
 	EnableLimiter bool `yaml:"enableLimiter,omitempty"`
 
+	// EnableStabilization: When true, applies HPA-style stabilization (a trailing
+	// scale-down window plus per-period rate policies) to the optimizer's
+	// per-variant targets before they are emitted, damping flapping. The
+	// HorizontalPodAutoscaler defaults are used: immediate scale-up rate-limited
+	// to the higher of +4 pods or +100% per 60s, and scale-down held by a 300s
+	// stabilization window. Default is false, preserving the prior behavior of
+	// emitting the optimizer's raw recommendation each cycle.
+	EnableStabilization bool `yaml:"enableStabilization,omitempty"`
+
 	// AnalyzerName selects which saturation analyzer to use.
 	// "saturation" uses the V2 token-based analyzer.
 	// Empty string (default) uses the V1 percentage-based analyzer.
@@ -198,6 +207,9 @@ func (c *SaturationScalingConfig) Merge(override SaturationScalingConfig) {
 	if override.EnableLimiter {
 		c.EnableLimiter = override.EnableLimiter
 	}
+	if override.EnableStabilization {
+		c.EnableStabilization = override.EnableStabilization
+	}
 	if override.Priority != 0 {
 		c.Priority = override.Priority
 	}

diff --git a/internal/config/saturation_scaling_test.go b/internal/config/saturation_scaling_test.go
@@ -397,6 +397,19 @@ var _ = Describe("SaturationScalingConfig", func() {
 			Expect(base.ModelID).To(Equal("llama-70b"))
 			Expect(base.Namespace).To(Equal("production"))
 		})
+
+		It("should propagate EnableStabilization from the override", func() {
+			base := SaturationScalingConfig{}
+			base.Merge(SaturationScalingConfig{EnableStabilization: true})
+			Expect(base.EnableStabilization).To(BeTrue())
+		})
+
+		It("should not clear EnableStabilization already set to true", func() {
+			base := SaturationScalingConfig{EnableStabilization: true}
+			base.Merge(SaturationScalingConfig{EnableStabilization: false})
+			Expect(base.EnableStabilization).To(BeTrue(),
+				"a false override must not revert an enabled flag (matches EnableLimiter semantics)")
+		})
 	})
 
 	Context("IsV2", func() {

diff --git a/internal/engines/pipeline/enforcer.go b/internal/engines/pipeline/enforcer.go
@@ -182,15 +182,7 @@ func (e *Enforcer) ensureMinimumReplicasOnDecisions(
 // CurrentReplicas after enforcement and refreshes its reason. SetDecisionReason
 // is the single place that writes d.Action.
 func updateDecisionAction(d *interfaces.VariantDecision, optimizerName, policyType string, metricsEmitter *metrics.MetricsEmitter) {
-	var action interfaces.SaturationAction
-	switch {
-	case d.TargetReplicas > d.CurrentReplicas:
-		action = interfaces.ActionScaleUp
-	case d.TargetReplicas < d.CurrentReplicas:
-		action = interfaces.ActionScaleDown
-	default:
-		action = interfaces.ActionNoChange
-	}
+	action := d.ActionForTarget()
 	// Preserve the decision's original reason category (e.g. saturation-only on
 	// the V1 path) instead of hardcoding V2, so an enforced decision is not
 	// mis-attributed in the reason metric label. Fall back to V2 if it was unset.

diff --git a/internal/engines/saturation/engine.go b/internal/engines/saturation/engine.go
@@ -50,6 +50,7 @@ import (
 	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/logging"
 	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/metrics"
 	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/saturation"
+	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/stabilization"
 	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/utils"
 	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/utils/scaletarget"
 )
@@ -136,6 +137,12 @@ type Engine struct {
 	// ScaleToZeroEnforcer applies scale-to-zero and minimum replica enforcement
 	ScaleToZeroEnforcer *pipeline.Enforcer
 
+	// stabilizer damps the optimizer's raw per-variant targets with HPA-style
+	// scaling behavior. It is long-lived so its trailing recommendation windows
+	// and per-period rate budgets persist across optimize cycles. Applied only
+	// when EnableStabilization is set in the saturation config.
+	stabilizer *stabilization.Stabilizer
+
 	// GPULimiter constrains scaling decisions based on available GPU resources.
 	// Only applied when EnableLimiter is true in the saturation config.
 	GPULimiter pipeline.Limiter
@@ -229,6 +236,7 @@ func NewEngine(client client.Client, apiReader client.Reader, scheme *runtime.Sc
 		Config:                  cfg,
 		ReplicaMetricsCollector: collector.NewReplicaMetricsCollector(promSource, client, recorder, podLocator),
 		ScaleToZeroEnforcer:     pipeline.NewEnforcer(requestCountFunc),
+		stabilizer:              stabilization.New(),
 		GPULimiter:              gpuLimiter,
 		metricsRegistry:         metricsRegistry,
 		saturationV2Analyzer:    satV2,
@@ -870,6 +878,10 @@ func (e *Engine) optimizeV2(
 		"decisionCount", len(allDecisions),
 		"modelCount", len(requests))
 
+	// Damp the raw recommendations with HPA-style stabilization before the
+	// enforcer and emit path see them. No-op unless enabled in config.
+	e.applyStabilization(ctx, allDecisions)
+
 	// Stage 3: Apply enforcer per-model (directly on decisions)
 	for _, req := range requests {
 		// Skip scale-to-zero enforcement if any variant has minReplicas > 0

diff --git a/internal/engines/saturation/stabilization.go b/internal/engines/saturation/stabilization.go
@@ -0,0 +1,111 @@
+package saturation
+
+import (
+	"context"
+
+	"k8s.io/utils/ptr"
+	ctrl "sigs.k8s.io/controller-runtime"
+
+	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/interfaces"
+	"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/stabilization"
+)
+
+// applyStabilization damps the optimizer's raw per-variant targets with
+// HPA-style scaling behavior before the enforcer and emit path consume them.
+// It mutates decisions in place and is a no-op unless EnableStabilization is set
+// in the global ("default") saturation config — mirroring how the GPU limiter is
+// gated. Each variant (per role, when disaggregated) is stabilized independently
+// against its own trailing recommendation window and per-period rate budget,
+// retained on the Engine's long-lived stabilizer.
+func (e *Engine) applyStabilization(ctx context.Context, decisions []interfaces.VariantDecision) {
+	if e.stabilizer == nil || len(decisions) == 0 {
+		return
+	}
+
+	cfg, ok := e.Config.SaturationConfig()["default"]
+	if !ok || !cfg.EnableStabilization {
+		return
+	}
+
+	logger := ctrl.LoggerFrom(ctx)
+	behavior := stabilization.DefaultBehavior()
+
+	// Bound the stabilizer's per-key history to the variants live this cycle, so
+	// keys for deleted variants do not accumulate.
+	active := make(map[string]struct{}, len(decisions))
+	for i := range decisions {
+		active[stabilizationKey(&decisions[i])] = struct{}{}
+	}
+	e.stabilizer.Retain(active)
+
+	type stabilizationEntry struct {
+		Name  string `json:"name"`
+		Role  string `json:"role,omitempty"`
+		Curr  int    `json:"curr"`
+		Raw   int    `json:"raw"`   // optimizer recommendation
+		Final int    `json:"final"` // after stabilization
+	}
+	type modelKey struct{ ns, modelID string }
+	grouped := make(map[modelKey][]stabilizationEntry)
+
+	for i := range decisions {
+		d := &decisions[i]
+		res := e.stabilizer.Stabilize(stabilization.Args{
+			Key:             stabilizationKey(d),
+			CurrentReplicas: int32(d.CurrentReplicas),
+			DesiredReplicas: int32(d.TargetReplicas),
+			Behavior:        behavior,
+			// 0 means no floor / no cap: scale-to-zero and minimum-replica
+			// enforcement remain the enforcer's job, not the stabilizer's.
+			MinReplicas: int32(ptr.Deref(d.MinReplicas, 0)),
+			MaxReplicas: int32(ptr.Deref(d.MaxReplicas, 0)),
+		})
+
+		raw := d.TargetReplicas
+		final := int(res.Replicas)
+		if final != raw {
+			d.TargetReplicas = final
+			retargetDecision(d)
+		}
+
+		k := modelKey{d.Namespace, d.ModelID}
+		grouped[k] = append(grouped[k], stabilizationEntry{
+			Name:  d.VariantName,
+			Role:  d.Role,
+			Curr:  d.CurrentReplicas,
+			Raw:   raw,
+			Final: final,
+		})
+	}
+
+	for k, entries := range grouped {
+		logger.Info("stabilization-decision",
+			"modelID", k.modelID,
+			"namespace", k.ns,
+			"decisions", entries,
+		)
+	}
+}
+
+// stabilizationKey identifies a scale target across cycles. It includes the
+// model so that, even if two models in a namespace ever expose a same-named
+// variant, their histories never collide. Disaggregated P/D variants have one
+// scale target per role, so the role is part of the key too.
+func stabilizationKey(d *interfaces.VariantDecision) string {
+	key := d.Namespace + "/" + d.ModelID + "/" + d.VariantName
+	if d.Role != "" {
+		key += "/" + d.Role
+	}
+	return key
+}
+
+// retargetDecision recomputes a decision's Action after stabilization changed
+// its TargetReplicas, preserving the original reason category so the decision is
+// not mis-attributed in the reason metric label.
+func retargetDecision(d *interfaces.VariantDecision) {
+	category := d.ReasonCategory()
+	if category == "" {
+		category = interfaces.DecisionReasonV2
+	}
+	d.SetDecisionReason(d.ActionForTarget(), category, string(category)+" (stabilized)")
+}