Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions docs/plans/engine/stabilization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# HPA-style stabilization of optimizer recommendations

## Motivation

The optimizer produces a fresh per-variant replica target every optimize cycle
(~30s). Acting on those targets directly is prone to flapping: a momentary load
spike scales up and the next cycle scales straight back down. Today WVA leans on
a downstream HorizontalPodAutoscaler to absorb this β€” WVA emits a desired-replica
metric and the HPA's own stabilization window damps it.

Stabilizing inside WVA is the prerequisite for WVA ever actuating `/scale`
directly (the 1β†’N range), because patching the scale subresource without a
stabilization window would lose the damping the HPA provides today and reproduce
the flapping. This change implements that stabilization step; direct actuation is
a separate, later step.

## Design

A small, self-contained `internal/stabilization` package ports the Kubernetes
HorizontalPodAutoscaler **configurable scaling behavior** so an operator can
express damping the familiar HPA way.

- **Config types are reused** from `k8s.io/api/autoscaling/v2`
(`HorizontalPodAutoscalerBehavior`, `HPAScalingRules`, `HPAScalingPolicy`).
Already a dependency; maps 1:1 if WVA ever generates HPAs.
- **The algorithm is clean-room.** The upstream behavior logic lives on
unexported methods of `HorizontalController` in `k8s.io/kubernetes`, which is
not consumable as a module. The behavior is documented at
[configurable-scaling-behavior](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior),
so it is reimplemented rather than imported. Knative's KPA was also considered
and rejected (wrong domain β€” it smooths a request/concurrency metric, not a
replica recommendation β€” and heavy Knative coupling).

### Algorithm

`Stabilizer.Stabilize` damps one raw recommendation for one scale target. The
pipeline, in order, matches the HPA:

1. **Trailing stabilization window.** Per key, recent raw recommendations are
retained. The smallest recommendation over the scale-up window is a floor
(scale up only once every recent recommendation agrees); the largest over the
scale-down window is a ceiling (scale down only after the window decays). The
current replica count is clamped into `[floor, ceiling]`. Default windows:
scale-up `0s` (immediate), scale-down `300s`.
2. **Tolerance deadband.** A change within the configured per-direction
tolerance fraction of current replicas is suppressed. Default `0` (disabled),
since the optimizer already decided to act.
3. **Per-period rate policies.** `Pods` and `Percent` policies cap the change
per `periodSeconds`, budgeting against changes already actuated within the
period (the period start is reconstructed as `current βˆ’ added + removed`
using both scale-up and scale-down events, as the HPA does). `selectPolicy`
`Max`/`Min`/`Disabled` combines competing policies. Defaults: scale-up the
higher of `+4 pods` or `+100%` per `60s`; scale-down `-100%` per `60s`. The
magnitudes and the `300s` down window match the HPA; the `60s` period is
chosen for WVA's ~30s optimize cadence rather than the controller's `15s`
default (which would make the budget a near no-op at that cadence).
4. **Min/max clamp** to the variant's bounds.

The `Stabilizer` is long-lived and concurrency-safe; it keeps the trailing
windows and rate budgets in memory, mirroring the HPA controller, which keeps
the same state in memory on the elected leader.

### Wiring

The stabilizer is applied in the V2 engine's `optimizeV2`, immediately after the
optimizer produces decisions (and the `scaling-decision` cycle log) and **before**
the scale-to-zero / minimum-replica enforcer. This ordering is deliberate and
matches the HPA: stabilize the recommendation, then let the hard bounds have the
last word. The existing emit path (Prometheus desired-replica gauge, internal
decision cache, CRD status patch) carries the stabilized value unchanged.

Each variant is keyed `namespace/model/variant[/role]`, so disaggregated
prefill/decode scale targets are damped independently.

A `stabilization-decision` structured log line is emitted per model per cycle,
grouped like the sibling `scaling-decision` line, with `curr`, `raw`
(optimizer recommendation) and `final` (stabilized) per variant.

### Configuration

Gated by `enableStabilization` in the saturation scaling config (default
`false`), mirroring how `enableLimiter` gates the GPU limiter. When disabled the
engine emits the optimizer's raw recommendation exactly as before β€” no
operator-facing behavior change on upgrade. When enabled, the HPA default
behavior is applied. Exposing the full per-policy behavior through config is a
follow-up.

## Known interactions

- **Double stabilization.** If a downstream HPA also stabilizes the emitted
metric, lag compounds. The intended contract once this is enabled is that WVA
owns stabilization and the HPA `behavior` is set to near-passthrough. Until
that is wired and documented for operators, `enableStabilization` defaults off.
- **Scale-to-zero.** The enforcer runs after stabilization and applies
scale-to-zero / minimum-replica floors as a hard override, so an idle model
still scales to zero even if the window is holding replicas high. The trailing
window records the pre-enforcement recommendation; recovery from zero is
handled by the scale-from-zero path.

## Testing

`internal/stabilization` has table-driven Ginkgo specs with an injectable clock
covering: immediate default scale-up, the 300s scale-down hold-and-release, the
scale-up window delay, per-period Pods/Percent rate caps, `selectPolicy`
Max/Min/Disabled, the tolerance deadband, min/max clamping, and per-key
isolation.
12 changes: 12 additions & 0 deletions internal/config/saturation_scaling.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,15 @@ type SaturationScalingConfig struct {
// Default is false (limiter disabled).
EnableLimiter bool `yaml:"enableLimiter,omitempty"`

// EnableStabilization: When true, applies HPA-style stabilization (a trailing
// scale-down window plus per-period rate policies) to the optimizer's
// per-variant targets before they are emitted, damping flapping. The
// HorizontalPodAutoscaler defaults are used: immediate scale-up rate-limited
// to the higher of +4 pods or +100% per 60s, and scale-down held by a 300s
// stabilization window. Default is false, preserving the prior behavior of
// emitting the optimizer's raw recommendation each cycle.
EnableStabilization bool `yaml:"enableStabilization,omitempty"`

// AnalyzerName selects which saturation analyzer to use.
// "saturation" uses the V2 token-based analyzer.
// Empty string (default) uses the V1 percentage-based analyzer.
Expand Down Expand Up @@ -198,6 +207,9 @@ func (c *SaturationScalingConfig) Merge(override SaturationScalingConfig) {
if override.EnableLimiter {
c.EnableLimiter = override.EnableLimiter
}
if override.EnableStabilization {
c.EnableStabilization = override.EnableStabilization
}
if override.Priority != 0 {
c.Priority = override.Priority
}
Expand Down
13 changes: 13 additions & 0 deletions internal/config/saturation_scaling_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -397,6 +397,19 @@ var _ = Describe("SaturationScalingConfig", func() {
Expect(base.ModelID).To(Equal("llama-70b"))
Expect(base.Namespace).To(Equal("production"))
})

It("should propagate EnableStabilization from the override", func() {
base := SaturationScalingConfig{}
base.Merge(SaturationScalingConfig{EnableStabilization: true})
Expect(base.EnableStabilization).To(BeTrue())
})

It("should not clear EnableStabilization already set to true", func() {
base := SaturationScalingConfig{EnableStabilization: true}
base.Merge(SaturationScalingConfig{EnableStabilization: false})
Expect(base.EnableStabilization).To(BeTrue(),
"a false override must not revert an enabled flag (matches EnableLimiter semantics)")
})
})

Context("IsV2", func() {
Expand Down
10 changes: 1 addition & 9 deletions internal/engines/pipeline/enforcer.go
Original file line number Diff line number Diff line change
Expand Up @@ -182,15 +182,7 @@ func (e *Enforcer) ensureMinimumReplicasOnDecisions(
// CurrentReplicas after enforcement and refreshes its reason. SetDecisionReason
// is the single place that writes d.Action.
func updateDecisionAction(d *interfaces.VariantDecision, optimizerName, policyType string, metricsEmitter *metrics.MetricsEmitter) {
var action interfaces.SaturationAction
switch {
case d.TargetReplicas > d.CurrentReplicas:
action = interfaces.ActionScaleUp
case d.TargetReplicas < d.CurrentReplicas:
action = interfaces.ActionScaleDown
default:
action = interfaces.ActionNoChange
}
action := d.ActionForTarget()
// Preserve the decision's original reason category (e.g. saturation-only on
// the V1 path) instead of hardcoding V2, so an enforced decision is not
// mis-attributed in the reason metric label. Fall back to V2 if it was unset.
Expand Down
12 changes: 12 additions & 0 deletions internal/engines/saturation/engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ import (
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/logging"
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/metrics"
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/saturation"
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/stabilization"
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/utils"
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/utils/scaletarget"
)
Expand Down Expand Up @@ -136,6 +137,12 @@ type Engine struct {
// ScaleToZeroEnforcer applies scale-to-zero and minimum replica enforcement
ScaleToZeroEnforcer *pipeline.Enforcer

// stabilizer damps the optimizer's raw per-variant targets with HPA-style
// scaling behavior. It is long-lived so its trailing recommendation windows
// and per-period rate budgets persist across optimize cycles. Applied only
// when EnableStabilization is set in the saturation config.
stabilizer *stabilization.Stabilizer

// GPULimiter constrains scaling decisions based on available GPU resources.
// Only applied when EnableLimiter is true in the saturation config.
GPULimiter pipeline.Limiter
Expand Down Expand Up @@ -229,6 +236,7 @@ func NewEngine(client client.Client, apiReader client.Reader, scheme *runtime.Sc
Config: cfg,
ReplicaMetricsCollector: collector.NewReplicaMetricsCollector(promSource, client, recorder, podLocator),
ScaleToZeroEnforcer: pipeline.NewEnforcer(requestCountFunc),
stabilizer: stabilization.New(),
GPULimiter: gpuLimiter,
metricsRegistry: metricsRegistry,
saturationV2Analyzer: satV2,
Expand Down Expand Up @@ -870,6 +878,10 @@ func (e *Engine) optimizeV2(
"decisionCount", len(allDecisions),
"modelCount", len(requests))

// Damp the raw recommendations with HPA-style stabilization before the
// enforcer and emit path see them. No-op unless enabled in config.
e.applyStabilization(ctx, allDecisions)

// Stage 3: Apply enforcer per-model (directly on decisions)
for _, req := range requests {
// Skip scale-to-zero enforcement if any variant has minReplicas > 0
Expand Down
111 changes: 111 additions & 0 deletions internal/engines/saturation/stabilization.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
package saturation

import (
"context"

"k8s.io/utils/ptr"
ctrl "sigs.k8s.io/controller-runtime"

"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/interfaces"
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/stabilization"
)

// applyStabilization damps the optimizer's raw per-variant targets with
// HPA-style scaling behavior before the enforcer and emit path consume them.
// It mutates decisions in place and is a no-op unless EnableStabilization is set
// in the global ("default") saturation config β€” mirroring how the GPU limiter is
// gated. Each variant (per role, when disaggregated) is stabilized independently
// against its own trailing recommendation window and per-period rate budget,
// retained on the Engine's long-lived stabilizer.
func (e *Engine) applyStabilization(ctx context.Context, decisions []interfaces.VariantDecision) {
if e.stabilizer == nil || len(decisions) == 0 {
return
}

cfg, ok := e.Config.SaturationConfig()["default"]
if !ok || !cfg.EnableStabilization {
return
}

logger := ctrl.LoggerFrom(ctx)
behavior := stabilization.DefaultBehavior()

// Bound the stabilizer's per-key history to the variants live this cycle, so
// keys for deleted variants do not accumulate.
active := make(map[string]struct{}, len(decisions))
for i := range decisions {
active[stabilizationKey(&decisions[i])] = struct{}{}
}
e.stabilizer.Retain(active)

type stabilizationEntry struct {
Name string `json:"name"`
Role string `json:"role,omitempty"`
Curr int `json:"curr"`
Raw int `json:"raw"` // optimizer recommendation
Final int `json:"final"` // after stabilization
}
type modelKey struct{ ns, modelID string }
grouped := make(map[modelKey][]stabilizationEntry)

for i := range decisions {
d := &decisions[i]
res := e.stabilizer.Stabilize(stabilization.Args{
Key: stabilizationKey(d),
CurrentReplicas: int32(d.CurrentReplicas),
DesiredReplicas: int32(d.TargetReplicas),
Behavior: behavior,
// 0 means no floor / no cap: scale-to-zero and minimum-replica
// enforcement remain the enforcer's job, not the stabilizer's.
MinReplicas: int32(ptr.Deref(d.MinReplicas, 0)),
MaxReplicas: int32(ptr.Deref(d.MaxReplicas, 0)),
})

raw := d.TargetReplicas
final := int(res.Replicas)
if final != raw {
d.TargetReplicas = final
retargetDecision(d)
}

k := modelKey{d.Namespace, d.ModelID}
grouped[k] = append(grouped[k], stabilizationEntry{
Name: d.VariantName,
Role: d.Role,
Curr: d.CurrentReplicas,
Raw: raw,
Final: final,
})
}

for k, entries := range grouped {
logger.Info("stabilization-decision",
"modelID", k.modelID,
"namespace", k.ns,
"decisions", entries,
)
}
}

// stabilizationKey identifies a scale target across cycles. It includes the
// model so that, even if two models in a namespace ever expose a same-named
// variant, their histories never collide. Disaggregated P/D variants have one
// scale target per role, so the role is part of the key too.
func stabilizationKey(d *interfaces.VariantDecision) string {
key := d.Namespace + "/" + d.ModelID + "/" + d.VariantName
if d.Role != "" {
key += "/" + d.Role
}
return key
}

// retargetDecision recomputes a decision's Action after stabilization changed
// its TargetReplicas, preserving the original reason category so the decision is
// not mis-attributed in the reason metric label.
func retargetDecision(d *interfaces.VariantDecision) {
category := d.ReasonCategory()
if category == "" {
category = interfaces.DecisionReasonV2
}
d.SetDecisionReason(d.ActionForTarget(), category, string(category)+" (stabilized)")
}
Loading
Loading