test(e2e): KV cache saturation scale-up to maxReplicas=10 by asm582 · Pull Request #1283 · llm-d/llm-d-workload-variant-autoscaler

asm582 · 2026-06-16T18:25:51Z

Summary

Adds a new full-labelled e2e test that proves WVA scales a Deployment from 1 → 10 replicas when every pod continuously reports 95% KV cache usage via llm-d-inference-sim --fake-metrics
Each new replica inherits kv-cache-usage=0.95 from the Deployment pod template — no per-pod configuration needed
The V1 saturation engine never gains spare KV headroom (avgSpareKv=0 < kvSpareTrigger=0.10) and keeps recommending +1 until the VA maxReplicas=10 clamp fires

How the test works

Config	Value	Why
`kv-cache-usage` (fake-metrics)	`0.95`	All replicas saturated (≥ threshold 0.80)
`kvCacheThreshold`	`0.80`	0.95 ≥ 0.80 → `nonSaturatedCount=0`
`kvSpareTrigger`	`0.10`	`avgSpareKv=0 < 0.10` → `ShouldScaleUp=true`
`maxReplicas` (VA + HPA)	`10`	Caps the scale-up ladder

Three It blocks

Initial detection — V1 path selected in controller logs, VA DesiredOptimizedAlloc.NumReplicas > 1
Continuous climb — desired replicas logged step-by-step (1→2→…), asserted ≥ 5 mid-ladder
Final cap — VA desired == 10 AND Deployment.Spec.Replicas == 10 via HPA (end-to-end proof)

Running

# Focused run (cluster already set up):
make test-e2e-full FOCUS="KV cache saturation"

# With cluster creation:
make create-kind-cluster && DEPLOY_LWS=true make deploy-e2e-infra && make test-e2e-full FOCUS="KV cache saturation"

Test plan

Verify test passes on kind-emulator with USE_SIMULATOR=true
Verify test is skipped cleanly when USE_SIMULATOR=false
Verify AfterAll restores the saturation ConfigMap and deletes all resources

🤖 Generated with Claude Code

… support Introduces a Poisson-equivalent (k=1.0) Gamma arrival workload for stress-testing prefill saturation under realistic bursty traffic, along with the harness and Makefile infrastructure needed to run it without modifying the cloned llm-d-benchmark repo. New files under hack/benchmark/: - scenarios/prefill_heavy_gamma/prefill_heavy_gamma.yaml — workload scenario - scenarios/prefill_heavy_gamma/prefill_heavy_gamma_trace.jsonl — pre-generated trace - scenarios/prefill_heavy_gamma/gen_gamma_trace.py — trace generator script - harnesses/guidellm-llm-d-benchmark.sh — extended harness with guidellm replay profile support and runtime upgrade from a pinned commit - patches/20_harness_pod.yaml.j2.patch — injects ConfigMap harness override into the pod command so it takes precedence over the image-baked script - patches/step_06_create_profile_configmap.py.patch — switches ConfigMap creation to server-side apply to avoid the 262 KB annotation size limit Makefile changes: - Add BENCHMARK_ANALYZE (default true) and BENCHMARK_SKIP (default false) flags to benchmark-run - Add BENCHMARK_REPLAY_SUPPORT (default false): when true, applies the hack/benchmark patches just-in-time and restores the cloned repo after the run, keeping llm-d-benchmark pristine between invocations - Add GUIDELLM_REPLAY_COMMIT to pin the guidellm commit used for the runtime upgrade; propagated via LLMDBENCH_GUIDELLM_REPLAY_COMMIT env var - Fix benchmark-standup to reference workload-autoscaling.yaml (renamed from inference-scheduling-wva.yaml in llm-d-benchmark v0.6.3) - Fix trace lookup to check the scenario directory directly before falling back to the traces/ subdirectory Run the gamma workload: make benchmark-run \ BENCHMARK_NAMESPACE=<ns> \ MODEL_ID=<model> \ BENCHMARK_WORKLOAD=prefill_heavy_gamma.yaml \ BENCHMARK_SCENARIOS_DIR=$(pwd)/hack/benchmark/scenarios/prefill_heavy_gamma \ BENCHMARK_REPLAY_SUPPORT=true Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

step_06_create_profile_configmap.py hits the 262 KB last-applied-configuration annotation limit for all guidellm runs, not just replay workloads. Apply the server-side apply patch unconditionally and only gate the harness script / pod-template patches behind BENCHMARK_REPLAY_SUPPORT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tionally guidellm v0.6.0 (the image default) does not resolve REPLACE_ENV_* env var placeholders in scenario YAML fields such as target. The upgraded guidellm (commit 73d91f311bd9) does, which is why the gamma run succeeded but standard prefill/decode/symmetrical runs fail with '--target: Field required'. Fix by: - Adding _resolve_env() to the harness script to expand REPLACE_ENV_<VAR> placeholders via bash indirect variable expansion, independent of guidellm version - Applying all three patches unconditionally on every benchmark-run (not just when BENCHMARK_REPLAY_SUPPORT=true): the harness script and pod template patches are needed for all workloads, not only replay ones - BENCHMARK_REPLAY_SUPPORT still gates only the guidellm upgrade (injecting LLMDBENCH_GUIDELLM_REPLAY_COMMIT) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a full e2e test that verifies WVA adds replicas incrementally when every pod in the fleet reports 95% KV cache usage via llm-d-sim --fake-metrics, and caps at maxReplicas=10. Each new replica inherits kv-cache-usage=0.95 from the Deployment pod template, so the V1 saturation engine never gains spare KV headroom and keeps recommending +1 until the VA maxReplicas clamp kicks in. Three It blocks prove the full chain: - Initial detection: V1 path selected, VA desired > 1 - Continuous climb: desired replicas logged step-by-step, asserted >= 5 - Final cap: VA desired == 10 AND Deployment spec.replicas == 10 via HPA Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

asm582 · 2026-06-16T18:26:38Z

KV cache saturation scale-up to maxReplicas should detect KV cache saturation (0.95) and recommend initial scale-up [full]
/Users/abhishekmalvankar/benchmarking/inferno-autoscaler/test/e2e/kv_cache_saturation_scaleup_test.go:191
  STEP: Snapshotting existing saturation ConfigMap for restore in AfterAll @ 06/16/26 14:21:34.351
  STEP: Creating model service with kv-cache-usage=0.95 fake metrics @ 06/16/26 14:21:34.353
  STEP: Waiting for initial replica to be ready @ 06/16/26 14:21:34.422
  STEP: Creating VA with minReplicas=1, maxReplicas=10 @ 06/16/26 14:21:39.432
  2026-06-16T14:21:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:21:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  STEP: Creating HPA / ScaledObject with maxReplicas=10 so the Deployment actually scales @ 06/16/26 14:21:39.439
  STEP: Installing V1 saturation config (no analyzerName) with kvCacheThreshold=0.80 @ 06/16/26 14:21:39.456
  STEP: Asserting controller logs show V1 path selected for our model @ 06/16/26 14:21:39.463
  STEP: Waiting for VA to receive a positive desired allocation @ 06/16/26 14:22:09.547
  2026-06-16T14:22:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    VA progress (saturation-kvcache-va): MetricsAvailable=True reason=MetricsFound message="Saturation metrics data is available for scaling decisions"
    VA progress (saturation-kvcache-va): DesiredOptimizedAlloc replicas=2 accelerator="H100"
  STEP: Asserting VA recommends more than 1 replica when kv-cache-usage is 0.95 @ 06/16/26 14:22:09.55
  2026-06-16T14:22:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    Initial detection: VA desired replicas=2 (kv-cache-usage=0.95)
• [35.205 seconds]
------------------------------
KV cache saturation scale-up to maxReplicas should continuously scale up as each added replica also reports 95% KV cache usage [full]
/Users/abhishekmalvankar/benchmarking/inferno-autoscaler/test/e2e/kv_cache_saturation_scaleup_test.go:216
  STEP: Observing VA desired replicas climbing past the midpoint (>=5) as saturation persists @ 06/16/26 14:22:09.553
  2026-06-16T14:22:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 1 → 2 (each new replica starts at kv-cache-usage=0.95)
  2026-06-16T14:22:14-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:19-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:24-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:29-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:34-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:44-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:49-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:54-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:59-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:04-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 2 → 3 (each new replica starts at kv-cache-usage=0.95)
  2026-06-16T14:23:14-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:19-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:24-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:29-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:34-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:44-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:49-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:54-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:59-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:04-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 3 → 4 (each new replica starts at kv-cache-usage=0.95)
  2026-06-16T14:24:14-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:19-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:24-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:29-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:34-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 4 → 5 (each new replica starts at kv-cache-usage=0.95)
• [150.185 seconds]

Adds TestAnalyzeModelSaturation_EPPRoutingEffect to prove that the V1 percentage-based analyzer stops recommending scale-up when an intelligent router (EPP) concentrates traffic on a hot pod while leaving freshly-added pods cold. Root cause: V1 computes avgSpareKv only over NON-saturated replicas. One cold pod (kv≈0.05) produces avgSpareKv=0.75 >> kvSpareTrigger=0.10, masking the hot pod's saturation entirely. Scale-up resumes only once the cold pod's KV usage crosses the threshold: coldPodKv > kvCacheThreshold - kvSpareTrigger → 0.80 - 0.10 = 0.70 The test covers: baseline (1 hot pod alone), 1 hot + 1 cold suppressed, 1 hot + 4 cold suppressed, exact boundary (kv=0.70, no scale-up), one tick above boundary (kv=0.71, scale-up resumes), and all-saturated (the --fake-metrics e2e test path). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

asm582 and others added 4 commits June 16, 2026 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(e2e): KV cache saturation scale-up to maxReplicas=10#1283

test(e2e): KV cache saturation scale-up to maxReplicas=10#1283
asm582 wants to merge 5 commits into
llm-d:mainfrom
asm582:feat/kv-cache-saturation-scaleup-test

asm582 commented Jun 16, 2026

Uh oh!

asm582 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

asm582 commented Jun 16, 2026

Summary

How the test works

Three It blocks

Running

Test plan

Uh oh!

asm582 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant