Skip to content

test(e2e): KV cache saturation scale-up to maxReplicas=10#1283

Draft
asm582 wants to merge 5 commits into
llm-d:mainfrom
asm582:feat/kv-cache-saturation-scaleup-test
Draft

test(e2e): KV cache saturation scale-up to maxReplicas=10#1283
asm582 wants to merge 5 commits into
llm-d:mainfrom
asm582:feat/kv-cache-saturation-scaleup-test

Conversation

@asm582

@asm582 asm582 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds a new full-labelled e2e test that proves WVA scales a Deployment from 1 → 10 replicas when every pod continuously reports 95% KV cache usage via llm-d-inference-sim --fake-metrics
  • Each new replica inherits kv-cache-usage=0.95 from the Deployment pod template — no per-pod configuration needed
  • The V1 saturation engine never gains spare KV headroom (avgSpareKv=0 < kvSpareTrigger=0.10) and keeps recommending +1 until the VA maxReplicas=10 clamp fires

How the test works

Config Value Why
kv-cache-usage (fake-metrics) 0.95 All replicas saturated (≥ threshold 0.80)
kvCacheThreshold 0.80 0.95 ≥ 0.80 → nonSaturatedCount=0
kvSpareTrigger 0.10 avgSpareKv=0 < 0.10ShouldScaleUp=true
maxReplicas (VA + HPA) 10 Caps the scale-up ladder

Three It blocks

  1. Initial detection — V1 path selected in controller logs, VA DesiredOptimizedAlloc.NumReplicas > 1
  2. Continuous climb — desired replicas logged step-by-step (1→2→…), asserted ≥ 5 mid-ladder
  3. Final cap — VA desired == 10 AND Deployment.Spec.Replicas == 10 via HPA (end-to-end proof)

Running

# Focused run (cluster already set up):
make test-e2e-full FOCUS="KV cache saturation"

# With cluster creation:
make create-kind-cluster && DEPLOY_LWS=true make deploy-e2e-infra && make test-e2e-full FOCUS="KV cache saturation"

Test plan

  • Verify test passes on kind-emulator with USE_SIMULATOR=true
  • Verify test is skipped cleanly when USE_SIMULATOR=false
  • Verify AfterAll restores the saturation ConfigMap and deletes all resources

🤖 Generated with Claude Code

asm582 and others added 4 commits June 16, 2026 11:13
… support

Introduces a Poisson-equivalent (k=1.0) Gamma arrival workload for
stress-testing prefill saturation under realistic bursty traffic, along
with the harness and Makefile infrastructure needed to run it without
modifying the cloned llm-d-benchmark repo.

New files under hack/benchmark/:
- scenarios/prefill_heavy_gamma/prefill_heavy_gamma.yaml — workload scenario
- scenarios/prefill_heavy_gamma/prefill_heavy_gamma_trace.jsonl — pre-generated trace
- scenarios/prefill_heavy_gamma/gen_gamma_trace.py — trace generator script
- harnesses/guidellm-llm-d-benchmark.sh — extended harness with guidellm
  replay profile support and runtime upgrade from a pinned commit
- patches/20_harness_pod.yaml.j2.patch — injects ConfigMap harness override
  into the pod command so it takes precedence over the image-baked script
- patches/step_06_create_profile_configmap.py.patch — switches ConfigMap
  creation to server-side apply to avoid the 262 KB annotation size limit

Makefile changes:
- Add BENCHMARK_ANALYZE (default true) and BENCHMARK_SKIP (default false)
  flags to benchmark-run
- Add BENCHMARK_REPLAY_SUPPORT (default false): when true, applies the
  hack/benchmark patches just-in-time and restores the cloned repo after
  the run, keeping llm-d-benchmark pristine between invocations
- Add GUIDELLM_REPLAY_COMMIT to pin the guidellm commit used for the
  runtime upgrade; propagated via LLMDBENCH_GUIDELLM_REPLAY_COMMIT env var
- Fix benchmark-standup to reference workload-autoscaling.yaml (renamed
  from inference-scheduling-wva.yaml in llm-d-benchmark v0.6.3)
- Fix trace lookup to check the scenario directory directly before falling
  back to the traces/ subdirectory

Run the gamma workload:
  make benchmark-run \
    BENCHMARK_NAMESPACE=<ns> \
    MODEL_ID=<model> \
    BENCHMARK_WORKLOAD=prefill_heavy_gamma.yaml \
    BENCHMARK_SCENARIOS_DIR=$(pwd)/hack/benchmark/scenarios/prefill_heavy_gamma \
    BENCHMARK_REPLAY_SUPPORT=true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
step_06_create_profile_configmap.py hits the 262 KB
last-applied-configuration annotation limit for all guidellm runs, not
just replay workloads. Apply the server-side apply patch unconditionally
and only gate the harness script / pod-template patches behind
BENCHMARK_REPLAY_SUPPORT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tionally

guidellm v0.6.0 (the image default) does not resolve REPLACE_ENV_* env var
placeholders in scenario YAML fields such as target. The upgraded guidellm
(commit 73d91f311bd9) does, which is why the gamma run succeeded but standard
prefill/decode/symmetrical runs fail with '--target: Field required'.

Fix by:
- Adding _resolve_env() to the harness script to expand REPLACE_ENV_<VAR>
  placeholders via bash indirect variable expansion, independent of guidellm
  version
- Applying all three patches unconditionally on every benchmark-run (not just
  when BENCHMARK_REPLAY_SUPPORT=true): the harness script and pod template
  patches are needed for all workloads, not only replay ones
- BENCHMARK_REPLAY_SUPPORT still gates only the guidellm upgrade (injecting
  LLMDBENCH_GUIDELLM_REPLAY_COMMIT)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a full e2e test that verifies WVA adds replicas incrementally when
every pod in the fleet reports 95% KV cache usage via llm-d-sim
--fake-metrics, and caps at maxReplicas=10.

Each new replica inherits kv-cache-usage=0.95 from the Deployment pod
template, so the V1 saturation engine never gains spare KV headroom and
keeps recommending +1 until the VA maxReplicas clamp kicks in.

Three It blocks prove the full chain:
- Initial detection: V1 path selected, VA desired > 1
- Continuous climb: desired replicas logged step-by-step, asserted >= 5
- Final cap: VA desired == 10 AND Deployment spec.replicas == 10 via HPA

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@asm582

asm582 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author
KV cache saturation scale-up to maxReplicas should detect KV cache saturation (0.95) and recommend initial scale-up [full]
/Users/abhishekmalvankar/benchmarking/inferno-autoscaler/test/e2e/kv_cache_saturation_scaleup_test.go:191
  STEP: Snapshotting existing saturation ConfigMap for restore in AfterAll @ 06/16/26 14:21:34.351
  STEP: Creating model service with kv-cache-usage=0.95 fake metrics @ 06/16/26 14:21:34.353
  STEP: Waiting for initial replica to be ready @ 06/16/26 14:21:34.422
  STEP: Creating VA with minReplicas=1, maxReplicas=10 @ 06/16/26 14:21:39.432
  2026-06-16T14:21:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:21:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  STEP: Creating HPA / ScaledObject with maxReplicas=10 so the Deployment actually scales @ 06/16/26 14:21:39.439
  STEP: Installing V1 saturation config (no analyzerName) with kvCacheThreshold=0.80 @ 06/16/26 14:21:39.456
  STEP: Asserting controller logs show V1 path selected for our model @ 06/16/26 14:21:39.463
  STEP: Waiting for VA to receive a positive desired allocation @ 06/16/26 14:22:09.547
  2026-06-16T14:22:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    VA progress (saturation-kvcache-va): MetricsAvailable=True reason=MetricsFound message="Saturation metrics data is available for scaling decisions"
    VA progress (saturation-kvcache-va): DesiredOptimizedAlloc replicas=2 accelerator="H100"
  STEP: Asserting VA recommends more than 1 replica when kv-cache-usage is 0.95 @ 06/16/26 14:22:09.55
  2026-06-16T14:22:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    Initial detection: VA desired replicas=2 (kv-cache-usage=0.95)
• [35.205 seconds]
------------------------------
KV cache saturation scale-up to maxReplicas should continuously scale up as each added replica also reports 95% KV cache usage [full]
/Users/abhishekmalvankar/benchmarking/inferno-autoscaler/test/e2e/kv_cache_saturation_scaleup_test.go:216
  STEP: Observing VA desired replicas climbing past the midpoint (>=5) as saturation persists @ 06/16/26 14:22:09.553
  2026-06-16T14:22:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 1 → 2 (each new replica starts at kv-cache-usage=0.95)
  2026-06-16T14:22:14-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:19-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:24-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:29-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:34-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:44-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:49-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:54-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:22:59-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:04-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 2 → 3 (each new replica starts at kv-cache-usage=0.95)
  2026-06-16T14:23:14-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:19-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:24-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:29-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:34-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:44-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:49-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:54-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:23:59-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:04-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:09-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 3 → 4 (each new replica starts at kv-cache-usage=0.95)
  2026-06-16T14:24:14-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:19-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:24-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:29-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:34-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
  2026-06-16T14:24:39-04:00     INFO    VariantAutoscaling is deprecated and will be removed in a future release. Migrate to the annotation-based path (add llm-d.ai/managed=true to your HPA or ScaledObject). See docs/developer-guide/migrating-from-va-crd.md for migration steps.
    KV saturation scale-up: WVA desired replicas 4 → 5 (each new replica starts at kv-cache-usage=0.95)
• [150.185 seconds]

Adds TestAnalyzeModelSaturation_EPPRoutingEffect to prove that the V1
percentage-based analyzer stops recommending scale-up when an intelligent
router (EPP) concentrates traffic on a hot pod while leaving freshly-added
pods cold.

Root cause: V1 computes avgSpareKv only over NON-saturated replicas.
One cold pod (kv≈0.05) produces avgSpareKv=0.75 >> kvSpareTrigger=0.10,
masking the hot pod's saturation entirely.  Scale-up resumes only once
the cold pod's KV usage crosses the threshold:

  coldPodKv > kvCacheThreshold - kvSpareTrigger  →  0.80 - 0.10 = 0.70

The test covers: baseline (1 hot pod alone), 1 hot + 1 cold suppressed,
1 hot + 4 cold suppressed, exact boundary (kv=0.70, no scale-up),
one tick above boundary (kv=0.71, scale-up resumes), and all-saturated
(the --fake-metrics e2e test path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant