test(e2e): KV cache saturation scale-up to maxReplicas=10#1283
Draft
asm582 wants to merge 5 commits into
Draft
Conversation
… support
Introduces a Poisson-equivalent (k=1.0) Gamma arrival workload for
stress-testing prefill saturation under realistic bursty traffic, along
with the harness and Makefile infrastructure needed to run it without
modifying the cloned llm-d-benchmark repo.
New files under hack/benchmark/:
- scenarios/prefill_heavy_gamma/prefill_heavy_gamma.yaml — workload scenario
- scenarios/prefill_heavy_gamma/prefill_heavy_gamma_trace.jsonl — pre-generated trace
- scenarios/prefill_heavy_gamma/gen_gamma_trace.py — trace generator script
- harnesses/guidellm-llm-d-benchmark.sh — extended harness with guidellm
replay profile support and runtime upgrade from a pinned commit
- patches/20_harness_pod.yaml.j2.patch — injects ConfigMap harness override
into the pod command so it takes precedence over the image-baked script
- patches/step_06_create_profile_configmap.py.patch — switches ConfigMap
creation to server-side apply to avoid the 262 KB annotation size limit
Makefile changes:
- Add BENCHMARK_ANALYZE (default true) and BENCHMARK_SKIP (default false)
flags to benchmark-run
- Add BENCHMARK_REPLAY_SUPPORT (default false): when true, applies the
hack/benchmark patches just-in-time and restores the cloned repo after
the run, keeping llm-d-benchmark pristine between invocations
- Add GUIDELLM_REPLAY_COMMIT to pin the guidellm commit used for the
runtime upgrade; propagated via LLMDBENCH_GUIDELLM_REPLAY_COMMIT env var
- Fix benchmark-standup to reference workload-autoscaling.yaml (renamed
from inference-scheduling-wva.yaml in llm-d-benchmark v0.6.3)
- Fix trace lookup to check the scenario directory directly before falling
back to the traces/ subdirectory
Run the gamma workload:
make benchmark-run \
BENCHMARK_NAMESPACE=<ns> \
MODEL_ID=<model> \
BENCHMARK_WORKLOAD=prefill_heavy_gamma.yaml \
BENCHMARK_SCENARIOS_DIR=$(pwd)/hack/benchmark/scenarios/prefill_heavy_gamma \
BENCHMARK_REPLAY_SUPPORT=true
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
step_06_create_profile_configmap.py hits the 262 KB last-applied-configuration annotation limit for all guidellm runs, not just replay workloads. Apply the server-side apply patch unconditionally and only gate the harness script / pod-template patches behind BENCHMARK_REPLAY_SUPPORT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tionally guidellm v0.6.0 (the image default) does not resolve REPLACE_ENV_* env var placeholders in scenario YAML fields such as target. The upgraded guidellm (commit 73d91f311bd9) does, which is why the gamma run succeeded but standard prefill/decode/symmetrical runs fail with '--target: Field required'. Fix by: - Adding _resolve_env() to the harness script to expand REPLACE_ENV_<VAR> placeholders via bash indirect variable expansion, independent of guidellm version - Applying all three patches unconditionally on every benchmark-run (not just when BENCHMARK_REPLAY_SUPPORT=true): the harness script and pod template patches are needed for all workloads, not only replay ones - BENCHMARK_REPLAY_SUPPORT still gates only the guidellm upgrade (injecting LLMDBENCH_GUIDELLM_REPLAY_COMMIT) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a full e2e test that verifies WVA adds replicas incrementally when every pod in the fleet reports 95% KV cache usage via llm-d-sim --fake-metrics, and caps at maxReplicas=10. Each new replica inherits kv-cache-usage=0.95 from the Deployment pod template, so the V1 saturation engine never gains spare KV headroom and keeps recommending +1 until the VA maxReplicas clamp kicks in. Three It blocks prove the full chain: - Initial detection: V1 path selected, VA desired > 1 - Continuous climb: desired replicas logged step-by-step, asserted >= 5 - Final cap: VA desired == 10 AND Deployment spec.replicas == 10 via HPA Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
Author
|
Adds TestAnalyzeModelSaturation_EPPRoutingEffect to prove that the V1 percentage-based analyzer stops recommending scale-up when an intelligent router (EPP) concentrates traffic on a hot pod while leaving freshly-added pods cold. Root cause: V1 computes avgSpareKv only over NON-saturated replicas. One cold pod (kv≈0.05) produces avgSpareKv=0.75 >> kvSpareTrigger=0.10, masking the hot pod's saturation entirely. Scale-up resumes only once the cold pod's KV usage crosses the threshold: coldPodKv > kvCacheThreshold - kvSpareTrigger → 0.80 - 0.10 = 0.70 The test covers: baseline (1 hot pod alone), 1 hot + 1 cold suppressed, 1 hot + 4 cold suppressed, exact boundary (kv=0.70, no scale-up), one tick above boundary (kv=0.71, scale-up resumes), and all-saturated (the --fake-metrics e2e test path). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
full-labelled e2e test that proves WVA scales a Deployment from 1 → 10 replicas when every pod continuously reports 95% KV cache usage viallm-d-inference-sim --fake-metricskv-cache-usage=0.95from the Deployment pod template — no per-pod configuration neededavgSpareKv=0 < kvSpareTrigger=0.10) and keeps recommending+1until the VAmaxReplicas=10clamp firesHow the test works
kv-cache-usage(fake-metrics)0.95kvCacheThreshold0.80nonSaturatedCount=0kvSpareTrigger0.10avgSpareKv=0 < 0.10→ShouldScaleUp=truemaxReplicas(VA + HPA)10Three It blocks
DesiredOptimizedAlloc.NumReplicas > 11→2→…), asserted≥ 5mid-ladder== 10ANDDeployment.Spec.Replicas == 10via HPA (end-to-end proof)Running
Test plan
USE_SIMULATOR=trueUSE_SIMULATOR=falseAfterAllrestores the saturation ConfigMap and deletes all resources🤖 Generated with Claude Code