Skip to content

fix(optimizer): prefer cheaper variant when it covers overflow at lower absolute cost#1252

Open
biranofer wants to merge 1 commit into
llm-d:mainfrom
biranofer:fix/cost-aware-optimizer-fallback
Open

fix(optimizer): prefer cheaper variant when it covers overflow at lower absolute cost#1252
biranofer wants to merge 1 commit into
llm-d:mainfrom
biranofer:fix/cost-aware-optimizer-fallback

Conversation

@biranofer

Copy link
Copy Markdown
Collaborator

Summary

  • costAwareScaleUp greedily picked the most cost-efficient variant (lowest cost/perReplicaCapacity) but overpaid when a cheaper-by-absolute-cost variant could cover the required overflow at lower total spend
  • Add a per-iteration fallback check: if covering the overflow with the cheapest variant costs strictly less than one additional replica of the current variant, round down and let the loop assign the remainder to the cheaper variant
  • Guard added: fallback only applies when the cheapest variant has headroom below its maxReplicas cap

Details

The N-1 terms cancel algebraically, so the check reduces to:

fallbackReplicas * cheapest.Cost < vc.Cost

On equality the more efficient variant is kept (strict <) — at equal cost it provides more capacity per dollar.

Fixes #1251

Test plan

  • Existing unit tests pass (make test — all 113 specs green including TestScaleToZero/should respect maxReplicas during scale-up)
  • Empirically validated on pokprod001: two-variant Llama-3.1-8B (primary TP=2 cost=10, v2 TP=1 cost=5), rate=15 req/s, 25-min run — primary held at peak=2 replicas, v2 absorbed all remaining demand up to max=10, 98.6% SLO

🤖 Generated with Claude Code

…er absolute cost

costAwareScaleUp greedily allocated replicas to the most cost-efficient
variant (lowest cost/perReplicaCapacity). This missed cases where a
cheaper-by-absolute-cost variant can cover the required capacity overflow
at lower total spend — e.g. 1 replica at cost 10 vs 1 replica at cost 5
for the same demand.

Add a per-iteration check: before committing to the Nth replica of the
current variant, compare its cost against the cost of covering the overflow
with the cheapest-by-absolute-cost variant. If the fallback is strictly
cheaper AND has remaining headroom (below its maxReplicas cap), round down
and let the loop assign remaining demand to the cheaper variant.

The N-1 terms cancel, so the condition reduces to:
  fallbackReplicas * cheapest.Cost < vc.Cost

On equality the more efficient variant is kept (strict <), which is correct
since at equal cost it provides more capacity per dollar.

Fixes: llm-d#1251

Signed-off-by: Biran <biran@il.ibm.com>
Signed-off-by: biran <biranofer@gmail.com>
@biranofer biranofer requested a review from ev-shindin June 9, 2026 18:09
@biranofer biranofer self-assigned this Jun 9, 2026
Comment thread internal/engines/pipeline/cost_aware_optimizer.go
@biranofer

biranofer commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

I'm considering to add a CS (Cost Save) configuration parameter [0..1] that prefer the cheapest only if it saves CS of the cost, i,e.
if fallbackReplicas <= cheapestAvailable && float64(fallbackReplicas)*cheapestVC.Cost <(1-CS)*vc.Cost { replicasNeeded-- }

A reasonable default can be 0.2

@deanlorenz deanlorenz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • please add unit tests.
  • please resolve conflicts after rebase to main.

NIT

  • cheapestVariantCapacity does not break ties. order may change. may be harder to debug and test.
  • if current is capped then cheapest must cover all remaining -- unlikely to be cheaper than one current replica.

@ev-shindin

Copy link
Copy Markdown
Collaborator

@biranofer please rebase and address @deanlorenz review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cost_aware_optimizer: greedy efficiency-first allocation overpays when cheaper variant covers overflow at lower absolute cost

3 participants