Benchmarking and tuning helpers for llama-server on local inference machines.
This repo grew out of a practical question: when running the same Qwen model on the same custom-built llama-server, do alternative KV cache strategies improve throughput, reduce memory pressure, or make larger context windows more practical?
The initial surprise was that the compressed/turbo KV path did not obviously win at moderate prompt sizes. On the tested M4 Pro machine, f16 KV remained faster, while the compressed path used somewhat less memory. That shifted the focus from "is it faster?" to the more useful question: "at what context length does the memory tradeoff become worth it?"
scripts/benchmark_qwen.shSingle-endpoint and two-endpoint benchmark helper for OpenAI-compatiblellama-serverendpoints.scripts/benchmark_qwen_stepwise.shOne-at-a-time stepped benchmark runner for comparing two launcher scripts without endpoint contention.scripts/tune_qwen_server.shSweeps--ctx-size,--batch-size, and--ubatch-size, health-checks each config, benchmarks stable ones, and writes machine-readable plus markdown summaries. It can also derive prompt size from a target fraction of each tested context window.scripts/compare_tuning_runs.shCompares two tuning summary TSV files and writes a side-by-side markdown report.
Running two large servers side by side makes it too easy to measure contention instead of model behavior. This is easy to do on any shared local inference machine, and the same benchmarking trap shows up across local inference setups too. The later scripts in this repo intentionally run a single server at a time, wait for it to become healthy, benchmark it, stop it, and only then move to the next config.
- Build or point at your desired
llama-serverbinary. - Tune one KV strategy:
./scripts/tune_qwen_server.sh \
--label f16 \
--cache-type-k f16 \
--cache-type-v f16- Tune the other KV strategy:
./scripts/tune_qwen_server.sh \
--label turbo \
--cache-type-k turbo3 \
--cache-type-v turbo3 \
--ctx-fill-ratio 0.80- Compare the resulting summaries:
./scripts/compare_tuning_runs.sh \
--left tuning-results/turbo-YYYYMMDD-HHMMSS/turbo-summary.tsv \
--right tuning-results/f16-YYYYMMDD-HHMMSS/f16-summary.tsv \
--left-label turbo \
--right-label f16The published Apple Silicon result bundle is under results/apple-silicon. Start with the comparison writeups and max-context summary:
- Pair 1 comparison
- Pair 2 comparison
- Turbo max-context summary
- Source provenance
- Architecture-scoped plots
- Fastest published
f16config:ctx=131072,batch=1024,ubatch=128at34.39completion tok/s with49928.8 MiBaverage RSS. - Fastest published turbo config:
ctx=98304,batch=1024,ubatch=128at29.18completion tok/s with47306.0 MiBaverage RSS. - Most memory-efficient published
f16config:ctx=32768,batch=512,ubatch=128at47612.7 MiBaverage RSS. - Most memory-efficient published turbo config:
ctx=32768,batch=1024,ubatch=128at47003.7 MiBaverage RSS. - Published max-context turbo sweep remained healthy through
ctx=245760; the fastest max-context setting there wasctx=196608,batch=512,ubatch=128at5.16completion tok/s with48713.7 MiBaverage RSS.
- If your active
llama-serverbinary does not support a given cache type, the tuning sweep will fail health checks quickly. That is still useful because it tells you the config is not viable with that binary. - Throughput at small or moderate prompt sizes can be a misleading proxy for long-context behavior.
- The most interesting comparisons are often at progressively larger prompt sizes, where memory pressure, checkpointing, and stability become the dominant factors.
This repository is licensed under GPL-3.0.
Before publishing results, it helps to include:
- exact
llama-servercommit - model name and quant
- machine specs
- whether prompt cache was enabled or disabled
- whether runs were one-at-a-time or concurrent
- prompt size strategy
- failure modes, not just the fastest successful run