A wrong KV-dtype, wrong quantization flag, or a corrupted shard will let
vLLM cheerfully report 60+ tok/s while emitting * * * * or
the the the the. The number is right; the output is garbage. Always
validate coherence before you trust a TPS number, yours, ours, or
anyone's.
Run windows_tools/check_coherence.py
against any running server:
python windows_tools\check_coherence.py --port 5001Three prompts get sent at three different generation lengths:
- Capital of France (200 tok), sanity. Expect one short sentence
ending with
Paris.andfinish_reason=stop. - 300-word Whiskers cat / rooftop garden story (700 tok), long-form narrative. Tests whether KV cache stays coherent past a few hundred tokens.
- Iterative Fibonacci with docstring (500 tok), code generation. Different distribution than prose; catches issues that prose hides.
Exit code:
0, all three coherent. Good.1, at least one degenerate. Don't ship this config.2, server unreachable. Check--port.
The script prints the first 200 chars of each response so you can eyeball the output yourself.
These are the patterns the validator looks for. If you see any of them in production, you have a coherence problem regardless of what TPS says:
| Pattern | What it means |
|---|---|
* * * * * * * |
Attention collapsed to a single high-prob token |
the the the the |
Same with a different collapse point |
a a a a a a |
Same |
**:**:**:** |
Delimiter loop, often after a markdown table |
\n\n\n\n\n\n\n |
Format-token loop |
| Coherent for 30 tokens then "the the the" mid-sentence | KV cache aged past attention-quant coherence limit |
- Wrong
--quantizationflag for the model. AutoRound checkpoint loaded withawq_marlinwill load fine and produce gibberish. - KV-dtype too aggressive.
fp8_e5m2is rejected by TRITON_ATTN (only fp8 / fp8_e4m3 accepted on Windows). 3-bit / TurboQuant KV is not in the 0.19.0 wheel. - Shard corruption. Run
windows_tools/verify_model_sha.pyto check every shard against HuggingFace'sx-linked-etag. One bad shard = consistent local SHA, garbage output. max_tokensruns out before</think>when using--reasoning-parser qwen3. The whole response ends up in thereasoningfield withcontent="", looks degenerate but is just truncated reasoning. Raisemax_tokensor append/no_thinkto the prompt or drop the parser flag.- MTP head missing from quant. Qwen3.6-27B AutoRound from Lorbus
keeps
mtp.fcin BF16. Other quants either OOM allocating a fresh BF16 head or quantize it to INT4 and the loader silently skips it → 0 % draft acceptance, no speedup, output usually still coherent. SeeMTP_HEAD.md.
A safe rollback config is:
TP=1, PP=1, MTP=off, --kv-cache-dtype=auto (BF16),
--enforce-eager, ctx=8000
This is slow but it should always be coherent on the Lorbus AutoRound quant. If even that produces gibberish, the model file is corrupt, not a vLLM problem.
When MTP is enabled, vLLM logs per-window stats:
Spec decode metrics: draft_acceptance_rate=0.62, system_efficiency=1.85
draft_acceptance_rate, what fraction of the speculated tokens were accepted. Should be 0.4–0.7 for prose, 0.2–0.5 for dense code.system_efficiency, net throughput multiplier vs no spec-decode. Should be > 1.2 to be worth running.
If system_efficiency < 1.0, MTP is costing you tokens. Either drop
N or disable spec-decode entirely:
Get-Content logs\vllm_server.5001.log -Wait | Select-String 'draft_acceptance|system_efficiency'