Single landing page for everything Blackwell. If you have an RTX 5060, 5070, 5080, or 5090 on Windows and want to run Qwen3.6-27B natively (no WSL, no Docker), this is the page.
-
Download
qwen3.6-windows-server-portable-x64-blackwell.zipfrom the Releases page. Not the default zip, the default is for 30/40-series only. -
Make sure your NVIDIA driver is 596 or newer (CUDA 13 is required).
nvidia-smishows the driver version. -
Extract anywhere, double-click
start.bat, pick a 5090 snapshot:rtx5090_nvfp4(200k ctx, NVFP4 weights, default since v1.3.0, escapes the 170W prefill ceiling, ~5x faster prefill than AutoRound). Requires the NVFP4 weights atD:\models\Qwen3.6-27B-NVFP4(or setVLLM_NVFP4_MODEL_DIR); the launcher does not auto-download these yet.rtx5090_nvfp4_vision(180k ctx, same NVFP4 weights, vision/video input enabled, experimental). The Peutlefaire NVFP4 quant preserves the unquantized visual tower, so vision is a flag flip on the same weights. Trades ~20k ctx vs the text twin to cover the vision encoder + per-image KV.
As of v1.3.7, NVFP4 is the only supported 5090 path. The AutoRound INT4 5090 snapshots (
rtx5090,rtx5090_max) were removed because they cannot escape the 170W prefill ceiling on consumer Blackwell (seeSM120_GDN_CEILING.mdfor the investigation).
That's it. The launcher autodetects the bundled wheel as a CUDA 13 build and installs the right torch index (cu130) plus a runtime shim on first boot. No CUDA Toolkit install required.
The Ampere/Ada zip qwen3.6-windows-server-portable-x64-ampere.zip ships
vllm-0.19.0+devnen.3 against CUDA 12.6 / PyTorch cu126. That torch
build has no sm_120 kernels, so on Blackwell it boots cleanly to the
first torch.zeros call and dies with
cudaErrorNoKernelImageForDevice. There is no wheel-side workaround.
The Blackwell zip ships vllm-0.20.0+cu132.devnen.2 against CUDA 13.2
/ PyTorch cu130, which has sm_120 kernels. Same launcher, same
snapshots, different wheel.
We keep them as separate releases because forcing every existing 30/40-series user to install a CUDA 13 driver is a breaking change for installs that work today. The Blackwell zip will run on Ampere and Ada too if the host has a 596+ driver, but the default zip is the recommended path for non-Blackwell users.
End-to-end on a single RTX 5090 (driver 596.36, sm_120, 32 GB):
| Check | Result |
|---|---|
| Wheel boots, model loads | yes; ~17 s to load 17 GB AutoRound INT4, ~25 s for 20 GB NVFP4 |
/v1/chat/completions, /v1/messages, /v1/responses |
yes |
| Marlin sm_120 + AutoRound INT4 | works; Marlin selects MarlinLinearKernel for GPTQMarlinLinearMethod on first load. The scalar_types.int4 Marlin sm_120 bug from older vLLM versions is fixed in 0.20.0. |
FlashInfer fp4_gemm + NVFP4 (compressed-tensors) |
works; autotuner selects sm_120 native FP4 tactics. First boot is slow (~9 min, autotuning per GEMM shape); subsequent boots cache picks. |
| TP=1 + MTP n=6 | works on both AutoRound and NVFP4 |
| AutoRound decode tok/s (historical, no longer shipped) | 158.1 tok/s on the removed rtx5090 AutoRound snapshot (ctx 240k, MTP n=6, mem_util 0.95, 200-token completion, median of 3 runs at 575W). Long-prompt 24k decode 107.8 tok/s, 24k prefill 3,100-3,300 tok/s. AutoRound prefill on long unique-word prompts was power-capped at ~170W on consumer Blackwell, which is why these snapshots are no longer offered (see SM120_GDN_CEILING.md). |
| NVFP4 prefill tok/s | ~7,460 tok/s @ 24k and ~5,300 tok/s @ 47k on rtx5090_nvfp4 (full TDP, ~580W, peak mem-BW 35%). 5–7x AutoRound. Sidesteps the 170W ceiling by routing FFN GEMMs through FlashInfer's sm_120 native FP4 tensor cores. |
| NVFP4 decode tok/s | ~92 tok/s (300-token completion, MTP n=6). +25% vs AutoRound on the same hardware. |
| NVFP4 long-ctx coherence + needle retrieval | PASS at 50k / 100k / 177k tokens (both needles retrieved at every depth). |
| NVFP4 MTP acceptance | 81.9% @ 50k ctx (4.91/6 avg), 73.9% @ 150k ctx (4.43/6 avg), well above the 50% degraded-weight threshold. |
| NVFP4 vs AutoRound coding parity | Tied 12/12 on a 12-problem HumanEval-style slice (small slice, see caveat in SM120_GDN_CEILING.md; confirms no catastrophic regression, does not prove long-tail parity). |
| CUDA 13 toolkit on host | not required. The launcher copies torch's bundled cudart64_13.dll, cublas64_13.dll, etc. from venv\Lib\site-packages\torch\lib\ into a writable cuda13_shim\bin\ and points CUDA_PATH there so flashinfer's import-time CDLL succeeds. |
These rows in
SPEC_DECODE_MATRIX.md are 0.19-era and have
not been re-tested on the 0.20 wheel that ships in the Blackwell zip:
- PP=2 + MTP: was blocked on 0.19 with
Qwen3NextMTP/SupportsPP NotImplementedError. May or may not be fixed on 0.20. - PP=2 + ngram: was blocked on 0.19 with the missing
drafterattribute. May or may not be fixed. - TP=2 numbers: 0.19 was ~7.5 tok/s (unusable) because we ran the CPU-relay patch. 0.20 ships NCCL on Windows (experimental), which removes the CPU-relay floor, TP=2 numbers may be very different.
- MTP-on-Blackwell tuning: the 0.19 sweep peaked at MTP n=6 ctx 90k for 64.5 tok/s on a 3090. The 5090 has more memory bandwidth and a different cudagraph profile; the optimum almost certainly shifts. Bench yours and post numbers.
- Async scheduling: on by default in 0.20.0 (was opt-in on 0.19). The current decode numbers were taken with the default; cross-zip comparisons (Ampere zip vs Blackwell zip on a 3090) need a fresh bench.
If you have a 2× 5090 box or a 5090 + 3090 box, please boot the
Blackwell zip on it, run the pp2_160k snapshot, and post numbers.
Since v1.2.4 the launcher detects the host GPU at startup, prints a
banner with the architecture, and groups the snapshot cards by arch.
On a Blackwell box the rtx5090_nvfp4 and rtx5090_nvfp4_vision
cards float to the top under a blue Recommended for your Blackwell GPU header, and the 3090-era cards drop below under a neutral header.
Each card gets a [Blackwell] or [Ampere/Ada] chip and a colored top
border so you can tell at a glance which build a snapshot targets.
configs.yaml carries the truth via an optional arch: key per
entry; the rtx5090* snapshots are tagged explicitly. Existing user
snapshots keep working with no edits, the heuristic falls back to
ampere for anything that doesn't start with rtx5090.
As of v1.3.7 the Blackwell zip ships two single-card 5090 snapshots,
both GPU0, attention backend TRITON_ATTN, KV dtype fp8_e4m3, with a
randomised --data-parallel-rpc-port (see "RPC port leak" below) and
no VLLM_ATTENTION_BACKEND env var (deprecated in 0.20.0; the CLI
flag still works). The text snapshot listens on port 5001; the vision
snapshot listens on 5004 so you can run it alongside the text snapshot
if you want both endpoints live at once:
rtx5090_nvfp4 is the new default since v1.3.0. It uses NVFP4
weights (Peutlefaire/Qwen3.6-27B-NVFP4,
~20.6 GB) and routes FFN/QKV/proj GEMMs through FlashInfer's sm_120
native FP4 tensor cores. This bypasses the 170W prefill ceiling that
AutoRound INT4 hits on consumer Blackwell. The GDN linear-attention
layers are still slow for their 10/40 layer share, but FFN dominates
prefill FLOPs and is now firing at full 575W. Set VLLM_NVFP4_MODEL_DIR
to point at the weights directory; default location is
D:\models\Qwen3.6-27B-NVFP4. First boot is slow (~9 min while
FlashInfer JIT-compiles fp4_gemm kernels and autotunes tactics per
GEMM shape). Every subsequent boot still re-runs the Tuning fp4_gemm
progress bars, but the compiled kernel binaries are cached under
~/.cache/flashinfer/ so only tactic selection re-runs and the boot
finishes in ~1-2 min. The progress bars on every boot are expected
and unavoidable. See
SM120_GDN_CEILING.md for the full
prefill-ceiling investigation, validation matrix, and the eval-slice
caveat on quality vs AutoRound.
rtx5090_nvfp4_vision enables image and video input on the same
NVFP4 weights. The Peutlefaire quant deliberately keeps the visual
tower unquantized (the recipe's ignore list excludes
re:visual.* / re:model.visual.*), so the vision encoder is already
on disk, loading it is a CLI flag flip, not a different model. VRAM
cost vs the text twin: ~2 GiB of unquantized vision-tower weights stay
resident, plus a 16,384-token encoder cache for image features. Context
is dropped 200k -> 180k to absorb that. Measured at boot:
Available KV cache memory: 8.75 GiB, GPU KV cache size: 66,912 tokens, Maximum concurrency for 180,000 tokens per request: 1.22x.
Listens on port 5004 (the text NVFP4 default is 5001). Cold first boot
takes longer than the text twin (FlashInfer autotunes the vision-tower
GEMM shapes too; expect 12-15 min vs ~9 min). Every subsequent boot
still shows the Tuning fp4_gemm progress bars but finishes in ~1-2
min because the compiled kernel binaries are cached.
--limit-mm-per-prompt defaults to image=4, video=1; lower that if
you OOM at boot.
Bench 2026-05-06 (v1.2.3, 575W power cap, --no-enable-prefix-caching shipped from v1.2.2, median of 3 × 200-token short runs). v1.2.5 re-enables prefix caching (vLLM PR #25752 / Mamba2 APC in the wheel auto-applies mamba_cache_mode='align'); see the v1.2.5 notes below the table.
| Snapshot | ctx | MTP n | mem_util | Short/300-tok decode | 24k prefill | 47k prefill | Use it when |
|---|---|---|---|---|---|---|---|
rtx5090_nvfp4 |
200k | 6 | 0.95 | ~92 tok/s | ~7,460 tok/s @ 580W | ~5,300 tok/s @ 584W | Default since v1.3.0. Full TDP prefill via FlashInfer sm_120 native FP4. ~5x AutoRound prefill, ~25% faster decode. Quality tied with AutoRound on a small coding slice (see eval caveat). |
rtx5090_nvfp4_vision |
180k | 6 | 0.95 | not yet benched | not yet benched | not yet benched | Same NVFP4 weights with the visual tower loaded. Image + video input via --limit-mm-per-prompt={"image":4,"video":1}. Listens on port 5004. KV pool at 180k: 66,912 tokens / max-concurrency 1.22x. Status: experimental until a multimodal smoke test is wired into the coherence harness. |
Historical (removed in v1.3.7): the AutoRound INT4 snapshots rtx5090
(240k, MTP n=6) and rtx5090_max (280k, MTP n=3) hit 158.1 and 154.3
tok/s short decode respectively, with 3,100-3,300 tok/s prefill capped
at the ~170W consumer-Blackwell ceiling on long unique-word prompts.
NVFP4 beats both on prefill by 5-7x at full TDP, so the AutoRound
snapshots are no longer shipped for 5090.
(Earlier 500W baseline was 124.9 / 138.0 short decode. The 500W → 575W cap lift adds ~20–30% short decode and ~10–20% long-prompt decode.)
v1.2.5 update, prefix caching back on, large prefill speedup. vLLM PR
#25752 (Mamba2 Automatic
Prefix Caching) shipped upstream and is in the wheel; with prefix caching
enabled, vLLM auto-applies mamba_cache_mode='align' for Qwen3_5 so the
v1.2.2-era #17140 stepwise decode regression no longer reproduces. The
re-bench below was taken on the (now-removed) AutoRound rtx5090
snapshot at ctx 240k, MTP n=6, batch 4128, 575W and is kept here as
historical reference for AutoRound prefix-caching behavior:
| Prompt size | OFF (v1.2.2-v1.2.4) | ON (v1.2.5) | Speedup |
|---|---|---|---|
| 12 k prefill | 686 tok/s | 2147 tok/s | 3.1x |
| 16 k prefill | 559 tok/s | 2034 tok/s | 3.6x |
| 24 k prefill | timeout | 4333 tok/s | unblocks |
| 32 k prefill | timeout | 2479 tok/s | unblocks |
| 48 k prefill | timeout | 2444 tok/s | unblocks |
| KV pool | 79,968 tokens | 94,656 | +18 % |
windows_tools/repro_17140.py (3 short → 24k hit → 3 short → 24k hit → 3
short) shows decode drift +2.6 % then +2.3 % across the two long hits ,
the 130 → 90 → 40 cliff does not recur. Warm-prefix TTFT on a 24 k re-hit
drops from ~42 s cold to ~1.6 s. Short-prompt headline decode is in the
~125 tok/s range under bench.py methodology; the 158 tok/s table number
above came from the v1.2.3 prefix-caching-off baseline and will be
re-measured under v1.2.5 conditions in a follow-up.
Both beat every 3090 snapshot on context size at the same MTP n.
Profile history. v1.2.0-v1.2.2 also shipped rtx5090_speed
(120k, MTP n=6) as the headline "speed" config. The 575W re-bench
showed it tied the AutoRound rtx5090 snapshot on short decode (158
vs 158), was slower on long decode (103 vs 108), and had a
reproducible long-prompt prefill regression (~343 tok/s vs ~3,200 for
the other two). Same MTP + chunked-prefill flags, cause not
root-caused. Removed in v1.2.3 because it offered no inference
advantage. The AutoRound rtx5090 and rtx5090_max snapshots
themselves were then removed in v1.3.7 once NVFP4 was the validated
default, since neither AutoRound snapshot could escape the 170W
prefill ceiling on consumer Blackwell.
If you want a different combo, e on the dashboard opens the
snapshot editor. Duplicate either snapshot, edit, save. The launcher
rewrites both the YAML and the .py for you.
| Component | Requirement |
|---|---|
| GPU | RTX 5060 / 5070 / 5080 / 5090 (any sm_120) |
| NVIDIA driver | 596 or newer |
| CUDA Toolkit | not required on the host |
| Visual Studio Build Tools | optional (small flashinfer-sampler decode boost; otherwise the launcher uses the PyTorch fallback sampler) |
| Disk | ~10 GB for the launcher install + ~17 GB for the model weights |
| RAM | 16 GB+ recommended |
| Windows | 10 22H2 or 11 |
The launcher refuses to boot on a Blackwell GPU with the cu126 wheel
(it surfaces a preflight error pointing at this page) and refuses to
boot on a non-Blackwell GPU with a forced cu126 install via
--variant ampere if the host driver is too old.
If you already have the default zip extracted and want to switch
without losing your models\, logs\, custom snapshots:
update.bat --variant blackwell
See UPGRADING.md for the full updater story.
The portable launcher cannot assume anything about what the user has installed system-wide. The four user environment classes are:
- Drivers only, most inference users. NVIDIA driver 596+, no CUDA Toolkit, no conda. The bundled cu13 shim and torch's bundled DLLs handle everything. This was always the design target.
- System CUDA 12.x, devs who installed CUDA 12.x for some other
tool. The dangerous case:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\on PATH ahead of the cu13 shim makes FlashInfer logFailed to get device capability: SM 12.x requires CUDA >= 12.9and silently fall back, then bake slow fp4_gemm tactics into vLLM's AOT compile cache. See internal forensic notes for the exact reproduction. - System CUDA 13.x, devs with a CUDA 13 Toolkit installed. Worked accidentally before v1.3.2; scrubbed for uniformity now.
- Conda / Mamba with
cudatoolkit, DLLs live under<env>/Library/binand shadow the shim the same way (2) does.
v1.3.2 hardens against all four by replacing the additive
cuda_env() helper with clean_cuda_env() for the rtx5090_nvfp4
snapshot. It builds the subprocess environment from scratch:
- Drops every
CUDA_*/NVCC_*/NVTOOLSEXT*/CUDNN_*key inherited from the host (any one can poison FlashInfer's runtime probe). - Filters PATH to remove every NVIDIA-toolkit dir and known conda
Library/binpath. - Prepends the bundled
cuda13_shim/binand pinsCUDA_PATH/CUDA_HOMEat it.
A second layer, preflight_sm120a_or_die(), spawns a 5-second
subprocess under the cleaned env before warmup starts. It calls
flashinfer.utils.is_sm120a_supported(cuda:0) and hard-exits with a
diagnostic if it returns False, instead of letting the launcher run
degraded for 11 minutes before you notice. The error message points
at this doc and at windows_tools/wipe_caches.py.
If you upgraded to v1.3.2 from an older release and saw slow prefill in the past, the pre-v1.3.2 boot may have already poisoned vLLM's compile cache under a polluted env. The fingerprint:
rtx5090_nvfp4prefill ~750 tok/s on a 47 k prompt (validated baseline is ~5,300 tok/s @ 580 W).nvidia-smi dmonshows SM 100 %, mem-BW ≈ 0 %, power 200–270 W during prompt processing.[preflight WARN] vLLM compile cache was populated under a different envprinted at boot.
Recovery is one command:
python windows_tools\wipe_caches.pyBacks up and wipes the four caches that, when poisoned, cause this
fingerprint: ~/.cache/vllm/, %LOCALAPPDATA%\Temp\torchinductor_*,
~/.cache/torch/, ~/.cache/flashinfer/. Defaults to
move-to-.bak.<timestamp> for forensic comparison. Use --no-backup
for outright deletion or --dry-run to preview. Cold rebuild on next
boot takes ~11–25 min depending on what was wiped.
The cache_env_stamp_check() preflight writes
~/.cache/vllm/.env_stamp.json on a clean boot and warns loudly on
mismatch, so wheel upgrades and env-pollution incidents both surface
the recommendation to wipe.
FlashInfer's fp4_gemm autotuner runs once per unique GEMM shape
encountered during CUDA-graph capture and warmup. With MTP n=6,
chunked prefill, max_num_batched_tokens=4128, ctx 200 k there are a
lot of shapes. Expect:
- Warm cache (subsequent boots): 3–8 min.
- Cold FlashInfer cache only: ~9 min.
- After
wipe_caches.py(vLLM + torch + torchinductor + flashinfer all wiped): 11–25 min.
You'll see many [AutoTuner]: Tuning fp4_gemm: 0/13 → 13/13 cycles.
That's expected, not a loop. Each cycle is a different shape. The
[Autotuner]: Skipped 6 unsupported tactic(s) for fp4_gemm line is
also normal: those six are TMA-WS kernels gated on compute_120f, not
shipped in the bundled FlashInfer cubin pack. The validated 5,300
tok/s prefill is achieved without them. See
SM120_GDN_CEILING.md for the research on
that knob.
Wait for Application startup complete and Uvicorn running on http://0.0.0.0:5001. Do not kill the snapshot mid-autotune;
interrupting forces the cycle to start over on the next boot.
flashinfer 0.4.x does an absolute-path CDLL of cudart64_13.dll at
import time. NVIDIA's CUDA 13 toolkit ships that DLL in
<CUDA_PATH>\bin\, but installing the toolkit just to satisfy a
CDLL call would be silly when torch already bundles every CUDA 13
runtime DLL it needs.
What the launcher does on first boot:
- Looks at
venv\Lib\site-packages\torch\lib\and findscudart64_13.dll,cublas64_13.dll,cublasLt64_13.dll,cudnn64_*.dll, etc. - Creates
cuda13_shim\bin\next to the launcher and copies the DLLs there. - Sets
CUDA_PATH=<install>\cuda13_shim\for the snapshot subprocess. - flashinfer's import-time
CDLLsucceeds.
This is idempotent (runs every boot, skips DLLs already present),
cheap (~5 MiB of file copies), and doesn't touch your system CUDA
install. If you ever delete cuda13_shim\, the next launcher boot
recreates it from torch's lib\.
vLLM 0.20.0 hardcodes data_parallel_rpc_port=29550 in its
ParallelConfig default. When an engine-core child orphans (parent
crash, ctrl-C during boot), the port stays held until the orphan is
killed. Back-to-back snapshot launches then deterministically fail
with Address in use 127.0.0.1:29550.
The shipped snapshots in this project pass
--data-parallel-rpc-port=<random> via a helper in
snapshots/_common.py, so you won't hit this on the launcher path.
If you're invoking vLLM directly without the helper:
netstat -ano -p tcp | Select-String ":29550"
taskkill /F /PID <pid>On 0.19.0, the env var was silently ignored and the CLI flag was
load-bearing. On 0.20.0, the env var is genuinely unrecognised and
emits a one-line Unknown vLLM environment variable: VLLM_ATTENTION_BACKEND warning. The CLI flag
(--attention-backend=TRITON_ATTN) is the right answer either way.
The shipped rtx5090_nvfp4* snapshots do not set the env var. The
other snapshots (carried over from the 0.19 path) still set it; on
the Blackwell zip those produce the warning above and otherwise
behave identically. To silence the warning, drop the env-var line
from the snapshot or rebuild it via the in-TUI editor.
The community vllm-blackwell-guide has reported up to 120 tok/s on 27B and 200 tok/s on the 35B MoE on a 5090, on tuned upstream vLLM running in WSL2. That stack:
- ships pure-upstream vLLM (this project's wheel is mostly upstream with a small reasoning-parser tweak and a wildcard model name; features like NVFP4 KV that landed in upstream after 0.20 are not in our wheel),
- pays the WSL2 tax on the GPU (one community measurement: 85 tok/s in WSL vs 160 tok/s in native Ubuntu).
Pick that route if you specifically need an upstream feature that isn't in our 0.20 base, or if you're already comfortable in Linux and would rather pay the WSL tax. Pick this launcher if you want native Windows decode with Anthropic-API-compatible serving and the shipped tool-calling fixes.
If you bench the Blackwell zip, please post:
- GPU model + driver version (
nvidia-smi --query-gpu=name,driver_version --format=csv) - Snapshot id (
rtx5090_nvfp4,rtx5090_nvfp4_vision, or your custom one) - Output of
windows_tools\check_coherence.py --port 5001(decode tok/s without coherence is meaningless) - Output of
windows_tools\bench_summarize.py(a single TSV row with prefill / decode / TTFT) - The
Maximum concurrency for X tokens per request: Y.YYxline fromlogs\vllm_server.5001.log(tells us how much KV headroom you had)
A Reddit reply, a GitHub issue, or a PR with a new
launcher\configs.yaml row are all welcome.
HARDWARE.md, full GPU compatibility table.INSTALL.md, full install procedure including picking the right zip.UPGRADING.md, in-place updater and variant switching.TROUBLESHOOTING.md, Blackwell-specific failure rows are at the bottom of the table.SPEC_DECODE_MATRIX.md, parallelism / spec-decode combos. The 0.19-era results need re-validation on 0.20.HALLUCINATED_FLAGS.md, theVLLM_ATTENTION_BACKENDenv-var deprecation note for 0.20.