Skip to content

Latest commit

 

History

History
446 lines (365 loc) · 23.1 KB

File metadata and controls

446 lines (365 loc) · 23.1 KB

RTX 50-series (Blackwell) on this launcher

Single landing page for everything Blackwell. If you have an RTX 5060, 5070, 5080, or 5090 on Windows and want to run Qwen3.6-27B natively (no WSL, no Docker), this is the page.

TL;DR

  1. Download qwen3.6-windows-server-portable-x64-blackwell.zip from the Releases page. Not the default zip, the default is for 30/40-series only.

  2. Make sure your NVIDIA driver is 596 or newer (CUDA 13 is required). nvidia-smi shows the driver version.

  3. Extract anywhere, double-click start.bat, pick a 5090 snapshot:

    • rtx5090_nvfp4 (200k ctx, NVFP4 weights, default since v1.3.0, escapes the 170W prefill ceiling, ~5x faster prefill than AutoRound). Requires the NVFP4 weights at D:\models\Qwen3.6-27B-NVFP4 (or set VLLM_NVFP4_MODEL_DIR); the launcher does not auto-download these yet.
    • rtx5090_nvfp4_vision (180k ctx, same NVFP4 weights, vision/video input enabled, experimental). The Peutlefaire NVFP4 quant preserves the unquantized visual tower, so vision is a flag flip on the same weights. Trades ~20k ctx vs the text twin to cover the vision encoder + per-image KV.

    As of v1.3.7, NVFP4 is the only supported 5090 path. The AutoRound INT4 5090 snapshots (rtx5090, rtx5090_max) were removed because they cannot escape the 170W prefill ceiling on consumer Blackwell (see SM120_GDN_CEILING.md for the investigation).

That's it. The launcher autodetects the bundled wheel as a CUDA 13 build and installs the right torch index (cu130) plus a runtime shim on first boot. No CUDA Toolkit install required.

Why two zips

The Ampere/Ada zip qwen3.6-windows-server-portable-x64-ampere.zip ships vllm-0.19.0+devnen.3 against CUDA 12.6 / PyTorch cu126. That torch build has no sm_120 kernels, so on Blackwell it boots cleanly to the first torch.zeros call and dies with cudaErrorNoKernelImageForDevice. There is no wheel-side workaround.

The Blackwell zip ships vllm-0.20.0+cu132.devnen.2 against CUDA 13.2 / PyTorch cu130, which has sm_120 kernels. Same launcher, same snapshots, different wheel.

We keep them as separate releases because forcing every existing 30/40-series user to install a CUDA 13 driver is a breaking change for installs that work today. The Blackwell zip will run on Ampere and Ada too if the host has a 596+ driver, but the default zip is the recommended path for non-Blackwell users.

What's verified

End-to-end on a single RTX 5090 (driver 596.36, sm_120, 32 GB):

Check Result
Wheel boots, model loads yes; ~17 s to load 17 GB AutoRound INT4, ~25 s for 20 GB NVFP4
/v1/chat/completions, /v1/messages, /v1/responses yes
Marlin sm_120 + AutoRound INT4 works; Marlin selects MarlinLinearKernel for GPTQMarlinLinearMethod on first load. The scalar_types.int4 Marlin sm_120 bug from older vLLM versions is fixed in 0.20.0.
FlashInfer fp4_gemm + NVFP4 (compressed-tensors) works; autotuner selects sm_120 native FP4 tactics. First boot is slow (~9 min, autotuning per GEMM shape); subsequent boots cache picks.
TP=1 + MTP n=6 works on both AutoRound and NVFP4
AutoRound decode tok/s (historical, no longer shipped) 158.1 tok/s on the removed rtx5090 AutoRound snapshot (ctx 240k, MTP n=6, mem_util 0.95, 200-token completion, median of 3 runs at 575W). Long-prompt 24k decode 107.8 tok/s, 24k prefill 3,100-3,300 tok/s. AutoRound prefill on long unique-word prompts was power-capped at ~170W on consumer Blackwell, which is why these snapshots are no longer offered (see SM120_GDN_CEILING.md).
NVFP4 prefill tok/s ~7,460 tok/s @ 24k and ~5,300 tok/s @ 47k on rtx5090_nvfp4 (full TDP, ~580W, peak mem-BW 35%). 5–7x AutoRound. Sidesteps the 170W ceiling by routing FFN GEMMs through FlashInfer's sm_120 native FP4 tensor cores.
NVFP4 decode tok/s ~92 tok/s (300-token completion, MTP n=6). +25% vs AutoRound on the same hardware.
NVFP4 long-ctx coherence + needle retrieval PASS at 50k / 100k / 177k tokens (both needles retrieved at every depth).
NVFP4 MTP acceptance 81.9% @ 50k ctx (4.91/6 avg), 73.9% @ 150k ctx (4.43/6 avg), well above the 50% degraded-weight threshold.
NVFP4 vs AutoRound coding parity Tied 12/12 on a 12-problem HumanEval-style slice (small slice, see caveat in SM120_GDN_CEILING.md; confirms no catastrophic regression, does not prove long-tail parity).
CUDA 13 toolkit on host not required. The launcher copies torch's bundled cudart64_13.dll, cublas64_13.dll, etc. from venv\Lib\site-packages\torch\lib\ into a writable cuda13_shim\bin\ and points CUDA_PATH there so flashinfer's import-time CDLL succeeds.

What's NOT yet validated on Blackwell

These rows in SPEC_DECODE_MATRIX.md are 0.19-era and have not been re-tested on the 0.20 wheel that ships in the Blackwell zip:

  • PP=2 + MTP: was blocked on 0.19 with Qwen3NextMTP / SupportsPP NotImplementedError. May or may not be fixed on 0.20.
  • PP=2 + ngram: was blocked on 0.19 with the missing drafter attribute. May or may not be fixed.
  • TP=2 numbers: 0.19 was ~7.5 tok/s (unusable) because we ran the CPU-relay patch. 0.20 ships NCCL on Windows (experimental), which removes the CPU-relay floor, TP=2 numbers may be very different.
  • MTP-on-Blackwell tuning: the 0.19 sweep peaked at MTP n=6 ctx 90k for 64.5 tok/s on a 3090. The 5090 has more memory bandwidth and a different cudagraph profile; the optimum almost certainly shifts. Bench yours and post numbers.
  • Async scheduling: on by default in 0.20.0 (was opt-in on 0.19). The current decode numbers were taken with the default; cross-zip comparisons (Ampere zip vs Blackwell zip on a 3090) need a fresh bench.

If you have a 2× 5090 box or a 5090 + 3090 box, please boot the Blackwell zip on it, run the pp2_160k snapshot, and post numbers.

Dashboard auto-grouping

Since v1.2.4 the launcher detects the host GPU at startup, prints a banner with the architecture, and groups the snapshot cards by arch. On a Blackwell box the rtx5090_nvfp4 and rtx5090_nvfp4_vision cards float to the top under a blue Recommended for your Blackwell GPU header, and the 3090-era cards drop below under a neutral header. Each card gets a [Blackwell] or [Ampere/Ada] chip and a colored top border so you can tell at a glance which build a snapshot targets.

configs.yaml carries the truth via an optional arch: key per entry; the rtx5090* snapshots are tagged explicitly. Existing user snapshots keep working with no edits, the heuristic falls back to ampere for anything that doesn't start with rtx5090.

The 5090 snapshots

As of v1.3.7 the Blackwell zip ships two single-card 5090 snapshots, both GPU0, attention backend TRITON_ATTN, KV dtype fp8_e4m3, with a randomised --data-parallel-rpc-port (see "RPC port leak" below) and no VLLM_ATTENTION_BACKEND env var (deprecated in 0.20.0; the CLI flag still works). The text snapshot listens on port 5001; the vision snapshot listens on 5004 so you can run it alongside the text snapshot if you want both endpoints live at once:

rtx5090_nvfp4 is the new default since v1.3.0. It uses NVFP4 weights (Peutlefaire/Qwen3.6-27B-NVFP4, ~20.6 GB) and routes FFN/QKV/proj GEMMs through FlashInfer's sm_120 native FP4 tensor cores. This bypasses the 170W prefill ceiling that AutoRound INT4 hits on consumer Blackwell. The GDN linear-attention layers are still slow for their 10/40 layer share, but FFN dominates prefill FLOPs and is now firing at full 575W. Set VLLM_NVFP4_MODEL_DIR to point at the weights directory; default location is D:\models\Qwen3.6-27B-NVFP4. First boot is slow (~9 min while FlashInfer JIT-compiles fp4_gemm kernels and autotunes tactics per GEMM shape). Every subsequent boot still re-runs the Tuning fp4_gemm progress bars, but the compiled kernel binaries are cached under ~/.cache/flashinfer/ so only tactic selection re-runs and the boot finishes in ~1-2 min. The progress bars on every boot are expected and unavoidable. See SM120_GDN_CEILING.md for the full prefill-ceiling investigation, validation matrix, and the eval-slice caveat on quality vs AutoRound.

rtx5090_nvfp4_vision enables image and video input on the same NVFP4 weights. The Peutlefaire quant deliberately keeps the visual tower unquantized (the recipe's ignore list excludes re:visual.* / re:model.visual.*), so the vision encoder is already on disk, loading it is a CLI flag flip, not a different model. VRAM cost vs the text twin: ~2 GiB of unquantized vision-tower weights stay resident, plus a 16,384-token encoder cache for image features. Context is dropped 200k -> 180k to absorb that. Measured at boot: Available KV cache memory: 8.75 GiB, GPU KV cache size: 66,912 tokens, Maximum concurrency for 180,000 tokens per request: 1.22x. Listens on port 5004 (the text NVFP4 default is 5001). Cold first boot takes longer than the text twin (FlashInfer autotunes the vision-tower GEMM shapes too; expect 12-15 min vs ~9 min). Every subsequent boot still shows the Tuning fp4_gemm progress bars but finishes in ~1-2 min because the compiled kernel binaries are cached. --limit-mm-per-prompt defaults to image=4, video=1; lower that if you OOM at boot.

Bench 2026-05-06 (v1.2.3, 575W power cap, --no-enable-prefix-caching shipped from v1.2.2, median of 3 × 200-token short runs). v1.2.5 re-enables prefix caching (vLLM PR #25752 / Mamba2 APC in the wheel auto-applies mamba_cache_mode='align'); see the v1.2.5 notes below the table.

Snapshot ctx MTP n mem_util Short/300-tok decode 24k prefill 47k prefill Use it when
rtx5090_nvfp4 200k 6 0.95 ~92 tok/s ~7,460 tok/s @ 580W ~5,300 tok/s @ 584W Default since v1.3.0. Full TDP prefill via FlashInfer sm_120 native FP4. ~5x AutoRound prefill, ~25% faster decode. Quality tied with AutoRound on a small coding slice (see eval caveat).
rtx5090_nvfp4_vision 180k 6 0.95 not yet benched not yet benched not yet benched Same NVFP4 weights with the visual tower loaded. Image + video input via --limit-mm-per-prompt={"image":4,"video":1}. Listens on port 5004. KV pool at 180k: 66,912 tokens / max-concurrency 1.22x. Status: experimental until a multimodal smoke test is wired into the coherence harness.

Historical (removed in v1.3.7): the AutoRound INT4 snapshots rtx5090 (240k, MTP n=6) and rtx5090_max (280k, MTP n=3) hit 158.1 and 154.3 tok/s short decode respectively, with 3,100-3,300 tok/s prefill capped at the ~170W consumer-Blackwell ceiling on long unique-word prompts. NVFP4 beats both on prefill by 5-7x at full TDP, so the AutoRound snapshots are no longer shipped for 5090.

(Earlier 500W baseline was 124.9 / 138.0 short decode. The 500W → 575W cap lift adds ~20–30% short decode and ~10–20% long-prompt decode.)

v1.2.5 update, prefix caching back on, large prefill speedup. vLLM PR #25752 (Mamba2 Automatic Prefix Caching) shipped upstream and is in the wheel; with prefix caching enabled, vLLM auto-applies mamba_cache_mode='align' for Qwen3_5 so the v1.2.2-era #17140 stepwise decode regression no longer reproduces. The re-bench below was taken on the (now-removed) AutoRound rtx5090 snapshot at ctx 240k, MTP n=6, batch 4128, 575W and is kept here as historical reference for AutoRound prefix-caching behavior:

Prompt size OFF (v1.2.2-v1.2.4) ON (v1.2.5) Speedup
12 k prefill 686 tok/s 2147 tok/s 3.1x
16 k prefill 559 tok/s 2034 tok/s 3.6x
24 k prefill timeout 4333 tok/s unblocks
32 k prefill timeout 2479 tok/s unblocks
48 k prefill timeout 2444 tok/s unblocks
KV pool 79,968 tokens 94,656 +18 %

windows_tools/repro_17140.py (3 short → 24k hit → 3 short → 24k hit → 3 short) shows decode drift +2.6 % then +2.3 % across the two long hits , the 130 → 90 → 40 cliff does not recur. Warm-prefix TTFT on a 24 k re-hit drops from ~42 s cold to ~1.6 s. Short-prompt headline decode is in the ~125 tok/s range under bench.py methodology; the 158 tok/s table number above came from the v1.2.3 prefix-caching-off baseline and will be re-measured under v1.2.5 conditions in a follow-up.

Both beat every 3090 snapshot on context size at the same MTP n.

Profile history. v1.2.0-v1.2.2 also shipped rtx5090_speed (120k, MTP n=6) as the headline "speed" config. The 575W re-bench showed it tied the AutoRound rtx5090 snapshot on short decode (158 vs 158), was slower on long decode (103 vs 108), and had a reproducible long-prompt prefill regression (~343 tok/s vs ~3,200 for the other two). Same MTP + chunked-prefill flags, cause not root-caused. Removed in v1.2.3 because it offered no inference advantage. The AutoRound rtx5090 and rtx5090_max snapshots themselves were then removed in v1.3.7 once NVFP4 was the validated default, since neither AutoRound snapshot could escape the 170W prefill ceiling on consumer Blackwell.

If you want a different combo, e on the dashboard opens the snapshot editor. Duplicate either snapshot, edit, save. The launcher rewrites both the YAML and the .py for you.

Driver and toolkit requirements

Component Requirement
GPU RTX 5060 / 5070 / 5080 / 5090 (any sm_120)
NVIDIA driver 596 or newer
CUDA Toolkit not required on the host
Visual Studio Build Tools optional (small flashinfer-sampler decode boost; otherwise the launcher uses the PyTorch fallback sampler)
Disk ~10 GB for the launcher install + ~17 GB for the model weights
RAM 16 GB+ recommended
Windows 10 22H2 or 11

The launcher refuses to boot on a Blackwell GPU with the cu126 wheel (it surfaces a preflight error pointing at this page) and refuses to boot on a non-Blackwell GPU with a forced cu126 install via --variant ampere if the host driver is too old.

Switching from the default zip to the Blackwell zip

If you already have the default zip extracted and want to switch without losing your models\, logs\, custom snapshots:

update.bat --variant blackwell

See UPGRADING.md for the full updater story.

CUDA env hygiene (v1.3.2+)

The portable launcher cannot assume anything about what the user has installed system-wide. The four user environment classes are:

  1. Drivers only, most inference users. NVIDIA driver 596+, no CUDA Toolkit, no conda. The bundled cu13 shim and torch's bundled DLLs handle everything. This was always the design target.
  2. System CUDA 12.x, devs who installed CUDA 12.x for some other tool. The dangerous case: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\ on PATH ahead of the cu13 shim makes FlashInfer log Failed to get device capability: SM 12.x requires CUDA >= 12.9 and silently fall back, then bake slow fp4_gemm tactics into vLLM's AOT compile cache. See internal forensic notes for the exact reproduction.
  3. System CUDA 13.x, devs with a CUDA 13 Toolkit installed. Worked accidentally before v1.3.2; scrubbed for uniformity now.
  4. Conda / Mamba with cudatoolkit, DLLs live under <env>/Library/bin and shadow the shim the same way (2) does.

v1.3.2 hardens against all four by replacing the additive cuda_env() helper with clean_cuda_env() for the rtx5090_nvfp4 snapshot. It builds the subprocess environment from scratch:

  • Drops every CUDA_* / NVCC_* / NVTOOLSEXT* / CUDNN_* key inherited from the host (any one can poison FlashInfer's runtime probe).
  • Filters PATH to remove every NVIDIA-toolkit dir and known conda Library/bin path.
  • Prepends the bundled cuda13_shim/bin and pins CUDA_PATH / CUDA_HOME at it.

A second layer, preflight_sm120a_or_die(), spawns a 5-second subprocess under the cleaned env before warmup starts. It calls flashinfer.utils.is_sm120a_supported(cuda:0) and hard-exits with a diagnostic if it returns False, instead of letting the launcher run degraded for 11 minutes before you notice. The error message points at this doc and at windows_tools/wipe_caches.py.

Recovering from a poisoned cache

If you upgraded to v1.3.2 from an older release and saw slow prefill in the past, the pre-v1.3.2 boot may have already poisoned vLLM's compile cache under a polluted env. The fingerprint:

  • rtx5090_nvfp4 prefill ~750 tok/s on a 47 k prompt (validated baseline is ~5,300 tok/s @ 580 W).
  • nvidia-smi dmon shows SM 100 %, mem-BW ≈ 0 %, power 200–270 W during prompt processing.
  • [preflight WARN] vLLM compile cache was populated under a different env printed at boot.

Recovery is one command:

python windows_tools\wipe_caches.py

Backs up and wipes the four caches that, when poisoned, cause this fingerprint: ~/.cache/vllm/, %LOCALAPPDATA%\Temp\torchinductor_*, ~/.cache/torch/, ~/.cache/flashinfer/. Defaults to move-to-.bak.<timestamp> for forensic comparison. Use --no-backup for outright deletion or --dry-run to preview. Cold rebuild on next boot takes ~11–25 min depending on what was wiped.

The cache_env_stamp_check() preflight writes ~/.cache/vllm/.env_stamp.json on a clean boot and warns loudly on mismatch, so wheel upgrades and env-pollution incidents both surface the recommendation to wipe.

Expected first-boot autotune time

FlashInfer's fp4_gemm autotuner runs once per unique GEMM shape encountered during CUDA-graph capture and warmup. With MTP n=6, chunked prefill, max_num_batched_tokens=4128, ctx 200 k there are a lot of shapes. Expect:

  • Warm cache (subsequent boots): 3–8 min.
  • Cold FlashInfer cache only: ~9 min.
  • After wipe_caches.py (vLLM + torch + torchinductor + flashinfer all wiped): 11–25 min.

You'll see many [AutoTuner]: Tuning fp4_gemm: 0/13 → 13/13 cycles. That's expected, not a loop. Each cycle is a different shape. The [Autotuner]: Skipped 6 unsupported tactic(s) for fp4_gemm line is also normal: those six are TMA-WS kernels gated on compute_120f, not shipped in the bundled FlashInfer cubin pack. The validated 5,300 tok/s prefill is achieved without them. See SM120_GDN_CEILING.md for the research on that knob.

Wait for Application startup complete and Uvicorn running on http://0.0.0.0:5001. Do not kill the snapshot mid-autotune; interrupting forces the cycle to start over on the next boot.

The CUDA 13 runtime shim, in detail

flashinfer 0.4.x does an absolute-path CDLL of cudart64_13.dll at import time. NVIDIA's CUDA 13 toolkit ships that DLL in <CUDA_PATH>\bin\, but installing the toolkit just to satisfy a CDLL call would be silly when torch already bundles every CUDA 13 runtime DLL it needs.

What the launcher does on first boot:

  1. Looks at venv\Lib\site-packages\torch\lib\ and finds cudart64_13.dll, cublas64_13.dll, cublasLt64_13.dll, cudnn64_*.dll, etc.
  2. Creates cuda13_shim\bin\ next to the launcher and copies the DLLs there.
  3. Sets CUDA_PATH=<install>\cuda13_shim\ for the snapshot subprocess.
  4. flashinfer's import-time CDLL succeeds.

This is idempotent (runs every boot, skips DLLs already present), cheap (~5 MiB of file copies), and doesn't touch your system CUDA install. If you ever delete cuda13_shim\, the next launcher boot recreates it from torch's lib\.

The 29550 RPC port leak

vLLM 0.20.0 hardcodes data_parallel_rpc_port=29550 in its ParallelConfig default. When an engine-core child orphans (parent crash, ctrl-C during boot), the port stays held until the orphan is killed. Back-to-back snapshot launches then deterministically fail with Address in use 127.0.0.1:29550.

The shipped snapshots in this project pass --data-parallel-rpc-port=<random> via a helper in snapshots/_common.py, so you won't hit this on the launcher path.

If you're invoking vLLM directly without the helper:

netstat -ano -p tcp | Select-String ":29550"
taskkill /F /PID <pid>

VLLM_ATTENTION_BACKEND is gone in 0.20

On 0.19.0, the env var was silently ignored and the CLI flag was load-bearing. On 0.20.0, the env var is genuinely unrecognised and emits a one-line Unknown vLLM environment variable: VLLM_ATTENTION_BACKEND warning. The CLI flag (--attention-backend=TRITON_ATTN) is the right answer either way.

The shipped rtx5090_nvfp4* snapshots do not set the env var. The other snapshots (carried over from the 0.19 path) still set it; on the Blackwell zip those produce the warning above and otherwise behave identically. To silence the warning, drop the env-var line from the snapshot or rebuild it via the in-TUI editor.

When to keep using WSL2 / the Blackwell guide instead

The community vllm-blackwell-guide has reported up to 120 tok/s on 27B and 200 tok/s on the 35B MoE on a 5090, on tuned upstream vLLM running in WSL2. That stack:

  • ships pure-upstream vLLM (this project's wheel is mostly upstream with a small reasoning-parser tweak and a wildcard model name; features like NVFP4 KV that landed in upstream after 0.20 are not in our wheel),
  • pays the WSL2 tax on the GPU (one community measurement: 85 tok/s in WSL vs 160 tok/s in native Ubuntu).

Pick that route if you specifically need an upstream feature that isn't in our 0.20 base, or if you're already comfortable in Linux and would rather pay the WSL tax. Pick this launcher if you want native Windows decode with Anthropic-API-compatible serving and the shipped tool-calling fixes.

Reporting Blackwell numbers

If you bench the Blackwell zip, please post:

  • GPU model + driver version (nvidia-smi --query-gpu=name,driver_version --format=csv)
  • Snapshot id (rtx5090_nvfp4, rtx5090_nvfp4_vision, or your custom one)
  • Output of windows_tools\check_coherence.py --port 5001 (decode tok/s without coherence is meaningless)
  • Output of windows_tools\bench_summarize.py (a single TSV row with prefill / decode / TTFT)
  • The Maximum concurrency for X tokens per request: Y.YYx line from logs\vllm_server.5001.log (tells us how much KV headroom you had)

A Reddit reply, a GitHub issue, or a PR with a new launcher\configs.yaml row are all welcome.

Related docs