Lemonade integrates vLLM as an experimental backend for AMD ROCm GPUs on Linux. vLLM brings two core benefits:
- Day-0 model support. vLLM typically supports new transformer architectures within hours of their release on Hugging Face — checkpoints load directly, with no per-architecture porting.
- Concurrency and multi-GPU. Paged-attention KV cache, continuous batching, and chunked prefill scale aggregate throughput with in-flight request count; tensor and pipeline parallelism are supported across multiple GPUs.
Status: experimental. The backend has been validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point). Prebuilt wheels also exist for
gfx110X(RDNA3) andgfx120X(RDNA4) but those targets have not been exercised end-to-end yet.
- Platform: Linux only
- Hardware: validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point); prebuilt wheels also exist for gfx110X (RDNA3) and gfx120X (RDNA4)
- Bundle: a self-contained tarball from lemonade-sdk/vllm-rocm with a relocatable Python interpreter, PyTorch (ROCm), the ROCm user-space libs, Triton, and vLLM. No system Python / PyTorch / ROCm install is required on the host.
vLLM on AMD ROCm requires a kernel that exports the CWSR sysfs properties and an amdgpu setup that doesn't shadow the built-in driver. Both are covered with verification commands and fixes on the Kernel Update Required page — that's the canonical reference; the same prerequisites apply to llamacpp:rocm and sd-cpp:rocm-*. Lemonade blocks install of vllm:rocm on systems missing the kernel fix and points users at that page.
lemonade backends install vllm:rocmOr via HTTP:
curl -X POST http://localhost:13305/api/v1/install \
-H 'Content-Type: application/json' \
-d '{"recipe": "vllm", "backend": "rocm"}'The install fetches a per-GPU-target release (e.g. …-gfx1151, …-gfx1150) from lemonade-sdk/vllm-rocm. The base version is pinned in backend_versions.json; the -{gfx_target} suffix is appended at runtime from SystemInfo::get_rocm_arch(), so a single pin covers all supported architectures.
Models registered with the vllm recipe in server_models.json load automatically on first request. Built-in vllm entries serve the upstream Hugging Face weights as-is in FP16 — there is no quantization step in the load path — so their model IDs carry an explicit -FP16- segment (e.g. Qwen3.5-4B-FP16-vLLM). This mirrors the -Hybrid / -CPU suffix convention used by ryzenai-llm and makes the data type obvious next to llamacpp (GGUF, typically Q4_K_M) and flm (4-bit) entries in the same list. Pointing a user.* vllm registration at a pre-quantized checkpoint (FP8, AWQ, GPTQ, etc.) is still supported.
To register your own:
lemonade pull user.MyModel \
--checkpoint main Qwen/Qwen3-4B \
--recipe vllmStandard OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions) work as usual. Lemonade forwards requests to the vLLM child process, which exposes the engine's own private endpoints (e.g. /metrics, /version) on a backend-only port surfaced via GET /v1/health (backend_url field) — useful for observability but not proxied through Lemonade.
Some vLLM model families need extra vllm-server arguments for correct behavior. For example, tool-calling models may need --enable-auto-tool-choice plus a matching --tool-call-parser. Lemonade keeps these built-in family defaults in vllm_model_config.json, separate from server_models.json.
When adding a built-in model with recipe: "vllm", check whether its model family needs vLLM arguments. If it does, update vllm_model_config.json in the same PR as the server_models.json entry.
The config has two layers:
{
"schema_version": 1,
"enable_checkpoint_regex_match": true,
"families": {
"qwen3.": {
"match": [
{
"checkpoint_regex": "Qwen3\\."
}
],
"args": "--enable-auto-tool-choice --tool-call-parser qwen3_coder"
}
},
"models": {
"Qwen3.5-4B-vLLM": {
"family": "qwen3."
}
}
}familiesdefines reusable defaults for a model family. Each family can includeargsand optionalmatchentries.modelsmaps Lemonade model names to a family and can also add per-modelargs.checkpoint_regexis matched against the model checkpoint, not only the organization name. This lets one family match checkpoints from different Hugging Face organizations.enable_checkpoint_regex_matchcontrols automatic family matching for unlisted models. Set it tofalseto require explicit entries undermodels.- A model entry can set
disable_family_match: trueto prevent regex family matching for that model while still allowing model-specificargs.
Argument precedence is:
- Family
args - Exact model
args - User-provided
vllm_args
Later layers override conflicting earlier flags but keep non-conflicting flags. Binary --flag / --no-flag pairs are resolved like the rest of Lemonade's *_args merge behavior, and repeated generic flags are preserved when the same flag appears multiple times.
Lemonade-managed process arguments cannot be set in this file or in vllm_args: --model, --served-model-name, --host, --port, --max-model-len, --enforce-eager, and --enable-prefix-caching.
Free-form CLI args can be appended to vllm-server via vllm.args:
# Allow more concurrent sequences and turn on prefix caching
lemonade config set vllm.args="--max-num-seqs 128 --enable-prefix-caching"The flat form (vllm_args=...) is also accepted and maps to the same setting.
- Cold first-load JIT. Loading a new model size triggers a Triton kernel compile. Expect 20 s – several minutes the first time you hit a given model+shape; subsequent loads of the same shape are faster as kernels cache to disk.
- FP8 first-load is slow on gfx1151. Cold-loading
Qwen/Qwen3-4B-FP8took ~12 minutes in our test, exceeding Lemonade's defaultwait_for_readytimeout. The engine selectsTritonFp8BlockScaledMMKerneland emits "Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal." warnings — i.e. no AMD-tuned kernel configs are shipped for this GPU's exact shapes, so vLLM autotunes from defaults. FP16 is the most polished path today; FP8 should improve once AMD ships tuned configs. huggingface-hubshadowing. Lemonade launchesvllm-serverwithPYTHONNOUSERSITE=1so the bundledhuggingface_hubis used. If a module-not-found error still appears, ensure~/.local/lib/python3.12/site-packages/huggingface_hubisn't being injected viaPYTHONPATH.- Long load times can leave orphaned processes if interrupted. If a load times out at the Lemonade level, vLLM's child
EngineCoremay continue running in the background and hold VRAM until killed. Look for aVLLM::EngineCorprocess andkill -9it before retrying.