Skip to content

Latest commit

 

History

History
112 lines (80 loc) · 7.52 KB

File metadata and controls

112 lines (80 loc) · 7.52 KB

vLLM Backend Options

Lemonade integrates vLLM as an experimental backend for AMD ROCm GPUs on Linux. vLLM brings two core benefits:

  1. Day-0 model support. vLLM typically supports new transformer architectures within hours of their release on Hugging Face — checkpoints load directly, with no per-architecture porting.
  2. Concurrency and multi-GPU. Paged-attention KV cache, continuous batching, and chunked prefill scale aggregate throughput with in-flight request count; tensor and pipeline parallelism are supported across multiple GPUs.

Status: experimental. The backend has been validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point). Prebuilt wheels also exist for gfx110X (RDNA3) and gfx120X (RDNA4) but those targets have not been exercised end-to-end yet.

Available Backend

ROCm

  • Platform: Linux only
  • Hardware: validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point); prebuilt wheels also exist for gfx110X (RDNA3) and gfx120X (RDNA4)
  • Bundle: a self-contained tarball from lemonade-sdk/vllm-rocm with a relocatable Python interpreter, PyTorch (ROCm), the ROCm user-space libs, Triton, and vLLM. No system Python / PyTorch / ROCm install is required on the host.

Prerequisites

vLLM on AMD ROCm requires a kernel that exports the CWSR sysfs properties and an amdgpu setup that doesn't shadow the built-in driver. Both are covered with verification commands and fixes on the Kernel Update Required page — that's the canonical reference; the same prerequisites apply to llamacpp:rocm and sd-cpp:rocm-*. Lemonade blocks install of vllm:rocm on systems missing the kernel fix and points users at that page.

Install

lemonade backends install vllm:rocm

Or via HTTP:

curl -X POST http://localhost:13305/api/v1/install \
  -H 'Content-Type: application/json' \
  -d '{"recipe": "vllm", "backend": "rocm"}'

The install fetches a per-GPU-target release (e.g. …-gfx1151, …-gfx1150) from lemonade-sdk/vllm-rocm. The base version is pinned in backend_versions.json; the -{gfx_target} suffix is appended at runtime from SystemInfo::get_rocm_arch(), so a single pin covers all supported architectures.

Use

Models registered with the vllm recipe in server_models.json load automatically on first request. Built-in vllm entries serve the upstream Hugging Face weights as-is in FP16 — there is no quantization step in the load path — so their model IDs carry an explicit -FP16- segment (e.g. Qwen3.5-4B-FP16-vLLM). This mirrors the -Hybrid / -CPU suffix convention used by ryzenai-llm and makes the data type obvious next to llamacpp (GGUF, typically Q4_K_M) and flm (4-bit) entries in the same list. Pointing a user.* vllm registration at a pre-quantized checkpoint (FP8, AWQ, GPTQ, etc.) is still supported.

To register your own:

lemonade pull user.MyModel \
  --checkpoint main Qwen/Qwen3-4B \
  --recipe vllm

Standard OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions) work as usual. Lemonade forwards requests to the vLLM child process, which exposes the engine's own private endpoints (e.g. /metrics, /version) on a backend-only port surfaced via GET /v1/health (backend_url field) — useful for observability but not proxied through Lemonade.

Model-Family Argument Config

Some vLLM model families need extra vllm-server arguments for correct behavior. For example, tool-calling models may need --enable-auto-tool-choice plus a matching --tool-call-parser. Lemonade keeps these built-in family defaults in vllm_model_config.json, separate from server_models.json.

When adding a built-in model with recipe: "vllm", check whether its model family needs vLLM arguments. If it does, update vllm_model_config.json in the same PR as the server_models.json entry.

The config has two layers:

{
  "schema_version": 1,
  "enable_checkpoint_regex_match": true,
  "families": {
    "qwen3.": {
      "match": [
        {
          "checkpoint_regex": "Qwen3\\."
        }
      ],
      "args": "--enable-auto-tool-choice --tool-call-parser qwen3_coder"
    }
  },
  "models": {
    "Qwen3.5-4B-vLLM": {
      "family": "qwen3."
    }
  }
}
  • families defines reusable defaults for a model family. Each family can include args and optional match entries.
  • models maps Lemonade model names to a family and can also add per-model args.
  • checkpoint_regex is matched against the model checkpoint, not only the organization name. This lets one family match checkpoints from different Hugging Face organizations.
  • enable_checkpoint_regex_match controls automatic family matching for unlisted models. Set it to false to require explicit entries under models.
  • A model entry can set disable_family_match: true to prevent regex family matching for that model while still allowing model-specific args.

Argument precedence is:

  1. Family args
  2. Exact model args
  3. User-provided vllm_args

Later layers override conflicting earlier flags but keep non-conflicting flags. Binary --flag / --no-flag pairs are resolved like the rest of Lemonade's *_args merge behavior, and repeated generic flags are preserved when the same flag appears multiple times.

Lemonade-managed process arguments cannot be set in this file or in vllm_args: --model, --served-model-name, --host, --port, --max-model-len, --enforce-eager, and --enable-prefix-caching.

Tuning

Free-form CLI args can be appended to vllm-server via vllm.args:

# Allow more concurrent sequences and turn on prefix caching
lemonade config set vllm.args="--max-num-seqs 128 --enable-prefix-caching"

The flat form (vllm_args=...) is also accepted and maps to the same setting.

Known gotchas

  • Cold first-load JIT. Loading a new model size triggers a Triton kernel compile. Expect 20 s – several minutes the first time you hit a given model+shape; subsequent loads of the same shape are faster as kernels cache to disk.
  • FP8 first-load is slow on gfx1151. Cold-loading Qwen/Qwen3-4B-FP8 took ~12 minutes in our test, exceeding Lemonade's default wait_for_ready timeout. The engine selects TritonFp8BlockScaledMMKernel and emits "Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal." warnings — i.e. no AMD-tuned kernel configs are shipped for this GPU's exact shapes, so vLLM autotunes from defaults. FP16 is the most polished path today; FP8 should improve once AMD ships tuned configs.
  • huggingface-hub shadowing. Lemonade launches vllm-server with PYTHONNOUSERSITE=1 so the bundled huggingface_hub is used. If a module-not-found error still appears, ensure ~/.local/lib/python3.12/site-packages/huggingface_hub isn't being injected via PYTHONPATH.
  • Long load times can leave orphaned processes if interrupted. If a load times out at the Lemonade level, vLLM's child EngineCore may continue running in the background and hold VRAM until killed. Look for a VLLM::EngineCor process and kill -9 it before retrying.