Deploy GLM-4.6V-Flash — a 9B dense vision-language model — on a Huawei Ascend 910B NPU using vLLM + vllm-ascend, with full multimodal (image) inference, an OpenAI-compatible API, and single-card or dual-card load-balanced serving.
TL;DR — it works out of the box, no source patching. vllm-ascend's support matrix lists multimodal
GLM-4Vas ❌ (issue #2260) — but that entry is the oldglm-4v-9b(GLM4VForCausalLM). GLM-4.6V-Flash is a denseGlm4vForConditionalGenerationmodel, which the vLLM core registry already supports, and it runs on Ascend directly. You do not need to build xLLM or fall back to rawtransformers.
When you try to serve a GLM vision model on Ascend, you hit conflicting signals: the official vllm-ascend matrix marks GLM-4V unsupported, and the model isn't in the matrix at all. This repo is the validated, reproducible answer: GLM-4.6V-Flash (the 9B dense member of the family — not the 106B MoE GLM-4.6V) serves cleanly on a single 910B card via vLLM, including image input, with an OpenAI API. It includes ready-to-run scripts, a multimodal smoke test, and a reproducible throughput/latency benchmark.
| Component | Version |
|---|---|
| NPU | Ascend 910B2C (Atlas 800 A2 class), 64 GB HBM × 2 |
| Driver / npu-smi | 25.5.1 |
| CANN | 8.5.1 |
| OS | openEuler 24.03 (container) |
| Python | 3.11 |
| torch / torch_npu | 2.9.0 / 2.9.0 |
| vllm | 0.19.1 |
| vllm-ascend | 0.19.1rc1 |
| transformers | 5.5.3 |
The easiest base is the official vllm-ascend container image (quay.io/ascend/vllm-ascend:...), which ships this whole stack. Any environment with a matching vllm + vllm-ascend + CANN should work.
| Repo (HF) | zai-org/GLM-4.6V-Flash |
| Repo (ModelScope) | ZhipuAI/GLM-4.6V-Flash |
| Architecture | Glm4vForConditionalGeneration (model_type: glm4v), dense |
| Size / dtype | ~20.6 GB, bf16 (4 shards) |
| Context | 128K |
| License | MIT |
The 106B
GLM-4.6Vis a different class (Glm4vMoeForConditionalGeneration, MoE) with different Ascend caveats. This repo targets the 9B dense Flash, which fits on one 64 GB card and avoids the MoE expert-routing kernels.
In mainland China huggingface.co is typically blocked; ModelScope and the HF mirror both work.
MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/download_model.shMODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/serve_single.sh
# -> OpenAI-compatible API on http://0.0.0.0:8000, model name: glm-4.6v-flashpython scripts/smoke_test.py
# [text] ...
# [image] Red rectangle, blue circle. <- the vision path is workingRun two replicas (one per NPU) behind vLLM's built-in load balancer on a single port:
MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/serve_dp2.shGLM-4.6V-Flash is a hybrid-reasoning model and emits <think>...</think> by default. Turn it off per request:
Standard OpenAI vision format — a base64 data URI or a URL:
{
"model": "glm-4.6v-flash",
"messages": [{ "role": "user", "content": [
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,<...>" } },
{ "type": "text", "text": "Describe this image." }
]}],
"chat_template_kwargs": { "enable_thinking": false }
}Reproduce with scripts/benchmark.sh (uses vllm bench serve, random dataset, input 1024 / output 256, ignore_eos). Full numbers and method: benchmarks/results.md.
| Concurrency | Single card · out tok/s | Dual card (DP=2) · out tok/s | Single TTFT | Dual TTFT |
|---|---|---|---|---|
| 1 | 40 | 40 | 151 ms | 131 ms |
| 16 | 461 | 520 | 528 ms | 233 ms |
| 64 | 1197 | 1423 | 798 ms | 570 ms |
A single 910B card sustains ~1200 output tok/s (~6000 total tok/s) under load with sub-second TTFT. DP=2's main wins are lower latency under concurrency and ~2× capacity headroom — its throughput advantage widens as offered load exceeds what one card can absorb.
Benchmarks measure the text path. Real vision requests add image-encoding overhead on top.
See docs/notes.md for:
- Why the
GLM-4V ❌in the support matrix doesn't apply to denseglm4v. - Why you don't need xLLM — and why building xLLM inside a vllm-ascend container fails (prebuilt
xllm_kernelsis_GLIBCXX_USE_CXX11_ABI=0, while the stack's torch 2.9 is ABI=1 — an irreconcilable link-time conflict). xLLM's clean path is JD's own dev image. - The
transformers + torch_npufallback (works for the dense 9B, slower). - Troubleshooting (download source,
HCCL_OP_EXPANSION_MODE=AIV, OOM tuning).
- Z.ai / Zhipu AI — GLM-V for the model.
- vLLM and vllm-ascend.
MIT. The GLM-4.6V-Flash weights are MIT-licensed by their authors; see the model card.

{ "model": "glm-4.6v-flash", "messages": [ /* ... text and/or image_url ... */ ], "chat_template_kwargs": { "enable_thinking": false } }