GLM-4.6V-Flash on Ascend NPU with vLLM

Deploy GLM-4.6V-Flash — a 9B dense vision-language model — on a Huawei Ascend 910B NPU using vLLM + vllm-ascend, with full multimodal (image) inference, an OpenAI-compatible API, and single-card or dual-card load-balanced serving.

TL;DR — it works out of the box, no source patching. vllm-ascend's support matrix lists multimodal GLM-4V as ❌ (issue #2260) — but that entry is the old glm-4v-9b (GLM4VForCausalLM). GLM-4.6V-Flash is a dense Glm4vForConditionalGeneration model, which the vLLM core registry already supports, and it runs on Ascend directly. You do not need to build xLLM or fall back to raw transformers.

Why this repo

When you try to serve a GLM vision model on Ascend, you hit conflicting signals: the official vllm-ascend matrix marks GLM-4V unsupported, and the model isn't in the matrix at all. This repo is the validated, reproducible answer: GLM-4.6V-Flash (the 9B dense member of the family — not the 106B MoE GLM-4.6V) serves cleanly on a single 910B card via vLLM, including image input, with an OpenAI API. It includes ready-to-run scripts, a multimodal smoke test, and a reproducible throughput/latency benchmark.

Tested environment

Component	Version
NPU	Ascend 910B2C (Atlas 800 A2 class), 64 GB HBM × 2
Driver / npu-smi	25.5.1
CANN	8.5.1
OS	openEuler 24.03 (container)
Python	3.11
torch / torch_npu	2.9.0 / 2.9.0
vllm	0.19.1
vllm-ascend	0.19.1rc1
transformers	5.5.3

The easiest base is the official vllm-ascend container image (quay.io/ascend/vllm-ascend:...), which ships this whole stack. Any environment with a matching vllm + vllm-ascend + CANN should work.

Model


Repo (HF)	`zai-org/GLM-4.6V-Flash`
Repo (ModelScope)	`ZhipuAI/GLM-4.6V-Flash`
Architecture	`Glm4vForConditionalGeneration` (`model_type: glm4v`), dense
Size / dtype	~20.6 GB, bf16 (4 shards)
Context	128K
License	MIT

The 106B GLM-4.6V is a different class (Glm4vMoeForConditionalGeneration, MoE) with different Ascend caveats. This repo targets the 9B dense Flash, which fits on one 64 GB card and avoids the MoE expert-routing kernels.

Quick start

1. Download the weights

In mainland China huggingface.co is typically blocked; ModelScope and the HF mirror both work.

MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/download_model.sh

2. Serve (single card)

MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/serve_single.sh
# -> OpenAI-compatible API on http://0.0.0.0:8000, model name: glm-4.6v-flash

3. Smoke test (text + image)

python scripts/smoke_test.py
# [text]  ...
# [image] Red rectangle, blue circle.   <- the vision path is working

4. (Optional) Dual-card load balancing

Run two replicas (one per NPU) behind vLLM's built-in load balancer on a single port:

MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/serve_dp2.sh

Disabling the thinking chain

GLM-4.6V-Flash is a hybrid-reasoning model and emits <think>...</think> by default. Turn it off per request:

{
  "model": "glm-4.6v-flash",
  "messages": [ /* ... text and/or image_url ... */ ],
  "chat_template_kwargs": { "enable_thinking": false }
}

Sending an image

Standard OpenAI vision format — a base64 data URI or a URL:

{
  "model": "glm-4.6v-flash",
  "messages": [{ "role": "user", "content": [
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,<...>" } },
    { "type": "text", "text": "Describe this image." }
  ]}],
  "chat_template_kwargs": { "enable_thinking": false }
}

Benchmarks

Reproduce with scripts/benchmark.sh (uses vllm bench serve, random dataset, input 1024 / output 256, ignore_eos). Full numbers and method: benchmarks/results.md.

Concurrency	Single card · out tok/s	Dual card (DP=2) · out tok/s	Single TTFT	Dual TTFT
1	40	40	151 ms	131 ms
16	461	520	528 ms	233 ms
64	1197	1423	798 ms	570 ms

A single 910B card sustains ~1200 output tok/s (~6000 total tok/s) under load with sub-second TTFT. DP=2's main wins are lower latency under concurrency and ~2× capacity headroom — its throughput advantage widens as offered load exceeds what one card can absorb.

Benchmarks measure the text path. Real vision requests add image-encoding overhead on top.

Notes & FAQ

See docs/notes.md for:

Why the GLM-4V ❌ in the support matrix doesn't apply to dense glm4v.
Why you don't need xLLM — and why building xLLM inside a vllm-ascend container fails (prebuilt xllm_kernels is _GLIBCXX_USE_CXX11_ABI=0, while the stack's torch 2.9 is ABI=1 — an irreconcilable link-time conflict). xLLM's clean path is JD's own dev image.
The transformers + torch_npu fallback (works for the dense 9B, slower).
Troubleshooting (download source, HCCL_OP_EXPANSION_MODE=AIV, OOM tuning).

Acknowledgements

Z.ai / Zhipu AI — GLM-V for the model.
vLLM and vllm-ascend.

License

MIT. The GLM-4.6V-Flash weights are MIT-licensed by their authors; see the model card.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GLM-4.6V-Flash on Ascend NPU with vLLM

Why this repo

Tested environment

Model

Quick start

1. Download the weights

2. Serve (single card)

3. Smoke test (text + image)

4. (Optional) Dual-card load balancing

Disabling the thinking chain

Sending an image

Benchmarks

Notes & FAQ

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GLM-4.6V-Flash on Ascend NPU with vLLM

Why this repo

Tested environment

Model

Quick start

1. Download the weights

2. Serve (single card)

3. Smoke test (text + image)

4. (Optional) Dual-card load balancing

Disabling the thinking chain

Sending an image

Benchmarks

Notes & FAQ

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages