Skip to content

TracyWang95/vllm-ascend-glm-4.6v-flash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GLM-4.6V-Flash on Ascend NPU with vLLM

Deploy GLM-4.6V-Flash — a 9B dense vision-language model — on a Huawei Ascend 910B NPU using vLLM + vllm-ascend, with full multimodal (image) inference, an OpenAI-compatible API, and single-card or dual-card load-balanced serving.

TL;DR — it works out of the box, no source patching. vllm-ascend's support matrix lists multimodal GLM-4V as ❌ (issue #2260) — but that entry is the old glm-4v-9b (GLM4VForCausalLM). GLM-4.6V-Flash is a dense Glm4vForConditionalGeneration model, which the vLLM core registry already supports, and it runs on Ascend directly. You do not need to build xLLM or fall back to raw transformers.

benchmark results


Why this repo

When you try to serve a GLM vision model on Ascend, you hit conflicting signals: the official vllm-ascend matrix marks GLM-4V unsupported, and the model isn't in the matrix at all. This repo is the validated, reproducible answer: GLM-4.6V-Flash (the 9B dense member of the family — not the 106B MoE GLM-4.6V) serves cleanly on a single 910B card via vLLM, including image input, with an OpenAI API. It includes ready-to-run scripts, a multimodal smoke test, and a reproducible throughput/latency benchmark.

Tested environment

Component Version
NPU Ascend 910B2C (Atlas 800 A2 class), 64 GB HBM × 2
Driver / npu-smi 25.5.1
CANN 8.5.1
OS openEuler 24.03 (container)
Python 3.11
torch / torch_npu 2.9.0 / 2.9.0
vllm 0.19.1
vllm-ascend 0.19.1rc1
transformers 5.5.3

The easiest base is the official vllm-ascend container image (quay.io/ascend/vllm-ascend:...), which ships this whole stack. Any environment with a matching vllm + vllm-ascend + CANN should work.

Model

Repo (HF) zai-org/GLM-4.6V-Flash
Repo (ModelScope) ZhipuAI/GLM-4.6V-Flash
Architecture Glm4vForConditionalGeneration (model_type: glm4v), dense
Size / dtype ~20.6 GB, bf16 (4 shards)
Context 128K
License MIT

The 106B GLM-4.6V is a different class (Glm4vMoeForConditionalGeneration, MoE) with different Ascend caveats. This repo targets the 9B dense Flash, which fits on one 64 GB card and avoids the MoE expert-routing kernels.


Quick start

1. Download the weights

In mainland China huggingface.co is typically blocked; ModelScope and the HF mirror both work.

MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/download_model.sh

2. Serve (single card)

MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/serve_single.sh
# -> OpenAI-compatible API on http://0.0.0.0:8000, model name: glm-4.6v-flash

3. Smoke test (text + image)

python scripts/smoke_test.py
# [text]  ...
# [image] Red rectangle, blue circle.   <- the vision path is working

4. (Optional) Dual-card load balancing

Run two replicas (one per NPU) behind vLLM's built-in load balancer on a single port:

MODEL_DIR=/data/models/GLM-4.6V-Flash bash scripts/serve_dp2.sh

Disabling the thinking chain

GLM-4.6V-Flash is a hybrid-reasoning model and emits <think>...</think> by default. Turn it off per request:

{
  "model": "glm-4.6v-flash",
  "messages": [ /* ... text and/or image_url ... */ ],
  "chat_template_kwargs": { "enable_thinking": false }
}

Sending an image

Standard OpenAI vision format — a base64 data URI or a URL:

{
  "model": "glm-4.6v-flash",
  "messages": [{ "role": "user", "content": [
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,<...>" } },
    { "type": "text", "text": "Describe this image." }
  ]}],
  "chat_template_kwargs": { "enable_thinking": false }
}

Benchmarks

Reproduce with scripts/benchmark.sh (uses vllm bench serve, random dataset, input 1024 / output 256, ignore_eos). Full numbers and method: benchmarks/results.md.

Concurrency Single card · out tok/s Dual card (DP=2) · out tok/s Single TTFT Dual TTFT
1 40 40 151 ms 131 ms
16 461 520 528 ms 233 ms
64 1197 1423 798 ms 570 ms

A single 910B card sustains ~1200 output tok/s (~6000 total tok/s) under load with sub-second TTFT. DP=2's main wins are lower latency under concurrency and ~2× capacity headroom — its throughput advantage widens as offered load exceeds what one card can absorb.

Benchmarks measure the text path. Real vision requests add image-encoding overhead on top.


Notes & FAQ

See docs/notes.md for:

  • Why the GLM-4V ❌ in the support matrix doesn't apply to dense glm4v.
  • Why you don't need xLLM — and why building xLLM inside a vllm-ascend container fails (prebuilt xllm_kernels is _GLIBCXX_USE_CXX11_ABI=0, while the stack's torch 2.9 is ABI=1 — an irreconcilable link-time conflict). xLLM's clean path is JD's own dev image.
  • The transformers + torch_npu fallback (works for the dense 9B, slower).
  • Troubleshooting (download source, HCCL_OP_EXPANSION_MODE=AIV, OOM tuning).

Acknowledgements

License

MIT. The GLM-4.6V-Flash weights are MIT-licensed by their authors; see the model card.

About

Deploy GLM-4.6V-Flash (9B dense VLM) on Huawei Ascend 910B NPU with vLLM - multimodal, OpenAI API, single/dual-card serving, reproducible benchmarks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors