Skip to content

[CPU][Perf] Accelerate unquantized MoE for AArch64#46353

Open
fadara01 wants to merge 2 commits into
vllm-project:mainfrom
fadara01:fused_moe_arm
Open

[CPU][Perf] Accelerate unquantized MoE for AArch64#46353
fadara01 wants to merge 2 commits into
vllm-project:mainfrom
fadara01:fused_moe_arm

Conversation

@fadara01

@fadara01 fadara01 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Purpose

Accelerate unquantized MoE for AArch64

  • Enable FusedMoE kernel for AArch64
  • Implement AdvSIMD BFMMLA interface to accelerate w13 and w2 GEMMs
  • Extend generic micro kernel interface and MoE kernel to support packing input matrix
  • Abstract sleef.h includes and tanh symbol for x86 under the AVX vectorizer class

Performance

1.96x higher throughput for gpt-oss and 2.18x higher throughput for gemma4 with benchmark below and 64 Neoverse-V2 cores

MODEL=unsloth/gpt-oss-20b-BF16
#MODEL=google/gemma-4-26B-A4B-it
# gemma4 needs this as attention currently hangs without it.
#export VLLM_CPU_ATTN_SPLIT_KV=0 
vllm bench throughput \
  --num-prompts 128 \
  --seed 0 \
  --dataset-name sonnet \
  --dataset-path /home/fadara01/vllm-moe/vllm/benchmarks/sonnet.txt \
  --input-len 256 \
  --output-len 256 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --model $MODEL \
  --tensor-parallel-size 1 \
  --no-enable-prefix-caching \
  --num-warmups 5

Test Plan

CI

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@fadara01

Copy link
Copy Markdown
Contributor Author

Hi @mgoin @bigPYJ1151 :)

Could you please have a look at this?

@fadara01 fadara01 mentioned this pull request Jun 22, 2026
4 tasks
@fadara01 fadara01 marked this pull request as draft June 22, 2026 10:25
@fadara01 fadara01 marked this pull request as ready for review June 22, 2026 11:42
@mergify

mergify Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fadara01.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 22, 2026
Comment thread csrc/cpu/cpu_fused_moe.cpp Outdated
fadara01 added 2 commits June 23, 2026 16:05
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends gpt-oss Related to GPT-OSS models needs-rebase performance Performance-related issues

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

2 participants