Qolda-AVL is a custom extension of ms-swift (the ModelScope training/inference framework) that adds an audio branch to the Qwen3-VL model family, turning a Vision-Language Model (VLM) into an Audio-Vision-Language Model (AVL).
- 🤗 Released model:
issai/Qolda-AVL-5B(Qwen3-VL-4B + Whisper audio branch, ≈5B total parameters)
This repo bundles two things so the whole stack works from one place:
swift/— the ms-swift framework with ourqwen3_avlaudio branch (training/inference logic).transformers/— a forked Transformers that ships the customQwen3AVLmodel classes. The model cannot load without it (see Custom Transformers).
Starting from upstream ms-swift, we add a new model type qwen3_avl that wraps
Qwen3-VL with an audio encoder + audio adapter + an audio DeepStack fusion
mechanism. The work spans the two bundled packages:
In transformers/ (the model itself): a new
src/transformers/models/qwen3_avl/ module —
Qwen3AVLConfig, Qwen3AVLForConditionalGeneration, and Qwen3AVLProcessor —
plus auto-class registration. This is where the Whisper audio encoder, the audio
projection MLP, and the audio DeepStack mergers live.
In swift/ (training/inference glue): the qwen3_avl model + template, which
is intentionally small and self-contained:
| File | Change |
|---|---|
swift/model/models/qwen.py |
Core qwen3_avl loader: audio feature extraction, the audio projector, and the audio DeepStack training-forward that scatters audio embeddings into the LLM and fuses per-layer audio features alongside the vision DeepStack. |
swift/model/model_arch.py |
Registers the qwen3_avl architecture and its aligner modules (audio projector + audio/vision DeepStack mergers) so --freeze_aligner controls them. |
swift/model/constant.py |
Adds the qwen3_avl model-type constant. |
swift/template/templates/qwen.py |
Adds the qwen3_avl chat template with <audio> handling. |
swift/template/constant.py |
Adds the qwen3_avl template-type constant. |
swift/trainers/mixin.py |
Trainer tweak to support the mixed vision + audio forward. |
Qolda-AVL/
├── swift/ # ms-swift framework + our qwen3_avl audio branch (training/inference)
├── transformers/ # forked Transformers with the Qwen3AVL model (REQUIRED)
├── scripts/train/ # the two training-stage launch scripts
│ ├── audio_alignment_pretrain.sh
│ └── multimodal_finetune.sh
├── data/ # tiny generic examples showing the expected data format
│ ├── audio-text.jsonl
│ ├── vision-text.jsonl
│ └── text-only.jsonl
├── requirements/ # upstream ms-swift dependency lists
├── requirements.txt # curated dependencies for the audio pipeline
├── setup.py / setup.cfg # packaging (installs the `swift` / `megatron` CLIs)
└── README.md
# 1. Create an environment (Python 3.10+ recommended)
conda create -n qolda-avl python=3.10 -y
conda activate qolda-avl
# 2. Install a CUDA-matched PyTorch build (see https://pytorch.org)
pip install torch torchaudio torchvision
# 3. Install the bundled custom Transformers FIRST (required — see below)
pip install -e ./transformers
# 4. Install Qolda-AVL (this ms-swift fork) in editable mode
pip install -e .
# …or just the curated dependency set:
pip install -r requirements.txt
# 5. (recommended) FlashAttention — the training scripts use --attn_impl flash_attn
pip install flash-attn --no-build-isolation
⚠️ Order matters. Install./transformersbefore (or instead of) any PyPItransformers, otherwisepip install -r requirements.txtmay pull the upstream release, which does not know aboutQwen3AVL.
Key dependencies (full list in requirements.txt / requirements/):
- the bundled
transformers/fork (required for theQwen3AVLmodel) torch,torchaudio,librosa,soundfile(audio)deepspeed,accelerate,peft,trl,datasets,modelscopeqwen-vl-utils,decord,pillow(vision/video, inherited from Qwen3-VL)flash-attn(optional but recommended)
The Qwen3AVL model classes live in a fork of Transformers (based on
5.2.0.dev0) under transformers/. Without it,
AutoModel/AutoConfig cannot resolve model_type: qwen3_avl, and both training
and inference will fail with an "unknown model type" error.
Install it (editable) into your environment:
# from the repo root
pip install -e ./transformersVerify the model is registered:
python -c "import transformers, transformers.models.qwen3_avl as m; \
print('transformers', transformers.__version__); \
print('Qwen3AVL OK:', hasattr(m, 'modeling_qwen3_avl'))"You should see a 5.2.0.dev0 version string and Qwen3AVL OK: True. After this,
loading the released checkpoint works through the normal HF API:
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("issai/Qolda-AVL-5B", trust_remote_code=False)
processor = AutoProcessor.from_pretrained("issai/Qolda-AVL-5B")If you already have a different
transformersinstalled, the editable install above will replace it in the active environment. We recommend a dedicated venv/ conda env for Qolda-AVL to avoid version clashes with other projects.
Training data is JSON-lines, one example per line, using the ms-swift
conversation schema. Each modality uses its own tag in the message text and lists
the corresponding file path(s) in a parallel field. The repo ships one tiny
example file per modality in data/:
Audio — <audio> tag + audios field (data/audio-text.jsonl):
{"messages": [{"role": "user", "content": "<audio>Transcribe the audio into text."},
{"role": "assistant", "content": "Good morning everyone."}],
"audios": ["/path/to/audio/sample_01.wav"]}Vision — <image> (or <video>) tag + images / videos field
(data/vision-text.jsonl):
{"messages": [{"role": "user", "content": "<image>What is shown in this image?"},
{"role": "assistant", "content": "A wooden table with a cup of coffee."}],
"images": ["/path/to/image/sample_01.jpg"]}Text-only — no media tag or field (data/text-only.jsonl):
{"messages": [{"role": "user", "content": "Explain the difference between a list and a tuple in Python."},
{"role": "assistant", "content": "A list is mutable, while a tuple is immutable."}]}- Multiple tags are allowed; provide one path per tag in the matching field.
- An optional
{"role": "system", ...}message can be prepended. - Modalities can be mixed within a single example and across files — the multimodal fine-tuning stage trains on all three together.
Replace the placeholder paths with your own .wav/.flac (audio) and
.jpg/.png (image) files.
Qolda-AVL is trained in two stages. Each script is a thin wrapper around
swift sft and exposes MODEL, DATA, and OUTPUT as environment variables, so
you can sanity-check the pipeline on the bundled example data and then point them
at your real model/data.
You first need an audio-extended base model: a Qwen3-VL-4B checkpoint (e.g.
Qwen3-VL-4B-Thinking) whose audio branch (Whisper encoder + audio projector +
audio DeepStack mergers) has been initialised. The released, already-trained model
is on the Hub at issai/Qolda-AVL-5B.
| Stage | Script | ViT | LLM | Aligners | Audio enc. | Data |
|---|---|---|---|---|---|---|
| 1 — Audio alignment pre-training | audio_alignment_pretrain.sh |
❄️ frozen | ❄️ frozen | 🔥 trained | ❄️ frozen | audio / transcription |
| 2 — Multimodal fine-tuning | multimodal_finetune.sh |
❄️ frozen | 🔥 trained | 🔥 trained | ❄️ frozen | audio + vision + text-only |
Stage 1 — Audio alignment pre-training. Only the aligners (audio projector + audio/vision DeepStack mergers) are trained, so the audio encoder learns to map its features into the (frozen) LLM's embedding space on large-scale audio data.
Stage 2 — Multimodal fine-tuning. Continue from the Stage-1 checkpoint and
unfreeze the LLM, training jointly on a mixture of audio-text, vision-text,
and text-only samples. This builds a full instruction-following assistant that
handles speech, images, and plain text without forgetting its original
vision/text abilities. For low-resource adaptation you can switch to LoRA inside
the script (--train_type lora --target_modules all-linear).
# Stage 1 — audio alignment (smoke-test on the bundled example data)
bash scripts/train/audio_alignment_pretrain.sh
# real run: override the defaults
MODEL=./pretrained/Qwen3-AVL-base \
DATA=./data/my_audio_train.jsonl \
OUTPUT=./output/audio_alignment \
bash scripts/train/audio_alignment_pretrain.sh
# Stage 2 — multimodal fine-tuning (defaults to all three example files;
# pass a space-separated list of your own jsonls via DATA)
MODEL=./output/audio_alignment/checkpoint-last \
DATA="./data/my_audio.jsonl ./data/my_vision.jsonl ./data/my_text.jsonl" \
OUTPUT=./output/multimodal_finetune \
bash scripts/train/multimodal_finetune.shThis is a fork of ms-swift. The framework's general usage, CLI flags, and
multimodal data conventions follow upstream documentation
(https://github.com/modelscope/ms-swift). Our additions are confined to the
qwen3_avl model/template plus the bundled transformers/ fork, so upstream
features (DeepSpeed, Megatron, LoRA, packing, evaluation, etc.) work unchanged.
Both bundled packages are Apache-2.0 (inherited from ms-swift and 🤗
Transformers); see LICENSE.