Qolda-AVL: Extending a Vision-Language Model with Audio Understanding

Qolda-AVL is a custom extension of ms-swift (the ModelScope training/inference framework) that adds an audio branch to the Qwen3-VL model family, turning a Vision-Language Model (VLM) into an Audio-Vision-Language Model (AVL).

🤗 Released model: issai/Qolda-AVL-5B (Qwen3-VL-4B + Whisper audio branch, ≈5B total parameters)

This repo bundles two things so the whole stack works from one place:

swift/ — the ms-swift framework with our qwen3_avl audio branch (training/inference logic).
transformers/ — a forked Transformers that ships the custom Qwen3AVL model classes. The model cannot load without it (see Custom Transformers).

What we added

Starting from upstream ms-swift, we add a new model type qwen3_avl that wraps Qwen3-VL with an audio encoder + audio adapter + an audio DeepStack fusion mechanism. The work spans the two bundled packages:

In transformers/ (the model itself): a new src/transformers/models/qwen3_avl/ module — Qwen3AVLConfig, Qwen3AVLForConditionalGeneration, and Qwen3AVLProcessor — plus auto-class registration. This is where the Whisper audio encoder, the audio projection MLP, and the audio DeepStack mergers live.

In swift/ (training/inference glue): the qwen3_avl model + template, which is intentionally small and self-contained:

File	Change
`swift/model/models/qwen.py`	Core `qwen3_avl` loader: audio feature extraction, the audio projector, and the audio DeepStack training-forward that scatters audio embeddings into the LLM and fuses per-layer audio features alongside the vision DeepStack.
`swift/model/model_arch.py`	Registers the `qwen3_avl` architecture and its aligner modules (audio projector + audio/vision DeepStack mergers) so `--freeze_aligner` controls them.
`swift/model/constant.py`	Adds the `qwen3_avl` model-type constant.
`swift/template/templates/qwen.py`	Adds the `qwen3_avl` chat template with `<audio>` handling.
`swift/template/constant.py`	Adds the `qwen3_avl` template-type constant.
`swift/trainers/mixin.py`	Trainer tweak to support the mixed vision + audio forward.

Repository layout

Qolda-AVL/
├── swift/                 # ms-swift framework + our qwen3_avl audio branch (training/inference)
├── transformers/          # forked Transformers with the Qwen3AVL model (REQUIRED)
├── scripts/train/         # the two training-stage launch scripts
│   ├── audio_alignment_pretrain.sh
│   └── multimodal_finetune.sh
├── data/                  # tiny generic examples showing the expected data format
│   ├── audio-text.jsonl
│   ├── vision-text.jsonl
│   └── text-only.jsonl
├── requirements/          # upstream ms-swift dependency lists
├── requirements.txt       # curated dependencies for the audio pipeline
├── setup.py / setup.cfg   # packaging (installs the `swift` / `megatron` CLIs)
└── README.md

Installation

# 1. Create an environment (Python 3.10+ recommended)
conda create -n qolda-avl python=3.10 -y
conda activate qolda-avl

# 2. Install a CUDA-matched PyTorch build (see https://pytorch.org)
pip install torch torchaudio torchvision

# 3. Install the bundled custom Transformers FIRST (required — see below)
pip install -e ./transformers

# 4. Install Qolda-AVL (this ms-swift fork) in editable mode
pip install -e .
#    …or just the curated dependency set:
pip install -r requirements.txt

# 5. (recommended) FlashAttention — the training scripts use --attn_impl flash_attn
pip install flash-attn --no-build-isolation

⚠️ Order matters. Install ./transformers before (or instead of) any PyPI transformers, otherwise pip install -r requirements.txt may pull the upstream release, which does not know about Qwen3AVL.

Key dependencies (full list in requirements.txt / requirements/):

the bundled transformers/ fork (required for the Qwen3AVL model)
torch, torchaudio, librosa, soundfile (audio)
deepspeed, accelerate, peft, trl, datasets, modelscope
qwen-vl-utils, decord, pillow (vision/video, inherited from Qwen3-VL)
flash-attn (optional but recommended)

Custom Transformers (required)

The Qwen3AVL model classes live in a fork of Transformers (based on 5.2.0.dev0) under transformers/. Without it, AutoModel/AutoConfig cannot resolve model_type: qwen3_avl, and both training and inference will fail with an "unknown model type" error.

Install it (editable) into your environment:

# from the repo root
pip install -e ./transformers

Verify the model is registered:

python -c "import transformers, transformers.models.qwen3_avl as m; \
print('transformers', transformers.__version__); \
print('Qwen3AVL OK:', hasattr(m, 'modeling_qwen3_avl'))"

You should see a 5.2.0.dev0 version string and Qwen3AVL OK: True. After this, loading the released checkpoint works through the normal HF API:

from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("issai/Qolda-AVL-5B", trust_remote_code=False)
processor = AutoProcessor.from_pretrained("issai/Qolda-AVL-5B")

If you already have a different transformers installed, the editable install above will replace it in the active environment. We recommend a dedicated venv/ conda env for Qolda-AVL to avoid version clashes with other projects.

Data format

Training data is JSON-lines, one example per line, using the ms-swift conversation schema. Each modality uses its own tag in the message text and lists the corresponding file path(s) in a parallel field. The repo ships one tiny example file per modality in data/:

Audio — <audio> tag + audios field (data/audio-text.jsonl):

{"messages": [{"role": "user", "content": "<audio>Transcribe the audio into text."},
              {"role": "assistant", "content": "Good morning everyone."}],
 "audios": ["/path/to/audio/sample_01.wav"]}

Vision — <image> (or <video>) tag + images / videos field (data/vision-text.jsonl):

{"messages": [{"role": "user", "content": "<image>What is shown in this image?"},
              {"role": "assistant", "content": "A wooden table with a cup of coffee."}],
 "images": ["/path/to/image/sample_01.jpg"]}

Text-only — no media tag or field (data/text-only.jsonl):

{"messages": [{"role": "user", "content": "Explain the difference between a list and a tuple in Python."},
              {"role": "assistant", "content": "A list is mutable, while a tuple is immutable."}]}

Multiple tags are allowed; provide one path per tag in the matching field.
An optional {"role": "system", ...} message can be prepended.
Modalities can be mixed within a single example and across files — the multimodal fine-tuning stage trains on all three together.

Replace the placeholder paths with your own .wav/.flac (audio) and .jpg/.png (image) files.

Training

Qolda-AVL is trained in two stages. Each script is a thin wrapper around swift sft and exposes MODEL, DATA, and OUTPUT as environment variables, so you can sanity-check the pipeline on the bundled example data and then point them at your real model/data.

You first need an audio-extended base model: a Qwen3-VL-4B checkpoint (e.g. Qwen3-VL-4B-Thinking) whose audio branch (Whisper encoder + audio projector + audio DeepStack mergers) has been initialised. The released, already-trained model is on the Hub at issai/Qolda-AVL-5B.

Stage	Script	ViT	LLM	Aligners	Audio enc.	Data
1 — Audio alignment pre-training	`audio_alignment_pretrain.sh`	❄️ frozen	❄️ frozen	🔥 trained	❄️ frozen	audio / transcription
2 — Multimodal fine-tuning	`multimodal_finetune.sh`	❄️ frozen	🔥 trained	🔥 trained	❄️ frozen	audio + vision + text-only

Stage 1 — Audio alignment pre-training. Only the aligners (audio projector + audio/vision DeepStack mergers) are trained, so the audio encoder learns to map its features into the (frozen) LLM's embedding space on large-scale audio data.

Stage 2 — Multimodal fine-tuning. Continue from the Stage-1 checkpoint and unfreeze the LLM, training jointly on a mixture of audio-text, vision-text, and text-only samples. This builds a full instruction-following assistant that handles speech, images, and plain text without forgetting its original vision/text abilities. For low-resource adaptation you can switch to LoRA inside the script (--train_type lora --target_modules all-linear).

# Stage 1 — audio alignment (smoke-test on the bundled example data)
bash scripts/train/audio_alignment_pretrain.sh

# real run: override the defaults
MODEL=./pretrained/Qwen3-AVL-base \
DATA=./data/my_audio_train.jsonl \
OUTPUT=./output/audio_alignment \
bash scripts/train/audio_alignment_pretrain.sh

# Stage 2 — multimodal fine-tuning (defaults to all three example files;
# pass a space-separated list of your own jsonls via DATA)
MODEL=./output/audio_alignment/checkpoint-last \
DATA="./data/my_audio.jsonl ./data/my_vision.jsonl ./data/my_text.jsonl" \
OUTPUT=./output/multimodal_finetune \
bash scripts/train/multimodal_finetune.sh

Relationship to upstream ms-swift

This is a fork of ms-swift. The framework's general usage, CLI flags, and multimodal data conventions follow upstream documentation (https://github.com/modelscope/ms-swift). Our additions are confined to the qwen3_avl model/template plus the bundled transformers/ fork, so upstream features (DeepSpeed, Megatron, LoRA, packing, evaluation, etc.) work unchanged.

License

Both bundled packages are Apache-2.0 (inherited from ms-swift and 🤗 Transformers); see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Qolda-AVL: Extending a Vision-Language Model with Audio Understanding

What we added

Repository layout

Installation

Custom Transformers (required)

Data format

Training

Relationship to upstream ms-swift

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
requirements		requirements
scripts/train		scripts/train
swift		swift
transformers		transformers
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Qolda-AVL: Extending a Vision-Language Model with Audio Understanding

What we added

Repository layout

Installation

Custom Transformers (required)

Data format

Training

Relationship to upstream ms-swift

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages