Skip to content

IS2AI/ms-swift-Qolda-AVL

Repository files navigation

Qolda-AVL: Extending a Vision-Language Model with Audio Understanding

Qolda-AVL is a custom extension of ms-swift (the ModelScope training/inference framework) that adds an audio branch to the Qwen3-VL model family, turning a Vision-Language Model (VLM) into an Audio-Vision-Language Model (AVL).

  • 🤗 Released model: issai/Qolda-AVL-5B (Qwen3-VL-4B + Whisper audio branch, ≈5B total parameters)

This repo bundles two things so the whole stack works from one place:

  1. swift/ — the ms-swift framework with our qwen3_avl audio branch (training/inference logic).
  2. transformers/ — a forked Transformers that ships the custom Qwen3AVL model classes. The model cannot load without it (see Custom Transformers).

What we added

Starting from upstream ms-swift, we add a new model type qwen3_avl that wraps Qwen3-VL with an audio encoder + audio adapter + an audio DeepStack fusion mechanism. The work spans the two bundled packages:

In transformers/ (the model itself): a new src/transformers/models/qwen3_avl/ module — Qwen3AVLConfig, Qwen3AVLForConditionalGeneration, and Qwen3AVLProcessor — plus auto-class registration. This is where the Whisper audio encoder, the audio projection MLP, and the audio DeepStack mergers live.

In swift/ (training/inference glue): the qwen3_avl model + template, which is intentionally small and self-contained:

File Change
swift/model/models/qwen.py Core qwen3_avl loader: audio feature extraction, the audio projector, and the audio DeepStack training-forward that scatters audio embeddings into the LLM and fuses per-layer audio features alongside the vision DeepStack.
swift/model/model_arch.py Registers the qwen3_avl architecture and its aligner modules (audio projector + audio/vision DeepStack mergers) so --freeze_aligner controls them.
swift/model/constant.py Adds the qwen3_avl model-type constant.
swift/template/templates/qwen.py Adds the qwen3_avl chat template with <audio> handling.
swift/template/constant.py Adds the qwen3_avl template-type constant.
swift/trainers/mixin.py Trainer tweak to support the mixed vision + audio forward.

Repository layout

Qolda-AVL/
├── swift/                 # ms-swift framework + our qwen3_avl audio branch (training/inference)
├── transformers/          # forked Transformers with the Qwen3AVL model (REQUIRED)
├── scripts/train/         # the two training-stage launch scripts
│   ├── audio_alignment_pretrain.sh
│   └── multimodal_finetune.sh
├── data/                  # tiny generic examples showing the expected data format
│   ├── audio-text.jsonl
│   ├── vision-text.jsonl
│   └── text-only.jsonl
├── requirements/          # upstream ms-swift dependency lists
├── requirements.txt       # curated dependencies for the audio pipeline
├── setup.py / setup.cfg   # packaging (installs the `swift` / `megatron` CLIs)
└── README.md

Installation

# 1. Create an environment (Python 3.10+ recommended)
conda create -n qolda-avl python=3.10 -y
conda activate qolda-avl

# 2. Install a CUDA-matched PyTorch build (see https://pytorch.org)
pip install torch torchaudio torchvision

# 3. Install the bundled custom Transformers FIRST (required — see below)
pip install -e ./transformers

# 4. Install Qolda-AVL (this ms-swift fork) in editable mode
pip install -e .
#    …or just the curated dependency set:
pip install -r requirements.txt

# 5. (recommended) FlashAttention — the training scripts use --attn_impl flash_attn
pip install flash-attn --no-build-isolation

⚠️ Order matters. Install ./transformers before (or instead of) any PyPI transformers, otherwise pip install -r requirements.txt may pull the upstream release, which does not know about Qwen3AVL.

Key dependencies (full list in requirements.txt / requirements/):

  • the bundled transformers/ fork (required for the Qwen3AVL model)
  • torch, torchaudio, librosa, soundfile (audio)
  • deepspeed, accelerate, peft, trl, datasets, modelscope
  • qwen-vl-utils, decord, pillow (vision/video, inherited from Qwen3-VL)
  • flash-attn (optional but recommended)

Custom Transformers (required)

The Qwen3AVL model classes live in a fork of Transformers (based on 5.2.0.dev0) under transformers/. Without it, AutoModel/AutoConfig cannot resolve model_type: qwen3_avl, and both training and inference will fail with an "unknown model type" error.

Install it (editable) into your environment:

# from the repo root
pip install -e ./transformers

Verify the model is registered:

python -c "import transformers, transformers.models.qwen3_avl as m; \
print('transformers', transformers.__version__); \
print('Qwen3AVL OK:', hasattr(m, 'modeling_qwen3_avl'))"

You should see a 5.2.0.dev0 version string and Qwen3AVL OK: True. After this, loading the released checkpoint works through the normal HF API:

from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("issai/Qolda-AVL-5B", trust_remote_code=False)
processor = AutoProcessor.from_pretrained("issai/Qolda-AVL-5B")

If you already have a different transformers installed, the editable install above will replace it in the active environment. We recommend a dedicated venv/ conda env for Qolda-AVL to avoid version clashes with other projects.


Data format

Training data is JSON-lines, one example per line, using the ms-swift conversation schema. Each modality uses its own tag in the message text and lists the corresponding file path(s) in a parallel field. The repo ships one tiny example file per modality in data/:

Audio<audio> tag + audios field (data/audio-text.jsonl):

{"messages": [{"role": "user", "content": "<audio>Transcribe the audio into text."},
              {"role": "assistant", "content": "Good morning everyone."}],
 "audios": ["/path/to/audio/sample_01.wav"]}

Vision<image> (or <video>) tag + images / videos field (data/vision-text.jsonl):

{"messages": [{"role": "user", "content": "<image>What is shown in this image?"},
              {"role": "assistant", "content": "A wooden table with a cup of coffee."}],
 "images": ["/path/to/image/sample_01.jpg"]}

Text-only — no media tag or field (data/text-only.jsonl):

{"messages": [{"role": "user", "content": "Explain the difference between a list and a tuple in Python."},
              {"role": "assistant", "content": "A list is mutable, while a tuple is immutable."}]}
  • Multiple tags are allowed; provide one path per tag in the matching field.
  • An optional {"role": "system", ...} message can be prepended.
  • Modalities can be mixed within a single example and across files — the multimodal fine-tuning stage trains on all three together.

Replace the placeholder paths with your own .wav/.flac (audio) and .jpg/.png (image) files.


Training

Qolda-AVL is trained in two stages. Each script is a thin wrapper around swift sft and exposes MODEL, DATA, and OUTPUT as environment variables, so you can sanity-check the pipeline on the bundled example data and then point them at your real model/data.

You first need an audio-extended base model: a Qwen3-VL-4B checkpoint (e.g. Qwen3-VL-4B-Thinking) whose audio branch (Whisper encoder + audio projector + audio DeepStack mergers) has been initialised. The released, already-trained model is on the Hub at issai/Qolda-AVL-5B.

Stage Script ViT LLM Aligners Audio enc. Data
1 — Audio alignment pre-training audio_alignment_pretrain.sh ❄️ frozen ❄️ frozen 🔥 trained ❄️ frozen audio / transcription
2 — Multimodal fine-tuning multimodal_finetune.sh ❄️ frozen 🔥 trained 🔥 trained ❄️ frozen audio + vision + text-only

Stage 1 — Audio alignment pre-training. Only the aligners (audio projector + audio/vision DeepStack mergers) are trained, so the audio encoder learns to map its features into the (frozen) LLM's embedding space on large-scale audio data.

Stage 2 — Multimodal fine-tuning. Continue from the Stage-1 checkpoint and unfreeze the LLM, training jointly on a mixture of audio-text, vision-text, and text-only samples. This builds a full instruction-following assistant that handles speech, images, and plain text without forgetting its original vision/text abilities. For low-resource adaptation you can switch to LoRA inside the script (--train_type lora --target_modules all-linear).

# Stage 1 — audio alignment (smoke-test on the bundled example data)
bash scripts/train/audio_alignment_pretrain.sh

# real run: override the defaults
MODEL=./pretrained/Qwen3-AVL-base \
DATA=./data/my_audio_train.jsonl \
OUTPUT=./output/audio_alignment \
bash scripts/train/audio_alignment_pretrain.sh

# Stage 2 — multimodal fine-tuning (defaults to all three example files;
# pass a space-separated list of your own jsonls via DATA)
MODEL=./output/audio_alignment/checkpoint-last \
DATA="./data/my_audio.jsonl ./data/my_vision.jsonl ./data/my_text.jsonl" \
OUTPUT=./output/multimodal_finetune \
bash scripts/train/multimodal_finetune.sh

Relationship to upstream ms-swift

This is a fork of ms-swift. The framework's general usage, CLI flags, and multimodal data conventions follow upstream documentation (https://github.com/modelscope/ms-swift). Our additions are confined to the qwen3_avl model/template plus the bundled transformers/ fork, so upstream features (DeepSpeed, Megatron, LoRA, packing, evaluation, etc.) work unchanged.

License

Both bundled packages are Apache-2.0 (inherited from ms-swift and 🤗 Transformers); see LICENSE.

About

Audio-Vision-Language extension of Qwen3-VL, built on ms-swift

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages