ms-swift-Qolda-AVL/requirements.txt at main · IS2AI/ms-swift-Qolda-AVL · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# =============================================================================
# Qolda-AVL — Python dependencies (curated for the audio training pipeline)
# =============================================================================
# This project is a fork of ms-swift, whose full upstream dependency lists live
# under ./requirements/ (framework.txt, eval.txt, ...) and are also wired into
# setup.py, so `pip install -e .` pulls them in.
#
# The list below is a self-contained set sufficient to run the Qolda-AVL audio
# training pipeline. Install a CUDA-matched PyTorch build (torch / torchaudio /
# torchvision) and the bundled transformers fork first, then
# `pip install -r requirements.txt`.
# -----------------------------------------------------------------------------

# --- ms-swift framework dependencies (inherited from upstream) ---
-r requirements/framework.txt

# --- core deep-learning stack ---
# Install the build that matches your CUDA version, e.g. from pytorch.org
torch>=2.4
torchaudio>=2.4
torchvision

# NOTE: the Qwen3AVL model requires the bundled transformers fork in this repo:
#   pip install -e ./transformers
# Do NOT install `transformers` from PyPI — it does not know about qwen3_avl.
accelerate>=0.34
deepspeed
peft>=0.11

# --- audio branch (Whisper encoder + audio DeepStack) ---
librosa            # audio loading / log-mel feature extraction
soundfile          # wav/flac I/O
av                 # robust audio/video decoding backend

# --- vision / video (inherited from Qwen3-VL) ---
qwen-vl-utils
decord

# --- optional, strongly recommended for speed ---
# FlashAttention is required by the training scripts (--attn_impl flash_attn).
# It needs a CUDA toolchain to build; install separately if the wheel fails:
#   pip install flash-attn --no-build-isolation
flash-attn; platform_system == "Linux"