All notable changes to this project are documented here. This project follows Semantic Versioning and Keep a Changelog.
scripts/cross_corpus_urdu_sindhi.py— trains classifiers on the full corpus of one language, tests on the full corpus of the other; covers all 5 feature sets x 3 classifiers x 2 directions = 30 transfer experiments.- Cross-corpus Urdu ↔ Sindhi finding: catastrophic negative transfer. Best transfer UAR 0.2734 (eGeMAPS + SVM-RBF) vs within-language best 0.5699 — a 30.99 pp drop. Symmetric across directions. Low-dim eGeMAPS transfers best, high-dim ComParE / IS10 transfer worst (curse of dimensionality applied to cross-lingual SER). To our knowledge no published work reports this specific result for Urdu and Sindhi.
- XLS-R audio-only validation on RAVDESS-SI: WF1 0.7727, UF1 0.7647, Acc 0.7792. +11.4 pp WF1 over the English-only wav2vec2-base baseline (0.659), and +4.5 pp over the English-only multimodal cross-attention baseline (0.728). Multilingual audio encoder is strictly better even on the English target task. Validates XLS-R as the audio backbone for the cross-lingual Indo-Aryan extension.
- results/results.md: cross-corpus section with table, observations, field implications; RAVDESS-SI table updated with XLS-R row plus a cross-lingual-finding callout.
- README.md: Active Research banner updated with cross-corpus + XLS-R headline tables.
- Multilingual encoder configs (
configs/text_only_xlmr_meld.yaml,configs/audio_only_xlsr_ravdess_si.yaml,configs/multimodal_xlmr_xlsr_meld.yaml) integratingxlm-roberta-base(100 languages) andfacebook/wav2vec2-xls-r-300m(53 languages) into the existing pipeline. src/models/lightning_module.py— wiredfreeze_text_layersandfreeze_audio_layersconfig keys through to the multimodal branch (previously only the unimodal branches respected freeze settings). Backward-compatible: defaults preserve prior behavior.- XLM-R pipeline validated on MELD — WF1 0.579 / UF1 0.409 / Acc 0.570 with 6-layer freeze (43M of 278M params trained). Confirms the cross-lingual transformer pipeline works end-to-end on English baseline before applying to Indo-Aryan targets.
- Urdu-Sindhi Speech Emotion Corpus integration (Syed et al. 2020, Zenodo DOI 10.5281/zenodo.3685274) — 1,435 recordings (734 Urdu + 701 Sindhi) across 7 emotions including the unusual Sarcasm class. Five hand-crafted feature representations released (eGeMAPS, ComParE, IS09, IS10, Prosody).
scripts/train_urdu_sindhi_classical.py— classical-ML baseline trainer for the Urdu-Sindhi corpus. 5-fold stratified CV across SVM-RBF, RandomForest, and MLP classifiers. Reports UAR, weighted F1, and per-class F1 with comparison against the paper baseline (Urdu UAR 56.96%, Sindhi UAR 55.29%). Runnable per (language, feature-set) combination or all-vs-all.scripts/inspect_urdu_sindhi.py— standalone .mat-file inspector for understanding the Zenodo feature format before training.
configs/text_only_xlmr_meld.yaml: first attempt withbatch_size=16, freeze=0OOM'd at epoch 0, batch 23/625 on Apple Silicon MPS (Adam optimizer state pushed allocation to 19.6/20.1 GB). Final config:batch_size=8, accumulate_grad_batches=4(effective batch 32 unchanged),freeze_encoder_layers=6,max_text_tokens=96. Trains to completion with ~12-14 GB peak memory.
- Three new MELD-7 baselines trained and evaluated end-to-end:
- Text-only RoBERTa-base + 2-utterance context — WF1 0.609 (top of literature range 0.55–0.62)
- Audio-only WavLM-base — WF1 0.357 (literature range 0.30–0.45; class-collapse on surprise/fear/disgust as expected on MELD)
- Multimodal RoBERTa + WavLM cross-attention — WF1 0.590
- Headline finding: on MELD, text-only beats multimodal by 1.9 pp WF1 — the opposite of RAVDESS where multimodal won by 6.9 pp. This isn't a regression: it's the modality-complementarity phenomenon that gives a strong two-dataset story (see
results/results.md§ "Cross-experiment observations"). - New per-class F1 tables and a full "Detailed results — MELD" section in
results/results.md. - Long-form writeups under
docs/writeups/: SEO blog post, IEEE-style short paper, LinkedIn announcement.
scripts/prepare_meld.sh: robust download (resume, gzip integrity check, retry loop, speed-floor abort), inner-tarball extraction, hardcoded video-dir mapping (bash 3.2 / macOS compatible), per-500-file ffmpeg progress, graceful skip of corrupted .mp4s.scripts/check_meld_audio.py: standalone audit script that scans all three splits and reports missing / empty .wav files vs the CSV labels.src/data/meld.py: loader now filters rows whose .wav is missing or 0-byte at__init__time and prints a clean count (dropped N rows with missing or empty audio).
configs/audio_only_meld.yaml: lowered LR from 1e-4 (the WavLM pre-training rate, prone to divergence on small fine-tuning runs) to 2e-5. Reuses the lesson learned from the RAVDESS audio-only divergence saga.configs/multimodal_meld.yaml: MPS-friendly memory profile —batch_size: 2withaccumulate_grad_batches: 8(effective batch 16 unchanged),max_audio_seconds8.0 → 5.0,max_text_tokens128 → 96. Original config OOM'd at epoch 0, batch 154/1249 on a 24GB unified-memory M-series Mac. New config completes 10 epochs cleanly.
- Live deployment: huggingface.co/spaces/Shahnawazkakarh/speech-emotion-recognition is now public, running
superb/wav2vec2-base-superb-eron a free CPU tier. - New
🤗 Open in Spacesbadge in the main README.
- Pinned
huggingface_hub<1.0inspace/requirements.txtso thatHfFolder(removed inhuggingface_hubv1.0 but still imported by older Gradio) remains available. - Explicitly declared
torch>=2.0,<3.0andtorchaudioinspace/requirements.txtafter the HF Spaces base image stopped pre-installing them for this Gradio version. - Replaced
gr.Label(num_top_classes=5)withgr.Markdownrendering manual probability bars to sidestepgradio_client.utils.get_typeschema introspection. - Added a defensive monkeypatch in
space/app.pythat wrapsgradio_client.utils.get_typeand_json_schema_to_python_typeto handle bool schemas gracefully — fully eliminates the well-knownTypeError: argument of type 'bool' is not iterableissue from/api/inforegardless of Gradio version.
- New
space/directory with a ready-to-deploy Gradio app for HuggingFace Spaces (space/app.py), Space frontmatter README (space/README.md), Spacerequirements.txt, and a step-by-step deploy guide (space/README_DEPLOY.md). - Default app runs
superb/wav2vec2-base-superb-erfor instant inference; doc explains how to swap in a custom-trained checkpoint via HF Model Hub.
- Multimodal cross-attention SI (test = actors 21-24): Test WF1 0.728, UF1 0.731, Accuracy 0.729. Beats audio-only SI by +6.9 pp WF1 (larger margin than on the random split — multimodal generalizes to unseen speakers better than audio alone).
- Audio-only wav2vec2-base SI: Test WF1 0.659, UF1 0.631, Accuracy 0.667.
- Text-only RoBERTa-base SI: Test WF1 0.031 — still at chance, as expected. Model collapsed to predicting
calmfor every input. - Key finding: multimodal F1 on the neutral class jumped from 0.21 (audio-only) to 0.78 (+57 pp) on unseen speakers — the text branch acts as a strong disambiguator even though it's at chance overall.
- New
split_strategy: speaker_independentoption inSERDataModulewith configurable actor lists. Three new configs:*_ravdess_si.yamlfor text/audio/multimodal. - New test
test_speaker_independent_split_disjointverifies no actor leaks across train/val/test.
- Updated LinkedIn URL to
linkedin.com/in/skakarh. - Added
skakarh.com/productslink in the SK footer.
- Multimodal cross-attention (RoBERTa + wav2vec2): Test WF1 0.858, UF1 0.851, Accuracy 0.858. Beats audio-only by +6.2 pp WF1.
- Audio-only wav2vec2-base: Test WF1 0.796, UF1 0.784, Accuracy 0.795.
- Text-only RoBERTa-base (deliberate ablation): Test WF1 0.053, near chance — confirms text-only fails on RAVDESS's 2-fixed-sentence setup.
- MPS "Unaligned blit request" bug when loading RoBERTa checkpoints via
load_from_checkpoint. Now load to CPU first withmap_location="cpu", then let Lightning move to MPS during.fit()/.test(). Affectssrc/evaluate.pyanddemo/gradio_app.py. - Demo device mismatch: input audio tensor was on CPU while model was on MPS. Now explicitly moves audio to model's device.
- Multimodal RAVDESS config: applied lessons learned from audio-only run (LR=2e-5, freeze 8/12 audio encoder layers, class-weighted CE loss).
CONTRIBUTING.md,CITATION.cff,CHANGELOG.md, GitHub issue/PR templates, CI status badge in README.
train.py: skipckpt_path="best"in--fast-dev-runmode (Lightning disables checkpointing in that mode); added--patienceand--skip-testflags.audio_only_ravdess.yaml: lower LR (1e-4 → 2e-5), freeze 8/12 encoder layers, enable class weights — initial config diverged on 1.1k samples..gitignore: anchor/data/,/outputs/, etc. to the repo root so they don't shadowsrc/data/,src/utils/, etc.- CI: relax
ruff(ignoreN812for Lightning'sas Lconvention andB905zip-strict).
- Initial project scaffold: configs, data loaders (RAVDESS, MELD, IEMOCAP stub), text + audio encoders, fusion modules (concat / gated / cross-attention), Lightning training and evaluation entry points, Gradio demo, smoke tests, CI workflow.