Code and models for the paper: "No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation" accepted at ASRU 2023.
To ensure complete reproducibility, we release the ASR model checkpoints used in our experiments, together with the SentencePiece model, the vocabulary files, the yaml files, and the outputs obtained by each model:
- Baseline: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + VTLP: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Random: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Opposite: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Random - Formant Shifting: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Random - Formant Shifting - Gender Swapping: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Vocabulary: vocab.txt | spm_model
Data (MuST-C v1, en-es direction) have to be preprocessed with:
python /path/to/fbk-fairseq/examples/speech_to_text/preprocess_generic.py --data-root /data/to/mustc \
--save-dir /data/to/mustc/save_folder --wav-dir /data/to/mustc/wav_folder \
--split train, dev, tst-HE, tst-COMMON --vocab-type bpe --src-lang en --tgt-lang en \
--task asr --n-mel-bins 80 --store-waveformThe following parameters are intended for training on a system with 4 GPUs, each having 16 GB of VRAM.
The training_data and dev_data files are in TSV format, obtained after preprocessing.
The config_file is a YAML file and can be downloaded above.
python train.py /path/to/data_folder \
--train-subset training_data --valid-subset dev_data \
--save-dir /path/to/save_folder \
--num-workers 5 --max-update 50000 --patience 10 --keep-last-epochs 13 \
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml config_file \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--clip-norm 10.0 \
--seed 1 --update-freq 8 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> /path/to/save_folder/train.log 2> /path/to/save_folder/train.err
python /path/to/fbk-fairseq/scripts/average_checkpoints.py --input /path/to/save/folder --num-epoch-checkpoints 5 --checkpoint-upper-bound $(ls /path/to/save_folder | head -n 5 | tail -n 1 | grep -o "[0-9]*") --output /path/to/save_folder/avg5.ptInference can be executed with the following command
(setting TEST_DATA to a TSV obtained from the preprocessing
and CONFIG_FILE to one of the YAML files provided above):
python /path/to/fbk-fairseq/fairseq_cli/generate.py /path/to/data_folder \
--gen-subset $TEST_DATA \
--user-dir examples/speech_to_text \
--max-tokens 40000 \
--config-yaml $CONFIG_FILE \
--beam 5 \
--max-source-positions 10000 \
--max-target-positions 1000 \
--task speech_to_text_ctc \
--criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--no-repeat-ngram-size 5 \
--path /path/to/checkpoint > /path/to/output_fileWe use the Python package JiWER to compute the word error rate. Gender-specific evaluations are performed by partitioning the test sets based on the MuST-Speaker resource.
@inproceedings{fucci2023pitch,
title={{No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation}},
author={Dennis Fucci and Marco Gaido and Matteo Negri and Mauro Cettolo and Luisa Bentivogli},
year={2023},
booktitle="IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)",
month = dec,
address="Taipei, Taiwan"
}