This is the official PyTorch implementation of paper Leveraging Unimodal Self-supervised Learning for Multimodal Audio-visual Speech Recognition
- Clone the repo into a directory.
git clone https://github.com/LUMIA-Group/Leveraging-Self-Supervised-Learning-for-AVSR.git
- Install all required packages.
pip install -r requirements.txt
Noted that the Pytorch-Lightning lib do not support a wrapped ReduceLROnPlateau scheduler, we need to modify the lib manually by:
python -c "exec(\"import pytorch_lightning\nprint(pytorch_lightning.__file__)\")"
vi /path/to/pytorch_lightning/trainer/optimizers.py
and comments the 154-156 lines
# scheduler["reduce_on_plateau"] = isinstance(
# scheduler["scheduler"], optim.lr_scheduler.ReduceLROnPlateau
# )
- Download LRW dataset and LRS2 dataset
- Download pretrained MoCo v2 model and wav2vec 2.0 model
- Change the directory in
config.py
to "relative directory" relative to the project root directory - Preprocessing the LRW dataset.
cd trainFrontend
python saveh5.py
- Preprocessing the LRS2 dataset.
python saveh5.py
- Train the visual front-end on LRW.
python trainfrontend.py
- Change the
args["MOCO_FRONTEND_FILE"]
inconfig.py
to the trained front-end file, and configargs["MODAL"]
to choose modality. - Train the AO and VO model first.
python train.py
- Then train the AV model. Before that, change the
args["TRAINED_AO_FILE"]
andargs["TRAINED_VO_FILE"]
to the trained AO and VO model.
python train.py
- Choose test configuration and model.
- Evaluate the visual word classification performance.
python evalfrontend.py
- Evaluate the AO/VO/AV model.
python eval.py
If you find this repo useful in your research, please consider citing it in the following format:
@inproceedings{pan2022leveraging,
title={Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition},
author={Pan, Xichen and Chen, Peiyu and Gong, Yichen and Zhou, Helong and Wang, Xinbing and Lin, Zhouhan},
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={4491--4503},
year={2022}
}