Skip to content

mohammed-elhaj/Visual_Speech_Recognition_for_Multiple_Languages

 
 

Repository files navigation

logo

Visual Speech Recognition for Multiple Languages

Authors

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.

Update

2023-07-26: We have released our training recipe for real-time AV-ASR, see here.

2023-06-16: We have released our training recipe for AutoAVSR, see here.

2023-03-27: We have released our AutoAVSR models for LRS3, see here.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.

Tutorial

We provide a tutorial Open In Colab to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

Demo

English -> Mandarin -> Spanish French -> Portuguese -> Italian

Preparation

  1. Clone the repository and enter it locally:
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages
  1. Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr
  1. Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
  1. Download and extract a pre-trained model and/or language model from model zoo to:
  • ./benchmarks/${dataset}/models

  • ./benchmarks/${dataset}/language_models

  1. [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.

Benchmark evaluation

python eval.py config_filename=[config_filename] \
               labels_filename=[labels_filename] \
               data_dir=[data_dir] \
               landmarks_dir=[landmarks_dir]
  • [config_filename] is the model configuration path, located in ./configs.

  • [labels_filename] is the labels path, located in ${lipreading_root}/benchmarks/${dataset}/labels.

  • [data_dir] and [landmarks_dir] are the directories for original dataset and corresponding landmarks.

  • gpu_idx=-1 can be added to switch from cuda:0 to cpu.

Speech prediction

python infer.py config_filename=[config_filename] data_filename=[data_filename]
  • data_filename is the path to the audio/video file.

  • detector=mediapipe can be added to switch from RetinaFace to MediaPipe tracker.

Mouth ROIs cropping

python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]
  • dst_filename is the path where the cropped mouth will be saved.

Model zoo

Overview

We support a number of datasets for speech recognition:

AutoAVSR models

Lip Reading Sentences 3 (LRS3)

Components WER url size (MB)
Visual-only
- 19.1 GoogleDrive or BaiduDrive(key: dqsy) 891
Audio-only
- 1.0 GoogleDrive or BaiduDrive(key: dvf2) 860
Audio-visual
- 0.9 GoogleDrive or BaiduDrive(key: sai5) 1540
Language models
- - GoogleDrive or BaiduDrive(key: t9ep) 191
Landmarks
- - GoogleDrive or BaiduDrive(key: mi3c) 18577

VSR for multiple languages models

Lip Reading Sentences 2 (LRS2)

Components WER url size (MB)
Visual-only
- 26.1 GoogleDrive or BaiduDrive(key: 48l1) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: 53rc) 9358
Lip Reading Sentences 3 (LRS3)

Components WER url size (MB)
Visual-only
- 32.3 GoogleDrive or BaiduDrive(key: 1b1s) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: mi3c) 18577
Chinese Mandarin Lip Reading (CMLR)

Components CER url size (MB)
Visual-only
- 8.0 GoogleDrive or BaiduDrive(key: 7eq1) 195
Language models
- - GoogleDrive or BaiduDrive(key: k8iv) 187
Landmarks
- - GoogleDrive or BaiduDrive(key: 1ret) 3721
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)

Components WER url size (MB)
Visual-only
Spanish 44.5 GoogleDrive or BaiduDrive(key: m35h) 186
Portuguese 51.4 GoogleDrive or BaiduDrive(key: wk2h) 186
French 58.6 GoogleDrive or BaiduDrive(key: t1hf) 186
Language models
Spanish - GoogleDrive or BaiduDrive(key: 0mii) 180
Portuguese - GoogleDrive or BaiduDrive(key: l6ag) 179
French - GoogleDrive or BaiduDrive(key: 6tan) 179
Landmarks
- - GoogleDrive or BaiduDrive(key: vsic) 3040
GRID

Components WER url size (MB)
Visual-only
Overlapped 1.2 GoogleDrive or BaiduDrive(key: d8d2) 186
Unseen 4.8 GoogleDrive or BaiduDrive(key: ttsh) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: 16l9) 1141

You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.

Lombard GRID

Components WER url size (MB)
Visual-only
Unseen (Front Plain) 4.9 GoogleDrive or BaiduDrive(key: 38ds) 186
Unseen (Side Plain) 8.0 GoogleDrive or BaiduDrive(key: k6m0) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: cusv) 309

You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.

TCD-TIMIT

Components WER url size (MB)
Visual-only
Overlapped 16.9 GoogleDrive or BaiduDrive(key: jh65) 186
Unseen 21.8 GoogleDrive or BaiduDrive(key: n2gr) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: bnm8) 930

Citation

If you use the AutoAVSR models training code, please consider citing the following paper:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels}, 
  year={2023},
}

If you use the VSR models for multiple languages please consider citing the following paper:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

About

Visual Speech Recognition for Multiple Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%