Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition

This repositary consist the pytorch code for Multimodal Emotion Recogntion with pretreined Roberta and Speech-BERT.

Basic strucutre of the code

Inspiration from fairseq

This code strcuture is built on top of Faiseq interface
Fairseq is an open source project by FacebookAI team that combined different SOTA architectures for sequencial data processing
This also consist of SOTA optimizing mechanisms such as ealry stopage, warup learnign rates, learning rate shedulers
We are trying to develop our own architecture in compatible with fairseq interface.
For more understanding please read the paper published about Fairseq interaface.

Merging of our own architecture with Fairseq interface

This can be bit tricky in the beggining. First it is important to udnestand that Fairseq has built in a way that all architectures can be access through the terminal commands (args).
Since our architecture has lot of properties in tranformer architecture, we followed the a tutorial that describe to use Roberta for the custom classification task.
We build over archtiecture by inserting new stuff to following directories in Fairseq interfeace.
- fairseq/data
- fairseq/models
- fairseq/modules
- fairseq/tasks
- fairseq/criterions

Main scripts of the code

Our main scripts are categorized in to for parts

Custom dataloader for load raw audio, faceframes and text is in the fairseq/data/raw_audio_text_dataset.py
The task of the emotion prediction similar to other tasks such as translation is in the fairseq/tasks/emotion_prediction.py
The custom architecture of our model similar to roberta,wav2vec is in the fairseq/models/mulT_emo.py
The cross-attention was implemted by modifying the self attentional scripts in original fairseq repositary. They can be found in fairseq/modules/transformer_multi_encoder.py and fairseq/modules/transformer_layer.py
Finally the cutom loss function and ebaluation scripts can be found it fairseq/criterions/emotion_prediction_cri.py

Prerequest models

Please use following links to downlaod the pretrained SSL models and save them in a seperate folder named pretrained_ssl.

For speech fetures - VQ-wav2vec
For sentence (text) features - Roberta

Preprocessing data.

We tokenized both speech and text data and then feed in to the algorithm training.

For text data, we first tokenized it with Roberta tokenizer and save each example in to seperate text files.
To preprocess speech data please refer the script given in convert_aud_to_token.py.
The preprocessed datasets and their labels can be found in the this google drive.

Terminal Commands

We followed the Fairseq terminal commands to train and validate our models.

Useful commands

--data - folder that contains filenames, sizes and labels of your raw data (please refer to the T_data folder).
--data-raw - Path of your raw data folder that contains tokenized speech and text.
--binary-target-iemocap - train the model with Iemocap data for binary accuracy.
--regression-target-mos - train the model with CMU-MOSEI/CMU-MOSI data for sentiment score.
For dataset specific traing commands please refer to emotion_prediction.py.

Training Command

CUDA_VISIBLE_DEVICES=8,7 python train.py --data ./T_data/iemocap --restore-file None --task emotion_prediction --reset-optimizer --reset-dataloader --reset-meters --init-token 0 --separator-token 2 --arch robertEMO_large --criterion emotion_prediction_cri --num-classes 8 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 1e-05 --total-num-update 2760 --warmup-updates 165 --max-epoch 10 --best-checkpoint-metric loss --encoder-attention-heads 2 --batch-size 1 --encoder-layers-cross 1 --no-epoch-checkpoints --update-freq 8 --find-unused-parameters --ddp-backend=no_c10d --binary-target-iemocap --a-only --t-only --pooler-dropout 0.1 --log-interval 1 --data-raw ./iemocap_data/

Validation Command

CUDA_VISIBLE_DEVICES=1 python validate.py --data ./T_data/iemocap --path './checkpoints/checkpoint_best.pt' --task emotion_prediction --valid-subset test --batch-size 4

Aditional

If you want to pre-process data again please refer to this repositary.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
F_1 Score		F_1 Score
SPEECH-BERT-TOKENIZATION		SPEECH-BERT-TOKENIZATION
T_data		T_data
build		build
docs		docs
examples		examples
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
F1_check.py		F1_check.py
LICENSE		LICENSE
README.md		README.md
ang.csv		ang.csv
eval_lm.py		eval_lm.py
f1_mos.py		f1_mos.py
fairseq.gif		fairseq.gif
fairseq_logo.png		fairseq_logo.png
generate.py		generate.py
hap.csv		hap.csv
hubconf.py		hubconf.py
interactive.py		interactive.py
mos.csv		mos.csv
neu.csv		neu.csv
pipeline.jpg		pipeline.jpg
preprocess.py		preprocess.py
pyproject.toml		pyproject.toml
sad.csv		sad.csv
score.py		score.py
setup.py		setup.py
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Basic strucutre of the code

Inspiration from fairseq

Merging of our own architecture with Fairseq interface

Main scripts of the code

Our main scripts are categorized in to for parts

Prerequest models

Please use following links to downlaod the pretrained SSL models and save them in a seperate folder named pretrained_ssl.

Preprocessing data.

We tokenized both speech and text data and then feed in to the algorithm training.

Terminal Commands

Useful commands

Training Command

Validation Command

Aditional

About

Releases

Packages

Languages

License

shamanez/BERT-like-is-All-You-Need

Folders and files

Latest commit

History

Repository files navigation

Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Basic strucutre of the code

Inspiration from fairseq

Merging of our own architecture with Fairseq interface

Main scripts of the code

Our main scripts are categorized in to for parts

Prerequest models

Please use following links to downlaod the pretrained SSL models and save them in a seperate folder named pretrained_ssl.

Preprocessing data.

We tokenized both speech and text data and then feed in to the algorithm training.

Terminal Commands

Useful commands

Training Command

Validation Command

Aditional

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages