Awesome Speech Emotion Recognition

Emotions are all over the place
Inside Out 2 (2024) by Pixar Animation Studios

Topics

Reviews
Databases
Developing
Training
Publishing
Learning
Other Awesome Material

Reviews

2024

Gan, C., Zheng, J., Zhu, Q., Cao, Y., & Zhu, Y. (2024). "A survey of dialogic emotion analysis: Developments, approaches and perspectives", Pattern Recognition, 156, 110794. doi:10.1016/j.patcog.2024.110794
A survey of methods and datasets in dialogic emotion analysis based on natural language processing from 2017 to 2024
G. Hu, Y. Xin, W. Lyu, H. Huang, C. Sun, Z. Zhu, L. Gui, and R. Cai, “Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective”, arXiv, Sep. 11, 2024.
A survey presenting recent trends in multimodal affective computing from an NLP perspective covering four tasks: multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis, and multimodal multi-label emotion recognition
I. Saadi, D. W. cunningham, A. Taleb-Ahmed, A. Hadid, and Y. E. Hillali, “Driver’s facial expression recognition: A comprehensive survey”, Expert Systems with Applications, vol. 242, p. 122784, May 2024, doi: 10.1016/j.eswa.2023.122784.
A comprehensive survey from 2018 to 2024 on recognizing facial expressions of drivers
H. H. Mustafa, N. R. Darwish, and H. A. Hefny, “Automatic Speech Emotion Recognition: a Systematic Literature Review”, International Journal of Speech Technology, vol. 27, no. 1, pp. 267–285, Mar. 2024, doi: 10.1007/s10772-024-10096-7.
A systematic literature review on Automatic Speech Emotion Recognition from 2011 to 2023
H. Barakat, O. Turk, and C. Demiroglu, “Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources”, EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, no. 1, p. 11, Feb. 2024, doi: 10.1186/s13636-024-00329-7.
A systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning
S. K. Khare, V. Blanes-Vidal, E. S. Nadimi, and U. R. Acharya, “Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations”, Information Fusion, vol. 102, p. 102019, 2024, doi: https://doi.org/10.1016/j.inffus.2023.102019.
A systematic review of emotion recognition from different input signals (e.g., physical, physiological)

2023

J. de Lope and M. Graña, “An ongoing review of speech emotion recognition”, Neurocomputing, vol. 528, pp. 1–11, Apr. 2023, doi: 10.1016/j.neucom.2023.01.002.
A comprehensive review of most popular datasets, and current machine learning and neural networks models for SER
A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions”, Information Fusion, vol. 91, pp. 424–444, Mar. 2023, doi: 10.1016/j.inffus.2022.09.025.
A review on multimodal fusion architectures

2022

E. H. Houssein, A. Hammad, and A. A. Ali, “Human emotion recognition from EEG-based brain-computer interface using machine learning: a comprehensive review”, Neural Computing and Applications, May 2022, doi: 10.1007/s00521-022-07292-4.
Human emotion recognition using EEG-based brain signals and machine learning
K. B. Bhangale and M. Kothandaraman, “Survey of Deep Learning Paradigms for Speech Processing”, Wireless Personal Communications, Mar. 2022, doi: 10.1007/s11277-022-09640-y.
Machine learning techniques for speech processing
S. Saganowski, “Bringing Emotion Recognition Out of the Lab into Real Life: Recent Advances in Sensors and Machine Learning”, Electronics, vol. 11, no. 3, Art. no. 3, Feb. 2022, doi: 10.3390/electronics11030496.
Advancements in sensors and machine learning methods and techniques

2021

T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and E. Ambikairajah, “A Comprehensive Review of Speech Emotion Recognition Systems”, IEEE Access, vol. 9, pp. 47795–47814, Mar. 2021, doi: 10.1109/ACCESS.2021.3068045.
SER systems' varied design components/methodologies, databases
E. Lieskovska, M. Jakubec, R. Jarina, and M. Chmulik, “A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism”, Electronics, vol. 10, p. 1163, May 2021, doi: 10.3390/electronics10101163.
Extensive comparison of Deep Learning architectures, mainly on the IEMOCAP benchmark database
Md. Shah Fahad, A. Ranjan, J. Yadav, and A. Deepak, “A survey of speech emotion recognition in natural environment”, Digital Signal Processing, vol. 110, p. 102951, Mar. 2021, doi: 10.1016/j.dsp.2020.102951.
A comprehensive survey of SER in the natural environment, various issues of SER in the natural environment, databases, feature extraction, and models
S. P. Yadav, S. Zaidi, A. Mishra, and V. Yadav, “Survey on Machine Learning in Speech Emotion Recognition and Vision Systems Using a Recurrent Neural Network (RNN)”, Archives of Computational Methods in Engineering, vol. 29, no. 3, pp. 1753–1770, May 2022, doi: 10.1007/s11831-021-09647-x.
A survey of deep learning algorithms in speech and vision applications and restrictions
P. Koromilas and T. Giannakopoulos, “Deep Multimodal Emotion Recognition on Human Speech: A Review”, Applied Sciences, vol. 11, no. 17, Art. no. 17, Jan. 2021, doi: 10.3390/app11177962.
A comprehensive review of the latest advancements in multimodal speech emotion recognition methods.

Databases

Russell's circumplex model of affect [1] is a model of human emotion that posits that all emotions can be represented as points on a two-dimensional space, with one dimension representing valence (pleasantness vs. unpleasantness) and the other dimension representing arousal (activation vs. deactivation). Valence refers to the positive and negative degree of emotion and arousal refers to the intensity of emotion. Most categorical emotions used in SER databases are based on this model

A graphical representation of the circumplex model of affect with the horizontal axis representing the valence or pleasant dimension and the vertical axis representing the arousal or activation dimension [2].

[1] Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. https://doi.org/10.1037/h0077714
[2] Valenza, G., Citi, L., Lanatá, A. et al. Revealing Real-Time Emotional Responses: a Personalized Assessment based on Heartbeat Dynamics. Sci Rep 4, 4998 (2014). https://doi.org/10.1038/srep04998

Datasets for Emotion Recognition

Dataset	Lang	Size	Type	Emotions	Modalities	Resolution
AffectNet	N/A	~450.000 subjects	Natural	Continuous valence/arousal values and categorical emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, surprise	Visual	425x425
Belfast Naturalistic Database	Spanish	127 multi-cultural speakers of 298 emotional clips	Natural	Amusement, anger, disgust, fear, frustration, sadness, surprise	Audio Visual	N/A
Berlin Database of Emotional Speech (Emo-DB)	German	5 male and 5 female speakers, with more than 500 utterances	Acted	Anger, boredom, disgust, fear/anxiety, happiness, neutral, sadness	Audio	Audio: 48kHz, downsampled to 16kHz Formats:.wav
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI)	English	1000 gender-balanced YouTube speakers, 23500 sentences	Natural	Sentiment: Negative, weakly negative, neutral, weakly positive, positive Emotions: Anger, disgust, fear, happiness, sadness, surprise	Audio Visual Text	N/A
CaFE	French (Canadian)	12 actors, 6 males, and 6 females, 6 sentences	Acted	Anger, disgust, fear, happiness, neutral, sadness, surprise in two different intensities	Audio	Audio: 192kHz and 48kHz Formats:.aiff
CREMA-D	English	91 actors,48 males, and 43 females, 12 sentences	Acted	Anger, disgust, fear, happy, neutral and sad / Emotional Intensity	Audio Visual	Audio: 16kHz Formats:.wav,.mp3,.flv
EmoV-DB	English (North American), French (Belgian)	4 English, (2 males, and 2 females) and 1 French (male)	Acted	Amused, anger, disgust, neutral, and sleepiness	Audio	Audio: 16kHz Formats: wav
EMOVIE	Chinese (Mandarin)	9,754 utterances from television programs and movies	Natural	Emotion Polarity (-1, -0.5, 0, 0.5 and 1)	Audio	Audio: 22.05kHz Formats: wav
ESD	English (North American), Chinese (Mandarin)	10 English (5 males, 5 females) and 10 Chinese (5 males, 5 females) speakers, 700 utterances	Acted	Anger, happiness, neutral, sadness, surprise	Audio Text	Audio: 16kHz Formats: wav
Interactive Emotional Motion Capture (USC-IEMOCAP)	English	A 12h multimodal and multispeaker (5 males and 5 females) database	Acted Elicited	Anger, frustration, happiness, neutral, sadness as well as dimensional labels such as valence, activation and dominance	Audio Visual	Audio: 48kHz Video: 120 fps
MELD: Multimodal EmotionLines Dataset	English	More than 13000 utterances from multiple speakers	Natural	Anger, disgust, fear, joy, neutral, non-neutral, sadness, surprise	Audio Visual Text	Audio: 16bit PCM Formats: .wav
OMG-Emotion	English	10 hours of YouTube videos around 1min long	Natural	Continuous valence/arousal values and categorical emotions: anger, disgust, fear, happiness, neutral, sadness, surprise	Audio Visual Text	N/A
RESD (Russian Emotional Speech Dialogs)	Russian	3.5 hours of live speech	Natural	Anger, disgust, fear, enthusiasm, happiness, neutral and sadness	Audio	Audio: 16 or 44.1kHz Formats: .wav
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)	English (North American)	A database of emotional speech and song of 12 males and 12 females	Acted	Anger, disgust, calmness, fear, happiness, neutral, sadness, surprise	Audio Visual	Audio: 48kHz - 16bit Video: 720p Formats: .wav,.mp4
SEMAINE	English	95 sessions of human-agent interactions	Natural	4D Emotional space	Audio Visual Text	N/A
Surrey Audio-Visual Expressed Emotion (SAVEE)	English	4 male speakers x 480 utterances	Acted	Anger, disgust, fear, happiness, neutral, sadness, surprise	Audio Visual	Audio: 44.1kHz - mono - 16bit Video: 256p - 60fps Formats: .wav, .avi
SUSAS	English	Speech under stress corpus with more than 16000 utterances from 32 speakers (13 females, 19 males)	Acted	Ten stress styles such as speaking style, single tracking task, and Lombard effect domain	Audio	8kHz, 8bit PCM

Datasets for Sound Classification

Database	Year	Type	Resolution
VGGSound	2020	more than 210k videos for 310 audio classes	N/A, 10sec long
AudioSet	2017	2.1 million sound clips from YouTube videos, 632 audio event classes	N/A, 10sec long
UrbanSound8K	2014	Urban sound excerpts	sampling rate may vary from file to file, duration<=4sec
ESC-50	2000	Environmental audio recordings	44.1kHz, mono, 5sec long

Audio/Visual Room Acoustics Datasets

Database	Year	Description
Real Acoustic Fields (RAF)	2024	An audio-visual room acoustics dataset and benchmark
Room Impulse Responses Datasets	2023	A list of publicly available room impulse response datasets and scripts to download them

Developing

Software

C++
- Essentia | |A C++ library for audio and music analysis, description, and synthesis, including Python bindings
- openSMILE | |An open-source toolkit for audio feature extraction and classification of speech and music signals, including a C API with Python and C# bindings
MATLAB
- Audio Toolbox | Provides tools for audio processing, speech analysis, and acoustic measurement
- Covarep | | A Cooperative Voice Analysis Repository for Speech Technologies
Python Libraries
- Aubio | | Free, open source library for audio and music analysis * Librosa | | A Python package for music and audio analysis
- OpenSoundscape | | A Python utility library for analyzing bioacoustic data
- Parselmouth | | A Pythonic interface to the Praat software
- PyAudioAnalysis | | A Python library that provides a wide range of audio-related functionalities focusing on feature extraction, classification, segmentation, and visualization issues
- Pydub | | Manipulate audio with a simple and easy high-level interface
- SoundFile | | A python library for audio IO processing

Tools

Audacity | | Free, open source, cross-platform audio software
AudioGPT | | Solve AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease (paper)
ESPNet | | ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, and spoken language understanding
Kaldi | | Kaldi is an automatic speech recognition toolkit
NVIDIA NeMo | | NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech synthesis (TTS)
S3PRL | | A toolkit targeting for Self-Supervised Learning for speech processing. It supports three major features: i) Pre-training, ii) Pre-trained models (Upstream) collection, and iii) Downstream Evaluation
SpeechBrain | | A PyTorch speech and all-in-one conversational AI toolkit

Training

Audio/Speech Data Augmentation

Audiomentations - A Python library for audio data augmentation
torch-audiomentations - Fast audio data augmentation in PyTorch

Embeddings

Audio/Speech
- Data2vec 2.0
- DeepSpectrum
- ECAPA-TDNN
- emotion2vec
- HuBERT
- OpenL3
- Resemblyzer
- UniSpeech-SAT
- VGGish
- Wav2Vec 2.0
- WavLM
- Whisper
- X-Vectors
- Xi-Vector

Metrics

Various performance metrics can be used to evaluate a SER system, such as

Accuracy, Weighted Accuracy
Recall, Unweighted Average Recall, Weighted Average Recall
Precision
F1 score, Weighted F1-score
Mean Absolute Error
Root Mean Square Error
Average Recognition Rate

Publishing

Journals

List

Name	Impact Factor	Review Method	First-decision
Frontiers in Computer Science	1.039	Peer-review	13w
International Journal of Speech Technology	1.803	Peer-review	61d
Machine Vision and Applications	2.012	Peer-review	44d
Applied Accoustics	2.639	Peer-review	7.5w
Applied Sciences	2.679	Peer-review	17.7d
Multimedia Tools and Applications	2.757	Peer-review	99d
IEEE Sensors	3.301	Peer-review	60d
IEEE Access	3.367	Binary Peer-review	4-6w
Computational Intelligence and Neuroscience	3.633	Peer-review	40d
IEEE/ACM Transactions on Audio, Speech and Language Processing	3.919	Peer-review	N/A
Neurocomputing	5.719	Peer-review	74d
IEEE Transactions on Affective Computing	10.506	Peer-review	N/A

Conferences

2023

Name	Date	Location	More
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)	June 2023	Canada
International Conference on Acoustics, Speech, & Signal Processing (ICASSP)	June 2023	Greece
International Speech Communication Association - Interspeech (ISCA)	August 2023	Ireland
European Signal Processing Conference (EUSIPCO)	September 2023	Finland	➖
Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)	September 2023	Finland
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)	October 2023	USA	➖
International Society for Music Information Retrieval Conference (ISMIR)	November 2023	Italy

2024

Name	Date	Location	More
International Conference on Acoustics, Speech, & Signal Processing (ICASSP)	April 2024	Seoul, Korea
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)	June 2024	Seattle, WA, USA
International Conference on Machine Learning (ICML)	July 2024	Vienna, Austria	➖
European Signal Processing Conference (EUSIPCO)	August 2024	Lyon, France	➖
International Speech Communication Association - Interspeech (ISCA)	September 2024	Kos, Greece	➖
Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)	October 2024	Tokyo, Japan
ACM International Conference on Multimedia	October 2024	Melbourne, Australia	➖
International Society for Music Information Retrieval Conference (ISMIR)	November 2024	San Fransisco, CA, USA	➖
Conference and Workshop on Neural Information Processing Systems (NeurIPS)	December 2024	Vancouver, Canada	➖

Learning

Deep Learning for Audio (University of Illinois Urbana-Champaign)
Dive into Deep Learning
Introduction to Speech Processing, 2nd Edition, Aalto University
Mastering Audio - The Art and the Science, Robert A. Katz
Spectral Audio Signal Processing, Julius O. Smith III
The Scientist and Engineer's Guide to Digital Signal Processing, Steven W. Smith

Other Awesome Material

Audio/Speech

Awesome Speaker Diarization
Casual Conversations Dataset
Music Emotion Recognition Datasets
Project TaRSila speech datasets for Brazilian Portuguese language
SER Datasets
Voice Datasets
✨❔

Visual/Face

Awesome-SOTA-FER a curated list of facial expression recognition in both 7-emotion classification and affect estimation
FG 2024 experience the cutting edge of progress in facial analysis, gesture recognition, and biometrics with this repository
✨❔

Perspectives

A picture(s) is worth a thousand words! A 2-min visual example of how we communicate emotions, our perceptions, the role of subjectivity and what is effective listening.

Are emotions consistent?

What about the dynamics of the context to our decisions and emotional wellness?

from Inside Out (2015) by Pixar Animation Studios

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly