Emotion recognition from audiovisual signals usually relies on feature sets whose extraction is based on knowledge gained over several decades of research in the domains of speech processing and vision computing. Along with the recent trend of representation learning, whose objective is to learn representations of data that are best suited for the recognition task, there has been some noticeable effort in the field of affective computing to learn representations of audio/visual data in the context of emotion.
There are three different levels of supervision in the way expert knowledge is exploited at the feature extraction step:
- Supervised: Expert-knowledge
- Semi-supervised: Bags-of-X-words
- Unsupervised: Deep Spectrum
The traditional approach in time-continuous emotion recognition consists in summarising low-level descriptors (LLDs) of speech and video data over time with a set of statistical measures computed over a fixed-duration sliding window. These descriptors usually include spectral, cepstral, prosodic, and voice quality information for the audio channel, appearance and geometric information for the video channel.
e.g.
- COMPARE
- FAUs (OpenFace)
- eGeMAPS (OpenSmile)
- MFCC (OpenSmile)
The technique of bags-of-words (BoW), which originates from text processing, can be seen as a semi-supervised representation learning, because it represents the distribution of LLDs according to a dictionary learned from them. To generate the XBoW-representations, both the acoustic and the visual features are processed and summarised over a block of a fixed length duration.
e.g.
- BoAW (OpenXBOW)
- BoVW (OpenXBOW)
Deep Spectrum features were first introduced for snore sound classification, and are extracted using deep representation learning paradigm heavily inspired by image processing. To generate Deep Spectrum features, the speech files are first transformed into mel-spectrogram images using Hanning windows, and a power spectral density computed on the dB power scale. These plots are then scaled and cropped to square images of size 227 x 227 pixels without axes and margins to comply with the input needs of ALEXNET - a deep CNN pre-trained for image classification. Afterwards, the spectral-based images are forwarded through ALEXNET. Finally, 4096-dimension feature vectors are extracted from the mel-spectrogram images using the activations from the second fully-connected layer of ALEXNET.
methods | acoustic features | visual features |
---|---|---|
Yang2018 | arousal hist | hands dist hist |
audio LLDs | body HDR | |
pause/rate hist | action units | |
Xing2018 | eGeMAPS + MFCCs | AUs MHH |
topic-level feat | AUs | |
eyesight feat | ||
emotion feat | ||
body movement |
The baseline recognition system of the BDS consists of a late fusion of the best performing audio and video representations using linear SVM with LIBLINEAR toolkit; training instances of the minority classes are duplicated to be balanced with the majority class, and the type of solver and value of complexity C are optimised by a grid search, using a logarithmic scale for the latter.