AVEC2018

Baseline Features

Emotion recognition from audiovisual signals usually relies on feature sets whose extraction is based on knowledge gained over several decades of research in the domains of speech processing and vision computing. Along with the recent trend of representation learning, whose objective is to learn representations of data that are best suited for the recognition task, there has been some noticeable effort in the field of affective computing to learn representations of audio/visual data in the context of emotion.

There are three different levels of supervision in the way expert knowledge is exploited at the feature extraction step:

Supervised: Expert-knowledge
Semi-supervised: Bags-of-X-words
Unsupervised: Deep Spectrum

Supervised

The traditional approach in time-continuous emotion recognition consists in summarising low-level descriptors (LLDs) of speech and video data over time with a set of statistical measures computed over a fixed-duration sliding window. These descriptors usually include spectral, cepstral, prosodic, and voice quality information for the audio channel, appearance and geometric information for the video channel.

e.g.

COMPARE
FAUs (OpenFace)
eGeMAPS (OpenSmile)
MFCC (OpenSmile)

Semi-supervised

The technique of bags-of-words (BoW), which originates from text processing, can be seen as a semi-supervised representation learning, because it represents the distribution of LLDs according to a dictionary learned from them. To generate the XBoW-representations, both the acoustic and the visual features are processed and summarised over a block of a fixed length duration.

e.g.

BoAW (OpenXBOW)
BoVW (OpenXBOW)

Unsupervised

Deep Spectrum features were first introduced for snore sound classification, and are extracted using deep representation learning paradigm heavily inspired by image processing. To generate Deep Spectrum features, the speech files are first transformed into mel-spectrogram images using Hanning windows, and a power spectral density computed on the dB power scale. These plots are then scaled and cropped to square images of size 227 x 227 pixels without axes and margins to comply with the input needs of ALEXNET - a deep CNN pre-trained for image classification. Afterwards, the spectral-based images are forwarded through ALEXNET. Finally, 4096-dimension feature vectors are extracted from the mel-spectrogram images using the activations from the second fully-connected layer of ALEXNET.

How Yang2018, Du2018, Xing2018, Syed2018 use baseline features

methods	acoustic features	visual features
Yang2018	arousal hist	hands dist hist
	audio LLDs	body HDR
	pause/rate hist	action units
Xing2018	eGeMAPS + MFCCs	AUs MHH
	topic-level feat	AUs
		eyesight feat
		emotion feat
		body movement

Baseline System

The baseline recognition system of the BDS consists of a late fusion of the best performing audio and video representations using linear SVM with LIBLINEAR toolkit; training instances of the minority classes are duplicated to be balanced with the majority class, and the type of solver and value of complexity C are optimised by a grid search, using a logarithmic scale for the latter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVEC2018.md

AVEC2018.md

AVEC2018

Baseline Features

Supervised

Semi-supervised

Unsupervised

How Yang2018, Du2018, Xing2018, Syed2018 use baseline features

Baseline System

Files

AVEC2018.md

Latest commit

History

AVEC2018.md

File metadata and controls

AVEC2018

Baseline Features

Supervised

Semi-supervised

Unsupervised

How Yang2018, Du2018, Xing2018, Syed2018 use baseline features

Baseline System