Skip to content

04. Emotion Classification from Speech

cotemyriam edited this page Apr 11, 2013 · 1 revision

Sub Projects | Meeting Reports

Priority: Medium

Emotion classification from speech: learn to map from acoustic sequence of the video clip to frame-based or sequence-based emotion classification.

YB (lead), Yann, Nicolas, Razvan, (need to do this quickly because both leave in May), Guillaume A., Stephan

Meeting Reports (Brainstorming):

2013.04.05

Audio features

Nicolas: I generated a first set of basic hand-crafted frame-level audio features to get us started, using Yaafe.

The features are in /data/lisa/data/audio_features/ in pickled numpy matrices (one file for each mp3 in the Train/Val folders). There are 3 subsets of features:

  • raw : Only the magnitude spectrogram.
  • minimal : Only MFCCs, MFCC derivates, AutoCorrelation, Loudness, Flux and other low-dimensional perceptual features (concatenated).
  • full : includes the above plus other features from the Yaafe library (ZCR, TemporalShapeStatistics, SpectralRolloff, SpectralShapeStatistics, SpectralFlatness, SpectralDecrease, SpectralFlatnessPerBand, SpectralCrestFactorPerBand, LPC, LSF, ComplexDomainOnsetDetection, Mel spectrum, MFCC second derivatives, Envelope, EnvelopeShapeStatistics, AmplitudeModulation, OBSI, OBSIR).

Mostly default parameters have been used with an analysis window size of ~25 ms, hopping size 12.5 ms, as used in the literature (although variable). The audio input was normalized.

There are also whitened versions of the above with .pca.pkl extensions. Each component has been whitened independently (0-mean 1-variance, diagonal covariance, no dimensionality reduction) using the training set distribution, in order to preserve the topology of the original feature space. PCA objects containing information about the applied transformation have been saved in the same directory with .pca extensions.

Note that there are no clip-level features in those sets; no aggregation / pooling / multi-scale statistics whatsoever were performed.

Generation code in lisa_emotiw/emotiw/boulanni/audio_features.py

Vera Am Mittag dataset

data/lisa/data/Vera_Am_Mittag/extracted_audio

See the entry Data Files for a description about this data.

Experimental Results