Deep learning has revolutionized speech and text processing, taking it to new heights. But how do we build speech recognition models? It's not as straightforward as it seems. There are different approaches, each with its own complexities. Let's dive into building three different models for three different types of speech inputs.
- Frame Level Speech Recognition
- Automatic Speech Recognition: Utterance to Phoneme transcription
- Attention-based end-to-end speech-to-text model with Transformer
Speech data consists of audio recordings, while phonemes represent the smallest sound units ("OH", "AH", etc.). Spectrograms, particularly MelSpectrograms, are commonly used to visually represent speech signals' frequency changes over time. In our dataset, we have audio recordings (utterances) and their corresponding phoneme state (subphoneme) labels from the Wall Street Journal (WSJ).
Inputs: Raw Mel Spectrogram Frame
Outputs: Frame Level Phoneme State Labels
Phonemes Example: ["+BREATH+", "+COUGH+", "+NOISE+", "+SMACK+", "+UH+", "+UM+", "AA", "AE"], and so on, with 40 phoneme labels.
Phonemes are like the building blocks of speech data. One powerful technique in speech recognition is modeling speech as a Markov process with unobserved states, known as phoneme states or subphonemes. Hidden Markov Models (HMMs) estimate parameters to maximize the likelihood of observed speech data.
Instead of HMMs, we're taking a model-free approach using a Multi-Layer Perceptron (MLP) network to classify Mel Spectrograms and output class probabilities for all 40 phonemes.
Feature Extraction
Each utterance is converted into a Mel Spectrogram matrix of shape (*, 27) after performing Short-Time Fourier Transform (STFT) on small, overlapping segments of the waveform.
Context
To ensure accurate predictions, we provide context around each vector. For example, a context of 5 means appending 5 vectors on both sides, resulting in a vector of (11, 27).
Cepstral Normalization
Cepstral Normalization helps remove channel effects in speaker recognition. It involves subtracting the mean and dividing by the standard deviation for each coefficient.
Building the Network
We used a Pyramid MLP architecture, achieving 88% classification accuracy. Various hyperparameters were considered in the process.
Hyperparameters | Values Considered | Chosen |
---|---|---|
Number of Layers | 2-8 | 8 |
Activations | ReLU, LeakyReLU, softplus, tanh, sigmoid, Mish, GELU | GELU |
Batch Size | 64, 128, 256, 512, 1024, 2048 | 1024 |
Architecture | Cylinder, Pyramid, Inverse-Pyramid, Diamond | Pyramid |
Dropout | 0-0.5, Dropout in alternate layers | 0.25 |
LR Scheduler | Fixed, StepLR, ReduceLROnPlateau, Exponential, CosineAnnealing | CosineAnnealing |
Weight Initialization | Gaussian, Xavier, Kaiming(Normal and Uniform), Random, Uniform | Kaiming |
Context | 10-50 | 20 |
Batch-Norm | Before or After Activation, Every layer or Alternate Layer or No Layer | Every Layer |
Optimizer | Vanilla SGD, Nesterov’s momentum, RMSProp, Adam | AdamW |
Regularization | Weight Decay | - |
LR | 0.001 | 0.001 |
Normalization | Cepstral Normalization | Cepstral Normalization |
In this problem, we'll utilize a neural network to process an audio recording where a person says the word "yes," resulting in the phonetic transcription /Y/ /EH/ /S/. Our focus will be on implementing RNNs along with the dynamic programming algorithm known as Connectionist Temporal Classification to produce these labels.
Standard speech recognition often involves labeling each frame (time step) of the recording with a phoneme. However, spoken words have variable lengths, making this approach unnatural. We want to directly output the phoneme sequence without worrying about exact timing.
Converting a variable-length speech recording (represented as feature vectors) into a sequence of phonemes with different lengths and no one-to-one correspondence in timing. without direct temporal correspondence can be referred to as generating order-aligned time-asynchrony labels. PyTorch provides functions like pad_sequence(), pad_packed_sequence(), and pack_padded_sequence() for padding and packing variable-length sequences efficiently.
- RNN network predicts probabilities for each phoneme at every time step.
- CTC algorithm performs dynamic programming to generate the final phoneme sequence from the probabilities.
Their ability to capture temporal dependencies makes them suitable for analyzing sequential data like speech.
-
Decoding probabilities from the RNN's output at every time step.
-
Employing dynamic programming to find the most likely phoneme sequence based on the probabilities.
-
Utilizing a "blank" symbol to handle silent regions and repetitions.
The model is an RNN that processes the speech feature vectors and outputs a sequence of probability vectors for each phoneme (including a blank symbol) at each time step. The architecture involves:
- 1D convolutional layers (CNNs) to capture local dependencies in the speech features.
- Bidirectional LSTMs (BLSTMs) to capture long-term contextual information.
- Pyramidal LSTMs (pBLSTMs) for potential downsampling of the input sequence.
- A final layer with softmax activation converts the hidden representations into phoneme probabilities.
Training data consists of speech recordings and their corresponding phoneme sequences. Since the target phoneme sequence is shorter and asynchronous compared to the input, we need a way to compute the loss function for training.
- Viterbi Training: Finds the single most likely alignment between the phoneme sequence and the input using the Viterbi algorithm.
- CTC Loss: Calculates the expected loss over all possible alignments using dynamic programming (forward-backward algorithm). PyTorch's CTCLoss function can be used for this purpose.
Using this network along with appropriate speech data transformations such as Time Masking and Frequency Masking, we've managed to attain a Levenshtein Distance of 6. This approach forms the foundation for various speech processing applications like voice assistants and automatic captioning.
In this part, we convert a sequence of speech feature vectors (Mel-Frequency Cepstral Coefficients - MFCCs) into a sequence of characters representing the spoken text. In this task of Sequence-to-Sequence conversion, the encoder-decoder Transformer architecture proves to be highly advantageous and effective.
- Convolutional Modules: Extract features from the input MFCCs.
- Positional Encoding: Injects positional information into the input embeddings, as the Transformer processes the entire sequence at once and lacks inherent order.
- Layer Normalization: Normalizes activations of hidden layers for better training stability.
- Feed Forward Neural Network: Increases the model's capacity to learn complex relationships.
- Multi-Head Self-Attention: Allows the encoder to attend to different parts of the input sequence simultaneously, capturing various relationships.
- Positional Encoding: Similar to the encoder.
- Layer Normalization: Similar to the encoder.
- Self-Attention: The decoder attends to its own previous outputs to understand the generated context so far.
- Multi-Head Attention: The decoder attends to the encoder outputs to comprehend how the input relates to the sequence being generated.
- Feed Forward Neural Network: Similar to the encoder.
- Linear Classifier: Outputs the predicted character probabilities.
- Loss Function: Cross-entropy loss is used, as the model is no longer recurrent and predicts one character at a time.
- Teacher Forcing: During training, the true previous output is used as input for the next step, ensuring the model directly learns the correct character sequence.
- Optimizer: AdamW
- Learning Rate Schedulers: ReduceLR
- Greedy Search: An algorithm used during inference (testing) to iteratively predict the most likely character at each step, considering the previously generated characters and the encoder outputs.
- Beam Search: It is a heuristic search algorithm used in sequence generation tasks, like machine translation, to explore multiple candidate sequences efficiently, retaining only the top-scoring ones at each step. Beam Search is the most optimal one as it looks at the overall context and whoel sentence probability.
By understanding these different approaches, we can harness the power of deep learning for speech recognition.
- Course Website: Deep Learning Course - CMU