Speech and Text Processing

Deep learning has revolutionized speech and text processing, taking it to new heights. But how do we build speech recognition models? It's not as straightforward as it seems. There are different approaches, each with its own complexities. Let's dive into building three different models for three different types of speech inputs.

Frame Level Speech Recognition
Automatic Speech Recognition: Utterance to Phoneme transcription
Attention-based end-to-end speech-to-text model with Transformer

1. Frame Level Speech Recognition

Speech data consists of audio recordings, while phonemes represent the smallest sound units ("OH", "AH", etc.). Spectrograms, particularly MelSpectrograms, are commonly used to visually represent speech signals' frequency changes over time. In our dataset, we have audio recordings (utterances) and their corresponding phoneme state (subphoneme) labels from the Wall Street Journal (WSJ).

Inputs: Raw Mel Spectrogram Frame
Outputs: Frame Level Phoneme State Labels

Phonemes Example: ["+BREATH+", "+COUGH+", "+NOISE+", "+SMACK+", "+UH+", "+UM+", "AA", "AE"], and so on, with 40 phoneme labels.

Phonemes are like the building blocks of speech data. One powerful technique in speech recognition is modeling speech as a Markov process with unobserved states, known as phoneme states or subphonemes. Hidden Markov Models (HMMs) estimate parameters to maximize the likelihood of observed speech data.

Instead of HMMs, we're taking a model-free approach using a Multi-Layer Perceptron (MLP) network to classify Mel Spectrograms and output class probabilities for all 40 phonemes.

Feature Extraction

Each utterance is converted into a Mel Spectrogram matrix of shape (*, 27) after performing Short-Time Fourier Transform (STFT) on small, overlapping segments of the waveform.

Context

To ensure accurate predictions, we provide context around each vector. For example, a context of 5 means appending 5 vectors on both sides, resulting in a vector of (11, 27).

Cepstral Normalization

Cepstral Normalization helps remove channel effects in speaker recognition. It involves subtracting the mean and dividing by the standard deviation for each coefficient.

Building the Network

We used a Pyramid MLP architecture, achieving 88% classification accuracy. Various hyperparameters were considered in the process.

Hyperparameters	Values Considered	Chosen
Number of Layers	2-8	8
Activations	ReLU, LeakyReLU, softplus, tanh, sigmoid, Mish, GELU	GELU
Batch Size	64, 128, 256, 512, 1024, 2048	1024
Architecture	Cylinder, Pyramid, Inverse-Pyramid, Diamond	Pyramid
Dropout	0-0.5, Dropout in alternate layers	0.25
LR Scheduler	Fixed, StepLR, ReduceLROnPlateau, Exponential, CosineAnnealing	CosineAnnealing
Weight Initialization	Gaussian, Xavier, Kaiming(Normal and Uniform), Random, Uniform	Kaiming
Context	10-50	20
Batch-Norm	Before or After Activation, Every layer or Alternate Layer or No Layer	Every Layer
Optimizer	Vanilla SGD, Nesterov’s momentum, RMSProp, Adam	AdamW
Regularization	Weight Decay	-
LR	0.001	0.001
Normalization	Cepstral Normalization	Cepstral Normalization

2. Automatic Speech Recognition: Utterance to Phoneme transcription

In this problem, we'll utilize a neural network to process an audio recording where a person says the word "yes," resulting in the phonetic transcription /Y/ /EH/ /S/. Our focus will be on implementing RNNs along with the dynamic programming algorithm known as Connectionist Temporal Classification to produce these labels.

Problem

Standard speech recognition often involves labeling each frame (time step) of the recording with a phoneme. However, spoken words have variable lengths, making this approach unnatural. We want to directly output the phoneme sequence without worrying about exact timing.

Challenge:

Converting a variable-length speech recording (represented as feature vectors) into a sequence of phonemes with different lengths and no one-to-one correspondence in timing. without direct temporal correspondence can be referred to as generating order-aligned time-asynchrony labels. PyTorch provides functions like pad_sequence(), pad_packed_sequence(), and pack_padded_sequence() for padding and packing variable-length sequences efficiently.

A two-stage approach:

RNN network predicts probabilities for each phoneme at every time step.
CTC algorithm performs dynamic programming to generate the final phoneme sequence from the probabilities.

RNNs

Their ability to capture temporal dependencies makes them suitable for analyzing sequential data like speech.

CTC

Decoding probabilities from the RNN's output at every time step.
Employing dynamic programming to find the most likely phoneme sequence based on the probabilities.
Utilizing a "blank" symbol to handle silent regions and repetitions.

Building the Model

The model is an RNN that processes the speech feature vectors and outputs a sequence of probability vectors for each phoneme (including a blank symbol) at each time step. The architecture involves:

1D convolutional layers (CNNs) to capture local dependencies in the speech features.
Bidirectional LSTMs (BLSTMs) to capture long-term contextual information.
Pyramidal LSTMs (pBLSTMs) for potential downsampling of the input sequence.
A final layer with softmax activation converts the hidden representations into phoneme probabilities.

Training the Network

Training data consists of speech recordings and their corresponding phoneme sequences. Since the target phoneme sequence is shorter and asynchronous compared to the input, we need a way to compute the loss function for training.

Viterbi Training: Finds the single most likely alignment between the phoneme sequence and the input using the Viterbi algorithm.
CTC Loss: Calculates the expected loss over all possible alignments using dynamic programming (forward-backward algorithm). PyTorch's CTCLoss function can be used for this purpose.

Using this network along with appropriate speech data transformations such as Time Masking and Frequency Masking, we've managed to attain a Levenshtein Distance of 6. This approach forms the foundation for various speech processing applications like voice assistants and automatic captioning.

3. Attention-based end-to-end speech-to-text model with Transformer Model

In this part, we convert a sequence of speech feature vectors (Mel-Frequency Cepstral Coefficients - MFCCs) into a sequence of characters representing the spoken text. In this task of Sequence-to-Sequence conversion, the encoder-decoder Transformer architecture proves to be highly advantageous and effective.

Architecture

Transformer Encoder(CNN-LSTM):

Convolutional Modules: Extract features from the input MFCCs.
Positional Encoding: Injects positional information into the input embeddings, as the Transformer processes the entire sequence at once and lacks inherent order.
Layer Normalization: Normalizes activations of hidden layers for better training stability.
Feed Forward Neural Network: Increases the model's capacity to learn complex relationships.
Multi-Head Self-Attention: Allows the encoder to attend to different parts of the input sequence simultaneously, capturing various relationships.

Transformer Decoder:

Positional Encoding: Similar to the encoder.
Layer Normalization: Similar to the encoder.
Self-Attention: The decoder attends to its own previous outputs to understand the generated context so far.
Multi-Head Attention: The decoder attends to the encoder outputs to comprehend how the input relates to the sequence being generated.
Feed Forward Neural Network: Similar to the encoder.
Linear Classifier: Outputs the predicted character probabilities.

Speech Transformer:

Combines the encoder and decoder to form the complete speech recognition model

Training:

Loss Function: Cross-entropy loss is used, as the model is no longer recurrent and predicts one character at a time.
Teacher Forcing: During training, the true previous output is used as input for the next step, ensuring the model directly learns the correct character sequence.
Optimizer: AdamW
Learning Rate Schedulers: ReduceLR

Inference

Greedy Search: An algorithm used during inference (testing) to iteratively predict the most likely character at each step, considering the previously generated characters and the encoder outputs.
Beam Search: It is a heuristic search algorithm used in sequence generation tasks, like machine translation, to explore multiple candidate sequences efficiently, retaining only the top-scoring ones at each step. Beam Search is the most optimal one as it looks at the overall context and whoel sentence probability.

By understanding these different approaches, we can harness the power of deep learning for speech recognition.

References

Course Website: Deep Learning Course - CMU

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Attention-Based-Speech-Recognition.ipynb		Attention-Based-Speech-Recognition.ipynb
Automatic-Speech-Recognition-Utterance-To-Phoneme-Transcritption.ipynb		Automatic-Speech-Recognition-Utterance-To-Phoneme-Transcritption.ipynb
Frame-Level-Speech-Recognition.ipynb		Frame-Level-Speech-Recognition.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech and Text Processing

1. Frame Level Speech Recognition

2. Automatic Speech Recognition: Utterance to Phoneme transcription

Problem

Challenge:

A two-stage approach:

RNNs

CTC

Building the Model

Training the Network

3. Attention-based end-to-end speech-to-text model with Transformer Model

Architecture

Transformer Encoder(CNN-LSTM):

Transformer Decoder:

Speech Transformer:

Training:

Inference

References

About

Releases

Packages

Languages

Santhoshkumar-p/speech-processing

Folders and files

Latest commit

History

Repository files navigation

Speech and Text Processing

1. Frame Level Speech Recognition

2. Automatic Speech Recognition: Utterance to Phoneme transcription

Problem

Challenge:

A two-stage approach:

RNNs

CTC

Building the Model

Training the Network

3. Attention-based end-to-end speech-to-text model with Transformer Model

Architecture

Transformer Encoder(CNN-LSTM):

Transformer Decoder:

Speech Transformer:

Training:

Inference

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages