Skip to content

Santhoshkumar-p/speech-processing

Repository files navigation

Speech and Text Processing

Deep learning has revolutionized speech and text processing, taking it to new heights. But how do we build speech recognition models? It's not as straightforward as it seems. There are different approaches, each with its own complexities. Let's dive into building three different models for three different types of speech inputs.

  1. Frame Level Speech Recognition
  2. Automatic Speech Recognition: Utterance to Phoneme transcription
  3. Attention-based end-to-end speech-to-text model with Transformer

1. Frame Level Speech Recognition

Speech data consists of audio recordings, while phonemes represent the smallest sound units ("OH", "AH", etc.). Spectrograms, particularly MelSpectrograms, are commonly used to visually represent speech signals' frequency changes over time. In our dataset, we have audio recordings (utterances) and their corresponding phoneme state (subphoneme) labels from the Wall Street Journal (WSJ).

Inputs: Raw Mel Spectrogram Frame
Outputs: Frame Level Phoneme State Labels

Phonemes Example: ["+BREATH+", "+COUGH+", "+NOISE+", "+SMACK+", "+UH+", "+UM+", "AA", "AE"], and so on, with 40 phoneme labels.
image
Phonemes are like the building blocks of speech data. One powerful technique in speech recognition is modeling speech as a Markov process with unobserved states, known as phoneme states or subphonemes. Hidden Markov Models (HMMs) estimate parameters to maximize the likelihood of observed speech data.

Instead of HMMs, we're taking a model-free approach using a Multi-Layer Perceptron (MLP) network to classify Mel Spectrograms and output class probabilities for all 40 phonemes.

Feature Extraction

Each utterance is converted into a Mel Spectrogram matrix of shape (*, 27) after performing Short-Time Fourier Transform (STFT) on small, overlapping segments of the waveform.
feature_extraction feature_extraction2

Context

To ensure accurate predictions, we provide context around each vector. For example, a context of 5 means appending 5 vectors on both sides, resulting in a vector of (11, 27).

context

Cepstral Normalization

Cepstral Normalization helps remove channel effects in speaker recognition. It involves subtracting the mean and dividing by the standard deviation for each coefficient.

Building the Network

We used a Pyramid MLP architecture, achieving 88% classification accuracy. Various hyperparameters were considered in the process.

Hyperparameters Values Considered Chosen
Number of Layers 2-8 8
Activations ReLU, LeakyReLU, softplus, tanh, sigmoid, Mish, GELU GELU
Batch Size 64, 128, 256, 512, 1024, 2048 1024
Architecture Cylinder, Pyramid, Inverse-Pyramid, Diamond Pyramid
Dropout 0-0.5, Dropout in alternate layers 0.25
LR Scheduler Fixed, StepLR, ReduceLROnPlateau, Exponential, CosineAnnealing CosineAnnealing
Weight Initialization Gaussian, Xavier, Kaiming(Normal and Uniform), Random, Uniform Kaiming
Context 10-50 20
Batch-Norm Before or After Activation, Every layer or Alternate Layer or No Layer Every Layer
Optimizer Vanilla SGD, Nesterov’s momentum, RMSProp, Adam AdamW
Regularization Weight Decay -
LR 0.001 0.001
Normalization Cepstral Normalization Cepstral Normalization

2. Automatic Speech Recognition: Utterance to Phoneme transcription

In this problem, we'll utilize a neural network to process an audio recording where a person says the word "yes," resulting in the phonetic transcription /Y/ /EH/ /S/. Our focus will be on implementing RNNs along with the dynamic programming algorithm known as Connectionist Temporal Classification to produce these labels.

Problem

Standard speech recognition often involves labeling each frame (time step) of the recording with a phoneme. However, spoken words have variable lengths, making this approach unnatural. We want to directly output the phoneme sequence without worrying about exact timing.

image

Challenge:

Converting a variable-length speech recording (represented as feature vectors) into a sequence of phonemes with different lengths and no one-to-one correspondence in timing. without direct temporal correspondence can be referred to as generating order-aligned time-asynchrony labels. PyTorch provides functions like pad_sequence(), pad_packed_sequence(), and pack_padded_sequence() for padding and packing variable-length sequences efficiently.

image

A two-stage approach:

  1. RNN network predicts probabilities for each phoneme at every time step.
  2. CTC algorithm performs dynamic programming to generate the final phoneme sequence from the probabilities.
image

RNNs

Their ability to capture temporal dependencies makes them suitable for analyzing sequential data like speech.

CTC

  • Decoding probabilities from the RNN's output at every time step.

  • Employing dynamic programming to find the most likely phoneme sequence based on the probabilities.

  • Utilizing a "blank" symbol to handle silent regions and repetitions.

    image

Building the Model

The model is an RNN that processes the speech feature vectors and outputs a sequence of probability vectors for each phoneme (including a blank symbol) at each time step. The architecture involves:

  • 1D convolutional layers (CNNs) to capture local dependencies in the speech features.
  • Bidirectional LSTMs (BLSTMs) to capture long-term contextual information.
  • Pyramidal LSTMs (pBLSTMs) for potential downsampling of the input sequence.
  • A final layer with softmax activation converts the hidden representations into phoneme probabilities.
image

Training the Network

Training data consists of speech recordings and their corresponding phoneme sequences. Since the target phoneme sequence is shorter and asynchronous compared to the input, we need a way to compute the loss function for training.

  • Viterbi Training: Finds the single most likely alignment between the phoneme sequence and the input using the Viterbi algorithm.
  • CTC Loss: Calculates the expected loss over all possible alignments using dynamic programming (forward-backward algorithm). PyTorch's CTCLoss function can be used for this purpose.

Using this network along with appropriate speech data transformations such as Time Masking and Frequency Masking, we've managed to attain a Levenshtein Distance of 6. This approach forms the foundation for various speech processing applications like voice assistants and automatic captioning.

3. Attention-based end-to-end speech-to-text model with Transformer Model

In this part, we convert a sequence of speech feature vectors (Mel-Frequency Cepstral Coefficients - MFCCs) into a sequence of characters representing the spoken text. In this task of Sequence-to-Sequence conversion, the encoder-decoder Transformer architecture proves to be highly advantageous and effective.

Architecture

Transformer Encoder(CNN-LSTM):

  • Convolutional Modules: Extract features from the input MFCCs.
  • Positional Encoding: Injects positional information into the input embeddings, as the Transformer processes the entire sequence at once and lacks inherent order.
  • Layer Normalization: Normalizes activations of hidden layers for better training stability.
  • Feed Forward Neural Network: Increases the model's capacity to learn complex relationships.
  • Multi-Head Self-Attention: Allows the encoder to attend to different parts of the input sequence simultaneously, capturing various relationships.

Transformer Decoder:

  • Positional Encoding: Similar to the encoder.
  • Layer Normalization: Similar to the encoder.
  • Self-Attention: The decoder attends to its own previous outputs to understand the generated context so far.
  • Multi-Head Attention: The decoder attends to the encoder outputs to comprehend how the input relates to the sequence being generated.
  • Feed Forward Neural Network: Similar to the encoder.
  • Linear Classifier: Outputs the predicted character probabilities.

Speech Transformer:

  • Combines the encoder and decoder to form the complete speech recognition model

    image

Training:

  • Loss Function: Cross-entropy loss is used, as the model is no longer recurrent and predicts one character at a time.
  • Teacher Forcing: During training, the true previous output is used as input for the next step, ensuring the model directly learns the correct character sequence.
  • Optimizer: AdamW
  • Learning Rate Schedulers: ReduceLR

Inference

  • Greedy Search: An algorithm used during inference (testing) to iteratively predict the most likely character at each step, considering the previously generated characters and the encoder outputs.
  • Beam Search: It is a heuristic search algorithm used in sequence generation tasks, like machine translation, to explore multiple candidate sequences efficiently, retaining only the top-scoring ones at each step. Beam Search is the most optimal one as it looks at the overall context and whoel sentence probability.

By understanding these different approaches, we can harness the power of deep learning for speech recognition.

References

About

Transforming Speech to Text with deep learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published