Skip to content

This is LipNet network where model learn from Lip movement and predict text without voice.

Notifications You must be signed in to change notification settings

codenigma1/LipAppNet

Repository files navigation

LipNet: Lip Reading Model with TensorFlow 🎥🎮

Overview 🚀

LipNet is an advanced deep learning model designed for lip reading. It takes silent video clips as input, analyzes lip movements, and predicts the corresponding text captions. By leveraging cutting-edge neural network architectures like 3D Convolutional Layers, Bidirectional LSTMs, and Connectionist Temporal Classification (CTC), LipNet achieves impressive results in translating visual lip movements into textual representations.


Features 🔄

  • Input: Silent videos with lip movements.
  • Output: Accurate text predictions based on lip movement.
  • Pretrained Weights: Use pretrained weights for evaluation or continue training for fine-tuning.
  • Data Pipeline: Custom TensorFlow dataset for handling video frames and text alignments.
  • Model Architecture: Combination of 3D convolutional layers, LSTMs, and dense layers.
  • Callbacks: Custom callbacks for monitoring predictions during training.

Dataset Structure 🌐

  1. Video Files: Stored in data/s1/ with a .mpg extension.
  2. Alignments: Text annotations corresponding to the lip movements in data/alignments/s1/.

Example:

data/
  s1/
    video1.mpg
    video2.mpg
  alignments/
    s1/
      video1.align
      video2.align

Training the Model 💡

  1. Define Vocabulary:

    vocab = [x for x in "abcdefghijklmnopqrstuvwxyz'?!123456789 "]
  2. Load and Preprocess Data: Videos are split into frames, normalized, and paired with text alignments.

  3. Build the Model: Combines Conv3D layers for feature extraction, Bidirectional LSTMs for sequence modeling, and Dense layers for character predictions.

  4. Loss Function: CTC Loss to handle variable-length sequences.

  5. Callbacks: Includes checkpoints, learning rate schedulers, and custom callbacks to monitor predictions.

  6. Resume Training: Resume training from a specific epoch if needed.

Training Commands:

model.fit(
    train,
    validation_data=test,
    epochs=100,
    callbacks=[checkpoint_callback, reduce_lr, early_stopping, example_callback]
)

Evaluate the Model 🔍

  1. Load Pretrained Weights:

    model.load_weights('new_best_weights2.weights.h5')
  2. Prediction:

    • Pass a silent video to the model and decode the output.
    • Example:
      yhat = model.predict(sample[0])
      decoded = tf.keras.backend.ctc_decode(yhat, [75], greedy=True)[0][0].numpy()
  3. Visualize Output:

    plt.imshow(frames[40])  # Visualize a specific frame

Visualization with GIFs 🎥

To enhance understanding, add GIFs of:

  1. Input Video Frames: Showing the lip movements of the speaker.
  2. Predicted Text: Overlay the predicted captions on the video.

Input Video Example


Model Architecture 🎨

Layers:

  • Conv3D: Extract spatiotemporal features from video frames.
  • BatchNormalization: Normalize activations for faster convergence.
  • MaxPooling3D: Reduce spatial dimensions.
  • Bidirectional LSTM: Capture sequential dependencies from both directions.
  • Dense: Output layer with vocabulary size + CTC blank token.

Custom Loss:

def CTCLoss(y_true, y_pred):
    loss = tf.keras.backend.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    return loss

Testing with Videos 🎞️

  1. Input Video:

    sample_video = load_data('data/s1/sample_video.mpg')
  2. Predict:

    yhat = model.predict(tf.expand_dims(sample_video[0], axis=0))
  3. Decode and Compare:

    decoded = tf.keras.backend.ctc_decode(yhat, [75], greedy=True)[0][0].numpy()
    print("Predicted: ", decoded_text)

Callbacks 📊

Example Callback:

  • Displays predictions at the end of each epoch.
class ProductExampleCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        data = self.dataset.next()
        yhat = model.predict(data[0])
        decoded = tf.keras.backend.ctc_decode(yhat, [75, 75], greedy=True)[0][0].numpy()
        print("Predictions:", decoded)

Future Enhancements 🌍

  1. Fine-tune on larger datasets for better accuracy.
  2. Integrate with real-time video streams for live lip reading.
  3. Add support for multilingual datasets.


About

This is LipNet network where model learn from Lip movement and predict text without voice.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published