Skip to content

Latest commit

 

History

History
181 lines (117 loc) · 4.39 KB

README.md

File metadata and controls

181 lines (117 loc) · 4.39 KB

LipNet: Lip Reading Model with TensorFlow 🎥🎮

Overview 🚀

LipNet is an advanced deep learning model designed for lip reading. It takes silent video clips as input, analyzes lip movements, and predicts the corresponding text captions. By leveraging cutting-edge neural network architectures like 3D Convolutional Layers, Bidirectional LSTMs, and Connectionist Temporal Classification (CTC), LipNet achieves impressive results in translating visual lip movements into textual representations.


Features 🔄

  • Input: Silent videos with lip movements.
  • Output: Accurate text predictions based on lip movement.
  • Pretrained Weights: Use pretrained weights for evaluation or continue training for fine-tuning.
  • Data Pipeline: Custom TensorFlow dataset for handling video frames and text alignments.
  • Model Architecture: Combination of 3D convolutional layers, LSTMs, and dense layers.
  • Callbacks: Custom callbacks for monitoring predictions during training.

Dataset Structure 🌐

  1. Video Files: Stored in data/s1/ with a .mpg extension.
  2. Alignments: Text annotations corresponding to the lip movements in data/alignments/s1/.

Example:

data/
  s1/
    video1.mpg
    video2.mpg
  alignments/
    s1/
      video1.align
      video2.align

Training the Model 💡

  1. Define Vocabulary:

    vocab = [x for x in "abcdefghijklmnopqrstuvwxyz'?!123456789 "]
  2. Load and Preprocess Data: Videos are split into frames, normalized, and paired with text alignments.

  3. Build the Model: Combines Conv3D layers for feature extraction, Bidirectional LSTMs for sequence modeling, and Dense layers for character predictions.

  4. Loss Function: CTC Loss to handle variable-length sequences.

  5. Callbacks: Includes checkpoints, learning rate schedulers, and custom callbacks to monitor predictions.

  6. Resume Training: Resume training from a specific epoch if needed.

Training Commands:

model.fit(
    train,
    validation_data=test,
    epochs=100,
    callbacks=[checkpoint_callback, reduce_lr, early_stopping, example_callback]
)

Evaluate the Model 🔍

  1. Load Pretrained Weights:

    model.load_weights('new_best_weights2.weights.h5')
  2. Prediction:

    • Pass a silent video to the model and decode the output.
    • Example:
      yhat = model.predict(sample[0])
      decoded = tf.keras.backend.ctc_decode(yhat, [75], greedy=True)[0][0].numpy()
  3. Visualize Output:

    plt.imshow(frames[40])  # Visualize a specific frame

Visualization with GIFs 🎥

To enhance understanding, add GIFs of:

  1. Input Video Frames: Showing the lip movements of the speaker.
  2. Predicted Text: Overlay the predicted captions on the video.

Input Video Example


Model Architecture 🎨

Layers:

  • Conv3D: Extract spatiotemporal features from video frames.
  • BatchNormalization: Normalize activations for faster convergence.
  • MaxPooling3D: Reduce spatial dimensions.
  • Bidirectional LSTM: Capture sequential dependencies from both directions.
  • Dense: Output layer with vocabulary size + CTC blank token.

Custom Loss:

def CTCLoss(y_true, y_pred):
    loss = tf.keras.backend.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    return loss

Testing with Videos 🎞️

  1. Input Video:

    sample_video = load_data('data/s1/sample_video.mpg')
  2. Predict:

    yhat = model.predict(tf.expand_dims(sample_video[0], axis=0))
  3. Decode and Compare:

    decoded = tf.keras.backend.ctc_decode(yhat, [75], greedy=True)[0][0].numpy()
    print("Predicted: ", decoded_text)

Callbacks 📊

Example Callback:

  • Displays predictions at the end of each epoch.
class ProductExampleCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        data = self.dataset.next()
        yhat = model.predict(data[0])
        decoded = tf.keras.backend.ctc_decode(yhat, [75, 75], greedy=True)[0][0].numpy()
        print("Predictions:", decoded)

Future Enhancements 🌍

  1. Fine-tune on larger datasets for better accuracy.
  2. Integrate with real-time video streams for live lip reading.
  3. Add support for multilingual datasets.