Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande
Project Website: https://tanjimin.github.io/unsupervised-video-dubbing/
Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See data.py
for details.
We created a simple depolyment pipeline which can be find under post_processing
subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details.
-
LibROSA 0.7.2
-
dlib 19.19
-
OpenCV 4.2.0
-
Pillow 6.2.2
-
PyTorch 1.2.0
-
TorchVision 0.4.0
.
├── source
│ ├── audio_driver_mp4 # contain audio drivers (saved in mp4 format)
│ ├── audio_driver_wav # contain audio drivers (saved in wav format)
│ ├── base_video # contain base videos (videos you'd like to modify)
│ ├── dlib # trained dlib models
│ └── model # trained landmark generation models
├── main.py # main function for post processing
├── main_support.py # support functions used in main.py
├── models.py # define the landmark generation model
├── step_3_vid2vid.sh # Bash script for running vid2vid
├── step_4_denoise.sh. # Bash script for denoising vid2vid results
├── compare_openness.ipynb # mouth openness comparison across generated videos
└── README.md
- shape_predictor_68_face_landmarks.dat
This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/)
The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product.
{C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.}
- Go to
post_processing
directory - Run:
python3 main.py -r step
(corresponding step below)- e.g:
python3 main.py -r 1
will run the first step and etc
- e.g:
- Input
- Base video file path (
./source/base_video/base_video.mp4
) - Audio driver file path (
./source/audio_driver_wav/audio_driver.wav
) - Epoch (
int
)
- Base video file path (
- Output (
./result
)- keypoints.npy (# generated landmarks in
npy
format) - source.txt (contains information about base video, audio driver, model epoch)
- keypoints.npy (# generated landmarks in
- Process
- Extract facial landmarks from base video
- Extract MFCC features from driver audio
- Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks
- Combine facial & mouth landmarks and save in
npy
format
- Input
- None
- Output (
./result
)- Folder — save_keypoints: visualized generated frames
- Folder — save_keypoints_csv : landmark coordinates for each frame, saved in
txt
format - openness.png: mouth openness measured and plotted across all frames
- Process
- Generate images from
npy
file - Generate openness plot
- Generate images from
- Input
- None
- Output
- Path for generated fake images from vid2vid are shown at the end; Please copy it back to the
/result/vid2vid_frames/
- Folder: vid2vid generated images
- Path for generated fake images from vid2vid are shown at the end; Please copy it back to the
- Process
- Run vid2vid
- Copy back vid2vid results to main folder
- Input
- vid2vid generated images folder path
- Original base images folder path
- Output
- Folder: Modified images (base image + vid2vid mouth regions)
- Folder: Denoised and smoothed frames
- Process
- Crop mouth areas from vid2vid generated images and paste them back to base images —> modified image
- Generate circular smoothed images by using gradient masking
- Take
(modified image, circular smoothed images)
as pairs and do denoising
- Input
- Saved frames folder path
- By default, it is saved in
./result/save_keypoints
; you can enterd
to go with default path - Otherwise, input the frames folder path
- By default, it is saved in
- Audio driver file path (
./source/audio_driver_wav/audio_driver.wav
)
- Saved frames folder path
- Output (
./result/save_keypoints/result/
)- video_without_sound.mp4: modified videos without sound
- audio_only.mp4: audio driver
- final_output.mp4: modified videos with sound
- Process
- Generate the modified video without sound with define fps
- Extract
wav
from audio driver - Combine audio and video to generate final output
- You may need to modify how MFCC features are extracted in
extract_mfcc
function- Be careful about sample rate, window_length, hop_length
- Good resource: https://www.mathworks.com/help/audio/ref/mfcc.html
- You may need to modify the region of interest (mouth area) in
frame_crop
function - You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps
# How to check your base video fps
# source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/
import cv2
video = cv2.VideoCapture("video.mp4");
# Find OpenCV version
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
if int(major_ver) < 3 :
fps = video.get(cv2.cv.CV_CAP_PROP_FPS)
print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps))
else :
fps = video.get(cv2.CAP_PROP_FPS)
print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))
video.release()
- You may need to modify the shell path
echo $SHELL
- You may need to modify the audio sampling rate in
extract_audio
function - You may need to customize your parameters in
combine_audio_video
function
- March 22, 2020: Drafted documentation