Paper, Tags: #audio
Our system translates videos from one language to another. The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech using the original speaker's voice. The visual content is translated by synthesizing lip movements for the speaker to match the translated audio.
We collected a large multilingual dataset and used it to train a large multilingual multi-speaker lipsync model. We perform speaker-specific fine-tuning using data from each individual target speaker.