DreamDance: Personalized Text-to-video Generation by Combining Text-to-Image Synthesis and Motion Transfer
The motion transfer is quite successful, even if the the character in the reference video performs large motion, like dancing and rotating.
Note that limited by the computing resources, we only generated the imitation videos of low-resolution. The performance of motion imitation is good.
Input images of prompt: miguel playing guitar on the street, pixar, cartoon, high quality, full body, single person
Output video
Input images of prompt: miguel running in a forest, pixar, cartoon, green eyes, red hat, high quality, standing, full body, single person
Output video
Input images of prompt: miguel in a forest, pixar, cartoon, green eyes, red hat, high quality, standing, full body, single person
Output video
Input images with prompt: miguel, pixar, cartoon, playing guitar, high quality, full body, single person
Output video
We noticed that if the changes are even larger, the interpolation still handled the video synthesis pretty well. Although the are some artifacts in the mid-frames, our limitations are mainly from the input image generation side. If future text-to-image synthesis models have the capability of generating more promising images with high consistency of all the factors above, frame interpolation will be a powerful method of text-to-video generation.