I4VGen: Image as Free Stepping Stone for Text-to-Video Generation
Official PyTorch implementation of the arXiv 2024 paper: https://arxiv.org/abs/2406.02230
I4VGen: Image as Free Stepping Stone for Text-to-Video Generation
Xiefan Guo, Jinlin Liu, Miaomiao Cui, Liefeng Bo, Di Huang
https://xiefan-guo.github.io/i4vgen
Abstract: I4VGen is a novel video diffusion inference pipeline to leverage advanced image techniques to enhance pre-trained text-to-video diffusion models, which requires no additional training. Instead of the vanilla text-to-video inference pipeline, I4VGen consists of two stages: anchor image synthesis and anchor image-augmented text-to-video synthesis. Correspondingly, a simple yet effective generation-selection strategy is employed to achieve visually-realistic and semantically-faithful anchor image, and an innovative noise-invariant video score distillation sampling (NI-VSDS) is developed to animate the image to a dynamic video by distilling motion knowledge from video diffusion models, followed by a video regeneration process to refine the video. Extensive experiments show that the proposed method produces videos with higher visual realism and textual fidelity.
- Linux and Windows are supported, but we recommend Linux for performance and compatibility reasons.
- All experiments are conducted on a single NVIDIA V100 GPU (32 GB).
Python libraries: See environment.yml for exact library dependencies. You can use the following commands to create and activate your AnimateDiff Python environment:
# Create conda environment
conda env create -f environments/animatediff_environment.yaml
# Activate conda environment
conda activate animatediff_env
Inference setup: Please refer to the official repo of AnimateDiff. The setup guide is listed here. mm-sd-v15-v2
and stable-diffusion-v1-5
are used in our experiments.
Name | HuggingFace | Type |
---|---|---|
mm-sd-v15-v2 | Link | Motion module |
stable-diffusion-v1-5 | Link | Base T2I diffusion model |
Generating videos: Before generating the video, please make sure you have set up the required Python environment and downloaded the corresponding checkpoints. Run the following command to generate the video.
python -m scripts.animate_animatediff --config configs/animatediff_configs/i4vgen_animatediff.yaml
In configs/animatediff_configs/i4vgen_animatediff.yaml
and ArgumentParser
, arguments for inference:
motion_module
: path to motion module, i.e.,mm-sd-v15-v2
motion modulepretrained_model_path
: path to base T2I diffusion model, i.e.,stable-diffusion-v1-5
Python libraries: See environment.yml for exact library dependencies. You can use the following commands to create and activate your LaVie Python environment:
# Create LaVie conda environment
conda env create -f environments/lavie_environment.yaml
# Activate LaVie conda environment
conda activate lavie_env
Inference setup: Please refer to the official repo of LaVie. The base-version
is employed in our experiments. Download pre-trained lavie_base
and stable-diffusion-v1-4
.
Name | HuggingFace | Type |
---|---|---|
lavie_base | Link | LaVie model |
stable-diffusion-v1-4 | Link | Base T2I diffusion model |
Generating videos: Before generating the video, please make sure you have set up the required Python environment and downloaded the corresponding checkpoints. Run the following command to generate the video.
python scripts/animate_lavie.py --config configs/lavie_configs/i4vgen_lavie.yaml
In configs/lavie_configs/i4vgen_lavie.yaml
and ArgumentParser
, arguments for inference:
ckpt_path
: path to LaVie model, i.e.,lavie_base
sd_path
: path to base T2I diffusion model, i.e.,stable-diffusion-v1-4
@article{guo2024i4vgen,
title = {I4VGen: Image as Free Stepping Stone for Text-to-Video Generation},
author = {Guo, Xiefan and Liu, Jinlin and Cui, Miaomiao and Bo, Liefeng and Huang, Di},
journal = {arXiv preprint arXiv:2406.02230},
year = {2024}
}
The code is built upon AnimateDiff and LaVie, we thank all the contributors for open-sourcing.