Skip to content

Latest commit

 

History

History
275 lines (177 loc) · 14.8 KB

README.md

File metadata and controls

275 lines (177 loc) · 14.8 KB

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Official implementation of Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models by adapting pretrained ControlNets.

arXiv projectpage checkpoints

Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal



CTRL-Adapter is an efficient and versatile framework for adding diverse spatial controls to any image or video diffusion model. It supports a variety of useful applications, including video control, video control with multiple conditions, video control with sparse frame conditions, image control, zero-shot transfer to unseen conditions, and video editing.

🔥 News

  • May. 26, 2024. Check our new arXiv-v2 for exciting new additions to Ctrl-Adapter!
    • Support for DiT-based backbones (Latte, PixArt-α)
    • Fine-grained patch-level MoE router for multi-control composition
    • Downstream tasks beyond spatial control (video editing, video style transfer, text-guided motion control)
  • Apr. 30, 2024. Training code released now! It's time to train Ctrl-Adapter on your desired backbone! 🚀🚀
  • Apr. 29, 2024. SDXL, I2VGen-XL, and SVD inference code and checkpoints are all released!

🔧 Setup

Environment Setup

If you only need to perform inference with our code, please install from requirements_inference.txt. To make our codebase easy to use, the primary libraries that need to be installed are Torch, Diffusers, and Transformers. Specific versions of these libraries are not required; the default versions should work fine :)

If you are planning to conduct training, please install from requirements_train.txt instead, which contains more dependent libraries needed.

conda create -n ctrl-adapter python==3.10
conda activate ctrl-adapter
pip install -r requirements_inference.txt # install from this if you only need to perform inference
pip install -r requirements_train.txt # install from this if you plan to do some training

Here we list several questions that we believe important when you start using this

🔮 Inference

We provde model checkpoints and inference scripts for Ctrl-Adapter trained on SDXL, I2VGen-XL, and SVD. All inference scripts are put under ./inference_scripts.

📌 Notice Before You Begin

Please note that there is usually no single model that excels at generating images/videos for all motion styles across various control conditions.

Different image/video generation backbones may perform better with specific types of motion. For instance, we have observed that SVD excels at slide motions, while it generally performs worse than I2VGen-XL with complex motions (this is consistent wtih the findings in DynamiCrafter). Additionally, using different control conditions can lead to significantly different results in the generated images/videos, and some control conditions may be more informative than others for certain types of motion.

📌 Inference Data Structure

We put some sample images/frames for inference under the folder ./assets/evaluation. You can add your custom examples following the same file structure illustrated below.

For model inference, we support two options:

  • If you already have condition image/frames extracted from some image/video, you can use inference (w/ extracted condition).
./assets/evaluation/images
    ├── depth
    │   ├── anime_corgi.png
    ├── raw_input
    │   ├── anime_corgi.png
    ├── captions.json

./assets/evaluation/frames
    ├── depth
    │   ├── newspaper_cat
    │   │   ├── 00000.png
    │   │   ├── 00001.png
    │   │   ...
    │   │   ├── 00015.png
    ├── raw_input
    │   ├── newspaper_cat
    │   │   ├── 00000.png # only the 1st frame is needed for I2V models
    ├── captions.json
  • If you haven't extracted control conditions and only have the raw image/frames, you can use inference (w/o extracted condition). In this way, our code can automatically extract the control conditions from the input image/frames and then generate corresponding image/video.
./assets/evaluation/images
    ├── raw_input
    │   ├── anime_corgi.png
    ├── captions.json

./assets/evaluation/frames
    ├── raw_input
    │   ├── newspaper_cat
    │   │   ├── 00000.png
    │   │   ├── 00001.png
    │   │   ...
    │   │   ├── 00015.png
    ├── captions.json

📌 Run Inference Scripts

Here is a sample command to run inference on SDXL with depth map as control (w/ extracted condition).

sh inference_scripts/sdxl/sdxl_inference_depth.sh

⚠️ --control_guidance_end: this is the most important parameter that balances generated image/video quality with control strength. If you notice the generated image/video does not follow the spatial control well, you can increase this value; and if you notice the generated image/video quality is not good because the spatial control is too strong, you can decrease this value. Detailed discussion of control strength via this parameter is shown in our paper.

We list the inference scripts for different tasks mentioned in our paper as follows ⬇️

Controllable Image Generation



SDXL

Control Conditions Checkpoints Inference (w/ extracted condition) Inference (w/o extracted condition)
Depth Map HF link command command
Canny Edge HF link command command
Soft Edge HF link command command
Normal Map HF link command command
Segmentation HF link command command
Scribble HF link command command
Lineart HF link command command

Controllable Video Generation



I2VGen-XL

Control Conditions Checkpoints Inference (w/ extracted condition) Inference (w/o extracted condition)
Depth Map HF link command command
Canny Edge HF link command command
Soft Edge HF link command command

SVD

Control Conditions Checkpoints Inference (w/ extracted condition) Inference (w/o extracted condition)
Depth Map HF link command command
Canny Edge HF link command command
Soft Edge HF link command command

Video Generation with Segmentation Control (NuScenes)



We currently implemented segmentation control on I2VGen-XL. Run this on a local pretrained checkpoint (no HF).

Notice that in these bash files, you can set the segmentation type that you want to use ("ade" or "odise"), depending which ControlNet you are using, in this way the Prescan segmentation will be converted to the correct format.

NB: for inference w/o extracted condition, segmentation_type "ade" is only support, since the deafult segmentor extarct the maps in ade format.

Inference (w/ extracted condition) Inference (w/o extracted condition)
command command

NB: If you want to add more data to test on, place raw frames and segmentation into evaluation/frames/raw_input and evaluation/frames/segmentation, along with a caption.csv with the corresponding folder name specified in it (look for other examples to see a reference).

Video Generation with Multi-Condition Control



We currently implemented multi-condition control on I2VGen-XL. The following checkpoint are trained on 7 control conditions, including depth, canny, normal, softedge, segmentation, lineart, and openpose. Here are the sample inference scripts that uses depth, canny, segmentation, and openpose as control conditions.

Adapter Checkpoint Router Checkpoint Inference (w/ extracted condition) Inference (w/o extracted condition)
HF link HF link command command

Video Generation with Sparse Control



Here we provide a sample inference script that uses user scribbles as condition, and 4 out of 16 frames for sparse control.

Control Conditions Checkpoint Inference (w/ extracted condition)
Scribbles HF link command

🚅 How To Train

🎉 To make our method reproducible and adaptable to new backbones, we have released all of our training code :)

You can find detailed training guideline for Ctrl-Adapter here!

📝 TODO List

  • Release environment setup, inference code, and model checkpoints.
  • Release training code.
  • Training guideline to adapt our Ctrl-Adapter to new image/video diffusion models.
  • Ctrl-Adapter + DiT-based image/video generation backbones (Latte, PixArt-α). (WIP)
  • Code for video editing and text-guided motion control. (WIP)
  • Release evaluation code.

💗 Please let us know in the issues or PRs if you're interested in any relevant backbones or down-stream tasks that can be implemented by our Ctrl-Adapter framework! Welcome to collaborate and contribute!

📚 BibTeX

🌟 If you find our project useful in your research or application development, citing our paper would be the best support for us!

@misc{lin2024ctrladapter,
      title={Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model}, 
      author={Han Lin and Jaemin Cho and Abhay Zala and Mohit Bansal},
      year={2024},
      eprint={2404.09967},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🙏 Acknowledgements

The development of Ctrl-Adapter has been greatly inspired by the following amazing works and teams:

We hope that releasing this model/codebase helps the community to continue pushing these creative tools forward in an open and responsible way.