Zeren Xiong1 · Zedong Zhang1 · Zikun Chen1 · Shuo Chen2 · Xiang Li3 · Gan Sun4 ·Jian Yang1 · Jun Li1
1Nanjing University of Science and Technology · 2RIKEN· 3Nankai University · 4South China University of Technology ·
- Introducing a scale factor and an injection step to balance text and image features in cross-attention, while preserving image information in self-attention during the text-image inversion diffusion process.
- Designing a balanced loss function with a noise parameter, ensuring both optimal editability and fidelity of the object image.
- Presenting a novel similarity score function that maximizes the similarities between the generated object image and the input text/image while balancing these similarities to harmonize text and image integration.
- 2024.12.20: 🎉 Our code is released! Explore the possibilities of novel object synthesis with our framework.
To set up the environment for running the code, follow these steps:
-
Clone the repository:
git clone https://github.com/xzr52/ATIH-code cd ATIH-code
-
Create a conda environment and install dependencies:
conda create -n ATIH python=3.10 conda activate ATIH pip install -r requirements.txt
-
Set CUDA paths:
export CUDA_HOME=/usr/local/cuda
-
Install the required submodule:
cd GroundingDino pip install -e .
- Download the segmentation model weights seg_ckpts, unzip them, and place them in the ckpts/ folder.
We provide a Gradio-based application for an intuitive user interface to interact with the framework.Our code is designed to run on two GPUs, each with 24GB of memory, by default. If your single GPU has more than 30GB of memory, you can modify the code by setting device2 to the same as device1, allowing the program to run on a single GPU.
To launch the app locally:
export no_proxy=127.0.0.1,localhost
python app.py
To perform inference on a single image, use the following command:
python inference_one_image.py --image_path examples/rabbit.png --target_prompt 'cock'
--image_path
: Path to the input image.--target_prompt
: Text description of the object.
Our framework supports using complex prompts. Here's an example of how the results look:
python inference_one_image.py --image_path examples/lion.png --target_prompt 'Green triceratops with rough, scaly skin and massive frilled head'
This work was partially supported by the National Science Foundation of China under Grant Nos. 62072242 and 62361166670. We also thank the developers of the following projects, which our implementation builds upon:
We deeply appreciate their contributions , which have been instrumental to our work.
@inproceedings{
xiong2024novel,
title={Novel Object Synthesis via Adaptive Text-Image Harmony},
author={Zeren Xiong and Ze-dong Zhang and Zikun Chen and Shuo Chen and Xiang Li and Gan Sun and Jian Yang and Jun Li},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=ENLsNDfys0}
}