Skip to content

xzr52/ATIH-code

Repository files navigation

Novel Object Synthesis via Adaptive Text-Image Harmony

Zeren Xiong1 · Zedong Zhang1 · Zikun Chen1 · Shuo Chen2 · Xiang Li3 · Gan Sun4 ·Jian Yang1 · Jun Li1

1Nanjing University of Science and Technology · 2RIKEN· 3Nankai University · 4South China University of Technology ·

In this paper, we study an object synthesis task that combines an object text with an object image to create a new object image. However, most diffusion models struggle with this task, often generating an object that predominantly reflects either the text or the image due to an imbalance between their inputs. To address this issue, we propose a simple yet effective method called Adaptive Text-Image Harmony (ATIH) to generate novel and surprising objects. Our contributions include:
  • Introducing a scale factor and an injection step to balance text and image features in cross-attention, while preserving image information in self-attention during the text-image inversion diffusion process.
  • Designing a balanced loss function with a noise parameter, ensuring both optimal editability and fidelity of the object image.
  • Presenting a novel similarity score function that maximizes the similarities between the generated object image and the input text/image while balancing these similarities to harmonize text and image integration.

🚀 News

  • 2024.12.20: 🎉 Our code is released! Explore the possibilities of novel object synthesis with our framework.

🛠️ 1. Set Environment

To set up the environment for running the code, follow these steps:

  1. Clone the repository:

    git clone https://github.com/xzr52/ATIH-code
    cd ATIH-code
  2. Create a conda environment and install dependencies:

    conda create -n ATIH python=3.10
    conda activate ATIH
    pip install -r requirements.txt
  3. Set CUDA paths:

    export CUDA_HOME=/usr/local/cuda
  4. Install the required submodule:

    cd GroundingDino
    pip install -e .
  1. Download the segmentation model weights seg_ckpts, unzip them, and place them in the ckpts/ folder.

🚀 2. Quick Start

We provide a Gradio-based application for an intuitive user interface to interact with the framework.Our code is designed to run on two GPUs, each with 24GB of memory, by default. If your single GPU has more than 30GB of memory, you can modify the code by setting device2 to the same as device1, allowing the program to run on a single GPU.

To launch the app locally:

export no_proxy=127.0.0.1,localhost
python app.py

🖼️ 3. Inference One Image

To perform inference on a single image, use the following command:

python inference_one_image.py --image_path examples/rabbit.png --target_prompt 'cock'
  • --image_path: Path to the input image.
  • --target_prompt: Text description of the object.

🎨 4. Complex Prompt Generation

Our framework supports using complex prompts. Here's an example of how the results look:

python inference_one_image.py --image_path examples/lion.png --target_prompt 'Green triceratops with rough, scaly skin and massive frilled head'

🙌 Acknowledgment

This work was partially supported by the National Science Foundation of China under Grant Nos. 62072242 and 62361166670. We also thank the developers of the following projects, which our implementation builds upon:

We deeply appreciate their contributions , which have been instrumental to our work.

📖BibTeX

@inproceedings{
  xiong2024novel,
  title={Novel Object Synthesis via Adaptive Text-Image Harmony},
  author={Zeren Xiong and Ze-dong Zhang and Zikun Chen and Shuo Chen and Xiang Li and Gan Sun and Jian Yang and Jun Li},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/forum?id=ENLsNDfys0}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published