my_dediffusion

This is the my unofficial implementaion of the model from paper "De-Diffusion Makes Text a Strong Cross-Modal Interface" by Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille and Jiahui Yu. Hope this repo can give a easy to use code framework for anyone like me wishes to reproduce the paper.

Project structure

The directory structure of the project looks like this:

├── README.md            <- The top-level README for developers using this project.
├── datasets
│   ├──dataloader.py         <- load the datasets and preprocessing.
│
│
├── requirements.txt     <- The requirements file for reproducing the analysis environment
│
├── models  <- Source code for use in this project.
│   ├── __init__.py
│   ├── decoder.py
│   ├── encoder.py
│   │
│── train_model.py   <- script for training the model
│── predict_model.py <- script for predicting from a model
│
├── utils   <- helper files for genric project.
│   ├── __init__.py
│   ├── ckpt.py
│   ├── config.py
│   ├── distributed.py
│   ├── logging.py
│   ├── utils.py
├── assets               <- for github repo 
└── LICENSE              <- Open-source license if one is chosen

Installation

Instructions on how to clone and set up your repository:

Clone this repo :

Clone the repository and navigate to the project directory:

git clone https://github.com/Yaxin9Luo/my_dediffusion.git
cd my_dediffusion

Create a conda virtual environment and activate it:

conda create -n dediffusion python=3.11 -y
conda activate dediffusion

Install the required dependencies:

pip install -r requirements.txt

Training

If you wish to train the Attention Pooler inside the image to text encoder block which is mentioned in the paper, you can use the following instruction and modify the config file for your own need. (note: I trained 100 images on a A100 for around 2 hours)

python train.py --config ./configs/main.yaml

Inference with pretrained models

Here I used pretrained BLip and Stable diffusion, you can change to whatever you like, for example, in the offcial paper, the authors mentioned that they use VIT-L and Imagen.

python inference.py --config ./configs/main.yaml

Examples

I have not yet finetuned the model with a lot of data, so the results are not that astonishing.

Original Images :

Inference Images:

Encoder Text: there are two cats that are laying on a couch with remotes on the back of the couch

Encoder Text: there is a dog that is running with a frisbee in it's mouth in the grass

Encoder Text: there is a dog that is running in the snow with a frisbee in it's mouth

Citation

@article{wei2023dediffusion,
  title={De-Diffusion Makes Text a Strong Cross-Modal Interface},
  author={Wei, Chen and Liu, Chenxi and Qiao, Siyuan and Zhang, Zhishuai and Yuille, Alan and Yu, Jiahui},
  journal={arXiv preprint arXiv:2311.00618},
  year={2023},
  url={https://arxiv.org/abs/2311.00618}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

my_dediffusion

Project structure

Installation

Clone this repo :

Create a conda virtual environment and activate it:

Install the required dependencies:

Training

Inference with pretrained models

Examples

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
assets		assets
configs		configs
datasets		datasets
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

License

Yaxin9Luo/De-Diffusion

Folders and files

Latest commit

History

Repository files navigation

my_dediffusion

Project structure

Installation

Clone this repo :

Create a conda virtual environment and activate it:

Install the required dependencies:

Training

Inference with pretrained models

Examples

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages