[Updated 2024/08/08]. Code released.
[Planned to release in July 2024]
Pytorch Implementation of Cross-view Masked Diffusion Transformers for Person Image Synthesis, ICML 2024.
Authors: Trung X. Pham, Zhang Kang, and Chang D. Yoo.
Introduction
X-MDPT (
Efficiency Advantages
Comparisons with state-of-the-arts
Consistent Targets
Setup Environment
We have tested with Pytorch 1.12+cuda11.6, using a docker.
conda create -n xmdpt python=3.8
conda activate xmdpt
pip install -r requirements.txt
Prepare Dataset
Downloading the DeepFashion dataset and processing it into the lmdb format for easy training and inference. Refer to PIDM (CVPR2023) for this LMDB. The data structure should be as follows:
datasets/
|-- [ 38] deepfashion
| |-- [6.4M] train_pairs.txt
| |-- [2.1M] train.lst
| |-- [817K] test_pairs.txt
| |-- [182K] test.lst
| |-- [4.0K] 256-256
| | |-- [8.0K] lock.mdb
| | `-- [2.4G] data.mdb
| |-- [8.7M] pose.rar
| `-- [4.0K] 512-512
| |-- [8.0K] lock.mdb
| `-- [8.4G] data.mdb
| |-- [4.0K] pose
| | |-- [4.0K] WOMEN
| | | |-- [ 12K] Shorts
| | | | |-- [4.0K] id_00007890
| | | | | |-- [ 900] 04_4_full.txt
| | |-- [4.0K] MEN
...
Training
CUDA_VISIBLE_DEVICES=0 bash run_train.sh
By default, it will save checkpoints for every 10k steps. You can use that for inference as below.
Inference
Download all checkpoints and VAE (fine-tuned only decoder) and put them into the correct place as in the default file infer_xmdpt.py.
For the test set of Deep Fashion, run the following
CUDA_VISIBLE_DEVICES=0 python infer_xmdpt.py
It will save the output image samples as in test_img of this repo.
For the arbitrary image, run the following (not implemented)
CUDA_VISIBLE_DEVICES=0 python infer_xmdpt.py --image_path test.png
Pretrained Models
All of our models had been trained and tested using a single A100 (80GB) GPU.
Model | Step | Resolution | FID | Params | Inference Time | Link |
---|---|---|---|---|---|---|
X-MDPT-S | 300k | 256x256 | 7.42 | 33.5M | 1.1s | Link |
X-MDPT-B | 300k | 256x256 | 6.72 | 131.9M | 1.3s | Link |
X-MDPT-L | 300k | 256x256 | 6.60 | 460.2M | 3.1s | Link |
VAE | - | - | - | - | - | Link |
Expected Outputs
Citation If X-MDPT is useful or relevant to your research, please kindly recognize our contributions by citing our papers:
@inproceedings{pham2024crossview,
title={Cross-view Masked Diffusion Transformers for Person Image Synthesis},
author={Trung X. Pham and Kang Zhang and Chang D. Yoo},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=jEoIkNkqyc}
}
Acknowledgements
This work was supported by the Institute for Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korean government (MSIT) (No. 2021-0-01381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments) and (No. 2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).
Helpful Repo
Thanks nice works of MDT (ICCV2023) and PIDM (CVPR2023) for publishing their codes.