📃 Paper
🔥News! The paper of ImageReward is accepted by ICLR 2024 tiny papers!
In this paper, we introduce FeedFace, a novel inference-based method designed to augment text-to-image diffusion models with face-based conditional generation. Trained on a thoroughly curated and annotated dataset of diverse human faces, FeedFace operates without additional training or fine-tuning for new facial conditions during generation. Our method can create images that are not only true to the textual descriptions but also exhibit remarkable facial faithfulness in seconds. In addition, our model supports using multiple faces as input conditions, leveraging extra facial information to improve the consistency of the generated images. FeedFace is compatible with different architectures, including U-ViT (e.g., UniDiffuser) and U-Net-based (e.g., Stable Diffusion) diffusion models. A key strength of our method lies in its efficiency. Through our experiments, we demonstrate that FeedFace can produce face-conditioned samples with comparable quality to leading industry methods (e.g. Face0), using only 0.4% of the data volume and fewer than 5% of the training steps required by these methods.
pip install -r requirements.txt
download pretrained ckpts and necessary autoencoder and caption decoder from huggingface, place them in the models
folder as follows:
models
├── autoencoder_kl.pth
├── caption_decoder.pth
├── feed-4800000.pt
├── gpt2
│ ├── config.json
│ ├── generation_config.json
│ ├── merges.txt
│ ├── pytorch_model.bin
│ ├── tokenizer.json
│ └── vocab.json
└── uvit_v1.pth
directly run the inference script:
python inference.py --prompt "a beautiful smiling woman" --image "path/to/face.jpg" --output "outputs"
You could also using a json file as a batch run of inference, using batch of reference images, the json_examples
dir provide some examples:
python inference.py -j json_examples/1433.json
Our model could generate face-conditioned samples with high similarity to the input image.
Our model also retains the textual alignment of the input prompt.
To reproduce our work, we provide the following steps:
download the FFHQ in the wild images from here.
We provide a list of image that used in our training dataset and caption in configs/FFHQ_in_the_wild-llava_1_5_13b.jsonl
, the mask shoud be regerated using the libs/make_data.py
script.
an example of using the script:
python libs/make_data.py --data_folder data_examples --out_folder data_process
# for multi-gpu processing, could use accelerate library:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch libs/make_data.py --data_folder data_examples --out_folder data_process
modify the configs/train_config.py
file to set the necessary hyperparameters, the jsonl_files
should be exisiting and all paths in the jsonl file should be correct.
jsonl_files=[
"configs/FFHQ_in_the_wild-llava_1_5_13b.jsonl"
],
then run the training script:
accelerate launch --num_processes 8 -mixed_precision fp16 --multi_gpu train.py --config=configs/train_config.py
In the context of facial generation, nuanced details such as facial expressions and orientation still pose difficulties, many times resulting in a pasting-like artifact. Moreover, despite the proficiency of our model in producing high-quality and consistent facial images, there are noticeable trade-offs in terms of the semantic alignment with the textual descriptions and overall image quality. Addressing these issues not only underscores the current limitations but also points towards potential avenues for future research.
If you make use of our work, please cite our paper:
@inproceedings{
xiang2024feedface,
title={FeedFace: Efficient Inference-based Face Personalization via Diffusion Models},
author={Chendong Xiang and Armando Fortes and Khang Hui Chua and Hang Su and Jun Zhu},
booktitle={The Second Tiny Papers Track at ICLR 2024},
year={2024},
url={https://openreview.net/forum?id=PqPKBcamy3}
}