PyTorch implementation of EmoEditor, an emotion-evoked diffusion model.
Code and data will be released upon paper acceptance.
Make Me Happier: Evoking Emotions Through Image Diffusion Models
Qing Lin, Jingfeng Zhang, Yew Soon Ong, Mengmi Zhang*
*Corresponding author
Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce four new evaluation metrics to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images.
The generated images evoke a sense of happiness in viewers, contrasting with the negative emotions elicited by the source images. Given a source image that triggers negative emotions (framed in green), our method (Ours) synthesizes a new image that elicits the given positive target emotions (in red), while maintaining the essential elements and structures of the scene. The dataset comprises two subsets: EmoPair-Annotated Subset (EPAS, left blue box) and EmoPair-Generated Subset (EPGS, right orange box). Each subset includes schematics depicting the creation, selection, and labeling of image pairs in the upper quadrants, with two example pairs in the lower quadrants. Each example pair comprises a source image (framed in green) and a target image. The classified source and target emotion labels (highlighted in red) and target-emotion-driven text instructions for image editing are provided.human psychophysics experiments and four newly introduced metrics
Compare with five state-of-the-art methods: (1) Color-transfer(CT); (2) Neural-Style-Transfer(NST); (3) CLIP-Styler(Csty); (4) Ip2p; (5) Large Model Series (LMS). This includes BLIP for image captioning, followed by GPT-3 for text instruction generation, and Ip2p for image editing based on the instructions.
Method | EMR(%)↑ | ESR(%)↑ | ENRD↓ | ESS↓ |
---|---|---|---|---|
CT | 6.89 | 79.32 | 33.29 | 7.36 |
NST | 34.42 | 92.01 | 34.45 | 18.57 |
Csty | 11.51 | 85.52 | 41.47 | 36.64 |
Ip2p | 2.53 | 67.76 | 9.39 | 12.71 |
LMS | 11.51 | 77.38 | 26.13 | 19.74 |
w/o |
5.06 | 69.15 | 19.45 | 14.93 |
w/o |
43.35 | 91.62 | 22.53 | 16.00 |
Ours | 50.20 | 92.86 | 23.98 | 16.27 |
Emotion Matching Ratio (EMR) and Emotion Similarity Ratio (ESR) assess the extent to which the generated images evoke target emotions.
EMR: We use the emotion predictor
ESR: Our emotion predictor
Emotion-Neutral Region Deviation (ENRD) and Edge Structure Similarity (ESS) measure the structural coherence and semantic consistency between source and generated images.
ENRD: We first identify emotionally neutral regions on the source images by employing the Grad-CAM technique through our emotion predictor
ESS: We apply the Canny edge detection algorithm on both source and generated images using thresholds of 200 and 500. Next, we compute the L1 norm between these edge maps to quantify their structural differences.
@article{lin2024emoeditor,
title={Make Me Happier: Evoking Emotions Through Image Diffusion Models},
author={Qing Lin and Jingfeng Zhang and Yew Soon Ong and Mengmi Zhang},
journal={arXiv preprint arXiv:2403.08255},
year={2024}
}
We benefit a lot from CompVis/stable_diffusion, timothybrooks/instruct-pix2pix and ayaanzhaque/instruct-nerf2nerf repo.