This is an implementation of "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." Here is the official GitHub link: StyleCLIP
video 1
out3.mp4
video 2
out7.mp4
A simple approach for leveraging CLIP to guide image manipulation is through direct latent code optimization. This method involves three essential components:
-
Requirements:
- A pre-trained StyleGAN model.
- A source latent code (usually generated from random noise ( z ) by the mapper from the generator; we can also perform image inversion using e4e to edit the image of our choice).
- A pre-trained CLIP model.
-
Loss Function: The loss function consists of three parts:
-
CLIP Loss
$$( D_{clip} )$$ : This calculates the cosine distance between the CLIP embeddings of the text and image arguments, where$$G$$ is a pre-trained StyleGAN generator and$$t$$ is the text prompt. -
L2 Norm: This part calculates the L2 distance between the source latent code
$$w_s$$ and the target latent code$$w$$ . - Identity Loss: Ensures that the identity of the image remains unchanged while allowing modifications to other visual features (e.g., hairstyle, expression, presence of glasses, etc.). The identity loss is calculated using a pre-trained ArcFace network for face recognition.
-
CLIP Loss
-
Finding the Optimized
$$w$$ : We find the optimized$$w$$ by solving the optimization problem through gradient descent. The gradient of the objective function is backpropagated while freezing the pre-trained StyleGAN and CLIP models. Typically, using an epoch count of 150 to 250 will yield decent results. The$\lambda 2$ parameter usually ranges from 0.02 to 0.06, depending on the extent to which you want to change your photo. The$\lambda_ID$ parameter is only appied when editing human faces.
To install the pre-trained CLIP model, run the following command:
pip install git+https://github.com/openai/CLIP.git
To install Ninja, execute the following commands:
!wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip
!sudo unzip ninja-linux.zip -d /usr/local/bin/
!sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force
Clone the StyleGAN repository:
!git clone https://github.com/NVlabs/stylegan2-ada-pytorch.git
%cd stylegan2-ada-pytorch
- Using Pre-trained StyleGAN Model: Install Ninja and modify the code in
torch.utils
. - Using Mac: Unable to train the network due to an unknown reason. The error message was: Expected scalar_type == ScalarType::Float || inputTensor.scalar_type() == ScalarType::Int || scalar_type == ScalarType::Bool to be true, but got false.
- CLIP Installation: Install directly from OpenAI's CLIP to avoid errors. Do not use
pip install CLIP
. - Loss Function Parameters: Adjust the parameters in the loss function to achieve the ideal result.
Figure 1: Wild animal image manipulation using the global direction method. The neutral class is 'original' and the target class is 'happy'.
Figure 2: Wild animal image inversion using ReStyle e4e. The first 5 images are generated by the inverted code from e4e, and the last image is the original.
The goal is to manipulate an image based on a given text prompt,
We map any text prompt
- In well-trained regions of the CLIP model, we anticipate that similar changes in image embeddings and text embeddings correspond to strong cosine similarity.
- Given two images,
$G(s)$ and$G(s + \alpha \Delta s)$ , their respective CLIP embeddings,$i$ and$i + \Delta i$ , can be computed. The text prompt is encoded as$\Delta t$ . By assessing the cosine similarity between$\Delta t$ and$\Delta i$ , we can determine the manipulation direction,$\Delta s$ by channel-wise assessing.
-
Natural Language Instruction
$(\Delta t)$
Using a pre-defined text prompt bank (e.g., ImageNet), we generate phrases such as "a bad photo of {}", "a photo of a small {}," etc., to produce average embeddings for both the target and neutral classes. These embeddings allow us to calculate the normalized difference. -
Channel-Wise Manipulation
Perform channel-wise manipulation,$\alpha \Delta s$ , on the style code$s$ over several image pairs. -
Channel Relevance Calculation
For each channel in the StyleGAN style space, project the CLIP space direction$\Delta i_c$ onto the corresponding manipulation direction$\Delta i$ to calculate the relevance. -
Apply Manipulation Direction
$\Delta s$
Apply the computed manipulation direction$\Delta s$ to the intended image, generating a modified image consistent with the desired attribute indicated by the text prompt.
In this formula,
-
StyleCLIP StyleCLIP — Official Implementation for "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery"
-
StyleGAN2-ada-pytorch StyleGAN2-ADA — Official PyTorch implementation
-
StylegGAN2 StyleGAN2 - Official TensorFlow Implementation
-
StyleGAN2-Pytorch StyleGAN2 - Pytorch
-
Encoder for Editing e4e — encoder for editing
-
CLIP CLIP - Contrastive Language-Image Pretraining
-
restyle-encoder Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement"