StyleCLIP-Pytorch_Happydog

This is an implementation of "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery." Here is the official GitHub link: StyleCLIP

Results Showcase

Images

Videos

video 1

out3.mp4

video 2

out7.mp4

Latent Optimization Step

A simple approach for leveraging CLIP to guide image manipulation is through direct latent code optimization. This method involves three essential components:

Requirements:
- A pre-trained StyleGAN model.
- A source latent code (usually generated from random noise ( z ) by the mapper from the generator; we can also perform image inversion using e4e to edit the image of our choice).
- A pre-trained CLIP model.
Loss Function: The loss function consists of three parts:
- CLIP Loss $$( D_{clip} )$$: This calculates the cosine distance between the CLIP embeddings of the text and image arguments, where $$G$$ is a pre-trained StyleGAN generator and $$t$$ is the text prompt.
- L2 Norm: This part calculates the L2 distance between the source latent code $$w_s$$ and the target latent code $$w$$.
- Identity Loss: Ensures that the identity of the image remains unchanged while allowing modifications to other visual features (e.g., hairstyle, expression, presence of glasses, etc.). The identity loss is calculated using a pre-trained ArcFace network for face recognition.
Finding the Optimized $$w$$: We find the optimized $$w$$ by solving the optimization problem through gradient descent. The gradient of the objective function is backpropagated while freezing the pre-trained StyleGAN and CLIP models. Typically, using an epoch count of 150 to 250 will yield decent results. The $\lambda 2$ parameter usually ranges from 0.02 to 0.06, depending on the extent to which you want to change your photo. The $\lambda_ID$ parameter is only appied when editing human faces.

Getting Started

Install Pre-trained CLIP Model

To install the pre-trained CLIP model, run the following command:

pip install git+https://github.com/openai/CLIP.git

Install Ninja

To install Ninja, execute the following commands:

!wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip
!sudo unzip ninja-linux.zip -d /usr/local/bin/
!sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force

Get Pre-trained StyleGAN

Clone the StyleGAN repository:

!git clone https://github.com/NVlabs/stylegan2-ada-pytorch.git
%cd stylegan2-ada-pytorch

Problems Encountered During This Project

Using Pre-trained StyleGAN Model: Install Ninja and modify the code in torch.utils.
Using Mac: Unable to train the network due to an unknown reason. The error message was: Expected scalar_type == ScalarType::Float || inputTensor.scalar_type() == ScalarType::Int || scalar_type == ScalarType::Bool to be true, but got false.
CLIP Installation: Install directly from OpenAI's CLIP to avoid errors. Do not use pip install CLIP.
Loss Function Parameters: Adjust the parameters in the loss function to achieve the ideal result.

Global Directions Using Text Prompts

Results Showcase

Images

Figure 1: Wild animal image manipulation using the global direction method. The neutral class is 'original' and the target class is 'happy'.

Figure 2: Wild animal image inversion using ReStyle e4e. The first 5 images are generated by the inverted code from e4e, and the last image is the original.

Objective

The goal is to manipulate an image based on a given text prompt, $t$, that indicates a desired attribute. Specifically, we aim to find a manipulation direction, $\Delta s$, such that generating an image $G(s + \alpha \Delta s)$ produces a result consistent with the desired manipulation encoded in the text prompt.

Approach

We map any text prompt $\Delta t$ into a single, global direction in the style space of StyleGAN.

Key Intuitions

In well-trained regions of the CLIP model, we anticipate that similar changes in image embeddings and text embeddings correspond to strong cosine similarity.
Given two images, $G(s)$ and $G(s + \alpha \Delta s)$, their respective CLIP embeddings, $i$ and $i + \Delta i$, can be computed. The text prompt is encoded as $\Delta t$. By assessing the cosine similarity between $\Delta t$ and $\Delta i$, we can determine the manipulation direction, $\Delta s$ by channel-wise assessing.

Methodology

Natural Language Instruction $(\Delta t)$
Using a pre-defined text prompt bank (e.g., ImageNet), we generate phrases such as "a bad photo of {}", "a photo of a small {}," etc., to produce average embeddings for both the target and neutral classes. These embeddings allow us to calculate the normalized difference.
Channel-Wise Manipulation
Perform channel-wise manipulation, $\alpha \Delta s$, on the style code $s$ over several image pairs.
Channel Relevance Calculation
For each channel in the StyleGAN style space, project the CLIP space direction $\Delta i_c$ onto the corresponding manipulation direction $\Delta i$ to calculate the relevance.
Apply Manipulation Direction $\Delta s$
Apply the computed manipulation direction $\Delta s$ to the intended image, generating a modified image consistent with the desired attribute indicated by the text prompt.

$$ \Delta s = \begin{cases} \Delta i_c \cdot \Delta i & \text{if } \Delta i_c \cdot \Delta i \geq \beta \\ 0 & \text{otherwise} \end{cases} $$

In this formula, $\Delta i_c$ represents the channel-wise change in CLIP image embeddings, $\Delta i$ is the text-based direction, and $\beta$ is a threshold to ensure significant manipulation.

References

StyleCLIP StyleCLIP — Official Implementation for "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery"
StyleGAN2-ada-pytorch StyleGAN2-ADA — Official PyTorch implementation
StylegGAN2 StyleGAN2 - Official TensorFlow Implementation
StyleGAN2-Pytorch StyleGAN2 - Pytorch
Encoder for Editing e4e — encoder for editing
CLIP CLIP - Contrastive Language-Image Pretraining
restyle-encoder Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement"

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
encoder4editing @ 99ea505		encoder4editing @ 99ea505
final_photo		final_photo
restyle-encoder @ 5e80276		restyle-encoder @ 5e80276
result		result
stylegan2-ada @ d72cc7d		stylegan2-ada @ d72cc7d
.DS_Store		.DS_Store
.gitmodules		.gitmodules
README.md		README.md
styleclip-optimization-pytorch.ipynb		styleclip-optimization-pytorch.ipynb
styleclip_global_direction_pytorch.ipynb		styleclip_global_direction_pytorch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StyleCLIP-Pytorch_Happydog

Results Showcase

Images

Videos

Latent Optimization Step

Getting Started

Install Pre-trained CLIP Model

Install Ninja

Get Pre-trained StyleGAN

Problems Encountered During This Project

Global Directions Using Text Prompts

Results Showcase

Images

Objective

Approach

Key Intuitions

Methodology

References

About

Releases

Packages

Languages

DaveYuan23/StyleCLIP-Pytorch_Happydog

Folders and files

Latest commit

History

Repository files navigation

StyleCLIP-Pytorch_Happydog

Results Showcase

Images

Videos

Latent Optimization Step

Getting Started

Install Pre-trained CLIP Model

Install Ninja

Get Pre-trained StyleGAN

Problems Encountered During This Project

Global Directions Using Text Prompts

Results Showcase

Images

Objective

Approach

Key Intuitions

Methodology

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages