Stable-Diffusion-Pokemon

A demo of fine tune Stable Diffusion on Pokemon-Blip-Captions in English, Japanese and Chinese Corpus

Brief introduction

Stable Diffusion is a state of the art text-to-image model that generates images from text.
Nowadays, with the help of diffusers, which provides pretrained diffusion models across multiple modalities, people can customize their own image generator conditional (based on prompt) or unconditional.
This project focus on run the text to image example diffusers provided based on lambdalabs/pokemon-blip-captions and migrate this task to Japanese and Chinese domain in both model and data dimensions. Compare the conclusions that may give a guideline about the fine tuning of Stable Diffusion in different languages.
All codes are edit versions of the official train_text_to_image.py that make it works in Japanese and Chinese Domain. And provide three pretrained models in English , Japanese and Chinese.

Installtion and Running

Running install.sh will install all dependencies and download all models needed.(make sure you have login your huggingface account and have your token) After download, you can try run_en_model.py, run_ja_model.py and run_zh_model.py by yourself.

DataSet prepare

For fine tuning them in Japanese and Chinese domains. All we need is the lambdalabs/pokemon-blip-captions in Japanese and Chinese. I have translated them with the help of DeepL and upload them to huggingface dataset hub. Located in svjack/pokemon-blip-captions-en-ja and svjack/pokemon-blip-captions-en-zh.

Fine tuning pretrained models

The English version located in train_en_model.py is only a simply copy of train_text_to_image.py change the code used by accelerate in script to notebook by function

notebook_launcher

The Japanese version located in train_ja_model.py replaced the pretrained model by rinnakk/japanese-stable-diffusion

The Chinese version located in train_zh_model.py replaced the pretrained tokenizer and text_encoder by IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese and use logit output from BertForTokenClassification with padding to the downstream network to replace CLIPTextModel.

For take a look at all outputs, i disable the safety_checker to let all outputs without covered in the inference steps.

Generator Results comparison

Images

Prompt	English	Japanese	Chinese
A cartoon character with a potted plant on his head 鉢植えの植物を頭に載せた漫画のキャラクター一个头上戴着盆栽的卡通人物
cartoon bird 漫画の鳥卡通鸟
blue dragon illustration ブルードラゴンのイラスト蓝色的龙图

Discussion

The pretrained models in English, Japanese and Chinese are trained for 26000, 26000 and 20000 steps respectively. The Japanese outperform others and the Chinese version seems the third. The interpretation can be rinnakk/japanese-stable-diffusion have many culture and features about Pokemon, Stable Diffusion in English domain is finetuned favourable. IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese in Chinese as the model card introduction is only a text feature finetuned version.

Contact

svjack - svjackbt@gmail.com - ehangzhou@outlook.com

Project Link:https://github.com/svjack/Stable-Diffusion-Pokemon

Acknowledgements

Stable Diffusion
diffusers
DeepL
rinnakk/japanese-stable-diffusion
IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese
svjack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Stable-Diffusion-Pokemon

Brief introduction

Installtion and Running

DataSet prepare

Fine tuning pretrained models

Generator Results comparison

Discussion

Contact

Acknowledgements

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Stable-Diffusion-Pokemon

Brief introduction

Installtion and Running

DataSet prepare

Fine tuning pretrained models

Generator Results comparison

Discussion

Contact

Acknowledgements