Skip to content

Latest commit

 

History

History
115 lines (99 loc) · 7.03 KB

README_EN.md

File metadata and controls

115 lines (99 loc) · 7.03 KB

Stable-Diffusion-Pokemon

A demo of fine tune Stable Diffusion on Pokemon-Blip-Captions in English, Japanese and Chinese Corpus

中文简介

Brief introduction

Stable Diffusion is a state of the art text-to-image model that generates images from text.
Nowadays, with the help of diffusers, which provides pretrained diffusion models across multiple modalities, people can customize their own image generator conditional (based on prompt) or unconditional.
This project focus on run the text to image example diffusers provided based on lambdalabs/pokemon-blip-captions and migrate this task to Japanese and Chinese domain in both model and data dimensions. Compare the conclusions that may give a guideline about the fine tuning of Stable Diffusion in different languages.
All codes are edit versions of the official train_text_to_image.py that make it works in Japanese and Chinese Domain. And provide three pretrained models in English , Japanese and Chinese.

Installtion and Running

Running install.sh will install all dependencies and download all models needed.(make sure you have login your huggingface account and have your token) After download, you can try run_en_model.py, run_ja_model.py and run_zh_model.py by yourself.

DataSet prepare

For fine tuning them in Japanese and Chinese domains. All we need is the lambdalabs/pokemon-blip-captions in Japanese and Chinese. I have translated them with the help of DeepL and upload them to huggingface dataset hub. Located in svjack/pokemon-blip-captions-en-ja and svjack/pokemon-blip-captions-en-zh.

Fine tuning pretrained models

The English version located in train_en_model.py is only a simply copy of train_text_to_image.py change the code used by accelerate in script to notebook by function

notebook_launcher

The Japanese version located in train_ja_model.py replaced the pretrained model by rinnakk/japanese-stable-diffusion

The Chinese version located in train_zh_model.py replaced the pretrained tokenizer and text_encoder by IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese and use logit output from BertForTokenClassification with padding to the downstream network to replace CLIPTextModel.

For take a look at all outputs, i disable the safety_checker to let all outputs without covered in the inference steps.

Generator Results comparison

Images
Prompt English Japanese Chinese
A cartoon character with a potted plant on his head

鉢植えの植物を頭に載せた漫画のキャラクター

一个头上戴着盆栽的卡通人物
Girl in a jacket Girl in a jacket Girl in a jacket
cartoon bird

漫画の鳥

卡通鸟
Girl in a jacket Girl in a jacket Girl in a jacket
blue dragon illustration

ブルードラゴンのイラスト

蓝色的龙图
Girl in a jacket Girl in a jacket Girl in a jacket

Discussion

The pretrained models in English, Japanese and Chinese are trained for 26000, 26000 and 20000 steps respectively. The Japanese outperform others and the Chinese version seems the third. The interpretation can be rinnakk/japanese-stable-diffusion have many culture and features about Pokemon, Stable Diffusion in English domain is finetuned favourable. IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese in Chinese as the model card introduction is only a text feature finetuned version.

Contact

svjack - [email protected] - [email protected]

Project Link:https://github.com/svjack/Stable-Diffusion-Pokemon

Acknowledgements