A demo of fine tune Stable Diffusion on Pokemon-Blip-Captions in English, Japanese and Chinese Corpus
Stable Diffusion is a state of the art text-to-image model that generates images from text.
Nowadays, with the help of diffusers, which provides pretrained diffusion models across multiple modalities, people can customize their own image generator conditional (based on prompt) or unconditional.
This project focus on run the text to image example diffusers provided based on lambdalabs/pokemon-blip-captions and migrate this task to Japanese and Chinese domain
in both model and data dimensions. Compare the conclusions that may give a guideline about the fine tuning of Stable Diffusion in different languages.
All codes are edit versions of the official train_text_to_image.py that make it works in Japanese and Chinese Domain.
And provide three pretrained models in English , Japanese and Chinese.
Running install.sh will install all dependencies and download all models needed.(make sure you have login your huggingface account and have your token) After download, you can try run_en_model.py, run_ja_model.py and run_zh_model.py by yourself.
For fine tuning them in Japanese and Chinese domains. All we need is the lambdalabs/pokemon-blip-captions in Japanese and Chinese. I have translated them with the help of DeepL and upload them to huggingface dataset hub. Located in svjack/pokemon-blip-captions-en-ja and svjack/pokemon-blip-captions-en-zh.
The English version located in train_en_model.py is only a simply copy of train_text_to_image.py change the code used by accelerate in script to notebook by function
notebook_launcher
The Japanese version located in train_ja_model.py
replaced the pretrained model by rinnakk/japanese-stable-diffusion
The Chinese version located in train_zh_model.py replaced the pretrained tokenizer and text_encoder by IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese and use logit output from BertForTokenClassification with padding to the downstream network to replace CLIPTextModel.
For take a look at all outputs, i disable the safety_checker to let all outputs without covered in the inference steps.
Prompt | English | Japanese | Chinese |
---|---|---|---|
A cartoon character with a potted plant on his head 鉢植えの植物を頭に載せた漫画のキャラクター 一个头上戴着盆栽的卡通人物 |
|||
cartoon bird 漫画の鳥 卡通鸟 |
|||
blue dragon illustration ブルードラゴンのイラスト 蓝色的龙图 |
The pretrained models in English, Japanese and Chinese are trained for 26000, 26000 and 20000 steps respectively. The Japanese outperform others and the Chinese version seems the third. The interpretation can be rinnakk/japanese-stable-diffusion have many culture and features about Pokemon, Stable Diffusion in English domain is finetuned favourable. IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese in Chinese as the model card introduction is only a text feature finetuned version.
svjack - [email protected] - [email protected]
Project Link:https://github.com/svjack/Stable-Diffusion-Pokemon