VisCPM

Chinese-English bilingual multi-modal large model series based on CPM (Chinese Pretrained Models) basic model

Multimodal Conversation Model VisCPM-Chat • Text-to-image Model VisCPM-Paint • Inference • License • Join Wechat Group

VisCPM is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat model) and text-to-image generation capabilities (VisCPM-Paint model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM is trained based on the large language model CPM-Bee with 10B parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, VisCPM can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.

👐 Open-source Usage: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.
🌟 Image and text generation coverage: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.
💫 Excellent bilingual performance: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.

VisCPM-Chat

VisCPM-Chat supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes Q-Former as the visual encoder and CPM-Bee (10B) as the base LLM. It combines visual and language models and is optimized with the language modeling training objective. The model training consists of two stages: pretraining and instruction tuning.

Pretraining: VisCPM-Chat is pretrained using approximately 100M high-quality English text-image pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, etc. In this stage, the language model parameters remain fixed, and only the parameters of the Q-Former are updated to enable efficient alignment of vision and language representations.
Instruction Tuning: We utilize the LLaVA-150K dataset that contains English multimodal instruction-following data. We mix this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this stage, we update all model parameters to improve the data efficiency of instruction tuning. Interestingly, we observe that even when using only English instruction data for fine-tuning, the model can well comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction tuning stage, we can align the model's response language with the user's question language.

We evaluate the model on the standard LLaVA English test set and the translated Chinese test set from the standard English test set. The evaluation benchmark examines the model's performance in conversation, detailed description, and complex reasoning, and uses GPT-4 for scoring. It can be observed that VisCPM-Chat achieves the best average performance in Chinese multimodal capabilities, excelling in conversation and complex reasoning, while also demonstrating good English multimodal capabilities. We provide two versions of the model, namely VisCPM-Chat-balance and VisCPM-Chat-zhplus. The former has a balanced ability in both English and Chinese, while the latter has a stronger emphasis on Chinese proficiency. Both models use the same data during the instruction tuning stage. VisCPM-Chat-zhplus additionally incorporates 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese during the pretraining stage.

Model		LLM Backbone	English				Chinese
Model		LLM Backbone	Conversation	Detailed Description	Complex Reasoning	Avg	Conversation	Detailed Description	Complex Reasoning	Avg
English Model	MiniGPT4	Vicuna-13B	65.0	67.3	76.6	69.7	-	-	-	-
	InstructBLIP	Vicuna-13B	81.9	68.0	91.2	80.5	-	-	-	-
	LLaVA	Vicuna-13B	89.5	70.4	96.2	85.6	-	-	-	-
En-Zh Bilingual Model	mPLUG-Owl	LLaMA-7B	64.6	47.7	80.1	64.2	76.3	61.2	77.8	72.0
	VisualGLM	ChatGLM-6B	62.4	63.0	80.6	68.7	76.6	87.8	83.6	82.7
	Ziya-Visual	Ziya-LLaMA-13B-v1	82.7	69.9	92.1	81.7	85.0	74.7	82.4	80.8
	VisCPM-Chat-balance	CPMBee-10B	83.3	68.9	90.5	81.1	92.7	76.1	89.2	86.3
	VisCPM-Chat-zhplus	CPMBee-10B	80.1	65.7	92.5	79.6	90.3	81.4	92.1	88.2

VisCPM-Paint

VisCPM-Paint supports bilingual text-to-image generation. The model uses CPM-Bee as the text encoder, UNet as the image decoder, and fuses vision and language models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of Stable Diffusion 2.1, and it is fused with the language model by gradually unfreezing key bridging parameters. The model is trained on the LAION 2B English text-image pair dataset.

Similar to VisCPM-Chat, we found that due to the bilingual capability of CPM-Bee, VisCPM-Paint can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely VisCPM-Paint-balance and VisCPM-Paint-zhplus. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. VisCPM-Paint-balance is trained only using English text-image pairs, while VisCPM-Paint-zhplus incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on VisCPM-Paint-balance.

Model	Zero-shot FID↓
Model	English	Chinese
GLIDE	12.2	-
Make-A-Scene	11.8	-
DALL·E-2	10.4	-
Unidiffuser	9.7	-
Cogview2	-	24.0
Stable Diffusion	8.6	-
AltDiffusion	17.2	16.1
TaiyiDiffusion	-	15.6
VisCPM-Paint-balance	9.5	10.9
VisCPM-Paint-zhplus	9.9	9.6

⚙️ Install

Clone this repository and navigate to source folder

git clone https://github.com/OpenBMB/VisCPM.git
cd VisCPM

Create conda environment

conda create -n viscpm python=3.10 -y
conda activate viscpm

Install dependencies

pip install torch>=1.10
pip install -r requirements.txt

VisCPM now requires GPUs with more than 20GB memory. We will soon update more memory-friendly inference methods.

💡 Inference

Model Zoo

Model	Description	Download Link
VisCPM-Chat-balance	Multimodal conversation model with balanced proficiency in both Chinese and English	download
VisCPM-Chat-zhplus	Multimodal conversation model with a strong emphasis on Chinese proficiency	download
VisCPM-Paint-balance	Text-to-image model with balanced proficiency in both Chinese and English	download
VisCPM-Paint-zhplus	Text-to-image model with a strong emphasis on Chinese proficiency	download

VisCPM-Chat

After downloading the checkpoints, please refer to the following codes to run VisCPM-Chat (replace '/path/to/checkpoint' with actually path of downloaded checkpoint).

Single-turn Conversation

We can have a multimodal conversation with VisCPM-Chat using a few lines of codes.

from VisCPM import VisCPMChat
from PIL import Image

model_path = '/path/to/checkpoint'
viscpm_chat = VisCPMChat(model_path, image_safety_checker=True)
# We perform security checks on the input images by default.
image_path = 'figures/vlu_case1.png'
image = Image.open(image_path).convert("RGB")

question = '如果用一句中国唐代的著名诗人"李白"的古诗来描述这幅图像，你能想到什么？' # If you use an ancient poem by the famous Tang Dynasty poet "Li Bai" to describe this image, what can you think of?
answer, _, _ = viscpm_chat.chat(image, question)

print(answer)

We can obtain the following results:

“黄河之水天上来，奔流到海不复回。” 李白的这句诗可以用来形容这幅图片中汹涌澎湃、波涛汹涌的景象：一条湍急的河流从山上奔腾而下，形成了一幅令人叹为观止的画面，展示出大自然的力量和雄伟壮丽。

"The water of the Yellow River comes up from the sky, rushing to the sea and never returning." This line of Li Bai's poem can be used to describe the surging and turbulent scene in this picture: a rushing river rushes down from the mountain, forming a breathtaking scene. Breathtaking images that showcase the power and majesty of nature.

Multi-turn Conversation

The codes can be easily extended to support multi-turn multimodal conversations.

from VisCPM import VisCPMChat
from PIL import Image

model_path = '/path/to/checkpoint'
viscpm_chat = VisCPMChat(model_path, image_safety_checker=True)
# We perform security checks on the input images by default.
image_path = 'figures/vlu_case2.jpeg'
image = Image.open(image_path).convert("RGB")

question = '这幅图像是在哪个节日拍摄的？' # At which festival was this image taken?
answer, context, vision_hidden_states = viscpm_chat.chat(image, question)

# pass history context of multi-turn conversation
question = '你能用什么古诗描述这幅画？' # What ancient poem can you use to describe this painting?
answer, context, _ = viscpm_chat.chat(image, question, context, vision_hidden_states=vision_hidden_states)

print(context)

We can obtain the following results:

User: 这幅图像是在哪个节日拍摄的？
AI: 这幅图像是在中秋节拍摄的， 也就是中国传统节日中的月圆之夜。
User: 你能用什么古诗描述这幅画？
AI: “明月几时有，把酒问青天。” 这是苏轼的《水调歌头》中的一句诗，用来形容这幅图片再贴切不过了：在中秋之夜，月亮高高地挂在天空中，一座古老的建筑沐浴着月光，营造出一种宁静祥和的气氛。

User: On which festival was this image taken?
AI: This image was taken during the Mid-Autumn Festival, which is the night of the full moon in the traditional Chinese festival.
User: What ancient poem can you use to describe this painting?
AI: "When did the bright moon first exist? With a cup of wine in my hand, I ask the blue sky." This is a line from Su Shi's "Shui Diao Ge Tou", which can't be more appropriate to describe this picture: On the night of Mid-Autumn Festival, the moon hangs high in the sky among them, an ancient building is bathed in moonlight, creating a peaceful atmosphere.

VisCPM-Paint

After downloading the checkpoints, please refer to the following codes to run VisCPM-Paint (replace '/path/to/checkpoint' with actually path of downloaded checkpoint).

The input prompts of the images above can be found at prompts.txt.

# If the memory of your GPU is less than 40g, you can introduce the following environment variables. After the introduction, the memory usage is about 22G, but the time required for inference will be longer.
export CUDA_MEM_SAVE=True

from VisCPM import VisCPMPaint
painter = VisCPMPaint('/path/to/checkpoint', image_safety_checker=True, prompt_safety_checker=True, add_ranker=True) 
# We perform security checks on the input text and output images by default. Additionally, the default setting includes image reranking.
image = painter.generate('人闲桂花落，月静春山空')
# The sweet-scented osmanthus falls when people are idle, the moon is quiet and the mountains are empty in spring.
# Corresponding to the second picture in the first row of the above picture.
image.save('/data/test.png')

In our code, we have enabled the default security checks for both input text and output images.

Additionally, we have implemented a default setting of reranking for the generated images. This means that for a given input, we generate four images simultaneously and return the one with the highest relevance score to the input, which is evaluated using Chinese-Clip. Reranking enhances the stability of the generated image quality but may also slow the model's generation speed. If you prefer to obtain the generated results quickly, you can disable the reranking mechanism.

If you are providing English text as input for generating images, it is advisable to disable the reranking mechanism and input text checker, since the scoring model used for reranking and safety checker for the input prompt are specifically trained for Chinese text.

🛡 Safety

Safety Statement

As a multimodal model, VisCPM generates content by learning from a vast amount of public image and text data. However, it does not possess the ability to comprehend or express personal opinions or value judgments. Any content generated by VisCPM does not represent the viewpoints or positions of the model developers. Therefore, when using content generated by VisCPM, users should take full responsibility for evaluating and verifying it on their own

Safety Module

To prevent the model from being misused to process or generate content that violates widely accepted societal values, we have incorporated a content safety module in VisCPM. When the safety module detects image or text content that does not comply with safety regulations during model processing or generation, it intercepts the corresponding content. We performed security checks on the input images accepted by VisCPM-Chat and the input text and output images of VisCPM-Paint. While the safety module in VisCPM still has room for improvement, there may be instances of both false positives and false negatives. We will continue to enhance the performance of the safety module in future updates.

📝 License

VisCPM is governed by the GML License, and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to negotiate commercial licensing.

The CPM-Bee base, governed by the General Model License (GML), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to obtain the certificate of authorization.

✅ Todo

VisCPM is still undergoing continuous improvement, and we will further optimize it in the following aspects:

Integrate into 🤗 huggingface
Enhancing the safety model
Supporting rapid web deployment
Enabling model quantization
Supporting model fine-tuning

🏫 Institutions

This project is developed by the following institutions:

THUNLP
ModelBest
Zhihu

Citation

@misc{thu-2023-viscpm,
  author = {THUNLP, ModelBest, Zhihu},
  title = {VisCPM: Chinese-English Bilingual Multi-modal Large Model Series},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/OpenBMB/VisCPM}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

VisCPM

VisCPM-Chat

VisCPM-Paint

⚙️ Install

💡 Inference

Model Zoo

VisCPM-Chat

Single-turn Conversation

Multi-turn Conversation

VisCPM-Paint

🛡 Safety

Safety Statement

Safety Module

📝 License

✅ Todo

🏫 Institutions

Citation

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

VisCPM

VisCPM-Chat

VisCPM-Paint

⚙️ Install

💡 Inference

Model Zoo

VisCPM-Chat

Single-turn Conversation

Multi-turn Conversation

VisCPM-Paint

🛡 Safety

Safety Statement

Safety Module

📝 License

✅ Todo

🏫 Institutions

Citation