update README.md and README_zh.md

OpenDFM · Feb 19, 2024 · 3f326f7 · 3f326f7
1 parent 3b310ac
commit 3f326f7
Show file tree

Hide file tree

Showing 9 changed files with 168 additions and 25 deletions.
diff --git a/.gitignore b/.gitignore
@@ -22,3 +22,4 @@ data/log*
 *.zip
 /results_backup/
 /results_on_hard/
+/docs/tex/
diff --git a/README.md b/README.md
@@ -1,51 +1,91 @@
-# 🖼️ MULTI-Benchmark
+# 🖼️ MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images
 
 <div align="center">
 
-🌐 [Website](https://opendfm.github.io/MULTI-Benchmark/) 
+![MULTI](./docs/static/images/overview.png)
 
-📃 [Paper](https://arxiv.org/abs/2402.03173/)
+🌐 [Website](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [Paper](https://arxiv.org/abs/2402.03173/) | 🤗 [Dataset](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 🎯 [Leaderboard]() (Coming Soon) 
 
-🤗 [Dataset](https://opendfm.github.io/MULTI-Benchmark/) (Coming Soon) 
-
-🎯 [Leaderboard](https://opendfm.github.io/MULTI-Benchmark/) (Coming Soon)
+[简体中文](./README_zh.md) | English
 
 </div>
 
-# MULTI-Benchmark
-
-[This](https://OpenDFM.github.io/MULTI-Benchmark/) is our official page.
+## 🔥 News
 
-## 🔥 News 
-
-- **[Coming Soon]** We will soon release our first offical verison of dataset and leaderboard.
+- **[Cooming Soon]** We will release the official evaluation platform.
+- **[2024.2.19]** We release the [HuggingFace Page](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/).
 - **[2024.2.6]** We publish our [paper](https://arxiv.org/abs/2402.03173/) on arXiv.
 - **[2023.12.7]** We release the [code](./eval) of our benchmark evaluation.
-- **[2023.12.5]** We release the [GitHub Page](https://opendfm.github.io/MULTI-Benchmark/).
+- **[2023.12.5]** We release the [GitHub Page](https://OpenDFM.github.io/MULTI-Benchmark/).
 
 ## 📖 Overview
 
-We introduce **MULTI**: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and annotated more than 18K questions from exams，quizzes, textbooks, websites and other resources, most of which underwent at least two rounds of human annotation and checking, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended discussions.
+Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context. In this paper, we present ***MULTI***, as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context. MULTI provides multimodal inputs and requires responses that are either precise or open-ended, reflecting real-life examination styles. **MULTI** includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. We also introduce ***MULTI-Elite***, a 500-question selected hard subset, and ***MULTI-Extend***, with more than 4,500 external knowledge context pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a **63.7%** accuracy rate on **MULTI**, in contrast to other MLLMs scoring between **28.5%** and **55.3%**. **MULTI** serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
+
+## 🏆 Leaderboard
+
+| Modality |     Model     | Version                    | Overall | MULTI-Elite |
+|:--------:|:-------------:| -------------------------- |:-------:|:-----------:|
+|    🖼️    |    GPT-4V     | gpt-4-vision-preview       |  63.7   |    14.0     |
+|    🖼️    |     Yi-VL     | Yi-34B-Chat                |  55.3   |    26.2     |
+|    🖼️    | Gemini Vision | gemini-pro-vision          |  53.7   |    12.4     |
+|    📃    |    Gemini     | gemini-pro                 |  52.2   |    10.5     |
+|    📃    |     GPT-4     | gpt-4-1106-preview         |  50.2   |     5.8     |
+|    📃    |    DFM-2.0    | dfm-2.0-70b-preview        |  49.7   |    18.0     |
+|    🖼️    |   InternVL    | InternVL-Chat-Chinese-V1.1 |  44.9   |    20.7     |
+|    🖼️    |    Qwen-VL    | Qwen-VL-Chat               |  39.0   |    10.5     |
+|    📃    |    ChatGPT    | gpt-3.5-turbo-1106         |  35.9   |     4.7     |
+|    🖼️    |    VisCPM     | VisCPM-Chat                |  33.4   |    13.0     |
+|    📃    |     MOSS      | moss-moon-003-sft          |  32.6   |    13.1     |
+|    🖼️    |   VisualGLM   | visualglm-6b               |  31.1   |    12.8     |
+|    🖼️    | Chinese-LLaVA | Chinese-LLaVA-Cllama2      |  28.5   |    12.3     |
+
+For more details, please visit our [leaderboard]() (Coming Soon).
+
+## ⏬ Download
+
+You can download the dataset from the [HuggingFace Page](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark). Current [version](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip) is `v1.2.2`.
 
-We manually selected 500 questions to form a difficult subset, which is used to evaluate the model's extreme performance. These questions often contain multiple images and formulas, test the model's comprehensive understanding of multiple images, and require complex and rigorous logical reasoning. The performance of this part of the data will be displayed separately on the leaderboard.
+```
+wget https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/resolve/main/MULTI_v1.2.2_20240212_release.zip
+unzip MULTI_v1.2.2_20240212_release.zip -d ./data
+```
+
+## 📝 How to Evaluate
+
+This will be updated soon. Please refer to the [legancy README](./eval/models/README.md) for now.
 
-We tested on GPT-3.5 and open-source multimodal large models $^\dagger$ , and the results show that even the advanced GPT-3.5 only achieved **43.28%** accuracy, showing a huge room for improvement. We believe that MULTI will motivate the community to build the next generation of multimodal foundation models, to achieve expert-level artificial general intelligence.
+## 📮 How to Submit
 
-$^\dagger$ Based on `v0.3.0-20231115` version of the data, tested on SC/MC/FIB three question types.
+You need to first prepare a UTF-8 encoded JSON file with the following format:
 
-## ⏩ How can I early access MULTI 🤔?
+```json
+{
+    "czsx_0_0": {
+        "question_id": "czsx_0_0",
+        "question_image_number": 1,
+        "image_list": [...],
+        "input_message": ...,
+        "prediction": "C"
+    },
+    ...
+}
+```
+If you evaluate the model with our official code, you can simply zip the experiment result folder `./results/EXPERIMENT_NAME`. 
 
-We will release our first official version soon. (Within this week)
+Then, you can submit your result to our [evaluation platform](https://wj.sjtu.edu.cn/q/89UmRAJn) (Coming Soon).
 
-Please feel free to contact (`[email protected]` and `[email protected]`) and keep in touch with us. 
+Thank you for being so interested in the **MULTI** dataset! As the automated evaluation platform is not yet online, please fill in [this questionnaire](https://wj.sjtu.edu.cn/q/89UmRAJn) to get the evaluation results, your information will be kept strictly confidential, so please feel free to fill it out. 🤗
+
+You are also welcome to pull a request and contribute your code to our evaluation code. We will be very grateful for your contribution!
 
 ## 📑 Citation
 
-If you find our work useful, please consider citing us!
+If you find our work useful, please citing us!
 
 ```
 @misc{zhu2024multi,
-      title={MULTI: Multimodal Understanding Leaderboard with Text and Images}, 
+      title={{MULTI}: Multimodal Understanding Leaderboard with Text and Images}, 
       author={Zichen Zhu and Yang Xu and Lu Chen and Jingkai Yang and Yichuan Ma and Yiming Sun and Hailin Wen and Jiaqi Liu and Jinyu Cai and Yingzi Ma and Situo Zhang and Zihan Zhao and Liangtai Sun and Kai Yu},
       year={2024},
       eprint={2402.03173},
@@ -56,5 +96,4 @@ If you find our work useful, please consider citing us!
 
 ## 📧 Contact Us
 
-If you would like to early access our benchmark or have any questions, please feel free to contact: `[email protected]` and `[email protected]`
-
+If you have any questions, please feel free to contact us via email `[email protected]` and `[email protected]`
diff --git a/README_zh.md b/README_zh.md
@@ -0,0 +1,99 @@
+# 🖼️ MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images
+
+<div align="center">
+
+![MULTI](./docs/static/images/overview.png)
+
+🌐 [网站](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [论文](https://arxiv.org/abs/2402.03173/) | 🤗 [数据](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 🎯 [榜单]() (即将上线) 
+
+简体中文 | [English](./README.md) 
+
+</div>
+
+## 🔥 新闻
+
+- **[即将上线]** 我们将发布官方评估平台。
+- **[2024.2.19]** 我们发布了[HuggingFace页面](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/)。
+- **[2024.2.6]** 我们在arXiv上发布了我们的[论文](https://arxiv.org/abs/2402.03173/)。
+- **[2023.12.7]** 我们发布了我们的基准评估[代码](./eval)。
+- **[2023.12.5]** 我们发布了[GitHub页面](https://OpenDFM.github.io/MULTI-Benchmark/)。
+
+## 📖 介绍
+
+在多模态大型语言模型（MLLMs）迅速进步的背景下，提出具有挑战性且符合现实场景的基准测试变得尤为重要，而现有的基准测试主要关注于理解简单的自然图像和短文本。在本文中，我们介绍了***MULTI***，作为一个前沿的基准测试，用于评估MLLMs在理解复杂的表格和图像、以及进行长文本推理的能力。MULTI提供多模态输入，并要求回答是精确的或开放式的，反映了现实生活中的考试风格。**MULTI**包括超过18,000个问题，挑战MLLMs进行多种任务，从公式推导到图像细节分析和跨模态推理。我们还引入了***MULTI-Elite***，一个精心挑选的包含500个问题的难题子集，以及***MULTI-Extend***，包含超过4,500个外部知识上下文。我们的评估显示了MLLMs进步的巨大潜力，GPT-4V在**MULTI**上的准确率达到了**63.7%**，而其他MLLMs的得分介于**28.5%**和**55.3%**之间。**MULTI**不仅作为一个稳健的评估平台，也为专家级AI的发展指明了道路。
+
+## 🏆 Leaderboard
+
+| 模态 |     模型      | 版本                       | 总体 | MULTI-Elite |
+|:----:|:-------------:| -------------------------- |:----:|:-----------:|
+|  🖼️  |    GPT-4V     | gpt-4-vision-preview       | 63.7 |    14.0     |
+|  🖼️  |     Yi-VL     | Yi-34B-Chat                | 55.3 |    26.2     |
+|  🖼️  | Gemini Vision | gemini-pro-vision          | 53.7 |    12.4     |
+|  📃  |    Gemini     | gemini-pro                 | 52.2 |    10.5     |
+|  📃  |     GPT-4     | gpt-4-1106-preview         | 50.2 |     5.8     |
+|  📃  |    DFM-2.0    | dfm-2.0-70b-preview        | 49.7 |    18.0     |
+|  🖼️  |   InternVL    | InternVL-Chat-Chinese-V1.1 | 44.9 |    20.7     |
+|  🖼️  |    Qwen-VL    | Qwen-VL-Chat               | 39.0 |    10.5     |
+|  📃  |    ChatGPT    | gpt-3.5-turbo-1106         | 35.9 |     4.7     |
+|  🖼️  |    VisCPM     | VisCPM-Chat                | 33.4 |    13.0     |
+|  📃  |     MOSS      | moss-moon-003-sft          | 32.6 |    13.1     |
+|  🖼️  |   VisualGLM   | visualglm-6b               | 31.1 |    12.8     |
+|  🖼️  | Chinese-LLaVA | Chinese-LLaVA-Cllama2      | 28.5 |    12.3     |
+
+更多详情，请访问我们的[排行榜]()（即将推出）。
+
+## ⏬ 下载
+
+你可以从[HuggingFace页面](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark)下载数据集。最新[版本](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip)为`v1.2.2`。
+
+```
+wget https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/resolve/main/MULTI_v1.2.2_20240212_release.zip
+unzip MULTI_v1.2.2_20240212_release.zip -d ./data
+```
+
+## 📝 如何评估
+
+此部分即将更新。现在，请参考[历史版本README](./eval/models/README.md)。
+
+## 📮 如何提交
+
+你需要首先准备一个UTF-8编码的JSON文件，格式如下：
+
+```json
+{
+    "czsx_0_0": {
+        "question_id": "czsx_0_0",
+        "question_image_number": 1,
+        "image_list": [...],
+        "input_message": ...,
+        "prediction": "C"
+    },
+    ...
+}
+```
+如果你使用我们的官方代码评估模型，你可以直接压缩实验结果文件夹`./results/EXPERIMENT_NAME`。
+
+然后，你可以将你的结果提交到我们的[评估平台]()（即将推出）。
+
+感谢您对 MULTI 数据集的关注！由于自动评测平台尚未上线，请填写[此问卷](https://wj.sjtu.edu.cn/q/89UmRAJn)以获取评测结果，您的个人信息将被严格保密，请放心填写。🤗
+
+欢迎拉取请求（Pull Request）并贡献你的代码到我们的评估代码中。我们感激不尽！
+
+## 📑 引用
+
+如果你觉得我们的工作有用，请引用我们！
+
+```
+@misc{zhu2024multi,
+      title={{MULTI}: Multimodal Understanding Leaderboard with Text and Images}, 
+      author={Zichen Zhu and Yang Xu and Lu Chen and Jingkai Yang and Yichuan Ma and Yiming Sun and Hailin Wen and Jiaqi Liu and Jinyu Cai and Yingzi Ma and Situo Zhang and Zihan Zhao and Liangtai Sun and Kai Yu},
+      year={2024},
+      eprint={2402.03173},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+## 📧 联系我们
+
+如果你有任何问题，请随时通过电子邮件联系我们 `[email protected]` 和 `[email protected]`
diff --git a/data/README.md b/data/README.md
@@ -1,3 +1,5 @@
+# this README will be deprecated soon
+
 # Overview
 
 This document will guide you to fetch our benchmark.

diff --git a/docs/static/images/example.png b/docs/static/images/example.png
diff --git a/docs/static/images/overview.png b/docs/static/images/overview.png
diff --git a/docs/static/images/pipeline.png b/docs/static/images/pipeline.png
diff --git a/eval/models/README.md b/eval/models/README.md
@@ -1,3 +1,5 @@
+# this README will be deprecated soon
+
 # Overview
 
 This folder contains all evaluators to support LLM performance test. Each file contains an evaluator specified to one LLM, and implements a `generate_answer` method to receive a question as input and give out the answer of it.

diff --git a/eval/models/internvl_hf.py b/eval/models/internvl_hf.py
@@ -32,7 +32,7 @@ def generate_response(self, input):
             pixel_values = self.image_processor(images=image, return_tensors='pt').pixel_values
             pixel_values = pixel_values.to(torch.bfloat16).cuda()
             message = question["prompted_content"]
-            response = self.model.chat(self.tokenizer, pixel_values, message, self.sample_params)
+            response = self.model.chat(self.tokenizer, pixel_values, message[0:768], self.sample_params) # 768 is the max length of the message
             return response, message
 
         elif isinstance(input, tuple):