diff --git a/.gitignore b/.gitignore index 3e7bf52..d95d84d 100644 --- a/.gitignore +++ b/.gitignore @@ -22,3 +22,4 @@ data/log* *.zip /results_backup/ /results_on_hard/ +/docs/tex/ diff --git a/README.md b/README.md index 4e3f702..2170b52 100644 --- a/README.md +++ b/README.md @@ -1,51 +1,91 @@ -# 🖼️ MULTI-Benchmark +# 🖼️ MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images
-🌐 [Website](https://opendfm.github.io/MULTI-Benchmark/) +![MULTI](./docs/static/images/overview.png) -📃 [Paper](https://arxiv.org/abs/2402.03173/) +🌐 [Website](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [Paper](https://arxiv.org/abs/2402.03173/) | 🤗 [Dataset](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 🎯 [Leaderboard]() (Coming Soon) -🤗 [Dataset](https://opendfm.github.io/MULTI-Benchmark/) (Coming Soon) - -🎯 [Leaderboard](https://opendfm.github.io/MULTI-Benchmark/) (Coming Soon) +[简体中文](./README_zh.md) | English
-# MULTI-Benchmark - -[This](https://OpenDFM.github.io/MULTI-Benchmark/) is our official page. +## 🔥 News -## 🔥 News - -- **[Coming Soon]** We will soon release our first offical verison of dataset and leaderboard. +- **[Cooming Soon]** We will release the official evaluation platform. +- **[2024.2.19]** We release the [HuggingFace Page](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/). - **[2024.2.6]** We publish our [paper](https://arxiv.org/abs/2402.03173/) on arXiv. - **[2023.12.7]** We release the [code](./eval) of our benchmark evaluation. -- **[2023.12.5]** We release the [GitHub Page](https://opendfm.github.io/MULTI-Benchmark/). +- **[2023.12.5]** We release the [GitHub Page](https://OpenDFM.github.io/MULTI-Benchmark/). ## 📖 Overview -We introduce **MULTI**: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and annotated more than 18K questions from exams,quizzes, textbooks, websites and other resources, most of which underwent at least two rounds of human annotation and checking, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended discussions. +Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context. In this paper, we present ***MULTI***, as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context. MULTI provides multimodal inputs and requires responses that are either precise or open-ended, reflecting real-life examination styles. **MULTI** includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. We also introduce ***MULTI-Elite***, a 500-question selected hard subset, and ***MULTI-Extend***, with more than 4,500 external knowledge context pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a **63.7%** accuracy rate on **MULTI**, in contrast to other MLLMs scoring between **28.5%** and **55.3%**. **MULTI** serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI. + +## 🏆 Leaderboard + +| Modality | Model | Version | Overall | MULTI-Elite | +|:--------:|:-------------:| -------------------------- |:-------:|:-----------:| +| 🖼️ | GPT-4V | gpt-4-vision-preview | 63.7 | 14.0 | +| 🖼️ | Yi-VL | Yi-34B-Chat | 55.3 | 26.2 | +| 🖼️ | Gemini Vision | gemini-pro-vision | 53.7 | 12.4 | +| 📃 | Gemini | gemini-pro | 52.2 | 10.5 | +| 📃 | GPT-4 | gpt-4-1106-preview | 50.2 | 5.8 | +| 📃 | DFM-2.0 | dfm-2.0-70b-preview | 49.7 | 18.0 | +| 🖼️ | InternVL | InternVL-Chat-Chinese-V1.1 | 44.9 | 20.7 | +| 🖼️ | Qwen-VL | Qwen-VL-Chat | 39.0 | 10.5 | +| 📃 | ChatGPT | gpt-3.5-turbo-1106 | 35.9 | 4.7 | +| 🖼️ | VisCPM | VisCPM-Chat | 33.4 | 13.0 | +| 📃 | MOSS | moss-moon-003-sft | 32.6 | 13.1 | +| 🖼️ | VisualGLM | visualglm-6b | 31.1 | 12.8 | +| 🖼️ | Chinese-LLaVA | Chinese-LLaVA-Cllama2 | 28.5 | 12.3 | + +For more details, please visit our [leaderboard]() (Coming Soon). + +## ⏬ Download + +You can download the dataset from the [HuggingFace Page](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark). Current [version](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip) is `v1.2.2`. -We manually selected 500 questions to form a difficult subset, which is used to evaluate the model's extreme performance. These questions often contain multiple images and formulas, test the model's comprehensive understanding of multiple images, and require complex and rigorous logical reasoning. The performance of this part of the data will be displayed separately on the leaderboard. +``` +wget https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/resolve/main/MULTI_v1.2.2_20240212_release.zip +unzip MULTI_v1.2.2_20240212_release.zip -d ./data +``` + +## 📝 How to Evaluate + +This will be updated soon. Please refer to the [legancy README](./eval/models/README.md) for now. -We tested on GPT-3.5 and open-source multimodal large models $^\dagger$ , and the results show that even the advanced GPT-3.5 only achieved **43.28%** accuracy, showing a huge room for improvement. We believe that MULTI will motivate the community to build the next generation of multimodal foundation models, to achieve expert-level artificial general intelligence. +## 📮 How to Submit -$^\dagger$ Based on `v0.3.0-20231115` version of the data, tested on SC/MC/FIB three question types. +You need to first prepare a UTF-8 encoded JSON file with the following format: -## ⏩ How can I early access MULTI 🤔? +```json +{ + "czsx_0_0": { + "question_id": "czsx_0_0", + "question_image_number": 1, + "image_list": [...], + "input_message": ..., + "prediction": "C" + }, + ... +} +``` +If you evaluate the model with our official code, you can simply zip the experiment result folder `./results/EXPERIMENT_NAME`. -We will release our first official version soon. (Within this week) +Then, you can submit your result to our [evaluation platform](https://wj.sjtu.edu.cn/q/89UmRAJn) (Coming Soon). -Please feel free to contact (`JamesZhutheThird@sjtu.edu.cn` and `xuyang0112@sjtu.edu.cn`) and keep in touch with us. +Thank you for being so interested in the **MULTI** dataset! As the automated evaluation platform is not yet online, please fill in [this questionnaire](https://wj.sjtu.edu.cn/q/89UmRAJn) to get the evaluation results, your information will be kept strictly confidential, so please feel free to fill it out. 🤗 + +You are also welcome to pull a request and contribute your code to our evaluation code. We will be very grateful for your contribution! ## 📑 Citation -If you find our work useful, please consider citing us! +If you find our work useful, please citing us! ``` @misc{zhu2024multi, - title={MULTI: Multimodal Understanding Leaderboard with Text and Images}, + title={{MULTI}: Multimodal Understanding Leaderboard with Text and Images}, author={Zichen Zhu and Yang Xu and Lu Chen and Jingkai Yang and Yichuan Ma and Yiming Sun and Hailin Wen and Jiaqi Liu and Jinyu Cai and Yingzi Ma and Situo Zhang and Zihan Zhao and Liangtai Sun and Kai Yu}, year={2024}, eprint={2402.03173}, @@ -56,5 +96,4 @@ If you find our work useful, please consider citing us! ## 📧 Contact Us -If you would like to early access our benchmark or have any questions, please feel free to contact: `JamesZhutheThird@sjtu.edu.cn` and `xuyang0112@sjtu.edu.cn` - +If you have any questions, please feel free to contact us via email `JamesZhutheThird@sjtu.edu.cn` and `xuyang0112@sjtu.edu.cn` diff --git a/README_zh.md b/README_zh.md new file mode 100644 index 0000000..12bf382 --- /dev/null +++ b/README_zh.md @@ -0,0 +1,99 @@ +# 🖼️ MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images + +
+ +![MULTI](./docs/static/images/overview.png) + +🌐 [网站](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [论文](https://arxiv.org/abs/2402.03173/) | 🤗 [数据](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 🎯 [榜单]() (即将上线) + +简体中文 | [English](./README.md) + +
+ +## 🔥 新闻 + +- **[即将上线]** 我们将发布官方评估平台。 +- **[2024.2.19]** 我们发布了[HuggingFace页面](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/)。 +- **[2024.2.6]** 我们在arXiv上发布了我们的[论文](https://arxiv.org/abs/2402.03173/)。 +- **[2023.12.7]** 我们发布了我们的基准评估[代码](./eval)。 +- **[2023.12.5]** 我们发布了[GitHub页面](https://OpenDFM.github.io/MULTI-Benchmark/)。 + +## 📖 介绍 + +在多模态大型语言模型(MLLMs)迅速进步的背景下,提出具有挑战性且符合现实场景的基准测试变得尤为重要,而现有的基准测试主要关注于理解简单的自然图像和短文本。在本文中,我们介绍了***MULTI***,作为一个前沿的基准测试,用于评估MLLMs在理解复杂的表格和图像、以及进行长文本推理的能力。MULTI提供多模态输入,并要求回答是精确的或开放式的,反映了现实生活中的考试风格。**MULTI**包括超过18,000个问题,挑战MLLMs进行多种任务,从公式推导到图像细节分析和跨模态推理。我们还引入了***MULTI-Elite***,一个精心挑选的包含500个问题的难题子集,以及***MULTI-Extend***,包含超过4,500个外部知识上下文。我们的评估显示了MLLMs进步的巨大潜力,GPT-4V在**MULTI**上的准确率达到了**63.7%**,而其他MLLMs的得分介于**28.5%**和**55.3%**之间。**MULTI**不仅作为一个稳健的评估平台,也为专家级AI的发展指明了道路。 + +## 🏆 Leaderboard + +| 模态 | 模型 | 版本 | 总体 | MULTI-Elite | +|:----:|:-------------:| -------------------------- |:----:|:-----------:| +| 🖼️ | GPT-4V | gpt-4-vision-preview | 63.7 | 14.0 | +| 🖼️ | Yi-VL | Yi-34B-Chat | 55.3 | 26.2 | +| 🖼️ | Gemini Vision | gemini-pro-vision | 53.7 | 12.4 | +| 📃 | Gemini | gemini-pro | 52.2 | 10.5 | +| 📃 | GPT-4 | gpt-4-1106-preview | 50.2 | 5.8 | +| 📃 | DFM-2.0 | dfm-2.0-70b-preview | 49.7 | 18.0 | +| 🖼️ | InternVL | InternVL-Chat-Chinese-V1.1 | 44.9 | 20.7 | +| 🖼️ | Qwen-VL | Qwen-VL-Chat | 39.0 | 10.5 | +| 📃 | ChatGPT | gpt-3.5-turbo-1106 | 35.9 | 4.7 | +| 🖼️ | VisCPM | VisCPM-Chat | 33.4 | 13.0 | +| 📃 | MOSS | moss-moon-003-sft | 32.6 | 13.1 | +| 🖼️ | VisualGLM | visualglm-6b | 31.1 | 12.8 | +| 🖼️ | Chinese-LLaVA | Chinese-LLaVA-Cllama2 | 28.5 | 12.3 | + +更多详情,请访问我们的[排行榜]()(即将推出)。 + +## ⏬ 下载 + +你可以从[HuggingFace页面](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark)下载数据集。最新[版本](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip)为`v1.2.2`。 + +``` +wget https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/resolve/main/MULTI_v1.2.2_20240212_release.zip +unzip MULTI_v1.2.2_20240212_release.zip -d ./data +``` + +## 📝 如何评估 + +此部分即将更新。现在,请参考[历史版本README](./eval/models/README.md)。 + +## 📮 如何提交 + +你需要首先准备一个UTF-8编码的JSON文件,格式如下: + +```json +{ + "czsx_0_0": { + "question_id": "czsx_0_0", + "question_image_number": 1, + "image_list": [...], + "input_message": ..., + "prediction": "C" + }, + ... +} +``` +如果你使用我们的官方代码评估模型,你可以直接压缩实验结果文件夹`./results/EXPERIMENT_NAME`。 + +然后,你可以将你的结果提交到我们的[评估平台]()(即将推出)。 + +感谢您对 MULTI 数据集的关注!由于自动评测平台尚未上线,请填写[此问卷](https://wj.sjtu.edu.cn/q/89UmRAJn)以获取评测结果,您的个人信息将被严格保密,请放心填写。🤗 + +欢迎拉取请求(Pull Request)并贡献你的代码到我们的评估代码中。我们感激不尽! + +## 📑 引用 + +如果你觉得我们的工作有用,请引用我们! + +``` +@misc{zhu2024multi, + title={{MULTI}: Multimodal Understanding Leaderboard with Text and Images}, + author={Zichen Zhu and Yang Xu and Lu Chen and Jingkai Yang and Yichuan Ma and Yiming Sun and Hailin Wen and Jiaqi Liu and Jinyu Cai and Yingzi Ma and Situo Zhang and Zihan Zhao and Liangtai Sun and Kai Yu}, + year={2024}, + eprint={2402.03173}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +``` + +## 📧 联系我们 + +如果你有任何问题,请随时通过电子邮件联系我们 `JamesZhutheThird@sjtu.edu.cn` 和 `xuyang0112@sjtu.edu.cn` diff --git a/data/README.md b/data/README.md index 18c076a..720a699 100644 --- a/data/README.md +++ b/data/README.md @@ -1,3 +1,5 @@ +# this README will be deprecated soon + # Overview This document will guide you to fetch our benchmark. diff --git a/docs/static/images/example.png b/docs/static/images/example.png index bdf3c26..a36275d 100644 Binary files a/docs/static/images/example.png and b/docs/static/images/example.png differ diff --git a/docs/static/images/overview.png b/docs/static/images/overview.png index bd865b2..e4b09f5 100644 Binary files a/docs/static/images/overview.png and b/docs/static/images/overview.png differ diff --git a/docs/static/images/pipeline.png b/docs/static/images/pipeline.png new file mode 100644 index 0000000..6e31d3e Binary files /dev/null and b/docs/static/images/pipeline.png differ diff --git a/eval/models/README.md b/eval/models/README.md index 1f5feb9..a896263 100644 --- a/eval/models/README.md +++ b/eval/models/README.md @@ -1,3 +1,5 @@ +# this README will be deprecated soon + # Overview This folder contains all evaluators to support LLM performance test. Each file contains an evaluator specified to one LLM, and implements a `generate_answer` method to receive a question as input and give out the answer of it. diff --git a/eval/models/internvl_hf.py b/eval/models/internvl_hf.py index 557b87d..9f45465 100644 --- a/eval/models/internvl_hf.py +++ b/eval/models/internvl_hf.py @@ -32,7 +32,7 @@ def generate_response(self, input): pixel_values = self.image_processor(images=image, return_tensors='pt').pixel_values pixel_values = pixel_values.to(torch.bfloat16).cuda() message = question["prompted_content"] - response = self.model.chat(self.tokenizer, pixel_values, message, self.sample_params) + response = self.model.chat(self.tokenizer, pixel_values, message[0:768], self.sample_params) # 768 is the max length of the message return response, message elif isinstance(input, tuple):