diff --git a/index.html b/index.html index ec48ff3..04445b6 100644 --- a/index.html +++ b/index.html @@ -164,7 +164,7 @@

MULTI: Multimodal Understanding Leaderb

-

We introduce MULTI: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and organized nearly 18K questions from exams, quizzes, textbooks and other educational resources, most of which underwent at least two rounds of human annotation and proofreading, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended answer questions. +

We introduce MULTI: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and organized nearly 18K questions from exams, quizzes, textbooks and other educational resources, most of which underwent at least two rounds of human annotation and proofreading, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended answer questions.

We manually selected 500 questions to form a difficult subset, which is used to evaluate the model's extreme performance. These questions often contain multiple images and formulas, test the model's comprehensive understanding of multiple images, and require complex and rigorous logical reasoning. The performance of this part of the data will be displayed separately on the leaderboard.

We tested on GPT-3.5 and open-source multimodal large models*, and the results show that even the advanced GPT-3.5 only achieved 43.28% accuracy, showing a huge room for improvement. We believe that MULTI will motivate the community to build the next generation of multimodal foundation models, to achieve expert-level artificial general intelligence.

Based on v0.3.0-20231115 version of the data, tested on SC/MC/FIB three question types.