diff --git a/index.html b/index.html index 3382a3b..92ce487 100644 --- a/index.html +++ b/index.html @@ -162,7 +162,7 @@
We introduce MULTI: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and annotated more than 18K questions from exams, quizzes, textbooks, websites and other resources, most of which underwent at least two rounds of human annotation and checking, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended discussions. +
We introduce MULTI: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and annotated more than 18K questions from exams, quizzes, textbooks, websites and other resources, most of which underwent at least two rounds of human annotation and checking, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended discussions.
We manually selected 500 questions to form a difficult subset, which is used to evaluate the model's extreme performance. These questions often contain multiple images and formulas, test the model's comprehensive understanding of multiple images, and require complex and rigorous logical reasoning. The performance of this part of the data will be displayed separately on the leaderboard.
We tested on GPT-3.5 and open-source multimodal large models*, and the results show that even the advanced GPT-3.5 only achieved 43.28% accuracy, showing a huge room for improvement. We believe that MULTI will motivate the community to build the next generation of multimodal foundation models, to achieve expert-level artificial general intelligence.
* Based on v0.3.0-20231115 version of the data, tested on SC/MC/FIB three question types.
diff --git a/static/css/index.css b/static/css/index.css index 5d75d0f..639736d 100644 --- a/static/css/index.css +++ b/static/css/index.css @@ -49,16 +49,22 @@ body { .publication-title { font-family: 'Google Sans', sans-serif; + padding-left: -50px; + padding-right: -50px; } .publication-authors { font-family: 'Google Sans', sans-serif; + padding-left: -50px; + padding-right: -50px; } .publication-venue { color: #555; width: fit-content; font-weight: bold; + padding-left: -10px; + padding-right: -10px; } .publication-awards {