Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
gray311 authored Dec 4, 2023
1 parent dd62790 commit 36ecee3
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ <h1 class="title is-1 publication-title">MULTI: Multimodal Understanding Leaderb
</div>
</div>
<h3 class="subtitle is-size-3-tablet has-text-left pb-">
<p style="text-align:justify; line-height:150%; margin-left: -150px; margin-right: -150px; font-size: 20px">We introduce MULTI: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and organized nearly 18K questions from exams, quizzes, textbooks and other educational resources, most of which underwent at least two rounds of human annotation and proofreading, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended answer questions.
<p style="text-align:justify; line-height:150%; margin-left: -130px; margin-right: -130px; font-size: 20px">We introduce MULTI: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and organized nearly 18K questions from exams, quizzes, textbooks and other educational resources, most of which underwent at least two rounds of human annotation and proofreading, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended answer questions.
<br><br>We manually selected 500 questions to form a difficult subset, which is used to evaluate the model's extreme performance. These questions often contain multiple images and formulas, test the model's comprehensive understanding of multiple images, and require complex and rigorous logical reasoning. The performance of this part of the data will be displayed separately on the leaderboard.
<br><br>We tested on GPT-3.5 and open-source multimodal large models<sup>*</sup>, and the results show that even the advanced GPT-3.5 only achieved 43.28% accuracy, showing a huge room for improvement. We believe that MULTI will motivate the community to build the next generation of multimodal foundation models, to achieve expert-level artificial general intelligence.
<br><br>Based on v0.3.0-20231115 version of the data, tested on SC/MC/FIB three question types.</p>
Expand Down

0 comments on commit 36ecee3

Please sign in to comment.