Merge branch 'github-page' of https://github.com/X-LANCE/MULTI-Benchmark

into github-page
OpenDFM · Dec 7, 2023 · 84f68e2 · 84f68e2
2 parents 8a206c2 + 74069bf
commit 84f68e2
Show file tree

Hide file tree

Showing 2 changed files with 7 additions and 1 deletion.
diff --git a/index.html b/index.html
@@ -162,7 +162,7 @@ <h1 class="title is-1 publication-title">MULTI: Multimodal Understanding Leaderb
 <!--      </div>-->
 <!--    </div>-->
     <h3 class="subtitle is-size-3-tablet has-text-left pb-">
-    <p style="text-align:justify; line-height:150%; margin-left: -130px; margin-right: -130px; font-size: 20px">We introduce <b>MULTI</b>: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and annotated more than 18K questions from exams， quizzes, textbooks, websites and other resources, most of which underwent at least two rounds of human annotation and checking, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended discussions.
+    <p style="text-align:justify; line-height:150%; margin-left: -75px; margin-right: -75px; font-size: 20px">We introduce <b>MULTI</b>: a multi-level, multi-disciplinary, and multi-type cross-modal test benchmark, aimed at evaluating the performance of multimodal generative large models under different conditions and scenarios. We collected and annotated more than 18K questions from exams， quizzes, textbooks, websites and other resources, most of which underwent at least two rounds of human annotation and checking, and three rounds of script cleaning. Some questions were manually adapted to make them more suitable for evaluating the comprehensive ability of the model. These questions involve four educational levels: junior high school, high school, college and social exams, covering Chinese, mathematics, English, physics, chemistry, biology, history, geography, politics, information technology, driving test and other disciplines and fields, including single choice, multiple choice, fill in the blank (given range and fully open), and open-ended discussions.
       <br><br>We manually selected 500 questions to form a difficult subset, which is used to evaluate the model's extreme performance. These questions often contain multiple images and formulas, test the model's comprehensive understanding of multiple images, and require complex and rigorous logical reasoning. The performance of this part of the data will be displayed separately on the leaderboard.
       <br><br>We tested on GPT-3.5 and open-source multimodal large models<sup>*</sup>, and the results show that even the advanced GPT-3.5 only achieved 43.28% accuracy, showing a huge room for improvement. We believe that MULTI will motivate the community to build the next generation of multimodal foundation models, to achieve expert-level artificial general intelligence.
       <br><br> <p style="font-size:15px"><sup>*</sup> Based on v0.3.0-20231115 version of the data, tested on SC/MC/FIB three question types.</p>

diff --git a/static/css/index.css b/static/css/index.css
@@ -49,16 +49,22 @@ body {
 
 .publication-title {
     font-family: 'Google Sans', sans-serif;
+    padding-left: -50px;
+    padding-right: -50px;
 }
 
 .publication-authors {
     font-family: 'Google Sans', sans-serif;
+    padding-left: -50px;
+    padding-right: -50px;
 }
 
 .publication-venue {
     color: #555;
     width: fit-content;
     font-weight: bold;
+    padding-left: -10px;
+    padding-right: -10px;
 }
 
 .publication-awards {