diff --git a/index.html b/index.html index dd08f88..89f4dd4 100644 --- a/index.html +++ b/index.html @@ -224,9 +224,8 @@

(Put some important take-away messages here)

- @@ -285,8 +284,8 @@

-
-
+ +

Overview

@@ -362,8 +361,12 @@

Experiments and Results

During inference, elements are clustered into groups of 5 elements, with iterative refinement, until a single choice is made or all options are discarded. We evaluate supervised fine-tuning (SFT) methods using FLAN-T5 and BLIP2-T5 and in-context learning (ICL) methods using GPT-3.5, GPT-4. The experiment results are shown in the following table.

+
+ algebraic reasoning +
+ @@ -566,17 +569,59 @@

Experiments and Results

+ +
-
- - +
+
+

+ Logo + Online Evaluation on Live Websites +

+
+
+
+ +
+

Overview

+
+

+ We introduce the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark, a novel benchmark meticulously curated to assess the expert-level multimodal understanding capability of foundation models across a broad scope of tasks. Covering subjects across disciplines, including Art, Business, Health & Medicine, Science, Humanities & Social Science, and Tech & Engineering, and over subfields. The detailed subject coverage and statistics are detailed in the figure. The questions in our benchmark were manually collected by a team of college students (including coauthors) from various disciplines and subjects, drawing from online sources, textbooks, and lecture materials. +

+ algebraic reasoning +
+

+ MMMU is designed to measure three essential skills in LMMs: perception, knowledge, and reasoning. Our aim is to evaluate how well these models can not only perceive and understand information across different modalities but also apply reasoning with subject-specific knowledge to derive the solution. +

+

+ Our MMMU benchmark introduces key challenges to multimodal foundation models, as detailed in a figure. Among these, we particularly highlight the challenge stemming from the requirement for both expert-level visual perceptual abilities and deliberate reasoning with subject-specific knowledge. This challenge is vividly illustrated through our tasks, which not only demand the processing of various heterogeneous image types but also necessitate a model's adeptness in using domain-specific knowledge to deeply understand both the text and images and to reason. This goes significantly beyond basic visual perception, calling for an advanced approach that integrates advanced multimodal analysis with domain-specific knowledge. +

+
+
+
+
+ +
+
+
+ +
+
+
@@ -1921,6 +1966,24 @@

BibTeX

box-shadow: 0 12px 16px 0 rgba(0,0,0,0.24), 0 17px 50px 0 rgba(0,0,0,0.19); /* 鼠标悬停时的阴影效果 */ } + /*.results-carousel {*/ + /*overflow: hidden;*/ + /*}*/ + + /*.results-carousel .item {*/ + /*margin: 5px;*/ + /*overflow: hidden;*/ + /*border: 1px solid #bbb;*/ + /*border-radius: 10px;*/ + /*padding: 0;*/ + /*font-size: 0;*/ + /*}*/ + + /*.results-carousel video {*/ + /*margin: 0;*/ + /*}*/ + + table { border-collapse: collapse; width: 100%;