From 979a3d75b0dae1b95cd7c38c4d3bedb4e9c652c6 Mon Sep 17 00:00:00 2001 From: Yiqi Zhu <159240994+StephenZhuYiQi@users.noreply.github.com> Date: Tue, 4 Jun 2024 15:04:14 +0800 Subject: [PATCH] move home page to this repo --- index.md | 463 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 463 insertions(+) create mode 100644 index.md diff --git a/index.md b/index.md new file mode 100644 index 0000000..d8f77a8 --- /dev/null +++ b/index.md @@ -0,0 +1,463 @@ +
+ Browse and Concentrate:
Comprehending Multimodal Content via prior-LLM Context Fusion +
+ +
+ +
+ Ziyue Wang1*, + Chi Chen1*, + Yiqi Zhu1, Fuwen Luo1,
+ Peng Li2†, + Ming Yan3, + Fei Huang3†, + Maosong Sun1, + Yang Liu1,2 +
+ +
+ +
+ 1 Department of Computer Science and Technology, Tsinghua University, Beijing, China
+ 2 Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
+ 3 Institute of Intelligent Computing, Alibaba Group
+
+ +
+ +
+ * Equal contribution
+ Corresponding authors
+
+ +
+ +
+ 📖 arXiv | + Github | + Models 🤗 +
+ +
+
+ +
+ Introduction +
+ +With the bloom of Multimodal Large Language Models, the paradigm that extending Large Language Models with pre-trained vision encoders has shown remarkable abilities in visual reasoning and visual instruction-following tasks. However, this paradigm neglects essential crossmodality and inter-image interactions, leading to the LLM being presented with isolate visual and textual features without recognition of interleaved multimodal inputs, to which we refer as prior-LLM modality isolation and it obscures a deeper understanding of multi-image and interleaved inputs. + +To mitigate the issue, we propose a novel paradigm named **Bro**wse-and-Concentra**te** (**Brote**). This paradigm begins with a browsing phase to generate a condition context vector, serving as a collection of browsing insights, encapsulating the main intent and visual information derived from images. Subsequently, a concentrating phase is employed to comprehend multimodal inputs, guided by the condition context vector. Our paradigm exhibits notable advancements, improving the average accuracy on 7 multi-image benchmarks by **2.13%** and **7.60%** against strong baselines with 3B and 11B LLMs, respectively. + + +
+ +
+ +
+
+ +
+ Framework +
+ +Our paradigm progressively comprehends images via two phases, browsing and concentrating. In the browsing phase, the MLLM browses the entire input and generates a condition context as the browsing result, denoted as _C_. Then, in the concentrating phase, the model comprehends multimodal inputs under the guidance of _C_. We refer to the model of browsing phase as _MB_ and the model of concentrating phase as _MC_. + +Moreover, our proposed Brote can be further divided into two modes, explicit and implicit, regarding the distinct approaches of incorporating browsing result _C_. The explicit version, denoted as **Brote-EX**, operates with separated parameters (_MB_ ≠ _MC_). This explicit mode first generates _C_ using _MB_, followed by _MC_ to infer the final outcomes. In contrast, for the implicit version, denoted as **BroteIM**, employs shared parameters for both phases (_MB_ = _MC_), permitting _MC_ to directly predict the answer without the need to explicitly produce intermediate vectors from the other model. + +
+ +
+ +
+
+ +
+ Trainging Strategies +
+ +To encourage further exploration of information from _C_ for VL tasks, we propose a new training strategy named **context-dropping training**. The strategy intentionally omits particular inputs yet requiring the model to infer for answers solely with the assistant of _C_. It motivates the model to compensate for the missing information from the provided condition context _C_. We propose three different dropping strategies: +1. Drop images: This involves two approaches, removing certain images (**Context Dropping (IMG-N)**), and replacing original images by blank placeholders (**Context Dropping (IMG-B)**). +2. Drop text: We remove the text before the last image (**Context Dropping (TXT)**). +3. Drop ALL: A combination of the above settings denoted as **ALL**, applied with the same probabilities. + +
+ +
+ + +
+
+ +
+ Results +
+ +
+

We report our results in the following tables:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model#Param LLMIn-context LearningMulti-image / Video TasksAVG
VQA2A-OKVQANLVR2DEMONSEEDMSVD QAMSRVTT QA
KOSMOS-11.3B51.8-------
InstructBLIP-XL3B31.76*39.13*52.59*32.59*52.743.40*12.12*37.77
MMICL-XL3B69.1653.43*71.48*38.14*54.69*53.6842.36*54.71
Otter7B45.39*38.42*49.54*24.5139.725.87*9.78*-
VPG-C-LLaMA27B---37.22----
Flamingo-9B7B56.3----30.213.7-
Brote-EX-XL3B69.9756.0071.4137.3357.5153.0243.1455.48
Brote-IM-XL3B68.9456.4376.0237.3457.8656.0645.0856.84
InstructBlip-XXL11B48.21*45.92*64.54*33.00*50.81*44.30*15.49*43.18
MMICL-XXL11B70.5654.85*56.16*36.30*56.66*52.1939.46*52.18
EMU-233B67.0---62.849.031.4-
Flamingo-80B70B63.1----35.617.4-
Brote-EX-XXL11B70.8659.9470.4238.7059.3154.4245.2457.00
Brote-IM-XXL11B71.7160.3180.7138.9461.6457.2945.9459.78
+ + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model#Param LLMVQAv2A-OKVQAScienceQA-IMGMME PerceptionMME CognitionMMBenchAVG
InstructBLIP-XL3B36.7754.5770.401093.70*281.43*69.68*68.52
MMICL-XL3B69.1352.12*72.58*1184.54*277.86*73.11*75.81
LLaVA7B---457.82214.6436.2-
Otter7B57.89*41.92*63.101292.26306.4348.369.51
Brote-EX-XL3B69.9052.9371.151203.87301.7973.2777.18
Brote-IM-XL3B70.2453.4072.581181.95266.7974.2975.90
InstructBlip-XXL11B63.6957.1070.601212.82*291.79*70.34*75.99
MMICL-XXL11B70.3051.35*74.92*1313.88*311.79*76.58*80.41
MMICL-XXL (BLIP2)11B69.99--1381.74428.9365.24-
Brote-EX-XXL11B71.5856.4777.691279.73310.0176.6781.31
Brote-IM-XXL11B73.0257.8378.381284.13300.0077.3481.66
+ + +
+ +
+ + +
+
+ +
+ Citation +
+ +
+

📑 If you find our project helpful to your research, please consider citing:

+
+        @article{wang2024browse,
+            title={Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion},
+            author={Wang, Ziyue and Chen, Chi and Zhu, Yiqi and Luo, Fuwen and Li, Peng and Yan, Ming and Zhang, Ji and Huang, Fei and Sun, Maosong and Liu, Yang},
+            journal={arXiv preprint arXiv:2402.12195},
+            year={2024}
+        }
+    
+