From 979a3d75b0dae1b95cd7c38c4d3bedb4e9c652c6 Mon Sep 17 00:00:00 2001
From: Yiqi Zhu <159240994+StephenZhuYiQi@users.noreply.github.com>
Date: Tue, 4 Jun 2024 15:04:14 +0800
Subject: [PATCH] move home page to this repo
---
index.md | 463 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 463 insertions(+)
create mode 100644 index.md
diff --git a/index.md b/index.md
new file mode 100644
index 0000000..d8f77a8
--- /dev/null
+++ b/index.md
@@ -0,0 +1,463 @@
+
+ Browse and Concentrate:
Comprehending Multimodal Content via prior-LLM Context Fusion
+
+
+
+
+
+ Ziyue Wang1*,
+ Chi Chen1*,
+ Yiqi Zhu1, Fuwen Luo1,
+ Peng Li2†,
+ Ming Yan3,
+ Fei Huang3†,
+ Maosong Sun1,
+ Yang Liu1,2
+
+
+
+
+
+ 1 Department of Computer Science and Technology, Tsinghua University, Beijing, China
+ 2 Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
+ 3 Institute of Intelligent Computing, Alibaba Group
+
+
+
+
+
+ * Equal contribution
+ † Corresponding authors
+
+
+
+
+
+
+
+
+
+
+ Introduction
+
+
+With the bloom of Multimodal Large Language Models, the paradigm that extending Large Language Models with pre-trained vision encoders has shown remarkable abilities in visual reasoning and visual instruction-following tasks. However, this paradigm neglects essential crossmodality and inter-image interactions, leading to the LLM being presented with isolate visual and textual features without recognition of interleaved multimodal inputs, to which we refer as prior-LLM modality isolation and it obscures a deeper understanding of multi-image and interleaved inputs.
+
+To mitigate the issue, we propose a novel paradigm named **Bro**wse-and-Concentra**te** (**Brote**). This paradigm begins with a browsing phase to generate a condition context vector, serving as a collection of browsing insights, encapsulating the main intent and visual information derived from images. Subsequently, a concentrating phase is employed to comprehend multimodal inputs, guided by the condition context vector. Our paradigm exhibits notable advancements, improving the average accuracy on 7 multi-image benchmarks by **2.13%** and **7.60%** against strong baselines with 3B and 11B LLMs, respectively.
+
+
+
+
+
+
+
+
+
+
+ Framework
+
+
+Our paradigm progressively comprehends images via two phases, browsing and concentrating. In the browsing phase, the MLLM browses the entire input and generates a condition context as the browsing result, denoted as _C_. Then, in the concentrating phase, the model comprehends multimodal inputs under the guidance of _C_. We refer to the model of browsing phase as _MB_ and the model of concentrating phase as _MC_.
+
+Moreover, our proposed Brote can be further divided into two modes, explicit and implicit, regarding the distinct approaches of incorporating browsing result _C_. The explicit version, denoted as **Brote-EX**, operates with separated parameters (_MB_ ≠ _MC_). This explicit mode first generates _C_ using _MB_, followed by _MC_ to infer the final outcomes. In contrast, for the implicit version, denoted as **BroteIM**, employs shared parameters for both phases (_MB_ = _MC_), permitting _MC_ to directly predict the answer without the need to explicitly produce intermediate vectors from the other model.
+
+
+
+
+
+
+
+
+
+ Trainging Strategies
+
+
+To encourage further exploration of information from _C_ for VL tasks, we propose a new training strategy named **context-dropping training**. The strategy intentionally omits particular inputs yet requiring the model to infer for answers solely with the assistant of _C_. It motivates the model to compensate for the missing information from the provided condition context _C_. We propose three different dropping strategies:
+1. Drop images: This involves two approaches, removing certain images (**Context Dropping (IMG-N)**), and replacing original images by blank placeholders (**Context Dropping (IMG-B)**).
+2. Drop text: We remove the text before the last image (**Context Dropping (TXT)**).
+3. Drop ALL: A combination of the above settings denoted as **ALL**, applied with the same probabilities.
+
+
+
+
+
+
+
+
+
+
+ Results
+
+
+
+
We report our results in the following tables:
+
+
+
+
+ Model |
+ #Param LLM |
+ In-context Learning |
+ Multi-image / Video Tasks |
+ AVG |
+
+
+ VQA2 |
+ A-OKVQA |
+ NLVR2 |
+ DEMON |
+ SEED |
+ MSVD QA |
+ MSRVTT QA |
+
+
+ KOSMOS-1 |
+ 1.3B |
+ 51.8 |
+ - |
+ - |
+ - |
+ - |
+ - |
+ - |
+ - |
+
+
+ InstructBLIP-XL |
+ 3B |
+ 31.76* |
+ 39.13* |
+ 52.59* |
+ 32.59* |
+ 52.7 |
+ 43.40* |
+ 12.12* |
+ 37.77 |
+
+
+ MMICL-XL |
+ 3B |
+ 69.16 |
+ 53.43* |
+ 71.48* |
+ 38.14* |
+ 54.69* |
+ 53.68 |
+ 42.36* |
+ 54.71 |
+
+
+ Otter |
+ 7B |
+ 45.39* |
+ 38.42* |
+ 49.54* |
+ 24.51 |
+ 39.7 |
+ 25.87* |
+ 9.78* |
+ - |
+
+
+ VPG-C-LLaMA2 |
+ 7B |
+ - |
+ - |
+ - |
+ 37.22 |
+ - |
+ - |
+ - |
+ - |
+
+
+ Flamingo-9B |
+ 7B |
+ 56.3 |
+ - |
+ - |
+ - |
+ - |
+ 30.2 |
+ 13.7 |
+ - |
+
+
+ Brote-EX-XL |
+ 3B |
+ 69.97 |
+ 56.00 |
+ 71.41 |
+ 37.33 |
+ 57.51 |
+ 53.02 |
+ 43.14 |
+ 55.48 |
+
+
+ Brote-IM-XL |
+ 3B |
+ 68.94 |
+ 56.43 |
+ 76.02 |
+ 37.34 |
+ 57.86 |
+ 56.06 |
+ 45.08 |
+ 56.84 |
+
+
+ InstructBlip-XXL |
+ 11B |
+ 48.21* |
+ 45.92* |
+ 64.54* |
+ 33.00* |
+ 50.81* |
+ 44.30* |
+ 15.49* |
+ 43.18 |
+
+
+ MMICL-XXL |
+ 11B |
+ 70.56 |
+ 54.85* |
+ 56.16* |
+ 36.30* |
+ 56.66* |
+ 52.19 |
+ 39.46* |
+ 52.18 |
+
+
+ EMU-2 |
+ 33B |
+ 67.0 |
+ - |
+ - |
+ - |
+ 62.8 |
+ 49.0 |
+ 31.4 |
+ - |
+
+
+ Flamingo-80B |
+ 70B |
+ 63.1 |
+ - |
+ - |
+ - |
+ - |
+ 35.6 |
+ 17.4 |
+ - |
+
+
+ Brote-EX-XXL |
+ 11B |
+ 70.86 |
+ 59.94 |
+ 70.42 |
+ 38.70 |
+ 59.31 |
+ 54.42 |
+ 45.24 |
+ 57.00 |
+
+
+ Brote-IM-XXL |
+ 11B |
+ 71.71 |
+ 60.31 |
+ 80.71 |
+ 38.94 |
+ 61.64 |
+ 57.29 |
+ 45.94 |
+ 59.78 |
+
+
+
+
+
+
+ - The best results for models larger/smaller than 10B are separately bolded and the seconds are underlined.
+ - VQAv2 and A-OKVQA are conducted under four-shot setting.
+ - SEED refers to SEED-Bench that contains both images and videos.
+ - For video benchmarks, we uniformly extract eight frames from the given video clips for answering the questions.
+ - For "AVG", we first average the MME scores over its subtasks, then calculate the average scores of all benchmarks in this table.
+
+
+
+
+
+
+
+ Model |
+ #Param LLM |
+ VQAv2 |
+ A-OKVQA |
+ ScienceQA-IMG |
+ MME Perception |
+ MME Cognition |
+ MMBench |
+ AVG |
+
+
+ InstructBLIP-XL |
+ 3B |
+ 36.77 |
+ 54.57 |
+ 70.40 |
+ 1093.70* |
+ 281.43* |
+ 69.68* |
+ 68.52 |
+
+
+ MMICL-XL |
+ 3B |
+ 69.13 |
+ 52.12* |
+ 72.58* |
+ 1184.54* |
+ 277.86* |
+ 73.11* |
+ 75.81 |
+
+
+ LLaVA |
+ 7B |
+ - |
+ - |
+ - |
+ 457.82 |
+ 214.64 |
+ 36.2 |
+ - |
+
+
+ Otter |
+ 7B |
+ 57.89* |
+ 41.92* |
+ 63.10 |
+ 1292.26 |
+ 306.43 |
+ 48.3 |
+ 69.51 |
+
+
+ Brote-EX-XL |
+ 3B |
+ 69.90 |
+ 52.93 |
+ 71.15 |
+ 1203.87 |
+ 301.79 |
+ 73.27 |
+ 77.18 |
+
+
+ Brote-IM-XL |
+ 3B |
+ 70.24 |
+ 53.40 |
+ 72.58 |
+ 1181.95 |
+ 266.79 |
+ 74.29 |
+ 75.90 |
+
+
+ InstructBlip-XXL |
+ 11B |
+ 63.69 |
+ 57.10 |
+ 70.60 |
+ 1212.82* |
+ 291.79* |
+ 70.34* |
+ 75.99 |
+
+
+ MMICL-XXL |
+ 11B |
+ 70.30 |
+ 51.35* |
+ 74.92* |
+ 1313.88* |
+ 311.79* |
+ 76.58* |
+ 80.41 |
+
+
+ MMICL-XXL (BLIP2) |
+ 11B |
+ 69.99 |
+ - |
+ - |
+ 1381.74 |
+ 428.93 |
+ 65.24 |
+ - |
+
+
+ Brote-EX-XXL |
+ 11B |
+ 71.58 |
+ 56.47 |
+ 77.69 |
+ 1279.73 |
+ 310.01 |
+ 76.67 |
+ 81.31 |
+
+
+ Brote-IM-XXL |
+ 11B |
+ 73.02 |
+ 57.83 |
+ 78.38 |
+ 1284.13 |
+ 300.00 |
+ 77.34 |
+ 81.66 |
+
+
+
+
+
+
+ - The best results for models larger/smaller than 10B are separately bolded and the seconds are underlined.
+ - VQAv2 and A-OKVQA are conducted under zero-shot setting.
+ - ScienceQA is conducted under zero-shot CoT (ZS-CoT) setting.
+ - For "AVG", we first average the MME scores over its subtasks, then calculate the average scores of all benchmarks in this table.
+
+
+
+
+
+
+
+
+ Citation
+
+
+
+
📑 If you find our project helpful to your research, please consider citing:
+
+ @article{wang2024browse,
+ title={Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion},
+ author={Wang, Ziyue and Chen, Chi and Zhu, Yiqi and Luo, Fuwen and Li, Peng and Yan, Ming and Zhang, Ji and Huang, Fei and Sun, Maosong and Liu, Yang},
+ journal={arXiv preprint arXiv:2402.12195},
+ year={2024}
+ }
+
+