-
[2024.10.09] 🤗🤗🤗We release
$\gamma$ -MOD, a novel approach to enhance computational efficiency in Multimodal Large Language Models (MLLMs) by incorporating Mixture-of-Depth (MoD) layers. This plug-and-play strategy seamlessly replaces redundant dense layers, significantly reducing computational costs while maintaining performance.
Despite recent advancements in MLLMs, their high computational demands have limited practical applications, especially for real-time inference. Traditional Mixture-of-Experts (MoE) techniques have attempted to address this issue, but often fall short in achieving optimal efficiency.
- ARank Metric: Guides the replacement of redundant layers with MoD layers.
- Shared Vision-Language Router: Facilitates cross-modality token routing.
- Masked Routing Learning: Prevents critical tokens from being skipped during model adaptation.
- Training time: Reduced by 31%.
- Inference time: Reduced by 53.2%.
- FLOPs Reduction: 51.6% with minimal impact on accuracy.
Our
-
Consistent Routing Patterns (Fig. 4a):
- Question tokens are mostly retained
- Image tokens show the highest redundancy and are routed the most
- Response tokens fall between these two extremes
-
Efficient Content Skipping (Fig. 4b):
- Gray areas in images represent skipped tokens (often background or less relevant pixels)
- White areas highlight regions the model focuses on more intensely
-
Improved Focus on Critical Information:
- By routing out redundant tokens, the model can allocate more computational resources to important areas
- Example: In the IQ test image (middle of first row), the model concentrates on arithmetic and geometric aspects, leading to more accurate responses
This visualization demonstrates how
(Notice: Install the required packages and versions for the model you wish to modify to MoD version, below is for LLaVA-HR, for Mini-Gemini, just upgrade transformers to 4.36.2 as the official version)
- Clone the repository and navigate to the
$\gamma$ -MOD folder:
git clone https://github.com/Yaxin9Luo/Gamma-MOD.git
cd Gamma-MOD
- Create and activate a new conda environment:
conda create -n gamma-mod python=3.10 -y
conda activate gamma-mod
- Upgrade pip and install the package:
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training:
pip install ninja
pip install flash-attn --no-build-isolation
Please refer to the original LLaVA-HR and Mini-Gemini for data preparation. Or whatever MLLM's offical repo you are using.
Important Notice: For the Finetune stage, you need modify the data JSON file to move the image tokens to the beginning of the sequence. You can refer to modify_data_config.py
to do so, or you can follow the steps below:
python modify_data_config.py /path/to/your/llava_v1_5_mix665k.json /path/to/save/your/modified_llava_v1_5_mix665k.json
Please download the caption annotations blip_laion_cc_sbu_558k.json and images from here. Move the downloaded files to the /data/data folder. Then run the following command to start the training process:
bash bash scripts/v1_5/pretrain_llava_hr.sh
We recommend to directly pre-trained projector, here are the link from official LLaVA-HR and Mini-Gemini.
Version | Vision Encoder | Projection | Pretrain Data | Pretraining schedule | Download |
---|---|---|---|---|---|
LLaVA-HR-7b | CLIP-L & ConvNeXt-L | MLP-2x | LCS-558K | 1e | projector |
LLaVA-HR-X-13b | CLIP-L & ConvNeXt-XXL | MLP-2x | LCS-558K | 1e | projector |
Mini-Gemini-HD-7b | CLIP-L | MLP-2x | MGM-Pretrain | 1e | projector |
Please run the stage-1 alignment model on any dataset you wish to compute the ARank.We will use sqa as an example.
bash scripts/v1_5/eval_full/arank.sh /path/to/your/stage1_checkpoint
We also provide the stage-1 checkpoint for your convenience.
Version | Download |
---|---|
|
model |
|
model |
After you get the ARank, you can use the ARank to replace the dense layers in the original model. Reference to llava_llama_mod.py file and the initialize_mod_modules function. Then train the model with the following command:
bash /path/to/your/fine_tune_mod.sh
We also provide the stage-2 sft checkpoint for your convenience.
Version | Download |
---|---|
|
model |
|
model |
|
model |
|
model |
|
model |
We follow LLaVA-v1.5 to conduct evaluations. you should download eval.zip and unzip it to ./playground/data/eval
. Please refer to Evaluation.md to prepare the data.
Then, your can run our evaluation script bash scripts/v1_5/eval.sh
.
- LLaVA-HR: Training time reduced by 31% and inference time by 53.2%, with only 1.5% accuracy drop.
- Mini-Gemini-HD: Training time reduced by 41% and inference time by 58.1%, with only 1.0% accuracy drop.
- Generalization: Demonstrated the ability to generalize across different MLLMs.
Model | Training Time Reduction | Inference Time Reduction | Accuracy |
---|---|---|---|
|
31.0% | 53.2% | -1.5% |
|
18.8% | 50.4% | -0.3% |
|
17.4% | 58.6% | +0.4% |
|
41.0% | 58.1% | -1.0% |
For more details, check the full report.
If you use
@misc{luo2024gammamodexploringmixtureofdepthadaptation,
title={$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models},
author={Yaxin Luo and Gen Luo and Jiayi Ji and Yiyi Zhou and Xiaoshuai Sun and Zhiqiang Shen and Rongrong Ji},
year={2024},
eprint={2410.13859},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.13859},
}
For questions, please reach out to Yaxin Luo.
This project is licensed under the MIT License - see the LICENSE file for details.
Special thanks to all contributors and the LLaVA & LLaVA-HR & MGM project for codebase.
We are also thankful to LLaVA-pp, MoE-LLaVA for releasing their models and code as open-source contributions.