HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments (arXiv 2024)

Abstract:

High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder’s attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

HiRED Overview: 1. Distributing Budget: A fixed token budget (i.e., 10%) is distributed across image partitions based on their visual importance score; and 2. Selecting Top Tokens: The most important tokens are selected (i.e., the rest are dropped) based on their feature importance score from image partitions within their allocated budget.

Visualization: For example, when a 10% token budget is set, HiRED distributes the budget among the image partitions (full and sub-images) and selects the most important tokens from each partition. The selected tokens are shown in red boxes.

Installation and Setup

conda create --name hired python=3.10
conda activate hired
pip install -e transformers
pip install -e lmms-eval
pip install sentencepiece seaborn ipykernel

The main implementation of HiRED is in modeling_llava_next
The single image partition version of HiRED (based on llava-1.5) is in modeling_llava
The accuracy evaluation scripts for selected benchmarks are in accuracy_benchmarks
The inference efficiency (throughput, time-to-first-token latency, and GPU memory usage) evaluation scripts are in run_HiRED_sys_report.py
The visualization scripts for HiRED token selection is in view_HiRED_token_selection.ipynb
Our main baselines (PruMerge and PruMerge+) is implemented in prumerge_llava_next.py. To run them, paste the code from this file into modeling_llava_next.py. To toggle between PruMerge and PruMerge+, change the use_prumerge_plus flag in the code.

Citation

If you find this work useful, please consider citing:

@misc{hasan2024hired,
      title={HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments},
      author={Kazi Hasan Ibn Arif and JinYi Yoon and Dimitrios S. Nikolopoulos and Hans Vandierendonck and Deepu John and Bo Ji},
      year={2024},
      eprint={2408.10945},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.10945},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
accuracy_benchmarks		accuracy_benchmarks
figs		figs
lmms-eval		lmms-eval
transformers		transformers
LICENSE		LICENSE
README.md		README.md
helper.py		helper.py
prumerge_llava_next.py		prumerge_llava_next.py
run_HiRED_sys_report.py		run_HiRED_sys_report.py
view_HiRED_token_selection.ipynb		view_HiRED_token_selection.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments (arXiv 2024)

Abstract:

Installation and Setup

Contents

Citation

About

Languages

License

hasanar1f/HiRED

Folders and files

Latest commit

History

Repository files navigation

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments (arXiv 2024)

Abstract:

Installation and Setup

Contents

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages