Installation | Quick Start | Training | Evaluation
This is an implementation of our method MERV, Multi-Encoder Representation of Videos. We provide a simple and efficient codebase for training video-based large language models (VideoLLMs), particularly with multiple visual encoders for extracting visual information.
- Different Visual Representations. We natively support vision backbones such as SigLIP, DINOv2 as well as video backbones like LanguageBind, ViViT – and even fusions of different backbones. Adding new backbones is easy via TIMM or Huggingface. Using multiple encoders is a first class feature, so it is easy to set different settings for different encoders (such as frame rates and so on).
- Base and Instruct-Tuned Language Models. We support arbitrary instances of
AutoModelForCausalLM
including both base and instruct-tuned models (with built-in prompt handling) via Transformers. - Fast, Efficient Training. Our models are trained with PyTorch FSDP and Flash-Attention, making them very quick to train compared to other codebases. For example, in our testing, the Video-LLaVA codebase took ~80 hrs total to train a model which ours does in ~24 hrs. This also makes training on multiple visual encoders incur minimal overhead.
This repository was built using Python 3.10, but should be backwards compatible with any Python >= 3.8. We require PyTorch 2.1 or greater installation instructions can be found here. This repository was developed and has been thoroughly tested with PyTorch 2.1.0, Torchvision 0.16.0, and Flash-Attention 2.3.3.
Once PyTorch has been properly installed, you can install this package locally via an editable installation.
git clone https://github.com/princetonvisualai/merv.git
cd merv
conda create -n merv python=3.10 -y
conda activate merv
pip install -e .
# Training additionally requires Flash-Attention 2 (https://github.com/Dao-AILab/flash-attention)
# Verify Ninja --> should return exit code "0"
ninja --version; echo $?
# Install Flash Attention 2
# =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip install flash-attn --no-build-isolation
Additionally, request access here to use LLaMA-2 and generate an access token and put it in .hf_token
If you run into any problems during the installation process, please file a GitHub Issue.
To run our models in inference, we suggest having at least 80GB of CPU memory and 24GB of GPU memory for inference.
We have tested our model on a single RTX 3090.
See scripts/quick_start.py
for a simple example (shown below), and merv/models/registry.py
for a list of available models: our two main models merv
and merv-full
, as well as some single encoder baselines for testing.
from pathlib import Path
import requests
import torch
from PIL import Image
from merv import load_vid
hf_token = Path(".hf_token").read_text().strip()
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# Or a local path if models are locally downloaded
vidlm = load_vid("merv-full", hf_token=hf_token)
vidlm.to(device, dtype=torch.bfloat16)
# Run on example Perception Test video and specify a prompt
video_path = "./assets/video_10336_short.mp4"
user_prompt = "Describe what is happening in this video."
# Build prompt
prompt_builder = vidlm.get_prompt_builder()
prompt_builder.add_turn(role="human", message=user_prompt)
prompt_text = prompt_builder.get_prompt()
# Generate!
generated_text = vidlm.generate(
video_path,
prompt_text,
num_frames=[16,16,32,16], # get from model config
do_sample=True,
temperature=0.4,
max_new_tokens=512,
min_length=1,
)
The training instruction is in TRAINING.md
We evaluate on a diverse set of tasks.
- MSVD, MSRVTT, TGIF, and ActivityNet preparation follow that of Video-LLaVA.
- Perception Test can be found here.
- NExT-QA, VLEP, TVQA preparation follow that of SeViLA.
We follow the Video-ChatGPT protocol for evaluation, but our prompts are the same as Video-LLaVA for consistency in comparison.
Note that the API model is always subject to change; we query gpt-3.5-turbo-0613
.
We provide some example scripts for evaluation.
We run inference in parallel, and then we run GPT evaluation once all of the inference is done.
For open-ended QA, first, create a file .oai_keys.yaml
with the following content for GPT API access:
- api_key: 123456
api_version: 2023-0613-preview
api_endpoint: https://api.openai.com
- api_key: 123457
api_version: 2023-0613-preview
api_endpoint: https://api.openai2.com
...
Then edit scripts/eval_gpt.py
according to which API provider you use.
The following scripts will run inference and GPT evaluation.
# In parallel, run inference jobs.
python scripts/eval_openended.py --model_path ${CKPT_NAME} --eval_dataset ${BENCHMARK} \
--num_chunks $CHUNKS \
--chunk_idx $CHUNK_ID
... wait for all jobs to finish ...
# Then run GPT on the results; API keys taken from .oai_keys.yaml
python scripts/eval_gpt.py \
--pred_path $output_file \
--output_file eval_result/${CKPT_NAME}/${BENCHMARK}_gpt.json
For MCQ based tasks, we use the following script:
python scripts/eval_mcq.py --model_path ${CKPT_NAME} --eval_dataset ${BENCHMARK} \
--num_chunks $3 \
--chunk_idx $SLURM_ARRAY_TASK_ID \
--filename_question ${FILENAMEQUESTION}\
--filename_answer ${FILENAMEANSWER} \
--full_path_ckpt ${FULLPATH} \
--strategy ${STRATEGY}
where strategy describes the strategy for the MCQ task (by default is naive), as GPT is not necessary.
If you find our work useful, please cite our paper.
@misc{chung2024unifying,
title={Unifying Specialized Visual Encoders for Video Language Models},
author={Jihoon Chung and Tyler Zhu and Max Gonzalez Saez-Diez and Juan Carlos Niebles and Honglu Zhou and Olga Russakovsky},
year={2024},
eprint={2306.XXXX},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Additionally, if this repository is useful, please also cite the original authors from Prismatic-VLMs who created this fantastic repository.
@inproceedings{karamcheti2024prismatic,
title = {Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models},
author = {Siddharth Karamcheti and Suraj Nair and Ashwin Balakrishna and Percy Liang and Thomas Kollar and Dorsa Sadigh},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2024},
}