MLLM Eval

Tip

See the slurm/ dir for instructions on running evaluations in parallel on HPC clusters using Slurm.

Overview

This directory contains the evaluation scripts and benchmarks for the Cambrian-1 multimodal language model. It includes a wide range of benchmarks to assess the model's performance across various tasks and domains.

Benchmarks

The following benchmarks are included:

GQA
VizWiz
ScienceQA
TextVQA
POPE
MME
MMBench (English and Chinese)
SEED
MMMU
MathVista
AI2D
ChartQA
DocVQA
InfoVQA
STVQA
OCRBench
MMStar
RealWorldQA
SynthDog
QBench
BLINK
MMVP
VStar
ADE
OMNI
COCO

Each benchmark has its own subdirectory under eval/ containing:

an evaluation script (*_eval.py) — generates answers and saves them in a .jsonl file
a testing script (*_test.py) — reads in the .jsonl answers file, performs any necessary post-processing and matching with ground truth, and appends the results to a common .csv file (for the benchmark) keyed on the model_id and time of the evaluation

Setup

Ensure you have the required dependencies installed. You can find these in the requirements.txt file in this subdir. You also need the Cambrian codebase installed in the parent directory.
The datasets will be downloaded automatically when you run the evaluation scripts.

Usage

To run evaluations, use the run_benchmark.sh script in the scripts/ directory. Here's the basic usage:

bash scripts/run_benchmark.sh --benchmark <benchmark_name> --ckpt <path_to_checkpoint> --conv_mode <conversation_mode>

For example:

bash scripts/run_benchmark.sh --benchmark mmmu --ckpt /path/to/cambrian/checkpoint --conv_mode llama_3

or using the nyu-visionx/cambrian-8b HF model:

bash scripts/run_benchmark.sh --benchmark mmmu --ckpt nyu-visionx/cambrian-8b --conv_mode llama_3

Running All Benchmarks

To sequentially run all benchmarks for a single checkpoint, use the run_all_benchmarks.sh script:

bash scripts/run_all_benchmarks.sh /path/to/cambrian/checkpoint llama_3

This script will run all implemented benchmarks and save progress in a checkpoint file.

Tip

See the slurm/ dir for instructions on running evaluations in parallel on HPC clusters using Slurm.

Tabulating Results

After running the evaluations, you can use the tabulate.py script to compile the results:

python scripts/tabulate.py --eval_dir eval --experiment_csv experiments.csv --out_pivot pivot.xlsx --out_all_results all_results.csv

This will generate:

A long CSV file (all_results.csv) with all compiled results
An Excel/CSV file (pivot.xlsx/pivot.csv) with the final metrics for each benchmark as columns and each model evaluated as a row

Contributing

If you want to add a new benchmark or improve existing ones, please follow the structure of the existing benchmark directories and update the run_all_benchmarks.sh script accordingly.

TODO:

add GPT / LLM evaluation option to grade answers instead of the manual / fuzzy matching currently used in the *_test.py files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MLLM Eval

Overview

Benchmarks

Setup

Usage

Running All Benchmarks

Tabulating Results

Contributing

TODO:

Files

README.md

Latest commit

History

README.md

File metadata and controls

MLLM Eval

Overview

Benchmarks

Setup

Usage

Running All Benchmarks

Tabulating Results

Contributing

TODO: