Tip
See the slurm/ dir for instructions on running evaluations in parallel on HPC clusters using Slurm.
This directory contains the evaluation scripts and benchmarks for the Cambrian-1 multimodal language model. It includes a wide range of benchmarks to assess the model's performance across various tasks and domains.
The following benchmarks are included:
- GQA
- VizWiz
- ScienceQA
- TextVQA
- POPE
- MME
- MMBench (English and Chinese)
- SEED
- MMMU
- MathVista
- AI2D
- ChartQA
- DocVQA
- InfoVQA
- STVQA
- OCRBench
- MMStar
- RealWorldQA
- SynthDog
- QBench
- BLINK
- MMVP
- VStar
- ADE
- OMNI
- COCO
Each benchmark has its own subdirectory under eval/
containing:
- an evaluation script (
*_eval.py
) — generates answers and saves them in a.jsonl
file - a testing script (
*_test.py
) — reads in the.jsonl
answers file, performs any necessary post-processing and matching with ground truth, and appends the results to a common.csv
file (for the benchmark) keyed on themodel_id
andtime
of the evaluation
-
Ensure you have the required dependencies installed. You can find these in the
requirements.txt
file in this subdir. You also need the Cambrian codebase installed in the parent directory. -
The datasets will be downloaded automatically when you run the evaluation scripts.
To run evaluations, use the run_benchmark.sh
script in the scripts/
directory. Here's the basic usage:
bash scripts/run_benchmark.sh --benchmark <benchmark_name> --ckpt <path_to_checkpoint> --conv_mode <conversation_mode>
For example:
bash scripts/run_benchmark.sh --benchmark mmmu --ckpt /path/to/cambrian/checkpoint --conv_mode llama_3
or using the nyu-visionx/cambrian-8b
HF model:
bash scripts/run_benchmark.sh --benchmark mmmu --ckpt nyu-visionx/cambrian-8b --conv_mode llama_3
To sequentially run all benchmarks for a single checkpoint, use the run_all_benchmarks.sh
script:
bash scripts/run_all_benchmarks.sh /path/to/cambrian/checkpoint llama_3
This script will run all implemented benchmarks and save progress in a checkpoint file.
Tip
See the slurm/ dir for instructions on running evaluations in parallel on HPC clusters using Slurm.
After running the evaluations, you can use the tabulate.py
script to compile the results:
python scripts/tabulate.py --eval_dir eval --experiment_csv experiments.csv --out_pivot pivot.xlsx --out_all_results all_results.csv
This will generate:
- A long CSV file (
all_results.csv
) with all compiled results - An Excel/CSV file (
pivot.xlsx
/pivot.csv
) with the final metrics for each benchmark as columns and each model evaluated as a row
If you want to add a new benchmark or improve existing ones, please follow the structure of the existing benchmark directories and update the run_all_benchmarks.sh
script accordingly.
- add GPT / LLM evaluation option to grade answers instead of the manual / fuzzy matching currently used in the
*_test.py
files