Evaluation

In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.

Currently, we mostly utilize the official toolkit or server for the evaluation.

Evaluate on Custom Datasets

You can evaluate LLaVA on your custom datasets by converting your dataset to LLaVA's jsonl format, and evaluate using model_vqa.py.

Below we provide a general guideline for evaluating datasets with some common formats.

Short-answer (e.g. VQAv2, MME).

<question>
Answer the question using a single word or phrase.

Option-only for multiple-choice (e.g. MMBench, SEED-Bench).

<question>
A. <option_1>
B. <option_2>
C. <option_3>
D. <option_4>
Answer with the option's letter from the given choices directly.

Natural QA (e.g. LLaVA-Bench, MM-Vet).

No postprocessing is needed.

Scripts

Before preparing task-specific data, download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Note that if you want to evaluate models on coco-caption, ocrvqa, okvqa and refcoco, you should further download eval_aug.zip. Then, extract them to ./playground/data/eval. This also provides a general structure for all datasets.

VQAv2

Download test2015 and put it under ./playground/data/eval/vqav2.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval_full/vqav2.sh $Weight

Submit the results to the evaluation server: ./playground/data/eval/vqav2/answers_upload.

GQA

Download the data following the official instructions here and put under ./playground/data/eval/gqa/data.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval_full/gqa.sh $Weight

VisWiz

Download test.json and extract test.zip to test. Put them under ./playground/data/eval/vizwiz.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/vizwiz.sh $Weight

Submit the results to the evaluation server: ./playground/data/eval/vizwiz/answers_upload.

ScienceQA

Under ./playground/data/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/sqa.sh $Weight

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/textvqa.sh $Weight

POPE

Download coco from POPE and put under ./playground/data/eval/pope.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/pope.sh $Weight

MME

Download the data following the official instructions here.
Downloaded images to MME_Benchmark_release_version.
put the official eval_tool and MME_Benchmark_release_version under ./playground/data/eval/MME.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/mme.sh $Weight

MMBench

Download mmbench_dev_20230712.tsv and put under ./playground/data/eval/mmbench.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/mmbench.sh $Weight

Submit the results to the evaluation server: ./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under ./playground/data/eval/mmbench.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/mmbench_cn.sh $Weight

Submit the results to the evaluation server: ./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

Following the official instructions to download the images and the videos. Put images under ./playground/data/eval/seed_bench/SEED-Bench-image.
Extract the video frame in the middle from the downloaded videos, and put them under ./playground/data/eval/seed_bench/SEED-Bench-video-image. We provide our script extract_video_frames.py modified from the official one.
Multiple-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval_full/eval/seed.sh $Weight

Optionally, submit the results to the leaderboard: ./playground/data/eval/seed_bench/answers_upload using the official jupyter notebook.

LLaVA-Bench-in-the-Wild

Extract contents of llava-bench-in-the-wild to ./playground/data/eval/llava-bench-in-the-wild.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/llavabench.sh $Weight

MM-Vet

Extract mm-vet.zip to ./playground/data/eval/mmvet.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval_full/mmvet.sh $Weight

Evaluate the predictions in ./playground/data/eval/mmvet/results using the official jupyter notebook.

COCO-Caption

Download val2014 and put it under ./playground/data/eval/coco-caption.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval_full/coco_caption.sh $Weight

RefCOCO

Download train2014 and put it under ./playground/data/eval/refcoco.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval_full/rec.sh $Weight

OCRVQA

Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval_full/ocrvqa.sh $Weight

OKVQA

Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval_full/okvqa.sh  $Weight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation.md

Evaluation.md

Evaluation

Evaluate on Custom Datasets

Scripts

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

SEED-Bench

LLaVA-Bench-in-the-Wild

MM-Vet

COCO-Caption

RefCOCO

OCRVQA

OKVQA

Files

Evaluation.md

Latest commit

History

Evaluation.md

File metadata and controls

Evaluation

Evaluate on Custom Datasets

Scripts

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

SEED-Bench

LLaVA-Bench-in-the-Wild

MM-Vet

COCO-Caption

RefCOCO

OCRVQA

OKVQA