forked from haotian-liu/LLaVA
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
34 changed files
with
1,506 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Evaluation | ||
|
||
In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. | ||
|
||
Currently, we mostly utilize the official toolkit or server for the evaluation. | ||
|
||
## Evaluate on Custom Datasets | ||
|
||
You can evaluate LLaVA on your custom datasets by converting your dataset to LLaVA's jsonl format, and evaluate using [`model_vqa.py`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/model_vqa.py). | ||
|
||
Below we provide a general guideline for evaluating datasets with some common formats. | ||
|
||
1. Short-answer (e.g. VQAv2, MME). | ||
|
||
``` | ||
<question> | ||
Answer the question using a single word or phrase. | ||
``` | ||
|
||
2. Option-only for multiple-choice (e.g. MMBench, SEED-Bench). | ||
|
||
``` | ||
<question> | ||
A. <option_1> | ||
B. <option_2> | ||
C. <option_3> | ||
D. <option_4> | ||
Answer with the option's letter from the given choices directly. | ||
``` | ||
|
||
3. Natural QA (e.g. LLaVA-Bench, MM-Vet). | ||
|
||
No postprocessing is needed. | ||
|
||
## Scripts | ||
|
||
Before preparing task-specific data, download [eval.zip](https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view?usp=sharing). It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to `./playground/data/eval`. This also provides a general structure for all datasets. | ||
|
||
### VQAv2 | ||
|
||
1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `./playground/data/eval/vqav2`. | ||
2. Multi-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/vqav2/answers_upload`. | ||
|
||
### GQA | ||
|
||
1. Download the data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) and put under `./playground/data/eval/gqa/data`. | ||
2. Multi-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh | ||
``` | ||
|
||
### VisWiz | ||
|
||
1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `./playground/data/eval/vizwiz`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/vizwiz/answers_upload`. | ||
|
||
### ScienceQA | ||
|
||
1. Under `./playground/data/eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA). | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh | ||
``` | ||
|
||
### TextVQA | ||
|
||
1. Download [`TextVQA_0.5.1_val.json`](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `./playground/data/eval/textvqa`. | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh | ||
``` | ||
|
||
### POPE | ||
|
||
1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `./playground/data/eval/pope`. | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh | ||
``` | ||
|
||
### MME | ||
|
||
1. Download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation). | ||
2. Downloaded images to `MME_Benchmark_release_version`. | ||
3. put the official `eval_tool` and `MME_Benchmark_release_version` under `./playground/data/eval/MME`. | ||
4. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh | ||
``` | ||
|
||
### MMBench | ||
|
||
1. Download [`mmbench_dev_20230712.tsv`](https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv) and put under `./playground/data/eval/mmbench`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712`. | ||
|
||
### MMBench-CN | ||
|
||
1. Download [`mmbench_dev_cn_20231003.tsv`](https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_20231003.tsv) and put under `./playground/data/eval/mmbench`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003`. | ||
|
||
### SEED-Bench | ||
|
||
1. Following the official [instructions](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md) to download the images and the videos. Put images under `./playground/data/eval/seed_bench/SEED-Bench-image`. | ||
2. Extract the video frame in the middle from the downloaded videos, and put them under `./playground/data/eval/seed_bench/SEED-Bench-video-image`. We provide our script `extract_video_frames.py` modified from the official one. | ||
3. Multiple-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/seed.sh | ||
``` | ||
4. Optionally, submit the results to the leaderboard: `./playground/data/eval/seed_bench/answers_upload` using the official jupyter notebook. | ||
|
||
### LLaVA-Bench-in-the-Wild | ||
|
||
1. Extract contents of [`llava-bench-in-the-wild`](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) to `./playground/data/eval/llava-bench-in-the-wild`. | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/llavabench.sh | ||
``` | ||
|
||
### MM-Vet | ||
|
||
1. Extract [`mm-vet.zip`](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip) to `./playground/data/eval/mmvet`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh | ||
``` | ||
3. Evaluate the predictions in `./playground/data/eval/mmvet/results` using the official jupyter notebook. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
import os | ||
import json | ||
import argparse | ||
|
||
def eval_pope(answers, label_file): | ||
label_list = [json.loads(q)['label'] for q in open(label_file, 'r')] | ||
|
||
for answer in answers: | ||
text = answer['text'] | ||
|
||
# Only keep the first sentence | ||
if text.find('.') != -1: | ||
text = text.split('.')[0] | ||
|
||
text = text.replace(',', '') | ||
words = text.split(' ') | ||
if 'No' in words or 'not' in words or 'no' in words: | ||
answer['text'] = 'no' | ||
else: | ||
answer['text'] = 'yes' | ||
|
||
for i in range(len(label_list)): | ||
if label_list[i] == 'no': | ||
label_list[i] = 0 | ||
else: | ||
label_list[i] = 1 | ||
|
||
pred_list = [] | ||
for answer in answers: | ||
if answer['text'] == 'no': | ||
pred_list.append(0) | ||
else: | ||
pred_list.append(1) | ||
|
||
pos = 1 | ||
neg = 0 | ||
yes_ratio = pred_list.count(1) / len(pred_list) | ||
|
||
TP, TN, FP, FN = 0, 0, 0, 0 | ||
for pred, label in zip(pred_list, label_list): | ||
if pred == pos and label == pos: | ||
TP += 1 | ||
elif pred == pos and label == neg: | ||
FP += 1 | ||
elif pred == neg and label == neg: | ||
TN += 1 | ||
elif pred == neg and label == pos: | ||
FN += 1 | ||
|
||
print('TP\tFP\tTN\tFN\t') | ||
print('{}\t{}\t{}\t{}'.format(TP, FP, TN, FN)) | ||
|
||
precision = float(TP) / float(TP + FP) | ||
recall = float(TP) / float(TP + FN) | ||
f1 = 2*precision*recall / (precision + recall) | ||
acc = (TP + TN) / (TP + TN + FP + FN) | ||
print('Accuracy: {}'.format(acc)) | ||
print('Precision: {}'.format(precision)) | ||
print('Recall: {}'.format(recall)) | ||
print('F1 score: {}'.format(f1)) | ||
print('Yes ratio: {}'.format(yes_ratio)) | ||
print('%.3f, %.3f, %.3f, %.3f, %.3f' % (f1, acc, precision, recall, yes_ratio) ) | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--annotation-dir", type=str) | ||
parser.add_argument("--question-file", type=str) | ||
parser.add_argument("--result-file", type=str) | ||
args = parser.parse_args() | ||
|
||
questions = [json.loads(line) for line in open(args.question_file)] | ||
questions = {question['question_id']: question for question in questions} | ||
answers = [json.loads(q) for q in open(args.result_file)] | ||
for file in os.listdir(args.annotation_dir): | ||
assert file.startswith('coco_pope_') | ||
assert file.endswith('.json') | ||
category = file[10:-5] | ||
cur_answers = [x for x in answers if questions[x['question_id']]['category'] == category] | ||
print('Category: {}, # samples: {}'.format(category, len(cur_answers))) | ||
eval_pope(cur_answers, os.path.join(args.annotation_dir, file)) | ||
print("====================================") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.