forked from haotian-liu/LLaVA
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a967492
commit ce1aa08
Showing
30 changed files
with
1,721 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Evaluation | ||
|
||
In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. | ||
|
||
Currently, we mostly utilize the official toolkit or server for the evaluation. | ||
|
||
## Evaluate on Custom Datasets | ||
|
||
You can evaluate LLaVA on your custom datasets by converting your dataset to LLaVA's jsonl format, and evaluate using [`model_vqa.py`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/model_vqa.py). | ||
|
||
Below we provide a general guideline for evaluating datasets with some common formats. | ||
|
||
1. Short-answer (e.g. VQAv2, MME). | ||
|
||
``` | ||
<question> | ||
Answer the question using a single word or phrase. | ||
``` | ||
|
||
2. Option-only for multiple-choice (e.g. MMBench, SEED-Bench). | ||
|
||
``` | ||
<question> | ||
A. <option_1> | ||
B. <option_2> | ||
C. <option_3> | ||
D. <option_4> | ||
Answer with the option's letter from the given choices directly. | ||
``` | ||
|
||
3. Natural QA (e.g. LLaVA-Bench, MM-Vet). | ||
|
||
No postprocessing is needed. | ||
|
||
## Scripts | ||
|
||
Before preparing task-specific data, download [eval.zip](https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view?usp=sharing). It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to `./playground/data/eval`. This also provides a general structure for all datasets. | ||
|
||
### VQAv2 | ||
|
||
1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `./playground/data/eval/vqav2`. | ||
2. Multi-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/vqav2/answers_upload`. | ||
|
||
### GQA | ||
|
||
1. Download the data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) and put under `./playground/data/eval/gqa/data`. | ||
2. Multi-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh | ||
``` | ||
|
||
### VisWiz | ||
|
||
1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `./playground/data/eval/vizwiz`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/vizwiz/answers_upload`. | ||
|
||
### ScienceQA | ||
|
||
1. Under `./playground/data/eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA). | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh | ||
``` | ||
|
||
### TextVQA | ||
|
||
1. Download [`TextVQA_0.5.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `./playground/data/eval/textvqa`. | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh | ||
``` | ||
|
||
### POPE | ||
|
||
1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `./playground/data/eval/pope`. | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh | ||
``` | ||
|
||
### MME | ||
|
||
1. Download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation). | ||
2. Downloaded images to `MME_Benchmark_release_version`. | ||
3. put the official `eval_tool` and `MME_Benchmark_release_version` under `./playground/data/eval/MME`. | ||
4. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh | ||
``` | ||
|
||
### MMBench | ||
|
||
1. Download `mmbench_dev_20230712.tsv` from the official [website](https://github.com/open-compass/MMBench) and put under `./playground/data/eval/mmbench`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712`. | ||
|
||
### MMBench-CN | ||
|
||
1. Download `mmbench_dev_cn_20231003.tsv` from the official [website](https://github.com/open-compass/MMBench) and put under `./playground/data/eval/mmbench`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh | ||
``` | ||
3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003`. | ||
|
||
### SEED-Bench | ||
|
||
1. Following the official [instructions](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md) to download the images and the videos. Put images under `./playground/data/eval/seed_bench/SEED-Bench-image`. | ||
2. Extract the video frame in the middle from the downloaded videos, and put them under `./playground/data/eval/seed_bench/SEED-Bench-video-image`. We provide our script `extract_video_frames.py` modified from the official one. | ||
3. Multiple-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/seed.sh | ||
``` | ||
4. Optionally, submit the results to the leaderboard: `./playground/data/eval/seed_bench/answers_upload` using the official jupyter notebook. | ||
|
||
### LLaVA-Bench-in-the-Wild | ||
|
||
1. Extract contents of [`llava-bench-in-the-wild`](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) to `./playground/data/eval/llava-bench-in-the-wild`. | ||
2. Single-GPU inference and evaluate. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/llavabench.sh | ||
``` | ||
|
||
### MM-Vet | ||
|
||
1. Extract [`mm-vet.zip`](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip) to `./playground/data/eval/mmvet`. | ||
2. Single-GPU inference. | ||
```Shell | ||
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh | ||
``` | ||
3. Evaluate the predictions in `./playground/data/eval/mmvet/results` using the official jupyter notebook. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
import argparse | ||
import os | ||
import json | ||
import pandas as pd | ||
from tqdm import tqdm | ||
import openai | ||
from concurrent.futures import ThreadPoolExecutor, as_completed | ||
import math | ||
import time | ||
|
||
|
||
all_options = ['A', 'B', 'C', 'D'] | ||
|
||
|
||
def split_list(lst, n): | ||
"""Split a list into n (roughly) equal-sized chunks""" | ||
chunk_size = math.ceil(len(lst) / n) # integer division | ||
return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)] | ||
|
||
|
||
def get_chunk(lst, n, k): | ||
chunks = split_list(lst, n) | ||
return chunks[k] | ||
|
||
|
||
def get_row(df, colname, value): | ||
assert (df[colname] == value).sum() == 1 | ||
return df[df[colname] == value].iloc[0] | ||
|
||
|
||
def encode_query(question, options, answer): | ||
query = "" | ||
query += "Question: " + question + "\n" | ||
query += "Options: " + "\n".join([f"{option_char}. {option}" for option_char, option in zip(all_options[:len(options)], options)]) + "\n" | ||
query += "Answer: " + answer + "\n" | ||
return query | ||
|
||
|
||
def get_openai_api(): | ||
api_type = os.environ.get('API_TYPE', 'azure') | ||
|
||
if api_type == 'azure': | ||
api_key = os.environ.get('API_KEY', 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') | ||
engine = os.environ.get('ENGINE', 'chatgpt-turbo') | ||
api_host = os.environ.get('API_BASE') | ||
return { | ||
'api_type': 'azure', | ||
'api_version': '2023-06-01-preview', | ||
'engine': engine, | ||
'api_key': api_key, | ||
'api_base': f'https://{api_host}.openai.azure.com', | ||
} | ||
else: | ||
api_key = os.environ.get('API_KEY', 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') | ||
model = os.environ.get('MODEL', 'gpt-3.5-turbo-0301') | ||
|
||
return { | ||
'model': model, | ||
'api_key': api_key, | ||
} | ||
|
||
|
||
def chatgpt_extract_answer( | ||
question, options, answer, max_tokens=64, temperature=0.2, top_p=0.9, frequency_penalty=0, presence_penalty=0, | ||
request_timeout=None, num_retry=1): | ||
api_kwargs = get_openai_api() | ||
|
||
system_message = """You are an AI assistant to help me matching an answer with several options of a multiple choice question. | ||
You are provided with a question, several options, and an answer, and you need to find which option is most similar to the answer. | ||
If the meaning of all options are significantly different from the answer, output X. | ||
You should output a single uppercase character in A, B, C, D, if they are valid options, and X otherwise.""" | ||
exemplers = [ | ||
{ | ||
"question": "What is the main object in image?", | ||
"options": ["teddy bear", "rabbit", "cat", "dog"], | ||
"answer": "a cute teddy bear", | ||
"output": "A", | ||
}, | ||
{ | ||
"question": "What is the main object in image?", | ||
"options": ["teddy bear", "rabbit", "cat", "dog"], | ||
"answer": "Spider", | ||
"output": "X", | ||
}, | ||
] | ||
|
||
messages = [ | ||
{"role": "system", "content": system_message}, | ||
] | ||
for exempler in exemplers: | ||
messages.append({"role": "user", "content": encode_query(exempler['question'], exempler['options'], exempler['answer'])}) | ||
messages.append({"role": "assistant", "content": exempler['output']}) | ||
messages.append({"role": "user", "content": encode_query(question, options, answer)}) | ||
|
||
response = None | ||
attempts = [] | ||
for i in range(num_retry): | ||
try: | ||
response = openai.ChatCompletion.create( | ||
messages = messages, | ||
max_tokens = max_tokens, | ||
temperature = temperature, | ||
top_p = top_p, | ||
frequency_penalty = frequency_penalty, | ||
presence_penalty = presence_penalty, | ||
request_timeout = request_timeout, | ||
**api_kwargs | ||
) | ||
except Exception as e: | ||
if type(e) in [openai.error.RateLimitError, openai.error.APIError, openai.error.APIConnectionError, openai.error.Timeout]: | ||
pass | ||
elif type(e) in [openai.error.AuthenticationError, openai.error.InvalidRequestError]: | ||
print(e) | ||
return None | ||
else: | ||
print(type(e), e) | ||
attempts.append(e.__class__.__name__) | ||
time.sleep(1) | ||
else: | ||
time.sleep(1) | ||
break | ||
|
||
if response is None: | ||
print(f'All {num_retry} attempts failed: {attempts}. Returning None.') | ||
return None | ||
|
||
content = response['choices'][0]['message']['content'] | ||
content = content.strip() | ||
return content | ||
|
||
def is_none(value): | ||
if value is None: | ||
return True | ||
if type(value) is float and math.isnan(value): | ||
return True | ||
if type(value) is str and value.lower() == 'nan': | ||
return True | ||
if type(value) is str and value.lower() == 'none': | ||
return True | ||
return False | ||
|
||
def get_options(row, options): | ||
parsed_options = [] | ||
for option in options: | ||
option_value = row[option] | ||
if is_none(option_value): | ||
break | ||
parsed_options.append(option_value) | ||
return parsed_options | ||
|
||
def auto_parse_answer(question, options, answer): | ||
if answer.strip('.').strip().upper() in all_options[:len(options)]: | ||
return answer.strip('.').strip().upper() | ||
expand_option_valid = [f'The answer is {option}.'.lower() in answer.lower() for option in all_options[:len(options)]] | ||
if any(expand_option_valid): | ||
return all_options[expand_option_valid.index(True)] | ||
|
||
matched_ops = [all_options[_i] for _i, option in enumerate(options) if answer.lower() in option.lower()] | ||
if len(matched_ops) == 1: | ||
return matched_ops[0] | ||
return None | ||
|
||
def eval_results(args): | ||
questions = pd.read_table(os.path.expanduser(args.question_file)) | ||
answers = [json.loads(line) for line in open(os.path.expanduser(args.answers_file))] | ||
answers = {(row['question_id'], row.get('round_id', 0)): row for row in answers} | ||
results_file = os.path.expanduser(args.results_file) | ||
if os.path.exists(results_file): | ||
results = [json.loads(line) for line in open(results_file)] | ||
results = {(row['question_id'], row.get('round_id', 0)): row for row in results} | ||
else: | ||
results = {} | ||
results_writer = open(results_file, 'a') | ||
|
||
def process_answer(idx, answer): | ||
if idx in results: | ||
return None | ||
question_id, round_id = idx | ||
question_data = get_row(questions, 'index', question_id) | ||
if 'options' in answer: | ||
options = answer['options'] | ||
option_char = answer['option_char'] | ||
else: | ||
assert round_id == 0, "round_id must be 0 when options are not provided" | ||
options = get_options(question_data, all_options) | ||
option_char = all_options[:len(options)] | ||
option_map = {all_options[i]: option_char[i] for i in range(len(options))} | ||
option_map['X'] = 'X' | ||
parsed_answer = auto_parse_answer(question_data['question'], options, answer['text']) | ||
if parsed_answer is None: | ||
parsed_answer = chatgpt_extract_answer( | ||
question_data['question'], options, answer['text'], | ||
request_timeout=args.request_timeout, num_retry=args.num_retry) | ||
if parsed_answer is None: | ||
return None | ||
if parsed_answer not in option_map: | ||
print(f'Invalid parsed answer: {parsed_answer}') | ||
return None | ||
answer['parsed_answer'] = option_map[parsed_answer] | ||
return answer | ||
|
||
with ThreadPoolExecutor(max_workers=args.max_workers) as executor: | ||
# Submit all tasks to the executor | ||
futures = {executor.submit(process_answer, key, value): key for key, value in answers.items()} | ||
|
||
# Process results as they become available | ||
for future in tqdm(as_completed(futures), total=len(answers)): | ||
answer = future.result() | ||
if answer is not None: | ||
results_writer.write(json.dumps(answer) + '\n') | ||
results_writer.flush() | ||
|
||
results_writer.close() | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--question-file", type=str, default="tables/question.jsonl") | ||
parser.add_argument("--answers-file", type=str, default="answer.jsonl") | ||
parser.add_argument("--results-file", type=str, default="results.jsonl") | ||
parser.add_argument("--max-workers", type=int, default=1) | ||
parser.add_argument("--num-retry", type=int, default=3) | ||
parser.add_argument("--request-timeout", type=int, default=None) | ||
args = parser.parse_args() | ||
|
||
eval_results(args) |
Oops, something went wrong.