Skip to content

Commit

Permalink
Release evaluation scripts.
Browse files Browse the repository at this point in the history
  • Loading branch information
haotian-liu committed Oct 11, 2023
1 parent a967492 commit ce1aa08
Show file tree
Hide file tree
Showing 30 changed files with 1,721 additions and 22 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@


## Release
- [10/11] The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), with evaluation scripts coming this week!
- [10/11] The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), and evaluation scripts are released [here](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md)!
- [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the [technical report](https://arxiv.org/abs/2310.03744), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md).
- [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [[LLavA-RLHF]](https://llava-rlhf.github.io/)
- [9/22] [LLaVA](https://arxiv.org/abs/2304.08485) is accpeted by NeurIPS 2023 as **oral presentation**, and [LLaVA-Med](https://arxiv.org/abs/2306.00890) is accpeted by NeurIPS 2023 Datasets and Benchmarks Track as **spotlight presentation**.
Expand Down
142 changes: 142 additions & 0 deletions docs/Evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Evaluation

In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.

Currently, we mostly utilize the official toolkit or server for the evaluation.

## Evaluate on Custom Datasets

You can evaluate LLaVA on your custom datasets by converting your dataset to LLaVA's jsonl format, and evaluate using [`model_vqa.py`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/model_vqa.py).

Below we provide a general guideline for evaluating datasets with some common formats.

1. Short-answer (e.g. VQAv2, MME).

```
<question>
Answer the question using a single word or phrase.
```

2. Option-only for multiple-choice (e.g. MMBench, SEED-Bench).

```
<question>
A. <option_1>
B. <option_2>
C. <option_3>
D. <option_4>
Answer with the option's letter from the given choices directly.
```

3. Natural QA (e.g. LLaVA-Bench, MM-Vet).

No postprocessing is needed.

## Scripts

Before preparing task-specific data, download [eval.zip](https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view?usp=sharing). It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to `./playground/data/eval`. This also provides a general structure for all datasets.

### VQAv2

1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `./playground/data/eval/vqav2`.
2. Multi-GPU inference.
```Shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
```
3. Submit the results to the evaluation server: `./playground/data/eval/vqav2/answers_upload`.

### GQA

1. Download the data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) and put under `./playground/data/eval/gqa/data`.
2. Multi-GPU inference.
```Shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
```

### VisWiz

1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `./playground/data/eval/vizwiz`.
2. Single-GPU inference.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
```
3. Submit the results to the evaluation server: `./playground/data/eval/vizwiz/answers_upload`.

### ScienceQA

1. Under `./playground/data/eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA).
2. Single-GPU inference and evaluate.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
```

### TextVQA

1. Download [`TextVQA_0.5.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `./playground/data/eval/textvqa`.
2. Single-GPU inference and evaluate.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
```

### POPE

1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `./playground/data/eval/pope`.
2. Single-GPU inference and evaluate.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
```

### MME

1. Download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation).
2. Downloaded images to `MME_Benchmark_release_version`.
3. put the official `eval_tool` and `MME_Benchmark_release_version` under `./playground/data/eval/MME`.
4. Single-GPU inference and evaluate.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
```

### MMBench

1. Download `mmbench_dev_20230712.tsv` from the official [website](https://github.com/open-compass/MMBench) and put under `./playground/data/eval/mmbench`.
2. Single-GPU inference.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
```
3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712`.

### MMBench-CN

1. Download `mmbench_dev_cn_20231003.tsv` from the official [website](https://github.com/open-compass/MMBench) and put under `./playground/data/eval/mmbench`.
2. Single-GPU inference.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh
```
3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003`.

### SEED-Bench

1. Following the official [instructions](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md) to download the images and the videos. Put images under `./playground/data/eval/seed_bench/SEED-Bench-image`.
2. Extract the video frame in the middle from the downloaded videos, and put them under `./playground/data/eval/seed_bench/SEED-Bench-video-image`. We provide our script `extract_video_frames.py` modified from the official one.
3. Multiple-GPU inference and evaluate.
```Shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/seed.sh
```
4. Optionally, submit the results to the leaderboard: `./playground/data/eval/seed_bench/answers_upload` using the official jupyter notebook.

### LLaVA-Bench-in-the-Wild

1. Extract contents of [`llava-bench-in-the-wild`](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) to `./playground/data/eval/llava-bench-in-the-wild`.
2. Single-GPU inference and evaluate.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/llavabench.sh
```

### MM-Vet

1. Extract [`mm-vet.zip`](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip) to `./playground/data/eval/mmvet`.
2. Single-GPU inference.
```Shell
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh
```
3. Evaluate the predictions in `./playground/data/eval/mmvet/results` using the official jupyter notebook.
226 changes: 226 additions & 0 deletions llava/eval/eval_mmbench.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
import argparse
import os
import json
import pandas as pd
from tqdm import tqdm
import openai
from concurrent.futures import ThreadPoolExecutor, as_completed
import math
import time


all_options = ['A', 'B', 'C', 'D']


def split_list(lst, n):
"""Split a list into n (roughly) equal-sized chunks"""
chunk_size = math.ceil(len(lst) / n) # integer division
return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]


def get_chunk(lst, n, k):
chunks = split_list(lst, n)
return chunks[k]


def get_row(df, colname, value):
assert (df[colname] == value).sum() == 1
return df[df[colname] == value].iloc[0]


def encode_query(question, options, answer):
query = ""
query += "Question: " + question + "\n"
query += "Options: " + "\n".join([f"{option_char}. {option}" for option_char, option in zip(all_options[:len(options)], options)]) + "\n"
query += "Answer: " + answer + "\n"
return query


def get_openai_api():
api_type = os.environ.get('API_TYPE', 'azure')

if api_type == 'azure':
api_key = os.environ.get('API_KEY', 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
engine = os.environ.get('ENGINE', 'chatgpt-turbo')
api_host = os.environ.get('API_BASE')
return {
'api_type': 'azure',
'api_version': '2023-06-01-preview',
'engine': engine,
'api_key': api_key,
'api_base': f'https://{api_host}.openai.azure.com',
}
else:
api_key = os.environ.get('API_KEY', 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
model = os.environ.get('MODEL', 'gpt-3.5-turbo-0301')

return {
'model': model,
'api_key': api_key,
}


def chatgpt_extract_answer(
question, options, answer, max_tokens=64, temperature=0.2, top_p=0.9, frequency_penalty=0, presence_penalty=0,
request_timeout=None, num_retry=1):
api_kwargs = get_openai_api()

system_message = """You are an AI assistant to help me matching an answer with several options of a multiple choice question.
You are provided with a question, several options, and an answer, and you need to find which option is most similar to the answer.
If the meaning of all options are significantly different from the answer, output X.
You should output a single uppercase character in A, B, C, D, if they are valid options, and X otherwise."""
exemplers = [
{
"question": "What is the main object in image?",
"options": ["teddy bear", "rabbit", "cat", "dog"],
"answer": "a cute teddy bear",
"output": "A",
},
{
"question": "What is the main object in image?",
"options": ["teddy bear", "rabbit", "cat", "dog"],
"answer": "Spider",
"output": "X",
},
]

messages = [
{"role": "system", "content": system_message},
]
for exempler in exemplers:
messages.append({"role": "user", "content": encode_query(exempler['question'], exempler['options'], exempler['answer'])})
messages.append({"role": "assistant", "content": exempler['output']})
messages.append({"role": "user", "content": encode_query(question, options, answer)})

response = None
attempts = []
for i in range(num_retry):
try:
response = openai.ChatCompletion.create(
messages = messages,
max_tokens = max_tokens,
temperature = temperature,
top_p = top_p,
frequency_penalty = frequency_penalty,
presence_penalty = presence_penalty,
request_timeout = request_timeout,
**api_kwargs
)
except Exception as e:
if type(e) in [openai.error.RateLimitError, openai.error.APIError, openai.error.APIConnectionError, openai.error.Timeout]:
pass
elif type(e) in [openai.error.AuthenticationError, openai.error.InvalidRequestError]:
print(e)
return None
else:
print(type(e), e)
attempts.append(e.__class__.__name__)
time.sleep(1)
else:
time.sleep(1)
break

if response is None:
print(f'All {num_retry} attempts failed: {attempts}. Returning None.')
return None

content = response['choices'][0]['message']['content']
content = content.strip()
return content

def is_none(value):
if value is None:
return True
if type(value) is float and math.isnan(value):
return True
if type(value) is str and value.lower() == 'nan':
return True
if type(value) is str and value.lower() == 'none':
return True
return False

def get_options(row, options):
parsed_options = []
for option in options:
option_value = row[option]
if is_none(option_value):
break
parsed_options.append(option_value)
return parsed_options

def auto_parse_answer(question, options, answer):
if answer.strip('.').strip().upper() in all_options[:len(options)]:
return answer.strip('.').strip().upper()
expand_option_valid = [f'The answer is {option}.'.lower() in answer.lower() for option in all_options[:len(options)]]
if any(expand_option_valid):
return all_options[expand_option_valid.index(True)]

matched_ops = [all_options[_i] for _i, option in enumerate(options) if answer.lower() in option.lower()]
if len(matched_ops) == 1:
return matched_ops[0]
return None

def eval_results(args):
questions = pd.read_table(os.path.expanduser(args.question_file))
answers = [json.loads(line) for line in open(os.path.expanduser(args.answers_file))]
answers = {(row['question_id'], row.get('round_id', 0)): row for row in answers}
results_file = os.path.expanduser(args.results_file)
if os.path.exists(results_file):
results = [json.loads(line) for line in open(results_file)]
results = {(row['question_id'], row.get('round_id', 0)): row for row in results}
else:
results = {}
results_writer = open(results_file, 'a')

def process_answer(idx, answer):
if idx in results:
return None
question_id, round_id = idx
question_data = get_row(questions, 'index', question_id)
if 'options' in answer:
options = answer['options']
option_char = answer['option_char']
else:
assert round_id == 0, "round_id must be 0 when options are not provided"
options = get_options(question_data, all_options)
option_char = all_options[:len(options)]
option_map = {all_options[i]: option_char[i] for i in range(len(options))}
option_map['X'] = 'X'
parsed_answer = auto_parse_answer(question_data['question'], options, answer['text'])
if parsed_answer is None:
parsed_answer = chatgpt_extract_answer(
question_data['question'], options, answer['text'],
request_timeout=args.request_timeout, num_retry=args.num_retry)
if parsed_answer is None:
return None
if parsed_answer not in option_map:
print(f'Invalid parsed answer: {parsed_answer}')
return None
answer['parsed_answer'] = option_map[parsed_answer]
return answer

with ThreadPoolExecutor(max_workers=args.max_workers) as executor:
# Submit all tasks to the executor
futures = {executor.submit(process_answer, key, value): key for key, value in answers.items()}

# Process results as they become available
for future in tqdm(as_completed(futures), total=len(answers)):
answer = future.result()
if answer is not None:
results_writer.write(json.dumps(answer) + '\n')
results_writer.flush()

results_writer.close()


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
parser.add_argument("--answers-file", type=str, default="answer.jsonl")
parser.add_argument("--results-file", type=str, default="results.jsonl")
parser.add_argument("--max-workers", type=int, default=1)
parser.add_argument("--num-retry", type=int, default=3)
parser.add_argument("--request-timeout", type=int, default=None)
args = parser.parse_args()

eval_results(args)
Loading

0 comments on commit ce1aa08

Please sign in to comment.