Release evaluation scripts.

dudulightricks · Oct 11, 2023 · ce1aa08 · ce1aa08
1 parent a967492
commit ce1aa08
Show file tree

Hide file tree

Showing 30 changed files with 1,721 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@
 
 
 ## Release
-- [10/11] The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), with evaluation scripts coming this week!
+- [10/11] The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), and evaluation scripts are released [here](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md)!
 - [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the [technical report](https://arxiv.org/abs/2310.03744), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md).
 - [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [[LLavA-RLHF]](https://llava-rlhf.github.io/)
 - [9/22] [LLaVA](https://arxiv.org/abs/2304.08485) is accpeted by NeurIPS 2023 as **oral presentation**, and [LLaVA-Med](https://arxiv.org/abs/2306.00890) is accpeted by NeurIPS 2023 Datasets and Benchmarks Track as **spotlight presentation**.

diff --git a/docs/Evaluation.md b/docs/Evaluation.md
@@ -0,0 +1,142 @@
+# Evaluation
+
+In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
+
+Currently, we mostly utilize the official toolkit or server for the evaluation.
+
+## Evaluate on Custom Datasets
+
+You can evaluate LLaVA on your custom datasets by converting your dataset to LLaVA's jsonl format, and evaluate using [`model_vqa.py`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/model_vqa.py).
+
+Below we provide a general guideline for evaluating datasets with some common formats.
+
+1. Short-answer (e.g. VQAv2, MME).
+
+```
+<question>
+Answer the question using a single word or phrase.
+```
+
+2. Option-only for multiple-choice (e.g. MMBench, SEED-Bench).
+
+```
+<question>
+A. <option_1>
+B. <option_2>
+C. <option_3>
+D. <option_4>
+Answer with the option's letter from the given choices directly.
+```
+
+3. Natural QA (e.g. LLaVA-Bench, MM-Vet).
+
+No postprocessing is needed.
+
+## Scripts
+
+Before preparing task-specific data, download [eval.zip](https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view?usp=sharing). It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to `./playground/data/eval`. This also provides a general structure for all datasets.
+
+### VQAv2
+
+1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `./playground/data/eval/vqav2`.
+2. Multi-GPU inference.
+```Shell
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
+```
+3. Submit the results to the evaluation server: `./playground/data/eval/vqav2/answers_upload`.
+
+### GQA
+
+1. Download the data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) and put under `./playground/data/eval/gqa/data`.
+2. Multi-GPU inference.
+```Shell
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
+```
+
+### VisWiz
+
+1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `./playground/data/eval/vizwiz`.
+2. Single-GPU inference.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
+```
+3. Submit the results to the evaluation server: `./playground/data/eval/vizwiz/answers_upload`.
+
+### ScienceQA
+
+1. Under `./playground/data/eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA).
+2. Single-GPU inference and evaluate.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
+```
+
+### TextVQA
+
+1. Download [`TextVQA_0.5.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `./playground/data/eval/textvqa`.
+2. Single-GPU inference and evaluate.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
+```
+
+### POPE
+
+1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `./playground/data/eval/pope`.
+2. Single-GPU inference and evaluate.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
+```
+
+### MME
+
+1. Download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation).
+2. Downloaded images to `MME_Benchmark_release_version`.
+3. put the official `eval_tool` and `MME_Benchmark_release_version` under `./playground/data/eval/MME`.
+4. Single-GPU inference and evaluate.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
+```
+
+### MMBench
+
+1. Download `mmbench_dev_20230712.tsv` from the official [website](https://github.com/open-compass/MMBench) and put under `./playground/data/eval/mmbench`.
+2. Single-GPU inference.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
+```
+3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712`.
+
+### MMBench-CN
+
+1. Download `mmbench_dev_cn_20231003.tsv` from the official [website](https://github.com/open-compass/MMBench) and put under `./playground/data/eval/mmbench`.
+2. Single-GPU inference.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh
+```
+3. Submit the results to the evaluation server: `./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003`.
+
+### SEED-Bench
+
+1. Following the official [instructions](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md) to download the images and the videos. Put images under `./playground/data/eval/seed_bench/SEED-Bench-image`.
+2. Extract the video frame in the middle from the downloaded videos, and put them under `./playground/data/eval/seed_bench/SEED-Bench-video-image`. We provide our script `extract_video_frames.py` modified from the official one.
+3. Multiple-GPU inference and evaluate.
+```Shell
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/seed.sh
+```
+4. Optionally, submit the results to the leaderboard: `./playground/data/eval/seed_bench/answers_upload` using the official jupyter notebook.
+
+### LLaVA-Bench-in-the-Wild
+
+1. Extract contents of [`llava-bench-in-the-wild`](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) to `./playground/data/eval/llava-bench-in-the-wild`.
+2. Single-GPU inference and evaluate.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/llavabench.sh
+```
+
+### MM-Vet
+
+1. Extract [`mm-vet.zip`](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip) to `./playground/data/eval/mmvet`.
+2. Single-GPU inference.
+```Shell
+CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh
+```
+3. Evaluate the predictions in `./playground/data/eval/mmvet/results` using the official jupyter notebook.
diff --git a/llava/eval/eval_mmbench.py b/llava/eval/eval_mmbench.py
@@ -0,0 +1,226 @@
+import argparse
+import os
+import json
+import pandas as pd
+from tqdm import tqdm
+import openai
+from concurrent.futures import ThreadPoolExecutor, as_completed
+import math
+import time
+
+
+all_options = ['A', 'B', 'C', 'D']
+
+
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
+
+
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+
+
+def get_row(df, colname, value):
+    assert (df[colname] == value).sum() == 1
+    return df[df[colname] == value].iloc[0]
+
+
+def encode_query(question, options, answer):
+    query = ""
+    query += "Question: " + question + "\n"
+    query += "Options: " + "\n".join([f"{option_char}. {option}" for option_char, option in zip(all_options[:len(options)], options)]) + "\n"
+    query += "Answer: " + answer + "\n"
+    return query
+
+
+def get_openai_api():
+    api_type = os.environ.get('API_TYPE', 'azure')
+
+    if api_type == 'azure':
+        api_key = os.environ.get('API_KEY', 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
+        engine = os.environ.get('ENGINE', 'chatgpt-turbo')
+        api_host = os.environ.get('API_BASE')
+        return {
+            'api_type': 'azure',
+            'api_version': '2023-06-01-preview',
+            'engine': engine,
+            'api_key': api_key,
+            'api_base': f'https://{api_host}.openai.azure.com',
+        }
+    else:
+        api_key = os.environ.get('API_KEY', 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
+        model = os.environ.get('MODEL', 'gpt-3.5-turbo-0301')
+
+        return {
+            'model': model,
+            'api_key': api_key,
+        }
+
+
+def chatgpt_extract_answer(
+    question, options, answer, max_tokens=64, temperature=0.2, top_p=0.9, frequency_penalty=0, presence_penalty=0,
+    request_timeout=None, num_retry=1):
+    api_kwargs = get_openai_api()
+
+    system_message = """You are an AI assistant to help me matching an answer with several options of a multiple choice question.
+You are provided with a question, several options, and an answer, and you need to find which option is most similar to the answer.
+If the meaning of all options are significantly different from the answer, output X.
+You should output a single uppercase character in A, B, C, D, if they are valid options, and X otherwise."""
+    exemplers = [
+        {
+            "question": "What is the main object in image?",
+            "options": ["teddy bear", "rabbit", "cat", "dog"],
+            "answer": "a cute teddy bear",
+            "output": "A",
+        },
+        {
+            "question": "What is the main object in image?",
+            "options": ["teddy bear", "rabbit", "cat", "dog"],
+            "answer": "Spider",
+            "output": "X",
+        },
+    ]
+
+    messages = [
+        {"role": "system", "content": system_message},
+    ]
+    for exempler in exemplers:
+        messages.append({"role": "user", "content": encode_query(exempler['question'], exempler['options'], exempler['answer'])})
+        messages.append({"role": "assistant", "content": exempler['output']})
+    messages.append({"role": "user", "content": encode_query(question, options, answer)})
+
+    response = None
+    attempts = []
+    for i in range(num_retry):
+        try:
+            response = openai.ChatCompletion.create(
+                messages = messages,
+                max_tokens = max_tokens,
+                temperature = temperature,
+                top_p = top_p,
+                frequency_penalty = frequency_penalty,
+                presence_penalty = presence_penalty,
+                request_timeout = request_timeout,
+                **api_kwargs
+            )
+        except Exception as e:
+            if type(e) in [openai.error.RateLimitError, openai.error.APIError, openai.error.APIConnectionError, openai.error.Timeout]:
+                pass
+            elif type(e) in [openai.error.AuthenticationError, openai.error.InvalidRequestError]:
+                print(e)
+                return None
+            else:
+                print(type(e), e)
+            attempts.append(e.__class__.__name__)
+            time.sleep(1)
+        else:
+            time.sleep(1)
+            break
+
+    if response is None:
+        print(f'All {num_retry} attempts failed: {attempts}. Returning None.')
+        return None
+
+    content = response['choices'][0]['message']['content']
+    content = content.strip()
+    return content
+
+def is_none(value):
+    if value is None:
+        return True
+    if type(value) is float and math.isnan(value):
+        return True
+    if type(value) is str and value.lower() == 'nan':
+        return True
+    if type(value) is str and value.lower() == 'none':
+        return True
+    return False
+
+def get_options(row, options):
+    parsed_options = []
+    for option in options:
+        option_value = row[option]
+        if is_none(option_value):
+            break
+        parsed_options.append(option_value)
+    return parsed_options
+
+def auto_parse_answer(question, options, answer):
+    if answer.strip('.').strip().upper() in all_options[:len(options)]:
+        return answer.strip('.').strip().upper()
+    expand_option_valid = [f'The answer is {option}.'.lower() in answer.lower() for option in all_options[:len(options)]]
+    if any(expand_option_valid):
+        return all_options[expand_option_valid.index(True)]
+
+    matched_ops = [all_options[_i] for _i, option in enumerate(options) if answer.lower() in option.lower()]
+    if len(matched_ops) == 1:
+        return matched_ops[0]
+    return None
+
+def eval_results(args):
+    questions = pd.read_table(os.path.expanduser(args.question_file))
+    answers = [json.loads(line) for line in open(os.path.expanduser(args.answers_file))]
+    answers = {(row['question_id'], row.get('round_id', 0)): row for row in answers}
+    results_file = os.path.expanduser(args.results_file)
+    if os.path.exists(results_file):
+        results = [json.loads(line) for line in open(results_file)]
+        results = {(row['question_id'], row.get('round_id', 0)): row for row in results}
+    else:
+        results = {}
+    results_writer = open(results_file, 'a')
+
+    def process_answer(idx, answer):
+        if idx in results:
+            return None
+        question_id, round_id = idx
+        question_data = get_row(questions, 'index', question_id)
+        if 'options' in answer:
+            options = answer['options']
+            option_char = answer['option_char']
+        else:
+            assert round_id == 0, "round_id must be 0 when options are not provided"
+            options = get_options(question_data, all_options)
+            option_char = all_options[:len(options)]
+        option_map = {all_options[i]: option_char[i] for i in range(len(options))}
+        option_map['X'] = 'X'
+        parsed_answer = auto_parse_answer(question_data['question'], options, answer['text'])
+        if parsed_answer is None:
+            parsed_answer = chatgpt_extract_answer(
+                question_data['question'], options, answer['text'],
+                request_timeout=args.request_timeout, num_retry=args.num_retry)
+        if parsed_answer is None:
+            return None
+        if parsed_answer not in option_map:
+            print(f'Invalid parsed answer: {parsed_answer}')
+            return None
+        answer['parsed_answer'] = option_map[parsed_answer]
+        return answer
+
+    with ThreadPoolExecutor(max_workers=args.max_workers) as executor:
+        # Submit all tasks to the executor
+        futures = {executor.submit(process_answer, key, value): key for key, value in answers.items()}
+
+        # Process results as they become available
+        for future in tqdm(as_completed(futures), total=len(answers)):
+            answer = future.result()
+            if answer is not None:
+                results_writer.write(json.dumps(answer) + '\n')
+                results_writer.flush()
+
+    results_writer.close()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
+    parser.add_argument("--answers-file", type=str, default="answer.jsonl")
+    parser.add_argument("--results-file", type=str, default="results.jsonl")
+    parser.add_argument("--max-workers", type=int, default=1)
+    parser.add_argument("--num-retry", type=int, default=3)
+    parser.add_argument("--request-timeout", type=int, default=None)
+    args = parser.parse_args()
+
+    eval_results(args)