Skip to content

Commit

Permalink
update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
JamesZhutheThird committed Feb 22, 2024
1 parent 8c7b675 commit e2c9ecd
Show file tree
Hide file tree
Showing 5 changed files with 264 additions and 53 deletions.
146 changes: 126 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

![MULTI](./docs/static/images/overview.png)

🌐 [Website](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [Paper](https://arxiv.org/abs/2402.03173/) | 🤗 [Dataset](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 🎯 [Leaderboard]() (Coming Soon)
🌐 [Website](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [Paper](https://arxiv.org/abs/2402.03173/) | 🤗 [Dataset](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 🎯 [Leaderboard]() (Coming Soon)

[简体中文](./README_zh.md) | English

Expand All @@ -20,40 +20,144 @@

## 📖 Overview

Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context. In this paper, we present ***MULTI***, as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context. **MULTI** provides multimodal inputs and requires responses that are either precise or open-ended, reflecting real-life examination styles. **MULTI** includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. We also introduce ***MULTI-Elite***, a 500-question selected hard subset, and ***MULTI-Extend***, with more than 4,500 external knowledge context pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a **63.7%** accuracy rate on **MULTI**, in contrast to other MLLMs scoring between **28.5%** and **55.3%**. **MULTI** serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context. In this paper, we present***MULTI***, as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context. **MULTI** provides multimodal inputs and requires responses that are either precise or open-ended, reflecting real-life examination styles. **MULTI** includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. We also introduce***MULTI-Elite***, a 500-question selected hard subset, and ***MULTI-Extend***, with more than 4,500 external knowledge context pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a **63.7%** accuracy rate on **MULTI**, in contrast to other MLLMs scoring between **28.5%** and **55.3%**. **MULTI** serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.

## 🏆 Leaderboard

| Modality | Model | Version | Overall | MULTI-Elite |
|:--------:|:-------------:| -------------------------- |:-------:|:-----------:|
| 🖼️ | GPT-4V | gpt-4-vision-preview | 63.7 | 14.0 |
| 🖼️ | Yi-VL | Yi-34B-Chat | 55.3 | 26.2 |
| 🖼️ | Gemini Vision | gemini-pro-vision | 53.7 | 12.4 |
|:--------:|:-------------:|----------------------------|:-------:|:-----------:|
| 🖼️ | GPT-4V | gpt-4-vision-preview | 63.7 | 14.0 |
| 🖼️ | Yi-VL | Yi-34B-Chat | 55.3 | 26.2 |
| 🖼️ | Gemini Vision | gemini-pro-vision | 53.7 | 12.4 |
| 📃 | Gemini | gemini-pro | 52.2 | 10.5 |
| 📃 | GPT-4 | gpt-4-1106-preview | 50.2 | 5.8 |
| 📃 | DFM-2.0 | dfm-2.0-70b-preview | 49.7 | 18.0 |
| 🖼️ | InternVL | InternVL-Chat-Chinese-V1.1 | 44.9 | 20.7 |
| 🖼️ | Qwen-VL | Qwen-VL-Chat | 39.0 | 10.5 |
| 🖼️ | InternVL | InternVL-Chat-Chinese-V1.1 | 44.9 | 20.7 |
| 🖼️ | Qwen-VL | Qwen-VL-Chat | 39.0 | 10.5 |
| 📃 | ChatGPT | gpt-3.5-turbo-1106 | 35.9 | 4.7 |
| 🖼️ | VisCPM | VisCPM-Chat | 33.4 | 13.0 |
| 🖼️ | VisCPM | VisCPM-Chat | 33.4 | 13.0 |
| 📃 | MOSS | moss-moon-003-sft | 32.6 | 13.1 |
| 🖼️ | VisualGLM | visualglm-6b | 31.1 | 12.8 |
| 🖼️ | Chinese-LLaVA | Chinese-LLaVA-Cllama2 | 28.5 | 12.3 |
| 🖼️ | VisualGLM | visualglm-6b | 31.1 | 12.8 |
| 🖼️ | Chinese-LLaVA | Chinese-LLaVA-Cllama2 | 28.5 | 12.3 |

For more details, please visit our [leaderboard]() (Coming Soon).

## ⏬ Download

You can download the dataset from the [HuggingFace Page](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark). Current [version](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip) is `v1.2.2`.
You can download the dataset from the [HuggingFace Page](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark). Current [version](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip) is `v1.2.2`. Unzip the files and put them under `data`.

```
wget https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/resolve/main/MULTI_v1.2.2_20240212_release.zip
unzip MULTI_v1.2.2_20240212_release.zip -d ./data
unzip MULTI_v1.2.2_20240212_release.zip -d ./data/
```

The structure of `data` should be something like:

```
data
├── images # folder containing images
├── problem_v1.2.2_20240212_release.json # MULTI
├── knowledge_v1.2.2_20240212_release.json # MULTI-Extend
├── hard_list_v1.2.1_20240206.json # MULTI-Elite
└── captions_v1.2.0_20231217.csv # image captions generated by BLIP-6.7b
```

## 📝 How to Evaluate

This will be updated soon. Please refer to the [legancy README](./eval/models/README.md) for now.
We provide a unified evaluation framework in `eval`. Each file in `eval/models` contains an evaluator specified to one M/LLM, and implements a `generate_answer` method to receive a question as input and give out the answer of it.

```shell
cd eval
python eval.py -h # to list all supported arguments
python eval.py -l # to list all supported models
```

### Environment Preparation Before Usage

Each evaluator requires its unique environment setting, and a universal environment may not work for all evaluators. **Just follow the official guide.** If the corresponding model runs well, then so should it fit in our framework.

You just need to install another two packages to run the evaluation code:

```shell
pip install tiktoken tqdm
```

If you just want to generate data for a specific setting (using `--debug` argument), this line above is all you need.

### Running Evaluation

For a quick start, see these examples:

Test GPT-4V model on whole MULTI with multimodal input, using MULTI-Extend as external knowledge:

```shell
python eval.py \
--problem_file ../data/problem_v1.2.2_20240212_release.json \
--knowledge_file ../data/knowledge_v1.2.2_20240212_release.json \
--questions_type 0,1,2,3 \
--image_type 0,1,2 \
--input_type 2 \
--model gpt-4v \
--model_version gpt-4-vision-preview \
--api_key sk-************************************************
```

Test Qwen-VL model on MULTI-Elite with image caption input, skip all questions not containing images, evaluate only multiple-choice questions, automatically set cuda device:

```shell
python eval.py \
--problem_file ../data/problem_v1.2.2_20240212_release.json \
--subset ../data/hard_list_v1.2.1_20240206.json \
--caption_file ../data/captions_v1.2.0_20231217.csv \
--questions_type 0,1 \
--image_type 1,2 \
--input_type 1 \
--model qwen-vl \
--model_dir ../models/Qwen-VL-Chat
```

The evaluation script will generate a folder named `results` under the root directory, and the result will be saved in `../results/EXPERIMENT_NAME`. During the evaluation, the script will save checkpoints in `../results/EXPERIMENT_NAME/checkpoints`, you can delete them after the evaluation is done. If the evaluation is interrupted, you can continue from the last checkpoint:

```shell
python eval.py \
--checkpoint_dir ../results/EXPERIMENT_NAME
```

Most of the arguments are saved in `../results/EXPERIMENT_NAME/args.json`, so you can continue the evaluation without specifying all the arguments again. Please note that `--api_key` is not saved in `args.json` for security reasons, so you need to specify it again.

```shell
python eval.py \
--checkpoint_dir ../results/EXPERIMENT_NAME \
--api_key sk-************************************************
```

For more details of arguments, please use `python eval.py -h`, and refer to `args.py` and `eval.py`.

### Add Support for Your Models

It's recommended to read the code of the other given evaluators in `eval/models` before your implementation.

Create `class YourModelEvaluator` and implement `generate_answer(self, question:dict)` to match the design supported in `eval.py` and `eval.sh`, which is anticipated to largely ease the coding process.

**Do not forget to add their references into `args.py` for the convenience of usage.**

You can execute `model_tester.py` in the `eval` folder to check the correctness of you implementation. Various problems including implementation errors, small bugs in code, and even wrong environment settings may cause failure of the evaluation. The examples provided in the file cover most kinds of cases presented in our benchmark. Feel free to change the code in it to debug your code😊

```shell
python model_tester.py <args> # args are similar to the default settings above
```

### Create Captions and OCR Data for Images

Generate captions or OCR data for images, and save them in csv with format below:

```
../data/images/czls/502_1.png,a cartoon drawing of a man standing in front of a large block
../data/images/czls/525_1.png,a chinese newspaper with the headline, china's new year
...
```

We provide two example scripts to generate captions (`image_caption.py`) and OCR data (`image_ocr.py`) for images.

## 📮 How to Submit

Expand All @@ -64,24 +168,26 @@ You need to first prepare a UTF-8 encoded JSON file with the following format:
"czsx_0_0": {
"question_id": "czsx_0_0",
"question_image_number": 1,
"image_list": [...],
"input_message": ...,
"image_list": [...], # optional
"input_message": ..., # optional
"prediction": "C"
},
...
}
```
If you evaluate the model with our official code, you can simply zip the experiment result folder `./results/EXPERIMENT_NAME`.

Then, you can submit your result to our [evaluation platform](https://wj.sjtu.edu.cn/q/89UmRAJn) (Coming Soon).
If you evaluate the model with our official code, you can simply zip the experiment result folder `results/EXPERIMENT_NAME`.

Thank you for being so interested in the **MULTI** dataset! As the automated evaluation platform is not yet online, please fill in [this questionnaire](https://wj.sjtu.edu.cn/q/89UmRAJn) to get the evaluation results, your information will be kept strictly confidential, so please feel free to fill it out. 🤗
Then, you can submit your result to our [evaluation platform](https://wj.sjtu.edu.cn/q/89UmRAJn) (Coming Soon).

You are also welcome to pull a request and contribute your code to our evaluation code. We will be very grateful for your contribution!

**[Notice]** Thank you for being so interested in the **MULTI** dataset! As the automated evaluation platform is not yet online, please fill in [this questionnaire](https://wj.sjtu.edu.cn/q/89UmRAJn) to get the evaluation results, your information will be kept strictly confidential, so please feel free to fill it out. 🤗


## 📑 Citation

If you find our work useful, please citing us!
If you find our work useful, please cite us!

```
@misc{zhu2024multi,
Expand Down
116 changes: 110 additions & 6 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,16 +44,120 @@

## ⏬ 下载

你可以从[HuggingFace页面](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark)下载数据集。最新[版本](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip)`v1.2.2`
你可以从[HuggingFace页面](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark)下载数据集。最新[版本](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/blob/main/MULTI_v1.2.2_20240212_release.zip)`v1.2.2`解压文件并将它们放置在`data`下。

```
wget https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/resolve/main/MULTI_v1.2.2_20240212_release.zip
unzip MULTI_v1.2.2_20240212_release.zip -d ./data
unzip MULTI_v1.2.2_20240212_release.zip -d ./data/
```

## 📝 如何评估
`data` 的结构应该如下所示:

此部分即将更新。现在,请参考[历史版本README](./eval/models/README.md)
```
data
├── images # 包含图片的文件夹
├── problem_v1.2.2_20240212_release.json # MULTI
├── knowledge_v1.2.2_20240212_release.json # MULTI-Extend
├── hard_list_v1.2.1_20240206.json # MULTI-Elite
└── captions_v1.2.0_20231217.csv # 由BLIP-6.7b生成的图片描述
```

## 📝 如何评测

我们在`eval`中提供了一个统一的评估框架。`eval/models`中的每个文件都包含了一个针对某个M/LLM的评测器,并实现了一个`generate_answer`方法来接收问题输入并输出答案。

```shell
cd eval
python eval.py -h # 列出所有支持的参数
python eval.py -l # 列出所有支持的模型
```

### 使用前的环境准备

每个模型都需要其独特的环境设置,通用环境可能不适用于所有模型的评测。**按照官方文档操作即可。** 如果相应的模型运行良好,那么它应该也适合我们的框架。

您只需要安装另外两个包来运行评估代码:

```shell
pip install tiktoken tqdm
```

如果您只是想为特定设置生成数据(使用`--debug`参数),上面这行代码就是您所需要的一切。

### 运行评测

请参考这些示例以便快速开始:

在MULTI上测试GPT-4V模型,采用多模态输入,并使用MULTI-Extend作为外部知识:

```shell
python eval.py \
--problem_file ../data/problem_v1.2.2_20240212_release.json \
--knowledge_file ../data/knowledge_v1.2.2_20240212_release.json \
--questions_type 0,1,2,3 \
--image_type 0,1,2 \
--input_type 2 \
--model gpt-4v \
--model_version gpt-4-vision-preview \
--api_key sk-************************************************
```

在MULTI-Elite上测试Qwen-VL模型,使用图片描述输入,跳过所有不包含图片的问题,仅评测选择题,自动设置cuda设备:

```shell
python eval.py \
--problem_file ../data/problem_v1.2.2_20240212_release.json \
--subset ../data/hard_list_v1.2.1_20240206.json \
--caption_file ../data/captions_v1.2.0_20231217.csv \
--questions_type 0,1 \
--image_type 1,2 \
--input_type 1 \
--model qwen-vl \
--model_dir ../models/Qwen-VL-Chat
```

测脚本将在根目录下生成`results`文件夹,结果将保存在`../results/EXPERIMENT_NAME`中。评测过程中,脚本将在`../results/EXPERIMENT_NAME/checkpoints`中保存检查点,评测完成后您可以删除它们。如果评测被中断,您可以从最后一个检查点继续:

```shell
python eval.py \
--checkpoint_dir ../results/EXPERIMENT_NAME
```

大多数参数都保存在`../results/EXPERIMENT_NAME/args.json`中,因此您可以继续评测而无需再次指定所有参数。请注意,出于安全原因,`--api_key`不会保存在`args.json`中,因此您需要再次指定它。

```shell
python eval.py \
--checkpoint_dir ../results/EXPERIMENT_NAME \
--api_key sk-************************************************
```

有关参数的更多详细信息,请使用`python eval.py -h`并参考`args.py``eval.py`

### 为您的模型增加支持

建议在此之前阅读`eval/models`中其他评测器的代码。

创建`class YourModelEvaluator`并实现 `generate_answer(self, question:dict)`以匹配`eval.py``eval.sh`中支持的设计,这预计将大大简化代码实现过程。

**不要忘记将它们的调用方式添加到`args.py`中,以方便使用。**

您可以在`eval`文件夹中执行`model_tester.py`来检查您的实现的正确性。各种问题,包括实现错误、代码中的小错误,甚至错误的环境设置都可能导致评测失败。文件中提供的示例覆盖了我们基准测试中呈现的大多数情况类型。随意更改其中的代码以调试您的代码😊

```shell
python model_tester.py <args> # args 类似于上面的默认设置
```

### 为图片创建描述和OCR数据

为图片生成描述或OCR数据,并以下面的格式保存在csv中:

```
../data/images/czls/502_1.png,a cartoon drawing of a man standing in front of a large block
../data/images/czls/525_1.png,a chinese newspaper with the headline, china's new year
...
```

我们提供了两个示例脚本来为图片生成标题(`image_caption.py`)和OCR数据(`image_ocr.py`)。

## 📮 如何提交

Expand All @@ -75,10 +179,10 @@ unzip MULTI_v1.2.2_20240212_release.zip -d ./data

然后,你可以将你的结果提交到我们的[评估平台]()(即将推出)。

感谢您对 MULTI 数据集的关注!由于自动评测平台尚未上线,请填写[此问卷](https://wj.sjtu.edu.cn/q/89UmRAJn)以获取评测结果,您的个人信息将被严格保密,请放心填写。🤗

欢迎拉取请求(Pull Request)并贡献你的代码到我们的评估代码中。我们感激不尽!

**[提示]** 感谢您对 MULTI 数据集的关注!由于自动评测平台尚未上线,请填写[此问卷](https://wj.sjtu.edu.cn/q/89UmRAJn)以获取评测结果,您的个人信息将被严格保密,请放心填写。🤗

## 📑 引用

如果你觉得我们的工作有用,请引用我们!
Expand Down
Binary file not shown.
Loading

0 comments on commit e2c9ecd

Please sign in to comment.