Skip to content

Commit

Permalink
[ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parall…
Browse files Browse the repository at this point in the history
…el (hpcaitech#5169)

* Support GSM, Data Leakage Evaluation and Tensor Parallel

* remove redundant code and update inference.py in examples/gpt_evaluation

---------

Co-authored-by: Xu Yuanchen <[email protected]>
  • Loading branch information
chengeharrison and Xu Yuanchen authored Dec 12, 2023
1 parent b07a6f4 commit cefdc32
Show file tree
Hide file tree
Showing 19 changed files with 578 additions and 100 deletions.
46 changes: 41 additions & 5 deletions applications/ColossalEval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
- [Citations](#citations)

## Overview
[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. More details can be found in the following sections.
[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. Currently we support AGIEval, CEval, CMMLU, CValues, GAOKAO-Bench, GSM8K, LongBench, MMLU, MtBench and SafetyBench. More details can be found in the following sections.

## Leaderboard

Expand Down Expand Up @@ -101,7 +101,7 @@ The evaluation process involves 2 steps which are `inference` and `evaluation`.

### Inference

The inference process consists of two parts.
The inference process consists of two parts. We now support tensor parallel inference for large models using [ShardFormer](colossalai/shardformer) in the [example](applications/ColossalEval/examples/dataset_evaluation/inference.py) script.
1. Preprocess and convert the original dataset.
2. Config your tokenizer and model arguments to perform zero-shot or few-shot prompting.

Expand Down Expand Up @@ -193,7 +193,7 @@ In this step, you will configure your tokenizer and model arguments to infer on

A config file consists of two parts.
1. Model config. In model config, you need to specify model name, model path, model class, tokenizer arguments and model arguments. For model class, currently we support `HuggingFaceModel`, `HuggingFaceCausalLM`, `ChatGLMModel` and `ChatGLMModel2`. `HuggingFaceModel` is for models that can be loaded with `AutoModel` and `HuggingFaceCausalLM` is for models that can be loaded with `AutoModelForCausalLM`. `ChatGLMModel` and `ChatGLMModel2` are for ChatGLM and ChatGLM2 models respectively. You can check all model classes in `colossal_eval/models/__init__.py`. If your model should set `trust_remote_code` as true, specify it in the `tokenizer_kwargs` and `model_kwargs` fields.
2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench and LongBench and few-shot on dataset MMLU, CMMLU and AGIEval. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K and LongBench and few-shot on dataset MMLU, CMMLU AGIEval and GSM8K. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.

Once you have all config ready, the program will run inference on all the given datasets on all the given models.

Expand Down Expand Up @@ -236,17 +236,20 @@ An example config using model class `HuggingFaceCausalLM` and dataset class `CMM

Currently, we support Hugging Face models. The `tokenizer_kwargs` is the arguments used in `AutoTokenizer.from_pretrained()`. The `model_kwargs` is the arguments used in `AutoModel.from_pretrained` or `AutoModelForCausalLM.from_pretrained()`. `few_shot` will be set true if you want to enable few-shot prompting for the dataset. `debug` will be set true if you want to verify whether your prompt is right or wrong.

> For GSM8K dataset, you can set additional flags `load_train` or `load_reference` for dataset configuration as true and during the inference process, the program will calculate loss summation over all tokens for each data sample. During the evaluation process, you can use metric `loss_over_all_tokens` to calculate the overall loss and use it for data leakage evaluation.
#### How to Use
An example script can be the following. The `configs/dataset_evaluation/inference.py` is the same in all examples provided.

```shell
torchrun --nproc_per_node=1 inference.py \
torchrun --nproc_per_node=4 inference.py \
--config "path to config file" \
--load_dataset \
--tp_size 2 \
--inference_save_path "path to save inference results"
```

You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`.
You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`. If you want to use tensor parallel inference, specify the tensor parallel size in `--tp_size` and the process will automatically calculate data parallel size.

### Evaluation

Expand Down Expand Up @@ -371,11 +374,13 @@ To make it more easier to set the config, you only need to specify all metrics y
- `classification_score`: Calculate classification score between prediction and reference. It determines whether the ouput(a class) is equal to the reference. It is used in Longbench.
- `code_sim_score`: Calculate similarity score between prediction and reference. It is used in Longbench.
- `count_score`: Calculate count score between prediction and reference. It determines whether the ouput(number of given passages) is equal to the reference. It is used in Longbench.
- `gsm_accuracy`: Calculate scores between prediction and reference.. It is used in GSM8K.
- `perplexity`: Calculate perplexity. The formula is $ perplexity = \frac{1}{n} \sum_i e^{loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
- `ppl_score`: Calculate perplexity score. The formula is $ ppl\_score = \frac{1}{n} \sum_i e^{-loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
- `ppl_score_over_choices`: Calculate perplexity score over choices. The formula is $ ppl\_score\_over\_choices= \frac{1}{n} \sum_i e^{-loss\_over\_choices_i} $ where $n$ is the number of samples and $ loss\_over\_choices_i $ is the loss on the first predicted token for sample $ i $. It can be used in all dataset that contains single-choice questions.
- `per_byte_perplexity`: Calculate per byte perplexity. The formula is $ \frac{1}{n} \sum_i e^{\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
- `per_byte_ppl_score`: Calculate per byte perplexity score. The formula is $ \frac{1}{n} \sum_i e^{-\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
- `loss_over_all_tokens`: Calculate loss over all tokens. The formula is $ loss\_over\_all\_tokens = \frac{1}{n} \sum_i loss_i $ where $n$ is the total number of tokens of the dataset and $ loss_i $ is the loss summation for sample $ i $ over all tokens and $ \sum_i loss_i $ is the loss summation for all samples. It can be used in all dataset.

We use `combined_single_choice_accuracy` and `first_token_logit` in the leaderboard.

Expand Down Expand Up @@ -520,6 +525,15 @@ year={2023}
primaryClass={cs.CL}
}
@misc{xu2023cvalues,
title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility},
author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
year={2023},
eprint={2307.09705},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{Zhang2023EvaluatingTP,
title={Evaluating the Performance of Large Language Models on GAOKAO Benchmark},
author={Xiaotian Zhang and Chunyang Li and Yi Zong and Zhengyu Ying and Liang He and Xipeng Qiu},
Expand All @@ -542,6 +556,20 @@ year={2023}
year={2021}
}
@article{zhang2023safetybench,
title={SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions},
author={Zhexin Zhang and Leqi Lei and Lindong Wu and Rui Sun and Yongkang Huang and Chong Long and Xiao Liu and Xuanyu Lei and Jie Tang and Minlie Huang},
journal={arXiv preprint arXiv:2309.07045},
year={2023}
}
@article{cobbe2021training,
title={Training verifiers to solve math word problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
Expand All @@ -558,4 +586,12 @@ year={2023}
primaryClass={cs.CL}
}
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
2 changes: 2 additions & 0 deletions applications/ColossalEval/colossal_eval/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .colossalai import ColossalDataset
from .cvalues import CValuesDataset
from .gaokaobench import GaoKaoBenchDataset
from .gsm import GSMDataset
from .longbench import LongBenchDataset
from .mmlu import MMLUDataset
from .mtbench import MTBenchDataset
Expand All @@ -24,4 +25,5 @@
"SafetyBenchENDataset",
"SafetyBenchZHDataset",
"CValuesDataset",
"GSMDataset",
]
17 changes: 14 additions & 3 deletions applications/ColossalEval/colossal_eval/dataset/agieval.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,11 +99,20 @@ def get_prompt(line: Dict, dataset_name: str, logger: DistributedLogger) -> Dict

# process few-shot raw_prompts
def combine_prompt(prompt_path, dataset_name, load_explanation=True, chat_mode=False):
demostrations = []
demostration_en = "Here are the answers for the problems in the exam."
demostration_zh = "以下是考试中各个问题的答案。"

if dataset_name in english_qa_datasets or dataset_name in english_cloze_datasets:
demostrations.append(demostration_en)
elif dataset_name in chinese_qa_datasets or dataset_name in chinese_cloze_datasets:
demostrations.append(demostration_zh)

skip_passage = False
if dataset_name == "sat-en-without-passage":
skip_passage = True
dataset_name = "sat-en"
demostrations = []

# read the prompts by context and explanation
context_row = [0, 1, 3, 5, 7, 9]
explanation_row = [0, 2, 4, 6, 8, 10]
Expand Down Expand Up @@ -153,7 +162,7 @@ def combine_prompt(prompt_path, dataset_name, load_explanation=True, chat_mode=F
if chat_mode:
demostrations.append((question_input,))
else:
demostrations.append(question_input + "\n")
demostrations.append(question_input)

return demostrations

Expand All @@ -178,7 +187,9 @@ class AGIEvalDataset(BaseDataset):
"""

@staticmethod
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
def load(
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
) -> List[Dict]:
dataset = {"test": {}}

files = glob.glob(os.path.join(path, "*.jsonl"))
Expand Down
4 changes: 2 additions & 2 deletions applications/ColossalEval/colossal_eval/dataset/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ class BaseDataset:
logger: Logger for the dataset.
"""

def __init__(self, path, logger, few_shot):
self.dataset = self.load(path, logger, few_shot)
def __init__(self, path, logger, few_shot, forward_only=False, load_train=False, load_reference=False):
self.dataset = self.load(path, logger, few_shot, forward_only, load_train, load_reference)

def save(self, save_path):
"""Save the converted dataset"""
Expand Down
10 changes: 6 additions & 4 deletions applications/ColossalEval/colossal_eval/dataset/ceval.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,8 @@
}


def get_few_shot_data(data: List[Dict]):
few_shot_data = []
def get_few_shot_data(data: List[Dict], subject):
few_shot_data = [f"以下是中国关于{subject}考试的单项选择题,请选出其中的正确答案。"]
for i in data:
few_shot_data.append(i["input"] + i["target"])
return few_shot_data
Expand All @@ -86,7 +86,9 @@ class CEvalDataset(BaseDataset):
"""

@staticmethod
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
def load(
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
) -> List[Dict]:
dataset = {"dev": {}, "test": {}}
for split in ["dev", "test"]:
files = os.listdir(os.path.join(path, split))
Expand All @@ -105,7 +107,7 @@ def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:

if split == "test" and few_shot:
dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
dataset["dev"][subject]["data"]
dataset["dev"][subject]["data"], subject
)

with open(file_dir, encoding="utf-8") as f:
Expand Down
10 changes: 6 additions & 4 deletions applications/ColossalEval/colossal_eval/dataset/cmmlu.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@
}


def get_few_shot_data(data: List[Dict]):
few_shot_data = []
def get_few_shot_data(data: List[Dict], subject):
few_shot_data = [f"以下是关于{subject}的单项选择题,请直接给出正确答案的选项。"]
for i in data:
few_shot_data.append(i["input"] + i["target"])
return few_shot_data
Expand All @@ -101,7 +101,9 @@ class CMMLUDataset(BaseDataset):
"""

@staticmethod
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
def load(
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
) -> List[Dict]:
dataset = {"dev": {}, "test": {}}
for split in ["dev", "test"]:
files = os.listdir(os.path.join(path, split))
Expand All @@ -120,7 +122,7 @@ def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:

if split == "test" and few_shot:
dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
dataset["dev"][subject]["data"]
dataset["dev"][subject]["data"], subject
)

with open(file_dir, encoding="utf-8") as f:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,9 @@ class GaoKaoBenchDataset(BaseDataset):
"""

@staticmethod
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
def load(
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
) -> List[Dict]:
dataset = {"test": {}}
for category in ["Fill-in-the-blank_Questions", "Multiple-choice_Questions", "Open-ended_Questions"]:
files = os.listdir(os.path.join(path, "data", category))
Expand Down
Loading

0 comments on commit cefdc32

Please sign in to comment.