[ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parall…

…el (hpcaitech#5169) * Support GSM, Data Leakage Evaluation and Tensor Parallel * remove redundant code and update inference.py in examples/gpt_evaluation --------- Co-authored-by: Xu Yuanchen <[email protected]>
flybird11111 · Dec 12, 2023 · cefdc32 · cefdc32
1 parent b07a6f4
commit cefdc32
Show file tree

Hide file tree

Showing 19 changed files with 578 additions and 100 deletions.
diff --git a/applications/ColossalEval/README.md b/applications/ColossalEval/README.md
@@ -37,7 +37,7 @@
 - [Citations](#citations)
 
 ## Overview
-[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. More details can be found in the following sections.
+[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. Currently we support AGIEval, CEval, CMMLU, CValues, GAOKAO-Bench, GSM8K, LongBench, MMLU, MtBench and SafetyBench. More details can be found in the following sections.
 
 ## Leaderboard
 
@@ -101,7 +101,7 @@ The evaluation process involves 2 steps which are `inference` and `evaluation`.
 
 ### Inference
 
-The inference process consists of two parts.
+The inference process consists of two parts. We now support tensor parallel inference for large models using [ShardFormer](colossalai/shardformer) in the [example](applications/ColossalEval/examples/dataset_evaluation/inference.py) script.
 1. Preprocess and convert the original dataset.
 2. Config your tokenizer and model arguments to perform zero-shot or few-shot prompting.
 
@@ -193,7 +193,7 @@ In this step, you will configure your tokenizer and model arguments to infer on
 
 A config file consists of two parts.
 1. Model config. In model config, you need to specify model name, model path, model class, tokenizer arguments and model arguments. For model class, currently we support `HuggingFaceModel`, `HuggingFaceCausalLM`, `ChatGLMModel` and `ChatGLMModel2`. `HuggingFaceModel` is for models that can be loaded with `AutoModel` and `HuggingFaceCausalLM` is for models that can be loaded with `AutoModelForCausalLM`. `ChatGLMModel` and `ChatGLMModel2` are for ChatGLM and ChatGLM2 models respectively. You can check all model classes in `colossal_eval/models/__init__.py`. If your model should set `trust_remote_code` as true, specify it in the `tokenizer_kwargs` and `model_kwargs` fields.
-2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench and LongBench and few-shot on dataset MMLU, CMMLU and AGIEval. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
+2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K and LongBench and few-shot on dataset MMLU, CMMLU AGIEval and GSM8K. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
 
 Once you have all config ready, the program will run inference on all the given datasets on all the given models.
 
@@ -236,17 +236,20 @@ An example config using model class `HuggingFaceCausalLM` and dataset class `CMM
 
 Currently, we support Hugging Face models. The `tokenizer_kwargs` is the arguments used in `AutoTokenizer.from_pretrained()`. The `model_kwargs` is the arguments used in `AutoModel.from_pretrained` or `AutoModelForCausalLM.from_pretrained()`. `few_shot` will be set true if you want to enable few-shot prompting for the dataset. `debug` will be set true if you want to verify whether your prompt is right or wrong.
 
+> For GSM8K dataset, you can set additional flags `load_train` or `load_reference` for dataset configuration as true and during the inference process, the program will calculate loss summation over all tokens for each data sample. During the evaluation process, you can use metric `loss_over_all_tokens` to calculate the overall loss and use it for data leakage evaluation.
+
 #### How to Use
 An example script can be the following. The `configs/dataset_evaluation/inference.py` is the same in all examples provided.
 
 ```shell
-torchrun --nproc_per_node=1 inference.py \
+torchrun --nproc_per_node=4 inference.py \
     --config "path to config file" \
     --load_dataset \
+    --tp_size 2 \
     --inference_save_path "path to save inference results"
 ```
 
-You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`.
+You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`. If you want to use tensor parallel inference, specify the tensor parallel size in `--tp_size` and the process will automatically calculate  data parallel size.
 
 ### Evaluation
 
@@ -371,11 +374,13 @@ To make it more easier to set the config, you only need to specify all metrics y
 - `classification_score`: Calculate classification score between prediction and reference. It determines whether the ouput(a class) is equal to the reference. It is used in Longbench.
 - `code_sim_score`: Calculate similarity score between prediction and reference. It is used in Longbench.
 - `count_score`: Calculate count score between prediction and reference. It determines whether the ouput(number of given passages) is equal to the reference. It is used in Longbench.
+- `gsm_accuracy`: Calculate scores between prediction and reference.. It is used in GSM8K.
 - `perplexity`: Calculate perplexity. The formula is $ perplexity = \frac{1}{n} \sum_i e^{loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
 - `ppl_score`: Calculate perplexity score. The formula is $ ppl\_score = \frac{1}{n} \sum_i e^{-loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
 - `ppl_score_over_choices`: Calculate perplexity score over choices. The formula is $ ppl\_score\_over\_choices= \frac{1}{n} \sum_i e^{-loss\_over\_choices_i} $ where $n$ is the number of samples and $ loss\_over\_choices_i $ is the loss on the first predicted token for sample $ i $. It can be used in all dataset that contains single-choice questions.
 - `per_byte_perplexity`: Calculate per byte perplexity. The formula is $ \frac{1}{n} \sum_i e^{\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
 - `per_byte_ppl_score`: Calculate per byte perplexity score. The formula is $ \frac{1}{n} \sum_i e^{-\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
+- `loss_over_all_tokens`: Calculate loss over all tokens. The formula is $ loss\_over\_all\_tokens = \frac{1}{n} \sum_i loss_i $ where $n$ is the total number of tokens of the dataset and $ loss_i $ is the loss summation for sample $ i $ over all tokens and $ \sum_i loss_i $ is the loss summation for all samples. It can be used in all dataset.
 
 We use `combined_single_choice_accuracy` and `first_token_logit` in the leaderboard.
 
@@ -520,6 +525,15 @@ year={2023}
       primaryClass={cs.CL}
 }
 
+@misc{xu2023cvalues,
+      title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility},
+      author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
+      year={2023},
+      eprint={2307.09705},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
 @inproceedings{Zhang2023EvaluatingTP,
   title={Evaluating the Performance of Large Language Models on GAOKAO Benchmark},
   author={Xiaotian Zhang and Chunyang Li and Yi Zong and Zhengyu Ying and Liang He and Xipeng Qiu},
@@ -542,6 +556,20 @@ year={2023}
   year={2021}
 }
 
+@article{zhang2023safetybench,
+      title={SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions},
+      author={Zhexin Zhang and Leqi Lei and Lindong Wu and Rui Sun and Yongkang Huang and Chong Long and Xiao Liu and Xuanyu Lei and Jie Tang and Minlie Huang},
+      journal={arXiv preprint arXiv:2309.07045},
+      year={2023}
+}
+
+@article{cobbe2021training,
+  title={Training verifiers to solve math word problems},
+  author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others},
+  journal={arXiv preprint arXiv:2110.14168},
+  year={2021}
+}
+
 @article{hendrycks2021ethics,
   title={Aligning AI With Shared Human Values},
   author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
@@ -558,4 +586,12 @@ year={2023}
       primaryClass={cs.CL}
 }
 
+@misc{wei2023skywork,
+      title={Skywork: A More Open Bilingual Foundation Model},
+      author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
+      year={2023},
+      eprint={2310.19341},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
 ```
diff --git a/applications/ColossalEval/colossal_eval/dataset/__init__.py b/applications/ColossalEval/colossal_eval/dataset/__init__.py
@@ -5,6 +5,7 @@
 from .colossalai import ColossalDataset
 from .cvalues import CValuesDataset
 from .gaokaobench import GaoKaoBenchDataset
+from .gsm import GSMDataset
 from .longbench import LongBenchDataset
 from .mmlu import MMLUDataset
 from .mtbench import MTBenchDataset
@@ -24,4 +25,5 @@
     "SafetyBenchENDataset",
     "SafetyBenchZHDataset",
     "CValuesDataset",
+    "GSMDataset",
 ]
diff --git a/applications/ColossalEval/colossal_eval/dataset/agieval.py b/applications/ColossalEval/colossal_eval/dataset/agieval.py
@@ -99,11 +99,20 @@ def get_prompt(line: Dict, dataset_name: str, logger: DistributedLogger) -> Dict
 
 # process few-shot raw_prompts
 def combine_prompt(prompt_path, dataset_name, load_explanation=True, chat_mode=False):
+    demostrations = []
+    demostration_en = "Here are the answers for the problems in the exam."
+    demostration_zh = "以下是考试中各个问题的答案。"
+
+    if dataset_name in english_qa_datasets or dataset_name in english_cloze_datasets:
+        demostrations.append(demostration_en)
+    elif dataset_name in chinese_qa_datasets or dataset_name in chinese_cloze_datasets:
+        demostrations.append(demostration_zh)
+
     skip_passage = False
     if dataset_name == "sat-en-without-passage":
         skip_passage = True
         dataset_name = "sat-en"
-    demostrations = []
+
     # read the prompts by context and explanation
     context_row = [0, 1, 3, 5, 7, 9]
     explanation_row = [0, 2, 4, 6, 8, 10]
@@ -153,7 +162,7 @@ def combine_prompt(prompt_path, dataset_name, load_explanation=True, chat_mode=F
         if chat_mode:
             demostrations.append((question_input,))
         else:
-            demostrations.append(question_input + "\n")
+            demostrations.append(question_input)
 
     return demostrations
 
@@ -178,7 +187,9 @@ class AGIEvalDataset(BaseDataset):
     """
 
     @staticmethod
-    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+    def load(
+        path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
+    ) -> List[Dict]:
         dataset = {"test": {}}
 
         files = glob.glob(os.path.join(path, "*.jsonl"))

diff --git a/applications/ColossalEval/colossal_eval/dataset/base.py b/applications/ColossalEval/colossal_eval/dataset/base.py
@@ -12,8 +12,8 @@ class BaseDataset:
         logger: Logger for the dataset.
     """
 
-    def __init__(self, path, logger, few_shot):
-        self.dataset = self.load(path, logger, few_shot)
+    def __init__(self, path, logger, few_shot, forward_only=False, load_train=False, load_reference=False):
+        self.dataset = self.load(path, logger, few_shot, forward_only, load_train, load_reference)
 
     def save(self, save_path):
         """Save the converted dataset"""

diff --git a/applications/ColossalEval/colossal_eval/dataset/ceval.py b/applications/ColossalEval/colossal_eval/dataset/ceval.py
@@ -71,8 +71,8 @@
 }
 
 
-def get_few_shot_data(data: List[Dict]):
-    few_shot_data = []
+def get_few_shot_data(data: List[Dict], subject):
+    few_shot_data = [f"以下是中国关于{subject}考试的单项选择题，请选出其中的正确答案。"]
     for i in data:
         few_shot_data.append(i["input"] + i["target"])
     return few_shot_data
@@ -86,7 +86,9 @@ class CEvalDataset(BaseDataset):
     """
 
     @staticmethod
-    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+    def load(
+        path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
+    ) -> List[Dict]:
         dataset = {"dev": {}, "test": {}}
         for split in ["dev", "test"]:
             files = os.listdir(os.path.join(path, split))
@@ -105,7 +107,7 @@ def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
 
                 if split == "test" and few_shot:
                     dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
-                        dataset["dev"][subject]["data"]
+                        dataset["dev"][subject]["data"], subject
                     )
 
                 with open(file_dir, encoding="utf-8") as f:

diff --git a/applications/ColossalEval/colossal_eval/dataset/cmmlu.py b/applications/ColossalEval/colossal_eval/dataset/cmmlu.py
@@ -86,8 +86,8 @@
 }
 
 
-def get_few_shot_data(data: List[Dict]):
-    few_shot_data = []
+def get_few_shot_data(data: List[Dict], subject):
+    few_shot_data = [f"以下是关于{subject}的单项选择题，请直接给出正确答案的选项。"]
     for i in data:
         few_shot_data.append(i["input"] + i["target"])
     return few_shot_data
@@ -101,7 +101,9 @@ class CMMLUDataset(BaseDataset):
     """
 
     @staticmethod
-    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+    def load(
+        path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
+    ) -> List[Dict]:
         dataset = {"dev": {}, "test": {}}
         for split in ["dev", "test"]:
             files = os.listdir(os.path.join(path, split))
@@ -120,7 +122,7 @@ def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
 
                 if split == "test" and few_shot:
                     dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
-                        dataset["dev"][subject]["data"]
+                        dataset["dev"][subject]["data"], subject
                     )
 
                 with open(file_dir, encoding="utf-8") as f:

diff --git a/applications/ColossalEval/colossal_eval/dataset/gaokaobench.py b/applications/ColossalEval/colossal_eval/dataset/gaokaobench.py
@@ -69,7 +69,9 @@ class GaoKaoBenchDataset(BaseDataset):
     """
 
     @staticmethod
-    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+    def load(
+        path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
+    ) -> List[Dict]:
         dataset = {"test": {}}
         for category in ["Fill-in-the-blank_Questions", "Multiple-choice_Questions", "Open-ended_Questions"]:
             files = os.listdir(os.path.join(path, "data", category))