Gauntlet v0.1.0 yaml fixes (#748)

* update yaml * adding datasets * adding datasets * added agi eval * test CoT eval * fix broken eval yaml * fix broken eval yaml * debugging * debugging * commit * commit * commit * commit * commit * restore mcli * adding simple tasks * add simple human_eval * fix yaml * fix yaml * remove breakpoint * remove breakpoint * change bsz * merge main * eval gauntlet cb * add udpated readme * fix precommit * add pii * restor line * restor line * add execution predicrtion * add execution prediction * add execution prediction * change mosaicml reqs * change mosaicml reqs * fix error * comment * test smaller beams * tesT * tesT * tesT * add coding task * tesT * finish eval * finish data * fix * fix * remove strategyqa cot * remove * remove * foo * edit * fix * rm breakpoint * rm breakpoint * remove execution prediction; make coding optional * remove execution prediction; make coding optional * remove import * remove import * restore files * restore * restore * update readm; rename gauntlet yamls * edit yamls * fix yamllint * restore mpt eval * finish * fix * precommit * precommit --------- Co-authored-by: Michael Carbin <[email protected]> Co-authored-by: Daniel King <[email protected]>
mosaicml · Nov 21, 2023 · e7943e3 · e7943e3
1 parent 7f5d70c
commit e7943e3
Show file tree

Hide file tree

Showing 5 changed files with 74 additions and 68 deletions.
diff --git a/scripts/eval/README.md b/scripts/eval/README.md
@@ -1,6 +1,6 @@
 # In-context learning (ICL) evaluation
 
-This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.
+This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Eval Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.
 
 You can evaluate a model by preparing an evaluation YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).
 

diff --git a/scripts/eval/local_data/MODEL_GAUNTLET.md → scripts/eval/local_data/EVAL_GAUNTLET.md b/scripts/eval/local_data/MODEL_GAUNTLET.md → scripts/eval/local_data/EVAL_GAUNTLET.md
@@ -1,4 +1,4 @@
-# Mosaic Model Gauntlet v0 - Evaluation Suite
+# Mosaic Eval Gauntlet v0.1.0 - Evaluation Suite
 
 
 <!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_BEGIN -->
@@ -7,9 +7,9 @@
       <img alt="LLM Foundry" src="../../../assets/radar_blog.png" width="60%">
     </picture>
     <br>
-    MPT-7B vs MPT-30B compared on the 6 categories of Model Gauntlet.
+    MPT-7B vs MPT-30B compared on the 6 categories of Eval Gauntlet v0.
 </p>
-The Mosaic Model Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Model Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.
+The Mosaic Eval Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Eval Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.
 
 <br>
 While deciding which benchmarks to include, we had a few criteria in mind. We wanted benchmarks to require a broad range of skills that were useful for practical applications, we wanted them to come from a diverse range of sources, we wanted them to capture skills that have been traditionally emphasized by the research community as well as those that have been underexplored, and we wanted them to be evaluated via simple, unambiguous metrics such as exact match and multiple choice accuracy. The philosophy behind compiling aggregate scores as opposed to the more common approach of reporting individual metrics, is two-fold.
@@ -24,7 +24,7 @@ At evaluation time, we run all the benchmarks, average the subscores within each
 
 For example, if benchmark A has a random baseline accuracy of 25%, and the model achieved 30%, we would report this as (0.3 - 0.25)/(1-0.25) = 0.0667. This can be thought of as the accuracy above chance rescaled so the max is 1. For benchmarks in which the random guessing baseline accuracy is ~0 we report the accuracy as is. Note that with this rescaling, a model could technically score below 0 on a category as a whole, but we haven’t found this to occur with any of the models we’ve tested.
 
-This is version v0, in the coming weeks we will update the mixture to include more benchmarks.
+This is version v0.1.0 of the Eval Gauntlet.
 
 ### Reading Comprehension
 
@@ -349,7 +349,7 @@ The Safety category consists of benchmarks designed to assess model's toxicity,
     - Random baseline accuracy: 50%
 
 ### Programming
-Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more.
+Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more. By default the programming tasks are disabled in `scripts/eval/yamls/tasks.yaml` due to their long duration.
 
 51. HumanEval Python code generation
     - Description: HumanEval Python consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.

diff --git a/scripts/eval/yamls/copa.yaml b/scripts/eval/yamls/copa.yaml
@@ -0,0 +1,6 @@
+icl_tasks:
+-
+  label: copa
+  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
+  num_fewshot: [0]
+  icl_task_type: multiple_choice
diff --git a/scripts/eval/yamls/tasks.yaml b/scripts/eval/yamls/tasks.yaml
@@ -1,69 +1,69 @@
 icl_tasks:
 -
   label: jeopardy
-  dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
   has_categories: true
 -
   label: bigbench_qa_wikidata
-  dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
 -
   label: arc_easy
-  dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: arc_challenge
-  dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: mmlu
-  dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
   has_categories: true
 -
   label: bigbench_misconceptions
-  dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: copa
-  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
   num_fewshot: [0]
   icl_task_type: multiple_choice
 -
   label: piqa
-  dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl  # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: openbook_qa
-  dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl
   num_fewshot: [0]
   icl_task_type: multiple_choice
 -
   label: bigbench_novel_concepts
-  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: bigbench_strange_stories
-  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: bigbench_strategy_qa
-  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
@@ -73,17 +73,17 @@ icl_tasks:
   icl_task_type: language_modeling
 -
   label: hellaswag
-  dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: winograd
-  dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl
   num_fewshot: [0]
   icl_task_type: schema
 -
   label: winogrande
-  dataset_uri: eval/local_data/language_understanding/winogrande.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/language_understanding/winogrande.jsonl
   num_fewshot: [0]
   icl_task_type: schema
 -
@@ -154,84 +154,84 @@ icl_tasks:
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: pubmed_qa_labeled
-  dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
 -
   label: squad
-  dataset_uri: eval/local_data/reading_comprehension/squad.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/squad.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
 -
   label: bigbench_understanding_fables
-  dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: boolq
-  dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 # -
 #   label: human_eval
-#   dataset_uri: eval/local_data/programming/human_eval.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_cpp
-#   dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_js
-#   dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_return_simple
-#   dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_return_complex
-#   dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_25
-#   dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_50
-#   dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_75
-#   dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20