Skip to content

Commit

Permalink
Gauntlet v0.1.0 yaml fixes (#748)
Browse files Browse the repository at this point in the history
* update yaml

* adding datasets

* adding datasets

* added agi eval

* test CoT eval

* fix broken eval yaml

* fix broken eval yaml

* debugging

* debugging

* commit

* commit

* commit

* commit

* commit

* restore mcli

* adding simple tasks

* add simple human_eval

* fix yaml

* fix yaml

* remove breakpoint

* remove breakpoint

* change bsz

* merge main

* eval gauntlet cb

* add udpated readme

* fix precommit

* add pii

* restor line

* restor line

* add execution predicrtion

* add execution prediction

* add execution prediction

* change mosaicml reqs

* change mosaicml reqs

* fix error

* comment

* test smaller beams

* tesT

* tesT

* tesT

* add coding task

* tesT

* finish eval

* finish data

* fix

* fix

* remove strategyqa cot

* remove

* remove

* foo

* edit

* fix

* rm breakpoint

* rm breakpoint

* remove execution prediction; make coding optional

* remove execution prediction; make coding optional

* remove import

* remove import

* restore files

* restore

* restore

* update readm; rename gauntlet yamls

* edit yamls

* fix yamllint

* restore mpt eval

* finish

* fix

* precommit

* precommit

---------

Co-authored-by: Michael Carbin <[email protected]>
Co-authored-by: Daniel King <[email protected]>
  • Loading branch information
3 people authored Nov 21, 2023
1 parent 7f5d70c commit e7943e3
Show file tree
Hide file tree
Showing 5 changed files with 74 additions and 68 deletions.
2 changes: 1 addition & 1 deletion scripts/eval/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# In-context learning (ICL) evaluation

This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.
This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Eval Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.

You can evaluate a model by preparing an evaluation YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Mosaic Model Gauntlet v0 - Evaluation Suite
# Mosaic Eval Gauntlet v0.1.0 - Evaluation Suite


<!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_BEGIN -->
Expand All @@ -7,9 +7,9 @@
<img alt="LLM Foundry" src="../../../assets/radar_blog.png" width="60%">
</picture>
<br>
MPT-7B vs MPT-30B compared on the 6 categories of Model Gauntlet.
MPT-7B vs MPT-30B compared on the 6 categories of Eval Gauntlet v0.
</p>
The Mosaic Model Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Model Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.
The Mosaic Eval Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Eval Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.

<br>
While deciding which benchmarks to include, we had a few criteria in mind. We wanted benchmarks to require a broad range of skills that were useful for practical applications, we wanted them to come from a diverse range of sources, we wanted them to capture skills that have been traditionally emphasized by the research community as well as those that have been underexplored, and we wanted them to be evaluated via simple, unambiguous metrics such as exact match and multiple choice accuracy. The philosophy behind compiling aggregate scores as opposed to the more common approach of reporting individual metrics, is two-fold.
Expand All @@ -24,7 +24,7 @@ At evaluation time, we run all the benchmarks, average the subscores within each

For example, if benchmark A has a random baseline accuracy of 25%, and the model achieved 30%, we would report this as (0.3 - 0.25)/(1-0.25) = 0.0667. This can be thought of as the accuracy above chance rescaled so the max is 1. For benchmarks in which the random guessing baseline accuracy is ~0 we report the accuracy as is. Note that with this rescaling, a model could technically score below 0 on a category as a whole, but we haven’t found this to occur with any of the models we’ve tested.

This is version v0, in the coming weeks we will update the mixture to include more benchmarks.
This is version v0.1.0 of the Eval Gauntlet.

### Reading Comprehension

Expand Down Expand Up @@ -349,7 +349,7 @@ The Safety category consists of benchmarks designed to assess model's toxicity,
- Random baseline accuracy: 50%

### Programming
Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more.
Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more. By default the programming tasks are disabled in `scripts/eval/yamls/tasks.yaml` due to their long duration.

51. HumanEval Python code generation
- Description: HumanEval Python consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.
Expand Down
6 changes: 6 additions & 0 deletions scripts/eval/yamls/copa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
icl_tasks:
-
label: copa
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
54 changes: 27 additions & 27 deletions scripts/eval/yamls/tasks.yaml
Original file line number Diff line number Diff line change
@@ -1,69 +1,69 @@
icl_tasks:
-
label: jeopardy
dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
continuation_delimiter: "\nAnswer: " # this separates questions from answers
has_categories: true
-
label: bigbench_qa_wikidata
dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
-
label: arc_easy
dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: arc_challenge
dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: mmlu
dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
has_categories: true
-
label: bigbench_misconceptions
dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: copa
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
-
label: piqa
dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: openbook_qa
dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
-
label: bigbench_novel_concepts
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: bigbench_strange_stories
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: bigbench_strategy_qa
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
Expand All @@ -73,17 +73,17 @@ icl_tasks:
icl_task_type: language_modeling
-
label: hellaswag
dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: winograd
dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl
num_fewshot: [0]
icl_task_type: schema
-
label: winogrande
dataset_uri: eval/local_data/language_understanding/winogrande.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/language_understanding/winogrande.jsonl
num_fewshot: [0]
icl_task_type: schema
-
Expand Down Expand Up @@ -154,84 +154,84 @@ icl_tasks:
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: pubmed_qa_labeled
dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
-
label: squad
dataset_uri: eval/local_data/reading_comprehension/squad.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/squad.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
-
label: bigbench_understanding_fables
dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: boolq
dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
# -
# label: human_eval
# dataset_uri: eval/local_data/programming/human_eval.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_cpp
# dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_js
# dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_return_simple
# dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_return_complex
# dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_25
# dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_50
# dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_75
# dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
Expand Down
Loading

0 comments on commit e7943e3

Please sign in to comment.