Skip to content

Commit

Permalink
update readm; rename gauntlet yamls
Browse files Browse the repository at this point in the history
  • Loading branch information
bmosaicml committed Nov 12, 2023
1 parent cd08024 commit f0c0b9d
Show file tree
Hide file tree
Showing 8 changed files with 1,240 additions and 648 deletions.
2 changes: 1 addition & 1 deletion llmfoundry/utils/builders.py
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ def _validate_cfg(icl_cfg: DictConfig):
if dist.get_local_rank() == 0 and os.path.exists(destination_path):
os.remove(destination_path)
dist.barrier()

dataloaders = get_icl_task_dataloader(
icl_cfg.icl_task_type,
icl_cfg.dataset_uri,
Expand Down
174 changes: 134 additions & 40 deletions scripts/eval/local_data/MODEL_GAUNTLET.md

Large diffs are not rendered by default.

531 changes: 265 additions & 266 deletions scripts/eval/local_data/reading_comprehension/agi_eval_lsat_rc.jsonl

Large diffs are not rendered by default.

392 changes: 196 additions & 196 deletions scripts/eval/local_data/reading_comprehension/agi_eval_sat_en.jsonl

Large diffs are not rendered by default.

168 changes: 114 additions & 54 deletions scripts/eval/yamls/eval_gauntlet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,20 @@ eval_gauntlet:
- language_understanding
- symbolic_problem_solving
- reading_comprehension
# - programming
- programming
lm_task_average:
- world_knowledge_lm_task_subscore
- commonsense_reasoning_lm_task_subscore
- language_understanding_lm_task_subscore
- symbolic_problem_solving_lm_task_subscore
- reading_comprehension_lm_task_subscore
lite_average:
- world_knowledge_lite
- commonsense_reasoning_lite
- language_understanding_lite
- symbolic_problem_solving_lite
- reading_comprehension_lite
- programming_lite
categories:
- name: world_knowledge
benchmarks:
Expand All @@ -31,20 +44,11 @@ eval_gauntlet:
- name: bigbench_misconceptions
num_fewshot: 10
random_baseline: 0.5
- name: triviaqa_sm_sub
num_fewshot: 3
random_baseline: 0.0
- name: commonsense_reasoning
benchmarks:
- name: copa
num_fewshot: 0
random_baseline: 0.5
- name: siqa
num_fewshot: 10
random_baseline: 0.5
- name: commonsense_qa
num_fewshot: 10
random_baseline: 0.25
- name: piqa
num_fewshot: 10
random_baseline: 0.5
Expand Down Expand Up @@ -115,21 +119,6 @@ eval_gauntlet:
- name: logi_qa
num_fewshot: 10
random_baseline: 0.25
- name: aqua
num_fewshot: 3
random_baseline: 0.0
- name: gsm8k
num_fewshot: 3
random_baseline: 0.0
- name: svamp
num_fewshot: 3
random_baseline: 0
- name: agi_eval_sat_math
num_fewshot: 3
random_baseline: 0.0
- name: agi_eval_lsat_ar
num_fewshot: 3
random_baseline: 0.25
- name: reading_comprehension
benchmarks:
- name: pubmed_qa_labeled
Expand All @@ -144,34 +133,6 @@ eval_gauntlet:
- name: boolq
num_fewshot: 10
random_baseline: 0.5
- name: coqa
num_fewshot: 0
random_baseline: 0.0
- name: agi_eval_lsat_rc
num_fewshot: 3
random_baseline: 0.25
- name: agi_eval_lsat_lr
num_fewshot: 3
random_baseline: 0.25
- name: agi_eval_sat_en
num_fewshot: 3
random_baseline: 0.25
- name: safety
benchmarks:
- name: winogender_mc_female
num_fewshot: 10
random_baseline: 0.5
- name: winogender_mc_male
num_fewshot: 10
random_baseline: 0.5
- name: enterprise_pii_classification
num_fewshot: 10
random_baseline: 0.5
- name: bbq
num_fewshot: 3
random_baseline: 0.5
# THIS CATEGORY IS PARTICULARLY SLOW, USE SPARINGLY.
# TASKS ARE DEFINED IN `coding_tasks.yaml`
# - name: programming
# benchmarks:
# - name: human_eval
Expand All @@ -197,4 +158,103 @@ eval_gauntlet:
# random_baseline: 0.0
# - name: human_eval_75
# num_fewshot: 0
# random_baseline: 0.0
# random_baseline: 0.0
- name: world_knowledge_lm_task_subscore
benchmarks:
- name: jeopardy
num_fewshot: 10
random_baseline: 0
- name: bigbench_qa_wikidata
num_fewshot: 10
random_baseline: 0
- name: language_understanding_lm_task_subscore
benchmarks:
- name: lambada_openai
num_fewshot: 0
random_baseline: 0.0
- name: bigbench_conlang_translation
num_fewshot: 0
random_baseline: 0.0
- name: symbolic_problem_solving_lm_task_subscore
benchmarks:
- name: bigbench_dyck_languages
num_fewshot: 10
random_baseline: 0
- name: bigbench_cs_algorithms
num_fewshot: 10
random_baseline: 0
- name: bigbench_operators
num_fewshot: 10
random_baseline: 0.0
- name: bigbench_repeat_copy_logic
num_fewshot: 10
random_baseline: 0.0
- name: simple_arithmetic_withspaces
num_fewshot: 10
random_baseline: 0.0
- name: simple_arithmetic_nospaces
num_fewshot: 10
random_baseline: 0.0
- name: reading_comprehension_lm_task_subscore
benchmarks:
- name: pubmed_qa_labeled
num_fewshot: 10
random_baseline: 0.0
- name: squad
num_fewshot: 10
random_baseline: 0
- name: world_knowledge_lite
benchmarks:
- name: jeopardy
num_fewshot: 10
random_baseline: 0
- name: arc_challenge
num_fewshot: 10
random_baseline: 0.25
- name: commonsense_reasoning_lite
benchmarks:
- name: copa
num_fewshot: 0
random_baseline: 0.5
- name: piqa
num_fewshot: 10
random_baseline: 0.5
- name: language_understanding_lite
benchmarks:
- name: lambada_openai
num_fewshot: 0
random_baseline: 0.0
- name: hellaswag
num_fewshot: 10
random_baseline: 0.25
- name: winograd
num_fewshot: 0
random_baseline: 0.5
- name: symbolic_problem_solving_lite
benchmarks:
- name: bigbench_elementary_math_qa
num_fewshot: 10
random_baseline: 0.25
- name: bigbench_dyck_languages
num_fewshot: 10
random_baseline: 0
- name: bigbench_operators
num_fewshot: 10
random_baseline: 0.0
- name: bigbench_repeat_copy_logic
num_fewshot: 10
random_baseline: 0.0
- name: simple_arithmetic_withspaces
num_fewshot: 10
random_baseline: 0.0
- name: simple_arithmetic_nospaces
num_fewshot: 10
random_baseline: 0.0
- name: reading_comprehension_lite
benchmarks:
- name: pubmed_qa_labeled
num_fewshot: 10
random_baseline: 0.0
- name: squad
num_fewshot: 10
random_baseline: 0
Loading

0 comments on commit f0c0b9d

Please sign in to comment.