Migrate ICL classes to foundry #936

bmosaicml · 2024-02-02T19:12:55Z

DEPRECATING COMPOSER CLASSES: mosaicml/composer#3125

This PR migrates all the ICL(Dataset|Metric) classes (including the super classes, since composer no longer depends on them) It also migrates all the relevant tests, it renames the QATask to InContextLearningGenerationTaskWithAnswers (to capture the fact that it can and will be used for arbitrary generation tasks, such as summarization, and can even be used with LLM-as-judge).

Relatedly we need to remove or deprecate the equivalent classes in composer in order to avoid confusion and prevent people from trying to add new functionality to composer in the future.

Experimental runs:

mpt 7b: mpt-eval-zDGaOU
Llama 2 7b: llama2-eval-66Rw1B

| model_name               |   core_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:-------------------------|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| mosaicml/mpt-7b |       0.343081 |          0.421662 |                0.256372 |                 0.634086 |                   0.155426 |                0.247861 |
| meta-llama/Llama-2-7b-hf |       0.417207 |          0.510394 |                0.355677 |                 0.655805 |                   0.227892 |                0.336269 |

| Category                 | Benchmark                    | Subtask                             |   Accuracy | Number few shot   | Model                    |
|:-------------------------|:-----------------------------|:------------------------------------|-----------:|:------------------|:-------------------------|
| symbolic_problem_solving | gsm8k                        |                                     |  0.0871873 | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | copa                         |                                     |  0.8       | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | commonsense_qa               |                                     |  0.225225  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | piqa                         |                                     |  0.799238  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |  0.568965  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |  0.561817  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | lambada_openai               |                                     |  0.702892  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.761601  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | coqa                         |                                     |  0.453213  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | boolq                        |                                     |  0.747401  | 0-shot            | mosaicml/mpt-7b |
| world_knowledge          | triviaqa_sm_sub              |                                     |  0.493667  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | jeopardy                     | Average                             |  0.459835  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | american_history                    |  0.513317  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | literature                          |  0.557143  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | science                             |  0.386555  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | word_origins                        |  0.265753  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_history                       |  0.576407  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | bigbench_qa_wikidata         |                                     |  0.655824  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_easy                     |                                     |  0.718855  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.440273  | 3-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | siqa                         |                                     |  0.54913   | 3-shot            | mosaicml/mpt-7b |
| language_understanding   | winograd                     |                                     |  0.85348   | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_operators           |                                     |  0.333333  | 3-shot            | mosaicml/mpt-7b |
| reading_comprehension    | squad                        |                                     |  0.553264  | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | svamp                        |                                     |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | mmlu                         | Average                             |  0.281358  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | abstract_algebra                    |  0.26      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | anatomy                             |  0.303704  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | astronomy                           |  0.309211  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | business_ethics                     |  0.38      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | clinical_knowledge                  |  0.286792  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_biology                     |  0.291667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_chemistry                   |  0.21      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_computer_science            |  0.25      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_mathematics                 |  0.31      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_medicine                    |  0.225434  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_physics                     |  0.215686  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | computer_security                   |  0.35      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | conceptual_physics                  |  0.289362  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | econometrics                        |  0.245614  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | electrical_engineering              |  0.324138  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | elementary_mathematics              |  0.272487  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | formal_logic                        |  0.222222  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | global_facts                        |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_biology                 |  0.3       | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_chemistry               |  0.187192  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_computer_science        |  0.34      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_european_history        |  0.321212  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_geography               |  0.313131  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_government_and_politics |  0.264249  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_macroeconomics          |  0.266667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_mathematics             |  0.211111  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_microeconomics          |  0.247899  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_physics                 |  0.291391  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_psychology              |  0.251376  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_statistics              |  0.208333  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_us_history              |  0.181373  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_world_history           |  0.253165  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_aging                         |  0.403587  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_sexuality                     |  0.259542  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | international_law                   |  0.347107  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | jurisprudence                       |  0.324074  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | logical_fallacies                   |  0.251534  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | machine_learning                    |  0.321429  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | management                          |  0.242718  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | marketing                           |  0.299145  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | medical_genetics                    |  0.22      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | miscellaneous                       |  0.301405  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_disputes                      |  0.32659   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_scenarios                     |  0.259218  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | nutrition                           |  0.30719   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | philosophy                          |  0.315113  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | prehistory                          |  0.302469  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_accounting             |  0.248227  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_law                    |  0.269231  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_medicine               |  0.198529  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_psychology             |  0.271242  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | public_relations                    |  0.381818  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | security_studies                    |  0.236735  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | sociology                           |  0.268657  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | us_foreign_policy                   |  0.36      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | virology                            |  0.349398  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_religions                     |  0.269006  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |  0.304     | 5-shot            | mosaicml/mpt-7b |
| language_understanding   | winogrande                   |                                     |  0.722178  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |  0.23913   | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |  0.082     | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |  0.089     | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |  0.235075  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |  0.247059  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_sat_en              |                                     |  0.257282  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.4343    | 25-shot           | mosaicml/mpt-7b |
| commonsense_reasoning    | openbook_qa                  |                                     |  0.452     | 10-shot           | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.765385  | 10-shot           | mosaicml/mpt-7b |
|                          | bigbench_cs_algorithms       |                                     |  0.480303  | 10-shot           | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |  0.281787  | 1-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | gsm8k                        |                                     |   0.148597 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | copa                         |                                     |   0.8      | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | commonsense_qa               |                                     |   0.383292 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | piqa                         |                                     |   0.786181 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |   0.614943 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |   0.585408 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | lambada_openai               |                                     |   0.736658 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.74995  | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | coqa                         |                                     |   0.4705   | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | boolq                        |                                     |   0.792966 | 0-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | triviaqa_sm_sub              |                                     |   0.582333 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | jeopardy                     | Average                             |   0.508028 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | american_history                    |   0.564165 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | literature                          |   0.661224 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | science                             |   0.388655 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | word_origins                        |   0.30411  | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_history                       |   0.621984 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | bigbench_qa_wikidata         |                                     |   0.693125 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_easy                     |                                     |   0.757155 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.494881 | 3-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | siqa                         |                                     |   0.730809 | 3-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winograd                     |                                     |   0.879121 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_operators           |                                     |   0.42381  | 3-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | squad                        |                                     |   0.532545 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | svamp                        |                                     |   0.423333 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | mmlu                         | Average                             |   0.457122 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | abstract_algebra                    |   0.31     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | anatomy                             |   0.422222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | astronomy                           |   0.460526 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | business_ethics                     |   0.48     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | clinical_knowledge                  |   0.418868 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_biology                     |   0.416667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_chemistry                   |   0.28     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_computer_science            |   0.29     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_mathematics                 |   0.34     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_medicine                    |   0.421965 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_physics                     |   0.264706 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | computer_security                   |   0.56     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | conceptual_physics                  |   0.434043 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | econometrics                        |   0.307018 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | electrical_engineering              |   0.427586 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | elementary_mathematics              |   0.285714 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | formal_logic                        |   0.325397 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | global_facts                        |   0.41     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_biology                 |   0.512903 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_chemistry               |   0.349754 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_computer_science        |   0.45     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_european_history        |   0.606061 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_geography               |   0.520202 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_government_and_politics |   0.668394 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_macroeconomics          |   0.407692 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_mathematics             |   0.27037  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_microeconomics          |   0.403361 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_physics                 |   0.258278 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_psychology              |   0.592661 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_statistics              |   0.222222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_us_history              |   0.578431 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_world_history           |   0.561181 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_aging                         |   0.565022 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_sexuality                     |   0.564885 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | international_law                   |   0.636364 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | jurisprudence                       |   0.546296 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | logical_fallacies                   |   0.521472 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | machine_learning                    |   0.339286 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | management                          |   0.514563 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | marketing                           |   0.67094  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | medical_genetics                    |   0.52     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | miscellaneous                       |   0.630907 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_disputes                      |   0.523121 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_scenarios                     |   0.250279 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | nutrition                           |   0.486928 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | philosophy                          |   0.553055 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | prehistory                          |   0.506173 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_accounting             |   0.368794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_law                    |   0.34485  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_medicine               |   0.422794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_psychology             |   0.45098  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | public_relations                    |   0.490909 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | security_studies                    |   0.420408 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | sociology                           |   0.666667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | us_foreign_policy                   |   0.68     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | virology                            |   0.475904 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_religions                     |   0.649123 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |   0.291    | 5-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winogrande                   |                                     |   0.73086  | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |   0.252174 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |   0.245    | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |   0.256    | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |   0.373134 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |   0.329412 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_sat_en              |                                     |   0.368932 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.514505 | 25-shot           | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | openbook_qa                  |                                     |   0.458    | 10-shot           | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.773053 | 10-shot           | meta-llama/Llama-2-7b-hf |
|                          | bigbench_cs_algorithms       |                                     |   0.44697  | 10-shot           | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |   0.274371 | 1-shot            | meta-llama/Llama-2-7b-hf |

Base run: eval-gauntlet-pre-migration-mpt-N3lIuF
Base llama2: eval-gauntlet-pre-migration-llama-imFgAZ

| model_name               |   core_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:-------------------------|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| meta-llama/Llama-2-7b-hf |       0.417124 |          0.510394 |                0.355677 |                 0.655805 |                   0.227475 |                0.336269 |
| mosaicml/mpt-7b |       0.343081 |          0.421662 |                0.256372 |                 0.634086 |                   0.155426 |                0.247861 |

| Category                 | Benchmark                    | Subtask                             |   Accuracy | Number few shot   | Model                    |
|:-------------------------|:-----------------------------|:------------------------------------|-----------:|:------------------|:-------------------------|
| symbolic_problem_solving | gsm8k                        |                                     |  0.0871873 | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | copa                         |                                     |  0.8       | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | commonsense_qa               |                                     |  0.225225  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | piqa                         |                                     |  0.799238  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |  0.568965  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |  0.561817  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | lambada_openai               |                                     |  0.702892  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.761601  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | coqa                         |                                     |  0.453213  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | boolq                        |                                     |  0.747401  | 0-shot            | mosaicml/mpt-7b |
| world_knowledge          | triviaqa_sm_sub              |                                     |  0.493667  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | jeopardy                     | Average                             |  0.459835  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | american_history                    |  0.513317  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | literature                          |  0.557143  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | science                             |  0.386555  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | word_origins                        |  0.265753  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_history                       |  0.576407  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | bigbench_qa_wikidata         |                                     |  0.655824  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_easy                     |                                     |  0.718855  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.440273  | 3-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | siqa                         |                                     |  0.54913   | 3-shot            | mosaicml/mpt-7b |
| language_understanding   | winograd                     |                                     |  0.85348   | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_operators           |                                     |  0.333333  | 3-shot            | mosaicml/mpt-7b |
| reading_comprehension    | squad                        |                                     |  0.553264  | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | svamp                        |                                     |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | mmlu                         | Average                             |  0.281358  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | abstract_algebra                    |  0.26      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | anatomy                             |  0.303704  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | astronomy                           |  0.309211  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | business_ethics                     |  0.38      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | clinical_knowledge                  |  0.286792  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_biology                     |  0.291667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_chemistry                   |  0.21      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_computer_science            |  0.25      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_mathematics                 |  0.31      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_medicine                    |  0.225434  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_physics                     |  0.215686  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | computer_security                   |  0.35      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | conceptual_physics                  |  0.289362  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | econometrics                        |  0.245614  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | electrical_engineering              |  0.324138  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | elementary_mathematics              |  0.272487  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | formal_logic                        |  0.222222  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | global_facts                        |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_biology                 |  0.3       | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_chemistry               |  0.187192  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_computer_science        |  0.34      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_european_history        |  0.321212  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_geography               |  0.313131  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_government_and_politics |  0.264249  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_macroeconomics          |  0.266667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_mathematics             |  0.211111  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_microeconomics          |  0.247899  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_physics                 |  0.291391  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_psychology              |  0.251376  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_statistics              |  0.208333  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_us_history              |  0.181373  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_world_history           |  0.253165  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_aging                         |  0.403587  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_sexuality                     |  0.259542  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | international_law                   |  0.347107  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | jurisprudence                       |  0.324074  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | logical_fallacies                   |  0.251534  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | machine_learning                    |  0.321429  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | management                          |  0.242718  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | marketing                           |  0.299145  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | medical_genetics                    |  0.22      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | miscellaneous                       |  0.301405  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_disputes                      |  0.32659   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_scenarios                     |  0.259218  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | nutrition                           |  0.30719   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | philosophy                          |  0.315113  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | prehistory                          |  0.302469  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_accounting             |  0.248227  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_law                    |  0.269231  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_medicine               |  0.198529  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_psychology             |  0.271242  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | public_relations                    |  0.381818  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | security_studies                    |  0.236735  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | sociology                           |  0.268657  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | us_foreign_policy                   |  0.36      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | virology                            |  0.349398  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_religions                     |  0.269006  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |  0.304     | 5-shot            | mosaicml/mpt-7b |
| language_understanding   | winogrande                   |                                     |  0.722178  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |  0.23913   | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |  0.082     | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |  0.089     | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |  0.235075  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |  0.247059  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_sat_en              |                                     |  0.257282  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.4343    | 25-shot           | mosaicml/mpt-7b |
| commonsense_reasoning    | openbook_qa                  |                                     |  0.452     | 10-shot           | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.765385  | 10-shot           | mosaicml/mpt-7b |
|                          | bigbench_cs_algorithms       |                                     |  0.480303  | 10-shot           | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |  0.281787  | 1-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | gsm8k                        |                                     |   0.148597 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | copa                         |                                     |   0.8      | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | commonsense_qa               |                                     |   0.383292 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | piqa                         |                                     |   0.786181 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |   0.614943 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |   0.585408 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | lambada_openai               |                                     |   0.736658 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.74995  | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | coqa                         |                                     |   0.4705   | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | boolq                        |                                     |   0.792966 | 0-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | triviaqa_sm_sub              |                                     |   0.582333 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | jeopardy                     | Average                             |   0.508028 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | american_history                    |   0.564165 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | literature                          |   0.661224 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | science                             |   0.388655 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | word_origins                        |   0.30411  | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_history                       |   0.621984 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | bigbench_qa_wikidata         |                                     |   0.693125 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_easy                     |                                     |   0.757155 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.494881 | 3-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | siqa                         |                                     |   0.730809 | 3-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winograd                     |                                     |   0.879121 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_operators           |                                     |   0.42381  | 3-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | squad                        |                                     |   0.532545 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | svamp                        |                                     |   0.42     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | mmlu                         | Average                             |   0.457122 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | abstract_algebra                    |   0.31     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | anatomy                             |   0.422222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | astronomy                           |   0.460526 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | business_ethics                     |   0.48     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | clinical_knowledge                  |   0.418868 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_biology                     |   0.416667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_chemistry                   |   0.28     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_computer_science            |   0.29     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_mathematics                 |   0.34     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_medicine                    |   0.421965 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_physics                     |   0.264706 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | computer_security                   |   0.56     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | conceptual_physics                  |   0.434043 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | econometrics                        |   0.307018 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | electrical_engineering              |   0.427586 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | elementary_mathematics              |   0.285714 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | formal_logic                        |   0.325397 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | global_facts                        |   0.41     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_biology                 |   0.512903 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_chemistry               |   0.349754 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_computer_science        |   0.45     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_european_history        |   0.606061 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_geography               |   0.520202 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_government_and_politics |   0.668394 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_macroeconomics          |   0.407692 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_mathematics             |   0.27037  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_microeconomics          |   0.403361 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_physics                 |   0.258278 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_psychology              |   0.592661 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_statistics              |   0.222222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_us_history              |   0.578431 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_world_history           |   0.561181 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_aging                         |   0.565022 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_sexuality                     |   0.564885 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | international_law                   |   0.636364 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | jurisprudence                       |   0.546296 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | logical_fallacies                   |   0.521472 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | machine_learning                    |   0.339286 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | management                          |   0.514563 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | marketing                           |   0.67094  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | medical_genetics                    |   0.52     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | miscellaneous                       |   0.630907 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_disputes                      |   0.523121 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_scenarios                     |   0.250279 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | nutrition                           |   0.486928 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | philosophy                          |   0.553055 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | prehistory                          |   0.506173 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_accounting             |   0.368794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_law                    |   0.34485  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_medicine               |   0.422794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_psychology             |   0.45098  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | public_relations                    |   0.490909 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | security_studies                    |   0.420408 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | sociology                           |   0.666667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | us_foreign_policy                   |   0.68     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | virology                            |   0.475904 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_religions                     |   0.649123 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |   0.291    | 5-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winogrande                   |                                     |   0.73086  | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |   0.252174 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |   0.245    | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |   0.256    | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |   0.373134 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |   0.329412 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_sat_en              |                                     |   0.368932 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.514505 | 25-shot           | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | openbook_qa                  |                                     |   0.458    | 10-shot           | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.773053 | 10-shot           | meta-llama/Llama-2-7b-hf |
|                          | bigbench_cs_algorithms       |                                     |   0.44697  | 10-shot           | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |   0.274371 | 1-shot            | meta-llama/Llama-2-7b-hf |

CODE
Pre-migration llama2-code-pre-migration-D3fXGe mpt7b-code-pre-migration-x8nPTd

| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval                |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0121951 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.251969  | 0-shot            | mosaicml/mpt-7b |

Post-migration llama2-code-post-migration-aSqFno mpt7b-code-post-migration-3N0tKy

| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval                |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0121951 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.251969  | 0-shot            | mosaicml/mpt-7b |

…lm-foundry into migrate_subclasses_to_foundry

llmfoundry/eval/datasets/in_context_learning_evaluation.py

mcli/mcli-hf-eval.yaml

eitanturok

Looks absolutely fire, best PR I've seen in my whole life :)

Will let Max do the final vetting.

eitanturok

Looks absolutely fire, best PR I've seen in my whole life :)

Will let Max do the final vetting.

maxisawesome · 2024-04-04T22:56:03Z

It appears that a single svamp example was different btwn pre and post, so I reran some runs on svamp only. They produced the same results before/after the migration, so I am confident about our results there:
llama2-svamp-post-migration-0tvV1U
llama2-svamp-pre-migration-HZAfmD
svamp: 0.346667 both times

…lm-foundry into migrate_subclasses_to_foundry

maxisawesome · 2024-04-12T17:19:55Z

Llama2 human_eval pre-migration llama2-code-pre-migration-D3fXGe:

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |

Llama2 human_eval post-migration llama2-code-post-migration-aSqFno:

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |

poorly named mpt pre-migration run: llama2-code-pre-migration-gAc90c

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model           |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:----------------|
|            | human_eval                |           |  0.097561  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0434783 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.810811  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.244094  | 0-shot            | mosaicml/mpt-7b |

mpt post-migration run: mpt7b-code-post-migration-oU1rq4

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model           |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:----------------|
|            | human_eval                |           |  0.097561  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0434783 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.810811  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.244094  | 0-shot            | mosaicml/mpt-7b |

maxisawesome · 2024-04-12T17:20:37Z

With these results I approve the PR!

maxisawesome

Eval is same before and after and I approve

dakinggg

Did not review the code super closely, relying on Max's review for that. Results look good to me.

llmfoundry/utils/builders.py

…lm-foundry into migrate_subclasses_to_foundry

* start * still need to migrate fixtures * wip onboarding tests * still workin' * still wip * maybe done; test out on mcli now * mcli * remove calibration error * migration * migration * full migration * precommit * fix * fix pytests * refactor QA * update * restore * add * fix * wip * update readme * final pyright * done * pass prelimiter into ALL the ICL task datasets * allow QA task name stil lfor backward compatibility * fix * fix test * add generation length * remove max_new_tokens * fix cpu trsts * try and fix lm eval test * temp disable lm task eval test * fix test? * fix tet * finish * fix * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * fix comments * fix bug with seq len * restore mcli * merge * fix builder * add deprecation warning * add deprecation warning * merge * merge * add logging necessities to nlp.py * add attention_mask test update * fix generation_length in tests * fix bug * restore yamls * fix typos * add deprecation warning for code * pyright wip * fix pyright * fix pyright error again * fix pyright * fix pyright * update version --------- Co-authored-by: Eitan Turok <[email protected]> Co-authored-by: Max Marion <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: Max Marion <[email protected]>

bmosaicml added 9 commits January 27, 2024 14:51

start

cd18e74

still need to migrate fixtures

1fffbad

Merge branch 'main' into migrate_subclasses_to_foundry

5a6e81c

wip onboarding tests

4aac81e

still workin'

946a4af

still wip

289ca55

maybe done; test out on mcli now

3696f8d

mcli

a20877d

remove calibration error

53da3ea

bmosaicml mentioned this pull request Feb 4, 2024

Remove subclasses from composer mosaicml/composer#2962

Closed

7 tasks

bmosaicml requested review from dakinggg, tbarton16, codestar12, maxisawesome, josejg and mansheej February 4, 2024 15:01

bmosaicml added 6 commits February 7, 2024 13:31

merge

16b8e32

migration

a90766e

migration

72ce793

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

667bdec

…lm-foundry into migrate_subclasses_to_foundry

full migration

ceff0c4

precommit

5bb06cc

maxisawesome reviewed Feb 12, 2024

View reviewed changes

llmfoundry/eval/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved

fix

fe83828

maxisawesome reviewed Feb 12, 2024

View reviewed changes

mcli/mcli-hf-eval.yaml Outdated Show resolved Hide resolved

bmosaicml added 2 commits February 12, 2024 14:43

fix pytests

b54a12b

refactor QA

71e8391

bmosaicml mentioned this pull request Feb 21, 2024

Refactor qa #984

Closed

bmosaicml added 2 commits February 22, 2024 17:48

update

414153e

restore

a3f5a31

eitanturok approved these changes Apr 4, 2024

View reviewed changes

Merge branch 'main' into migrate_subclasses_to_foundry

d78d783

maxisawesome and others added 15 commits April 9, 2024 07:40

Merge branch 'main' into migrate_subclasses_to_foundry

1ddf194

fix typos

d5aebc8

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

d7272b1

…lm-foundry into migrate_subclasses_to_foundry

add deprecation warning for code

a5082b0

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

3c8ac56

…lm-foundry into migrate_subclasses_to_foundry

pyright wip

642ad40

Merge branch 'main' into migrate_subclasses_to_foundry

f30db14

fix pyright

de321b2

fix pyright error again

019c58a

fix pyright

779f490

fix pyright

03f7e91

Merge branch 'main' into migrate_subclasses_to_foundry

e81823d

Merge branch 'main' into migrate_subclasses_to_foundry

709fc80

Merge branch 'main' into migrate_subclasses_to_foundry

eb494d8

Merge branch 'main' into migrate_subclasses_to_foundry

3cd226d

maxisawesome self-requested a review April 12, 2024 17:20

maxisawesome approved these changes Apr 12, 2024

View reviewed changes

dakinggg approved these changes Apr 12, 2024

View reviewed changes

llmfoundry/utils/builders.py Outdated Show resolved Hide resolved

maxisawesome added 2 commits April 12, 2024 20:24

update version

02308df

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

1a36d3f

…lm-foundry into migrate_subclasses_to_foundry

maxisawesome merged commit 3729ba3 into main Apr 12, 2024
9 checks passed

maxisawesome deleted the migrate_subclasses_to_foundry branch April 12, 2024 21:00

achalddave mentioned this pull request Jul 1, 2024

use llm-foundry for ICL metrics mlfoundations/open_lm#287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate ICL classes to foundry #936

Migrate ICL classes to foundry #936

bmosaicml commented Feb 2, 2024 •

edited

Loading

eitanturok left a comment

eitanturok left a comment

maxisawesome commented Apr 4, 2024

maxisawesome commented Apr 12, 2024

maxisawesome commented Apr 12, 2024

maxisawesome left a comment

dakinggg left a comment

Migrate ICL classes to foundry #936

Migrate ICL classes to foundry #936

Conversation

bmosaicml commented Feb 2, 2024 • edited Loading

eitanturok left a comment

Choose a reason for hiding this comment

eitanturok left a comment

Choose a reason for hiding this comment

maxisawesome commented Apr 4, 2024

maxisawesome commented Apr 12, 2024

maxisawesome commented Apr 12, 2024

maxisawesome left a comment

Choose a reason for hiding this comment

dakinggg left a comment

Choose a reason for hiding this comment

bmosaicml commented Feb 2, 2024 •

edited

Loading