Refactor in_context_learning_evaluation.py #2713

maxisawesome · 2023-11-14T18:15:48Z

What does this PR do?

Refactors the classes in composer/datasets/in_context_learning_evaluation.py to inherit from a single class
Adds parsing of arbitrary HuggingFace datasets
Cleans up memory usage while loading the datasets

What issue(s) does this change relate to?

Makes the code a little more managable, and reuses code between eval classes significantly more.
Moves the long context eval code found in this llm-foundry PR to composer. There will still need to be an llm-foundry PR that allows a user to specify a small python function that parses arbitrary datasets in a more advanced way than the default that goes here.
Effectively implements @dakinggg 's PR seen here.

Comments

This is a pretty big change to the icl datasets. Please pick it apart with scrutiny, I will not be offended. Please note additional needed tests or code smells that you see.

All the tests pass and the eval_gauntlet runs. Before merging, I need to rerun EleutherAI/gpt-neo-125m on the gauntlet w/o my changes, as well as 7b w/ and w/o my changes.

Before submitting

Have you read the contributor guidelines?
Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
No, some large comments explaining classes/functions (and thus their documentation as well) still need to be updated. In addition, we should properly add the right classes here.
Did you update any related tests and add any new tests related to your change? (see testing)
Yes but please pick them apart and make me make more tests to ensure things are as we expect.
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

MPT eval results:

Pre-refactor:
Run name: test-eval-refactor-no-refactor-with-new-logging-mpt-BOGPfN
Core Average: 0.334786

| model_name      |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:----------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| mosaicml/mpt-7b |       0.334786 |          0.438081 |        0.47509 |          0.356657 |                0.385817 |                 0.383055 |                   0.163968 |                0.384431 |                            0.58974 |                                  0.386438 |                                    0.260531 |                                 0.515614 |               0.357162 |                     0.604679 |                      0.712279 |                        0.185713 |                     0.515614 |

Post-refactor:
Run name: test-eval-refactor-with-new-logging-mpt-m6LcxC
Core Average: 0.334729

| model_name      |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:----------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| mosaicml/mpt-7b |       0.334729 |          0.437803 |       0.474966 |          0.356733 |                0.386108 |                 0.382913 |                   0.163876 |                0.384012 |                            0.58974 |                                  0.386341 |                                    0.260657 |                                 0.514473 |               0.357731 |                     0.604679 |                      0.712215 |                        0.185731 |                     0.514473 |

Llama2 eval results:

Pre-refactor:
Run name: test-eval-refactor-no-refactor-with-logging-llama2-T3PGVn
Core Average: 0.404405

| model_name               |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:-------------------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| meta-llama/Llama-2-7b-hf |       0.404405 |          0.502264 |       0.525717 |          0.453943 |                0.417293 |                 0.457704 |                   0.220997 |                0.472087 |                           0.632564 |                                  0.426062 |                                    0.331144 |                                 0.619284 |               0.438208 |                     0.594342 |                      0.713677 |                        0.263072 |                     0.619284 |'

Post-refactor:
NOTE: runname was mislabeled - this is with the refactor (can note this with mcli describe)
Run name: test-eval-refactor-no-refactor-with-new-logging-llama2-xG8WHF
Core Average: 0.40438

| model_name               |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:-------------------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| meta-llama/Llama-2-7b-hf |        0.40438 |          0.502263 |       0.525621 |           0.45366 |                0.417738 |                 0.457752 |                   0.220765 |                0.471982 |                           0.632564 |                                  0.425965 |                                    0.331144 |                                 0.619379 |               0.437639 |                     0.594342 |                      0.713613 |                         0.26313 |                     0.619379 |

eracah · 2023-11-14T20:13:42Z

Oooh hoo hoo. I love a good refactor

dakinggg

Left first round of comments. Thank you for taking this on!

composer/datasets/in_context_learning_evaluation.py

fix spelling Co-authored-by: Daniel King <[email protected]>

dakinggg

🚢

* extremely wip commit w/ ICLdataset class * more extremely broken wip * add split keys * first pass at moving QA to new format * linting * linting * tests pass! * fix repeated defaults, gold_idx --> gold * basic HF parsing but test not passing * fix cot. wip * del device and world_size from tests * change to .map * fix schema * tests passing w/ collate refactor * finish HF tests * add hf batch parsing * linting * add doc strings, rm hf_parsing_vars * revert question_prelimiter back to prelimiter * fix tests * add more docstrings * add doc strings, fix hf w/ categories * add doc strings and default check * linting * add temperature * remove need for hf:// on hf links * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * fix comments, add test for check hf uri, still wip * add gpu tests back * update fix_eos_on_preamble * update comments * add return types * typing, comments * init RAG Generation task * init _construct_context for RAG eval * fix context key, move hf test dataset, few docstrings * fix docstrings, add second path for schema * init collate_fn, _tokenize_example functions (bug exists) * fix typo in warning error * remove canonical_solution from batch * missed one canonical_sllution * remove encoded dataset to have just one dataset var * rename sample to example * improve comment * edit RAGtask * rm hf parsing func * fix docstring, rename fewshot fun * docstring * change default split_batch to check types * remove need to set split_keys * doc string update * improve comments * rm stacked_keys for tokenize_labels bool * initial wip in comments * make _conv_tokens_to_tensors func * wip - sketch out batch_mappings * linting and debugging statements to help me remember where I'm doing wip * all tests except one sus schema test passing * fix missing fewshot for schema * rm temperature add generation_kwargs * add defaults that are currently set in llm-foundry builders.py * fix defaults in tests, add some comments * tests wip * tests for new funcs * rm RAG task * more docstring * tests passing * wip * wip * add dict to data_spec * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py Co-authored-by: Daniel King <[email protected]> * Apply suggestions from code review comment improvements Co-authored-by: Daniel King <[email protected]> * default_batch to base_batch and some docstrings * update comments and fix test. move spacing to default get_answer * improved docstrings * finish schema/mc tests * address pr review comments * lintign * fixing import, add type * update comments * update keys * add typechecks for token ids * rm outdated test * fix tests * add microbatch test * pyright fixes * linting attempts * linting wip * fix linting * add early stopping and do_normalization documentation * fix linting * fix linting * fix final dist test issue * fix isort * fix linting * fix docstrings * fix docstrings * add warning filters * fix warnings * Update composer/datasets/in_context_learning_evaluation.py fix spelling Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py fix spelling Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py fix spelling Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py fix spelling Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py fix spelling Co-authored-by: Daniel King <[email protected]> * Update composer/datasets/in_context_learning_evaluation.py fix spelling Co-authored-by: Daniel King <[email protected]> * add capitalization * revert default changes * change update_generate_kwargs to public * fix type * move pad_tok_id error --------- Co-authored-by: Daniel King <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Eitan Turok <[email protected]>

maxisawesome and others added 17 commits October 27, 2023 23:34

extremely wip commit w/ ICLdataset class

a49db7d

more extremely broken wip

914f207

add split keys

330a97e

first pass at moving QA to new format

453df28

linting

89cf3f4

linting

e585e1f

tests pass!

5078373

fix repeated defaults, gold_idx --> gold

b4e00e4

basic HF parsing but test not passing

a588767

fix cot. wip

cea8ff8

del device and world_size from tests

ae4c6bc

change to .map

76a3f33

fix schema

0090b82

tests passing w/ collate refactor

ea03380

finish HF tests

bebbbda

Merge branch 'mosaicml:dev' into icl_refactor

11419e1

add hf batch parsing

1907cee

maxisawesome requested a review from a team as a code owner November 14, 2023 18:15

maxisawesome mentioned this pull request Nov 15, 2023

Execution Prediction #2659

Closed

7 tasks

dakinggg mentioned this pull request Nov 15, 2023

Lower memory usage for ICL #2691

Closed

7 tasks

dakinggg reviewed Nov 15, 2023

View reviewed changes

maxisawesome added 8 commits November 15, 2023 23:19

linting

503914a

add doc strings, rm hf_parsing_vars

db0bcb8

revert question_prelimiter back to prelimiter

c2dd31c

fix tests

3ad77e6

add more docstrings

14cf5e7

add doc strings, fix hf w/ categories

f768563

add doc strings and default check

b354d43

linting

23f8735

maxisawesome added 12 commits January 24, 2024 07:04

add early stopping and do_normalization documentation

eae8a1c

Merge branch 'mosaicml_dev' into icl_refactor

fc970b1

fix linting

5be0cc9

fix linting

1fd12fc

fix final dist test issue

5f12dc5

fix isort

f531dfc

fix linting

3e71cb3

fix docstrings

e487934

fix docstrings

c5ca3f8

add warning filters

712a33d

fix warnings

b305a4b

Merge branch 'mosaicml_dev' into icl_refactor

fbef372

dakinggg reviewed Jan 25, 2024

View reviewed changes

maxisawesome and others added 12 commits January 25, 2024 12:18

Update composer/datasets/in_context_learning_evaluation.py

9ed99fd

fix spelling Co-authored-by: Daniel King <[email protected]>

Update composer/datasets/in_context_learning_evaluation.py

37d5f9b

fix spelling Co-authored-by: Daniel King <[email protected]>

Update composer/datasets/in_context_learning_evaluation.py

5ff9a30

fix spelling Co-authored-by: Daniel King <[email protected]>

Update composer/datasets/in_context_learning_evaluation.py

c44c763

fix spelling Co-authored-by: Daniel King <[email protected]>

Update composer/datasets/in_context_learning_evaluation.py

7e07084

fix spelling Co-authored-by: Daniel King <[email protected]>

Update composer/datasets/in_context_learning_evaluation.py

b8cae18

fix spelling Co-authored-by: Daniel King <[email protected]>

add capitalization

a840abb

revert default changes

c65aab8

change update_generate_kwargs to public

3d5c700

fix type

5e11203

move pad_tok_id error

d48a3a9

Merge branch 'dev' into icl_refactor

e90a862

dakinggg approved these changes Jan 26, 2024

View reviewed changes

maxisawesome merged commit 39eb817 into mosaicml:dev Jan 26, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor in_context_learning_evaluation.py #2713

Refactor in_context_learning_evaluation.py #2713

maxisawesome commented Nov 14, 2023 •

edited

Loading

eracah commented Nov 14, 2023

dakinggg left a comment •

edited

Loading

dakinggg left a comment

Refactor in_context_learning_evaluation.py #2713

Refactor in_context_learning_evaluation.py #2713

Conversation

maxisawesome commented Nov 14, 2023 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

Comments

Before submitting

MPT eval results:

Llama2 eval results:

eracah commented Nov 14, 2023

dakinggg left a comment • edited Loading

Choose a reason for hiding this comment

dakinggg left a comment

Choose a reason for hiding this comment

maxisawesome commented Nov 14, 2023 •

edited

Loading

dakinggg left a comment •

edited

Loading