Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor in_context_learning_evaluation.py #2713

Merged
merged 126 commits into from
Jan 26, 2024

Conversation

maxisawesome
Copy link
Contributor

@maxisawesome maxisawesome commented Nov 14, 2023

What does this PR do?

  1. Refactors the classes in composer/datasets/in_context_learning_evaluation.py to inherit from a single class
  2. Adds parsing of arbitrary HuggingFace datasets
  3. Cleans up memory usage while loading the datasets

What issue(s) does this change relate to?

  1. Makes the code a little more managable, and reuses code between eval classes significantly more.
  2. Moves the long context eval code found in this llm-foundry PR to composer. There will still need to be an llm-foundry PR that allows a user to specify a small python function that parses arbitrary datasets in a more advanced way than the default that goes here.
  3. Effectively implements @dakinggg 's PR seen here.

Comments

This is a pretty big change to the icl datasets. Please pick it apart with scrutiny, I will not be offended. Please note additional needed tests or code smells that you see.

All the tests pass and the eval_gauntlet runs. Before merging, I need to rerun EleutherAI/gpt-neo-125m on the gauntlet w/o my changes, as well as 7b w/ and w/o my changes.

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
    No, some large comments explaining classes/functions (and thus their documentation as well) still need to be updated. In addition, we should properly add the right classes here.
  • Did you update any related tests and add any new tests related to your change? (see testing)
    Yes but please pick them apart and make me make more tests to ensure things are as we expect.
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

MPT eval results:

Pre-refactor:
Run name: test-eval-refactor-no-refactor-with-new-logging-mpt-BOGPfN
Core Average: 0.334786

| model_name      |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:----------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| mosaicml/mpt-7b |       0.334786 |          0.438081 |        0.47509 |          0.356657 |                0.385817 |                 0.383055 |                   0.163968 |                0.384431 |                            0.58974 |                                  0.386438 |                                    0.260531 |                                 0.515614 |               0.357162 |                     0.604679 |                      0.712279 |                        0.185713 |                     0.515614 |

Post-refactor:
Run name: test-eval-refactor-with-new-logging-mpt-m6LcxC
Core Average: 0.334729

| model_name      |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:----------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| mosaicml/mpt-7b |       0.334729 |          0.437803 |       0.474966 |          0.356733 |                0.386108 |                 0.382913 |                   0.163876 |                0.384012 |                            0.58974 |                                  0.386341 |                                    0.260657 |                                 0.514473 |               0.357731 |                     0.604679 |                      0.712215 |                        0.185731 |                     0.514473 |

Llama2 eval results:

Pre-refactor:
Run name: test-eval-refactor-no-refactor-with-logging-llama2-T3PGVn
Core Average: 0.404405

| model_name               |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:-------------------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| meta-llama/Llama-2-7b-hf |       0.404405 |          0.502264 |       0.525717 |          0.453943 |                0.417293 |                 0.457704 |                   0.220997 |                0.472087 |                           0.632564 |                                  0.426062 |                                    0.331144 |                                 0.619284 |               0.438208 |                     0.594342 |                      0.713677 |                        0.263072 |                     0.619284 |'

Post-refactor:
NOTE: runname was mislabeled - this is with the refactor (can note this with mcli describe)
Run name: test-eval-refactor-no-refactor-with-new-logging-llama2-xG8WHF
Core Average: 0.40438

| model_name               |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:-------------------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| meta-llama/Llama-2-7b-hf |        0.40438 |          0.502263 |       0.525621 |           0.45366 |                0.417738 |                 0.457752 |                   0.220765 |                0.471982 |                           0.632564 |                                  0.425965 |                                    0.331144 |                                 0.619379 |               0.437639 |                     0.594342 |                      0.713613 |                         0.26313 |                     0.619379 |

@maxisawesome maxisawesome requested a review from a team as a code owner November 14, 2023 18:15
@eracah
Copy link
Contributor

eracah commented Nov 14, 2023

Oooh hoo hoo. I love a good refactor

@maxisawesome maxisawesome mentioned this pull request Nov 15, 2023
7 tasks
@dakinggg dakinggg mentioned this pull request Nov 15, 2023
7 tasks
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left first round of comments. Thank you for taking this on!

composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@maxisawesome maxisawesome merged commit 39eb817 into mosaicml:dev Jan 26, 2024
16 checks passed
ShashankMosaicML pushed a commit to ShashankMosaicML/composer that referenced this pull request Feb 3, 2024
* extremely wip commit w/ ICLdataset class

* more extremely broken wip

* add split keys

* first pass at moving QA to new format

* linting

* linting

* tests pass!

* fix repeated defaults, gold_idx --> gold

* basic HF parsing but test not passing

* fix cot. wip

* del device and world_size from tests

* change to .map

* fix schema

* tests passing w/ collate refactor

* finish HF tests

* add hf batch parsing

* linting

* add doc strings, rm hf_parsing_vars

* revert question_prelimiter back to prelimiter

* fix tests

* add more docstrings

* add doc strings, fix hf w/ categories

* add doc strings and default check

* linting

* add temperature

* remove need for hf:// on hf links

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* fix comments, add test for check hf uri, still wip

* add gpu tests back

* update fix_eos_on_preamble

* update comments

* add return types

* typing, comments

* init RAG Generation task

* init _construct_context for RAG eval

* fix context key, move hf test dataset, few docstrings

* fix docstrings, add second path for schema

* init collate_fn,  _tokenize_example functions (bug exists)

* fix typo in warning error

* remove canonical_solution from batch

* missed one canonical_sllution

* remove encoded dataset to have just one dataset var

* rename sample to example

* improve comment

* edit RAGtask

* rm hf parsing func

* fix docstring, rename fewshot fun

* docstring

* change default split_batch to check types

* remove need to set split_keys

* doc string update

* improve comments

* rm stacked_keys for tokenize_labels bool

* initial wip in comments

* make _conv_tokens_to_tensors func

* wip - sketch out batch_mappings

* linting and debugging statements to help me remember where I'm doing wip

* all tests except one sus schema test passing

* fix missing fewshot for schema

* rm temperature add generation_kwargs

* add defaults that are currently set in llm-foundry builders.py

* fix defaults in tests, add some comments

* tests wip

* tests for new funcs

* rm RAG task

* more docstring

* tests passing

* wip

* wip

* add dict to data_spec

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Apply suggestions from code review

comment improvements

Co-authored-by: Daniel King <[email protected]>

* default_batch to base_batch and some docstrings

* update comments and fix test. move spacing to default get_answer

* improved docstrings

* finish schema/mc tests

* address pr review comments

* lintign

* fixing import, add type

* update comments

* update keys

* add typechecks for token ids

* rm outdated test

* fix tests

* add microbatch test

* pyright fixes

* linting attempts

* linting wip

* fix linting

* add early stopping and do_normalization documentation

* fix linting

* fix linting

* fix final dist test issue

* fix isort

* fix linting

* fix docstrings

* fix docstrings

* add warning filters

* fix warnings

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* add capitalization

* revert default changes

* change update_generate_kwargs to public

* fix type

* move pad_tok_id error

---------

Co-authored-by: Daniel King <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Eitan Turok <[email protected]>
ShashankMosaicML pushed a commit to ShashankMosaicML/composer that referenced this pull request Feb 3, 2024
* extremely wip commit w/ ICLdataset class

* more extremely broken wip

* add split keys

* first pass at moving QA to new format

* linting

* linting

* tests pass!

* fix repeated defaults, gold_idx --> gold

* basic HF parsing but test not passing

* fix cot. wip

* del device and world_size from tests

* change to .map

* fix schema

* tests passing w/ collate refactor

* finish HF tests

* add hf batch parsing

* linting

* add doc strings, rm hf_parsing_vars

* revert question_prelimiter back to prelimiter

* fix tests

* add more docstrings

* add doc strings, fix hf w/ categories

* add doc strings and default check

* linting

* add temperature

* remove need for hf:// on hf links

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* fix comments, add test for check hf uri, still wip

* add gpu tests back

* update fix_eos_on_preamble

* update comments

* add return types

* typing, comments

* init RAG Generation task

* init _construct_context for RAG eval

* fix context key, move hf test dataset, few docstrings

* fix docstrings, add second path for schema

* init collate_fn,  _tokenize_example functions (bug exists)

* fix typo in warning error

* remove canonical_solution from batch

* missed one canonical_sllution

* remove encoded dataset to have just one dataset var

* rename sample to example

* improve comment

* edit RAGtask

* rm hf parsing func

* fix docstring, rename fewshot fun

* docstring

* change default split_batch to check types

* remove need to set split_keys

* doc string update

* improve comments

* rm stacked_keys for tokenize_labels bool

* initial wip in comments

* make _conv_tokens_to_tensors func

* wip - sketch out batch_mappings

* linting and debugging statements to help me remember where I'm doing wip

* all tests except one sus schema test passing

* fix missing fewshot for schema

* rm temperature add generation_kwargs

* add defaults that are currently set in llm-foundry builders.py

* fix defaults in tests, add some comments

* tests wip

* tests for new funcs

* rm RAG task

* more docstring

* tests passing

* wip

* wip

* add dict to data_spec

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Apply suggestions from code review

comment improvements

Co-authored-by: Daniel King <[email protected]>

* default_batch to base_batch and some docstrings

* update comments and fix test. move spacing to default get_answer

* improved docstrings

* finish schema/mc tests

* address pr review comments

* lintign

* fixing import, add type

* update comments

* update keys

* add typechecks for token ids

* rm outdated test

* fix tests

* add microbatch test

* pyright fixes

* linting attempts

* linting wip

* fix linting

* add early stopping and do_normalization documentation

* fix linting

* fix linting

* fix final dist test issue

* fix isort

* fix linting

* fix docstrings

* fix docstrings

* add warning filters

* fix warnings

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* add capitalization

* revert default changes

* change update_generate_kwargs to public

* fix type

* move pad_tok_id error

---------

Co-authored-by: Daniel King <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Eitan Turok <[email protected]>
ShashankMosaicML pushed a commit to ShashankMosaicML/composer that referenced this pull request Feb 6, 2024
* extremely wip commit w/ ICLdataset class

* more extremely broken wip

* add split keys

* first pass at moving QA to new format

* linting

* linting

* tests pass!

* fix repeated defaults, gold_idx --> gold

* basic HF parsing but test not passing

* fix cot. wip

* del device and world_size from tests

* change to .map

* fix schema

* tests passing w/ collate refactor

* finish HF tests

* add hf batch parsing

* linting

* add doc strings, rm hf_parsing_vars

* revert question_prelimiter back to prelimiter

* fix tests

* add more docstrings

* add doc strings, fix hf w/ categories

* add doc strings and default check

* linting

* add temperature

* remove need for hf:// on hf links

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* fix comments, add test for check hf uri, still wip

* add gpu tests back

* update fix_eos_on_preamble

* update comments

* add return types

* typing, comments

* init RAG Generation task

* init _construct_context for RAG eval

* fix context key, move hf test dataset, few docstrings

* fix docstrings, add second path for schema

* init collate_fn,  _tokenize_example functions (bug exists)

* fix typo in warning error

* remove canonical_solution from batch

* missed one canonical_sllution

* remove encoded dataset to have just one dataset var

* rename sample to example

* improve comment

* edit RAGtask

* rm hf parsing func

* fix docstring, rename fewshot fun

* docstring

* change default split_batch to check types

* remove need to set split_keys

* doc string update

* improve comments

* rm stacked_keys for tokenize_labels bool

* initial wip in comments

* make _conv_tokens_to_tensors func

* wip - sketch out batch_mappings

* linting and debugging statements to help me remember where I'm doing wip

* all tests except one sus schema test passing

* fix missing fewshot for schema

* rm temperature add generation_kwargs

* add defaults that are currently set in llm-foundry builders.py

* fix defaults in tests, add some comments

* tests wip

* tests for new funcs

* rm RAG task

* more docstring

* tests passing

* wip

* wip

* add dict to data_spec

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Apply suggestions from code review

comment improvements

Co-authored-by: Daniel King <[email protected]>

* default_batch to base_batch and some docstrings

* update comments and fix test. move spacing to default get_answer

* improved docstrings

* finish schema/mc tests

* address pr review comments

* lintign

* fixing import, add type

* update comments

* update keys

* add typechecks for token ids

* rm outdated test

* fix tests

* add microbatch test

* pyright fixes

* linting attempts

* linting wip

* fix linting

* add early stopping and do_normalization documentation

* fix linting

* fix linting

* fix final dist test issue

* fix isort

* fix linting

* fix docstrings

* fix docstrings

* add warning filters

* fix warnings

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* add capitalization

* revert default changes

* change update_generate_kwargs to public

* fix type

* move pad_tok_id error

---------

Co-authored-by: Daniel King <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Eitan Turok <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants