Skip to content

Commit

Permalink
Refactor in_context_learning_evaluation.py (mosaicml#2713)
Browse files Browse the repository at this point in the history
* extremely wip commit w/ ICLdataset class

* more extremely broken wip

* add split keys

* first pass at moving QA to new format

* linting

* linting

* tests pass!

* fix repeated defaults, gold_idx --> gold

* basic HF parsing but test not passing

* fix cot. wip

* del device and world_size from tests

* change to .map

* fix schema

* tests passing w/ collate refactor

* finish HF tests

* add hf batch parsing

* linting

* add doc strings, rm hf_parsing_vars

* revert question_prelimiter back to prelimiter

* fix tests

* add more docstrings

* add doc strings, fix hf w/ categories

* add doc strings and default check

* linting

* add temperature

* remove need for hf:// on hf links

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* fix comments, add test for check hf uri, still wip

* add gpu tests back

* update fix_eos_on_preamble

* update comments

* add return types

* typing, comments

* init RAG Generation task

* init _construct_context for RAG eval

* fix context key, move hf test dataset, few docstrings

* fix docstrings, add second path for schema

* init collate_fn,  _tokenize_example functions (bug exists)

* fix typo in warning error

* remove canonical_solution from batch

* missed one canonical_sllution

* remove encoded dataset to have just one dataset var

* rename sample to example

* improve comment

* edit RAGtask

* rm hf parsing func

* fix docstring, rename fewshot fun

* docstring

* change default split_batch to check types

* remove need to set split_keys

* doc string update

* improve comments

* rm stacked_keys for tokenize_labels bool

* initial wip in comments

* make _conv_tokens_to_tensors func

* wip - sketch out batch_mappings

* linting and debugging statements to help me remember where I'm doing wip

* all tests except one sus schema test passing

* fix missing fewshot for schema

* rm temperature add generation_kwargs

* add defaults that are currently set in llm-foundry builders.py

* fix defaults in tests, add some comments

* tests wip

* tests for new funcs

* rm RAG task

* more docstring

* tests passing

* wip

* wip

* add dict to data_spec

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

Co-authored-by: Daniel King <[email protected]>

* Apply suggestions from code review

comment improvements

Co-authored-by: Daniel King <[email protected]>

* default_batch to base_batch and some docstrings

* update comments and fix test. move spacing to default get_answer

* improved docstrings

* finish schema/mc tests

* address pr review comments

* lintign

* fixing import, add type

* update comments

* update keys

* add typechecks for token ids

* rm outdated test

* fix tests

* add microbatch test

* pyright fixes

* linting attempts

* linting wip

* fix linting

* add early stopping and do_normalization documentation

* fix linting

* fix linting

* fix final dist test issue

* fix isort

* fix linting

* fix docstrings

* fix docstrings

* add warning filters

* fix warnings

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* Update composer/datasets/in_context_learning_evaluation.py

fix spelling

Co-authored-by: Daniel King <[email protected]>

* add capitalization

* revert default changes

* change update_generate_kwargs to public

* fix type

* move pad_tok_id error

---------

Co-authored-by: Daniel King <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Eitan Turok <[email protected]>
  • Loading branch information
4 people authored and ShashankMosaicML committed Feb 3, 2024
1 parent b8df31c commit d0f0be9
Show file tree
Hide file tree
Showing 4 changed files with 2,415 additions and 1,246 deletions.
11 changes: 11 additions & 0 deletions composer/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@
build_streaming_cifar10_dataloader, build_synthetic_cifar10_dataloader)
from composer.datasets.imagenet import (build_ffcv_imagenet_dataloader, build_imagenet_dataloader,
build_streaming_imagenet1k_dataloader, build_synthetic_imagenet_dataloader)
from composer.datasets.in_context_learning_evaluation import (InContextLearningCodeEvalDataset,
InContextLearningDataset, InContextLearningLMTaskDataset,
InContextLearningMultipleChoiceTaskDataset,
InContextLearningQATaskDataset,
InContextLearningSchemaTaskDataset)
from composer.datasets.lm_dataset import build_lm_dataloader
from composer.datasets.mnist import build_mnist_dataloader, build_synthetic_mnist_dataloader
from composer.datasets.synthetic import (SyntheticBatchPairDataset, SyntheticDataLabelType, SyntheticDataType,
Expand All @@ -24,6 +29,12 @@
'SyntheticDataLabelType',
'SyntheticDataType',
'SyntheticPILDataset',
'InContextLearningDataset',
'InContextLearningQATaskDataset',
'InContextLearningLMTaskDataset',
'InContextLearningCodeEvalDataset',
'InContextLearningMultipleChoiceTaskDataset',
'InContextLearningSchemaTaskDataset',
'build_ade20k_dataloader',
'build_streaming_ade20k_dataloader',
'build_streaming_c4_dataloader',
Expand Down
Loading

0 comments on commit d0f0be9

Please sign in to comment.