Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding more token encoding types #1254

Merged
merged 16 commits into from
Jun 6, 2024
Merged

Adding more token encoding types #1254

merged 16 commits into from
Jun 6, 2024

Conversation

snarayan21
Copy link
Contributor

@snarayan21 snarayan21 commented Jun 5, 2024

Depending on the vocab size, users can encode their token IDs using various int formats. Previously, we only allowed for int64, which covers an absurdly high vocab size. Enabling decoding tokens in uint32 or uint16, for example, would let people save space on their datasets since the max vocab sizes supported would be ~4 million with uint32, or ~65k with uint16. This has been added to both the text and finetuning datasets.

This PR also lets users specify their MDS dataset columns using ndarray types to enable automatically encoding/decoding samples. This was already present for finetuning dataset, so the functionality has been added for the generic text dataset. Accordingly, I've changed the default value in our MDS conversion scripts to use ndarray:uint32 instead of bytes and made relevant changes to get this working.

Added unit tests checking that this works for text and finetuning datasets and that an error is thrown for uncompatible encoding types.

Moved a util function that was applicable to both the text and finetuning dataloaders to a common location to import. This had been written twice to perform the same functionality.

Ran the following scripts successfully: convert_dataset_hf.py, convert_dataset_json.py, convert_finetuning_dataset.py, convert_text_to_mds.py. Updated data prep readme to have instructions for convert_text_to_mds.py.

Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.

Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.

@snarayan21 snarayan21 requested a review from a team as a code owner June 5, 2024 17:36
Copy link
Contributor

@codestar12 codestar12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good other then default value

Copy link
Contributor

@XiaohanZhangCMU XiaohanZhangCMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Suggest keeping the previous default for now.

Copy link
Contributor

@codestar12 codestar12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@snarayan21 snarayan21 requested review from codestar12 and irenedea June 5, 2024 19:57
llmfoundry/data/data.py Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
tests/data/test_data_encodings.py Outdated Show resolved Hide resolved
tests/data/test_data_encodings.py Show resolved Hide resolved
@snarayan21 snarayan21 requested a review from dakinggg June 6, 2024 02:55
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I think it would be good to manual test at least one of the scripts? maybe do a training run using the new version of convert text to mds and make sure it still trains? and run the regression test suite to make sure existing datasets didn't get broken.

llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
llmfoundry/data/text_data.py Show resolved Hide resolved
scripts/data_prep/convert_dataset_json.py Outdated Show resolved Hide resolved
tests/a_scripts/data_prep/test_convert_text_to_mds.py Outdated Show resolved Hide resolved
@snarayan21
Copy link
Contributor Author

Ran the following scripts successfully: convert_dataset_hf.py, convert_dataset_json.py, convert_finetuning_dataset.py, convert_text_to_mds.py. Updated data prep readme to have instructions for convert_text_to_mds.py.

Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.

Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.

@snarayan21 snarayan21 requested a review from dakinggg June 6, 2024 18:51
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @snarayan21! And thanks for leaving the documentation better than you found it :)

@snarayan21 snarayan21 merged commit 42c2d9a into main Jun 6, 2024
9 checks passed
KuuCi pushed a commit that referenced this pull request Jun 7, 2024
* add more token encoing types

* add more token encoing types

* add tests

* add tests

* ft support, tests

* linting is shortening my lifespan

* linting is shortening my lifespan

* long tensor

* long tensor

* long tensor

* feedbacc

* import

* import

---------

Co-authored-by: Daniel King <[email protected]>
(cherry picked from commit 42c2d9a)
KuuCi pushed a commit that referenced this pull request Jun 7, 2024
* add more token encoing types

* add more token encoing types

* add tests

* add tests

* ft support, tests

* linting is shortening my lifespan

* linting is shortening my lifespan

* long tensor

* long tensor

* long tensor

* feedbacc

* import

* import

---------

Co-authored-by: Daniel King <[email protected]>
KuuCi pushed a commit that referenced this pull request Jun 7, 2024
* add more token encoing types

* add more token encoing types

* add tests

* add tests

* ft support, tests

* linting is shortening my lifespan

* linting is shortening my lifespan

* long tensor

* long tensor

* long tensor

* feedbacc

* import

* import

---------

Co-authored-by: Daniel King <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants