-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding more token encoding types #1254
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other then default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Suggest keeping the previous default for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, I think it would be good to manual test at least one of the scripts? maybe do a training run using the new version of convert text to mds and make sure it still trains? and run the regression test suite to make sure existing datasets didn't get broken.
…undry into saaketh/tok_encodings merging main
Ran the following scripts successfully: Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here. Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @snarayan21! And thanks for leaving the documentation better than you found it :)
* add more token encoing types * add more token encoing types * add tests * add tests * ft support, tests * linting is shortening my lifespan * linting is shortening my lifespan * long tensor * long tensor * long tensor * feedbacc * import * import --------- Co-authored-by: Daniel King <[email protected]> (cherry picked from commit 42c2d9a)
* add more token encoing types * add more token encoing types * add tests * add tests * ft support, tests * linting is shortening my lifespan * linting is shortening my lifespan * long tensor * long tensor * long tensor * feedbacc * import * import --------- Co-authored-by: Daniel King <[email protected]>
* add more token encoing types * add more token encoing types * add tests * add tests * ft support, tests * linting is shortening my lifespan * linting is shortening my lifespan * long tensor * long tensor * long tensor * feedbacc * import * import --------- Co-authored-by: Daniel King <[email protected]>
Depending on the vocab size, users can encode their token IDs using various int formats. Previously, we only allowed for int64, which covers an absurdly high vocab size. Enabling decoding tokens in uint32 or uint16, for example, would let people save space on their datasets since the max vocab sizes supported would be ~4 million with uint32, or ~65k with uint16. This has been added to both the text and finetuning datasets.
This PR also lets users specify their MDS dataset columns using
ndarray
types to enable automatically encoding/decoding samples. This was already present for finetuning dataset, so the functionality has been added for the generic text dataset. Accordingly, I've changed the default value in our MDS conversion scripts to usendarray:uint32
instead ofbytes
and made relevant changes to get this working.Added unit tests checking that this works for text and finetuning datasets and that an error is thrown for uncompatible encoding types.
Moved a util function that was applicable to both the text and finetuning dataloaders to a common location to import. This had been written twice to perform the same functionality.
Ran the following scripts successfully:
convert_dataset_hf.py
,convert_dataset_json.py
,convert_finetuning_dataset.py
,convert_text_to_mds.py
. Updated data prep readme to have instructions forconvert_text_to_mds.py
.Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.
Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.