-
-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pretraining with iterable/streaming Dataset #556
Conversation
What would happen if max_steps is set to a really large number? Would it default to only running through the entire dataset? I'm a bit curious how to determine this value. |
No, it would run max_steps (and determine the number of epochs accordingly), see https://github.com/huggingface/transformers/blob/6acc27eea853885270dba5313181443d43e31f2c/src/transformers/trainer.py#L1605 . For IterableDatasets, max_steps is necessary (see https://github.com/huggingface/transformers/blob/6acc27eea853885270dba5313181443d43e31f2c/src/transformers/trainer.py#L574 ). |
Does this pr fix finetune with raw corpus? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. thanks!
* return without packing prep/len * fix remove columns * fix encode arguments * add error when max steps not set * fix test --------- Co-authored-by: Jan Philipp Harries <[email protected]>
* return without packing prep/len * fix remove columns * fix encode arguments * add error when max steps not set * fix test --------- Co-authored-by: Jan Philipp Harries <[email protected]>
Due to various changes pretraining_dataset didn't work anymore, this should fix it (using it without problems with a streaming dataset, works for local + remote).