-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better defaults for StreamingDataset subclasses #723
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Do we need to change the defaults in
llm-foundry/llmfoundry/data/denoising.py
Line 473 in 84c86e3
dataset = StreamingTextDataset( |
dataset = dataset_constructor.build_from_streaming( |
oop yes we do. Changing that now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also update the epoch_size
data type to Optional[Union[int, str]]
which is inline with the streaming args? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some more to change in denoising.py
. Will approve after that, lgtm otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snarayan21 please post a PSA in the research slack once you merge this. I'll also include a PSA in the next release notes, but want to make sure research knows that the defaults here are changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank You!
With better defaults and new features being added to Streaming, this PR makes sure those improvements are reflected in llm-foundry. Specifically:
shuffle_algo
changed topy1e
: More balanced downloads and cache limit performancepartition_algo
changed torelaxed
: Elastic determinism and resumptions on many more numbers of nodespredownload
changed toNone
: Lower predownload for more balanced download demand. Will be set byStreamingDataset
.num_canonical_nodes
changed toNone
: Now set to equal the number of physical nodes, unless usingpy1s
andpy2s
shuffle algorithms, in which case will equal 64 * physical nodes.shuffle_block_size
changed toNone
: Now set based onnum_canonical_nodes
to give a consistently good shuffle quality without needing a crazy amount of downloads.Defaults were changed in Streaming v0.7.0 with this commit: mosaicml/streaming@93bf054