Pretrain multipack v2 #1470

winglian · 2024-03-31T19:06:01Z

This PR enables more traditional pretraining, concatenating datasets, but not handling cross attention between the packed sequences. This helps with filling the context length when a sample doesn't come close to using the configured context length.

the multipack buffer size is how far ahead to look to try to pack samples together for pretraining/streaming datasets. What happens is that we pull in N samples and attempt to optimally pack those since we can't pack the entire dataset b/c they are usually too large to pre-process.

ehartford · 2024-03-31T19:14:07Z

Thank you for the attention to pretraining

winglian added 2 commits March 31, 2024 12:57

multipack attention options for pretraining

595b2b1

include zstandard needed for some pretraining datasets

40496a8

winglian merged commit 5aa5097 into main Apr 2, 2024
7 checks passed

winglian deleted the pretrain-multipack-v2 branch April 2, 2024 12:42

djsaunde pushed a commit that referenced this pull request Dec 17, 2024

Pretrain multipack v2 (#1470)

c3df4a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrain multipack v2 #1470

Pretrain multipack v2 #1470

winglian commented Mar 31, 2024

ehartford commented Mar 31, 2024

Pretrain multipack v2 #1470

Pretrain multipack v2 #1470

Conversation

winglian commented Mar 31, 2024

ehartford commented Mar 31, 2024