-
-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collate missing features #1096
base: main
Are you sure you want to change the base?
Collate missing features #1096
Conversation
For instance, pile-cc with input_ids, labels, attention_mask, position_ids, and length is ~1.2TB, and ~250GB with only input_ids and length. |
Is there another part that goes with this to optionally have the tokenization step be a bit more sparse for this feature? |
collate_missing_features: | ||
# - labels # copy input_ids | ||
# - position_ids # [0, 1, 2, ..., len(input_ids)-1] | ||
# - attention_mask [1, 1, 1, ...] with same length as input_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# - attention_mask [1, 1, 1, ...] with same length as input_ids | |
# - attention_mask # [1, 1, 1, ...] with same length as input_ids |
Looks good so far. I think a unit test on |
I can write something that makes the tokenization step a bit more sparse. How about when all datasets are of type |
Currently the
BatchSamplerDataCollatorForSeq2Seq
typically requiresinput_ids
,labels
,attention_mask
,position_ids
. For certain uses cases (e.g., datasets of typecompletion
), the latter 3 are straightforwardly derived frominput_ids
. Therefore, saving to disk datasets with all 4 features can be redundant, and generally requires a lot more disk space (up to 5x).The changes here implement a
MissingFeaturesCollator
together with thecollate_missing_features
config option. Iflabels
,attention_mask
, orposition_ids
are missing from the dataset, it replaces them with the following defaults:I've observed that eagerly filling these missing features does not lead to longer run-times.