Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collate missing features #1096

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RicardoDominguez
Copy link
Contributor

@RicardoDominguez RicardoDominguez commented Jan 11, 2024

Currently the BatchSamplerDataCollatorForSeq2Seq typically requires input_ids, labels, attention_mask, position_ids. For certain uses cases (e.g., datasets of type completion), the latter 3 are straightforwardly derived from input_ids. Therefore, saving to disk datasets with all 4 features can be redundant, and generally requires a lot more disk space (up to 5x).

The changes here implement a MissingFeaturesCollator together with the collate_missing_features config option. If labels, attention_mask, or position_ids are missing from the dataset, it replaces them with the following defaults:

  • labels: copy input_ids
  • position_ids: [0, 1, 2, ..., len(input_ids)-1]
  • attention_mask: [1, 1, 1, ...] with same length as input_ids

I've observed that eagerly filling these missing features does not lead to longer run-times.

@RicardoDominguez
Copy link
Contributor Author

RicardoDominguez commented Jan 11, 2024

For instance, pile-cc with input_ids, labels, attention_mask, position_ids, and length is ~1.2TB, and ~250GB with only input_ids and length.

@winglian
Copy link
Collaborator

Is there another part that goes with this to optionally have the tokenization step be a bit more sparse for this feature?

collate_missing_features:
# - labels # copy input_ids
# - position_ids # [0, 1, 2, ..., len(input_ids)-1]
# - attention_mask [1, 1, 1, ...] with same length as input_ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - attention_mask [1, 1, 1, ...] with same length as input_ids
# - attention_mask # [1, 1, 1, ...] with same length as input_ids

@winglian
Copy link
Collaborator

Looks good so far. I think a unit test on build_collator could be helpful to validate the collator that is returned based on various options, and then some sort of unit test to validate the functionality of MissingFeaturesCollator would be ideal too.

@RicardoDominguez
Copy link
Contributor Author

I can write something that makes the tokenization step a bit more sparse. How about when all datasets are of type completion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants